Great question.

1 min readNov 1, 2018

Great question. Since we scraped individual tags, multi-tagged stories were duplicated a lot in the data.

In the cleaning phase, I grouped together all the entries with identical story-urls, combined their tag encodings, then removed all duplicates. This means that every entry in the cleaned-dataset has a unique Medium url (and by my logic is a unique Medium story).

Written by Harrison Jansma

No responses yet