Harrison Jansma
1 min readNov 1, 2018

--

Great question. Since we scraped individual tags, multi-tagged stories were duplicated a lot in the data.

In the cleaning phase, I grouped together all the entries with identical story-urls, combined their tag encodings, then removed all duplicates. This means that every entry in the cleaned-dataset has a unique Medium url (and by my logic is a unique Medium story).

--

--

No responses yet