Skip to main content

Preparing a dataset to use in Hydrator

Every few months I get an email from someone looking for some help with the Hydrator tool I wrote about back in 2017. Hydrator is a simple tool to rehydrate twitter ids into full tweets. From the Hydrator GitHub page, “Twitter’s Terms of Service do not allow the full JSON for datasets of tweets to be distributed to third parties. However they do allow datasets of tweet IDs to be shared. Hydrator helps you turn these tweet IDs back into JSON and also CSV from the comfort of your desktop.”

The advantage for me, and apparently many others, is that it’s an actual program instead of a command-line tool, so it’s pretty damn easy to use for those of us who only occasionally need to rehydrate a list of tweeter ids.

The biggest issue people have is that Hydrator only wants a file with a single list of twitter ids, and nothing else. Not even a header on that list. For a reasonably-sized list it’s usually no problem to use Excel or similar to clean up the offending dataset, but the other day someone asked if I could help with prepping the file at A Twitter Dataset of 150+ million tweets related to COVID-19 for open research. For some reason the researchers added two columns to the dataset, so in addition to the column of twitter ids, there are also columns for date and time.

Note this is all happening on my MacBook. YMMV on other platforms.

I did some Googling and decided that csvkit should do the trick and it did, with a couple of caveats. First, it’s an incredibly simple command-line tool, with a stellar step-by-step tutorial. Trust me, you can use it! So I installed it and ran the first command to turn the COVID-19 dataset from a TSV (tab-separated values) file into a CSV (comma-separated values) file:

in2csv full_dataset-clean.tsv > full_dataset-clean.csv

And it didn’t work 🙁 csvkit is supposed to be able to recognize and parse tsv, but for whatever reason it didn’t in this case. No biggie, back to Google, and after a little trial and error, I settled upon this single command line to convert the file from TSV to CSV:

cat full_dataset-clean.tsv | tr "\\t" "," > full_dataset-clean.csv 

And just below that is a suggestion that I might not’ve even needed csvkit for the next step, but I didn’t see that until just now 🙂

So that command did indeed translate my TSV file into a CSV file, and I went back to the csvkit tutorial to get a file with only the first column, using the command:

csvcut -c 1 full_dataset-clean.csv > full_dataset-clean-tweet_ids.csv

One last problem is that there’s still a header on that column (tweet_id), which I verified with the command:

csvcut -n full_dataset-clean-tweet_ids.csv

Google suggests the SED (stream editor) command will do the trick:

sed -i -e '1d' full_dataset-clean-tweet_ids.csv

And I can verify the first line is now gone with:

head full_dataset-clean-tweet_ids.csv

Probably way more convoluted than it needed to be, but it got the job done, and now I can point Hydrator at my full_dataset-clean-tweet_ids.csv file and Bob’s your uncle!

Note to self, xsv is another CSV tool that might be worth exploring.

Source of Article

Similar posts