Using Datasette for an ELT Personal Twitter Data Warehouse
Twitter publishes some analytics for you, and I check them about once a month, but they are extremely limited when it comes to the important things, like exploring your follower graph.
This is where Datasette comes in. Datasette is a great EDA (exploratory data analysis) tool for any arbitrary SQLite dataset. You can use it to build your own Twitter analytics!
Today, at Simon Willison’s prompting, I decided to try to replicate his 2018 work in analyzing his own Twitter network.
TL;DR
You can check out my own dataset here: https://swyx-twitter-datasette.glitch.me/twitter/tweets?_sort_desc=favorite_count&_facet=retweeted&_facet=favorited&_facet=user&user=33521530&retweeted=1
Things to try:
- check out the selected facets
- sort/hide individual columns
- search
- build up sql filters (eg favorite_count > 100, retweet_count > 100)
Keep in mind that all of this is automatically generated from a sqlite table (which itself was automatically generated by running twitter-to-sqlite), greatly reducing the bar for GUI-based, SQL-based EDA.
And because it is web based and the SQL is serialized into the URL, you can easily share whatever you find with coworkers.
The How
Optional: Get Credentials
The most time consuming piece of this is going to be getting Twitter API credentials. The process is largely unchanged since 2018 - get a developer account on https://developer.twitter.com/ (some approval time may be required… so get this out of the way ASAP), then create a new app https://apps.twitter.com/ and eventually get to the four “Keys and Tokens” you will need to do this:
Optional: Test Your Credentials
You can test your credentials by writing a little Python script locally (pip install requests_oauthlib
, then run python script.py
. I also like to use venv)
from requests_oauthlib import OAuth1Session
twitter = OAuth1Session(
client_key='FILL_ME_IN',
client_secret='FILL_ME_IN',
resource_owner_key='FILL_ME_IN-FILL_ME_IN',
resource_owner_secret='FILL_ME_IN'
)
print(twitter.get( 'https://api.twitter.com/2/users/by/username/swyx').json())
print(twitter.get( 'https://api.twitter.com/2/users/1465910794114134018/following/?user.fields=public_metrics').json())
By default you only have access to the Twitter API v2 endpoints, though you can apply for Elevated Access if you need Twitter API v1 as many libraries and blogposts are still on API v1.
From here you can browse the Twitter API reference for what data you’d like to extract… but you don’t have to, if you agree enough with Simon’s opinions in twitter-to-sqlite!
Scraping your data with twitter-to-sqlite
One thing I love about how Simon has been building Datasette is the rich ecosystem of data extraction tools he has also groomed. He’s got standard things for extracting data from many sources he cares about into SQLite (and, as I’ve already said, Datasette takes it from there all the way to a viewable explorable product).
So it is with twitter-to-sqlite:
$ pip install twitter-to-sqlite
$ twitter-to-sqlite auth
Create an app here: https://developer.twitter.com/en/apps
Then navigate to 'Keys and tokens' and paste in the following:
API key: xxx
API secret key: xxx
Access token: xxx
Access token secret: xxx
$ twitter-to-sqlite user-timeline twitter.db
Importing tweets [#####-------------------------------] 2799/17780 00:01:39
This gives you a nice little twitter.db
file containing every tweet you’ve ever made or liked or retweeted that is still on your timeline (timeline gets cut off after 3,200 tweets). You can also crawl your followers with twitter-to-sqlite followers twitter.db
.
Deploying to Glitch
Simon’s old method relies on Zeit Now, which has since been deprecated. Datasette will work wherever serverful web hosting works, but its nice to host this somewhere free since it is just a personal public data tool. Render.com or Fly.io would probably work, but I felt like trying out Glitch since that is Simon’s preferred platform.
Unfortunately I ran into a bunch of problems here and took a couple of hours to fix them. TLDR; just remix my Glitch:
The important unlock here is understanding how to upload your SQLite db to Glitch for it to run. You can’t just drag and drop it into the “assets” folder as that is just for CDN hosting, and you need access to this file in the Glitch build process.
The way I figured out how to do it was to clone the git repo locally (unlike that answer, there is no more Advanced Options menu in Glitch, you have to go Tools -> Import and Export -> copy your project’s Git URL), and add the file to git (on a different branch, because of Glitch’s idiosyncratic setup… read the instructions on the forum).
git clone https://<git access token>@api.glitch.com/git/swyx-twitter-datasette
cd swyx-twitter-datasette
git checkout -b addData
###
# add/replace your twitter.db file to this repo and commit it
###
git push --set-upstream origin addData
And then inside Glitch you’ll need to open the Terminal and do:
git merge addData
refresh
And the twitter.db should show up in there.
If all goes well your project should be running like mine: https://swyx-twitter-datasette.glitch.me/twitter/tweets (you can rename your Glitch to a custom vanity subdomain) and you can explore to your hearts content!
The Why
So. That’s a lot of work to navel-gaze at a bunch of Tweets. Why bother?
It’s not really about the tweets for me. It’s about lowering the cost of EDA. I’m putting in my reps when I don’t need it, so that I can be faster/more confident when I do.
Everyone wants to operate on data-driven insights, but the cost of wrangling data is often too high and so data, especially your own data, doesn’t get looked at.
EDA tools like Datasette dramatically lower the cost of data analysis, with a surprisingly simple ELT contract: You handle the Extract phase, it handles the Load, and exposes a standard UI for you to do dynamic Transforms.
Since the UI is constrained but has all the basic EDA features, it reduces the amount of custom coding you have to do, while giving you just enough flexibility to make queries and derive insights for any arbitrary need.
And since can see all your data and all fields at a glance, you can come up with questions you didn’t even think to ask. ELT with Datasette is the data-driven equivalent of dumping out your legos.