Dump data from Stackoverflow into a SQLite db
MIT License
Downloads all your contributions to StackOverflow into a searchable, sortable, sqlite database. This includes your questions, answers, and comments.
The best way to install the package is by using pipx:
pipx install stackoverflow-to-sqlite
It's also available via brew:
brew install xavdid/projects/stackoverflow-to-sqlite
Usage: stackoverflow-to-sqlite [OPTIONS] USER_ID
Save all the contributions for a StackOverflow user to a SQLite database.
Options:
--version Show the version and exit.
--db FILE A path to a SQLite database file. If it doesn't exist, it will be
created. While it can have any extension, `.db` or `.sqlite` is
recommended.
--help Show this message and exit.
The CLI takes a single required argument: a StackOverflow user id. The easiest way to get this is from a user's profile page:
The simplest usage is to pass that directly to the CLI and use the default database location:
% stackoverflow-to-sqlite 1825390
The resulting SQLite database pairs well with Datasette, a tool for viewing SQLite in the web. Below is my recommended configuration.
First, install datasette
:
pipx install datasette
Then, add the recommended plugins (for rendering timestamps and markdown):
pipx inject datasette datasette-render-markdown datasette-render-timestamps
Finally, create a metadata.json
file next to your stackoverflow.db
with the following:
{
"databases": {
"stackoverflow": {
"tables": {
"questions": {
"sort_desc": "creation_date",
"plugins": {
"datasette-render-markdown": {
"columns": ["body_markdown"]
},
"datasette-render-timestamps": {
"columns": ["creation_date", "closed_date", "last_activity_date"]
}
}
},
"answers": {
"sort_desc": "creation_date",
"plugins": {
"datasette-render-markdown": {
"columns": ["body_markdown"]
},
"datasette-render-timestamps": {
"columns": ["last_edit_date", "creation_date"]
}
}
},
"comments": {
"sort_desc": "creation_date",
"plugins": {
"datasette-render-markdown": {
"columns": ["body_markdown"]
},
"datasette-render-timestamps": {
"columns": ["creation_date"]
}
}
},
"tags": {
"sort": "name"
}
}
}
}
}
Now when you run
datasette serve stackoverflow.db --metadata metadata.json
You'll get a nice, formatted output!
StackOverflow has recently announced some pretty major AI-related plans. They also don't allow you to modify or remove your content in protest. There's no real guarantee around what they will or won't do to content you've produced.
Ultimately, there's no better steward of data you've put time and energy into creating than you. This builds a searchable archive of everything you've ever said on StackOverflow, which is nice in case it gets different or worse.
At some point, I'd like to crawl the entire Stack Exchange network. An account id is shared across all sites while a user id is specific to each site. So I'm using the former as the primary key to better represent that.
Datasette truncates long text fields by default. You can disable this behavior by using the truncate_cells_html
flag when running datasette
(see the docs):
datasette stackoverflow.db --setting truncate_cells_html 0
Yes, currently it does a full backup every time the command is run. It technically does upserts on every row, so it'll update existing rows with new data.
I'd like to stop saving items once we've seen an item we've saved already, but doing it that way hasn't been a priority.
Because the goal is to capture your own data, not archive all of SO. There's better avenues for that.
This section is people making changes to this package.
When in a virtual environment, run the following:
just install
This installs the package in --edit
mode and makes its dependencies available. You can now run stackoverflow-to-sqlite
to invoke the CLI.
In your virtual environment, a simple just test
should run the unit test suite. You can also run just typecheck
for type checking.
these notes are mostly for myself (or other contributors)
just release
while your venv is active~/.pypirc
is empty)