The current state of StackOverflow’s data dumps
My search engine got its start from StackExchange data dumps, and they’re still one of the major data sources I use. But the official source I was using stopped updating in April 2024, so my data has been stuck in time for a while. I finally got around to investigating why I hadn’t gotten updates, and it turns out that the interesting part of the issue is not some technical problem on my end, but the politics between StackExchange corporate and the volunteer moderators. This turns out to have been a complicated and long-running conflict, so here’s my attempt to summarize and link to primary sources to provide an overview and timeline of the situation.
Trouble brewing
Discussions about the data dumps can be found on the data-dumps
tag on meta.stackexchange. In the past, this mostly consisted of announcements about technical glitches preparing the quarterly dumps or questions about the data and schema. The first signs of trouble came in June 2023, when the usual dump failed to appear. This, it turned out, was not the usual sort of hitch that had frequently plagued the dump before, but an intentional decision that had been made quietly months prior:
The job that uploads the data dump to Archive.org was disabled on 28 March, and marked to not be re-enabled without approval of senior leadership.
This, understandably, raised a lot of eyebrows. Further furor was kicked up when the CTO explained they were “working on a strategy to protect Stack Overflow data from being misused by companies building LLMs”; as one user noted:
In order to prevent companies building LLMs from getting our data (which they already have, and have had for years) your plan is to keep all of us from getting our data, am I reading that right?
I won’t try to rehash all the discussion, since you can go and read it yourself; but the upshot was that the dump was re-enabled, though with many caveats about the future that left people with more questions than answers. This created a whole flurry of discussion about what the company actually intended, what the legal issues were, and how the community could pick up the slack if the dumps were yanked again.
However, as far as I can tell the next official statement wasn’t until the end of July, when a staff member announced that they were committed to maintaining the data dumps for the “forseeable future”; many people found this statement concerningly vague, and number of people also picked up on the phrase “community members have free access to them for legitimate usages that support the community”, presciently noting that this gave StackExchange a lot of leeway to restrict access to the dumps as they saw fit.
Despite these concerns, for the next few quarters things seemed like they had gone back to normal. In August 2023 the regularly scheduled dump encountered the usual glitches but otherwise proceeded as expected; and similarly in December. In February 2024, another announcement described their continued work on making the export process more robust, and committed to estimated dates for releases through the end of the year. In March this schedule was adjusted slightly, in a post which received praise for being transparent and involving the community in changes to the dump process.
A new dump
The next sign of trouble, as far as I can tell, came in May 2024, when StackExchange announced an “exciting new partnership” with OpenAI in a poorly-received meta.stackexchange post. There were a lot of things that people in the community were unhappy with; these are mostly out of scope for this post, but are somewhat summarised in response to the question “Why are people so unhappy about the OpenAI partnership?”
This raised a lot of worries, but concrete concerns specific to the data dump did not surface until July, when StackExchange—in a reversal of their previous recommitment to transparency—unexpectedly announced “a change to the data-dump process”. Described as “primarily only a change in location for where the data dump is accessed,” in practice there were two major changes which this announcement buried the lede on:
- Instead of being uploaded to the Internet Archive, where they could be easily accessed, new dumps would be hosted on StackExchange’s servers and downloading all of them would require manually logging in to 368 separate sites and navigating to the page with the download link.
- Before downloading, you had to agree to a license saying you were only going to use the file for non-commercial purposes.
Not only was this significant change to licensing and access presented as though it was a relatively minor improvement, but it also turned out that it wasn’t even ready:
Unfortunately, this new process will not be ready until mid-August. Because this is an important change, we want to use it for the July data dump release, which means that we will miss our previous commitment date of July 31 for the data dump.
This meant that nobody could actually see the process they were discussing in action, but the community had already seen enough from the post to have a lot of concerns.
Part of the new process was a checkbox you had to agree to in order to download each dump file:
I agree that I will use this file for non-commercial use. I will not use it for any other purpose, and I will not transfer it to others without permission from Stack Overflow. I certify that I am not downloading this file on behalf of my employer, for use in a for-profit enterprise. I have read and agree to the Terms of Service and the Acceptable Use Policy, and have read and understand Stack Overflow's privacy notice.
Users quickly noted that this attempt to add a restriction on commercial use wasn’t actually compliant with the CC BY-SA license on the actual content, which forbids adding downstream restrictions. The company responded that “We are asking politely that you not, but realistically speaking, there's nothing we can do to prevent you from sharing it, once you have it in your hands. We recognize that.” Many people found this justification dubious at best when contrasted against the very explicit wording on the clickthrough agreement. Two weeks after this announcement, and after consulting their lawyers, the company replied that they were going to change the text of this agreement (on the grounds that “Many users have found that language to be unnecessarily bureaucratic and difficult to parse and understand”, though a number of them rebutted that they found it all too clear); the final version is:
I understand that this file is being provided to me for my own use and for projects that do not include training a large language model (LLM), and that should I distribute this file for the purpose of LLM training, Stack Overflow reserves the right to decline to allow me access to future downloads of this data dump.
Despite the revisions, this checkbox continued to raise questions, such as:
- Even without the checkbox, are SE's changes legal? (conclusion: yes, though you don’t have to be happy about it and they probably can’t stop people from freely redistributing the files after downloading them.)
- Do the new Data Dump terms of use allow commercial use NOT involving LLMs? (again concluding that they can’t actually stop you.)
- How can Stack Exchange prove a violation to the Data Dump download agreement? (conclusion: at least right now, they don’t have any technical means to identify violators.)
- Does SE plan to decline providing future versions of the data dumps to individuals who may use or share it for training LLMs? (conclusion: despite all of the foregoing, it seems like they really want to try it anyway.)
Volunteer mirrors
Another question that was quickly raised, given that the license allowed redistribution even though the company didn’t want it to, was whether someone would continue to upload the dumps to the Internet Archive. As the announcement explains in its FAQ section:
Will Stack Overflow be uploading the data dump to archive.org?
Stack Overflow is no longer uploading the data dump to archive.org. [...] We would really rather users do not upload the file to archive.org or similar data pile sites. [...] Our hope is that because the process for individuals to request the dumps is lightweight and quick, you won’t feel the need to undermine these efforts to encourage commercial re-users to contribute back.
This hope, it seems, was not to be; users replied that they had already contributed with the understanding that the license allowed commercial use, and they saw no reason that the content they had already created should suddenly be imposed with a new restriction to “contribute back”. In a telling exchange, one company representative acknowledged the futility of trying to stop this, remarking, “My hope is that over time, they see that there's no need [to mirror], but that will take time.” Community members, however, remained steadfast, calling out the need to safeguard against future corporate decisions that could further restrict access.
At this point, despite disabling the old dump, the new process wasn’t available yet and no new dump was yet available to download. Sometime around August, the June dumps became available; two volunteers downloaded them and prepared checksums before re-uploading them to the Internet Archive, a process that has continued as newer dumps have been made available.
Where to get the dumps today
The official way to download a dump is:
- Log in to the site you want a dump for.
- Click your profile (in the upper right hand corner).
- Click the Settings tab.
- Click “Data dump access” in the sidebar.
- Click the checkbox to agree to the terms of the download.
- Click “Download data”.
However, volunteers have taken up the slack of continuing to upload all of the dumps to the Internet Archive. As of this writing, the latest version of the dump can be found in the stackexchange_20241231
collection; searching for stackexchange
on archive.org will get the latest collection as well as a variety of older ones. At least for the moment, official checksums are posted when each quarter’s data dump is released, which can be used to verify the integrity of the volunteer-provided downloads. Somebody has also created communitydatadump.com, which has an extensive list of past dumps as well as archive.org and torrent links.
Continued Tensions
This post is only summarizing a tiny corner of what seems to be going on with StackOverflow and other StackExchange sites, since I tried to limit myself to only the contention around the data dumps in particular. The discussion had posts like a moderator asking “What was going through the pseudo-mind of your dysfunctional organisation””, or a former member of staff who had strong words about the truthfulness of even company-internal conversations prior to turning off the dump, but these complaints are not limited to the data dumps; the disconnect between corporate decisions and community needs remains a major point of contention across a variety of different topics, which the post Company failure to meet expectations attempts to summarize. (There was also a moderation strike in the process, which cited the data dumps as one of their concerns. This post was long and involved enough already, so I did not attempt to dig into that part of the situation.)
I’ve updated Feep! Search to use the community-uploaded Archive.org copies, mostly because that lets me keep my current automated process and avoid a lot of tedious clicking. I’m also going to continue keeping an eye on this situation; hopefully the company keeps to their current commitment, but they have a lot of trust that they need to regain. (For example, I can’t use the latest dump because it’s missing a bunch of posts, a problem which some users are finding it hard to consider in good faith.) The data dumps were my catalyst to getting started building a search engine, and I hope they will continue to be available to inspire more interesting projects from other people as well.