Feep! » Blog » Post

Reading all of StackOverflow

For a search engine to be of any use, it has to have a collection of data to search over. I want to start off with a reasonably large dataset so I can start getting a feel for what tuning is actually needed, but I don’t particularly want to spend a lot of time in the beginning building my own crawler and dealing with all of the vagarities of HTML out in the wild, so I want to start with a large, reasonably structured dataset that I can ingest quickly and be off to the races. Fortunately, Stack Overflow provides a complete XML data dump which I've downloaded and started playing around with.

(Note: this post is backdated; it was written Feb 24, 2022.)

Looking at the data

The StackOverflow data dump comes as a large (16 gigabyte) file, stackoverflow.com-Posts.7z, which contains a single file, Posts.xml. (The dumps for other sites have multiple XML files for different database tables, but stackoverflow.com gets split up since it’s so big.) This is easy to extract to -stdandard output and take a look at:

7z e -so stackoverflow.com-Posts.7z Posts.xml | head -n4
<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="742" ViewCount="61738" Body="&lt;p&gt;I want to use a &lt;code&gt;Track-Bar&lt;/code&gt; to change a &lt;code&gt;Form&lt;/code&gt;'s opacity.&lt;/p&gt;&#xA;&lt;p&gt;This is my code:&lt;/p&gt;&#xA;&lt;pre class=&quot;lang-cs prettyprint-override&quot;&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;When I build the application, it gives the following error:&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;pre class=&quot;lang-none prettyprint-override&quot;&gt;&lt;code&gt;Cannot implicitly convert type decimal to double&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;I have tried using &lt;code&gt;trans&lt;/code&gt; and &lt;code&gt;double&lt;/code&gt;, but then the &lt;code&gt;Control&lt;/code&gt; doesn't work. This code worked fine in a past VB.NET project.&lt;/p&gt;&#xA;" OwnerUserId="8" LastEditorUserId="3072350" LastEditorDisplayName="Rich B" LastEditDate="2021-02-26T03:31:15.027" LastActivityDate="2021-11-15T21:15:29.713" Title="How to convert a Decimal to a Double in C#?" Tags="&lt;c#&gt;&lt;floating-point&gt;&lt;type-conversion&gt;&lt;double&gt;&lt;decimal&gt;" AnswerCount="12" CommentCount="3" FavoriteCount="59" CommunityOwnedDate="2012-10-31T16:42:47.213" ContentLicense="CC BY-SA 4.0" />
  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="309" ViewCount="22174" Body="&lt;p&gt;I have an absolutely positioned &lt;code&gt;div&lt;/code&gt; containing several children, one of which is a relatively positioned &lt;code&gt;div&lt;/code&gt;. When I use a &lt;code&gt;percentage-based width&lt;/code&gt; on the child &lt;code&gt;div&lt;/code&gt;, it collapses to &lt;code&gt;0 width&lt;/code&gt; on IE7, but not on Firefox or Safari.&lt;/p&gt;&#xA;&lt;p&gt;If I use &lt;code&gt;pixel width&lt;/code&gt;, it works. If the parent is relatively positioned, the percentage width on the child works.&lt;/p&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;Is there something I'm missing here?&lt;/li&gt;&#xA;&lt;li&gt;Is there an easy fix for this besides the &lt;code&gt;pixel-based width&lt;/code&gt; on the child?&lt;/li&gt;&#xA;&lt;li&gt;Is there an area of the CSS specification that covers this?&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;" OwnerUserId="9" LastEditorUserId="9134576" LastEditorDisplayName="user14723686" LastEditDate="2021-01-29T18:46:45.963" LastActivityDate="2021-01-29T18:46:45.963" Title="Why did the width collapse in the percentage width child element in an absolutely positioned parent on Internet Explorer 7?" Tags="&lt;html&gt;&lt;css&gt;&lt;internet-explorer-7&gt;" AnswerCount="7" CommentCount="0" FavoriteCount="13" ContentLicense="CC BY-SA 4.0" />

...well, it’s easy to extract; maybe not so much to look at. Here's that first <row/> again, but with better formatting:

<row
	Id="4"
	PostTypeId="1"
	AcceptedAnswerId="7"
	CreationDate="2008-07-31T21:42:52.667"
	Score="742"
	ViewCount="61738"
	Body="&lt;p&gt;I want to use a &lt;code&gt;Track-Bar&lt;/code&gt; to change a &lt;code&gt;Form&lt;/code&gt;'s opacity.&lt;/p&gt;&#xA;&lt;p&gt;This is my code:&lt;/p&gt;&#xA;&lt;pre class=&quot;lang-cs prettyprint-override&quot;&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;When I build the application, it gives the following error:&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;pre class=&quot;lang-none prettyprint-override&quot;&gt;&lt;code&gt;Cannot implicitly convert type decimal to double&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;I have tried using &lt;code&gt;trans&lt;/code&gt; and &lt;code&gt;double&lt;/code&gt;, but then the &lt;code&gt;Control&lt;/code&gt; doesn't work. This code worked fine in a past VB.NET project.&lt;/p&gt;&#xA;"
	OwnerUserId="8"
	LastEditorUserId="3072350"
	LastEditorDisplayName="Rich B"
	LastEditDate="2021-02-26T03:31:15.027"
	LastActivityDate="2021-11-15T21:15:29.713"
	Title="How to convert a Decimal to a Double in C#?"
	Tags="&lt;c#&gt;&lt;floating-point&gt;&lt;type-conversion&gt;&lt;double&gt;&lt;decimal&gt;"
	AnswerCount="12"
	CommentCount="3"
	FavoriteCount="59"
	CommunityOwnedDate="2012-10-31T16:42:47.213"
	ContentLicense="CC BY-SA 4.0" />

From what I can tell, this is pretty much a direct dump of the database schema. Here’s the fields I’m interested in:

So that’s a question; where’s its answer?

7z e -so stackoverflow.com-Posts.7z Posts.xml | grep 'ParentId="4"' | head -n1
<row
	Id="7"
	PostTypeId="2"
	ParentId="4"
	CreationDate="2008-07-31T22:17:57.883"
	Score="495"
	Body="&lt;p&gt;An explicit cast to &lt;code&gt;double&lt;/code&gt; like this isn't necessary:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;double trans = (double) trackBar1.Value / 5000.0;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&#xA;&lt;p&gt;Identifying the constant as &lt;code&gt;5000.0&lt;/code&gt; (or as &lt;code&gt;5000d&lt;/code&gt;) is sufficient:&lt;/p&gt;&#xA;&#xA;&lt;pre&gt;&lt;code&gt;double trans = trackBar1.Value / 5000.0;&#xA;double trans = trackBar1.Value / 5000d;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;"
	OwnerUserId="9"
	LastEditorUserId="5496973"
	LastEditDate="2019-10-21T14:03:54.607"
	LastActivityDate="2019-10-21T14:03:54.607"
	CommentCount="0"
	ContentLicense="CC BY-SA 4.0" /> <!-- [formatting added for clarity] -->

So far, so good. To reconstruct the contents of a question page on StackOverflow, I just need to look at each question, find its answers, and merge them together.

Hm. Find the answers. When they could be anywhere in an XML file that’s 16GB when compressed. This might be a bit difficult.

A streaming XML parser

Step 1 is to figure out how to read the file at all. It’s a single giant XML file, and I don’t have anywhere near enough RAM to load the entire DOM at once the way most XML parsers want to. Fortunately a bit of searching for “streaming XML parsers” reveals that Python’s built-in xml.etree.ElementTree library is well-suited to the task: it has an iterparse function that emits events as it’s reading the XML, and doesn’t mind if you manipulate the resulting tree while it’s being built. So all we need to do is keep track of the root <posts> element, and clear it out after we’ve seen each row.

import sys, json, xml.etree.ElementTree as etree

root = None
for _, elem in etree.iterparse(sys.stdin, events=['start']):
	if elem.tag == 'posts':
		root = elem
	elif elem.tag == 'row':
		print(json.dumps(elem.attrib))
	else:
		raise Exception('Unknown post type: '+elem.attrib['PostTypeId'])
	root.clear() # Discard each <row> once we've seen it

And now we’ve got each <row> in newline-delimited JSON format:

7z e -so data/stackoverflow.com-Posts.7z Posts.xml | python3 ./extract_xml.py | head -n4
{"Id": "4", "PostTypeId": "1", "AcceptedAnswerId": "7", "CreationDate": "2008-07-31T21:42:52.667", "Score": "742", "ViewCount": "61738", "Body": "<p>I want to use a <code>Track-Bar</code> to change a <code>Form</code>'s opacity.</p>\n<p>This is my code:</p>\n<pre class=\"lang-cs prettyprint-override\"><code>decimal trans = trackBar1.Value / 5000;\nthis.Opacity = trans;\n</code></pre>\n<p>When I build the application, it gives the following error:</p>\n<blockquote>\n<pre class=\"lang-none prettyprint-override\"><code>Cannot implicitly convert type decimal to double\n</code></pre>\n</blockquote>\n<p>I have tried using <code>trans</code> and <code>double</code>, but then the <code>Control</code> doesn't work. This code worked fine in a past VB.NET project.</p>\n", "OwnerUserId": "8", "LastEditorUserId": "3072350", "LastEditorDisplayName": "Rich B", "LastEditDate": "2021-02-26T03:31:15.027", "LastActivityDate": "2021-11-15T21:15:29.713", "Title": "How to convert a Decimal to a Double in C#?", "Tags": "<c#><floating-point><type-conversion><double><decimal>", "AnswerCount": "12", "CommentCount": "3", "FavoriteCount": "59", "CommunityOwnedDate": "2012-10-31T16:42:47.213", "ContentLicense": "CC BY-SA 4.0"}
{"Id": "6", "PostTypeId": "1", "AcceptedAnswerId": "31", "CreationDate": "2008-07-31T22:08:08.620", "Score": "309", "ViewCount": "22174", "Body": "<p>I have an absolutely positioned <code>div</code> containing several children, one of which is a relatively positioned <code>div</code>. When I use a <code>percentage-based width</code> on the child <code>div</code>, it collapses to <code>0 width</code> on IE7, but not on Firefox or Safari.</p>\n<p>If I use <code>pixel width</code>, it works. If the parent is relatively positioned, the percentage width on the child works.</p>\n<ol>\n<li>Is there something I'm missing here?</li>\n<li>Is there an easy fix for this besides the <code>pixel-based width</code> on the child?</li>\n<li>Is there an area of the CSS specification that covers this?</li>\n</ol>\n", "OwnerUserId": "9", "LastEditorUserId": "9134576", "LastEditorDisplayName": "user14723686", "LastEditDate": "2021-01-29T18:46:45.963", "LastActivityDate": "2021-01-29T18:46:45.963", "Title": "Why did the width collapse in the percentage width child element in an absolutely positioned parent on Internet Explorer 7?", "Tags": "<html><css><internet-explorer-7>", "AnswerCount": "7", "CommentCount": "0", "FavoriteCount": "13", "ContentLicense": "CC BY-SA 4.0"}
{"Id": "7", "PostTypeId": "2", "ParentId": "4", "CreationDate": "2008-07-31T22:17:57.883", "Score": "495", "Body": "<p>An explicit cast to <code>double</code> like this isn't necessary:</p>\n\n<pre><code>double trans = (double) trackBar1.Value / 5000.0;\n</code></pre>\n\n<p>Identifying the constant as <code>5000.0</code> (or as <code>5000d</code>) is sufficient:</p>\n\n<pre><code>double trans = trackBar1.Value / 5000.0;\ndouble trans = trackBar1.Value / 5000d;\n</code></pre>\n", "OwnerUserId": "9", "LastEditorUserId": "5496973", "LastEditDate": "2019-10-21T14:03:54.607", "LastActivityDate": "2019-10-21T14:03:54.607", "CommentCount": "0", "ContentLicense": "CC BY-SA 4.0"}
{"Id": "9", "PostTypeId": "1", "AcceptedAnswerId": "1404", "CreationDate": "2008-07-31T23:40:59.743", "Score": "2069", "ViewCount": "709135", "Body": "<p>Given a <code>DateTime</code> representing a person's birthday, how do I calculate their age in years?</p>\n", "OwnerUserId": "1", "LastEditorUserId": "6537157", "LastEditorDisplayName": "Rich B", "LastEditDate": "2021-01-05T17:33:32.393", "LastActivityDate": "2021-11-15T21:33:05.663", "Title": "How do I calculate someone's age based on a DateTime type birthday?", "Tags": "<c#><.net><datetime>", "AnswerCount": "68", "CommentCount": "10", "FavoriteCount": "483", "CommunityOwnedDate": "2011-08-16T19:40:43.080", "ContentLicense": "CC BY-SA 4.0"}

This JSON Lines format is a lot more convenient to work with than XML: it’s easy to stream, since extracting each row is simply a matter of iterating over lines, and JSON parsers are a lot more readily available than XML ones.

Finding the answer to a question

Anyway, now that I can read the entire file without blowing out my RAM, it’s time to figure out how to join the questions and answers together. My first thought was to dump everything into a database, and let that do the joining. I dumped the data I needed into a SQLite database with separate tables for questions and answers, and then started querying it. Unfortunately, while querying for individual questions and their answers worked fine, this was unreasonably slow when trying to get a list of every single question and their answers. I poked around in the query plan for a bit, but trying to understand the finer points of index optimization didn’t seem like the way I wanted to spend the afternoon so I decided to strike out on my own. (This is the nice thing about side projects; you can go where the fancy takes you.)

Since I was having problems with the database index, I decided to make my own simple index instead. If it was slow, I figured, I could at least see why it was slow fairly easily, and I might also learn something about data structures in the process.

I figured I’d start with the simplest possible index, and see where things fell over. So, I defined a dictionary mapping question IDs to an array of offsets in a question file. As I wrote each line of JSON to the output file, I would use tell to get the current offset into the file and append it to the appropriate index array. Then, I could iterate over the questions a second time, seeking to the location of each answer and merging them with the question.

The first part of this plan went remarkably well: I iterated through the XML file, writing each line to the output file and updating the index as appropriate. (I initially worried that I might run out of RAM for the index and have to figure out how to store it in a file, but integers—even a fairly large list of them—are quite small in comparison to gigabytes of memory.)

When it came time to join the questions and answers together, though, my new approach was still very slow. However, since it was only a few dozen lines of code it was much easier to analyze what was going on. With the help of a profiler, I discovered that the vast majority of the time was being spent on that seek() call. Since the file was far too large to fit in RAM, and the questions were spread around, most of the time the data wasn’t in the page cache, and that call was translating into an actual disk seek. I have a spinning HDD, so each seek is roughly 10ms—multiply that by 22 million questions and it adds up fast! My back-of-the-envelope math was that it would take 3 days to complete; I didn’t have the patience for that, nor did I think it would do wonders for the hardware.

(In hindsight, my database was probably having the same issue; if I had told it to use a clustered index on question ID that probably would have solved the problem. But this didn’t occur to me until much later.)

Clearly, I’d need to make sure that the answers for each question were near each other as they were written to the disk, so that I could read them in all at once for each question without having to seek all over the place. Nothing immedately came to mind, so I set my computer aside and went to make lunch. The sandwich wasn’t very insightful, though, so I started searching the Web for ways to do joins on large datasets, and eventually found out about external sorting, a class of algorithms for efficiently sorting data that’s too large to fit in RAM. This is exactly what I need, so I dove down the rabbit hole of researching about different strategies, their tradeoffs, and the different optimizations that could be done.

The UNIX Way

In the course of my research, I found a page discussing the various heuristics that GNU sort uses to optimize the sorting of arbitrary data. I was thinking about how to apply this to my program when I had the dawning realization that I could just use sort and take advantage the already-heavily-optimized program to do exactly what I needed, already preinstalled on my computer!

After that revelation, things came together pretty quickly. sort doesn’t understand JSON, but it was trivial to prefix each JSON line with the question ID, sort it on that first field, and then use cut afterwards to strip off the ID prefix and get back JSON:

sort -k1,1 -s -n "$SITE.answers.jsonlx" \
| cut -d' ' -f2- \
> "$SITE.answers.sorted.jsonl"

This isn’t exactly instant, but it does complete in a mostly reasonable amount of time and appears to be primarily I/O bound, so I don’t think I’m likely to do much better trying to write my own.

Zip it up!

Now that I have both questions and answers sorted by the question ID, it’s just a matter of joining them up. This is slightly complicated by the need to iterate over them in parallel: I need to use the iterator protocol directly for the answers, rather than having the sugar of a for loop, but this is otherwise quite straightforward:

async function* source(exchange, exchange_path) {
       const answers = read_jsonl(`${exchange_path}.answers.sorted.jsonl`)
       async function get_next_answer() {
               const {value, done} = await answers.next()
               return done ? null : value
       }
       let next_answer = await get_next_answer()
       for await (question of read_jsonl(`${exchange_path}.questions.jsonl`)) {
               let html = question.Body
               while (next_answer && next_answer.ParentId == question.Id) {
                       html += ' '+next_answer.Body
                       next_answer = await get_next_answer()
               }
               yield {
                       url: `https://${exchange}/questions/${question.Id}`,
                       title: question.Title,
                       html,
               }
       }
       if (next_answer !== null) {
               throw Error('Failed to consume all answers')
       }
}

Now that I have a stream of StackOverflow pages, the next step will be to load them into a full-text search so I can start querying them. But this post is long enough already, so that’s a topic for another time.