ActorAges - Home

The goal of this project was to try and parse semi-organised information from Wikipedia. I just needed to extract movie release dates, a cast list and the birth/death dates of each actor, as well as their birth places, to be able to show how old each actor was at the time their movies were released.

Luckily, Wikipedia has the concept of infoboxes. These are the small tables on the right hand side of Wikipedia pages containing structured data about the subject of the page. Movies for example have a released parameter indicating the movie release date, and people have their birth date and place and death dates present as well. This should make parsing that information easy (...it wasn't).

Acquiring the data

Obviously, you wouldn't want to let a scraper loose on Wikipedia, parsing all the HTML and extracting what you need. Luckily, dumps are provided containing all the wikitext of all the pages. The main file, enwiki-latest-pages-articles-multistream.xml.bz2 (~24GB) contains however the entirety of Wikipedia (well, the latest versions of each page), and I just need to extract movies, actors and birthplace locations.

Retrieving movie page ID's

To determine which pages were about movies, I'd have to take a look at the categories that each page is assigned, and pick out those that correspond to a movie. Wikipedia has a category for movies of each year, so it's simply a case of fetching all page ID's of the 1900 Films category up to, let's say, 2039 Films.

Which Wikipedia page ID's are assigned to which category is provided in a dump called enwiki-latest-categorylinks.sql.gz (~30GB extracted) . It's a MySQL SQL dump file, so the naive approach was to import it locally in MySQL so I could perform queries against it. This clearly didn't work as it was taking hours to make any progress importing the dump. Instead, it was quite a bit quicker to just write a little parser in C++ that parses the INSERT INTO statements, see if the record matches the ____ Films category, and extract the page ID.

This reduced the time to extract which pages were about movies from several hours to only a couple minutes, and resulting in about one and a half megabyte of data, around 160,000 page ID's. This was loaded into a Postgres database for easy querying.

Extracting wikitext

There are a couple of steps I needed to take before I could actually see the wikitext for a specific page ID. There is some information on Wikipedia about how to read the big xml file. Basically, it's a multistream bz2 file, and to know at which offset in that big file you have to start reading a bz2 chunk is determined by interpreting another file called enwiki-latest-pages-articles-multistream-index.txt.bz2 (~1GB extracted). This text file is simply a three field colon separated list: offset : page ID : page title, for example:

213241905:30007:The Matrix Meaning: at file offset 213241905 in enwiki-latest-pages-articles-multistream.xml.bz2, a bz2 stream will start that contains an XML file for 100 pages, one of which will be for page ID 30007, titled 'The Matrix'.

I added another (very simple) parser for this file, and also inserted the contents in the Postgres database. Of course, I couldn't filter out just the movie ID's. I needed all the offsets for all the pages, as I would need to be able to fetch any random page later on, e.g. for each actor and birthplace.

With the help of the BZip2 library and an XML parser, it wasn't that much trouble to extract the appropriate stream, parse the 100 pages in the resulting XML file and extracting the one with the correct ID. I now had code that could look up the wikitext for any page in the dump.

Parsing wikitext

Extracting the cast list was next. An, again, naive approach was to throw a couple regular expression at the wikitext, but this proved unmaintainable pretty quickly. There is too much variation, and {{templates}} are present, which have to be interpreted in a very specific way. I looked at what was available online in the form of wikitext parsers, and didn't find any that suited my needs. So I did what every programmer would do in that situation, and simply wrote my own parser, combined with a wiki-template handler.

Extracting all the data

Now all that was left was to interpret the parsed tree, get all the associated actors, parse their individual infoboxes for relevant data, and their birth place locations, if it was linked. All the information was inserted in the Postgres database as well. Initially I had written it so during the processing of all movies, actors and locations, every found record would be immediately inserted in the database. This proved to be a huge mistake, as it took about 80 minutes to process everything. After some research, I found it was way more efficient to just write all the information to a couple CSV files, and then COPY those into Postgres. This reduced the total processing time to around 4 minutes for all movies, actors and locations.

All this data was then exported to pre-gzipped json files on the hard drive, which the website then reads. I chose to write the website without any framework at all, it's simple html, css and plain javascript.

Errors and inconsistencies

During the interpretation of the data, I would encounter lots of errors. From non-existent dates (Feb 30th) to information put into the wrong infobox fields (e.g. birth place and birth date swapped). The most occurring error was however incorrectly linked actors, mostly due to the actor having a similar name to someone else. There were mythological deities, medieval priests, foodstuffs, generic name pages, towns, companies, royal titles being linked as actors in movies. I've since made many hundreds of corrections.

Wikipedia is very liberal when it comes to its content and template formatting, so lots of flexibility had to be considered for the extraction of the relevant data. For example, for parsing birth and death dates, I encountered many templates to provide these dates. These templates could occur as "startdate", "enddate", "birthdate", "birthdateandage", "dateofbirth", "dateofbirthandage", "bda", "dob", "deathdateandage", "dda", "deathdate", "dateofdeath", "deathdateandgivenage", "dateofdeathandage", "releasedate", "disappeareddateandage", "disappeareddate" , and would all be parsed in the exact same way. But then, there's completely different templates that have to be parsed in a totally different way, such as "birthdeathage", "birthyearandage", "bya", "birthmonthandage", "yearofbirthandage", "deathyearandage", "dya", "oldstyledate", "oldstyledatedy". Not to mention some actors having their dates entered as a plain textual date. The actors without an infobox were even more challenging, as their dates had to be extracted from the lead paragraph of the body.

Conclusion

While it was a very interesting challenge to see how difficult it would be to extract useful information from an inherently unstructured source, the project is not finished, more polishing needs to be done. A lot of movies, actors and birth place locations were extracted though, so I am happy with the results.

About the project