The Unknown Perils of Mining Wikipedia

Robots learning from robots

Because of the breadth and availability of its content, Wikipedia has been widely used as a reference dataset for research in machine learning and for tech demos. However, Wikipedia has some serious problems that are not apparent from our familiarity with it as a resource for human beings.

Wikipedia has good coverage of popular topics and very irregular coverage of unpopular topics. Humans are unaware of this, since it is precisely the popular pages that are consumed: the most popular 12% of Wikipedia accounts for 90% of all traffic. The irregularity of coverage is poisonous to many models. A topic model trained on all of Wikipedia, for example, will associate "river" with "Romania" and "village" with "Turkey". Why? Because there are 10k pages on Villages in Turkey, and not enough pages on villages in other places.

To make things worse, unpopular pages are very often robot generated. For example, rambot authored 98% of all the articles on US towns, and half of the Swedish Wikipedia is written by lsjbot! Robot generated pages are built by inserting data into sentence templates. The sheer mass of these pages means that a huge proportion of the language examples a model learns from are just the same template used over and over. Robots learning from robots.

Popularity filtering

The most useful trick is to exclude those Wikipedia pages that are not viewed frequently. This automatically excludes the mass of robot generated pages, and retains those pages that are frequently viewed and therefore edited by humans. Wikipedia publishes page view statistics. Below, for example, are the most frequently viewed pages on the day we derived the dataset (some time ago).

                                   title  dailyviews
                               Main Page     8501031
Climatic Research Unit email controversy      175964
                        Sachin Tendulkar      151859
                                  Jaguar       96096
                            Andy Kaufman       85422
                               Ram-Leela       61326
                 Great Oxygenation Event       55590
                           United States       54713
         General Educational Development       53607
                      Financial services       53305
                                Facebook       46794

On the day that we downloaded the dump, there were 4.3M pages, 1.3M of which had not been viewed once on that day, and 3.8M (i.e. 88%) were looked at less than 20 times. The human experience of Wikipedia is restricted to a very small proportional of the pages. Moreover, we found that the performance of our test models improved considerably when we trained on only popular pages.

Roll-your-own

Wikipedia publishes regular dumps of its content in XML format (here). The dataset we provide below is from October 2013, but unless you care about the latest rap star, that shouldn't bother you. Just in case you'd like to create an updated version of the dataset, here is how it was done:

Download the latest XML data dump

Use the Wikipedia extractor, version 2.6, by Giuseppe Attardi and Antonio Fuschetto of the TANL project at the University of Pisa to produce nice small XML files for each page of the form: <doc id="" url="" title="">...</doc>
We used the single threaded version 2.6 (the multi-threaded version caused us problems).

Wrangle these into the format you want using an XML parser.

Download the some page view statistics, and remove all pages with less than (e.g.) 20 daily page views.

Exclude content pages based on title, e.g. "Image: XXX", "User: XXX"

Drop disambiguation pages.

Drop stubs.

The dataset

We worked through this procedure using an XML datadump from October 2013, retaining only those pages with at least 20 pages views. You can download it here as a UTF encoded, two column CSV, the first column being the URL of the page, the second column being the text of the page. Linefeeds in the text are escaped as '\n'. There are 463k pages. File size is 1.2GB compressed (gzip).

Machine Learning

Using machine learning to segment documents

Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.

The Unknown Perils of Mining Wikipedia

Become a Lateral Pioneer

Robots learning from robots

Popularity filtering

Roll-your-own

The dataset

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.

The Unknown Perils of Mining Wikipedia

Become a Lateral Pioneer

Robots learning from robots

Popularity filtering

Roll-your-own

The dataset

Spread the word

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.