If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the overwhelming mass of robot-generated pages that no human ever reads. We provide a cleaned corpus (also a Wikipedia recommendation API derived from it).
Because of the breadth and availability of its content, Wikipedia has been widely used as a reference dataset for research in machine learning and for tech demos. However, Wikipedia has some serious problems that are not apparent from our familiarity with it as a resource for human beings.
Wikipedia has good coverage of popular topics and very irregular coverage of unpopular topics. Humans are unaware of this, since it is precisely the popular pages that are consumed: the most popular 12% of Wikipedia accounts for 90% of all traffic. The irregularity of coverage is poisonous to many models. A topic model trained on all of Wikipedia, for example, will associate "river" with "Romania" and "village" with "Turkey". Why? Because there are 10k pages on Villages in Turkey, and not enough pages on villages in other places.
To make things worse, unpopular pages are very often robot generated. For example, rambot authored 98% of all the articles on US towns, and half of the Swedish Wikipedia is written by lsjbot! Robot generated pages are built by inserting data into sentence templates. The sheer mass of these pages means that a huge proportion of the language examples a model learns from are just the same template used over and over. Robots learning from robots.
The most useful trick is to exclude those Wikipedia pages that are not viewed frequently. This automatically excludes the mass of robot generated pages, and retains those pages that are frequently viewed and therefore edited by humans. Wikipedia publishes page view statistics. Below, for example, are the most frequently viewed pages on the day we derived the dataset (some time ago).
On the day that we downloaded the dump, there were 4.3M pages, 1.3M of which had not been viewed once on that day, and 3.8M (i.e. 88%) were looked at less than 20 times. The human experience of Wikipedia is restricted to a very small proportional of the pages. Moreover, we found that the performance of our test models improved considerably when we trained on only popular pages.
Wikipedia publishes regular dumps of its content in XML format (here). The dataset we provide below is from October 2013, but unless you care about the latest rap star, that shouldn't bother you. Just in case you'd like to create an updated version of the dataset, here is how it was done:
We worked through this procedure using an XML datadump from October 2013, retaining only those pages with at least 20 pages views. You can download it here as a UTF encoded, two column CSV, the first column being the URL of the page, the second column being the text of the page. Linefeeds in the text are escaped as '\n'. There are 463k pages. File size is 1.2GB compressed (gzip).
Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
This post describes a simple principle to split documents into coherent segments, using word embeddings.