Lateral has been sunset. A New Chapter begins - find out more

July 12, 2015

The arXiv as Dataset

The arXiv is a repository of over 1 million preprints in physics, mathematics and computer science. It is truly open access, and the preprints are an excellent dataset for testing out all sorts of language modelling / machine learning prototypes.

What's available?

Full-text and article metadata are published in two different ways.

The preprint metadata (title, abstract, authors, categories) are published via the OAI protocol for metadata harvesting (OAI-PMH) and via the arXiv API
The full-text of all preprints is made available in a huge data dump on S3, either as PDFs or as TeX source files.

I've found the preprint metadata much easier to work with, being smaller in size, cleaner and more frequently updated. Using the OAI-PMH interface, we fetch new abstracts everyday.

The arXiv as a dataset

There are many tasks for which the arXiv is an ideal dataset. You could use the tags (MSC categories) to train a tagger, for instance, or test out your ideas for summarisation or keyword extraction. We feed the abstracts into our content recommender to provide a way to conceptually browse the arXiv. When reading an abstract, articles with conceptually-related abstracts are surfaced automatically (see earlier post). The arXiv dataset often turns up in the language modelling literature as well, including in a recent paper authored by arXiv founder, Paul Ginsparg, and Alexander Alemi.

OAI-PMH -- WTF?

I had never heard of OAI-PMH before I wanted to work with arXiv data. It must have been popular at some stage, because there is a very long list of institutions that publish via OAI PMH. However, to my knowledge most are too small to be interesting, being e.g. the ePrint server of such and such a university. The big three seem to be the arXiv, CERN and PubMed Central.

All OAI PMH publishers must serve "Dublin Core", an XML-based format, which looks like this:

<pre>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.openarchives.org/OAI/2.0/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>The Szemeredi-Trotter Theorem in the Complex Plane</dc:title>
<dc:creator>Toth, Csaba D.</dc:creator>
<dc:subject>Mathematics - Combinatorics</dc:subject>
<dc:subject>05B25, 11T99</dc:subject>
<dc:description> It is shown that $n$ points and $e$ lines in the complex Euclidean plane
${\mathbb C}^2$ determine $O(n^{2/3}e^{2/3}+n+e)$ point-line incidences. This
bound is the best possible, and it generalizes the celebrated theorem by
Szemer\'edi and Trotter about point-line incidences in the real Euclidean plane
${\mathbb R}^2$.
</dc:description>
<dc:description>Comment: 24 pages, 5 figures, to appear in Combinatorica</dc:description>
<dc:date>2003-05-19</dc:date>
<dc:date>2014-05-16</dc:date>
<dc:type>text</dc:type>
<dc:identifier>http://arxiv.org/abs/math/0305283</dc:identifier>
<dc:identifier>Combinatorica 35 (1) (2015), 95-126</dc:identifier>
<dc:identifier>doi:10.1007/s00493-014-2686-2</dc:identifier>
</oai_dc:dc>
</pre>

Harvesting

We use the Python package oai-harvest (by John Harrison at the University of Liverpool) for harvesting the OAI-PMH metadata. It comes with some neat command line tools, that allow the use of a date filter when harvesting, which is useful for update cycles. (One word of warning: it is best to start small, since there are 1M records on the arXiv and oai-harvest writes out a file for each one). We then process the XML into a format we find more amenable using BeautifulSoup.

That's it, I hope you found it useful. Please do check out the arXiv demo!

Extra links

pmigdal on DataTau brought to my attention this StackExchange post with more about OAI-PMH for the arXiv.

“

More in

Machine Learning

Using machine learning to segment documents

Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.

Machine Learning

Text segmentation using word embeddings

This post describes a simple principle to split documents into coherent segments, using word embeddings.

Machine Learning

Semantic trees for training word embeddings with hierarchical softmax

In this blog post we describe an experiment to construct semantic trees and show how they can improve the quality of the learned embeddings in common word analogy and similarity tasks.

By clicking “Agree”, you agree to the storing of cookies on your device to enhance site navigation, analyse site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

More Options Deny Agree

The arXiv as Dataset

Become a Lateral Pioneer

What's available?

The arXiv as a dataset

OAI-PMH -- WTF?

Harvesting

Extra links

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.

The arXiv as Dataset

Become a Lateral Pioneer

What's available?

The arXiv as a dataset

OAI-PMH -- WTF?

Harvesting

Extra links

Spread the word

More in

Using machine learning to segment documents

Text segmentation using word embeddings

Semantic trees for training word embeddings with hierarchical softmax

Get into flow.