By ignoring citation graphs and keywords, you can discover papers and researchers you never knew existed. Check it out here (on arXiv papers = ML, CS, math & physics).
For the technically-inclined, the arXiv is a treasure trove of recent research in machine learning, computer science, mathematics and physics. It makes available the full-text of over 1M preprints. I first met the arXiv as a mathematician, and it was (and still is) the first place to look for new maths research. Papers appear on the arXiv long before they are officially published, and are also often revised there subsequent to their publication.
The arXiv is very effective as a repository, but it desperately lacks any content discovery functionality (this not a critique: as a repository, the arXiv shows how a community can simply bypass research paywalls). In any case, when I started applying machine learning to content recommendation, the arXiv was at the top of my hit list. So to show off what we could do here at Lateral, we built a document discovery solution for the arXiv.
Here is how it works: once you find one paper you are interested in, all those papers that are thematically similar are listed directly below it. You can search by title and author, or by pressing the "paragraph button" next to the search box, you can search by pasting in a whole chunk of text, e.g. an abstract of an article that interests you. Star the articles you like, and keep cruising for new papers.
I shamelessly lobbied my coworkers that we should build this tool together, and I did so because I wanted it for myself. Using it, I have found many papers (and researchers!) that I had never heard of that work on similar stuff to myself. In all cases, the new paper or researcher had remained undiscovered because of one of two reasons:
Our document model is purposefully blind to both social cliques and keywords -- it breaks through both these barriers. It sees only ideas!
We place our content recommender at your disposal and you can do what you want with it. It understands any English text that you throw at it. You could build a similar tool yourself, for any corpus you want. Knock yourselves out.
The interface is (we hope) functional but stripped back. There are some problems rendering some of the LaTeX. It would be great to have the ability to restrict the search by date, or to link to citations. All these things can be done! Please get in touch with your other suggestions and ideas.
Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
This post describes a simple principle to split documents into coherent segments, using word embeddings.