Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
This post describes a simple principle to split documents into coherent segments, using word embeddings.
In this blog post we describe an experiment to construct semantic trees and show how they can improve the quality of the learned embeddings in common word analogy and similarity tasks.
How can you learn a map from a German language to an English language word vectorisation model, to enable crosslingual document comparison?
By labelling documents with the users who read them, we used fastText to hack together a “hybrid recommender” system.
What kind of language do British parliamentarians use? We used the Lateral API to provide an overview by clustering debates and creating word clouds.
Previously we wrote about how machines can learn meaning. An exciting opportunity of this approach is that it also enables teaching machines new languages.
The arXiv is a repository of over 1 million preprints. It is truly open access, and excellent for testing language modelling / machine learning prototypes.
Computers consist of on/off switches and process meaningless symbols. So how is it that we can hope that machines learn meaning of words and documents?
If a machine is to learn about humans from Wikipedia, it must experience the corpus as a human sees it and ignore the mass of robot-generated pages.