Computers consist of on/off switches and process meaningless symbols. So how is it that we can hope that computers might understand the meaning of words, products, actions and documents? If most of us consider machine learning to be magic, it is because we don’t yet have an answer to this question. Here, I’ll provide an answer in the context of machines learning the meaning of words. But as we’ll see, the approach is the same everywhere.
Firstly, some motivation. Why would you want a machine to understand the meaning of a word? Consider the case where it doesn’t, and words are treated as meaningless symbols. In this case, the only way to compare two words is to check if they are the same word. So the machine considers the word ship to be totally unrelated to the word boat: these two words would be as unrelated to one another as the words cat and dynamite.
This "keyword autism" has the advantage of precision — it's useful in keyword search, for instance, if you know the document you want, and you know its exact title. But it is catastrophic for document discovery. Imagine having a research assistant who, when asked to find documents about "nordic boat building" deliberately ignored an article on "scandinavian ship construction" because the words weren’t exactly the same. It’d be time for a new research assistant.
At Lateral, we’ve built a tool for document discovery, for surfacing relevant documents that you didn’t know were there. For us, then, keyword autism was the enemy. Our machines needed to understand that the words ship and boat represent very similar concepts. Our machines needed to learn word meaning.
The key insight, made at the beginning of the information age, is to replace word meaning with something that machines can actually measure. It’s called the Distributional Hypothesis, and claims:
“words are characterised by the company that they keep"
This means, for example, that the words ship and boat must represent related notions because they both occur often with the words stern, sail and sea, but almost never with glycerine or meow. On the other hand, the words that occur with dynamite are very different from those that occur with cat, so these words must represent unrelated notions. Much has been made of this of late in the machine learning community (e.g. word2vec), but the idea is in fact seventy years old.
If the distributional hypothesis seems at all familiar, it's because the same approach is applied in different domains. Consider, for example, building a recommender for an e-Commerce website. Two products are related to the extent that they tend to be purchased together, and two customers are similar to the extent that they buy similar products. The fundamental insight is that objects (be they words or products) are related to one another by their use. The relationships between objects are used as a proxy for any intrinsic meaning the objects might have. Mathematicians will find this point of view familiar from abstract algebra and category theory.
So the machine can learn which words are related by processing text, sentence by sentence, and seeing which words occur together. Formally, we are trying to estimate the probability that a word occurs, given e.g. that the word cat occurs:
We can now forget about word meaning and use these probability distributions, which can be estimated by the machine, in its place. The word cat is then represented by a vector consisting of its co-occurrence probabilities. These vectors live in a very high dimensional vector space, but we can use dimension reduction to make this representation more robust and tractable.
A task for us, then, is collecting lots of text, so that the machine has an understanding of word relationships from a wide variety of disciplines. In this way, we built an artificial mind that seems to have studied every degree at University. It has studied the news, political science, mathematics, pharmacology, geology and has read patents and case law. You can use it for your own applications with our API, or check out some of the demos.
I hope that has helped demystify our machine learning somewhat!
Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
This post describes a simple principle to split documents into coherent segments, using word embeddings.