Previously we've written about how machines can learn meaning. One of the exciting opportunities of this approach is that it also means they can learn new languages very quickly. All you need is enough text data. Wikipedia offers a great starting point and partnering with content providers enables us to quickly gather additional data. We have recently started working on supporting new languages, and thought we would share some initial impressions here.
While it would be awesome to have a representative from every language on the team (we currently cover about 7), this isn't always possible. So what's amazing about teaching a machine a new language is that a team doesn't require a native language speaker to achieve it.
With a mixture of standardised testing and some Google translate quality control, anyone can train the machine to learn a new language. It's a simple observation but one that I think is pretty cool.
Another simple observation, since there are many machine learning-based language services that work for English only, there are often opportunities that can readily be filled by providers whose software is natively language-agnostic.
For new companies entering this space I would recommend considering new languages early on. It feels like something that can be put off mentally even once it’s worthwhile doing.
We will be releasing our first new language APIs publicly in the near future. If you have text content in other languages than English that you would be interested in recommending, please let us know. We'd also love to hear from you if you have a lot of text content in any language, and would like to share it with us to help us train a recommender model.
If there are any languages you would like to see supported, especially those that you feel there is a general lack of support for click here to suggest one.
If you're working on machine learning solutions for multiple languages or are considering training new languages and have any questions please get in touch. It would be great to share notes, also with regards to opportunities for understanding multiple languages!
Finally if you know any large open access text databases in academia or law for any language, we would love to hear about it.
Breaking documents into “chunks”, like sections and subsections, is easy for humans, but surprisingly hard for computers. In this post we explain why this is, why it’s a valuable problem to solve, and we introduce our new solution.
This post describes a simple principle to split documents into coherent segments, using word embeddings.