For the last few months, we’ve been doing occasional work on an approximate nearest neighbours (ANN) vector search tool, written in Python. It’s still not finished and there are many rough edges, but it comes with a working DynamoDB adaptor and hence operates out-of-memory, one our main requirements. On the down side, it isn’t as fast as it needs to be, and is untested in production.
Today, we are open sourcing it, in the hope that others can have some benefit from it, and perhaps even contribute to it. The vision we have for this tool is the following:
How are we doing so far? At the moment, it isn’t as fast as we’d like, we’ve not benchmarked its accuracy and it’s not production-tested. However, it does satisfy the remaining requirements: users can determine the number of recommendations on a per-call basis, vectors can be added with no downtime, it’s scalable and it operates out-of-memory, using a DynamoDB adaptor.
We would’ve much preferred to use a pre-existing tool, but we couldn’t find one that ticked all of the above boxes, though there are some excellent open source projects that go some of the way: FLANN (see also the python binding pyflann), nearpy and annoy, for example. However, at the time of writing none of these operates out-of-memory, although nearpy does feature a Redis adaptor. We wanted it to operate out-of-memory both so that the size of the corpus we can recommend from isn’t dependent on how much RAM we have, and also so that we don’t have to wait for this to load into memory every time we reboot the server.
How does it work? I'll address this in more detail in a forthcoming post, but the short version is: using a collection of hyperplane hashers (AKA "random projections"), like annoy. Both the number of hashers and the number of hyperplanes per hasher can be specified by the user. This accounts for the scalability: if the number of vectors in the search space doubles, we need only add one extra hyperplane to each hasher to ensure the number of vectors per chamber remains constant. At search time, each hasher is queried, and the resulting candidate nearest neighbours are all placed in a big pool. Distances are then computed explicitly between the input vector and each vector in the pool, and the ids of the nearest "N" neighbours ("N" being specified by the user at query time) returned. So having more hashers gives more accurate results, but increases the query time.
Here are some things I'd like to see improved, but haven't had a chance to address myself:
How to create a VHD that is fully compatible with Azure from an Ubuntu Cloud Image base.
In this blog post, I will create a Chrome extension that modifies this blog to set a custom background and to modify the HTML