At Lateral we use PostgreSQL to store documents for our visualiser. Each document consists of a text column and a JSON column with meta data inside such as a title, date and URL. We wanted to create a fast search experience for the visualiser that lets you search the full text of documents as well as their titles to quickly find a document to get Lateral recommendations for.
To achieve this we assessed our options; we could use an open source solution such as Apache Solr or Elasticsearch, we could use a managed solution such as Elastic or Algolia or we could use PostgreSQL’s full text search. We decided to use PostgreSQL for the following reasons:
The main arguments against using PostgreSQL search are accuracy and speed at scale. The accuracy of PostgreSQL searches is not the best. In the examples that follow I’ve explained how to weight titles over text, but we found a few cases where typing the exact title of a document wouldn’t return that document as the first result. This isn’t ideal and please let us know if you have any feedback about this. We’ve had issues in the past with the speed of PostgreSQL searches. This was for 23 million records and we found the speed deteriorates around 1 to 2 million rows.
So even though the dedicated solutions could probably work a bit better, we decided that for this use case they weren’t necessary.
To get optimum performance with PostgreSQL full text search you need to create a column to store the tsvector values with an index on it. Then you need to fill this column with the tsv values and create a trigger to update the field on INSERT and UPDATE. Then you’ll be able to query the table quickly.
To create the column for the tsvector values run:
And then to create the index on this column run:
Now you have somewhere to store the tsvectors you can populate the column. In this example, I am assuming a table with the structure I mentioned above, with a text column and a meta column that contains a JSON object with a title. To populate the tsv column with tsvectors, run the following:
This query gets the title from the meta JSON column and gives it the heighest weight of A. Then it gets the text value and weights it D. Then it combines the two tsvectors and writes them to the tsv column we just created.
At this point, if your data was static you could stop and start querying. But we want all future rows and updates to have up-to-date tsv columns so we need to create a trigger to do this. Firstly, you’ll need to create a function to take a column and update the tsv column:
This basically does the same as the update query above but as a function. Now you need to create a trigger to execute that function when any row is updated or inserted:
Now you can search the database. Replace both occurences of YOUR QUERY text with what you want to search and run the following query:
This query performs two queries; firstly it performs the search on the indexed tsv column. Then it ranks and sorts those results and returns 5.
This query takes about 50ms! We experimented with returning the document text with results and found that if we returned the full text of documents, it added about 350ms to the query time which seems to be network overhead rather than a slower query. If we only returned 200 characters from the texts then it only added around 100ms.
There is some added overhead because we’re using Rails to query PostgreSQL, so it’s not as fast as something like Algolia but it’s good enough for our use case.
See the search in action in one of our recommender demos:
How to create a VHD that is fully compatible with Azure from an Ubuntu Cloud Image base.
In this blog post, I will create a Chrome extension that modifies this blog to set a custom background and to modify the HTML