PostgreSQL Full-Text Search in a Nutshell

ChunTing Wu - Jun 14 - - Dev Community

If you ask me to choose a database for microservices, I would probably say PostgreSQL.

On one hand, PostgreSQL is a popular open source database with many mature practices, both on the server side and the client side. On the other hand, PostgreSQL is very "scalable". Scalability of course includes non-functional requirements such as traffic and data volume, as well as functional requirements such as full-text search.

In the early stages of planning a microservice, we may not be sure what features it needs to have. I have to say those tutorials always telling us microservices need to have clear boundaries early in design are nonsense.

Most of the time, as a microservice goes live, there are more requirements and more iterations, and those requirements don't care about the boundaries you define at the beginning.

Full-text search is a good example. In the early days of a service, we might be able to work just fine with a exact match or prefix match of text. Until one day, you get a request. "Hey, let's add a search feature!"

If the database you chose in the beginning doesn't have a search function, then that's a big problem.

I'm sure you wouldn't introduce a database like Elasticsearch, which specializes in full-text search, for this sudden need. So what to do?

Fortunately, PostgreSQL has a solution.

PostgreSQL Built-In Features

In the official document there is a large chapter on how to do full-text searching with PostgreSQL.

Let's skip all the technical details and go straight to the how-to.



SELECT id, title, body, author, created_at, updated_at, published,
       ts_rank(tsv, to_tsquery('english', 'excited | trends')) AS rank
FROM blog_posts
WHERE tsv @@ to_tsquery('english', 'excited | trends')
ORDER BY rank DESC;


Enter fullscreen mode Exit fullscreen mode

blog_posts is a table of stored blog posts, where tsv is a special column. It is not a metadata that a blog needs, it is a column that we create for searching purposes.



ALTER TABLE blog_posts ADD COLUMN tsv tsvector;
UPDATE blog_posts SET tsv = to_tsvector('english', title || ' ' || body);


Enter fullscreen mode Exit fullscreen mode

As we can see, tsv is the result set when we join title and body and do the English stemming.

The methods to_tsvector and to_tsquery are the core of this query. It is through these two methods that you can efficiently create the synonyms from the built-in dictionary. If you're familiar with Elasticsearch, then what these two methods are doing corresponds to analyzer.

Let's explain this with a flowchart.

Image description

The tsvector is the result that we store in advance through tokenizer and token filter operations. The same is applied to tsquery, but it is lighter. Then we compare tsquery and tsvector by using the matching operator @@, which produces a dataset containing the query results. Finally, the score is calculated and sorted by ts_rank.

In fact, the whole process is the same as what Elasticsearch does, except that PostgreSQL takes away the ability to customize and relies on the built-in dictionary.

It's worth mentioning that the tsv column needs to be indexed with a GIN type index, otherwise the performance will be poor.



CREATE INDEX idx_fts ON blog_posts USING gin(tsv);


Enter fullscreen mode Exit fullscreen mode

Elasticsearch is efficient not because it has a powerful analyzer, but because it uses inverted indexes at the backend. In PostgreSQL's case, it's GIN, Generalized Inverted Index.

Again, the technical detail is not the focus of this article, so I'll skip it.

Non-built-in PostgreSQL catalogs

It is not difficult to list all currently supported languages for PostgreSQL.



SELECT cfgname FROM pg_ts_config;


Enter fullscreen mode Exit fullscreen mode

As of today (PostgreSQL 16), there are only 29 in total.



postgres=# SELECT cfgname FROM pg_ts_config;
  cfgname
------------
 simple
 arabic
 armenian
 basque
 catalan
 danish
 dutch
 english
 finnish
 french
 german
 greek
 hindi
 hungarian
 indonesian
 irish
 italian
 lithuanian
 nepali
 norwegian
 portuguese
 romanian
 russian
 serbian
 spanish
 swedish
 tamil
 turkish
 yiddish
(29 rows)


Enter fullscreen mode Exit fullscreen mode

As seen, there is no CJK language at all, i.e. there is no ability to handle double-byte characters.

For languages without built-in support, they can be handled by extensions. Take Chinese as an example, pg_jieba is a widely used extension.

After installing the extension, we just need to modify the catalog in PostgreSQL.



CREATE EXTENSION pg_jieba;
UPDATE blog_posts SET tsv = to_tsvector('jieba', title || ' ' || body);


Enter fullscreen mode Exit fullscreen mode

The above is an example of to_tsvector, and of course, to_tsquery is the same.

So languages without built-in support can find language extensions to enhance PostgreSQL's capabilities. This is one of the great things about PostgreSQL, it has a rich ecosystem of additional features.

What about AWS RDS?

In the previous section we mentioned that we can install additional extensions to support more languages, however, AWS RDS cannot customize extensions.

In this case, we have to transfer the server-side effort to the client-side.

In other words, we need to implement the language-specific stems on the client side and write tsv on the client side.

Let's continue with jieba as an example. This is the main program logic for pg_jieba, which is also a Python package, so let's use Python for the example.



import jieba

def tokenize(text):
    return ' '.join(jieba.cut(text))

cur.execute("""
    INSERT INTO blog_posts (title, body, author, tsv)
    VALUES (%s, %s, %s, to_tsvector('english', %s))
""", (title, body, author, tokenize(title + ' ' + body)))


Enter fullscreen mode Exit fullscreen mode

Similarly, the query is also done by jieba.cut and then to_tsquery. One interesting thing is we still use english as the catalog in this example, but it doesn't really matter what we use. I just need the ability to split text with whitespace.

Conclusion

In this article, we can see the power of PostgreSQL.

It has many useful features and a rich ecosystem. Even if these readymade implementations are not enough, we can still implement on our own through its stable and flexible design.

That's why I prefer PostgreSQL.

If we don't think about something clearly in the beginning, there are many chances that we can fix it without getting stuck in a mess.

This article provided an example of full-text searching, and I'll provide an additional interesting example. At one time, I was using PostgreSQL for a microservice and focused on its ACID capabilities. However, the requirements changed so drastically that ACID was no longer a priority, but rather the ability to flexibly support a variety of data types.

Suddenly, PostgreSQL can be transformed into a document database, storing all kinds of nested JSON through JSONB.

Thanks to the flexibility of PostgreSQL, it has saved my life many times, and I would like to provide a reference for you.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player