If you ask me to choose a database for microservices, I would probably say PostgreSQL.
On one hand, PostgreSQL is a popular open source database with many mature practices, both on the server side and the client side. On the other hand, PostgreSQL is very "scalable". Scalability of course includes non-functional requirements such as traffic and data volume, as well as functional requirements such as full-text search.
In the early stages of planning a microservice, we may not be sure what features it needs to have. I have to say those tutorials always telling us microservices need to have clear boundaries early in design are nonsense.
Most of the time, as a microservice goes live, there are more requirements and more iterations, and those requirements don't care about the boundaries you define at the beginning.
Full-text search is a good example. In the early days of a service, we might be able to work just fine with a exact match or prefix match of text. Until one day, you get a request. "Hey, let's add a search feature!"
If the database you chose in the beginning doesn't have a search function, then that's a big problem.
I'm sure you wouldn't introduce a database like Elasticsearch, which specializes in full-text search, for this sudden need. So what to do?
Fortunately, PostgreSQL has a solution.
PostgreSQL Built-In Features
In the official document there is a large chapter on how to do full-text searching with PostgreSQL.
Let's skip all the technical details and go straight to the how-to.
SELECT id, title, body, author, created_at, updated_at, published,
ts_rank(tsv, to_tsquery('english', 'excited | trends')) AS rank
FROM blog_posts
WHERE tsv @@ to_tsquery('english', 'excited | trends')
ORDER BY rank DESC;
blog_posts
is a table of stored blog posts, where tsv
is a special column. It is not a metadata that a blog needs, it is a column that we create for searching purposes.
ALTER TABLE blog_posts ADD COLUMN tsv tsvector;
UPDATE blog_posts SET tsv = to_tsvector('english', title || ' ' || body);
As we can see, tsv
is the result set when we join title
and body
and do the English stemming.
The methods to_tsvector
and to_tsquery
are the core of this query. It is through these two methods that you can efficiently create the synonyms from the built-in dictionary. If you're familiar with Elasticsearch, then what these two methods are doing corresponds to analyzer
.
Let's explain this with a flowchart.
The tsvector
is the result that we store in advance through tokenizer and token filter operations. The same is applied to tsquery
, but it is lighter. Then we compare tsquery
and tsvector
by using the matching operator @@
, which produces a dataset containing the query results. Finally, the score is calculated and sorted by ts_rank
.
In fact, the whole process is the same as what Elasticsearch does, except that PostgreSQL takes away the ability to customize and relies on the built-in dictionary.
It's worth mentioning that the tsv
column needs to be indexed with a GIN
type index, otherwise the performance will be poor.
CREATE INDEX idx_fts ON blog_posts USING gin(tsv);
Elasticsearch is efficient not because it has a powerful analyzer
, but because it uses inverted indexes at the backend. In PostgreSQL's case, it's GIN, Generalized Inverted Index.
Again, the technical detail is not the focus of this article, so I'll skip it.
Non-built-in PostgreSQL catalogs
It is not difficult to list all currently supported languages for PostgreSQL.
SELECT cfgname FROM pg_ts_config;
As of today (PostgreSQL 16), there are only 29 in total.
postgres=# SELECT cfgname FROM pg_ts_config;
cfgname
------------
simple
arabic
armenian
basque
catalan
danish
dutch
english
finnish
french
german
greek
hindi
hungarian
indonesian
irish
italian
lithuanian
nepali
norwegian
portuguese
romanian
russian
serbian
spanish
swedish
tamil
turkish
yiddish
(29 rows)
As seen, there is no CJK language at all, i.e. there is no ability to handle double-byte characters.
For languages without built-in support, they can be handled by extensions. Take Chinese as an example, pg_jieba
is a widely used extension.
After installing the extension, we just need to modify the catalog in PostgreSQL.
CREATE EXTENSION pg_jieba;
UPDATE blog_posts SET tsv = to_tsvector('jieba', title || ' ' || body);
The above is an example of to_tsvector
, and of course, to_tsquery
is the same.
So languages without built-in support can find language extensions to enhance PostgreSQL's capabilities. This is one of the great things about PostgreSQL, it has a rich ecosystem of additional features.
What about AWS RDS?
In the previous section we mentioned that we can install additional extensions to support more languages, however, AWS RDS cannot customize extensions.
In this case, we have to transfer the server-side effort to the client-side.
In other words, we need to implement the language-specific stems on the client side and write tsv
on the client side.
Let's continue with jieba
as an example. This is the main program logic for pg_jieba
, which is also a Python package, so let's use Python for the example.
import jieba
def tokenize(text):
return ' '.join(jieba.cut(text))
cur.execute("""
INSERT INTO blog_posts (title, body, author, tsv)
VALUES (%s, %s, %s, to_tsvector('english', %s))
""", (title, body, author, tokenize(title + ' ' + body)))
Similarly, the query is also done by jieba.cut
and then to_tsquery
. One interesting thing is we still use english
as the catalog in this example, but it doesn't really matter what we use. I just need the ability to split text with whitespace.
Conclusion
In this article, we can see the power of PostgreSQL.
It has many useful features and a rich ecosystem. Even if these readymade implementations are not enough, we can still implement on our own through its stable and flexible design.
That's why I prefer PostgreSQL.
If we don't think about something clearly in the beginning, there are many chances that we can fix it without getting stuck in a mess.
This article provided an example of full-text searching, and I'll provide an additional interesting example. At one time, I was using PostgreSQL for a microservice and focused on its ACID capabilities. However, the requirements changed so drastically that ACID was no longer a priority, but rather the ability to flexibly support a variety of data types.
Suddenly, PostgreSQL can be transformed into a document database, storing all kinds of nested JSON through JSONB
.
Thanks to the flexibility of PostgreSQL, it has saved my life many times, and I would like to provide a reference for you.