Intro
I have always struggled with SQL syntax. Well, not exactly struggled, it's pretty simple, but I just dislike those weird joins, excessive capitalization, and those pesky ORMs that only make things worse. Wouldn't it be great if SQL syntax was as universal as data structures?
Fortunately, with the emergence of modern AI technologies, especially LLM, we can now write natural language queries and directly ask the database in plain English or even in your native language! Well, at least we can try. In this article, I will discuss the different approaches attempted by the community and share two public repositories containing my projects.
Naïve approach
Perhaps the first idea that comes to mind when trying out ChatGPT is simply prompting it to translate text into SQL. It's as simple as that:
Please translate this to SQL and output only SQL code: show all customers
However, this approach may not work in 80-90% of cases when the table name is not something that made up. Is it "Customer," "Customers," or "customer"? To solve this issue, we need to consider the
Schema
To address the problem, we can easily retrieve the schema using our preferred method. For example, here's what I do for SQLite:
SELECT name, sql
FROM sqlite_master
WHERE type='table';
Once we have the schema, we can modify the prompt:
You should translate everything to SQL queries. For assistance, here is the schema: {schema}
Please translate this to SQL and output only SQL code: show all customers
I have experimented with this setup, and it has worked well with almost any database. About 8 out of 10 queries have been accurate. I haven't conducted any benchmarks or measurements yet, but so far it looks promising! However, generating queries might be costly, especially when every interaction with a new database begins with simple queries like "show all tables" or "show all customers." What if we could memorize some of the most frequent queries?
Semantic search
What is this and why? There is plenty of information available online for a quick introduction. For example, you can refer to this article from Pinecone. Or, since I use Weaviate in my project, here is the link to their nice blog post about vectors, embeddings and vector search. Semantic search focuses on the meaning rather than just string matching. Now, let's integrate our translator with the vector database! After generating a query, we store the question-query pair in our database:
{
"question": "show all tables",
"query": "SELECT name from FROM sqlite_master WHERE type='table';"
}
When a new query comes in, such as "all tables" or "names of tables," our application first searches the database for similar queries and questions. If none are found, it generates a new one and stores it in the database. This ensures that our configuration learns and enriches itself with new queries and questions. For example, let's say our team of 2-3 users retrieves information every day. Over time, there will be fewer and fewer new generations as there are more similar queries. I haven't conducted extensive tests to know if my hypothesis really holds in the long run, apart from the Chinook sample questions. Nonetheless, the concept appears promising!
Implementation
Based on the aforementioned idea, I have developed two projects. The first is an API service written in Go with Weaviate on the backend, while the second is a prototype built rapidly in Python with ChromaDB to demonstrate the hypothesis. The first project is likely to be supported and improved, so you're welcome to participate if you're interested! Please note that both projects are highly experimental, so don't expect a production-ready solution. However, considering that vector databases are extremely fast, the first project should perform exceptionally well with the appropriate deployment. Feel free to test it out with your databases!
Inside the repositories you will find a quick video demo
Python version: https://github.com/zakharsmirnoff/charm-query-py
Go version: https://github.com/zakharsmirnoff/charm-query
Thank you for reading this! If you like the idea, please favorite the repository on GitHub, react to this post, and leave your comments below. I would be incredibly happy to hear your thoughts on the concept.