Part 1: Generating the Data Model

It’s a common refrain - “Just ask your data a question in natural language!”.

Having spent most of my career in the space, I can't help but think... it looks so easy…

So, my goal over the next few posts is to see how far I can get using open-source software and AI. My starter goal is the table stakes:

When a user asks a simple “What” question, my app will return a result set with an answer.

My flex goals are a bit more ambitious

The data setup should not involve any manual user interventions. All GPT.
The result should be flexible and dynamic so the user can interact with it to ask subsequent questions.
The user should be able to ask “Why’ questions.

I have broken this up into a few distinct steps. The first two are covered in this article. The third will be available shortly after.

Step 1: Model Suggestion
Step 2: Model Creation
Step 3: Querying the Model

Why Not Just Have GPT Generate SQL?

Let's start addressing the obvious, it might seem like a good idea to have GPT generate SQL directly from user queries. And maybe in a year or so this will be the correct way, but for now I have found it a bit lacking. As you scale up in complexity both in the number of tables and types of calculations that need to be performed, GPT starts to forget things. In an effort to solve a complex series of joins, it may forget the correct columns you need for your visualizations.

By placing a metadata layer between ChatGPT and the SQL generation, we can simplify the job for GPT. Instead of asking GPT to generate the entire SQL query from scratch, we ask it to navigate a predefined structure. This not only improves the accuracy of the queries but also makes the system more secure. For example, implementing Row-Level Security (RLS) is much easier when you have a metadata layer controlling what data is accessible. This layer acts as a buffer, ensuring that the generated queries adhere to the rules and structures defined by the model.

Establishing the Data Model

For this example I am using RestBI, but most BI tools will support this in one form or another. In this context, a Model is the metadata layer, a code representation of your data structure —think of it like LookML for those familiar with Looker. It defines the tables, columns, joins, and other elements needed to build queries. The model abstracts away the complexity of the underlying database, providing a consistent interface for querying data. For GPT, the end result is a nice flat table structure.

One advantage of using Models here, is that they provide a set of related fields. So if a user has subsequent questions, or wants to know "why", we can let them pivot around the initial visualization, within context of the Model, without having to go all the way back to the database every time.

Step 1: Model Suggestion

Getting GPT to Propose a Starting Point

Running against a sample or test database should be pretty easy, but let’s think ahead. Most databases at any size of company have 1000s of tables. Most of these tables will be built for purpose and irrelevant to any given user’s question.

To simplify the starting point, I wanted the first step to be a list of potential Models that exist within the database. An auto-discoverer. Break out the Models by use case, tell me what tables it needs, and why.

The Prompt

The idea is this: by feeding it metadata about the database, GPT can suggest different combinations of tables that might be useful for answering specific types of questions. In this case I have obtained the metadata through the RestBI /metadata endpoint.

I then ask GPT to suggest a series of Models as a list of PossibleModel Type:

const ListPossibleModels = async (tables: Table[], userSuggestions: string | null = null) => {
    let prompt = `
    The following is a list of all tables in the database, and the columns within those tables in JSON format:
    ${
        tables.map((table) => {
            return JSON.stringify({
                table: table.name,
                columns: table.columns.map((column) => column.name)
            })
        }).join(', ')
    }
    A Model is a collection of tables and columns that can be used to create queries. Please analyze the tables
    and columns above and suggest what models could be created from them. This could be one or several models, based on the dataset.
    A model typically represents a set of business questions, and tables that can be properly joined together to answer those questions.
    type PossibleModel = {
        name: string,
        tables: string[],
        reason: string
    }
    Please provide your suggestions in JSON format as a list of PossibleModel. Do not provide any other words or text in the response.
    `;
    let response = await Chat(prompt);
    return JSON.parse(response);
}

I love Typescript, and am using mostly NextJS on this project. By providing ChatGPT the types itself, I can ensure its output matches with my own structure without any fancy manipulations.

How did it do?

As goes the trend with GPT, at first I was really happy with my results! It looked spot on… It did a good job of breaking out tables by use case, describing why, and selecting the most relevant tables for each model.

On further inspection I started to realize some flaws:

The models were typically pretty small and use case focused. There were 155 tables across 15 schemas in my database. Most of the suggested models only used 4 or 5. I know I can safely join up to 20 of these into one.
It was only picking the most obviously named tables. My database is old, like back when people were proud of their efficiency in obfuscating table names, old. Any table like this was ignored. Junk in .. Junk out .. not really a groundbreaking learning, but yea, can be a problem.

Next Steps: Refining the Suggestions

As an immediate improvement, I introduced a user feedback mechanism. A text input box in my UI allows the user to add some additional context to the process :

${ userSuggestions ? `Here are an additional set of instructions to consider: ${userSuggestions}` : ''}

Now I could say “Please generate a model that is all encompassing of sales, product, categories, and customers”.

A few ideas for the future:

When the user submits feedback, include the last suggestions. This way the user can say things like “use the first model you generated, but include these 3 tables as well”
Start with the question. Maybe I am thinking about this wrong. Maybe I need to take the user’s question, and work backwards into the model needed to answer it.

Criticisms aside, I’m happy enough with this to move onto model creation.

Step 2: Model Generation

Moving from Suggestion to Creation

This one took me a few iterations to get right, and I know there is still a lot more to be done. But if I had to grade the final result, I would give this a solid B. Not quite ready for production, but it definitely passed expectations and made me eager to keep going.

The Model object

First, let's talk about what a data model actually is. In the context of business intelligence, a data model is essentially a blueprint that defines how data is organized, what tables exist, how those relate to each other, and what columns are exposed within them.

Importantly, it defines how SQL should be generated. In tools like Looker, this is known as LookML, but the concept is similar across most BI platforms.

In our case, we are using RestBI, and the Model is represented in JSON format. This is nice because of the simplistic and controlled structure and ChatGPT tends to do a good job of generating JSON.

Given a Model and a Query, RestBI can generate a result set for us.

A case for sending Types over Examples

Everyone knows to build examples into their prompt right? Well, in the past I have found that examples tend to be both helpful and confusing GPT.

Helpful because it clearly outlines what format the result should take.
Confusing, because it often seeds a thought in GPT that may not be relevant. If my example is a sales example, but now I am querying supply chain data, GPT will tend to include sales content more than I want because the example suggests that.

Since RestBI is JSON based and created in Typescript, there are thorough type definitions for what a model is. I decided to pass in a minified version of the type definitions instead of an example model.

const modelTypes = ‘export enum DatabaseType{POSTGRES='POSTGRES',MYSQL='MYSQL',ORACLE='ORACLE',SQL_SERVER='SQL_SERVER',SQLITE='SQLITE',SNOWFLAKE='SNOWFLAKE'}export enum ColumnDataType{STRING='STRING',NUMBER='NUMBER',DATE='DATE',BOOLEAN='BOOLEAN',JSON='JSON'}export enum ColumnType{MEASURE='MEASURE',DIMENSION='DIMENSION'}export type Connection={id:string,name:string,host:string,port:number,user:string,password:string,database:string,type:DatabaseType,schema?:string,warehouse?:string,role?:string}export type Table={id:string,dbName:string,name:string,schema?:string,alias?:string,validated?:boolean,columns:Column[]}export type Column={id:string,dbName:string,name:string,alias?:string,type?:ColumnType,validated?:boolean,dataType?:ColumnDataType,aggregationType?:string}export type JoinClause={column1:string,column2:string,operator:string}export type Join={id:string,validated?:boolean,table1:string,table2:string,clauses:JoinClause[],joinType?:string}export type Model={id:string,name:string,displayName?:string,connection:Connection,tables:Table[],joins:Join[],formulas:Formula[],filters:Filter[]}export type Formula={id:string,name:string,expression:string}export type Filter={id:string,name:string,expression:string}’

First Attempt: One-Shot Generation

My first approach was straightforward: to have GPT generate the entire model in one go. The prompt was similar to what I used for the suggestions, but with the expectation that it would output a fully-formed model ready to use.

To narrow the scope of the process down to what was relevant I took the user’s suggested Model as a basis, and only considered tables from the suggestion.

Here’s the initial approach:

export const CreateModel = async (tables: Table[], possibleModel: PossibleModel, userSuggestions: string) => {
    const relevantTables = tables.filter((table) => possibleModel.tables.includes(table.name) || possibleModel.tables.includes(table.dbName));
    let prompt = `
    The following is a list of all tables in the database, and the columns within those tables in JSON format:
    ${
        relevantTables.map((table) => {
            return JSON.stringify({
                table: table.name,
                schema: table.schema,
                columns: table.columns.map((column) => column.name)
            })
        }).join(', ')
    }
    Please create a valid Model based on the tables and columns above. The model should be named ${possibleModel.name}.
    ${ userSuggestions ? `Here are an additional set of instructions to consider: ${userSuggestions}` : ''}
    Here are the appropriate type definitions for creating the model:
   ${ modelTypes }
    Do not include a connection object in the model. Include tables, joins, formulas, and filters as needed. Include schema in tables if available. Be sure to limit columns to those that are necessary for the model.
    Please provide your suggestions in JSON format as a Model. Do not provide any other words or text in the response.
    `;
    let response = await Chat(prompt);
    return JSON.parse(response);
}

At first again, I was very encouraged. The first few models GPT generated were valid! Real models that could instantly be used! But as I explored more a few problems started to appear:

Not every model was valid. Sometimes it would confuse column names, sometimes it would mix up schemas, sometimes it would forget joins.
Another problem was that the models tended to be incomplete. While my suggestion had 7 tables, the resulting model only had 2 or 3.

Second Attempt: Validation and Iteration

To address the validity shortcomings of the one-shot approach, I introduced a validation pass. After generating the model, automatically validate it against the metadata to ensure everything was accurate and complete. If the model didn’t pass validation, I send a nearly identical prompt back to GPT with the added “hey here were the errors please fix them”

I implemented this with the RestBI \validate endpoint.

Simplified, here’s how that process looks:

client.validateModel(model).then((result: ValidationResult) => {
    let errors =  generateErrorList(result.model.tables);
    if (errors.length > 0) {
        setValidated(false);
        fixModel(model, errors.join(', '));
    }
}

It was an easy adjustment, and it seems to work pretty well. Most models are properly validated after only 1 loop. I'll likely keep this going forward.

Third Attempt: Table-by-Table Generation

I mentioned one of the main struggles in the initial approach was that it would generate incomplete models, with only a fraction of the tables I was interested in.

This time, I went through each of the tables listed in the "suggested" model. Instead of prompting GPT to create a Model type, I prompted it instead to generate a Table, allowing it to focus on picking good columns, and generating aliases:

export const CreateTable = async (tables: Table[], possibleTable: string, userSuggestions: string) => {
    const relevantTable = tables.find((table) => table.name === possibleTable || table.dbName === possibleTable);
    let prompt = ` 
    The following is the definition of the table ${relevantTable.name}. 
    ${JSON.stringify(relevantTable)}
    Please create a valid Table based on the tables and columns above. The table should be named ${possibleTable}.
    ${ userSuggestions ? `Here are an additional set of instructions to consider: ${userSuggestions}` : ''}
    Here are the appropriate type definitions for creating the table:
    export enum ColumnDataType{STRING='STRING',NUMBER='NUMBER',DATE='DATE',BOOLEAN='BOOLEAN',JSON='JSON'}export enum ColumnType{MEASURE='MEASURE',DIMENSION='DIMENSION'}export type Column={id:string,dbName:string,name:string,alias?:string,type?:ColumnType,validated?:boolean,dataType?:ColumnDataType,aggregationType?:string}export type Table={id:string,dbName:string,name:string,schema?:string,alias?:string,validated?:boolean,columns:Column[]}
    Include columns that most people would find useful, ensure their display name is human readable. Include schema if available.
    Please provide your suggestions in JSON format as a Table. Do not provide any other words or text in the response.
    `;
    let response = await Chat(prompt);
    return JSON.parse(response);
}
};

Once this was done, I added the resulting tables to a new Model manually:

for (let tableName of suggestedModel.tables) {
  let table = await CreateTable(metadata, tableName, suggestion);
  model.tables.push(table);
}

I then similarly created a single prompt to generate all of the joins.

Possible next steps

Expand the prompt to include other column attributes like default aggregations
Include a pass for formulas, default filters, etc
Some mechanism to alter the table once created simply, eg: add/remove columns or tables

How Did It Do?

The improvement in quality was immediately noticeable. I have all of the tables I wanted, and a good variety of columns. And of course, it passes validation.

Obviously, there is still a lot more to do here. But it feels good enough to start testing the Query functionality.

Conclusion

It's a work in progress but I feel pretty happy for only a few hours of time.

In the next post we provide an interface to allow true Natural Language Queries and use ChatGPT to translate it into something our Model recognizes, and can query.

Stay tuned, will publish the follow-up shortly! Please leave any thoughts you have in the comments, would love to hear from you.

Weekend Project - Building an NLQ App with ChatGPT