Update on the Text-to-SQL LLMs analysis. New models added. Claude-4, Grok-3, Llama-3
Introduction
In a previous publication Benchmark of 7 LLMs for BI-Style SQL, I conducted a benchmark study evaluating the performance of several large language models (LLMs) in generating SQL queries for database questions. The focus was on three key metrics: accuracy, latency, and cost. Costs analysis was not exhaustive due to the limited number of models tested at that time, but it brought me a good starting point for understanding how different LLMs perform in generating SQL queries.
Now, I extend the test to 10 additional models, including the latest versions of Claude, GPT-4, Grok-3, and Llama models exposed in Microsoft Azure as Service Endpoints. The goal is to see how these new models compare to the previous ones in terms of their ability to generate accurate SQL queries, their speed in responding, and the cost-effectiveness of using them for SQL generation tasks. I am still missing the Gemini family that will be tested in the future. I have discarded the Opus and the o3, o4 models from OpenAI because they are too advanced (and expensive) for the text-to-SQL tasks.
I updated the benchmark to 20 different BI-style SQL questions, providing the necessary schema context for each question. The models were tasked with generating SQL queries that would return the expected results, and their outputs were compared against ground-truth answers.
These are my tests and are vendor-agnostic. It should not be used as a guideline because the tests depends a lot, on the dataset, and the queries that the user would ask. See the section of “queries used and measuring accuracy” below for more details. You can use it as guideline, but you should not take it as a definitive benchmark.
Finally, this is a work in progress. Next iteration I will add a few things that I believe are missing now. Bad thing is that I would not be able to compare tests to tests, but positive thing is that it is incremental and with the time will be more mature, solid and useful.
TL;DR
- Prompt is king — include schema + question context.
- Accept ≥90 % row accuracy; retry or tweak the prompt if not.
- Reject models > 5secs. Latency or pricey outliers unless accuracy gaps justify it.
- My top3 in this test:
- gpt-4.1-mini: 0.64 USD per 1000 queries, 2.77 seconds latency, and 90% accuracy.
- Llama-3–3–70B-Instruct: 5.25 USD per 1000 queries, 3.16 seconds latency, and 85% accuracy.
- Claude-sonnet-4–20250514: 5.97 USD per 1000 queries, 4.18 seconds latency, and 94% accuracy.
- Pay premium tokens for excellent accuracy and low latencies.
- My tests are in Azure deployments, hosted in us-east-2. Claude models are based on Anthropic hosted API.
The metrics
When you have more than 1 metric to measure, things get interesting. Different ways to come up with conclusions based on how you weight things. For this benchmark, I have used these metrics:
- Execution success: Whether the model’s SQL parsed and ran without error. An ideal model always succeeds (100%); lower success means syntax or semantics issues. Mini models tend to fail more frequently. But do not assume that the syntax is wrong. They are specials, and sometimes you need to parse/remove something from the output. Still valid query, but you need to pay attention to the output.
- Column-level accuracy: How many columns of data the model’s query returned correctly. We compute percent_columns_equality by comparing the LLM’s result set to the expected set. I haven’t used in this benchmark, for cases where data-accurary was wrong. I haven’t figure out how to compare the columns ordinal position, and the column names. I will probably update this in the future.
- Row-level accuracy: How many rows of data the model’s query returned correctly. I compute percent_data_rows_equality by comparing the LLM’s result set to the expected set. In short, 100% means the exact same rows (
ORDER BY
included) as the ground truth. - Latency: I measure both LLM time (the API latency to generate the SQL) and SQL execution time (time to run the query against the database). In practice, the total user-perceived latency is LLM time + SQL time. I have used only LLM Time for this pubication. I had some models deployed in us-east-2, and others in europe (Anthropic). As I am based in Europe, network latency to Anthofic should be lower than to the models deployed in Azure. I haven’t added network latencies to my calculations.
- Token cost: Each model’s pricing per 1K tokens varies. Average input tokens was 850, and average output tokens were 190. Token usage was not high. I havent used any cache mechanism that could reduce the number of tokens used. Although interesting to dig into not, I don’t see important for this use case in particular. No 30% discount used from Anthropic joining the Development Partner Program, that basically shares your data for improving.
The analysis framework
- PREPARE The process begins with the PREPARE phase, which establishes the foundation for model evaluation. This step includes preparing the dataset and schema, followed by defining the queries and their expected results. Once the environment is set, a selection of 15–20 candidate models is made. This preparation ensures that all subsequent evaluations are conducted under consistent and reproducible conditions.
- TEST — 1 In the TEST — 1 phase, an initial round of model testing is carried out. Each selected model undergoes 2–3 test iterations. The results of these tests are then analyzed to determine the performance of each model. A performance threshold of 90% is used as a benchmark. Models that meet or exceed this threshold proceed to the conclusion phase, while those that fall short move on to further evaluation.
- TEST — 2 Models that do not meet the 90% performance threshold enter the TEST — 2 phase. Here, underperforming models are diagnosed to identify specific issues. Based on this diagnosis, tests are tuned accordingly and re-run. In this round, both performance and query cost are evaluated. If a model still performs below expectations or is too costly, it is considered for rejection.
- CONCLUSIONS In the CONCLUSIONS phase, models that have passed the evaluation process are documented and integrated into the application. This includes detailed records of the testing and tuning phases. Models that failed to meet the criteria are discarded. Finally, all results — including the rationale for model selection or rejection — are shared with stakeholders to ensure transparency and promote informed decision-making.
Accuracy, Latency, and Costs — Model Selection Workflow
This process describes a multi-criteria selection workflow to identify the top-performing LLMs based on accuracy, latency, and cost.
This process outlines a multi-step method to identify and select the top three LLMs based on their performance across three dimensions: accuracy, latency, and cost per query.
Step 1: Analyze Executed Queries
Start by analyzing the executed queries. The models’ results are sorted by accuracy, to understand which ones deliver the most reliable outputs. In my case, this step have several su-steps: first, check that the query is executed, then check that the columns are the same compared to the master data, and then check that the values match. This process is done automatically using a Python script that compares the dataframes. The process is far from perfect, but works nicelly.
# Convert to lists of lists (ignore column names/order)
baseline_rows = [list(row) for row in baseline_df.to_numpy()]
llm_rows = [list(row) for row in llm_df.to_numpy()]
# Convert to sets of tuples for easier comparison
baseline_set = set(map(tuple, baseline_rows))
llm_set = set(map(tuple, llm_rows))
intersection_size = len(baseline_set.intersection(llm_set))
# Calculate percentages
percent_data_rows_equality = round((intersection_size / len(baseline_set)), 2)
percent_rows_equality = round((len(llm_rows) / len(baseline_rows)), 2) if len(llm_rows) <= len(baseline_rows) else 0
percent_columns_equality = round((len(llm_df.columns) / len(baseline_df.columns)), 2) if len(llm_df.columns) <= len(baseline_df.columns) else 0
Step 2: Filter Models by Accuracy
From the accuracy-ranked list, select only those models that achieve 90% or higher accuracy. These models are considered efficient in terms of output quality. My tests are built for 20 queries, however I had to discard a few queries from the analysis because it failed in all the models. refine this further, but the current results already reveal clear trends.
Step 3: Filter by Latency
Evaluate the response times of the high-accuracy models. Retain only the models with a latency of less than 5 seconds, as speed is important for user experience. This number is arbitrary, but it is a good starting point. You can adjust it based on your needs and the models you are testing. Sub 3 seconds would be ideal.
In this point, I am not paying attention to the network latency, as I am running the tests from Europe, and some models are deployed in us-east-2. I will probably add this in the future, but for now, it is not a priority. I’ve investigated a little, and I’ve read that latencies to central Europe in Anthropic are arround 30–50ms, and to Azure in us-east-2 are arround 100–150ms. These numbers might be huge for mission critical applications, but for BI-style SQL queries, I assume are acceptable. In later iterations, I will try to be more precise with the latencies.
Step 4: Sort by Cost per Query
Next, assess the cost per query of the remaining models. Sort them from lowest to highest, identifying the most economical options. The cheaper model, the bettet. My experience is that once getting to 90% accuracy, small digits in accuracy are expensive.
Step 5: Match Across All Metrics
Match the top-performing models across the three criteria:
- Accuracy (≥90%)
- Latency (<5s)
- Cost (low)
These values will vary on depending on the use cases. I have selecte these values based on my tests, not based on a rule of thumb.
Step 6: Select the Best
Using a weighted scoring system, select the top 3 models that offer the best overall balance between accuracy, latency, and cost. These are the optimal models for production use or further evaluation.
This involves identifying overlapping models that perform well in all dimensions. This is a manual work, gut feeling, and experience. I think it is great to select 3 models and then sale them to the business in different flavours like general, medium, and premium. Goal would be to have a model for each use case, and not a model that fits all. This is the best way to get the most out of GenAI.
The results
Row-level accuracy
In this benchmark, I considere appropiate the models that perform >=85%. Although I have deep deeper into the failing ones, in next iteration I will investigate further to see how to improve their performance.
LLM Generation time
As said earlier, I considere good the models that perform <=5 seconds. I have given to Llama-4-Maverick a pass because of the latencies to Azure Data Centers in us-east-2 from my hometown.
Cost per Query
This is where my manual work goes (I wil try to automate in next iteration thought).
In the right of the left table, you see in red two colums:
- q for Acuracy (quality): The models that have >= 85% accuracy.
- L for Latency: The models that have <= 5 seconds latency.
My manual decision is to select bottom-up 3 models that have L, and Q, which are:
- gpt-4.1-mini: 0.64 USD per 1000 queries, 2.77 seconds latency, and 90% accuracy.
- Llama-3–3–70B-Instruct: 5.25 USD per 1000 queries, 3.16 seconds latency, and 85% accuracy.
- Claude-sonnet-4–20250514: 5.97 USD per 1000 queries, 4.18 seconds latency, and 94% accuracy.
Conclusions and next steps
In this benchmark, I have evaluated 20 different LLMs for BI-style SQL generation, focusing on accuracy, latency, and cost. The results indicate that the models can achieve high accuracy rates, with many performing above the 85% threshold. Latency is generally acceptable, with most models responding within 5 seconds. Cost per query varies significantly, with some models being more economical than others.
In future iterations, I plan to:
- Add more models, including the Gemini family.
- Refine the dataset and queries to improve accuracy.
- Automate the analysis process to streamline the evaluation of model performance.
- Investigate further into the models that underperformed to identify potential improvements.
- Explore the impact of network latency on model performance, especially for models hosted in different regions.
- Explore the potential of Langfuse to be more efficient in decision-making.