Sitemap

Speed, Accuracy, Cost: A Practical Benchmark of 7 LLMs for BI-Style SQL

10 min readMay 20, 2025

Introduction

Text-to-SQL is a popular use case in GenAI. In a previous post, How to Safely Use LLMs for Text-to-SQL with Stored Procedures, I broke down those risks and detailed how to approach. A follow-up article, Building Modern Text-to-SQL Systems with GenAI: Lessons from LinkedIn, Uber, and Open Source, surveyed some interesting architectures.

In this third publication I benchmarked a few LLMs on a 15 different queries against a the tpch database that is used for Analytical Benchmarking. You can build the database using this script. These are the specifications, and you can download a backup in Duckdb from here.

I ran ten batches of fifteen queries each and evaluated the following models. For every run I captured:

  • Execution success — did the SQL parse and run?
  • Row-level accuracy — did the result set match the expert baseline?
  • Latency — how long did generation and execution take?
  • Cost — token spend per successful query (logged with Langfuse).

All experiments were logged end-to-end in Langfuse, and custom logs in .csv and .yaml, giving me a single pane of glass to monitor prompt, response, SQL execution, and cost.

TL;DR

  • GPT-4.1 and GPT-4.1-mini top the chart: > 97% execution success, 83–87% row-level accuracy, with mean generation times of aprox. 6 and 3 secs.
  • GPT-4o-mini is the sprinter — aprox. 2 secs. per query and 100% execution success — but accuracy plateaus at 73%.
  • DeepSeek-V3–0324 and GPT-4.1-nano form a solid middle tier (93–95% success, 70–78% accuracy).
  • GPT-4o (full) and Phi-4 stumble: too many syntax or planning errors to ship without rigorous automatic validation.
  • Note: Logged end-to-end in Langfuse to monitor prompts, responses, SQL execution and cost across 10 runs × 15 BI-style questions (star-schema joins, aggregates, CTEs, PIVOT, ordering).

The Process

The process would look like this:

The process will iterate each model and each questions to generate the SQL query.

The Models

Due to resource constraints, I tested only the following models:

  • GPT-4.1: The latest iteration of GPT-4, offering the highest capability and a massive context window (up to 1 million tokens).
  • GPT-4.1-mini: A distilled, smaller version of GPT-4.1, designed to balance speed and intelligence.
  • GPT-4.1-nano: An even smaller, ultra-fast variant of GPT-4.1. This model prioritizes low latency and cost, with the trade-off of significantly reduced complexity understanding.
  • GPT-4o and GPT-4o-mini: These represent the original GPT-4 series (often referred to as GPT-4 “older” or base model) and a smaller fine-tuned version.
  • Phi-4: Phi-4 is a 14B parameter model from Microsoft’s Phi series focusing on complex reasoning in a small package.
  • DeepSeek-V3: DeepSeek-V3 is a massive Mixture-of-Experts LLM (671B parameters total, with 37B active per token) introduced in late 2024. It is specialized for tasks like search and likely excels at structured query understanding.

These models were not tested because I couldn’t make it to build Transact-SQL code:

  • Phi-4-mini-instruct,
  • Phi-3-small-128k-instruct,
  • Phi-3-medium-128k-instruct, do not build sql code. lots of comments, and no sql code.
  • I have removed o3, and o4 models because they are models for other type of purposes.

System Prompt

system_message = "RETURN Transact-SQL query sentence.\nThis is SQL Server.\n"

if tables_schema != "":
system_message += f"[ database schema -- {tables_schema} -- database schema]\n"
if sql_expert_rules != "":
system_message += f"[ expert rules -- {sql_expert_rules} -- expert rules]\n"
system_message += (
"If you do not have enough information to create the Transact-SQL sentence, just say: \"I don't know\" \n"
"Return only the Transact-SQL sentence that will be executed, no comments. \n"
"Do not return the code as markdown, or comments. \n"
"Use snake_case format for aliases, and do not use special characters or accents. \n"
"Do not make comments. What you retrieve will be executed in the database. \n"
"Provide a valid Transact-SQL sentence. Do not quote the sentence with things like ```sql, ```code, or similar. \n"
)
user_prompt = f"""
{user_prompt}
"""

Expert Rules

These were the rules I used in order to guide the Transact-SQL generation:

# 15 T-SQL Rules & Patterns to use for good Transact-SQL code creation
1. Use `OVER` clauses for running totals instead of self-joins.
2. Use `CROSS APPLY` to unpivot inline calculations.
3. Avoid using `SELECT *`; be explicit with columns.
4. Use `CTEs` for recursive logic and query modularization.
5. Use `PIVOT` and `UNPIVOT` for flexible row-column transformations.
6. Use `PARTITION BY` with `OVER` for grouped window functions.
7. Prefer `NOT EXISTS` over `NOT IN` for NULL-safe anti-joins.
8. Avoid unnecessary DISTINCT — verify the root cause of duplication.
9. Prefer `INNER JOIN` over `WHERE` for join conditions.
10. Use `EXISTS` for checking presence efficiently.
11. Combine conditions with `CASE` for conditional aggregations.
12. Use `OFFSET-FETCH` for efficient pagination.
13. Use table aliases for readability, especially in joins.
14. Filter using `HAVING` only for aggregated conditions.
15. Replace Functions in dates (YEAR, MONTH, DAY) with BETWEEN start_date AND end_date with possible.

Note. I don’t see value of asking the model to pretend to be an expert because of the lack of control of what an experts is. I prefer to use the rules that I know are good practices, and I can control.

The Questions

All the questions were thinking on a BI Analyst. I used the power of LLMs to build the questions. This is my tip: go to ChatGTP, share the database schema, and ask it to build questions acting as a BI Analist. Then, ask to ChatGTP to build the Transact-SQL queries for the prose you have just asked. This is a query example stored in a .yaml configuration file:

questions:
- question_number: 01
user_question: |
Which customers from the 'BUILDING' market segment placed the highest number of orders in 1996, and what was the total value of those orders?
Include only those customers with more than 10 orders.
List the customers in descending order of total value.
Include the customer name, number of orders, and total value of orders.
sql_query: |
SELECT c.c_name, COUNT(o.o_orderkey) AS num_orders, SUM(o.o_totalprice) AS total_value
FROM customer c
JOIN orders o ON c.c_custkey = o.o_custkey
WHERE c.c_mktsegment = 'BUILDING' AND YEAR(o.o_orderdate) = 1996
GROUP BY c.c_name
HAVING COUNT(o.o_orderkey) > 10
ORDER BY total_value DESC;
llm_sql_query: ""
tables_used:
- "customer"
- "orders"

As I am highly familiar with Transact-SQL, after building the answers, I reviewed the code, but if you are specific with the questions, the quality is good.

For a later project, I will check that the data rows, match between the Expert Code and the model code, so I specifically asked to order the data, and build the aliases following snake_case for the comparisons I will perform.

Logging to Langfuse as Observability Framework

I encourage you to test it. It is super easy to add the logging (called observations) to your code. Just decorate your methods, and boom! ready!

You need to think a little where to add the observations due to size, and once gathered some data you will clearly see the options to add labels, tags, and other metadata to your observations.

I believe it is not necessary a full explanation of the product: just go to Langfuse and check it out. It is free for open source projects. If you have questions, they have a Ask AI button where you can use it for envisioning, architecture, implementation, and any Q&A related to the project. It is connected to the github repo, and you can ask it to build code for you. So, all fine!

This is a screenshot of the costs of the models I used:

Lot of details of the models, and methods performance:

Full details of the model performance, costs, tags, input, and output parameters:

Full details of the methods and arguments used:

And probably what I most like is the timeline view of your executions:

There is still room for adding new things, like scoring, but for my purpose, it is great.

Logging customization

These are the custom logs I used to log the results of the models. I used a yaml format, and I added the model name, the question, the SQL code generated, and the execution time.

  • in cyan: one file for each model.
  • in red:
  • the user_question that will translate the model.
  • the sql_query that is the ideal query built from the SQL Expert.
  • the llm_sql_query that generates the LLM.
  • in yellow, the metrics values: rows returned by SQL Server, duration_sql that is the duration that SQL Server took, and duration_llm that is the time that the LLM spent building the Transact-SQL query.

The Results: Aggregate Performance Metrics

The results of the benchmark are summarized in Table 1. The models were evaluated on their ability to generate SQL queries that accurately matched the expected results, as well as their execution success rates and average execution times. The models were also compared based on their average SQL generation times and the number of rows returned by their queries.

To get a high-level view, Table 1 summarizes each model’s average SQL generation time, average execution time, execution success rate, and accuracy in producing correct results. For completeness, we also include the average number of rows returned by the model’s queries (when executed). This last metric indicates whether the model tended to overshoot or undershoot the result set compared to the baseline (which averaged ~7.1 rows per query result in our benchmark). Table 1: Overall comparison of LLM models on 15 Text-to-SQL questions. Generation Time is how long the model took to produce the SQL. Execution Time is the database runtime of the query (for successfully executed queries only). Exec. Success is the percentage of queries that ran without error (out of total attempts). Accuracy is the percentage of attempts that returned the correct result (matching the expert’s answer). Avg Rows is the average number of rows returned by the model’s queries (baseline correct results average ~7.1 rows).

SQL Output Quality and Style

While all models can output syntactically correct SQL (most of the time), the quality and style of their SQL can differ, which has implications for readability, maintainability, and even performance. Here are some qualitative observations from the queries generated by each model:

  • GPT-4.1: High-quality SQL, often expert-level. Confident use of advanced features like WITH, PIVOT, and window functions. Clean, efficient, and readable.
  • GPT-4.1-mini/nano: Simplified SQL — basic joins and aggregations. Easy to read but sometimes omits key steps (e.g., missing LIMIT or ORDER BY). Rarely uses advanced constructs.
  • GPT-4o (full & mini): Generally correct but more verbose. Sometimes uses subqueries where a HAVING clause would suffice. Style feels older or brute-force, though usually accurate.
  • Phi-4: Heavy use of CTEs for step-by-step logic. Often over-decomposes, sometimes repeats work. Understandable but inefficient and error-prone.
  • DeepSeek-V3: Practical and reliable SQL using standard syntax. Avoids database-specific features. Produces correct, portable results, but queries can be verbose or less elegant.

Cost Analysis (Token Usage and API Pricing)

Beyond performance, another crucial factor in model selection is cost efficiency. Each model has different token consumption characteristics and pricing under the Azure OpenAI (or OpenAI API) pricing model as of May 2025. We estimated the costs for each model by looking at how many tokens are used per query (prompt + generated SQL) and applying the latest pricing per 1,000 tokens for each model. Table 2 summarizes the relative costs.

NOTE. In next publication will do analysis on costs.

Conclusion

Several conclusions can be built depending on the project priorities: cost, accuracy, speed, or a combination of these. The choice of model should be based on the specific use case and the trade-offs that are acceptable for your application.

  • For a mission-critical business intelligence assistant where answers must be correct and complex questions are common, GPT-4.1 (or GPT-4.1-mini) is recommended, possibly augmented with DeepSeek-V3 for a second opinion on tough queries. The extra cost is justified by the accuracy and reliability.
  • For a high-traffic query service (say, powering natural language SQL for thousands of users) where cost and speed matter, GPT-4.1-mini or nano would drastically reduce latency and expenses, and you’d accept the occasional error or handle it in post-processing. Phi-4 could even be considered if you have robust error-handling, as it’s very cheap, but its lower accuracy means more oversight is needed.
  • If you have special constraints (data privacy, need to run on-premises, etc.), an open model like DeepSeek-V3 (or a smaller fine-tuned open model) might be the way to go despite the infrastructure cost, because it gives you full control. As our results showed, DeepSeek-V3 did not disappoint in performance — it’s at the level of the best closed models in many respects.
  • Need to complete the analysis with other models like Mistral-7B/8x7B, Mixtral-8x22B, and Meta Llama-3 in future work. These models are designed for structured tasks and may offer competitive performance at lower costs.
  • As models advance so fast, this analysis might work today, but as new models are released, the landscape, performance and decision would change.

--

--

Eladio Rincón Herrera
Eladio Rincón Herrera

Written by Eladio Rincón Herrera

Data projects: Eng, DBA, Arq, GenAI, Mentor, Manager, Suppport, Trainer, Speaker, former SQL MVP

No responses yet