I Tested 70 AI Models on DAX – Here's What Actually Works
I spent the past few months building DAXBench—a systematic evaluation of 70 different AI models on their ability to write DAX formulas for Power BI.
The goal was simple: figure out which AI assistants actually work for DAX, and which ones waste your time with code that looks right but won't compile.
The Core Finding
Here's what the data shows:
- 360 syntax errors
- 63 semantic errors
That's a 6:1 ratio. Large language models understand your business logic. They know what to calculate. They fail on how to write it in DAX.
DAX is rigid. Skip one optional parameter, miss one comma, use the wrong bracket notation—it won't run. And that's exactly where AI consistently breaks down.
Provider Patterns
When you test dozens of models from the same provider, patterns emerge.
OpenAI acts like an overconfident senior developer. The reasoning models (o1, o3) overthink everything—adding fiscal year parameters nobody asked for, defining unnecessary variables. The simpler coding models (GPT-5.2, Codex) actually perform better on DAX tasks.
DeepSeek took the biggest leap between generations. V3 models were confused about which language to use—they'd write SQL syntax inside DAX formulas. V3.2 fixed this completely and is now the best open-weights option you can run locally.
Google went from server crashes to 80% accuracy. Early Gemini models wrote DAX that was logically correct but computationally catastrophic—row-by-row iteration over million-row fact tables. Gemini 3 learned to use CALCULATE and KEEPFILTERS properly.
Anthropic had issues with variable scoping in earlier Claude models. Opus 4.5 now produces the cleanest, most consistent DAX output in the benchmark.
The 40% Problem
One function caused 40% of all failures: RANKX.
Not because the math was wrong. Because models consistently forget about an optional parameter and drop a single comma. The logic is perfect. It just won't compile.
This is the pattern across the board—AI gets the intent right but trips on DAX-specific grammar that doesn't exist in SQL or Python.
What Should You Use?
Based on the benchmark data:
- Best accuracy: GPT-5.2 or Codex (~80%)
- Cleanest code: Claude Opus 4.5
- Best open-weights: DeepSeek V3.2
- Avoid: Anything under 70B parameters, and reasoning models for simple DAX tasks
Full Results
The complete benchmark data and methodology is available at daxbench.com.
In the next post, I'll cover the 5 specific error patterns that account for 80% of AI-generated DAX mistakes, and show you exactly how to spot and fix them.