The operating question.
NL-to-SQL evals for finance.
NL-to-SQL systems fail differently in finance because the answer has to respect metric definitions, tenant scope, semantic layers, row-level permissions, and reporting context.
Evaluation surfaces.
- Dataset quality: whether the tables, joins, filters, and metric definitions are sufficient.
- Answer correctness: whether the final answer matches the question and source data.
- Materiality: whether an error would change a decision, report, or control.
Audit-grade output.
Keep the natural-language question, generated SQL, source records, expected answer, reviewer decision, and exception reason together.
Guidethe question, evidence, artifact, and action to sequence
Evidencethe source-linked facts needed for a defensible read
Next movehow the guidance connects back to the AI Audit