The operating question.

NL-to-SQL evals for finance.

NL-to-SQL systems fail differently in finance because the answer has to respect metric definitions, tenant scope, semantic layers, row-level permissions, and reporting context.

Evaluation surfaces.

  • Dataset quality: whether the tables, joins, filters, and metric definitions are sufficient.
  • Answer correctness: whether the final answer matches the question and source data.
  • Materiality: whether an error would change a decision, report, or control.

Audit-grade output.

Keep the natural-language question, generated SQL, source records, expected answer, reviewer decision, and exception reason together.

Guidethe question, evidence, artifact, and action to sequence
Evidencethe source-linked facts needed for a defensible read
Next movehow the guidance connects back to the AI Audit