The operating result.
Building AI Customers Could Trust.
~60%->95% stated FP&A accuracy (~90% measured; not an audited fact), 144% NRR, 20% fewer false positives, 90+ regression scenarios, and rollout to 100+ customers.
The buyer's exposure was simple: a team will not act on a number it does not trust. We made AI trustable and reliable by re-architecting the implementation, not by bolting evals onto the end.
Start with the architecture.
The reliability gap lived in context construction, retrieval relevance, prompts, review, and deterministic finance paths, not in the choice of model.
> We went from being unsure of our accuracy to rolling out the product to over 100 customers and having visibility across the behavior of our agentic application. The robust implementation was further validated by our customers' AI teams, and the vendor visibility allowed us to build enterprise trust.
CTO, AI-native finance SaaS
Engineer the reliability loop.
We authored the implementation as harness engineering: deciding where deterministic SQL belonged and where an LLM helped, so each consequential answer was exact where it had to be and judgment was applied only where judgment earned it. Each failure mode got an owner and a release check.
Accuracy moved after context-layer work, RAG relevance tuning, prompt tuning, reviewer agents, and SQL fast paths.
False positives fell after critical outputs moved through reviewer checks and release evals.
High-risk FP&A behaviors were covered in a criticality-weighted eval loop.
The product moved from uncertain accuracy into rollout with visibility across agent behavior.
The hardest FP&A workflow was capped by weak context construction, not model choice alone.
The RAG path often found plausible evidence before the most relevant finance record.
Critical outputs needed reviewer agents, SQL fast paths, and higher eval weight than routine answers.
Move accuracy into production.
The shift was from a promising FP&A agent to a system the team could act on without re-checking every figure.
Ship the reliability DAG.
Criticality decided which eval loops mattered most, then reinforced the product paths that carried the highest risk.
- Context-layer redesign
- RAG relevance tuning
- Prompt and fallback paths
- SQL fast-path scope
- Failure-mode taxonomy
- Reviewer-agent checks
- Criticality-weighted eval DAG
- Regression reinforcement
- 95% stated accuracy target
- False-positive checks
- 90+ regression scenarios
- Product-confidence readout
What the team can now defend.
The same system improved agent behavior, release confidence, and the reliability claim the team makes to every customer's AI reviewers. The claim is one customer AI reviewers could inspect against the mechanism and evidence.
What Shipped.
- Context layer, retrieval ranking, and prompt tuning for the FP&A workflow.
- Reviewer-agent checks and deterministic SQL fast paths for consequential questions.
- Criticality-weighted eval DAG with 90+ regression scenarios, run as the deploy gate.
Proof.
- ~60%->95% stated FP&A accuracy (~90% measured; not an audited fact) in the FP&A AI implementation.
- 144% net revenue retention as customers expanded.
- 20% reduction in false positives after the eval pipeline went live.
- 90+ high-risk document scenarios covered in regression testing.
- Rolled out to 100+ customers with agent-behavior visibility.
Build AI that holds.
This is harness engineering: AI Engineering and Evals authored into one implementation, so production agents stay reliable when a customer's AI team comes asking.
Next move: AI Engineering (build) Production Evals (deploy gate) Continuous Evals (release cadence).
Engineering inside your own product? Start with a discovery call and we will walk the FP&A implementation in detail.