AI’s struggle to master accounting procedures

AI is solving famously hard maths problems. So why can't it audit a simple spreadsheet? The answer reveals a deep truth about AI's struggle to master accounting.

by | 21 Jan, 2026


At a glance:

  • Generative AI currently struggles with the mathematical precision required for accounting tasks.
  • Tests show AI models are inconsistent and unreliable for spreadsheet logic and auditing.
  • AI can solve complex abstract maths but fails at simple, procedural business logic.
  • Hybrid systems combining AI with deterministic rules may offer a path to automation.

For the past two years, generative AI has been on a dream run, creating new text, code, images and video. Tech companies have promised that these cheeky chatbots would automate much of white-collar work – and accounting has been near the top of the list.

But a basic question still hangs in the air: can generative AI handle the mathematics of accounting yet?

It shouldn’t be too hard. Computers have been crunching calculations for decades. But as businesses start testing generative AI models on real-world finance tasks – reconciliation, forecasting, anomaly detection – they’re discovering the limits of today’s tools and are double-checking the vendors’ claims.

Some are running quiet experiments. Others, like Salesforce, are revising their strategy in public. What they’re finding could determine whether generative AI replaces the accountant – or stays in its lane as a language model built for writing tasks.

The accountant who put AI to the test

Simon Thorne is a spreadsheet tragic. A senior lecturer in computer science at Cardiff Metropolitan University, he was one of the first to test generative AI on real-world accounting logic.

Thorne had used various AI models before the current AI boom. He says he was “fairly astounded” by ChatGPT’s fluency once it was released to the public. He soon realised that people were using generative AI for more than just generating language and code. They were using it to fill spreadsheets.

“I noticed that there were some problems with what it would output. So I wanted to understand how well it can do certain things that are common in spreadsheets,” Thorne told Financial Accountant.

In 2023, Thorne released a series of tests aimed at answering that question. They ranged from basic tasks – like error-spotting in a profit and loss statement – to abstract puzzles and multi-step calculations. Over the next two years he expanded the suite into five categories: auditing, spreadsheet logic, domain knowledge, deterministic logic and pure maths.

Each test mimicked something a real user might ask a chatbot – “audit this budget”, “build a rolling average”, “spot the error in this interest calculation”. Some came straight from financial workflows. Others were logic challenges reworded to avoid contamination from AI training data. They included the so-called “astronaut puzzle”, a classic constraint logic problem, and a punishing entropy test involving probabilities and logarithms.

The doctored profit and loss statement included typical errors: hardcoded formulas, inconsistent rounding, duplicated entries. Gemini 2.5 picked up all of them. Microsoft’s Copilot missed half. In the entropy test, models often started strong but unravelled midway. “Once you get beyond a series of steps, it seems to just break down,” Thorne says.

He also built a deliberately opaque spreadsheet using named ranges, seeded with subtle structural flaws. It was a realistic model, the kind that causes grief in every finance team.

To keep the tests clean, Thorne never published the prompts. “When you look through my paper, you won’t find the prompts that I used,” he says. “I’m trying to protect my own set of tests.” If he publishes them, he believes, they will get picked up by AI vendors who will train their algorithms specifically to defeat them.

In 2025, Thorne ran 21 models through his suite, including GPT-4, Claude, Gemini and Copilot. The results were “very fractured and inconsistent”. Some models hallucinated answers. Others guessed formulas that looked plausible but were logically wrong. Only Claude 2.5 and GPT-4 returned the correct bank interest formula – and only when prompted with exact phrasing.

“They work on probability. So whatever looks like the right answer based on the input is the most probable answer. When I give it exactly the same puzzle but it’s got a different theme … it’s utterly unable to do that,” Thorne says.

“Once you get beyond a series of steps, it seems to just break down.”

Simon Thorne

The maths paradox

So if generative AI can’t reliably calculate a rolling interest formula, why is it solving unsolved maths problems?

That’s the paradox confronting researchers watching AI progress on a different front. In recent months, AI models have been quietly knocking off problems from the Erdős list – a notorious collection of unsolved mathematical puzzles compiled by the late Hungarian prodigy Paul Erdős.

The list includes more than 1,000 problems, spanning number theory, combinatorics, graph theory and geometry. Many are deceptively simple to state, but fiendishly hard to solve.

Between Christmas 2025 and mid-January 2026, 15 problems have shifted from “open” to “solved” on the official Erdős website. Eleven of those credited AI models for their role in the solution.

One of the more eye-catching results came from Neel Somani, a former quant and startup founder, who fed an Erdős problem into GPT-5.2 over the holiday break. Fifteen minutes later, ChatGPT returned a proof, reported TechCrunch. It cited Legendre’s formula, Bertrand’s postulate, and the Star of David theorem.

It also drew on a 2013 MathOverflow thread by Harvard mathematician Noam Elkies – but crucially, it didn’t copy. It built a different argument, producing a more complete solution to a variant of the original problem.

“I was curious to establish a baseline for when LLMs are effectively able to solve open math problems compared to where they struggle,” Somani told TechCrunch. The surprise was that the latest models are more successful at complex mathematical problems than previous algorithms.

The mathematician Terence Tao has tracked eight problems where AI models made “meaningful autonomous progress”, and six more where they rediscovered and extended existing work.

How can this paradox exist? In one context, AI is struggling to audit a spreadsheet. In another, it’s collaborating with mathematicians to extend the frontiers of human knowledge.

Why the split? Part of the answer lies in how these problems are structured.

Brilliant theory, broken practice

The breakthroughs in mathematics are impressive. But when it comes to practical business workflows, the results are less inspiring.

Business software company Salesforce was one of the earliest and loudest voices in the generative AI boom. CEO Marc Benioff even suggested renaming the company after its AI platform, Agentforce. But when Agentforce was tested in the real world, things didn’t go to plan.

One customer, home security firm Vivint, set up a basic instruction: send a customer satisfaction survey after every support call. No impressive acrobatics required – just a trigger, a task and an outcome. But in production, the surveys only went out some of the time. There was no logic to the failures. The task was simply skipped.

Salesforce’s CTO later explained the problem: LLMs struggle to follow more than eight steps in sequence. The system wasn’t broken; it just quietly dropped instructions without telling anyone.

As of early 2026, Agentforce no longer relies on language models alone. It runs on what Salesforce calls hybrid reasoning. LLMs still manage the conversation. But critical tasks are handed off to deterministic scripts – step-by-step rules that guarantee follow-through. An “Agent Script” ensures every required action happens in order, no matter how confidently the chatbot responds.

What makes finance different

The contrast between solving Erdős problems and failing survey triggers isn’t as contradictory as it first appears. In fact, it highlights the fundamental design trade-off in generative AI – and suggests why it struggles in accounting.

Large language models are probabilistic engines. They’re trained to predict the most likely next word in a sequence based on vast amounts of text. That makes them surprisingly good at exploring abstract problems – like spotting patterns, drawing analogies or suggesting proofs. Mathematics research is an example of an open-ended domain which thrives on variation and creative leaps. In such fields, large language models can be genuinely useful.

But accounting isn’t abstract. It’s procedural. Tasks like reconciliation, auditing and compliance require determinism. Every step must follow the last. Every figure must be accurate. There is no “roughly right”.

This is where LLMs fall down. They don’t run calculations; they simulate what a correct answer might look like. And they can be very convincing. They cite formulas, mimic logic, and format outputs perfectly. But it’s still pattern-matching. And the longer or more ambiguous the task, the more likely they are to drift or hallucinate.

Thorne saw this in his entropy test: models would get the first part right, then break down midway. Salesforce saw it with Agentforce: fluent conversations, broken execution. Digits, a US accounting startup, benchmarked LLMs on transaction classification – and found none exceeded 70 percent accuracy without tight constraints. (Like Agentforce, Digits achieves much higher accuracy by combining LLMs with deterministic models.)

A saying allegedly popular inside Microsoft’s Excel AI group encapsulates the problem: “Ninety-nine percent correct is 100 percent wrong.”

In accounting, near enough is just not good enough.

Not if, but how

The question isn’t whether generative AI can do accounting. We know the answer to that already. On its own, it can’t.

The better question is: Can generative AI, when paired with other models and rule-based systems, do accounting accurately enough to replace humans?

That’s still not settled. But the theory shows it’s not impossible. Hybrid systems – like those used by Salesforce and Digits – use language models for context and communication, and rely on deterministic logic for critical steps. Done well, this approach could deliver automation with guardrails – and accuracy that holds up under audit.

We haven’t seen a fully autonomous accounting system work at scale. But given enough time, it still seems possible.


Enhance your skills in emerging areas of cybersecurity and technologies with IFA’s self-paced short courses.

Share This