Why AI Agents Fail Today Despite the Hype

Nov 20, 2025

The agentic AI hype promises autonomous decision-makers that replace employees or dramatically boost efficiency. The reality, according to Maria Sukhareva (Principal AI Expert at Siemens) in “Why AI Agents Disappoint,” is that general-purpose AI agents don’t work for most real-world business use cases.

The WebArena benchmark proves the gap quantitatively. Researchers created realistic web environments (e-commerce sites, forums, development platforms) and asked GPT-4-based agents to complete end-to-end tasks like “Find the cheapest phone case and email me the link.” Success rate: 14.41%.

The architecture is brittle

Current agents follow a ReAct cycle (Plan, Act, Observe, Repeat) as I explored previously. This looks like reasoning, but it’s sequential token prediction choosing from pre-defined tools. When one step fails, the entire chain collapses.

The problem isn’t the architecture alone. It’s what happens when reality deviates from the plan.

If an agent struggles through a task and eventually succeeds, it retains zero memory of that success. Next session, it repeats the same mistakes. There’s no mechanism to update its knowledge base or behavior from experience. (Research like Meta’s Early Experience approach explores ways agents could learn from their own rollouts, but these methods aren’t yet production-ready.)

Error propagation compounds this. A single wrong click or a misread file in step two ruins the entire workflow. Attempted fixes like self-reflection or multi-agent debate act as band-aids. They sometimes amplify false reasoning rather than correct it.

Why coding works

AI-assisted coding (GitHub Copilot, Claude Code) is the exception. Agents genuinely deliver value here.

The environment is constrained. The IDE provides clear boundaries. The data is primarily text and code. Feedback loops are immediate. You run the code, see if it works, and adjust.

This reveals the requirements for agents to succeed elsewhere: constrained environments, homogeneous data types, and fast feedback loops.

Real-world business tasks fail all three tests. A financial audit requires juggling emails, database logs, PDF invoices, and regulatory texts simultaneously. When a document is missing or unreadable, agents hallucinate results rather than problem-solve like humans do.

Business workflows assume human common sense. Nobody emails a contact ten times in one minute. Nobody needs explicit instructions not to delete the production database. Agents require rigid IF…THEN rules for everything. They can’t handle dynamic obstacles (a file moved to a new database, a contact out of office) without being explicitly programmed for each scenario.

We’re a decade away from autonomous?

The market hype suggests we’re approaching the “Observer” level, where machines work fully autonomously. Sukhareva argues we’re actually at the “Collaborator” level where humans guide machines. Citing Andrej Karpathy, she estimates it will take at least ten years to fix these fundamental cognitive issues.

The gap isn’t just technical. It’s architectural. Current agents lack the cognitive structure for proactive learning, robust error recovery, and multimodal reasoning.

Companies investing in “agentic AI” based on hype videos and demos should understand what they’re actually buying: brittle sequential machines that work in constrained environments with immediate feedback. Not autonomous decision-makers.

The coding success proves agents can work when the environment matches their capabilities. The question for product teams isn’t whether to use agents. It’s whether your use case looks more like an IDE or like a financial audit.

Product Thinking w/ Surya

Discussion about this post

Ready for more?