Observability in Agentic Flows - Debugging the "Black Box"

How to instrument agentic systems for deep traceability and debugging.

Nabin Pokhrel
Nabin Pokhrel
·
#observability#tracing

TL;DR

  • Traditional logs, metrics, and traces tell you what happened. Agents need a fourth pillar reasoning logs to explain why a specific action was chosen.
  • Instrument every tool call as an OpenTelemetry span with the agent's reasoning as a span attribute. Agents backtrack and branch; flat log lines are nearly useless.
  • Standard APM doesn't track what matters for agents: reasoning divergence, backtrack frequency, per-tool success rate, and tokens per task. Build those metrics yourself.

An agent deletes a staging database table. The logs say: DELETE FROM config_overrides;.

What they don't say is why.

Turns out the agent couldn't find a configuration file. It was looking in the wrong directory. It concluded the environment was corrupted. It decided a "clean reset" was the best course of action.

Perfectly logical reasoning. Absolutely catastrophic outcome.

The file existed. It was in /opt/app/config.yaml. The agent just didn't look there.

This is the observability problem in agentic systems. Traditional logging doesn't solve it.

A Fourth Pillar

In regular software, there are three pillars: logs, metrics, traces. They tell you what happened, how often, and in what order.

Usually, that's enough.

For agents, you need a fourth: reasoning logs.

A regular log: Tool called: deleteTable(config_overrides)

A reasoning log: "Searched for config.yaml in /etc/app/ and /home/app/. Neither path contained the file. This suggests the environment is corrupted. Will reset by clearing the overrides table."

The first tells you what happened. The second tells you why and more importantly, where the reasoning went wrong.

Externalizing the Chain of Thought

The biggest mistake: treating the agent's reasoning as an internal implementation detail.

It's not. In production, the chain of thought is your most important debugging signal.

Force every agent to output its reasoning into a structured field captured by the logging infrastructure. Not as a debug log level that gets filtered out in production as a first-class field on every agent action.

interface AgentAction {
  actionId: string;
  timestamp: number;
  reasoning: string;      // WHY the agent chose this action
  plan: string[];         // What it intends to do next
  toolName: string;
  toolArgs: Record<string, unknown>;
  result: unknown;
  durationMs: number;
}

When something goes wrong, don't grep for errors. Grep for the reasoning.

The reasoning tells you where the agent's mental model diverged from reality. That's almost always the root cause.

Spans for Tool Calls

Every tool execution should be its own OpenTelemetry span. This isn't optional.

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('agent-tools');

async function instrumentedToolCall(
  toolName: string,
  args: Record<string, unknown>,
  reasoning: string
) {
  return tracer.startActiveSpan(`agent.tool.${toolName}`, async (span) => {
    span.setAttribute('agent.reasoning', reasoning);
    span.setAttribute('tool.name', toolName);
    span.setAttribute('tool.args', JSON.stringify(args));
    
    try {
      const result = await executeTool(toolName, args);
      span.setAttribute('tool.result', JSON.stringify(result));
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({ 
        code: SpanStatusCode.ERROR, 
        message: error instanceof Error ? error.message : 'Unknown error' 
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Now when you look at a trace, you don't just see "tool A called tool B." You see why tool A was called, what it expected to happen, and what actually happened.

That context makes the difference between a 5-minute diagnosis and a 5-hour one.

Agents Don't Execute Linearly

Here's something that catches people off guard: agents backtrack. They branch. They retry.

A flat list of log lines is nearly useless for understanding what an agent actually did.

See the red path? The agent decided to backtrack and try a different approach. In a flat log, this looks like the agent did the research twice for no reason. In a tree view, you can see the decision point and why it changed course.

Tools like LangSmith visualize this automatically. But even a custom solution storing parent-child relationships between actions and rendering them as a tree is worth building. A parentActionId field and a React tree component. One day of work. Weeks of saved debugging time.

What to Actually Monitor

After running agents in production for a while, these are the metrics that actually matter:

  • Reasoning divergence rate How often the agent's stated plan doesn't match what it executed. High rate means confusing prompts.
  • Backtrack frequency How often the agent undoes or retries a step. Some is healthy. Too much means thrashing.
  • Tool call success rate per tool Not overall. Per tool. If readFile fails 30% of the time, that's a config problem, not an agent problem.
  • Total tokens per task Not just for cost. As a proxy for efficiency. If the same task keeps using more tokens over time, something is drifting.

None of these exist in standard APM tools.

You have to build them. But once you have them, you can reason about agent performance the same way you reason about service performance.

Frequently Asked Questions

Why isn't standard logging enough for agents?

Standard logs tell you what happened (e.g., "API called"). But agents are non-deterministic decision-makers. To debug them, you need to know why the agent decided that specific call was necessary. Without the "Reasoning Log," you're left guessing the agent's internal state.

What is "Reasoning Divergence"?

Reasoning Divergence is a metric that tracks how often an agent's stated plan conflicts with the actual action it takes. High divergence often points to "instruction drift" or overly complex system prompts that confuse the model's decision-making process.

How do OpenTelemetry spans help with agent debugging?

By making every tool call a span, you can visualize the exact sequence of actions. Attaching the LLM's reasoning as a span attribute allows you to use standard observability tools (like Jaeger or Honeycomb) to see the logic that led to a specific database error or API timeout.

Is it expensive to log reasoning for every action?

While it adds to your storage costs, it's significantly cheaper than the developer time required to reverse-engineer a failed agentic workflow. In production, these logs are your only reliable way to perform "post-mortems" on AI decisions.


How to cite
Pokhrel, N. (2026). "Observability in Agentic Flows - Debugging the "Black Box"". Native Agents. https://nativeagents.dev/posts/internals/observability-in-agentic-flows