Bringing Production-Grade Observability to AI Agent Workflows with OpenTelemetry

Community Article Published October 31, 2025

Introduction

As AI agent systems move from research prototypes to production deployments, observability becomes critical. Unlike traditional applications where you can inspect logs or database queries, AI workflows involve complex interactions between multiple agents, LLM calls, and external systems. Understanding what's happening inside these black boxes is essential for debugging, optimization, and ensuring reliability at scale.

In this article, we'll explore how to add comprehensive observability to AI agent workflows using OpenTelemetry, the industry-standard observability framework. We'll see how structured tracing can reveal insights into agent behavior, LLM performance, and workflow execution patterns that are otherwise invisible.

The Observability Challenge in AI Workflows

Traditional observability tools were built for request-response architectures. AI agent workflows are fundamentally different:

Long-running processes: Tasks can span minutes or hours, not milliseconds
Non-deterministic behavior: LLM outputs vary between runs, making reproducibility challenging
Complex state management: Multiple agents interact with shared context and task queues
Cost implications: Each LLM call has both latency and financial costs that need tracking
Nested execution: Agents can spawn sub-tasks or invoke other agents recursively

Without proper observability, debugging a failed workflow feels like searching for a needle in a haystack. You might know a task failed, but not which agent, which LLM call, or what context led to the failure.

What is OpenTelemetry?

OpenTelemetry (OTel) is a vendor-neutral open standard for generating, collecting, and exporting telemetry data (traces, metrics, and logs). It's become the de facto standard for observability, supported by virtually every monitoring platform, from open-source tools like Jaeger and Prometheus to commercial platforms like Datadog, New Relic, and specialized AI observability tools.

Core Concepts

Traces represent a request's journey through your system. They consist of:

Spans: Individual operations (e.g., "LLM call", "task execution")
Span hierarchy: Parent-child relationships showing execution flow
Attributes: Key-value pairs capturing context (model name, tokens used, duration)

Semantic Conventions are standardized attribute names that ensure observability tools can automatically recognize and display data correctly. For example, llm.request.model tells any OTel-compatible tool "this is an LLM model identifier."

Why OpenTelemetry for AI?

The AI observability ecosystem is rapidly converging on OpenTelemetry:

Langfuse, a leading LLM observability platform, supports OTLP ingestion
Phoenix by Arize uses OpenTelemetry for trace collection
SigNoz provides full-stack observability including LLM traces
Braintrust and Dash0 both support OTLP exports

By instrumenting your AI workflows with OpenTelemetry, you get vendor portability and can export to multiple platforms simultaneously.

KaibanJS: A Framework for Multi-Agent Workflows

To demonstrate OpenTelemetry integration in practice, we'll use KaibanJS, a framework designed for building production-ready multi-agent systems. KaibanJS provides a structured approach to orchestrating AI agents through task-based workflows.

What Makes KaibanJS Suitable for Production

KaibanJS addresses several challenges inherent in multi-agent systems:

Explicit Task Dependencies: Tasks can declare dependencies, ensuring proper execution order and enabling parallel execution where possible
Agent Specialization: Each agent is defined with a specific role, goal, and background, promoting clear separation of concerns
Workflow Orchestration: The framework handles agent coordination, task queuing, and state management automatically
Event-Driven Architecture: Built-in event system enables observability and extensibility through subscribers
Type-Safe API: TypeScript-first design provides compile-time safety for complex workflows

The framework abstracts away the complexity of managing multiple LLM interactions, retries, and error handling, letting developers focus on defining the business logic of their agent workflows. This makes it an ideal candidate for adding observability, the structured nature of workflows maps naturally to trace hierarchies.

Introducing @kaibanjs/opentelemetry

The @kaibanjs/opentelemetry package bridges KaibanJS workflows with the OpenTelemetry ecosystem. It automatically instruments your workflows without requiring changes to your agent or task definitions, making observability a non-invasive addition to your system.

The package works by:

Subscribing to Workflow Events: It listens to KaibanJS internal events (task status updates, agent thinking phases) without modifying the core framework
Mapping Events to Spans: Automatically converts workflow events into OpenTelemetry spans with proper parent-child relationships
Enriching with Context: Adds semantic attributes like token usage, costs, and model information to each span
Exporting Traces: Sends structured traces to your configured observability platforms via OTLP

Since it operates through KaibanJS's subscriber pattern, you can add or remove observability without touching your workflow code, a critical requirement for production systems where you want to instrument existing applications without risk.

Instrumenting AI Agent Workflows

Let's see how we can instrument a multi-agent workflow with @kaibanjs/opentelemetry. Imagine a content processing system where agents collaborate to extract, analyze, and synthesize information:

import { Team, Agent, Task } from 'kaibanjs';
import { enableOpenTelemetry } from '@kaibanjs/opentelemetry';

// Define specialized agents
const contentExtractor = new Agent({
  name: 'ContentExtractor',
  role: 'Extract and structure content',
  goal: 'Parse unstructured content into structured formats',
  background: 'Expert in NLP and information extraction',
});

const contentAnalyzer = new Agent({
  name: 'ContentAnalyzer',
  role: 'Analyze extracted content',
  goal: 'Identify patterns and insights',
  background: 'Expert in content analysis and pattern recognition',
});

const contentSynthesizer = new Agent({
  name: 'ContentSynthesizer',
  role: 'Synthesize findings',
  goal: 'Combine insights into coherent summaries',
  background: 'Expert in synthesis and summarization',
});

// Define tasks that form a workflow
const extractTask = new Task({
  title: 'Extract Content',
  description: 'Extract structured data from: {input}',
  expectedOutput: 'Structured JSON with key information',
  agent: contentExtractor,
});

const analyzeTask = new Task({
  title: 'Analyze Content',
  description: 'Analyze the extracted content for patterns',
  expectedOutput: 'Analysis report with identified patterns',
  agent: contentAnalyzer,
  dependencies: [extractTask],
});

const synthesizeTask = new Task({
  title: 'Synthesize Findings',
  description: 'Create a summary from the analysis',
  expectedOutput: 'Executive summary',
  agent: contentSynthesizer,
  dependencies: [analyzeTask],
});

const team = new Team({
  name: 'Content Processing Team',
  agents: [contentExtractor, contentAnalyzer, contentSynthesizer],
  tasks: [extractTask, analyzeTask, synthesizeTask],
});

Now, let's add observability:

const config = {
  enabled: true,
  sampling: {
    rate: 1.0, // In production, use 0.1-0.3
    strategy: 'always', // or 'probabilistic' for production
  },
  attributes: {
    includeSensitiveData: false,
    customAttributes: {
      'service.name': 'content-processing',
      'service.version': '1.0.0',
      'service.environment': process.env.NODE_ENV || 'development',
    },
  },
  exporters: {
    console: true, // For local development
    otlp: [
      // Send to multiple platforms simultaneously
      {
        endpoint: 'https://cloud.langfuse.com/api/public/otel',
        protocol: 'http',
        headers: {
          Authorization: `Basic ${Buffer.from(
            `${process.env.LANGFUSE_PUBLIC_KEY}:${process.env.LANGFUSE_SECRET_KEY}`
          ).toString('base64')}`,
        },
        serviceName: 'content-processing',
      },
      {
        endpoint: 'https://ingest.us.signoz.cloud:443',
        protocol: 'grpc',
        headers: { 'signoz-access-token': process.env.SIGNOZ_TOKEN },
        serviceName: 'content-processing',
      },
    ],
  },
};

enableOpenTelemetry(team, config);

// Run the workflow
await team.start({ input: 'Your content here...' });

Understanding the Trace Structure

When the workflow runs, OpenTelemetry creates a hierarchical trace structure:

Task: Extract Content (CLIENT span)
├── Agent Thinking Span (CLIENT)
│   ├── Attributes:
│   │   - kaiban.llm.request.model: "gpt-4"
│   │   - kaiban.llm.request.provider: "openai"
│   │   - kaiban.llm.usage.input_tokens: 245
│   │   - kaiban.llm.usage.output_tokens: 312
│   │   - kaiban.llm.usage.cost: 0.012
│   │   - agent.id: "content-extractor-1"
│   │   - agent.name: "ContentExtractor"
│   └── Duration: 2.3s
├── Agent Thinking Span (CLIENT)
│   └── (Second iteration/refinement)
└── Status: DONE

Task: Analyze Content (CLIENT span)
├── Agent Thinking Span (CLIENT)
└── Status: DONE

Task: Synthesize Findings (CLIENT span)
└── Agent Thinking Span (CLIENT)

This structure immediately reveals:

Task dependencies: Which tasks run sequentially vs. in parallel
Agent iteration patterns: How many times agents "think" before producing output
LLM costs: Per-task and per-agent token usage and costs
Bottlenecks: Which agents or tasks take the longest

LLM-Specific Semantic Conventions

One key innovation is the use of semantic conventions specifically for LLM operations. These conventions use the kaiban.llm.* namespace, ensuring observability platforms automatically recognize LLM data:

Request Attributes

{
  'kaiban.llm.request.model': 'gpt-4-turbo-preview',
  'kaiban.llm.request.provider': 'openai',
  'kaiban.llm.request.iteration': 1,
  'kaiban.llm.request.input_length': 1524,
}

Usage Metrics

{
  'kaiban.llm.usage.input_tokens': 245,
  'kaiban.llm.usage.output_tokens': 312,
  'kaiban.llm.usage.total_tokens': 557,
  'kaiban.llm.usage.cost': 0.012, // USD
}

Response Attributes

{
  'kaiban.llm.response.status': 'completed',
  'kaiban.llm.response.duration': 2300, // milliseconds
  'kaiban.llm.response.output_length': 1842,
}

Platforms like Langfuse and Phoenix automatically recognize these attributes and display them in specialized LLM views, making it easy to:

Track token usage trends over time
Identify expensive model calls
Monitor latency patterns
Debug failed LLM interactions

Production Considerations

Sampling Strategies

In production, you'll want to sample traces rather than recording everything:

sampling: {
  rate: 0.1, // Sample 10% of workflows
  strategy: 'probabilistic',
}

This reduces overhead while still capturing representative data. You can increase sampling for critical workflows or error cases.

Multi-Service Export

Exporting to multiple platforms provides redundancy and allows teams to use specialized tools:

exporters: {
  otlp: [
    // Langfuse for LLM-specific analysis
    {
      endpoint: 'https://cloud.langfuse.com/api/public/otel',
      protocol: 'http',
      serviceName: 'kaibanjs-langfuse',
    },
    // SigNoz for infrastructure monitoring
    {
      endpoint: 'https://ingest.us.signoz.cloud:443',
      protocol: 'grpc',
      serviceName: 'kaibanjs-signoz',
    },
    // Your internal OTLP collector
    {
      endpoint: 'https://otel-collector.internal.com',
      protocol: 'http',
      serviceName: 'kaibanjs-internal',
    },
  ],
}

Environment-Based Configuration

Use environment variables for sensitive configuration:

export OTEL_EXPORTER_OTLP_ENDPOINT="https://your-service.com"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer $API_TOKEN"

Then reference them in code:

exporters: {
  otlp: {
    serviceName: 'my-service',
    // Automatically uses environment variables
  },
}

Real-World Insights

With proper instrumentation, you can answer questions like:

Cost Analysis: "Which tasks are consuming the most tokens?"
- Filter spans by task.name and sum kaiban.llm.usage.cost
Performance Optimization: "Are agents making unnecessary iterations?"
- Count agent.thinking spans per task span
Failure Debugging: "What was the exact LLM input when a task failed?"
- Inspect span attributes at the time of failure
Resource Planning: "How long do workflows typically take?"
- Aggregate task span durations across traces

Conclusion

OpenTelemetry brings production-grade observability to AI agent workflows, providing visibility into complex multi-agent systems that was previously difficult to achieve. By using standardized semantic conventions and supporting multiple export destinations, you get vendor portability and can choose the right tool for each team's needs.

The integration is non-intrusive, it observes your workflows without modifying core logic, making it safe to add to existing systems. As AI agent systems become more complex and move into production, observability isn't just nice-to-have; it's essential for reliability, cost management, and continuous improvement.

Whether you're debugging a failed workflow, optimizing token usage, or analyzing agent behavior patterns, structured traces with OpenTelemetry give you the insights you need to build and operate reliable AI agent systems at scale.