Built-in Observability: Traces, Metrics, and Dashboards

Hexr’s observability layer instruments your agents automatically. Every tool call, LLM request, credential exchange, and agent-to-agent message generates OpenTelemetry spans and metrics — without any configuration or extra code from you. The @hexr_agent decorator sets up all OTel providers at decoration time, so adding observability is as simple as using the SDK.

How telemetry flows

All telemetry from your agents and platform services flows through a single OpenTelemetry Collector, then routes to dedicated backends:

Data sources
Collection and storage
Data flow

Source	What it emits
Python SDK (`hexr_llm`, `hexr_tool`, `@hexr_agent`)	Agent spans, LLM metrics, tool invocations
Envoy proxies	mTLS metrics, connection counts, TLS handshake latency
A2A sidecars	Task lifecycle, message throughput, SSE connections
Platform services (Vault, Gateway, Credential Injector)	Operation rates, latency, error counts

Component	Role
OTel Collector	Aggregation point for all telemetry — receives OTLP gRPC and HTTP
Jaeger	Distributed trace storage and query UI
Prometheus	Metrics storage with 11+ pre-configured scrape targets
Grafana	Dashboards — 42 panels across 2 dashboards

Python SDK    ──┐
Envoy Proxies ──┤──► OTel Collector ──┬──► Jaeger (traces)
A2A Sidecars  ──┤                     └──► Prometheus (metrics)
Platform Svcs ──┘                              │
                                          Grafana (dashboards)

Automatic instrumentation

You get complete observability without writing any instrumentation code. The SDK emits spans for every operation:

@hexr_agent(name="analyst", tenant="acme")
def analyze(topic: str):
    # Span: hexr.agent.invoke (auto)

    s3 = hexr_tool("aws_s3")
    # Span: hexr.tool.invoke {service: aws_s3}
    # Span: hexr.cache.lookup {tier: L1|L2|L3}
    # Span: hexr.credential.exchange (if cache miss)

    client = hexr_llm(openai.OpenAI())
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Analyze {topic}"}]
    )
    # Span: hexr.llm.chat {model: gpt-4o, tokens_in: 42, tokens_out: 256}

    secret = hexr.vault.get("openai/api-key")
    # Span: hexr.vault.get {path: openai/api-key}

    return response

Zero configuration. All OTel providers are set up by @hexr_agent at decoration time.

Trace spans

Every SDK operation generates a span you can inspect in Jaeger with full agent identity context:

Span name	Key attributes	Source
`hexr.agent.invoke`	`agent_name`, `tenant`, `framework`, `status`	`@hexr_agent` decorator
`hexr.tool.invoke`	`service`, `region`, `cache_tier`	`hexr_tool()`
`hexr.cache.lookup`	`tier` (L1/L2/L3), `hit`, `duration_ms`	Credential cache
`hexr.credential.exchange`	`provider`, `service`, `spiffe_id`	Credential Injector client
`hexr.llm.chat`	`gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`	`hexr_llm()` proxy
`hexr.vault.get`	`path`, `tenant`	`hexr.vault` module
`hexr.gateway.call`	`tool_name`, `arguments`	`hexr.gateway` module
`hexr.a2a.client.send`	`target_agent`, `task_id`, `task_state`	`A2AClient`
`hexr.a2a.bridge.execute`	`source_agent`, `task_id`	A2A bridge
`hexr.sandbox.exec`	`language`, `timeout`, `exit_code`	`hexr.sandbox`
`hexr.browser.browse`	`url`, `actions_count`	`hexr.browser`
`hexr.guard.scan`	`scan_type` (prompt/output), `is_valid`	`hexr.guard`

LLM Guard span attributes

When LLM Guard blocks a prompt or response, additional attributes are set on the parent hexr.llm.chat span:

Attribute	Type	Description
`hexr.guard.prompt_blocked`	bool	`true` if the input prompt was blocked
`hexr.guard.scanners`	string	Scanner results that triggered the block
`hexr.guard.output_blocked`	bool	`true` if the LLM response was blocked
`hexr.guard.output_scanners`	string	Scanner results that triggered the output block

Blocked requests set the span status to ERROR with a description of which guard triggered.

Metrics

Agent metrics

Metric	Type	Description
`hexr.agent.invocations`	Counter	Total agent invocations
`hexr.agent.active`	UpDownCounter	Currently active invocations
`hexr.agent.duration`	Histogram	Invocation duration in seconds

Tool and credential metrics

Metric	Type	Description
`hexr.tool.invocations`	Counter	Total tool calls by service
`hexr.tool.duration`	Histogram	Tool call duration
`hexr.cache.hits`	Counter	Cache hits by tier (L1/L2/L3)
`hexr.cache.misses`	Counter	Cache misses
`hexr.cache.lookup.duration`	Histogram	Cache lookup latency
`hexr.credential.exchanges`	Counter	Full credential exchanges
`hexr.credential.failures`	Counter	Failed exchanges

LLM metrics

Metric	Type	Description
`hexr.llm.calls`	Counter	Total LLM API calls
`hexr.llm.call_errors`	Counter	Failed LLM calls
`hexr.llm.call.duration`	Histogram	LLM call latency
`hexr.llm.input_tokens`	Counter	Total input tokens consumed
`hexr.llm.output_tokens`	Counter	Total output tokens generated

A2A metrics

Metric	Type	Description
`hexr.a2a.sends`	Counter	Messages sent
`hexr.a2a.send_failures`	Counter	Failed sends
`hexr.a2a.send.duration`	Histogram	Send latency
`hexr.a2a.bridge.executions`	Counter	Bridge handler calls

LLM Guard metrics

Metric	Type	Description
`hexr_guard_scans_total`	Counter	Total scans by direction (`input`/`output`) and scanner
`hexr_guard_blocks_total`	Counter	Total blocks by direction and scanner
`hexr_guard_scan_duration_seconds`	Histogram	Scan latency by direction

Pre-built Grafana dashboards

Hexr ships with two Grafana dashboards covering 42 panels out of the box — no setup required.

Platform overview (23 panels)

Covers system-wide health across all your agents:

Agent pod status and container health
Credential exchange rates and cache hit ratios
mTLS connection counts and TLS handshake latency
SPIRE entry counts and SVID rotation rates
OTel Collector throughput (traces/sec, metrics/sec)
Vault operation rates and latency
Gateway tool invocation rates

A2A communication (19 panels)

Covers inter-agent messaging for multi-agent workflows:

Task lifecycle — submitted → working → completed / failed
Message throughput per agent pair
Task duration histograms
SSE streaming connection counts
Valkey task store operations
Error rates by task state transition
Cross-namespace communication patterns

GenAI semantic conventions

hexr_llm() follows the OpenTelemetry GenAI semantic conventions, so your traces are compatible with any OTel-native LLM observability tool:

Attribute	Example value
`gen_ai.system`	`openai`, `anthropic`, `google_genai`, `cohere`, `mistral`
`gen_ai.request.model`	`gpt-4o`, `claude-3-opus`, `gemini-pro`
`gen_ai.response.model`	`gpt-4o-2024-08-06`
`gen_ai.usage.input_tokens`	`1200`
`gen_ai.usage.output_tokens`	`800`
`gen_ai.response.id`	`chatcmpl-abc123`
`gen_ai.response.finish_reasons`	`["stop"]`

Prometheus scrape targets

Prometheus scrapes metrics from your agent pods and all Hexr platform services automatically. The pre-configured targets include:

Target	Metrics
Agent pods (per tenant)	Task lifecycle, message throughput
Credential Injector	Exchange rates, OPA decisions
Gateway	Tool calls, import counts
Vault	Secret operations, encryption
OTel Collector	Collector health, pipeline stats

​How telemetry flows

​Automatic instrumentation

​Trace spans

​LLM Guard span attributes

​Metrics

​Agent metrics

​Tool and credential metrics

​LLM metrics

​A2A metrics

​LLM Guard metrics

​Pre-built Grafana dashboards

​Platform overview (23 panels)

​A2A communication (19 panels)

​GenAI semantic conventions

​Prometheus scrape targets