@hexr_agent decorator sets up all OTel providers at decoration time, so adding observability is as simple as using the SDK.
How telemetry flows
All telemetry from your agents and platform services flows through a single OpenTelemetry Collector, then routes to dedicated backends:- Data sources
- Collection and storage
- Data flow
| Source | What it emits |
|---|---|
Python SDK (hexr_llm, hexr_tool, @hexr_agent) | Agent spans, LLM metrics, tool invocations |
| Envoy proxies | mTLS metrics, connection counts, TLS handshake latency |
| A2A sidecars | Task lifecycle, message throughput, SSE connections |
| Platform services (Vault, Gateway, Credential Injector) | Operation rates, latency, error counts |
Automatic instrumentation
You get complete observability without writing any instrumentation code. The SDK emits spans for every operation:@hexr_agent at decoration time.
Trace spans
Every SDK operation generates a span you can inspect in Jaeger with full agent identity context:| Span name | Key attributes | Source |
|---|---|---|
hexr.agent.invoke | agent_name, tenant, framework, status | @hexr_agent decorator |
hexr.tool.invoke | service, region, cache_tier | hexr_tool() |
hexr.cache.lookup | tier (L1/L2/L3), hit, duration_ms | Credential cache |
hexr.credential.exchange | provider, service, spiffe_id | Credential Injector client |
hexr.llm.chat | gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens | hexr_llm() proxy |
hexr.vault.get | path, tenant | hexr.vault module |
hexr.gateway.call | tool_name, arguments | hexr.gateway module |
hexr.a2a.client.send | target_agent, task_id, task_state | A2AClient |
hexr.a2a.bridge.execute | source_agent, task_id | A2A bridge |
hexr.sandbox.exec | language, timeout, exit_code | hexr.sandbox |
hexr.browser.browse | url, actions_count | hexr.browser |
hexr.guard.scan | scan_type (prompt/output), is_valid | hexr.guard |
LLM Guard span attributes
When LLM Guard blocks a prompt or response, additional attributes are set on the parenthexr.llm.chat span:
| Attribute | Type | Description |
|---|---|---|
hexr.guard.prompt_blocked | bool | true if the input prompt was blocked |
hexr.guard.scanners | string | Scanner results that triggered the block |
hexr.guard.output_blocked | bool | true if the LLM response was blocked |
hexr.guard.output_scanners | string | Scanner results that triggered the output block |
ERROR with a description of which guard triggered.
Metrics
Agent metrics
| Metric | Type | Description |
|---|---|---|
hexr.agent.invocations | Counter | Total agent invocations |
hexr.agent.active | UpDownCounter | Currently active invocations |
hexr.agent.duration | Histogram | Invocation duration in seconds |
Tool and credential metrics
| Metric | Type | Description |
|---|---|---|
hexr.tool.invocations | Counter | Total tool calls by service |
hexr.tool.duration | Histogram | Tool call duration |
hexr.cache.hits | Counter | Cache hits by tier (L1/L2/L3) |
hexr.cache.misses | Counter | Cache misses |
hexr.cache.lookup.duration | Histogram | Cache lookup latency |
hexr.credential.exchanges | Counter | Full credential exchanges |
hexr.credential.failures | Counter | Failed exchanges |
LLM metrics
| Metric | Type | Description |
|---|---|---|
hexr.llm.calls | Counter | Total LLM API calls |
hexr.llm.call_errors | Counter | Failed LLM calls |
hexr.llm.call.duration | Histogram | LLM call latency |
hexr.llm.input_tokens | Counter | Total input tokens consumed |
hexr.llm.output_tokens | Counter | Total output tokens generated |
A2A metrics
| Metric | Type | Description |
|---|---|---|
hexr.a2a.sends | Counter | Messages sent |
hexr.a2a.send_failures | Counter | Failed sends |
hexr.a2a.send.duration | Histogram | Send latency |
hexr.a2a.bridge.executions | Counter | Bridge handler calls |
LLM Guard metrics
| Metric | Type | Description |
|---|---|---|
hexr_guard_scans_total | Counter | Total scans by direction (input/output) and scanner |
hexr_guard_blocks_total | Counter | Total blocks by direction and scanner |
hexr_guard_scan_duration_seconds | Histogram | Scan latency by direction |
Pre-built Grafana dashboards
Hexr ships with two Grafana dashboards covering 42 panels out of the box — no setup required.Platform overview (23 panels)
Covers system-wide health across all your agents:- Agent pod status and container health
- Credential exchange rates and cache hit ratios
- mTLS connection counts and TLS handshake latency
- SPIRE entry counts and SVID rotation rates
- OTel Collector throughput (traces/sec, metrics/sec)
- Vault operation rates and latency
- Gateway tool invocation rates
A2A communication (19 panels)
Covers inter-agent messaging for multi-agent workflows:- Task lifecycle — submitted → working → completed / failed
- Message throughput per agent pair
- Task duration histograms
- SSE streaming connection counts
- Valkey task store operations
- Error rates by task state transition
- Cross-namespace communication patterns
GenAI semantic conventions
hexr_llm() follows the OpenTelemetry GenAI semantic conventions, so your traces are compatible with any OTel-native LLM observability tool:
| Attribute | Example value |
|---|---|
gen_ai.system | openai, anthropic, google_genai, cohere, mistral |
gen_ai.request.model | gpt-4o, claude-3-opus, gemini-pro |
gen_ai.response.model | gpt-4o-2024-08-06 |
gen_ai.usage.input_tokens | 1200 |
gen_ai.usage.output_tokens | 800 |
gen_ai.response.id | chatcmpl-abc123 |
gen_ai.response.finish_reasons | ["stop"] |
Prometheus scrape targets
Prometheus scrapes metrics from your agent pods and all Hexr platform services automatically. The pre-configured targets include:| Target | Metrics |
|---|---|
| Agent pods (per tenant) | Task lifecycle, message throughput |
| Credential Injector | Exchange rates, OPA decisions |
| Gateway | Tool calls, import counts |
| Vault | Secret operations, encryption |
| OTel Collector | Collector health, pipeline stats |