Observability Guide¶

This guide covers MCP Hangar's observability features: metrics, tracing, logging, and health checks.

Table of Contents¶

Quick Start
Metrics
Tracing
Langfuse Integration
Logging
Health Checks
Alerting
SLIs/SLOs
Troubleshooting
Best Practices

Quick Start¶

Prerequisites¶

For full tracing support, install OpenTelemetry dependencies:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

Enable Full Observability Stack¶

# Start monitoring stack (Prometheus, Grafana, Jaeger)
docker compose -f docker-compose.monitoring.yml --profile tracing up -d

Access dashboards: - Grafana: http://localhost:3000 (admin/admin) - Prometheus: http://localhost:9090 - Jaeger: http://localhost:16686

Configure MCP Hangar¶

# config.yaml
logging:
  level: INFO
  json_format: true

observability:
  tracing:
    enabled: true
    otlp_endpoint: http://localhost:4317
  metrics:
    enabled: true
    endpoint: /metrics

Metrics¶

Available Metrics¶

MCP Hangar exports Prometheus metrics at /metrics:

Tool Invocations¶

Metric	Type	Labels	Description
`mcp_registry_tool_calls_total`	Counter	provider, tool	Total tool invocations
`mcp_registry_tool_call_duration_seconds`	Histogram	provider, tool	Invocation latency
`mcp_registry_tool_call_errors_total`	Counter	provider, tool, error_type	Failed invocations

Provider State¶

Metric	Type	Labels	Description
`mcp_registry_provider_state`	Gauge	provider, state	Current provider state
`mcp_registry_cold_starts_total`	Counter	provider, mode	Cold start count
`mcp_registry_cold_start_duration_seconds`	Histogram	provider, mode	Cold start latency

Health Checks¶

Metric	Type	Labels	Description
`mcp_registry_health_checks`	Counter	provider, result	Health check executions
`mcp_registry_health_check_consecutive_failures`	Gauge	provider	Consecutive failures

Circuit Breaker¶

Metric	Type	Labels	Description
`mcp_registry_circuit_breaker_state`	Gauge	provider	State: 0=closed, 1=open, 2=half_open
`mcp_registry_circuit_breaker_failures_total`	Counter	provider	Circuit breaker trip count

Retry Metrics¶

Metric	Type	Labels	Description
`mcp_registry_retry_attempts_total`	Counter	provider, tool, attempt_number	Retry attempts
`mcp_registry_retry_exhausted_total`	Counter	provider, tool	Retries exhausted
`mcp_registry_retry_succeeded_total`	Counter	provider, tool, attempt_number	Successful retries

Prometheus Configuration¶

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'mcp-hangar'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: /metrics
    scrape_interval: 10s

Grafana Dashboards¶

Pre-built dashboards are in monitoring/grafana/dashboards/:

Dashboard	Description
overview.json	High-level health, latency percentiles, error rates
providers.json	Per-provider details and state transitions
discovery.json	Auto-discovery metrics and source health

Import via Grafana UI or use the provisioning configuration in monitoring/grafana/provisioning/.

Tracing¶

OpenTelemetry Integration¶

MCP Hangar supports distributed tracing via OpenTelemetry:

from mcp_hangar.observability import init_tracing, get_tracer

# Initialize once at application startup
init_tracing(
    service_name="mcp-hangar",
    otlp_endpoint="http://localhost:4317",
)

# Get a tracer for your module
tracer = get_tracer(__name__)

# Create spans for operations
with tracer.start_as_current_span("process_request") as span:
    span.set_attribute("request.id", request_id)
    span.set_attribute("provider.id", provider_id)
    result = process_request()

Using trace_span Context Manager¶

For simpler usage:

from mcp_hangar.observability import trace_span

with trace_span("my_operation", {"key": "value"}) as span:
    span.add_event("checkpoint_reached")
    do_work()

Environment Variables¶

Variable	Default	Description
`MCP_TRACING_ENABLED`	`true`	Enable/disable tracing
`OTEL_EXPORTER_OTLP_ENDPOINT`	`http://localhost:4317`	OTLP collector endpoint (gRPC)
`OTEL_SERVICE_NAME`	`mcp-hangar`	Service name in traces
`MCP_ENVIRONMENT`	`development`	Deployment environment tag

Trace Context Propagation¶

Propagate trace context across service boundaries:

from mcp_hangar.observability import inject_trace_context, extract_trace_context

# Inject context into outgoing request headers
headers = {}
inject_trace_context(headers)
# headers now contains 'traceparent' and 'tracestate'

# Extract context from incoming request
context = extract_trace_context(request_headers)

Getting Current Trace Information¶

from mcp_hangar.observability import get_current_trace_id, get_current_span_id

# Get current trace ID for logging correlation
trace_id = get_current_trace_id()  # Returns hex string or None
span_id = get_current_span_id()    # Returns hex string or None

Viewing Traces¶

Open Jaeger UI at http://localhost:16686
Select service mcp-hangar from dropdown
Set time range and click Find Traces
Click on a trace to see the span tree

Langfuse Integration¶

MCP Hangar integrates with Langfuse for LLM-specific observability, providing end-to-end tracing of tool invocations from your LLM application through MCP Hangar to individual providers.

Why Langfuse?¶

Feature	Benefit
End-to-end traces	See the complete journey from LLM prompt → tool call → provider response
Cost attribution	Track costs per provider, tool, user, or session
Latency analysis	Identify slow providers and optimize performance
Quality scoring	Correlate provider health with LLM response quality
Evals	Run automated evaluations on tool outputs

Installation¶

pip install mcp-hangar[observability]

Configuration¶

Via Environment Variables¶

export HANGAR_LANGFUSE_ENABLED=true
export LANGFUSE_PUBLIC_KEY=pk-lf-...
export LANGFUSE_SECRET_KEY=sk-lf-...
export LANGFUSE_HOST=https://cloud.langfuse.com  # or self-hosted URL

Via config.yaml¶

observability:
  langfuse:
    enabled: true
    public_key: ${LANGFUSE_PUBLIC_KEY}
    secret_key: ${LANGFUSE_SECRET_KEY}
    host: https://cloud.langfuse.com
    sample_rate: 1.0          # 0.0-1.0, fraction of traces to sample
    scrub_inputs: false       # Redact sensitive input data
    scrub_outputs: false      # Redact sensitive output data

Trace Propagation¶

To correlate traces from your LLM application with MCP Hangar, pass a trace_id when invoking tools:

from mcp_hangar.application.services import TracedProviderService

# Invoke with trace context from your LLM application
result = traced_service.invoke_tool(
    provider_id="math",
    tool_name="add",
    arguments={"a": 1, "b": 2},
    trace_id="your-langfuse-trace-id",    # Correlates with LLM trace
    user_id="user-123",                    # For cost attribution
    session_id="session-456",              # For grouping related calls
)

What Gets Traced¶

Event	Recorded Data
Tool invocation	Provider, tool, input params, output, latency, success/error
Health check	Provider, healthy status, latency

Recorded Scores¶

Score Name	Description
`tool_success`	1.0 for success, 0.0 for error
`tool_latency_ms`	Invocation latency in milliseconds
`provider_healthy`	1.0 if healthy, 0.0 if unhealthy
`health_check_latency_ms`	Health check latency

Using TracedProviderService¶

The TracedProviderService wraps ProviderService to automatically trace operations:

from mcp_hangar.application.services import ProviderService, TracedProviderService
from mcp_hangar.infrastructure.observability import LangfuseObservabilityAdapter, LangfuseConfig

# Create the observability adapter
langfuse_config = LangfuseConfig(
    enabled=True,
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
)
observability = LangfuseObservabilityAdapter(langfuse_config)

# Wrap your existing service
traced_service = TracedProviderService(
    provider_service=provider_service,
    observability=observability,
)

# All tool invocations are now traced
result = traced_service.invoke_tool("math", "add", {"a": 1, "b": 2})

Enable input/output scrubbing to avoid sending sensitive data to Langfuse:

observability:
  langfuse:
    enabled: true
    scrub_inputs: true    # Only sends parameter keys, not values
    scrub_outputs: true   # Only sends output structure, not content

Viewing Traces in Langfuse¶

Open Langfuse dashboard at https://cloud.langfuse.com
Navigate to Traces
Filter by:
metadata.mcp_hangar = true for MCP Hangar traces
metadata.provider = math for specific providers
Click on a trace to see:
Input parameters
Output results
Latency breakdown
Recorded scores

Combining with OpenTelemetry¶

Langfuse and OpenTelemetry can run simultaneously. Langfuse focuses on LLM-specific observability while OpenTelemetry provides infrastructure-level tracing:

observability:
  tracing:
    enabled: true
    otlp_endpoint: http://localhost:4317
  langfuse:
    enabled: true
    public_key: ${LANGFUSE_PUBLIC_KEY}
    secret_key: ${LANGFUSE_SECRET_KEY}

Logging¶

Structured Logging¶

MCP Hangar uses structlog for structured JSON logging:

{
  "timestamp": "2026-01-09T10:30:00.123Z",
  "level": "info",
  "event": "tool_invoked",
  "provider": "sqlite",
  "tool": "query",
  "duration_ms": 150,
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "service": "mcp-hangar"
}

Log Correlation with Traces¶

Include trace IDs in logs for correlation:

from mcp_hangar.observability import get_current_trace_id
from mcp_hangar.logging_config import get_logger

logger = get_logger(__name__)

def handle_request():
    trace_id = get_current_trace_id()
    logger.info("processing_request", trace_id=trace_id, request_id=req_id)

Configuration¶

# config.yaml
logging:
  level: INFO          # DEBUG, INFO, WARNING, ERROR, CRITICAL
  json_format: true    # Enable JSON output for log aggregation
  file: logs/mcp-hangar.log  # Optional file output

Environment variable override:

MCP_LOG_LEVEL=DEBUG python -m mcp_hangar.server

Health Checks¶

HTTP Endpoints¶

MCP Hangar provides standard health endpoints compatible with Kubernetes and other orchestrators:

Endpoint	HTTP Method	Purpose	Use Case
`/health/live`	GET	Liveness check	Container restart decisions
`/health/ready`	GET	Readiness check	Traffic routing decisions
`/health/startup`	GET	Startup check	Startup completion gate

Response Format¶

{
  "status": "healthy",
  "checks": [
    {
      "name": "providers",
      "status": "healthy",
      "duration_ms": 1.2,
      "message": "Check passed"
    }
  ],
  "version": "0.1.0",
  "uptime_seconds": 3600.5
}

Container Orchestration Configuration¶

Kubernetes¶

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: mcp-hangar
      image: mcp-hangar:latest
      ports:
        - containerPort: 8000
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8000
        initialDelaySeconds: 5
        periodSeconds: 10
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8000
        initialDelaySeconds: 10
        periodSeconds: 5
        failureThreshold: 3
      startupProbe:
        httpGet:
          path: /health/startup
          port: 8000
        failureThreshold: 30
        periodSeconds: 2

Docker Compose¶

services:
  mcp-hangar:
    image: mcp-hangar:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health/ready"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

Custom Health Checks¶

Register application-specific health checks:

from mcp_hangar.observability import HealthCheck, get_health_endpoint

def check_database_connection():
    """Return True if database is reachable."""
    try:
        db.execute("SELECT 1")
        return True
    except Exception:
        return False

# Register the check
endpoint = get_health_endpoint()
endpoint.register_check(HealthCheck(
    name="database",
    check_fn=check_database_connection,
    timeout_seconds=5.0,
    critical=True,  # False = degraded instead of unhealthy on failure
))

# Mark startup complete when ready
endpoint.mark_startup_complete()

Async Health Checks¶

async def check_external_service():
    """Async health check example."""
    async with aiohttp.ClientSession() as session:
        async with session.get("http://external-service/health") as resp:
            return resp.status == 200

endpoint.register_check(HealthCheck(
    name="external_service",
    check_fn=check_external_service,
    timeout_seconds=3.0,
    critical=False,  # Non-critical: failure results in degraded state
))

Alerting¶

Alert Rules¶

Pre-configured alert rules are in monitoring/prometheus/alerts/:

Critical Alerts (Immediate Response Required)¶

Alert	Condition	Description
`MCPHangarAllProvidersDown`	No ready providers for 1m	Complete service outage
`MCPHangarHighErrorRate`	Error rate > 10% for 2m	Significant failures
`MCPHangarCircuitBreakerOpen`	Any circuit breaker open	Provider isolation triggered
`MCPHangarNotResponding`	Scrape failures for 1m	Service unreachable
`MCPHangarStartupFailed`	Repeated startup failures	Provider cannot initialize

Warning Alerts (Investigation Required)¶

Alert	Condition	Description
`MCPHangarProviderDegraded`	Provider degraded for 5m	Provider experiencing issues
`MCPHangarHighLatencyP95`	P95 > 5s for 5m	Performance degradation
`MCPHangarFrequentColdStarts`	Cold start rate > 0.1/s	Consider increasing idle TTL
`MCPHangarDiscoverySourceUnhealthy`	Source unhealthy for 5m	Discovery issues
`MCPHangarLowAvailability`	Availability < 80% for 5m	Multiple providers affected
`MCPHangarRetryExhaustion`	High retry exhaustion rate	Persistent failures

Alertmanager Configuration¶

Configure notification routing in monitoring/alertmanager/alertmanager.yaml:

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://your-webhook-endpoint'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<your-service-key>'
        severity: critical

  - name: 'slack'
    slack_configs:
      - api_url: '<your-slack-webhook-url>'
        channel: '#alerts'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

SLIs/SLOs¶

Service Level Indicators¶

SLI	Metric	Good Event
Availability	Provider ready state	`mcp_registry_provider_state{state="ready"}`
Latency	Tool invocation duration	Request < 2s
Error Rate	Failed invocations	`mcp_registry_errors_total`

Recommended SLOs¶

SLI	Target	Measurement Window
Availability	99.9%	30 days rolling
Latency (P95)	< 2s	5 minute window
Error Rate	< 1%	5 minute window

Error Budget Calculation¶

# Error budget remaining (1.0 = full budget, 0.0 = exhausted)
1 - (
  sum(increase(mcp_registry_errors_total[30d])) /
  sum(increase(mcp_registry_tool_calls_total[30d]))
) / 0.001

Availability Query¶

# Current availability ratio
sum(mcp_registry_provider_state{state="ready"}) /
sum(mcp_registry_provider_state)

Troubleshooting¶

Metrics Not Visible¶

Verify endpoint accessibility:
```
curl http://localhost:8000/metrics
```
Check Prometheus targets:
Open http://localhost:9090/targets
Verify MCP Hangar target is UP

Review Prometheus logs:

docker logs mcp-prometheus 2>&1 | grep -i error

Traces Not Appearing¶

Verify tracing is enabled:

echo $MCP_TRACING_ENABLED  # Should be 'true' or unset

Check OTLP endpoint connectivity:
```
curl -v http://localhost:4317
```
Look for initialization errors:
```
grep -i "tracing" logs/mcp-hangar.log
```
Verify OpenTelemetry packages installed:
```
pip list | grep opentelemetry
```

Health Check Failures¶

Get detailed health status:

curl -s http://localhost:8000/health/ready | jq .

Check individual check results:

endpoint = get_health_endpoint()
result = endpoint.get_last_result("providers")
print(result.to_dict())

High Cardinality Warnings¶

Review metric label values for unbounded sets
Avoid user-provided values in labels

Use label aggregation in queries:

sum by (provider) (rate(mcp_registry_tool_calls_total[5m]))

Best Practices¶

Metrics¶

Use meaningful labels - Include provider, tool, and outcome
Avoid high cardinality - Don't use request IDs or timestamps as labels
Set retention appropriately - 15 days for metrics, 7 days for traces

Tracing¶

Initialize early - Call init_tracing() at application startup
Use semantic attributes - Follow OpenTelemetry conventions
Propagate context - Inject/extract for cross-service traces

Alerting¶

Create runbooks - Document response procedures for each alert
Test alerts regularly - Verify notification channels work
Tune thresholds - Adjust based on baseline behavior

Health Checks¶

Keep checks fast - Use short timeouts (< 5s)
Distinguish critical vs non-critical - Use critical=False for degraded states
Monitor the monitors - Alert on Prometheus/Grafana health

Observability Guide¶

Table of Contents¶

Quick Start¶

Prerequisites¶

Enable Full Observability Stack¶

Configure MCP Hangar¶

Metrics¶

Available Metrics¶

Tool Invocations¶

Provider State¶

Health Checks¶

Circuit Breaker¶

Retry Metrics¶

Prometheus Configuration¶

Grafana Dashboards¶

Tracing¶

OpenTelemetry Integration¶

Using trace_span Context Manager¶

Environment Variables¶

Trace Context Propagation¶

Getting Current Trace Information¶

Viewing Traces¶

Langfuse Integration¶

Why Langfuse?¶

Installation¶

Configuration¶

Via Environment Variables¶

Via config.yaml¶

Trace Propagation¶

What Gets Traced¶

Recorded Scores¶

Using TracedProviderService¶

GDPR Compliance¶

Viewing Traces in Langfuse¶

Combining with OpenTelemetry¶

Logging¶

Structured Logging¶

Log Correlation with Traces¶

Configuration¶

Health Checks¶

HTTP Endpoints¶

Response Format¶

Container Orchestration Configuration¶

Kubernetes¶

Docker Compose¶

Custom Health Checks¶

Async Health Checks¶

Alerting¶

Alert Rules¶

Critical Alerts (Immediate Response Required)¶

Warning Alerts (Investigation Required)¶

Alertmanager Configuration¶

SLIs/SLOs¶

Service Level Indicators¶

Recommended SLOs¶

Error Budget Calculation¶

Availability Query¶

Troubleshooting¶

Metrics Not Visible¶

Traces Not Appearing¶

Health Check Failures¶

High Cardinality Warnings¶

Best Practices¶

Metrics¶

Tracing¶

Alerting¶

Health Checks¶