Monitoring and Metrics

This guide shows you how to set up comprehensive monitoring and observability for Passage using OpenTelemetry, Prometheus, Grafana, and Sentry.

Overview

Passage provides built-in observability through:

OpenTelemetry - Metrics and distributed tracing
Sentry (optional) - Error tracking and reporting
Structured Logging - JSON-formatted logs with context

OpenTelemetry Integration

Passage natively exports metrics and traces using OpenTelemetry (OTLP over HTTP).

Configuration

Configure OpenTelemetry in your config.toml:

[otel]
environment = "production"
traces_endpoint = "https://otlp-gateway.example.com/v1/traces"
traces_token = "base64_auth_token"
metrics_endpoint = "https://otlp-gateway.example.com/v1/metrics"
metrics_token = "base64_auth_token"

Grafana Cloud Setup

Get Your Endpoints:
- Navigate to Configuration → Data Sources → OpenTelemetry
- Copy the OTLP endpoint URLs

Generate Auth Tokens:

# Format: instanceID:token
echo -n "12345:glc_xxxxx" | base64

Configure Passage:

[otel]
environment = "production"
traces_endpoint = "https://otlp-gateway-prod-us-central-0.grafana.net/otlp/v1/traces"
traces_token = "MTIzNDU6Z2xjX3h4eHh4"
metrics_endpoint = "https://otlp-gateway-prod-us-central-0.grafana.net/otlp/v1/metrics"
metrics_token = "MTIzNDU6Z2xjX3h4eHh4"

Environment Variables

Override with environment variables:

export PASSAGE_OTEL_ENVIRONMENT=production
export PASSAGE_OTEL_TRACES_ENDPOINT=https://otlp.example.com/v1/traces
export PASSAGE_OTEL_TRACES_TOKEN=base64_token
export PASSAGE_OTEL_METRICS_ENDPOINT=https://otlp.example.com/v1/metrics
export PASSAGE_OTEL_METRICS_TOKEN=base64_token

Metrics

Passage exports the following metrics via OpenTelemetry:

Connection Metrics

Metric	Type	Description
`passage_connections_total`	Counter	Total connection attempts
`passage_connections_active`	Gauge	Currently active connections
`passage_connections_failed`	Counter	Failed connection attempts
`passage_connections_rate_limited`	Counter	Connections blocked by rate limiter

Labels:

client_ip - Client IP address
server_address - Server address connected to
protocol_version - Minecraft protocol version

Request Metrics

Metric	Type	Description
`passage_requests_total`	Counter	Total requests (status pings + logins)
`passage_status_requests_total`	Counter	Server list ping requests
`passage_login_requests_total`	Counter	Login/join requests
`passage_request_duration_seconds`	Histogram	Request processing time

Labels:

request_type - status or login
adapter_type - Adapter used (fixed, http, grpc, etc.)
result - success or failure

Adapter Metrics

Metric	Type	Description
`passage_adapter_requests_total`	Counter	Adapter invocations
`passage_adapter_errors_total`	Counter	Adapter errors
`passage_adapter_duration_seconds`	Histogram	Adapter response time

Labels:

adapter_name - status, discovery, or strategy
adapter_type - Implementation type (fixed, http, grpc, etc.)
error_type - Error category (if applicable)

Target Metrics

Metric	Type	Description
`passage_targets_discovered`	Gauge	Number of discovered backend servers
`passage_target_selections_total`	Counter	Target selection operations
`passage_target_connections_total`	Counter	Successful target connections
`passage_target_connection_failures_total`	Counter	Failed target connections

Labels:

target_identifier - Target server ID
target_address - Target server address

System Metrics

Metric	Type	Description
`passage_uptime_seconds`	Gauge	Uptime in seconds
`passage_version_info`	Gauge	Version information (value always 1)

Labels:

version - Passage version

Distributed Tracing

Passage exports distributed traces to help debug performance issues and track request flows.

Trace Spans

Each connection generates the following spans:

connection - Overall connection lifecycle
- Attributes: client_ip, server_address, protocol_version, username, user_id
status_request - Server list ping (if applicable)
- Attributes: adapter_type
discovery - Target discovery
- Attributes: adapter_type, targets_found
strategy - Target selection
- Attributes: adapter_type, selected_target
target_connection - Backend server connection
- Attributes: target_identifier, target_address

Viewing Traces

In Grafana, navigate to Explore → Traces and search by:

Service name: passage
Operation: connection, status_request, discovery, etc.
Attributes: username, client_ip, target_identifier

Prometheus Setup

If you’re using Prometheus instead of Grafana Cloud, set up an OpenTelemetry Collector:

OpenTelemetry Collector Configuration

receivers:
  otlp:
    protocols:
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch:

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

  jaeger:
    endpoint: "jaeger:14250"
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]

Docker Compose Setup

version: '3.8'

services:
  passage:
    image: ghcr.io/scrayosnet/passage:latest
    ports:
      - "25565:25565"
    environment:
      - PASSAGE_OTEL_METRICS_ENDPOINT=http://otel-collector:4318/v1/metrics
      - PASSAGE_OTEL_TRACES_ENDPOINT=http://otel-collector:4318/v1/traces
    depends_on:
      - otel-collector

  otel-collector:
    image: otel/opentelemetry-collector:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4318:4318"
      - "8889:8889"

  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus

Prometheus Configuration

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'passage'
    static_configs:
      - targets: ['otel-collector:8889']

Grafana Dashboards

Creating a Dashboard

Navigate to Grafana
Create → Dashboard
Add Panel

Example Queries

Connection Rate

rate(passage_connections_total[5m])

Active Connections

passage_connections_active

Request Latency (p95)

histogram_quantile(0.95, rate(passage_request_duration_seconds_bucket[5m]))

Error Rate

rate(passage_connections_failed[5m])

Adapter Performance

histogram_quantile(0.99, rate(passage_adapter_duration_seconds_bucket[5m]))

Target Distribution

sum(passage_target_selections_total) by (target_identifier)

Pre-Built Dashboard

A pre-built Grafana dashboard is available in the Passage repository:

# Import from file
curl -o passage-dashboard.json \
  https://raw.githubusercontent.com/scrayosnet/passage/main/docs/grafana-dashboard.json

# Import in Grafana:
# Dashboard → Import → Upload JSON file

Logging

Passage uses structured logging with configurable log levels.

Log Levels

Set via RUST_LOG environment variable:

# Error only
RUST_LOG=error passage

# Info (default)
RUST_LOG=info passage

# Debug
RUST_LOG=debug passage

# Trace (very verbose)
RUST_LOG=trace passage

# Per-module levels
RUST_LOG=passage=debug,passage::adapter=trace passage

Log Format

Logs are output in JSON format:

{
  "timestamp": "2024-02-05T10:30:45.123Z",
  "level": "INFO",
  "target": "passage::connection",
  "message": "Connection established",
  "client_ip": "192.168.1.100",
  "username": "Steve",
  "user_id": "069a79f4-44e9-4726-a5be-fca90e38aaf5",
  "target": "hub-1"
}

Centralized Logging

Loki (Grafana)

services:
  passage:
    image: ghcr.io/scrayosnet/passage:latest
    logging:
      driver: loki
      options:
        loki-url: "http://loki:3100/loki/api/v1/push"
        loki-batch-size: "400"

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml

ELK Stack

services:
  passage:
    image: ghcr.io/scrayosnet/passage:latest
    logging:
      driver: fluentd
      options:
        fluentd-address: localhost:24224
        tag: passage

  fluentd:
    image: fluent/fluentd:latest
    ports:
      - "24224:24224"
    volumes:
      - ./fluent.conf:/fluentd/etc/fluent.conf

Sentry Error Tracking

Sentry provides real-time error tracking and alerting.

Configuration

[sentry]
enabled = true
debug = false
address = "https://[email protected]/0"
environment = "production"

Environment Variables

export PASSAGE_SENTRY_ENABLED=true
export PASSAGE_SENTRY_ADDRESS=https://your-key@sentry.io/project-id
export PASSAGE_SENTRY_ENVIRONMENT=production

What Gets Reported

Sentry captures:

Panic/crash events
Adapter errors
Connection failures
Configuration errors

Each event includes:

Stack traces
Request context (username, IP, target)
Environment information
Custom tags and metadata

Viewing Errors

In Sentry:

Navigate to Issues
Filter by environment (production)
View stack traces and context
Set up alerts for new/recurring errors

Health Checks

Passage doesn’t expose a dedicated health check endpoint, but you can monitor health by:

Connection Test

# Test if Passage is accepting connections
nc -zv localhost 25565

Minecraft Status Check

# Using mcstatus tool
pip install mcstatus
mcstatus localhost:25565 status

Kubernetes Liveness Probe

apiVersion: v1
kind: Pod
metadata:
  name: passage
spec:
  containers:
  - name: passage
    image: ghcr.io/scrayosnet/passage:latest
    livenessProbe:
      tcpSocket:
        port: 25565
      initialDelaySeconds: 5
      periodSeconds: 10
    readinessProbe:
      tcpSocket:
        port: 25565
      initialDelaySeconds: 2
      periodSeconds: 5

Alerting

Grafana Alerts

Create alerts in Grafana for:

High Error Rate

rate(passage_connections_failed[5m]) > 0.1

High Latency

histogram_quantile(0.95, rate(passage_request_duration_seconds_bucket[5m])) > 0.5

No Active Connections (possible crash)

passage_connections_active == 0

Adapter Errors

rate(passage_adapter_errors_total[5m]) > 0.05

Prometheus Alertmanager

groups:
  - name: passage
    interval: 30s
    rules:
      - alert: PassageHighErrorRate
        expr: rate(passage_connections_failed[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: PassageHighLatency
        expr: histogram_quantile(0.95, rate(passage_request_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P95 latency is {{ $value }}s"

      - alert: PassageDown
        expr: up{job="passage"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Passage is down"
          description: "Passage instance has been down for more than 1 minute"

Performance Tuning

Identify Bottlenecks

Use metrics to identify performance issues:

High passage_adapter_duration_seconds → Optimize adapters
High passage_request_duration_seconds → Check adapter performance
High passage_target_connection_failures_total → Backend server issues
High passage_connections_rate_limited → Adjust rate limiter settings

Adapter Optimization

Keep adapter response times under 50ms
Use caching where appropriate
Implement connection pooling
Monitor adapter-specific metrics

Best Practices

Monitoring

Set up dashboards for key metrics
Configure alerts for critical issues
Monitor both Passage and adapters
Track long-term trends

Logging

Use structured logging (JSON)
Set appropriate log levels
Aggregate logs centrally
Include context (username, IP, target)

Observability

Enable OpenTelemetry in production
Use distributed tracing for debugging
Monitor adapter performance
Track error rates and latency

Security

Protect metrics endpoints
Secure OTLP credentials
Monitor for unusual patterns
Set up security alerts

Troubleshooting

No Metrics Appearing

Check OTLP endpoints:
Terminal window
```
curl -v $PASSAGE_OTEL_METRICS_ENDPOINT
```

Verify authentication:

echo $PASSAGE_OTEL_METRICS_TOKEN | base64 -d

Check logs:
Terminal window
```
RUST_LOG=debug passage
```

High Latency

Check adapter metrics:
```
passage_adapter_duration_seconds
```
View traces in Grafana
Optimize slow adapters
Consider caching

Missing Traces

Ensure traces endpoint is configured
Check sample rate (default: 100%)
Verify network connectivity
Check OpenTelemetry Collector logs