Health and Observability
Observe mBedLM-core as layered signals: serving health, route behavior, contract quality, and workflow outcomes.
Observability Overview
Health is not a single metric. The platform is monitored as a set of linked indicators so operators can identify whether regressions come from serving, routing policy, contract shape, or domain adapters.
Signal Flow
graph TD A[Serving signals] --> B[Routing signals] B --> C[Contract signals] C --> D[Workflow signals] D --> E[Alerting and rollback]
Primary Signal Groups
- Serving Signals: endpoint readiness, latency, timeout rate, and backend reachability.
- Routing Signals: route decision distribution, fallback frequency, and escalation trends.
- Contract Signals: schema validation pass rate and malformed response incidents.
- Workflow Signals: plan/execute success rate, step failures, and completion latency.
Logging and Trace Model
- Correlate request/session identifiers across ingress, orchestration, and adapter boundaries.
- Emit structured lifecycle events for preflight, route, inference, and normalization stages.
- Preserve enough metadata for replay and regression investigation without leaking sensitive payloads.
Operator Checks
python -m cli.main system status python -m cli.main workflow logs python -m cli.main workflow status
Alert Strategy
- Page on sustained 5xx/timeout growth in production trust tier paths.
- Warn on elevated fallback rates and route drift before user-visible failures.
- Block promotion when contract integrity drops below accepted thresholds.
Curated Source Synthesis
- Health ownership is mapped per subsystem with explicit SLO and rollback triggers.
- Readiness checks include synthetic path validation, not just process liveness.
- Observability signals are tied to promotion gates and operational decision points.
- Quality guardrails and response normalization are monitored as first-class runtime health signals.
What Is Intentionally Not Included
- Private alert routing channels, pager schedules, and escalation rosters.
- Internal dashboard URLs and tenant-specific instrumentation settings.
- Environment-private threshold values and incident postmortem references.