Your LLM traces are write-only
You spent weeks building observability for your LLM app. Traces in Jaeger. Metrics in Grafana. Alerts in Slack. You can see exactly what your model says, how long it takes, and how much it costs. T...

Source: DEV Community
You spent weeks building observability for your LLM app. Traces in Jaeger. Metrics in Grafana. Alerts in Slack. You can see exactly what your model says, how long it takes, and how much it costs. Then you change the prompt. Did the model get better? Worse? For which inputs? You have no idea — because your traces are write-only. You observe but never evaluate. Your production data sits in Jaeger and never becomes a test. We built the bridge from traces to tests. Then we ran it on our own traces and discovered half our spans had no content — because recordContent was off by default. The tool designed to extract test data couldn't extract anything. Fixed that. Here's the workflow. The loop nobody closes Every LLM team has some version of this: 1. Deploy prompt v2 2. Watch dashboards for a few hours 3. "Looks fine, latency is similar, no errors" 4. Move on "Looks fine" is not evaluation. You're checking system health — latency, errors, cost — but not output quality. Your model could be ret