Distributed Tracing in ML Pipelines: From Preprocessing to Inference
How OpenTelemetry exposes the bottlenecks your metrics will never see Samuel Desseaux · Erythix 1. The Lie of the Green Dashboard It is 2 PM on a Tuesday. Your team receives a user report: predicti...

Source: DEV Community
How OpenTelemetry exposes the bottlenecks your metrics will never see Samuel Desseaux · Erythix 1. The Lie of the Green Dashboard It is 2 PM on a Tuesday. Your team receives a user report: predictions have been slow since this morning. You open Grafana. CPU at 38%, GPU at 72%, HTTP error rate at 0.2%, p99 latency at 1.4s. Nothing breaches a configured threshold. You tell the user everything looks nominal. Two hours later, a second report. Then a third. The problem exists. Your tools cannot see it. This scenario is not hypothetical. It is the daily reality of most teams operating ML pipelines in production without distributed tracing. Classic metrics measure the state of a service at a given moment. They do not measure the life of a request as it travels through multiple services. These are two fundamentally different levels of observation, and conflating them is a systematic source of operational blind spots. The distinction matters more in ML pipelines than anywhere else in software e