I Couldn’t Debug My AI/ML GPU Incident - So I Built My Own Tool
Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage. The next morning, ...

Source: DEV Community
Several weeks ago, I encountered some problems with ML jobs running on my GPU server. I received alerts triggered at midnight, and one of the jobs failed due to GPU memory usage. The next morning, I performed a root-cause analysis to understand what had happened the night before. However, I couldn’t identify the issue because I only had access to overall GPU usage metrics at current time. I used nvidia-smi and nvtop to inspect the current state, but there was no trace about the issue we got from last night. Therefore, I realized I needed a solution to prevent similar problems from happening in the future. I tried using DCGM exporter to expose GPU metrics, but it couldn’t provide PID-level metrics. I also tested it in a Kubernetes environment to get pod-level metrics, but it didn’t work because our GPUs only support time-slicing mode. Therefore, I developed an open-source tool called gpuxray to monitor GPUs at the process level. This tool has helped our team significantly when observing