The $800 Ghost in the Machine: Hunting Down Wasted Preemptible VMs

How runaway preemptible VMs were silently draining our GCP budget – and how we finally caught them.

Our ML team loved GCP's preemptible VMs. Cheap compute for training? Sign us up! But then something started happening. Our costs, even with preemptibles, began creeping upwards. Not dramatically, just enough to raise an eyebrow and trigger that nagging feeling that something was off.

It wasn’t a sudden spike, more like a slow leak – the kind that's hard to detect until you're standing ankle-deep in the server room wondering where all the water came from.

The Preemptible Paradox: Cheaper Isn't Always Better

Preemptible VMs are fantastic in theory. But like any powerful tool, they require careful handling. We thought we were being diligent. We had scripts that automatically spun down instances after training jobs finished. Or so we thought.

Our initial investigations focused on optimizing our training scripts, tweaking parameters, and even exploring different instance types. We were looking for code inefficiencies, assuming our infrastructure was behaving as expected. We were wrong.

Abstract image representing unnoticed cost leaks in cloud infrastructure.
Abstract visualization of idle, unused VMs consuming resources.

The Idle VM Graveyard

The 'aha!' moment arrived courtesy of EazyOps' real-time monitoring. We started seeing a pattern: preemptible VMs would spin up for training, the job would complete, but the instances wouldn't terminate. They became ghost VMs, idling in the machine, racking up charges for hours, sometimes even days.

The problem wasn’t with the preemptible VMs themselves; it was with our orchestration. Our scripts had a subtle but critical flaw – a race condition that prevented them from reliably terminating instances after job completion. These weren't being preempted by Google; they were being abandoned by us.

EazyOps to the Rescue: Taming the Ghosts

We implemented EazyOps' AI-based policies to automatically tag our ML workloads and enforce termination after a defined period of inactivity. Within minutes, EazyOps identified and terminated several idle instances, immediately stemming the $800/month leak.

A conceptual image representing automated cost optimization and resource management.

Results: From Ghost Hunting to Gold Mining

The results were immediate and impactful. We saw an almost instant 15% decrease in our GCP compute bill, which translated to roughly $800 in savings per month. Beyond the immediate cost savings, we gained something even more valuable: peace of mind. We knew that EazyOps was constantly vigilant, ensuring that our resources were being used efficiently and that no more ghost VMs were haunting our infrastructure.

Lessons Learned: The Importance of Vigilance

  • Automation is essential, but verification is critical: Assume nothing. Regularly audit your automated processes to ensure they are working as intended.
  • Real-time monitoring is invaluable: Traditional cost analysis tools are often reactive. Real-time visibility allows you to identify and address issues before they become major problems.
  • Tagging and labeling are your friends: Comprehensive tagging allows for granular control and analysis, enabling you to quickly pinpoint the source of cost anomalies.
An abstract visualization of proactive cloud cost control and future innovations.

What's Next: Proactive Cost Management

Moving forward, we’re exploring more advanced features of EazyOps, like predictive cost modeling and automated resource optimization. Our goal is to move from reactive cost management to a proactive approach, where we can anticipate and prevent waste before it occurs.

About Shujat

Shujat is a Senior Backend Engineer at EazyOps, working at the intersection of performance engineering, cloud cost optimization, and AI infrastructure. He writes to share practical strategies for building efficient, intelligent systems.