Postmortem: The Popcorn Panic

OJ - Aug 19 - - Dev Community

Issue Summary

Duration:

The “Popcorn Panic” Outage of May 6, 2024 struck from 7:15 PM to 8:35 PM (GMT+1) on a bustling Friday night. For a nerve-wracking 80 minutes, our servers went dark, just like the cinema screens in theaters waiting to be filled.

Impact:

During peak movie hours, 85% of customers eager to grab their movie tickets, popcorn, and snacks found themselves stuck in line, facing a blank screen instead of a blockbuster trailer. With no way to process sales at the tills, the crowd grew restless, popcorn supplies remained intact, and the smell of fresh nachos became a bittersweet reminder of what could have been.

Root Cause:

The root cause of this cinematic disaster was traced back to an unexpected power surge, triggered by a massive popcorn machine overload. The machines had been working overtime to meet demand, but unfortunately, they sent a surge down the power line that took our servers offline.

Timeline

  • 7:15 PM - Issue Detected: Tills went unresponsive just as the line for the 7:30 PM show started to swell. Customers tapped impatiently on their phones, hoping the problem was temporary.
  • 7:17 PM - Initial Investigation: Suspected a network glitch. Rebooted the tills, but they remained as unresponsive as a phone with no signal.
  • 7:25 PM - Customer Complaints: “I just want my popcorn!” echoed through the lobby as the smell of freshly popped kernels turned from enticing to infuriating.
  • 7:30 PM - Misleading Investigation: The team suspected a network outage and reset all the routers, but the tills remained silent, mocking our efforts.
  • 7:45 PM - Incident Escalation: IT was alerted. They sprinted to the server room, fueled by a combination of adrenaline and fear of facing an angry crowd.
  • 8:00 PM - Root Cause Discovery: IT discovered that a power surge had tripped the server’s breaker, caused by the overloaded popcorn machines working at full throttle.
  • 8:15 PM - Resolution: The breaker was reset, and additional power was diverted away from the overzealous popcorn machines. Tills started rebooting, and the lines slowly started moving again.
  • 8:35 PM - Service Restoration: Ticket sales, popcorn, and snack purchases resumed, and the movie-going experience was restored to its buttery, salty glory.

Root Cause and Resolution

The root cause of this outage was an overworked popcorn machine that sent an unexpected power surge through the system, tripping the server’s breaker. The surge took down our tills, preventing us from processing sales and leaving moviegoers hungry and ticketless.

To resolve the issue, the server’s breaker was reset, and power was redistributed to ensure the popcorn machines didn't overpower the system again. While the server took its time rebooting, the team also implemented measures to prevent future snack-induced outages.
Corrective and Preventative Measures

Improvements and Fixes:

Upgrade the power distribution system to handle the load from all snack equipment, ensuring that no popcorn machine holds our servers hostage again.
Introduce a surge protector specifically for snack-related machinery.
Conduct regular power load testing during peak snack times to prevent future surges.

Tasks:

  1. Install dedicated power circuits for high-demand machines like popcorn makers and fryers.
  2. Add surge protectors to the server room to shield the system from future culinary catastrophes.
  3. Implement monitoring to detect unusual power spikes and reroute energy as needed.
  4. Train staff to recognize early signs of overworked equipment and take preventative action.
  5. Plan for a backup manual ticketing system that can be used during outages.

Final Note: While the smell of popcorn may be irresistible, it turns out the machines that create it can wreak havoc if left unchecked. With these measures, we’re ensuring that no buttery treat will ever hold our servers hostage again.

.
Terabox Video Player