Even in the Cloud
A recent post of mine talked about how losing focus on delivering features can sink a development company and waste a lot of money. Wes took the time to reply in detail, and his points were too good not to share at length in a follow-up.
In my piece I described how GoCo wasted money by failing to focus on delivering functionality. Wes offered a different version (I have cut the text slightly, read the original here):
GoCo has a awesome TODO application. Their initial features included:
- Adding ToDo's
- Marking them complete
At some point, someone decided that it's important to spend time (and thus money) talking to customers, and getting their perspective on what they really want. And they discovered a slew of new features that could be added to the application:
- Deleting ToDo's without marking them complete
- Sharing ToDo's
- Keeping a ToDo history
1000's of customers later, the application went from having a snappy 50ms response time when viewing the website to 4 seconds to load! Because everyone was so focused on writing features, that they didn't consider everything else that is involved in how computers work. In the months following, the encountered the following issues, all of which could have been solved by engineers focused not on delivering new features, but on the "operational health" of the application:
- They ran out of disk space and caused an outage
- The MySQL server storing all the ToDo's ran into the PrimaryKey Integer limit, and crashed. Causing a huge outage as someone struggled to migrate the tables to use a new primary key of Int64 (or BigInt) m.signalvnoise.com/update-on-basec...
- After completion of the autocomplete feature, the entire application ground to a halt. Why? Every key press would send a packet to the server, asking the server for options that could be used to complete what the user was writing. The network & cpu load on the server caused huge latency issues.
And here we come to the countervailing point of every story on design, balance and priorities: if you really did focus entirely on new features while letting a Platform-as-a-Service worry about everything else, you can run into trouble. In fact, a company that measures itself entirely on ‘features delivered’ will inevitably fail. If you’re not acknowledging performance as a huge factor in success, you’re toast! There's a popular CodingHorror article about this ("Performance is a Feature"): https://blog.codinghorror.com/performance-is-a-feature/
But I’d like to shift the focus a bit from these abstract priorities and talk about the human meaning of prioritization.
Switching to PaaS means a big change in priorities
When we adopt a Platform-as-a-Service we are implicitly focusing less on one area and more on another. Network configuration, physical security, and scaling planning are all just less of one. Decisions still need to be made around network config, physical security (your office, laptops, etc), and scaling. Implicitly, making those changes will have an effect on your team.
Changing priorities has a human cost
When we discuss prioritization abstractly, we are imagining that our priorities can shift cleanly and smoothly, a bit like a budget in SimCity.
In SimCity, you can slash budgets one year and double them the next, and while your staff will complain it will still just ‘happen.’ No one will come to you and say ‘sorry boss we can’t hire more firefighters, they all moved away after you cut funding last year.’
And this is how you know SimCity is a game and not real life, because a real dev team cannot suddenly switch to a different model without impact on the landscape of your team. It’s natural, if we’re talking about ‘getting out of the business of running servers’ to ask ‘then what will the people who run our servers do?’
Infrastructure engineers can do bigger, better things with PaaS
The change in the work of operations is massive with PaaS. With a traditional hosted platform, you used to need someone to manage all the pieces of your service right down to the metal. Managing services can go from being many people's full time jobs - including being on call - to parts of a person or people's job. This frees your platform people to deliver great things for their team.
Have you met an experienced infrastructure engineer who knows their cloud tools? Someone who understands Heroku scheduler, metrics, and dataclips like the back of their hand? And beyond Heroku, skills like Chef/Ansible/Terraform, GitLab CI/CD, and PagerDuty configuration? What they can do in an afternoon would take me a month and a half. Authentication? Storage? Privacy? They handle it all in a blink, and set up automations that enable their team to deliver more, faster.
When a team selects cloud services and Platform-as-a-Service, it is crucial that you develop the expertise within your team to truly understand these services. Because the ‘multiplier effect’ of a good infrastructure or automation engineer is huge. The ability to spin up containers quickly and scale them smoothly means these engineers can focus on hard challenges for your team like improving your CI/CD pipeline or handling horizontal scale.
Further, good infrastructure engineering can have massive impacts on cost. Since architecture improvements can happen much faster on PaaS, it’s possible for an experienced engineer to have a massive impact.
In the end, platform engineering remains hugely important, but the nature of the work has changed. Instead of being the one whose sole job is to keep the engine running, the modern day coal-shovelers of the team, platform engineers become wizards who can deliver new tools and help the team pursue business opportunities with agility.