Are you wondering what it takes to become an SRE from a SysAdmin background? Our latest blog, covers the growth areas and technical skills needed to successfully transition to an SRE role.
The last decade has seen widespread adoption of SRE practices based on the best practices laid out by Google. Many SysAdmins have observed this trend and are now evaluating becoming SREs. Which gives rise to the question how much of a skills overlap is there between an SRE and a SysAdmin?
Both roles are concerned with IT operations and there is a significant overlap in their respective responsibilities. Broadly, Google has defined SRE to be software engineering principles applied to IT operations at scale. What does this mean in reality? SRE is essentially applying some key principles to IT operations. It frequently involves using various technologies that may be new to some SysAdmins.
In this blog we look at some of the growth areas and skills a SysAdmin needs to pick up to become an SRE. This transition requires some mindset changes and the acquisition of some new technical skills as well but it shouldn't be difficult for an experienced SysAdmin. So here are some of the changes you need to bring about in your mindset and skills to successfully transition to an SRE role.
Mindset Changes
Embracing Risk
As a SysAdmin the primary focus of your work has been to maintain order and keep the systems under your care, running smoothly. SysAdmins have traditionally focused on keeping their infrastructure stable and secure and to eliminate any risk of failure. On the other hand, SREs recognize that some amount of failure is inevitable. Error budget is an SRE concept that quantifies the amount of downtime your infrastructure can have before you are in breach of a SLO (service level objective). Armed with that knowledge, an SRE can decide to support agility and allow riskier changes or be more safety conscious and risk averse. This allows SREs to leverage risk for the benefit of the product rather than futilely attempting to eliminate risk and potentially becoming a bottleneck
Reducing Toil
Much of SRE concerns itself with removing toil. In this context, toil refers to those tasks that are repetitive and don't add any enduring value to the upkeep of your infrastructure. This sometimes also includes automating those jobs that are repetitive and time-consuming. By limiting toil to half of the work, an SRE frees up time to improve other aspects of the system. Improvements in system stability and performance are encouraged, and creative solutions can materialize. SysAdmins, are all too familiar with the repetitive configuration of hardware and software to fit the needs of their organisation. Most mature SysAdmins have developed automation practices that work well within their org but are not standardised. As an SRE you are expected to know standardization practices that will work for organizations of all types and major tech stacks. Automation using software such as Puppet, Chef and Ansible helps minimise repetitive steps and frees SysAdmins for more substantive and thorough work.
Automate all the things
Automation is a substantial aspect of good SRE practice. It is used to automate those tasks that have been identified as toil in the system. This can include running scripts when certain events occur, monitoring clusters, automating full-scale code deployments (Infrastructure as code) and auto configuring virtual machines in the cloud. SREs seek to automate to regulate their workload and to ensure that their workload does not increase linearly with the addition of users or machines they are maintaining. Some of the other benefits of automation include greater reliability when deployments are done, improved performance and all around, cost reduction.
Dealing with failure: Understanding SLOs and blameless postmortems
SysAdmins are familiar with the RCA(Root Cause Analysis) process - when a failure occurs the root cause is identified, and a solution is put in place. However, as an SRE there are best practices Google has created that include going beyond root causes and concerns itself with understanding the weaknesses in the system that led to the breakdown. Blameless postmortems encourage one to pick flaws in the existing reporting and operational processes. Good SRE practices insist on keeping people in the loop when failure occurs, including your customers. This is a cultural shift for SysAdmins, as they rarely tend to keep customers in the loop when things go down. These practices also include a formal written incident post-mortem process. The conclusions from an incident post-mortem must then be fed back to the planning process for future deployments. Failure takes on a fresh perspective from a SRE’s viewpoint - it is an opportunity to learn from your mistakes and do better next time around.
Soft Skills
SRE culture demands much greater collaboration with other parts of the organisation. While SLOs bring greater transparency to operations, achieving consensus on those objectives and deciding on the next step can often be challenging. Business teams, product management, developers and SREs all have slightly different goals and incentives. Bridging the gap between these various stakeholder perspectives may require conflict resolution skills. Explaining the trade off between feature development, stability and how Error Budgets can help decide the best result, requires strong communication skills. Finally, good negotiation skills will ensure that SRE goals are accepted in the face of pressure from Business, Product or Development.
Technical Skills
Transitioning from being a SysAdmin to an SRE requires brushing up or acquiring various technical skills.
Programming & Testing Skills: The emphasis on toil reduction and automation in SRE will require significantly stronger programming and testing skills. Typically an SRE should know one highly productive scripting language like Python and one high performance systems language like Go.
Infrastructure as Code: Traditionally, infrastructure deployment is a slow, manual, labour intensive process. Because of this, it is expensive, inelastic, inconsistent and unreliable. Infrastructure as Code (IaC) is an automation technique that brings the rigor of software engineering to infrastructure management. Tools like Ansible, Terraform, Puppet or Chef can be used to power an IaC initiative.
Cloud, Containers & Container Orchestration: Cloud and container services make something that was previously difficult to automate -- physical hardware -- manageable via standardised APIs. As an added benefit, they are usually far cheaper, more flexible and faster to provision than traditional hardware. They have also made the IaC technique far more powerful and useful. Knowledge of Amazon AWS, Kubernetes and Docker are now considered basic skills for SREs.
Modern Monitoring Tools: Active checking systems, metrics collection, and log aggregation have been the traditional mainstays of monitoring. More recently, code instrumentation and distributed tracing have been added to this arsenal. Older de facto standard tools like Nagios, Ganglia and rsyslog have been surpassed by tools like Prometheus, Datadog, and the ELK stack. APMs like NewRelic are now key for instrumentation and OpenTelemetry seems very promising as a distributed tracing tool. Familiarity of these platforms is a significant requirement for a good SRE.
Statistical Analysis: SRE culture demands hard data to support decision making. With the vast volumes of data being generated by monitoring tools, some basic statistical analysis is necessary to generate actionable data. This data can be used for capacity planning, release planning, continuous improvement and incident response.
Conclusion
SysAdmins and SREs are expected to be drivers of reliability and change that is beneficial to the customers. If you are a SysAdmin, you have doubtless carried out many operations in the systems level that will be invaluable to you as an SRE. The necessary areas of growth include learning to adapt to change, since the SRE practices in vogue today may very well change tomorrow. An SRE is someone who brings practices that have been a mainstay of software development at scale to the operations side. This crossover brings dividends to the organisation as they find solutions to recurrent problems without investing on more manpower and hardware. The future of SRE is bright as more organisations are seeking to cut costs and streamline their IT operations.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.