Job profile: Become a site reliability engineer

Posted on

May 11, 2021

by
‍

What does a site reliability engineer do? We're glad you asked.

With technology evolving in so many different directions, it has becoming increasingly hard to keep track of the siloed job roles that companies keep throwing out there in employment postings. Gone are the days of the jack-of-all trades "IT expert" and here to stay are the specific roles and job descriptions that have replace him (or her).

Software development and design is getting faster and more complex the further that we progress. This is pushing business teams past their bounds and frustrating IT operations teams more than ever before. DevOps, where this role sits, gained popularity in order to combat siloed workflows, decreased collaboration, a lack of visibility.

While establishing a 'culture of DevOps' has helped teams collaborate better and deliver reliable software faster, DevOps teams don't necessarily have someone specifically dedicated to developing systems that increase site reliability and performance. That is where a site reliability engineer (SRE) comes into the picture.

What's in a name?

The concept of site reliability engineering (also SRE) was initially brought to life by Google engineer Ben Treynor, an amazing man. Then, shortly after Google implemented the SRE role, the company published its popular SRE e-book, helping the movement gain traction in the industry.

A site reliability engineers sits at the crossroads of traditional technology and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems. Think about Splunk or large-scale data moves and custom written load-balancing software.

The actual clinical definition of a site reliability engineer or SRE is pretty straightforward: In general, an SRE individual or team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. Site reliability engineers create a bridge between development and operations by applying a software engineering mindset to system administration topics.

How might that be phrased in plain English? Well, you might simply ask, "How well are our servers and networks running for the applications that are being built?" You would, of course, ask this of an SRE. Did we run out of storage on the data warehouse or data mart? Ask the SRE. Is the system up or down? Simply ask the SRE.

DevOps, in general, pushed shared responsibility for the reliability of your applications and infrastructure. And while this is a great first step forward, it does not proactively help teams add resilience to their system. Many DevOps teams, even with shortened feedback loops and improved collaboration, can still find themselves deploying new, unreliable services into production at a rapid pace. Enter the SRE.

Job responsibilities

Some of the general things that an SRE does might include any of the following.

Proactively building and implementing services to make technology and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production (small ones mind you).

A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management. They code things that operations and network engineers used to do manually. They augment and grow ops. Think of a bridge between Development and Operations — as the name DevOps would imply.

One of the least-known items that an SRE does is gain exposure to systems in both staging and production, as well as all technical teams. This allows them to take part in work with software development, support, IT operations and on-call duties. This means they build up a great amount of historical knowledge over time.

Instead of silo-ing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it. I love asking any SRE to provide me their runbook. If they are good, they can and will pull it up immediately.

Stay sharp

The area where SREs must stay ahead most diligently is toolset changes. While, they write a LOT of their own tools, and they write a lot of tools for the team, they are subject to an overload of tools. Another emerging area of interest is in and around Splunk and other monitoring and DevOps tools. I think DevOps and SREs will start to write less of their own tools and lean more heavily on out-of-the-box tools that automate most of their tasks.

Job preparation

The right background can vary for this role, since it sits between operations and development. A background in traditional software development or network engineering is certainly helpful.

A background in critical thinking is likely to also be of use — I knew a psychology major who was one of the best I have ever met in this role. I also like operations individuals who have learned to code on their own for the SRE role.

Site reliability engineering roles and responsibilities are crucial to the continuous improvement of people, processes and technology within any organization. Ultimately, the best part about being an SRE is assisting clients, including people both internal and external to the organization.

SREs won't spend all (or even most) of their time building new features for customers, but they're constantly making an impact on customer experience. In fact, if you're looking for the IT job role that impacts customers most directly — then SRE might be it.

Certification

The DevOps Institute, fittingly enough, offers perhaps the most directly applicable certification for potential SREs with its Site Reliability Engineer (SRE) Foundation credential. India's DevOps School (devopsschool.com) also offers the similarly named Site Reliability Engineering Certified Professional (SRECP) credential.

IBM appears to be leading the charge among IT companies that provide certification with its IBM Certified Professional SRE - Cloud v1 credential.

Make a difference

Site reliability engineers not only improve the lives of customers, both internal and external, but they also make things better for frontline service desk teams, IT professionals, and software developers. An SRE can be one of the most fulfilling roles for a software engineer or operations specialist.

Working as an SRE can help you better understand the struggles of IT and support, making you a better developer or DevOps team member going forward. No matter what role you are in, I wish you success and prosperity.

About the Author

Nathan Kimpel is a seasoned information technology and operations executive with a diverse background in all areas of company functionality, and a keen focus on all aspects of IT operations and security. Over his 20 years in the industry, he has held every job in IT and currently serves as a Project Manager in the St. Louis (Missouri) area, overseeing 50-plus projects. He has years of success driving multi-million dollar improvements in technology, products and teams. His wide range of skills includes finance, as well as ERP and CRM systems. Certifications include PMP, CISSP, CEH, ITIL and Microsoft.

Posted to topic:

Jobs and Salary

Important Update: We have updated our Privacy Policy to comply with the California Consumer Privacy Act (CCPA)