Most Important Skills for a Site Reliability Manager at Google
David, a Site Reliability Manager at Google, emphasizes the crucial need for both "human empathy" and "mechanical sympathy" in the role, highlighting the importance of understanding both the people and the technology. Efficient task-switching, fueled by the ability to "visualize how computers...can fail," and a pessimistic imagination to anticipate problems are also key skills developed through practice, according to David.
Human Empathy, Systems Thinking, Efficient Task Switching, Predictive Failure Analysis, Problem-Solving
Advizer Information
Name
Job Title
Company
Undergrad
Grad Programs
Majors
Industries
Job Functions
Traits
David Fayram
Site Reliability Manager
University of California, Santa Barbara
None
Computer Science
Energy & Utilities, Technology, Advertising, Communications & Marketing
Cyber Security and IT
Took Out Loans, Worked 20+ Hours in School, LGBTQ
Video Highlights
1. Human empathy and mechanical sympathy are crucial skills for Site Reliability Managers. Empathy is needed to effectively manage people, while mechanical sympathy helps in understanding and predicting system failures.
2. Efficient task switching is essential due to the frequent interruptions and need to quickly address production issues. This skill requires practice and developing effective note-taking and progress tracking techniques.
3. Imagination and the ability to visualize system failures, their interactions, and their probabilities are key to proactive problem-solving and preventing outages. This 'imagination of pessimism' helps anticipate and mitigate potential risks
Transcript
What skills are most important for a job like yours?
I think the most important skills are human empathy and mechanical sympathy. As a manager, human empathy is really important. You can't be a good manager without understanding that the people you're working with and who are working under you are people, and you have to think in that way.
Mechanical sympathy is also really important. The SRE philosophy at Google, and really anywhere that there's SRE, is that you need to be able to think about machines predictably, understand how they work and how they fail, and anticipate that. Having a mix of both of these skills is essential if you want to be a manager in the SRE space.
You've also got to be able to efficiently task switch. This is actually really hard and it's a skill that you have to develop. I don't think I've met many people who can do it naturally; most people say they can, and then they're kind of bad at it.
You have to be able to lock up what you're doing right now and be able to come back to it quickly and efficiently. That's just practice: learning what notes you need to take, learning how to track your progress for yourself, and learning how to make yourself legible to the company.
Interrupts occur all the time, especially if you've got a pager and you're working on a project, and then something breaks in production. You go fix it, maybe you're back six hours later or two hours later. You want to be able to keep doing your work and not have the whole day be ruined.
You just need to be able to visualize how computers, large systems, and individual software can fail and interact. A lot of it is just imagination. The imagination of pessimism is what one of my colleagues calls it. It's just being able to imagine failure and then come up with ways to address that potential failure and weigh the relative probability of those outcomes.
