Software Engineering
Engineering Culture at FAANG
Google Project Aristotle studied 180 teams and found the most important factor in team effectiveness was not the composition of star engineers - it was psychological safety: the ability to take risks without fear of punishment. Culture is not a soft topic. It is the root cause of every repeated incident.
- **Google** requires blameless postmortems for all P1/P2 incidents as part of the SRE contract. An incident cannot be closed without a postmortem.
- **Etsy** after transitioning to blameless culture and frequent deployments increased deploy frequency from 1 per week to 50 per day - engineers stopped fearing experimentation.
- **Stripe** publishes some postmortems publicly, demonstrating transparency; their engineering ladder is included in every job description.
On-Call Rotation
On-call rotation means engineers are responsible for responding to production incidents outside business hours. The practice forces teams to build observable, maintainable systems - because the people who build it are the ones paged at 3am.
Alert fatigue is as dangerous as missing alerts. A team that receives 50 alerts per shift will learn to ignore all of them. Alert hygiene is not optional maintenance - it is a core reliability practice.
A team of 4 engineers considers introducing an on-call rotation. Why is this a problem?
Postmortem
A postmortem is a structured document written after a production incident to understand what happened, why it happened, and how to prevent recurrence. The goal is organizational learning, not accountability.
A postmortem with no action items - or action items with no owner and no deadline - is a historical document, not a learning tool. Action items must be tracked to completion, not filed and forgotten.
A postmortem is complete and action items are identified. What indicates a good postmortem outcome?
Blameless Culture
Blameless culture shifts the question from 'who made the mistake?' to 'what in the system allowed this to happen?'. This shift is not about avoiding accountability - it is about producing actionable improvements rather than scapegoats.
Blame culture makes incidents invisible. Engineers who fear punishment stop reporting near-misses, hide problems early, and deploy changes during low-visibility windows. Blameless culture makes the system transparent.
In a blameless culture, a critical incident occurs. An engineer honestly reports they deployed a change they were unsure about. How does leadership respond?
Engineering Ladder
An Engineering Ladder is a framework that defines levels of scope, impact, and expectations for engineers. It provides a shared vocabulary for career development conversations between engineers and managers.
The ladder is a self-assessment tool, not an HR control mechanism. An engineer who understands the scope expectations at the next level can deliberately seek projects that build that scope.
Engineering ladder is an HR tool for controlling salaries, not useful for engineers
Engineering ladder is a framework for self-assessment, career planning, and having explicit conversations about growth with managers.
Companies that publish their ladders (Stripe, Dropbox, Buffer) report that engineers use them to self-direct career growth more effectively. The ladder makes invisible expectations visible.
An engineer has written excellent code for 7 years and wants promotion to Staff level. What is required?
Key Ideas
- **On-Call** only works with a minimum of 8 engineers in rotation and strict alert hygiene: every alert must be actionable - otherwise delete it.
- **Postmortem** is not a historical record - it is concrete action items with an owner and a deadline; without those, the postmortem is useless.
- **Blameless culture** shifts the question from 'who is at fault' to 'what in the system allowed this to happen' - the only way to systematically improve reliability.
- **Engineering Ladder** defines not years of experience but scope of impact: task → component → team → domain → company.
Related Topics
Engineering culture connects reliability practices, team health, and career growth:
- SRE and Error Budget — On-call, SLO, and postmortems are three pillars of SRE practice that are impossible without blameless culture.
- Technical Leadership — Tech Leads shape blameless culture in their team and model it for junior engineers.
- Chaos Engineering — Chaos Engineering (Chaos Monkey) only works in teams with blameless culture - a production incident must not trigger fear.
Вопросы для размышления
- How does the current team respond to mistakes: by finding who is at fault or finding systemic causes? What can be changed right now?
- Is there an explicit engineering ladder in the company? Is it clear what is required to advance to the next level?
- If on-call rotation were introduced in the current team - which three alerts would be the first ones and why?