Job Description
SRE Sr Leader
Remote
6 Months
We are seeking an SRE Senior Leader to drive system uptime, performance, and scalability by blending software engineering with operational expertise. They lead teams to define SLIs/SLOs, automate infrastructure (IaC), manage incidents, and conduct post-mortems. Key roles include mentoring engineers, setting reliability strategies, and optimizing cloud costs.
Core Responsibilities:
• Leadership & Mentoring: Lead a team of SREs, manage sprint planning, and foster career growth.
• System Reliability & Strategy: Own the uptime, performance, and capacity planning of production systems.
• Automation & Tools: Reduce manual work (toil) by building automation, managing infrastructure as code (Terraform, Kubernetes), and enhancing observability.
• Incident Management: Drive root cause analysis (RCA), lead incident responses, and implement post-mortem action items.
• SLI/SLO Management: Define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to balance velocity and reliability.
Required Skills and Qualifications:
• Technical Expertise: Proficiency in coding/scripting (e.g., Python, Go) and familiarity with CI/CD tools.
• Infrastructure Skills: Strong knowledge of cloud platforms (AWS, GCP, Azure), Linux, networking, and containerization (Kubernetes).
• Leadership Experience: Proven experience leading technical teams and managing complex projects.
• Communication: Ability to communicate technical SRE initiatives to stakeholders across the organization.
Preferred Experience:
• 5+ years in SRE, DevOps, or Software Engineering.
• Experience in managing 24/7 high-availability production environments.
SRE Sr Leader
Remote
6 Months
We are seeking an SRE Senior Leader to drive system uptime, performance, and scalability by blending software engineering with operational expertise. They lead teams to define SLIs/SLOs, automate infrastructure (IaC), manage incidents, and conduct post-mortems. Key roles include mentoring engineers, setting reliability strategies, and optimizing cloud costs.
Core Responsibilities:
• Leadership & Mentoring: Lead a team of SREs, manage sprint planning, and foster career growth.
• System Reliability & Strategy: Own the uptime, performance, and capacity planning of production systems.
• Automation & Tools: Reduce manual work (toil) by building automation, managing infrastructure as code (Terraform, Kubernetes), and enhancing observability.
• Incident Management: Drive root cause analysis (RCA), lead incident responses, and implement post-mortem action items.
• SLI/SLO Management: Define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to balance velocity and reliability.
Required Skills and Qualifications:
• Technical Expertise: Proficiency in coding/scripting (e.g., Python, Go) and familiarity with CI/CD tools.
• Infrastructure Skills: Strong knowledge of cloud platforms (AWS, GCP, Azure), Linux, networking, and containerization (Kubernetes).
• Leadership Experience: Proven experience leading technical teams and managing complex projects.
• Communication: Ability to communicate technical SRE initiatives to stakeholders across the organization.
Preferred Experience:
• 5+ years in SRE, DevOps, or Software Engineering.
• Experience in managing 24/7 high-availability production environments.