Summary:
This is a developed professional role for an AWS focused SRE. Individuals are responsible for basic reliability and toil reduction projects. At this level SREs can observe the performance of a system and configure proactive alerting to protect service levels. SREs are ready to join the on-call rotation. They can participate in disaster recovery tests in production environments. They may train new team members.
Scope and Key Responsibilities:
Creates monitoring queries and establishes service level baselines
Supports senior engineers during incidents
Makes contributions during post-mortems and RCAs
Participates in disaster recovery testing
Implements automation and executes code in production environments
Contributes to SRE knowledge documentation
Observability: Level 3
Able to create proactive alert rules that detect conditions that are urgent and actionable, so that alerts page support teams before users are impacted.
Can create and configure browser agents to monitor performance of apps including user satisfaction, JavaScript errors, session performance, and core web vitals.
Can create complex synthetic transactions that includes scripts to simulate user flow and functionality from the browser or APIs endpoints.
Able to create advanced Application Performance Monitoring (APM) and Browser distributed traces that gives insights into application performance.
Able to recommend and create Service Level Objectives using the latency, traffic, errors, and saturation Golden Signals
Incident Management: Level 3
Has the ability to create and/or present RCAs including the executive summary, timeline, detailed impact statement, follow-on actions, and residual risks.
Can lead scenario modelling exercises and the creation of workflows which are triggered by a breach of SLO
Able to participate on the on-call rotation and provide on-call support for other SRE engineers.
Can write advanced automation scripts for incident response including failovers and rollbacks.
Design for Reliability: Level 3
Can make theorical performance (latency, traffic) and capacity recommendations based on customer demand and growth estimates
Has good knowledge of DevOps practices including monitoring, virtual networks, cloud storage, containers and orchestration, CI/CD, configuration management, and securing cloud applications
Disaster Recovery: Level 3
Capable of participating on-call to assist in the recovery of Major Incidents (for production environments)
Can test system and component failover within and between geographic regions (for production environments)
Able to automate the recovery of systems and components using Infrastructure-as-Code and Configuration Management scripts.
Platforms and Automation: Level 3
Able to identify opportunities to improve the developer experience through leveraging using observability tools, paved road components, shared services, and self-service portals.
Able to improve software delivery performance by recommending and/or implementing automated build and release processes and removing manual tasks
Able to maintain and secure cloud environments such that it doesn’t impact software delivery performance.
Reliability Culture: Level 3
Can contribute to SRE knowledge base articles and training material.
Able to analyze toil by looking at ticket trends and can make recommendation for the team on focus areas.
Can independently work on small toil elimination projects.
Any Graduate