Site Reliability Engineering Lead / L3 Support

Job Description:

8+ years of experience on operations for production systems.
Implement scalable solutions for applications, infra and database
Responsible for the maintenance, configuration, and reliable operations for VM, network servers and applications
Identify, install and configure the tools for managing, monitoring and reporting performance of servers and applications (on prem, cloud, hybrid)
Perform HW and SW upgrades
Troubleshoot hardware and software problems/incidents by working with application team members, by running diagnostics, and assessing impact of issues
Perform capacity planning for HW and SW
Experience implementing logging, monitoring, and alerting solutions
Experience in service level observability
Experience in performance improvement solution across Apps and Infra
Understand various metrics such as MTTR, MTBR, MTD, etc. along with understanding of SLO / SLI / Error Budget
Knowledge in Chaos Engineering tools and practices
Implement scalable solutions for applications, database
Good experience with scripting language such as bash, python, Groovy, GoLang
Implementation experience in SRE Tools and Accelerators
Understanding of Infra As Code practices and ability to create / modify runbooks and SOP's
Good understanding of container and container platform

Good to Have

At least 3+ years of application architecture and development knowledge with Java or .net technologies for API, Database, Microservices, Integration layer, Mobile
Implementation experience on CI/CD tools and pipelines
Knowledge in Chaos Engineering tools and practices –
Good understanding of ITSM processes, basic metrics for support quality and reliability
Internal / External SRE Certification

Key Skills: SRE, Java, Jenkins, L3, bash, python, Groovy,

Any Gradute

Back To Jobs