Job Description:
- 8+ years of experience on operations for production systems.
- Implement scalable solutions for applications, infra and database
- Responsible for the maintenance, configuration, and reliable operations for VM, network servers and applications
- Identify, install and configure the tools for managing, monitoring and reporting performance of servers and applications (on prem, cloud, hybrid)
- Perform HW and SW upgrades
- Troubleshoot hardware and software problems/incidents by working with application team members, by running diagnostics, and assessing impact of issues
- Perform capacity planning for HW and SW
- Experience implementing logging, monitoring, and alerting solutions
- Experience in service level observability
- Experience in performance improvement solution across Apps and Infra
- Understand various metrics such as MTTR, MTBR, MTD, etc. along with understanding of SLO / SLI / Error Budget
- Knowledge in Chaos Engineering tools and practices
- Implement scalable solutions for applications, database
- Good experience with scripting language such as bash, python, Groovy, GoLang
- Implementation experience in SRE Tools and Accelerators
- Understanding of Infra As Code practices and ability to create / modify runbooks and SOP's
- Good understanding of container and container platform
Good to Have
- At least 3+ years of application architecture and development knowledge with Java or .net technologies for API, Database, Microservices, Integration layer, Mobile
- Implementation experience on CI/CD tools and pipelines
- Knowledge in Chaos Engineering tools and practices –
- Good understanding of ITSM processes, basic metrics for support quality and reliability
- Internal / External SRE Certification
Key Skills: SRE, Java, Jenkins, L3, bash, python, Groovy,