Description

Job Description:

 

  • 8+ years of experience on operations for production systems.
  • Implement scalable solutions for applications, infra and database
  • Responsible for the maintenance, configuration, and reliable operations for VM, network servers and applications
  • Identify, install and configure the tools for managing, monitoring and reporting performance of servers and applications (on prem, cloud, hybrid)
  • Perform HW and SW upgrades
  • Troubleshoot hardware and software problems/incidents by working with application team members, by running diagnostics, and assessing impact of issues
  • Perform capacity planning for HW and SW
  • Experience implementing logging, monitoring, and alerting solutions
  • Experience in service level observability
  • Experience in performance improvement solution across Apps and Infra
  • Understand various metrics such as MTTR, MTBR, MTD, etc. along with understanding of SLO / SLI / Error Budget
  • Knowledge in Chaos Engineering tools and practices
  • Implement scalable solutions for applications, database
  • Good experience with scripting language such as bash, python, Groovy, GoLang
  • Implementation experience in SRE Tools and Accelerators
  • Understanding of Infra As Code practices and ability to create / modify runbooks and SOP's
  • Good understanding of container and container platform

 

Good to Have

  • At least 3+ years of application architecture and development knowledge with Java or .net technologies for API, Database, Microservices, Integration layer, Mobile
  • Implementation experience on CI/CD tools and pipelines
  • Knowledge in Chaos Engineering tools and practices –
  • Good understanding of ITSM processes, basic metrics for support quality and reliability
  • Internal / External SRE Certification

 

 

Key Skills: SRE, Java, Jenkins, L3, bash, python, Groovy,

 


 

Education

Any Gradute