Description

Job Description

  • The Site Reliability Engineer (SRE) will be responsible for both uplifting and maintaining our evolving technology platforms, infrastructure, and technology controls.
     
  • As an SRE, the role will include both oversight for production operations of our systems, as well as development/engineering of solutions to maximize system reliability and automation.
     
  • Own the Infrastructure, APM and work with DevOps teams to Build, Release, Monitor and run the services to improve service reliably.
     
  • Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go and Python.
     
  • Work with Ansible, Puppet, Chef, Terraform or another configuration management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives.
     
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system reliability.
     
  • Performance and maturity base lining of DevOps process, tools maturity & coverage, metrics, technology, and engineering practices.
     
  • Work closely with Engineering, QA, Operations teams in optimally delivering large scale systems using CI/CD pipelines.
     
  • Evaluate technology options and define the build, delivery, and deployment pipeline for applications.
     
  • Understand, Define, Measure, and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Management) and streamline - automate release management.
     
  • Strong believer of automation to bring in sustained continuous improvement by automating. Toil, Runbooks, improving ability of the applications to auto heal leading to improved reliability.
     
  • Should have supported Production Incidents (PIs) on critical applications of a company.
     
  • Troubleshoot, debug, and diagnose operational issues and drive them to closure.
     
  • Be a subject matter expert, able to upskill / cross skill engineering teams on SRE principles, tools, and execution.
     

 
 


 

Requirements

  • Should have experience in Software development.
     
  • 4+ years of relevant experience in SRE with incident management, SLO/SLI management handling.
     
  • Application Performance Monitoring (APM) tool New Relic or with relevant tools for monitoring, logging, tracing.
     
  • 3+ Years of Experience with designing and developing testing utilities using Python or similar for integrating test automation into the DevOps pipeline.
     
  • CI/CD Integration.
     
  • Containerisation - Kubernetes, Docker, Rancher, etc.
     
  • 3+ Years of Experience with version control systems, Git or similar.
     
  • Strong hands-on coding experience in one or more programming languages such as Python, Golang, Java, Bash, etc.
     
  • Expert level hands on knowledge in public cloud platform AWS and/or Google Cloud Platform.
     
  • Professional level certificate on one of the public clouds is highly desirable.
     
  • Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms.
     
  • Understanding of software delivery life cycles, particularly Agile/Lean & DevOps

Education

Any Graduate