The Site Reliability Engineer (SRE) will be responsible for both uplifting and maintaining our evolving technology platforms, infrastructure, and technology controls.
As an SRE, the role will include both oversight for production operations of our systems, as well as development/engineering of solutions to maximize system reliability and automation.
Own the Infrastructure, APM and work with DevOps teams to Build, Release, Monitor and run the services to improve service reliably.
Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go and Python.
Work with Ansible, Puppet, Chef, Terraform or another configuration management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives.
Maintain services once they are live by measuring and monitoring availability, latency, and overall system reliability.
Performance and maturity base lining of DevOps process, tools maturity & coverage, metrics, technology, and engineering practices.
Work closely with Engineering, QA, Operations teams in optimally delivering large scale systems using CI/CD pipelines.
Evaluate technology options and define the build, delivery, and deployment pipeline for applications.
Understand, Define, Measure, and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Management) and streamline - automate release management.
Strong believer of automation to bring in sustained continuous improvement by automating. Toil, Runbooks, improving ability of the applications to auto heal leading to improved reliability.
Should have supported Production Incidents (PIs) on critical applications of a company.
Troubleshoot, debug, and diagnose operational issues and drive them to closure.
Be a subject matter expert, able to upskill / cross skill engineering teams on SRE principles, tools, and execution.
Requirements
Should have experience in Software development.
4+ years of relevant experience in SRE with incident management, SLO/SLI management handling.
Application Performance Monitoring (APM) tool New Relic or with relevant tools for monitoring, logging, tracing.
3+ Years of Experience with designing and developing testing utilities using Python or similar for integrating test automation into the DevOps pipeline.
CI/CD Integration.
Containerisation - Kubernetes, Docker, Rancher, etc.
3+ Years of Experience with version control systems, Git or similar.
Strong hands-on coding experience in one or more programming languages such as Python, Golang, Java, Bash, etc.
Expert level hands on knowledge in public cloud platform AWS and/or Google Cloud Platform.
Professional level certificate on one of the public clouds is highly desirable.
Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms.
Understanding of software delivery life cycles, particularly Agile/Lean & DevOps