Description

Job Description

Responsibilities:
Design and Infrastructure Development
• Design and implement highly available, scalable and fault-tolerant infrastructure.
• Collaborate with engineering teams to define and implement reliability standards and best practices.
Automation and Operations
• Automate infrastructure provisioning, configuration, and deployment processes to streamline operations.
• Collaborate with other software engineers to design and implement deployment strategies using automated continuous integration and continuous delivery pipelines.
Monitoring and Performance Management
• Monitor system performance and identify potential issues to ensure uptime and optimal performance.
• Collaborate with software engineering teams to improve system reliability through automated testing, fault tolerance, and disaster recovery planning.
Incident Management
• Lead incident management efforts, overseeing response processes and coordinating with cross-functional teams.
• Design and implement incident response playbooks and escalation procedures for timely and effective resolution.
• Conduct post-incident reviews to identify root causes and implement preventative measures.


Experience
• 5+ years of experience as a Site Reliability Engineer or in a similar role.
• 4+ years of experience performing engineering and support in Azure.
• 4+ years of experience supporting enterprise-level complex applications and platforms in production.
• 3+ years of designing and building complex observability solutions using industry-standard tools or custom-built solutions.
• 2+ years of building and supporting CI/CD in GitHub

Technical Skills
• Proficiency in programming languages such as Python, Go, Java, or C#, with a focus on automation and scripting.
• Strong proficiency in Infrastructure as Code (IaC) principles using tools like Terraform.
• Experience with container orchestration tools, such as Kubernetes, and containerization technologies like Docker.
• 3+ years of experience working with configuration and monitoring technologies such as Ansible, Grafana, Elastic, Splunk, and Prometheus.

Desired Qualifications
• A degree in computer science or engineering
• Experience with Agile Scrum (Daily Standup, Sprint Planning and Sprint Retrospective meetings)
• Experience in datastores such as relational databases, NoSql, and other cloud storage services


 

Education

Any Graduate