Description

Responsibilities

Engage in and improve the whole lifecycle of services—from inception and design, through deployment and operation. 
Design, operate, maintain, and troubleshoot enterprise systems such as databases, message queues, APIs, and distributed applications through the use of data and metrics such as SLOs and error budgets. 
Establish and practice sustainable incident response and blameless postmortems to prevent problem recurrence. 
Support services before they go live through activities such as system design, developing software platforms and frameworks, capacity planning, and launch reviews. 
Scale systems sustainably through mechanisms like scripting and automation; evolve systems by pushing changes that improve their operational management reliability and velocity. 
Work closely with development teams, infrastructure teams, and business stakeholders to understand requirements and design solutions across multiple time zones 
Ensure that hardware design meets business and technical requirements, including performance, scalability, and reliability 
Ensure that hardware design meets industry standards and best practices for data center infrastructure 
Create and maintain detailed documentation on system configurations, procedures, and operational policies. 
Day to day server administration (physical, virtual), storage administration, network config and applications support, health and performance monitoring. Ensuring quick turnaround times, as well as performance levels, availability, and security. 
Deploy infrastructure manually and also via configuration management / automation platforms 
Troubleshoot hardware, software, and network related issues, provide quick resolution to reported problems and perform root cause analysis to analyze reason for issues and prevent future occurrences

Qualifications

Minimum

Experience programming in Python or other languages. 
Experience in designing, analysing, and troubleshooting large-scale distributed systems 
Able to work in a 24x7 on-call rotation (approx. 1 week every 2 months); 
Systematic problem-solving approach, strong communication skills, and a sense of ownership and drive; 
Working experience of Observability platforms such as Elastic or DataDog. 
Experience deploying / troubleshooting Linux systems (Red Hat/CentOS), Ubuntu as well as VMware environments (esxi, NSX, vsan) 
Experience working directly with end users to determine deployment and configuration requirements 
Ability to lift 15+ kilograms when working with storage equipment.

Preferred

7+ years as a Site Reliability Engineer, DevOps Engineer, or Infrastructure engineer; 
Understanding of Unix/Linux, and optionally Windows operating systems; 
Experience working with Infrastructure as Code / Automation tools (Ansible, Terraform, CloudFormation); 
Well organised, with ability to prioritise tasks independently, set goals and follow through in order to see them to completion; 
Experience with containers and container orchestration systems such as Docker and/or Kubernetes; 
Expertise with hybrid (bare metal/public cloud - AWS & Azure preferred) cloud environments. 
Experience with containerisation and virtualisation technologies such as Docker, Kubernetes, and VMware 
Knowledge of storage technologies (SAN / NAS devices)

Key Skills
Education

Any Graduate