Description

About the job
Run the production environment by monitoring availability and taking a holistic view of system health.
Support the applications with OnCall rotation support.
Provide stability to our applications and facilitates rapid feature development by taking active control on direction of the service and be proactive.
Automate and eliminate manual work and look for opportunities for automation.
Maintaining and implementing the SLO implementation adoption and automation.
Production Readiness/Health Scoring & Error Budget Tracking.
Runbook standards, maintenance, and updates.

Skill/Experience/Education:
Experience using DevOps tools and technologies such as GitLab, and Infrastructure as Code tools such as Terraform Strong troubleshooting skills and building and enhancing the observability using monitoring tools.
Proactive approach to Observability maturity, identifying problems, performance bottlenecks, and areas for improvement for observability.
Leading incident response and supporting application teams.
Blameless postmortems Developer feedback for enhanced logging, runbooks and addressing technical debt.
Promoting observability best practices
Experience in monitoring tools Dynatrace & Splunk Experience in public cloud platforms, preferably AWS and Api gateways.
Experience developing API or Microservices or Frontend is a plus
Experience using source version control (SVC) such as Git


Desired Skills and Experience
SITE RELIABILITY

Education

Any Graduate