Job Description:
Ability to work in a blameless, data driven fashion and manage incident response.
Build monitoring solutions including dashboards, alerting using industry standard tools such as Dynatrace, Grafana, and cloud native tools (Azure AWS)
Code and support infrastructure automation across the CI/CD pipeline using Python, Ansible, and Terrafrom
Demonstrate strong programming skills and thorough knowledge of systems, especially Azure, Databricks, OpenAI, and AWS
Enhance reliability through designing, building, and maintaining scalable core infrastructure.
Familiarity with Agile and/or Agile SaFE processes
Improve operational processes and team practices.
Intimate knowledge of SRE principles practices including SLIs, SLOs, Toil Reduction, Observability, Automation
On-call rotation for incident response and proactive incident measures
Strong analytical and documentation skills
Strong communication skills and an ability to collaborate.
Strong problem-solving skills and ability to think under pressure.
SRE, DataBricks, Purview, ¿Terraform, Python, Observability/Infrastructure/Config as Code
Dynatrace, Grafana, Ansible, Jenkins, Bitbucket, AW
Any Graduate