Description

Job Description:

Ability to work in a blameless, data driven fashion and manage incident response.

Build monitoring solutions including dashboards, alerting using industry standard tools such as Dynatrace, Grafana, and cloud native tools (Azure AWS)

Code and support infrastructure automation across the CI/CD pipeline using Python, Ansible, and Terrafrom

Demonstrate strong programming skills and thorough knowledge of systems, especially Azure, Databricks, OpenAI, and AWS

Enhance reliability through designing, building, and maintaining scalable core infrastructure.

Familiarity with Agile and/or Agile SaFE processes

Improve operational processes and team practices.

Intimate knowledge of SRE principles practices including SLIs, SLOs, Toil Reduction, Observability, Automation

On-call rotation for incident response and proactive incident measures

Strong analytical and documentation skills

Strong communication skills and an ability to collaborate.

Strong problem-solving skills and ability to think under pressure.

SRE, DataBricks, Purview, ¿Terraform, Python, Observability/Infrastructure/Config as Code

Dynatrace, Grafana, Ansible, Jenkins, Bitbucket, AW

Education

Any Graduate