Description

Responsibilities:

Observability and Monitoring:

  • Develop and implement robust observability strategies, including logging, metrics, and tracing, to gain deep insights into the performance and health of our systems.
  • Collaborate with cross-functional teams to establish and enforce best practices for instrumentation, logging, and monitoring throughout the software development lifecycle.

Site Reliability Engineering:

  • Lead initiatives to improve the reliability, availability, and scalability of our applications and infrastructure.
  • Collaborate with development teams to design and implement systems that are resilient to failures and capable of quick recovery.
  • Drive the adoption of SRE principles and practices across the organization.

Incident Management:

  • Develop and refine incident response processes, ensuring timely detection, analysis, and resolution of incidents.
  • Collaborate with teams to conduct post-incident reviews, identify root causes, and implement preventive measures.

Automation and Tooling:

  • Build and maintain automation tools for deployment, monitoring, and incident response to streamline operational processes.
  • Evaluate and integrate third-party tools to enhance observability and SRE capabilities.

Collaboration and Leadership:

  • Provide technical leadership and mentorship to the engineering team.
  • Collaborate with product managers, architects, and other stakeholders to align observability and SRE initiatives with business goals.

Qualifications:

  • Bachelor's or higher degree in Computer Science, Software Engineering, or a related field.
  • Extensive experience in software engineering with a focus on observability, monitoring, and SRE.
  • Strong expertise in designing and implementing distributed systems for high availability and reliability.
  • Proficiency in APM (Application performance monitoring), RUM (Real user monitoring), Synthetics, correlation, alert & incident management (e.g., OTEL, Jaeger, Kloudfuse, service-now).
  • Proficiency in one or more programming languages (e.g., Java, Python, Go).
  • Experience with cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
  • In-depth knowledge of observability tools and frameworks (e.g., Prometheus, Grafana, ELK stack, Datadog, Aternity) and incident management processes.
  • In-depth knowledge of ML & AI frameworks (e.g., Anomaly, Outlier, AIOps, LLM).
  • Excellent communication and collaboration skills.
  • Demonstrated ability to lead technical initiatives and mentor team members

Education

Bachelor’s Degree