Responsibilities:
Observability and Monitoring:
- Develop and implement robust observability strategies, including logging, metrics, and tracing, to gain deep insights into the performance and health of our systems.
- Collaborate with cross-functional teams to establish and enforce best practices for instrumentation, logging, and monitoring throughout the software development lifecycle.
Site Reliability Engineering:
- Lead initiatives to improve the reliability, availability, and scalability of our applications and infrastructure.
- Collaborate with development teams to design and implement systems that are resilient to failures and capable of quick recovery.
- Drive the adoption of SRE principles and practices across the organization.
Incident Management:
- Develop and refine incident response processes, ensuring timely detection, analysis, and resolution of incidents.
- Collaborate with teams to conduct post-incident reviews, identify root causes, and implement preventive measures.
Automation and Tooling:
- Build and maintain automation tools for deployment, monitoring, and incident response to streamline operational processes.
- Evaluate and integrate third-party tools to enhance observability and SRE capabilities.
Collaboration and Leadership:
- Provide technical leadership and mentorship to the engineering team.
- Collaborate with product managers, architects, and other stakeholders to align observability and SRE initiatives with business goals.
Qualifications:
- Bachelor's or higher degree in Computer Science, Software Engineering, or a related field.
- Extensive experience in software engineering with a focus on observability, monitoring, and SRE.
- Strong expertise in designing and implementing distributed systems for high availability and reliability.
- Proficiency in APM (Application performance monitoring), RUM (Real user monitoring), Synthetics, correlation, alert & incident management (e.g., OTEL, Jaeger, Kloudfuse, service-now).
- Proficiency in one or more programming languages (e.g., Java, Python, Go).
- Experience with cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
- In-depth knowledge of observability tools and frameworks (e.g., Prometheus, Grafana, ELK stack, Datadog, Aternity) and incident management processes.
- In-depth knowledge of ML & AI frameworks (e.g., Anomaly, Outlier, AIOps, LLM).
- Excellent communication and collaboration skills.
- Demonstrated ability to lead technical initiatives and mentor team members