Description

Position Overview:

The Principal DevOps/SRE Engineer will lead the development and implementation of a cutting-edge observability framework, focusing on application-centric monitoring and insights. This role is responsible for ensuring the performance, reliability, and scalability of business-critical applications through enhanced visibility into application-level metrics, leveraging modern observability tools like Datadog. The ideal candidate will collaborate closely with development teams to deliver robust monitoring solutions that go beyond infrastructure and provide deep insights into application behavior and performance.

 

Responsibilities:

 

•                    Design and Implement Observability Framework: Develop and implement an end-to-end observability framework that extends beyond infrastructure to focus on application-specific metrics. Ensure comprehensive visibility into the performance of key business applications.

•                    Datadog Integration and Enhancement: Leverage Datadog to instrument application-level monitoring, integrating golden signals (SLI/SLOs) for performance, availability, and reliability.

•                    Develop SLI/SLO Blueprints: Create and maintain SLI/SLO blueprints for key business applications, defining and measuring golden signals (latency, traffic, errors, saturation) to ensure optimal system health.

•                    System Performance Optimization: Proactively monitor and assess application performance, identifying areas for improvement. Collaborate with development and SRE teams to implement performance optimization measures.

•                    Dashboard and Visualization: Develop centralized dashboards with drill-down capabilities, providing real-time visibility into the health of applications and enabling quick identification of performance issues.

•                    Business Journey Mapping: Work closely with business and engineering teams to map out critical business journeys and ensure that observability systems capture relevant metrics for each journey.

•                    Gap Analysis and Continuous Improvement: Perform baseline measurements, identify gaps in existing monitoring systems, and work to close those gaps by integrating additional telemetry data.

•                    Incident Response and Alerting: Define and implement alerting mechanisms based on SLI/SLO thresholds. Ensure the observability system can trigger appropriate alerts and escalations in case of performance degradation.

•                    Collaboration with Development Teams: Work alongside development and data engineering teams to embed observability practices into the SDLC, ensuring that monitoring is an integral part of the application architecture from the ground up.

•                    Knowledge Sharing: Provide training and guidance to teams on best practices for application observability, ensuring consistent adoption of tools and methodologies across the organization.

 

Qualifications:

 

•                     

•                    11-15 years of hands-on experience in DevOps/SRE, with a strong focus on observability for large-scale, high-performance applications.

•                    Expertise in using and enhancing observability tools like Datadog, including deep experience with metrics collectionalerting, and dashboard creation.

•                    Proven ability to create and implement SLI/SLO frameworks to track application performance, availability, and reliability.

•                    Strong understanding of monitoring application health across various services, containers, and microservices architectures.

•                    Experience in business journey mapping and ensuring observability captures relevant metrics at every stage of the user experience.

•                    Expertise in root cause analysis and providing insights into system performance through observability data.

•                    Proficiency in programming/scripting languages (e.g., PythonBash) for automation and tool integration.

•                    Proven track record of driving performance improvements and maintaining system health through proactive monitoring and alerting.

 

Preferred Skills:

 

•                    Hands-on experience with cloud-native applications running on AWS or other cloud platforms.

•                    Familiarity with CI/CD pipelines and integrating observability tools into the development lifecycle.

•                    AWS certifications or equivalent experience with cloud infrastructure monitoring.

•                    Strong knowledge of modern infrastructure stacks, including KubernetesDocker, and serverless architectures.

•                    Experience working in agile environments, collaborating closely with product, development, and operations teams.

Education

Any Graduate