Summary
As an Observability Software Engineer II, you will be responsible for designing, implementing, and maintaining our observability platform. You'll work closely with cross-functional teams to ensure our systems are transparent, measurable, and reliable. By leveraging your expertise in observability tools and techniques, you will help us gain deep insights into our applications, infrastructure, and user experiences.
Responsibilities
Design and develop robust observability solutions to monitor, analyze, and troubleshoot distributed systems.
Familiar with OTEL standards and tools.
Previous experience working with application teams to implement “self-healing” i.e. alerting that triggers automated remediation.
Implement and configure monitoring, logging, tracing, and alerting systems to ensure comprehensive coverage of our infrastructure and applications.
Collaborate with software engineers to instrument code for telemetry data collection and analysis.
Optimize observability tooling and processes to improve system reliability, performance, and scalability.
Create dashboards, reports, and visualizations to provide actionable insights into system health and performance.
Investigate and resolve incidents by analyzing telemetry data and identifying root causes.
Stay current with industry trends and best practices in observability, and recommend improvements to our observability strategy and infrastructure.
Qualifications
Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
1-2 years experience as an Observability Engineer or a similar role in a production environment.
Deep understanding of observability principles, methodologies, and tools such as Prometheus, Grafana, Jaeger, ELK stack, etc.
Proficiency in programming/scripting languages like Java, Python, Go, or similar for automation and tooling development.
Strong knowledge of cloud computing platforms (AWS preferred) and container orchestration systems (e.g., Kubernetes).
Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems.
Strong communication skills and the ability to collaborate effectively with cross-functional teams
Bachelor's Degree