Description

Job Description:

  • Client is seeking an experienced monitoring tools and Open Telemetry Subject Matter Expert (SME) who will be responsible for designing, implementing and optimizing monitoring solutions and leveraging Open Telemetry to enhance observability within the Enterprise Command Center (ECC).
  • The SME should collaborate with the Incident Management team to troubleshoot and resolve incidents.

Key Job Functions:

  • Lead the design and implementation of monitoring solutions using industry standard tools such as Splunk and others.
  • Customize monitoring configurations to align with the organizational requirements.
  • Implement and integrate Open Telemetry across various applications and services for enhanced observability.
  • Optimize monitoring solutions for efficiency and accuracy ensuring minimal impact on system performance.
  • Responsible for designing and implementing application and infrastructure performance monitoring under AWS Cloud environment.
  • Create monitors and dashboards to monitor applications and infrastructure performance.
  • Perform deep statistical analysis using performance data to help identify capacity and performance bottlenecks.
  • Configure alerting mechanisms within monitoring tools to proactively identify and address potential issues.
  • Develop comprehensive documentation for monitoring tool configurations, Open Telemetry implementations and best practices.
  • Provide training to incident management teams on utilizing monitoring tools and interpreting open telemetry data effectively.
  • Setup monitoring dashboards for incident detection and alerting.
  • Perform end-to-end analysis of transactions under an observability environment.
  • Troubleshoot incidents and identify root cause quickly using wire data analytics, application performance management and event correlation monitoring tools.
  • Diagnose and resolve incidents by providing factual data from the various monitoring and instrumentation systems.

Job Requirements:

  • A good understanding of the IT Cloud infrastructure that includes AWS Cloud, middleware, database, storage and/or network infrastructure.
  • Strong understanding of IT infrastructure, networking, security concepts and application architecture.
  • Hands-on experience with Open Telemetry instrumentation and telemetry data collection.
  • Proven experience as a Splunk SM with in-depth knowledge of Splunk architecture and components.
  • Excellent troubleshooting and problem-solving skills.
  • Strong documentation skills and attention to detail.
  • Proactively monitoring of hardware, software, and environmental alerts or malfunctions.
  • Analyze dashboards and monitoring tools to look for trends and patterns in application/infrastructure health and performance.
  • Monitor applications and infrastructure using tools like Splunk, DynaTrace, Catchpoint, MoogSoft, xMatters, SignalFx, Catchpoint, MoogSoft, xMatters, SolarWinds, Extrahop etc.
  • Expert understanding of micro service-based applications deployed in Cloud using Lambdas, ECS Fargate etc.
  • Proficiency in AWS services like IAM, Roles, Security groups, EC2, S3, Lambda, ALB, ECS etc.
  • Experience working with AWS tools like ELB, RDS, Redshift, DynamoDB, Aurora, Route53, Lambda, S3, Batch, CloudWatch, CloudTrail, WAF etc.
  • Hands on experience with transaction level monitoring using Dynatrace and Splunk.
  • Create Splunk search queries and dashboards.
  • Be the SME in helping recognize and onboard new data sources into Splunk and other tools, analyze the data for anomalies and trends, and building dashboards highlighting the key trends of the data.
  • Implement best in class engineering strategies to support a distributed clustered Splunk environment consisting of Search Heads, Indexers, Forwarders, Splunk Enterprise Security (ES) app spanning security, performance, engineering, and operational roles.
  • Use open-source Observability framework, OpenTelemetry for instrumenting, generating, collecting, and exporting telemetry data such as traces, metrics, logs to help analyze application performance and behavior.
  • Use distributed tracing in an end-to-end visibility environment that consists of micro-services, Containers, Serverless and Lambda.
  • Work closely with application teams and business stakeholders to perform troubleshooting and aid in incident triage. 
  • Influence other technical teams on incident calls and articulate troubleshooting steps effectively.
  • Follow up on items that could negatively impact production operations, assist with postmortem related activities, and support various efforts related to operational improvements.
  • Strong relationship management skills and aptitude to multi-task and work well in a high stress environment, both within teams and independently

Education

Bachelor’s Degree