Title:- SRE /Datadog SME
Location:- Dallas TX
Long Term Contract
RESPONSIBILITIES
• Establish monitoring, tracing, logging, and alerting for shared platforms
• Define SLAs and SLOs and set up monitoring to ensure availability targets are being met
• Develop tools and workflows utilizing engineering best practices, such as infrastructure as code and CI/CD, to promote reliability and availability
• Collaborate with platform engineers and developers to improve operational stability and reliability
- He should have implemented Data Dog from scratch and supporting the monitoring for organization.
REQUIREMENTS
- Python + Application monitoring
- Data Dog SME
- Minimum 5 years in Data Dog
- Data Dog should be his primary skill
- Bachelor's degree in computer science or related or equivalent experience
- Proven work experience as a Site Reliability Engineer or in a similar role
- Expert in infrastructure as code (Terraform, Docker, Helm)
- Expert in monitoring tools such as Data Dog or Dynatrace
- Cloud experience, preferably Azure
- Experience with container technologies - Docker and Kubernetes
- Experience with configuration and administration of CI/CD pipelines, preferably using GitHub Actions
- Capable of writing comprehensive technical documentation and diagrams
- Working knowledge of bash and shell scripting
- Understanding of end-to-end application development lifecycle from code commit to production deployment
- Have DevOps, Reliability, and Security mindsets - understand production controls and change processes