Building the foundation for modern ops. By using available monitoring system, the SRE will analyze design and propose way to improve the environment monitoring including the right and wrong things to monitor and why. The problem SRE will need to solve for our team make available in Cloud current state of each edge deployment (system health, SLI, performance). SRE should be able to identify product issues as they arise in production/test environments and create automated (as much as possible) solutions for fixing the issues to keep incident management sustainable.
Responsibilities
In charge of maintaining/improving product monitoring system
Incident response management (troubleshooting, resolution, documentation, post-mortem analysis)
Knowledge sharing on the lessons learnt
Be a bridge between operations and development
Experience Required
Building solutions from scratch
Writing code to automate processes (log analysis, testing production environments, alerts automation)
Expertise in cloud providers
Tools
Incident management/on-call: PagerDuty
Logging: ELK/Kibana, SEQ logging
Language: Python, C#, scripting.
Database: SQL,Mongo
Network: Basic network knowledge (inbound/outbound and fw rules)
Monitoring: Prometheus, Grafana
Project management and issue tracking: AzureDevOps, Wiki
Source code management: Git
Infrastructure and orchestration: SaltStack, Docker, Zededa
Any Graduate