About the job
At Flipkart, Site reliability engineers (SREs) combine engineering experience and an innate drive to improve existing systems and processes, with the creativity to develop novel solutions to evolving challenges that can hamper reliability, performance and availability of critical platform services and applications. SRE builds solutions (Process + tools) to enhance the reliability posture of all critical platform services and applications. Improving uptime, availability, performance and stability of services and platform as well as automating repetitive work are the key goals to be worked upon. The SRE team in FK searches for engineers who bring creative and innovative problem-solving, build,deploy the solution that operates at scale, and enjoys collaborating with a cross-functional team to develop real-world solutions and positive user experiences at every interaction.
What you’ll do?
Build software and systems to manage platform infrastructure and applications by using one or more programming languages (e.g. Go, Python, Java)
Observability tools and methodology (e.g. logging, metrics, tracing) for highly available web services
Designing and delivering cloud-native infrastructure solutions on top of Public Cloud or similar private cloud.
Ability/Experience with designing and managing large scale complex cloud based infrastructure and applications hosted on them.
Incident response and management in on-call rotation
Focus on 360° reliability posture to improve reliability, quality and performance of platform services and applications/microservices.
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve.
Understanding on reliability and resilience framework.
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
What you’ll need?
Bachelor's or Master’s degree in Computer Science/Information Technology, with 6 to 12 years of relevant industry experience.
Extensive coding experience in any one programming language like Python/Java/Go.
Ability to work with complex business flows and dealing with huge amounts of data.
Strong knowledge of OS (Linux/Unix preferably) and N/w topologies.
Full-stack troubleshooting skills across network, application, management fabric, and distributed services layers.
Drive efficiencies through software improvement and root cause analysis resulting in service delivery, maturity, and scalability.
Familiarity with hybrid, public and private cloud deployments and environments
Experience with cloud-agnostic configuration management frameworks for constructs like IaaC, Config Managements, Deployments etc
Awareness of containerization technologies (Docker, Kubernetes, etc)
Sound knowledge on Observability tools and APMs
Experience in Service Tiering and SLO/SLIs
ANY GRADUATE