Description

Carry out SRE duties for Big Data on various open-source platforms such as Hadoop, Spark, and HBASE.

•              Keep an eye on the platforms and adhere to runbooks/SOPs to manage platform and application problems.

•              Familiarize yourself with the cluster maintenance processes and implement changes as per the documented installation and validation plans.

•              Showcase robust troubleshooting and debugging skills, aiming to pinpoint and rectify the issue, while also offering advice on how to prevent such problems in the future.

•              Conduct thorough root cause analysis of major production incidents, document for future reference, and put in place proactive measures to enhance system reliability.

•              Automate routine tasks using scripts or automation tools to lessen manual work, decrease the chance of human errors, and boost system reliability.

•              Technical Skills required:

o             At least 2-3 years of experience for a junior level role and 5+ for mid-level/senior level working as a Hadoop Site reliability engineer.

o             High level Knowledge on Hadoop platforms and core Hadoop components.

o             Troubleshooting both Hadoop platform service, application problems and identifying the root cause.

o             Writing ansible playbooks and automate manual tasks using Ansible, shell scripting and python scripting.

o             Should be familiar with Unix/Linux system internals, networking, and distributed systems.

 

Job description for Kafka SRE:

•              Carry out SRE duties for Kafka Streaming Platform.

•              Have thorough understanding on the Kafka architecture along with the concepts of Producer, Consumer, topics, partitions etc.

•              Keep an eye on the platforms and adhere to runbooks/SOPs to manage platform and application problems.

•              Familiarize yourself with the cluster maintenance processes and implement changes as per the documented installation and validation plans.

•              Showcase robust troubleshooting and debugging skills, aiming to pinpoint and rectify the issue, while also offering advice on how to prevent such problems in the future.

•              Conduct thorough root cause analysis of major production incidents, document for future reference, and put in place proactive measures to enhance system reliability.

•              Automate routine tasks using scripts or automation tools to lessen manual work, decrease the chance of human errors, and boost system reliability.

•              Technical Skills required:

o             At least 2-3 years of experience for a junior level role and 5+ for mid-level/senior level working as a Site reliability engineer for Kafka Platform.

o             Deep level Knowledge on core Kafka components like producers, consumers, topics, partitions etc.

o             Troubleshooting both Kafka platform service, application problems and identifying the root cause.

o             Writing Ansible playbooks and automate manual tasks using Ansible, shell scripting and python.

o             Should be familiar with Unix/Linux system internals, networking, and distributed systems.

 

Key Skills
Education

Any Graduate