Carry out SRE duties for Big Data on various open-source platforms such as Hadoop, Spark, and HBASE.
• Keep an eye on the platforms and adhere to runbooks/SOPs to manage platform and application problems.
• Familiarize yourself with the cluster maintenance processes and implement changes as per the documented installation and validation plans.
• Showcase robust troubleshooting and debugging skills, aiming to pinpoint and rectify the issue, while also offering advice on how to prevent such problems in the future.
• Conduct thorough root cause analysis of major production incidents, document for future reference, and put in place proactive measures to enhance system reliability.
• Automate routine tasks using scripts or automation tools to lessen manual work, decrease the chance of human errors, and boost system reliability.
• Technical Skills required:
o At least 2-3 years of experience for a junior level role and 5+ for mid-level/senior level working as a Hadoop Site reliability engineer.
o High level Knowledge on Hadoop platforms and core Hadoop components.
o Troubleshooting both Hadoop platform service, application problems and identifying the root cause.
o Writing ansible playbooks and automate manual tasks using Ansible, shell scripting and python scripting.
o Should be familiar with Unix/Linux system internals, networking, and distributed systems.
Job description for Kafka SRE:
• Carry out SRE duties for Kafka Streaming Platform.
• Have thorough understanding on the Kafka architecture along with the concepts of Producer, Consumer, topics, partitions etc.
• Keep an eye on the platforms and adhere to runbooks/SOPs to manage platform and application problems.
• Familiarize yourself with the cluster maintenance processes and implement changes as per the documented installation and validation plans.
• Showcase robust troubleshooting and debugging skills, aiming to pinpoint and rectify the issue, while also offering advice on how to prevent such problems in the future.
• Conduct thorough root cause analysis of major production incidents, document for future reference, and put in place proactive measures to enhance system reliability.
• Automate routine tasks using scripts or automation tools to lessen manual work, decrease the chance of human errors, and boost system reliability.
• Technical Skills required:
o At least 2-3 years of experience for a junior level role and 5+ for mid-level/senior level working as a Site reliability engineer for Kafka Platform.
o Deep level Knowledge on core Kafka components like producers, consumers, topics, partitions etc.
o Troubleshooting both Kafka platform service, application problems and identifying the root cause.
o Writing Ansible playbooks and automate manual tasks using Ansible, shell scripting and python.
o Should be familiar with Unix/Linux system internals, networking, and distributed systems.
Any Graduate