Lead complex technology initiatives towards Hadoop platform stability, automation initiatives.
Platform Incident and Problem management
Be part of 24/7 platform support team to perform platform incident management and triaging team.
Assess the impact of the platform issue and prioritize the triage by partnering with platform tenants, platform development and Hadoop vendor teams.
Facilitate and perform root cause analysis of platform incidents towards determining permanent resolution involving Tenants and Hadoop vendor.
Communicate with Stakeholders, leadership team on the triaging progress status.
To troubleshoot tenant reported failures by deep diving onto client logs to identify the root cause.
Use Autosys as scheduler to schedule, execute platform routine activities.
Hadoop Cluster Monitoring
Monitor Hadoop cluster health and performance.
Automate new monitoring opportunities leveraging available monitoring and ing tools such as Prometheus, Thousand Eyes, Splunk etc.
Perform troubleshooting on the actionable s and tuning to ensure optimal cluster performance.
Hadoop Cluster Configuration and tuning
Periodically tune and configure Hadoop ecosystem components such as HDFS, YARN, Hive, HBase, Spark, HPOS etc.
Schedule the Change request and track the approval process for implementation.
Platform Backup and Recovery:
Ensure Hadoop platform is resilient with data backup and recovery strategies in place.
Perform emergency or scheduled failover of platform to Disaster recovery site and fall back.
Upgrades and Patch Management:
Plan and execute upgrades of Hadoop ecosystem components and patches.
Ensure compatibility and smooth integration of new features and enhancements.
Work with Operating System administrators and patching automation build team to schedule and execute patching.
Perform Cluster health validation post patching and communicate with stakeholders.
Identify automation and process improvement opportunities related to patching to implement it.
Documentation and Reporting:
Maintain comprehensive documentation of Hadoop cluster configurations & troubleshooting processes, and procedures.
Develop and Schedule to Generate reports on cluster usage, performance metrics, and capacity utilization.
Collaboration and Support:
Ability to work both independently and in collaboration with Platform development and Platform tenant teams to troubleshoot and fix Hadoop platform issues, maintain production stability to enterprise-wide applications.
Collaborate and consult with key technical experts, senior technology team, and Hadoop vendor to resolve complex technical issues and achieve goals.
Mandatory Skills and Qualifications:
Proven Experience in technology incident and problem management role.
Proven Experience in using monitoring tools like Prometheus, Splunk, SiteScope & thousand eyes etc.,
Proven experience as a Unix Scripting
Proven experience in Hadoop administration or in a similar role.
Strong knowledge of Hadoop ecosystem and its components (Hive, HBase, Spark, Hue ).
Proven Experience is using Autosys or similar job scheduler.
Excellent problem-solving and troubleshooting skills.
Excellent communication and people skills for collaborating with cross-functional teams
Any Graduate