Description

About the Role:

 

Key Responsibilities:

  • Perform hardware diagnostics and troubleshooting for NVIDIA H100 Boxes, identifying, and resolving issues promptly to minimize system downtime.
  • Conduct thorough hardware assessments and evaluations to ensure optimal performance and reliability of HPC systems.
  • Collaborate with cross-functional teams including system administrators, engineers, and vendors to address hardware-related concerns and implement solutions.
  • Develop and maintain detailed documentation of hardware configurations, diagnostics procedures, and troubleshooting steps for future reference.
  • Execute hardware replacements, upgrades, and installations as needed, adhering to safety protocols and manufacturer guidelines.
  • Implement preventive maintenance strategies to extend the lifespan of HPC hardware and mitigate potential failures.
  • Stay updated on the latest developments and advancements in HPC hardware technology, particularly pertaining to NVIDIA H100 Boxes, and make recommendations for system enhancements.
  • Provide technical guidance and support to junior team members, sharing knowledge and best practices in HPC hardware management.

 

Qualifications:

  • Proven experience in high-performance computing environments, with a focus on hardware diagnostics and maintenance.
  • In-depth knowledge of NVIDIA H100 Boxes architecture, components, and functionalities.
  • Proficiency in hardware troubleshooting methodologies and diagnostic tools.
  • Familiarity with system administration tasks in Linux environments.
  • Strong communication and collaboration skills, with the ability to work effectively in a team-oriented environment.
  • Excellent problem-solving skills and attention to detail.
  • Relevant certifications (e.g., NVIDIA CUDA certification, CompTIA A+) are a plus.

Education

Any Graduate