Perform hardware diagnostics and troubleshooting for NVIDIA H100 Boxes, identifying, and resolving issues promptly to minimize system downtime.
Conduct thorough hardware assessments and evaluations to ensure optimal performance and reliability of HPC systems.
Collaborate with cross-functional teams including system administrators, engineers, and vendors to address hardware-related concerns and implement solutions.
Develop and maintain detailed documentation of hardware configurations, diagnostics procedures, and troubleshooting steps for future reference.
Execute hardware replacements, upgrades, and installations as needed, adhering to safety protocols and manufacturer guidelines.
Implement preventive maintenance strategies to extend the lifespan of HPC hardware and mitigate potential failures.
Stay updated on the latest developments and advancements in HPC hardware technology, particularly pertaining to NVIDIA H100 Boxes, and make recommendations for system enhancements.
Provide technical guidance and support to junior team members, sharing knowledge and best practices in HPC hardware management.
Qualifications:
Proven experience in high-performance computing environments, with a focus on hardware diagnostics and maintenance.
In-depth knowledge of NVIDIA H100 Boxes architecture, components, and functionalities.
Proficiency in hardware troubleshooting methodologies and diagnostic tools.
Familiarity with system administration tasks in Linux environments.
Strong communication and collaboration skills, with the ability to work effectively in a team-oriented environment.
Excellent problem-solving skills and attention to detail.
Relevant certifications (e.g., NVIDIA CUDA certification, CompTIA A+) are a plus.