Description


Job Description:
Technical/Functional Skills:

  • Proficiency in RoCEv2, K8s, KVM, Ubuntu, Python, Shell, Go, Rust, GPU drivers, and Cluster interconnect with 200G/400G networking.
  • Managing GPU clusters optimizing GPU-based services/tools/software

Roles & Responsibilities:

  • Develop, implement, and maintain GPU-based clusters of 10 to 1000 nodes, ensuring optimal performance and availability.
  • Administer Client/AI platforms – Distributed Client services, LLMs, Vector-DB and AI inferencing, by managing deployments, resource allocation, monitoring, and security.
  • Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise.
  • Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards.
  • Compile reports, document procedures, and publish recommendations for improving AI infrastructure and solutions.
  • Use AI/Client to continuously improve internal processes and tools that are used in end-to-end delivery of your services in this team


 

Education

Any Graduate