Job Description:
Technical/Functional Skills:
- Proficiency in RoCEv2, K8s, KVM, Ubuntu, Python, Shell, Go, Rust, GPU drivers, and Cluster interconnect with 200G/400G networking.
- Managing GPU clusters optimizing GPU-based services/tools/software
Roles & Responsibilities:
- Develop, implement, and maintain GPU-based clusters of 10 to 1000 nodes, ensuring optimal performance and availability.
- Administer Client/AI platforms – Distributed Client services, LLMs, Vector-DB and AI inferencing, by managing deployments, resource allocation, monitoring, and security.
- Collaborate with cross-functional teams to address AI infrastructure requirements, support AI-related projects, and provide technical expertise.
- Monitor and evaluate the performance of AI systems and clusters, ensuring that they adhere to industry best practices and meet company standards.
- Compile reports, document procedures, and publish recommendations for improving AI infrastructure and solutions.
- Use AI/Client to continuously improve internal processes and tools that are used in end-to-end delivery of your services in this team