DevOps Engineer
Location: Remote (India)
Employment Type: Full-Time
About the Company:
Our Company is a leading provider of innovative device tracking solutions, specializing in real-time tracking, IoT integration, and advanced analytics.
Our mission is to enhance operational efficiency and security through cutting-edge technology. We are looking for a talented DevOps Engineer to join our team and drive the scalability, performance, and reliability of our systems.
Job Summary:
As a DevOps Engineer for Deep Learning, you will design, deploy, and maintain highly scalable, high-performance infrastructure for training and deploying machine learning models. Collaborating with AI researchers and engineers, you’ll optimize workflows, manage large-scale data pipelines, and ensure seamless integration of cutting-edge AI solutions.
Key Responsibilities:
- Develop and maintain CI/CD pipelines for deep learning model training, testing, and deployment.
- Design scalable, distributed infrastructure to support high-performance training on GPUs/TPUs.
- Automate provisioning of cloud-based and on-premises deep learning clusters using tools like Terraform or Ansible.
- Manage containerized environments (Docker) and orchestration systems (Kubernetes) for AI workloads.
- Optimize workflows for data preprocessing, model training, and inference using tools like Dask, Apache Spark, or Ray.
- Implement and maintain monitoring and logging systems for model performance and infrastructure health.
- Collaborate with AI teams to improve deployment pipelines for real-time and batch inference.
- Ensure the security and integrity of sensitive data used in AI workflows.
- Scale data storage and processing pipelines for large datasets used in model training.
Qualifications:
- Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.
- 7+ years of experience as a DevOps Engineer, preferably in machine learning or data science environments.
- Strong experience with CI/CD tools (e.g., GitLab CI, Jenkins, CircleCI).
- Proficiency in cloud platforms (AWS, GCP, Azure) with a focus on GPU-based instances.
- Expertise in containerization (Docker) and orchestration (Kubernetes).
- Experience with infrastructure-as-code tools like Terraform, CloudFormation, or Ansible.
- Solid programming skills in Python, Bash, or Go.
- Familiarity with deep learning frameworks (e.g., TensorFlow, PyTorch).
- Strong understanding of GPU/TPU optimization for large-scale model training.
Nice-to-Have Skills:
- Experience with MLOps tools like MLflow, Kubeflow, or Sagemaker.
- Familiarity with distributed training strategies (e.g., Horovod, DeepSpeed).
- Knowledge of high-performance storage systems for ML datasets.
- Exposure to monitoring AI models using Prometheus, Grafana, or similar tools.
- Certifications in cloud platforms or Kubernetes.
Why Join Us?
- Work on cutting-edge technologies at the intersection of AI and infrastructure.
- Collaborate with world-class researchers and engineers in deep learning.
- Access to state-of-the-art hardware and tooling for AI development.
- Competitive salary, benefits, and career growth opportunities.
- A culture that values innovation, learning, and collaboration.