Description

DevOps Engineer 

Location: Remote (India) 
Employment Type: Full-Time

About the Company:
Our Company is a leading provider of innovative device tracking solutions, specializing in real-time tracking, IoT integration, and advanced analytics.

Our mission is to enhance operational efficiency and security through cutting-edge technology. We are looking for a talented DevOps Engineer to join our team and drive the scalability, performance, and reliability of our systems.

Job Summary:
As a DevOps Engineer for Deep Learning, you will design, deploy, and maintain highly scalable, high-performance infrastructure for training and deploying machine learning models. Collaborating with AI researchers and engineers, you’ll optimize workflows, manage large-scale data pipelines, and ensure seamless integration of cutting-edge AI solutions.

Key Responsibilities:

  • Develop and maintain CI/CD pipelines for deep learning model training, testing, and deployment.
  • Design scalable, distributed infrastructure to support high-performance training on GPUs/TPUs.
  • Automate provisioning of cloud-based and on-premises deep learning clusters using tools like Terraform or Ansible.
  • Manage containerized environments (Docker) and orchestration systems (Kubernetes) for AI workloads.
  • Optimize workflows for data preprocessing, model training, and inference using tools like Dask, Apache Spark, or Ray.
  • Implement and maintain monitoring and logging systems for model performance and infrastructure health.
  • Collaborate with AI teams to improve deployment pipelines for real-time and batch inference.
  • Ensure the security and integrity of sensitive data used in AI workflows.
  • Scale data storage and processing pipelines for large datasets used in model training.

Qualifications:

  • Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.
  • 7+ years of experience as a DevOps Engineer, preferably in machine learning or data science environments.
  • Strong experience with CI/CD tools (e.g., GitLab CI, Jenkins, CircleCI).
  • Proficiency in cloud platforms (AWS, GCP, Azure) with a focus on GPU-based instances.
  • Expertise in containerization (Docker) and orchestration (Kubernetes).
  • Experience with infrastructure-as-code tools like Terraform, CloudFormation, or Ansible.
  • Solid programming skills in Python, Bash, or Go.
  • Familiarity with deep learning frameworks (e.g., TensorFlow, PyTorch).
  • Strong understanding of GPU/TPU optimization for large-scale model training.

Nice-to-Have Skills:

  • Experience with MLOps tools like MLflow, Kubeflow, or Sagemaker.
  • Familiarity with distributed training strategies (e.g., Horovod, DeepSpeed).
  • Knowledge of high-performance storage systems for ML datasets.
  • Exposure to monitoring AI models using Prometheus, Grafana, or similar tools.
  • Certifications in cloud platforms or Kubernetes.

Why Join Us?

  • Work on cutting-edge technologies at the intersection of AI and infrastructure.
  • Collaborate with world-class researchers and engineers in deep learning.
  • Access to state-of-the-art hardware and tooling for AI development.
  • Competitive salary, benefits, and career growth opportunities.
  • A culture that values innovation, learning, and collaboration.