Responsibilities:
MLOps Framework and Infrastructure:
Create scalable MLOps frameworks and infrastructure to support the full machine learning lifecycle, from data ingestion to model deployment and monitoring.
Develop and maintain monitoring and alerting systems to track model performance, data drift, and system health, enabling proactive issue detection and resolution.
Work closely with data scientists and software engineers to seamlessly integrate machine learning models into production systems, prioritizing robustness, scalability, and performance.
CI/CD and Automation:
Implement automation and CI/CD practices to streamline model deployment, version control, and testing, ensuring efficient and reliable updates and rollbacks.
Responsible for implementing CI/CD for data engineering and machine learning platforms.
Architecting and Design:
Provide architecture and design for orchestrating the ML engineering and data engineering components/services.
Design and development of data engineering & ML engineering pipelines and production of them with optimum scaling.
Infrastructure and Provisioning:
Establish scalable and efficient infrastructure for training and inference, leveraging cloud platforms.
Responsible for setting up automated provisioning of GCP resources for machine learning and data engineering services.
Experience with infrastructure provisioning and configuring public and hybrid clouds with mandatory GCP experience.
Experience administering Kubernetes, Google Kubernetes Engine (GKE), and understanding of manifest management with Helm.
Data Engineering and ML Operations:
Responsible for building automated pipelines using Airflow/Kubeflow for running data engineering and machine learning components.
Responsible for building scalable solutions to bring AI models to production.
Adopt best practices for writing processed data (ML features) to appropriate data lakes, warehouses, or feature stores.
DevOps Management:
Manage code repositories and innovate development and release procedures.
Configure automated deployment on multiple environments.
Manage JIRA stories related to DevOps work.
Setup of monitoring and alerts for production workloads in Google Cloud Platform.
Automation and Optimization:
Recommend and implement automated solutions to improve the performance and reliability of the system.
Experience in automating multiple systems using Bash and languages such as Python and Go.
Collaboration and Communication:
Strong communication skills across the board, with a passion for finding and sharing best practices and driving greater discipline.
Excellent verbal and written communication skills in English.
Basic Requirements:
Experience:
8+ years of IT experience, with at least 4+ years of relevant experience in DevOps, CloudOps, Dockerization, and Containerization.
Experience with CI/CD pipelines and related tools such as Jenkins, CircleCI, or Google Cloud Build.
Experience with configuration management tools and deployment tools like Terraform and Google Deployment Manager.
Experience with Dockerization of machine learning scripts and deployment in Google Cloud Platform.
Experience with iteration on running products, experiment management, and data versioning.
Experience with Google Cloud Platform, Kubernetes, Google Kubernetes Engine (GKE), and Helm.
Experience with Agile and Scrum development processes.
Technical Skills:
Strong understanding of MLOps practices, including Git, CI/CD, input data unit and statistical testing, experiment tracking, model registry, scheduling of ML pipelines, production drift monitoring and alerting, and cloud computing optimization.
Strong understanding of distributed computing, data warehousing, and big data technologies.
Strong programming skills in Bash, Python, and Go.
Strong understanding of data engineering and ML engineering.
Soft Skills:
Excellent problem-solving and analytical skills.
Strong communication and collaboration skills.
Passion for finding and sharing best practices and driving greater discipline.
Nice to Have:
Linux/Unix:
Experience in managing Linux/Unix platforms and application server administration (e.g., Tomcat, JBoss).
Experience with DNS, Linux system configuration, and administration (CentOS, RedHat).
Monitoring Tools:
Experience with monitoring tools geared towards user experience and deep diagnostics (e.g., Splunk, New Relic, AppDynamics, Prometheus, Grafana).
Any Graduate