Description

Job Description:

Job Title: AIOps Engineer

Experience: 10+ Years.
Job Type: Full time Permanent.

Job Location: Gurgaon / Gurugram, India.  

Job Purpose:

As an AIOps Engineer, you will lead the design and development of AI solutions to improve system reliability and resilience. Your focus will be on building automation to reduce manual effort and prevent incidents, particularly in incident management. You'll need expertise in AI technologies, strong problem-solving abilities, and a background in software development. Collaborating with cross-functional teams, you'll implement scalable and robust AIOps solutions, helping customers to separate signal from noise across billions of data points and providing AI-based root cause analysis.

Responsibilities:

  • Assist in design and deployment of AI/ML solutions, ensuring alignment with business goals and technical requirements.
  • Collaborate with service and solution owners, technical & application teams to gather requirements, identify use cases, and propose solutions.
  • Create detailed technical solutions, including system architecture, data flow, and integration points.
  • Integrations with monitoring tools for metrics, logs and traces across infra, app, security, network domains
  • Integration with tools such as AppDynamics, New Relic, Splunk, Azure Log Analytics, SCOM, ServiceNow and RunDeck etc.
  • Implement scalable, efficient, and robust AI models, algorithms, and systems.
  • Collaborate with service and solution owners, technical, and non-technical teams to collect, clean, and prepare data for AI model training and evaluation.
  • Develop and maintain best practices and coding standards for AI solution development.
  • Troubleshoot and resolve technical issues related to AI systems.
  • Foster a culture of knowledge sharing and growth.
  • Create content (Docs, workflow diagrams, testing plans, training) related to specific use cases or best practices.
  • Provide training and guidance, as needed.
  • Monitor AIOps dashboards for Continual Improvement Opportunities.
  • Provide regular KPI and Metric updates including findings and adoption rates to leadership, escalating concerns as appropriate.
  • Stay up-to-date on advancements in AI technologies and research to drive innovation and continuous improvement.
  • Collaborate with Incident and Problem Management to reduce MTTR and Incident volume.
  • Design, implement, and maintain AIOps solutions to monitor and analyze IT systems, applications, and networks.
  • Deploy machine learning algorithms for anomaly detection, root cause analysis, and incident prediction.
  • Configure and manage observability tools and platforms to gain real-time visibility into system health and performance.
  • Develop monitoring dashboards, alerts, and reports to provide comprehensive insights into the IT environment.
  • Conduct root cause analysis for incidents using data from AIOps and observability tools to identify underlying issues.
  • Work closely with software engineers to instrument applications with appropriate logging, metrics, and tracing capabilities
  • Continuously analyze monitoring data to identify trends, anomalies, and opportunities for optimization.
  • Stay updated with industry trends and advancements in AIOps and observability practices, and recommend new tools or methodologies for adoption
  • Designing, developing, and implementing AI models and algorithms utilizing state-of-the-art techniques such as GPT, VAE, and GANs.
  • Collaborating with cross-functional teams to define AI project requirements and objectives, ensuring alignment with overall business goals.
  • Conducting research to stay up-to-date with the latest advancements in generative AI, machine learning, and deep learning techniques and identify opportunities to integrate them into our products and services.
  • Optimizing existing generative AI models for improved performance, scalability, and efficiency.
  • Developing and maintaining AI pipelines, including data preprocessing, feature extraction, model training, and evaluation.
  • Developing clear and concise documentation, including technical specifications, user guides, and presentations, to communicate complex AI concepts to both technical and non-technical stakeholders.
  • Contributing to the establishment of best practices and standards for generative AI development within the organization.
  • Providing technical mentorship and guidance to junior team members.
  • Apply trusted AI practices to ensure fairness, transparency, and accountability in AI models and systems
  • Drive DevOps and MLOps practices, covering continuous integration, deployment, and monitoring of AI
  • Utilize tools such as Docker, Kubernetes, and Git to build and manage AI pipelines
  • Implement monitoring and logging tools to ensure AI model performance and reliability
  • Collaborate seamlessly with software engineering and operations teams for efficient AI model integration and deployment.
  • Familiarity with DevOps and MLOps practices, including continuous integration, deployment, and monitoring of AI models.

Required Qualifications:

  • Minimum 5 years of experience in Data Science and Machine Learning
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.
  • 8+ years of industry experience in software development.
  • Strong proficiency in Java, Spring Boot, REST Web Services, Kafka, and Microservices architecture.
  • Strong understanding of object-oriented programming principles and design patterns.
  • Experience with AIOps and machine learning is highly desirable.
  • Knowledge of OpenTelemetry is an added advantage.
  • Experience with other monitoring tools like Prometheus, Grafana, etc.
  • Experience with Observability solutions like Dynatrace, DataDog, Instana etc. is highly desirable
  • Experience working with mainframe systems is a plus (willingness to learn is also acceptable).
  • Excellent problem-solving and analytical skills.
  • Strong communication and collaboration skills.
  • Ability to work independently and manage multiple projects simultaneously.
  • Passion for learning new technologies and continuous improvement.
  • In-depth knowledge of machine learning, deep learning, and generative AI techniques
  • Knowledge and experience in Generative AI
  • Proficiency in programming languages such as Python, R, and frameworks like TensorFlow or PyTorch
  • Strong understanding of NLP techniques and frameworks such as BERT, GPT, or Transformer models
  • Familiarity with computer vision techniques for image recognition, object detection, or image generation
  • Experience with cloud platforms such as Azure or AWS
  • Expertise in data engineering, including data curation, cleaning, and preprocessing
  • Knowledge of trusted AI practices, ensuring fairness, transparency, and accountability in AI models and systems
  • Strong collaboration with software engineering and operations teams to ensure seamless integration and deployment of AI models.
  • Excellent problem-solving and analytical skills, with the ability to translate business requirements into technical solutions
  • Strong communication and interpersonal skills, with the ability to collaborate effectively with stakeholders at various levels
  • Track record of driving innovation and staying updated with the latest AI research and advancements
  • Knowledge of IT operations concepts and processes, such as monitoring, incident management, root cause analysis, remediation.
  • You have a degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field. A Master degeree. is preffered.
  • You have solid experience developing and implementing generative AI models, with a strong understanding of deep learning techniques such as GPT, VAE, and GANs.
  • You are proficient in Python and have experience with machine learning libraries and frameworks such as TensorFlow, PyTorch, or Keras.
  • You have strong knowledge of data structures, algorithms, and software engineering principles.
  • You are familiar with cloud-based platforms and services, such as AWS, GCP, or Azure.
  • You have experience with natural language processing (NLP) techniques and tools, such as SpaCy, NLTK, or Hugging Face.
  • You are familiar with data visualization tools and libraries, such as Matplotlib, Seaborn, or Plotly.
  • You have knowledge of software development methodologies, such as Agile or Scrum.
  • You possess excellent problem-solving skills, with the ability to think critically and creatively to develop innovative AI solutions.
  • You have strong communication skills, with the ability to effectively convey complex technical concepts to a diverse audience.
  • You possess a proactive mindset, with the ability to work independently and collaboratively in a fast-paced, dynamic environment.
  • Collaborate seamlessly with software engineering and operations teams for efficient AI model integration and deployment.
  • Familiarity with DevOps and MLOps practices, including continuous integration, deployment, and monitoring of AI models.
  • Certification - ITIL V3 / V4
  • Hands-on experience in Observability & AIOP tools (preferably SolarWinds, Moogsoft, BigPanda, Splunk, Science Logic,  Logic Monitor , New Relic, AppDynamics etc)
  • Experience in working with AI frameworks and tools, such as TensorFlow, PyTorch, Scikit-learn

OTHER INFORMATION

  • Travel: as required.
  • Job is primarily performed in a Hybrid office environment.

    Qualified applicants may send their most updated CV at ([email protected])

Key Skills
Education

Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.

Salary

INR 35 -50