Description

Job Description:
Objectives of this role:
Run the production environment by monitoring availability and taking a holistic view of system health.
Build software and systems to manage platform infrastructure and applications.
Improve reliability, quality, and time-to-market of our suite of software solutions.
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement.
Provide primary operational support and engineering for multiple large-scale distributed software applications.

Responsibilities:
Gather and analyse metrics from operating systems as well as applications to assist in performance tuning and fault finding.
Partner with development team, Data Scientist, MLOps Architect/Engineers to improve services through rigorous testing and release procedures.
Participate in system design consulting, platform management, Troubleshooting production issues and capacity planning.
Create/manage sustainable systems and services through automation and uplifts.
Balance feature development speed and reliability with well-defined service-level objectives

Required skills and qualifications:
Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
Experience in working with such as Amazon S3, Sagemaker, Amazon Bedrock
Excellent knowledge working with cloud-native infrastructure, such as AWS Lambda, OpenShift
Good understanding of API management and should be able to troubleshoot API related issues.
Automation Mindset to manage cloud infrastructure using AWS CloudFormation/Terraform
Impeccable creative and communication skills.
Ability to problem solve in a fast-paced, high-stakes environment.
Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.

Education

ANY GRADUATE