Key Responsibilities:
· Infrastructure Management:
o Design, implement, and manage scalable, reliable, and secure infrastructure using cloud services (e.g., AWS, Azure, GCP) and on-premises solutions.
o Automate infrastructure provisioning, monitoring, and maintenance using tools like Terraform, Ansible, or Puppet.
· Monitoring and Incident Response:
o Develop and maintain monitoring, logging, and alerting systems to detect and respond to issues proactively.
o Lead incident response efforts, perform root cause analysis, and implement corrective actions to prevent recurrence.
· Performance Optimization:
o Continuously monitor system performance and optimize infrastructure to meet service level objectives (SLOs) and service level agreements (SLAs).
o Collaborate with development teams to identify and resolve performance bottlenecks.
· Reliability Engineering:
o Implement and advocate for best practices in reliability engineering, including chaos engineering, fault injection, and resilience testing.
o Design and implement disaster recovery and business continuity plans.
· Collaboration and Communication:
o Work closely with cross-functional teams to ensure seamless deployment and integration of new features and updates.
o Communicate effectively with stakeholders, providing updates on system status, performance metrics, and improvement initiatives.
· Compliance and Security:
o Ensure all infrastructure and operations comply with healthcare industry regulations (e.g., HIPAA, HITECH) and security best practices.
o Conduct regular security assessments and audits to identify and mitigate risks.
Required Skills
· Education:
· Bachelor’s degree in Computer Science, Information Technology, or a related field. Relevant certifications (e.g., AWS Certified DevOps Engineer, Google Professional DevOps Engineer) are a plus.
· Experience:
· 3-5+ years of experience in site reliability engineering, DevOps, or a related role.
· Proven experience in managing cloud infrastructure and services.
· Strong background in scripting and automation (e.g., Python, Bash, Shell).
· Technical Skills:
· Proficiency with infrastructure as code (IaC) tools such as Terraform, Ansible, or Puppet.
· Experience with monitoring and logging tools like Prometheus, Grafana, ELK Stack, or Splunk.
· Solid understanding of networking, security principles, and compliance requirements in the healthcare industry
Bachelor's Degree