Description

Primary Responsibilities:

As an experienced Site Reliability Engineer (SRE) with 6+ years of expertise, you will be responsible for designing, implementing, and maintaining our systems and infrastructure to ensure they are highly available, performant, and scalable. You will play a key role in driving automation, monitoring, and incident response processes to achieve optimal system reliability, including Java-based applications and related technologies.


 

Key Responsibilities:

1. System Reliability: Design, build, and maintain systems and services with a focus on reliability,

scalability, and performance, including Java-based applications. Implement best practices for high

availability and fault tolerance.

2. Automation: Develop and maintain automation scripts and tools, including Java-based

automation, to streamline system provisioning, configuration management, and deployment

processes. Promote Infrastructure as Code (IaC) principles.

3. Monitoring and Alerting: Set up and manage comprehensive monitoring and alerting solutions to detect and respond to performance issues and outages promptly. Utilize tools such as:

- Monitoring: Prometheus, Grafana, Datadog.

- Logging: ELK Stack, Splunk.

4. Incident Response: Lead and participate in incident response activities for Java-based applications and related services, including root cause analysis, resolution, and documentation. Collaborate with crossfunctional teams to prevent recurring incidents.

5. Capacity Planning: Monitor system resource utilization, plan for capacity upgrades, and optimize

resource allocation to ensure smooth operation during traffic spikes.

6. Performance Optimization: Identify bottlenecks and performance issues in Java applications and

infrastructure, and work to optimize them. Implement caching strategies, Java code optimizations, and related technologies.

7. Security and Compliance: Collaborate with security teams to ensure Java applications and systems are secure and compliant with industry standards and regulations. Implement security best practices, including data encryption and access controls.

8. Disaster Recovery: Design and maintain disaster recovery plans and backup strategies for Java-based applications to minimize data loss and downtime in case of failures.

9. Documentation: Create and maintain comprehensive documentation for Java application

configurations, procedures, and incident response processes.


 

Qualifications:

• Engineering degree in Computer Science, Information Technology, or related field.

• Minimum 4 years of hands-on experience in Site Reliability Engineering, System Administration, or a related role, with a strong emphasis on Java and related IT technologies.

• Proficiency in Java programming and related frameworks.

• Proficiency in scripting and programming languages (e.g., Python, Bash, Go).

• Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes).

• Strong knowledge of cloud platforms (e.g., AWS, Azure, GCP) and infrastructure as code tools (e.g. Terraform, Ansible).

• Expertise in managing and troubleshooting Linux-based systems.

• Strong problem-solving skills and the ability to analyze and resolve complex issues.

• Excellent communication and collaboration skills.

• Relevant certifications such as Certified Kubernetes Administrator (CKA) or AWS Certified DevOps

Engineer are a plus.

Education

ANY GRADUATE