Site Reliability Engineer - Networking Support

NVIDIA
Azusa, CA, U.S.

Description

NVIDIA is looking for a Site Reliability Engineer (SRE) to join its Networking Support team. As an SRE at NVIDIA you will ensure that our customers production environments have reliability and uptime. We are seeking an SRE with a mentality and methodology of how maintain, monitor and troubleshoot DC networking equipment.

SRE's culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you will be doing:

Supervise equipment, applications and processes through various tools applications and consoles

Rapidly debug and triage incidents and user-reported issues

Work with Tier 2 and Tier 3 support as required

Make valuable contribution to the overall health, performance, and reliability of the networking equipment and Infrastructure Services

Develop documentation for Operations processes

Work rotating shifts, including weekends and holidays; and overtime as required

What we need to see:

Must live in close proximity to Phoenix, Arizona

BS or diploma the Information Technology field, or equivalent experience

4+ years Site reliability engineering experience working on large scale distributed micro services in a production environment with a real passion for automation and tooling

Must be able to operate network devices to include racking and stacking, replacing modules and other components including cabling, transceivers etc.

An expertise with Incident management, organizational change and problem management process. Ability to detection of all service-impacting issues, accurate triage, partner communication, impact containment, service restoration, and post-incident follow-up

Tried strengths in problem-solving and root causing issues, while continuously seeking ways to drive optimization, efficiency and the bottom line

Experience performing operational activities including batch processing, system backups, maintenance, monitor and provide Level 1 network and server support, monitor and respond to data center environmental alarms, monitor various application systems

Experience handling special requests for network configuration changes, system reboots, performing server and network switch reboots, file restores, web updates and terminal messaging

Knowledge of TCP/IP networks and troubleshooting tools; Knowledge of Linux operating system and associated tools

Able to work a rotating shift schedule that includes days, nights, weekends and holidays as necessary

Key Skills

TCP/IP Linux system reboots performing server

Education

Bachelor’s Degree

Back To Jobs

Posted On: 25-Nov-2024
Experience: 4+ years of experience
Availability: Remote
Category: Sr. Site Reliability Engineer
Tenure: Full-Time Position