Position Summary
The Senior Reliability Engineer (Infrastructure) is responsible for ensuring the reliability, availability, and recoverability of JetBlue’s critical infrastructure platforms. This role applies engineering discipline to operational challenges, leads response to complex incidents, and drives improvements that reduce operational risk over time. The Senior Reliability Engineer works closely with cloud, platform, network, and application teams to ensure infrastructure systems are observable, resilient, and safe to operate in production, while exhibiting the JetBlue values of Safety, Caring, Integrity, Passion, and Fun.
Essential Responsibilities
- Own reliability outcomes for critical infrastructure platforms supporting JetBlue production systems.
- Define and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for infrastructure capabilities.
- Lead response, diagnosis, and resolution of complex infrastructure incidents as Incident Commander or senior technical authority.
- Participate in a 24x7 on-call rotation and help improve incident response practices.
- Diagnose and mitigate failures across Linux systems, Kubernetes platforms, Azure cloud infrastructure, and networking layers.
- Review and approve high-risk infrastructure changes with consideration for blast radius, rollback readiness, and dependency impact.
- Identify and mitigate capacity, scaling, and saturation risks across infrastructure systems.
- Improve monitoring, alerting, and dashboards to reflect real system health and customer impact.
- Reduce operational toil through automation, tooling, and reliability-focused engineering improvements.
- Develop and maintain operational documentation, runbooks, and recovery procedures.
- Lead blameless post-incident reviews and drive corrective actions to prevent repeat incidents.
- Mentor engineers on operational excellence, reliability practices, and incident response.
- Collaborate with cloud, platform, network, and security teams to ensure reliable and secure infrastructure operations.
- Ensure infrastructure platforms meet regulatory, compliance, and security requirements as applicable.
- Other duties as assigned.