Skip to main content
Sign up to apply

Already have an account? Log in

Back to Jobs

Senior Principal Site Reliability Engineer

Define reliability architecture for AI services

Set technical direction for building and scaling AI services, ensuring production-grade reliability and performance. Deliver SLO frameworks and fault tolerance patterns for AI compute and platform services.

Why This Role?

Influence product engineering teams with exceptional technical expertise

Required Skills

Site Reliability EngineeringAI InfrastructureCloud ComputingSystem DesignMentorship

Keywords

AI InfrastructureSite Reliability EngineeringCloud TechnologyGPU ComputeSLO Frameworks
View Original Description from Jobicy

Original description from Jobicy

Do you want to shape the future of AI infrastructure?Ready to define the reliability architecture for AI products, from GPU compute to globally distributed inference, ensuring performance and reliability at... Do you want to shape the future of AI infrastructure? Ready to define the reliability architecture for AI products, from GPU compute to globally distributed inference, ensuring performance and reliability at scale. Join the Akamai AI Team Akamai's Cloud Technology Group offers AI infrastructure globally. The GPU compute platform provides dedicated resources, from single GPUs to full clusters. These resources support training, simulation, inference, and various workloads. Site Reliability Engineering is integrated early to guarantee production-grade reliability and performance. Partner with the best As Senior Principal SRE for AI, this role involves setting technical direction for building, operating, and scaling AI services. Responsibilities include writing code, designing systems, and solving complex reliability issues. Additionally, mentoring team members, defining technical standards, and promoting engineering best practices are essential. Success depends on achieving influence with product engineering teams through exceptional technical expertise. As a Principal Site Reliability Engineer, you will be responsible for: Defining the reliability architecture for Akamai's AI compute and platform services, including SLO frameworks, fault tolerance patterns, and capacity


Share this job

Help a friend find their next remote role.


Source
Jobicy
Job Type
full time
Location
Regional Remote · Remote
Category
Engineering
Seniority
senior
Posted
Mar 29, 2026