Job Description

About Runloop

Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform eliminates friction in environment setup and dependencies, enabling teams to experiment, iterate, and deploy seamlessly. Were a small but dedicated team working to deliver a rock-solid platform that empowers innovation.

The Role

Were looking for a skilled Site Reliability Engineer (SRE) to ensure the reliability, observability, performance, and security of our core platformthe foundation upon which our users build. Youll work closely with engineering to maintain resilient systems that power our code sandboxes, while mentoring peers on reliability practices. This role blends deep operational expertise with a software engineering mindset.

What Youll Do

Design, operate, and improve production infrastructure on AWS, GCP, or Azure.
Define and monitor SLIs/SLOs, manage error budgets, and maintain observability with Prometheus, Grafana, and logging/tracing frameworks.
Build automation for deployments, scaling, and recoveryreducing toil and creating self-healing systems.
Lead incident response, rootcause analysis, and blameless postmortems.
Collaborate with developers to design scalable, reliable services.
Optimize distributed systems, networking, and sandbox performance.
Plan for capacity growth and support safe release/change management.
Mentor engineers on reliability and frontend distributed systems (CDNs, caching, client observability).

Qualifications

Proven experience as an SRE, DevOps Engineer, or similar role.
Strong programming skills (Python or Go preferred).
Deep knowledge of containerization (Docker, Kubernetes).
Expertise in infrastructure-as-code (Terraform or Pulumi).
Strong understanding of networking, Linux, and system security.
Handson experience with distributed systems and observability (metrics, logs, tracing).
Skilled in incident management, oncall rotations, and postmortem processes.
Ability to mentor and influence best practices across teams.

Bonus Points

Experience with chaos engineering, CI/CD for frontend delivery, or observability tools like Sentry, RUM, or synthetic monitoring.

Benefits

Competitive salary and equity.
Comprehensive health, dental, and vision insurance for you and your dependents.
Free lunch and snacks.
Opportunity to shape the future of AIdriven software engineering in a highimpact role.

Location

Onsite in San Francisco, CA (in office 4 days/week, optional 1 day WFH).

Join Us

If youre passionate about building resilient systems that empower developers and want to shape the future of AIdriven software engineering, wed love to hear from you. Join Runloop and help build the infrastructure that powers tomorrows AI.

Runloop is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, sexual orientation, gender identity, or any other characteristic protected by law.

#J-18808-Ljbffr

Job Tags

Full time, Work at office, Work from home,

Similar Jobs

ZenEducate

Paraprofessional - Great Opportunity Job at ZenEducate

...step. Dont wait, connect with a recruiter today and see how we can help you find the perfect role. Step into a rewarding Paraprofessional role where you'll assist teachers in implementing educational programs, prepare learning materials, and provide personalized...

Site Reliability Engineer (San Francisco) Job at Rethink recruit, San Francisco, CA

Nm4zMnRmQ1I0T2d6T3JCMkFNcU5Jb1BDR3c9PQ==