Site Reliability Engineer (SRE) - AI Infrastructure (San Francisco) Job at Hamilton Barnes Associates Limited, San Francisco, CA

N0hqeHN2cVE1ZUEzTmJkekJjNkRJb0RESHc9PQ==
  • Hamilton Barnes Associates Limited
  • San Francisco, CA

Job Description

Are you looking for an exciting new opportunity?

Join a stealth-mode hyperscale data center startup building a next-generation AI and cloud platform designed for startups and advanced research, powered by thousands of H100, H200, and B200 GPUs available on demand. Their platform supports everything from rapid experimentation to full-scale model training and inference, with flexible orchestration via Slurm, Kubernetes, or direct SSH access.

This is a rare opportunity to work at the intersection of hyperscale infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment. If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it.

Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilization, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong handson experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with highperformance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits

  • Equity

Salary

  • $300,000 gross per year
#J-18808-Ljbffr

Job Tags

Full time, Flexible hours,

Similar Jobs

PetSitter.com

Pet Sitter Wanted - Middle River Terrace Resident Looking For A Very Caring, Reliable, Trustworthy Individual That Loves Cats! Job at PetSitter.com

 ...come and feed and scoop litter one time per day for 3 days over the holidays. Dates are tentative, but it will be over the week of Christmas if you're available and in town. I pay 18 dollars per feed. You should be in and out within 15 minutes, as you will not need to do... 

New Mexico Legal Group

Seeking Divorce and Family Law Attorney (ABQ) Job at New Mexico Legal Group

New Mexico Legal Group is hiring an additional attorney to join our team and we have expanded our hiring criteria. While we are a divorce and family law firm, we have had significant success with candidates outside this field. This is especially true for former prosecutors... 

Professional Hospitality Resources, Inc. and Ocean Beach Clu...

Server - Tulu Seaside Bar & Grill, Marriott Virginia Beach Oceanfront Job at Professional Hospitality Resources, Inc. and Ocean Beach Clu...

Overview A Server provides excellent Guest Service by anticipating Guest needs, making specific suggestions of menu items and beverages and monitors the flow of the Guest's dining experience Responsibilities Level One Know all aspects of the menu ...

Alignerr

German Voice Actor for AI Training (Remote, Hourly Paid) (Miami) Job at Alignerr

A leading voice acting agency is seeking a Voice Actor German Speakers for a remote role. In this position, you will record high-quality German voice samples and evaluate AI-generated voice outputs. The ideal candidate will have experience in voice acting and fluency in...

Stefanini North America and APAC

Salesforce Developer Job at Stefanini North America and APAC

 ...Job Description Position Summary: We are seeking a highly skilled Salesforce Engineer to design, build, and optimize Salesforce solutions that empower our sales, commercial operations, and marketing teams. The ideal candidate will combine technical expertise with...