Infrastructure Engineer

Sciforium

San Francisco, California, United States

San Francisco, California, United States

Find similar jobs

About

Sciforium is an AI infrastructure company that develops advanced AI models and operates a proprietary serving platform. Following new multi-million dollar funding and direct sponsorship from AMD, along with dedicated support from AMD engineers, the team is now entering full operational mode with ambitious growth plans. The company is focused on expanding its engineering team with multiple ML engineers who will develop GPU-level kernels and contribute to the core technology stack. Sciforium offers a collaborative work environment with high impact and autonomy.

Role Overview

We are looking for a hands-on engineer to own the health and performance of our GPU cluster. You will be the primary custodian of our compute infrastructure, bridging the gap between hardware operations and machine learning workflows. You will manage everything from the Linux OS layer up to the ML software stack (CUDA/ROCm, JAX, vLLM), ensuring our research and engineering teams have a stable, high-performance environment.

Key Responsibilities

1. System Health & Reliability (SRE)

On-Call Response:
Serve as the primary point of contact for system outages and GPU failures, ensuring rapid resolution to minimize downtime.
Cluster Monitoring:
Proactively monitor GPU health, thermal metrics, and system loads to catch issues before they impact active workloads.
Vendor Liaison:
Work closely with data center staff and third-party hardware engineers to coordinate repairs, part replacements, and physical maintenance.

2. Linux & Network Administration

OS Management:
Install, patch, and maintain Linux operating systems (Ubuntu/CentOS/RHEL) across the cluster, ensuring consistent configuration management.
Security & Access:
Configure and maintain firewalls, VPNs, and SSH access controls to secure our infrastructure.
Identity Management:
Manage LDAP/Active Directory for centralized user authentication and ensure storage permissions (NFS/GPFS/Lustre) are correctly distributed.

3. GPU & ML Stack Engineering

Deployment:
Lead the bring-up of new GPU nodes, ensuring they are integrated smoothly into the existing cluster.
Driver Management:
Build and install custom kernel modules, GPU drivers (NVIDIA & AMD), and necessary plugins.
Software Stack Maintenance:
Maintain and update the ML libraries and frameworks, specifically
JAX, PyTorch, CUDA, and ROCm
.
Debugging:
Troubleshoot complex hardware-software interaction issues, such as compilation failures with
vLLM on ROCm
or CUDA memory leaks.

Qualifications

Must-Haves:

3+ years of experience in Linux Systems Administration (handling boot processes, kernel modules, systemd, etc.).
Deep understanding of GPU architecture and experience managing NVIDIA (H100/A100) or AMD (MI300/MI250) accelerators.
Experience managing networking security (VPNs, iptables/firewalld) and directory services (LDAP/FreeIPA).
Proficiency in Bash scripting and Python for automation.
Experience with ML software stacks (CUDA toolkit, ROCm, cuDNN, NCCL).

Nice-to-Haves:

Experience with cluster job schedulers (Slurm, Kubernetes/Run:AI).
Familiarity with
vLLM
serving and inference optimizations.
Experience using configuration management tools like Ansible, SaltStack, or Terraform.
Previous experience in a research or startup environment supporting ML Researchers.

Benefits

Medical, dental, and vision insurance
401k plan
Daily lunch, snacks, and beverages
Flexible time off
Competitive salary and equity

Equal Opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

San Francisco, California, United States

Languages

English

Notice for Users

This job was posted by one of our partners. You can view the original job source here.

Find similar jobs