About the Role
We are looking for a Lead Data Center Technician to serve as the senior hands-on technical presence at our data center site. You will lead break/fix and hardware lifecycle operations for GPU compute clusters, govern high-risk change work, act as the top operational escalation point for incidents, and mentor a growing team of technicians. This is a leadership role for a seasoned technician who thrives in high-density, mission-critical environments and wants to help build the operational backbone of a fast-scaling AI infrastructure company.
Key Responsibilities
Hardware Operations & Break/Fix
- Own lifecycle management for GPU servers, networking equipment, and supporting components — diagnosis, replacement, and performance validation without impacting live workloads
- Lead rack-and-stack, cabling, and commissioning for new GPU cluster deployments (including NVIDIA HGX-class platforms)
- Set and enforce standards for hardware interventions, power distribution work, and post-maintenance validation
Network Installation & Augmentation
- Lead planning, risk assessment, and execution for structured cabling, InfiniBand/Ethernet fabric builds, and network augmentations
- Define acceptance criteria and sign-off gates for physical network changes; validate end-to-end connectivity before handover
Incident Response & Troubleshooting
- Serve as the highest non-managerial escalation point for site incidents; lead triage, root-cause analysis, and corrective actions
- Own the ticket queue, trend analysis, and resolution documentation within SLA targets
- Participate in on-call rotation and coordinate coverage for peak events
Safety, Compliance & Documentation
- Govern adherence to SOPs/MOPs and site rules for all high-risk work, including electrical and liquid-cooling-adjacent activities
- Maintain comprehensive documentation, audit trails, and change records; support customer and compliance audits
- Enforce physical security procedures and access protocols
Leadership & Continuous Improvement
- Mentor and develop technicians; set expectations for execution quality, safety, and customer focus
- Lead post-incident reviews and implement durable process improvements
- Partner with engineering, capacity, and networking teams to reduce single points of failure and improve monitoring/alerting
Pay: $65.00-$72.12 per hour
Work Location: In person