About Us
WeCloudData is a premier learning academy dedicated to offering top-quality data and AI training to our students and corporate clients. Recognized as a leading school in the data and AI category, we have expanded our offerings to include consultancy and career services, supporting our students beyond the classroom. Over the next decade, our goal is to influence millions of learners, driving positive changes in the future of work through our commitment to excellent learning experiences, technological innovation, and fostering work-integrated learning environments. We pride ourselves on setting industry standards and being at the forefront of data and AI education.
Job Description
WeCloudData is seeking an experienced AI Infrastructure Operations Instructor to deliver a hands-on bootcamp focused on GPU-enabled AI systems, Kubernetes operations, AI deployment, observability, and infrastructure monitoring.
The instructor will train students to deploy, operate, monitor, and troubleshoot AI workloads in modern cloud and data center environments. This role focuses on the infrastructure and operational side of AI systems rather than AI model development or research.
The ideal candidate has experience operating production AI platforms, deploying containerized applications, managing Kubernetes environments, and working with GPU-enabled infrastructure. The instructor should be comfortable teaching beginners and helping students transition into AI Infrastructure Operations, MLOps, and AI Platform Engineering careers.
Key Responsibilities
Instruction & Delivery
- Deliver instructor-led lectures and workshops
- Conduct hands-on labs and troubleshooting exercises
- Mentor students throughout the bootcamp
- Support students in completing capstone projects
- Evaluate assignments and provide technical feedback
Curriculum Coverage
- Teach topics including:
- AI infrastructure fundamentals
- Training vs inference workloads
- GPU fundamentals and operations
- Linux administration
- Python scripting
- Containerization using Docker
- AI model deployment and serving
- Kubernetes operations
- GPU workload scheduling
- Monitoring and observability
- AI platform operations
- AI infrastructure troubleshooting
- AI data center fundamentals
- The instructor should be capable of guiding students through all major topics outlined in the bootcamp curriculum.
Lab Management
- Prepare cloud-based lab environments
- Configure Kubernetes clusters
- Manage GPU-enabled infrastructure
- Support Docker and container deployment labs
- Create troubleshooting scenarios and exercises
- Maintain capstone project environments
Professional Experience
Minimum
- 5+ years of experience in one or more of the following:
- Platform Engineering
- Cloud Engineering
- Site Reliability Engineering (SRE)
- MLOps
- AI Platform Operations
- Infrastructure Engineering
- DevOps Engineering
Preferred
- 2+ years supporting AI or machine learning workloads in production
Technical Expertise
- Linux
- Strong hands-on experience with:
- Linux administration
- Shell scripting
- Process management
- Networking fundamentals
- System troubleshooting
- Containers
- Strong experience with:
- Docker
- Container image management
- Containerized application deployment
- Kubernetes
- Practical experience with:
- Pods
- Deployments
- Services
- Ingress
- Autoscaling
- Helm
- Monitoring Kubernetes workloads
- Cloud Platforms
- Experience with one or more:
- AWS
- Azure
- Google Cloud
- Alibaba Cloud
- Monitoring & Observability
- Experience with:
- Prometheus
- Grafana
- Logging platforms
- Alerting systems
- Infrastructure monitoring
AI Infrastructure Knowledge
- Must understand:
- Training vs inference workloads
- AI deployment architectures
- Model serving concepts
- GPU utilization concepts
- AI workload bottlenecks
- Throughput and latency metrics
- AI platform operations
- The instructor does not need to be an AI researcher but should be comfortable deploying and operating AI workloads.
Preferred Qualifications
GPU & AI Infrastructure
- Experience with:
- NVIDIA GPUs
- CUDA ecosystem
- GPU monitoring tools
- GPU scheduling concepts
- Multi-GPU systems
- Bonus
- NVIDIA AI Enterprise
- NVIDIA NIM
- Triton Inference Server
- vLLM
- Ray Serve
MLOps & AI Platform Experience
- Experience with:
- MLflow
- Kubeflow
- Model serving platforms
- Vector databases
- RAG deployment architectures
- LLM inference systems
- These align strongly with the capstone projects proposed in the curriculum.
Preferred Certifications
Strongly Preferred
- Kubernetes Administrator (CKA)
- Kubernetes Application Developer (CKAD)
Nice to Have
- NVIDIA NCA-AIIO
- AWS Solutions Architect
- AWS SysOps Administrator
- Azure Administrator
- Azure AI Infrastructure certifications
Application question(s):
- We need both onsite and online instructors, so would you be willing to travel to Saudi Arabia?
Work Location: Remote