AI Platform Engineer – High Performance Computing
Location
100% Remote India
Job Type
Full-Time
Experience Level
Mid-Level
Salary Range
18,00,000 to 20,00,000
Job Description
Job Description The High-Performance Computing (HPC) AI Managed Services Engineer will be responsible for the design, deployment, and ongoing operational support of large-scale GPU-accelerated clusters tailored for AI and ML workloads. This role combines deep systems engineering with specialized knowledge of AI infrastructure like NVIDIA DGX platforms and high-speed interconnects. Core Responsibilities • Infrastructure Management: Build and maintain HPC compute nodes, GPU clusters, and high-performance storage systems (e.g., Lustre, VAST Data, Weka). • Workload Orchestration: Implement and manage job schedulers like Slurm, LSF, or PBS, and container orchestration platforms like Kubernetes for AI model training and inference. • Performance Tuning: Monitor and optimize system health, focusing on GPU utilization, network throughput (InfiniBand/RoCE), and storage I/O for Large Language Models (LLMs). • Managed Services Support: Provide L1/L2/L3 support, including troubleshooting hardware/software incidents, managing escalations, and interacting directly with customers for project implementation. • Automation: Develop automation scripts using Python, Bash, Ansible, or Terraform for rapid cluster deployment and configuration management. Required Skills & Experience • Experience: Typically 3–7+ years in HPC engineering, Linux system administration (RHEL/Ubuntu), or production AI infrastructure support. • NVIDIA Ecosystem: Deep knowledge of CUDA, NCCL, and GPU-enabled server management. • Networking: Proficiency in high-speed interconnects such as InfiniBand, RoCE, and Spectrum-X. • Parallel File Systems: Hands-on experience with Lustre, GPFS, or BeeGFS for distributed data processing. • Tools: Familiarity with monitoring and logging tools like Grafana, Prometheus, or Kibana. Educational Qualifications • Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related technical field. Common Professional Tools & Platforms • Schedulers: Slurm, Kubernetes, PBS Professional • Automation: Ansible, Terraform, Python • Hardware Partners: NVIDIA, Dell, HPE, AMD
About Quickhyre AI
QuickHyre AI is an online job portal and HRMS software platform for hiring and workforce management, trusted by over 10 million active users. It connects freshers and experienced professionals across all career levels with verified job opportunities, while enabling companies to hire, onboard, and manage employees through a single system. Candidates use QuickHyre AI to search and apply for jobs across entry-level, mid-level, and senior roles, including internships and contract positions across tech and non-tech domains. Employers use QuickHyre AI as a hiring platform, applicant tracking system (ATS), and HRMS to post jobs, screen candidates, manage applications, onboard hires, and maintain employee records. Built for startups, growing companies, and enterprises, QuickHyre AI reduces irrelevant applications, improves hiring efficiency, and simplifies HR operations. By combining job discovery, online recruitment, and HR management software, QuickHyre AI supports end-to-end talent acquisition and workforce administration at scale.
Connections
Sai Charan
Senior Developer
Kalpana Sharma
Team Lead
Rahul Patel
Full Stack Developer
Priya Singh
Frontend Developer
Connect with professionals in your network
Skill Match Analysis
??% skills matched (?? of skills)
💡 This is keyword matching for reference only. Your actual match score uses AI semantic analysis.
Login to see your score