Inference Optimization Engineer(LLM and Runtime)
Location
Bengaluru, Karnataka, India
Job Type
Full-Time
Experience Level
Mid-Level
Salary Range
Not disclosed
Job Description
Hiring for client We are seeking a highly skilled and innovative Inference Optimization (LLM and Runtime) to design, develop, and optimize cutting-edge AI systems that power intelligent, scalable, and agent-driven workflows. This role blends the frontier of generative AI research with robust engineering, requiring expertise in machine learning, deep learning, and large language models (LLMs) and latest trends going on in the industry. The ideal candidate will collaborate with cross-functional teams to build production-ready AI solutions that address real-world business challenges while keeping our platforms at the forefront of AI innovation. Key Tasks and Accountability: Optimization and customization of large-scale generative models (LLMs) for efficient inference and serving. Apply and evaluate advanced model optimization techniques such as quantization, pruning, distillation, tensor parallelism, caching strategies, etc., to enhance model efficiency, throughput, and inference performance. Implement custom fine-tuning pipelines using parameter-efficient methods (LoRA, QLoRA, adapters etc.) to achieve task-specific goals while minimizing compute overhead. Optimize runtime performance of inference stacks using frameworks like vLLM, TensorRT-LLM, DeepSpeed-Inference, and Hugging Face Accelerate. Design and implement scalable model-serving architectures on GPU clusters and cloud infrastructure (AWS, GCP, or Azure). Work closely with platform and infrastructure teams to reduce latency, memory footprint, and cost-per-token during production inference. Evaluate hardware–software co-optimization strategies across GPUs (NVIDIA A100/H100), TPUs, or custom accelerators. Monitor and profile performance using tools such as Nsight, PyTorch Profiler, and Triton Metrics to drive continuous improvement. Key Requirements: Education & Experience Ph.D. in Computer Science or a related field, with a specialization in Deep Learning, Generative AI, or Artificial Intelligence and Machine Learning (AI/ML). 2–3 years of hands-on experience in large language model (LLM) or deep learning optimization, gained through academic or industry work. Skills Strong analytical and mathematical reasoning ability with a focus on measurable performance gains. Collaborative mindset, with ability to work across research, engineering, and product teams. Pragmatic problem-solver who values efficiency, reproducibility, and maintainable code over theoretical exploration. Curiosity-driven attitude — keeps up with emerging model compression and inference technologies. What You’ll Do Take ownership of end-to-end optimization lifecycle — from profiling bottlenecks to delivering production-optimized LLMs. Develop custom inference pipelines capable of high throughput and low latency under real-world traffic. Build and maintain internal libraries, wrappers, and benchmarking suites for continuous performance evaluation. What you will bring Hands-on experience in building, optimizing machine learning or Agentic Systems at scale. A builder’s mindset — bias toward action, comfort with experimentation, and enthusiasm for solving complex, open-ended challenges. Startup DNA → bias to action, comfort with ambiguity, love for fast iteration, and flexible and growth mindset.
About Quickhyre AI
QuickHyre AI is an online job portal and HRMS software platform for hiring and workforce management, trusted by over 10 million active users. It connects freshers and experienced professionals across all career levels with verified job opportunities, while enabling companies to hire, onboard, and manage employees through a single system. Candidates use QuickHyre AI to search and apply for jobs across entry-level, mid-level, and senior roles, including internships and contract positions across tech and non-tech domains. Employers use QuickHyre AI as a hiring platform, applicant tracking system (ATS), and HRMS to post jobs, screen candidates, manage applications, onboard hires, and maintain employee records. Built for startups, growing companies, and enterprises, QuickHyre AI reduces irrelevant applications, improves hiring efficiency, and simplifies HR operations. By combining job discovery, online recruitment, and HR management software, QuickHyre AI supports end-to-end talent acquisition and workforce administration at scale.
Connections
Sai Charan
Senior Developer
Kalpana Sharma
Team Lead
Rahul Patel
Full Stack Developer
Priya Singh
Frontend Developer
Connect with professionals in your network
Skill Match Analysis
??% skills matched (?? of 65 skills)
💡 This is keyword matching for reference only. Your actual match score uses AI semantic analysis.
Login to see your score