Job Description

2-3 years experience required. Must have deep hands on Kubernetes and Linux experience. This is not a theoretical position.Qubrid AI is a scalable, high-performance AI platform, and we're looking for a Kubernetes Engineer / Architect with deep experience in container orchestration, scheduling, and GPU-enabled infrastructure to help power it. In this role, you’ll design, optimize, and maintain the core infrastructure that supports cutting-edge AI/ML workloads—from model training to large-scale inference—on Kubernetes. You'll work closely with ML engineers, data scientists, and DevOps to enable a seamless, robust, and efficient end-to-end AI platform that runs across hybrid or multi-cloud environments.Work from Home! Compensation with 2-3 Years Experience - 7-8 LPA. Plus Annual Bonus. Please do not apply if compensation is not acceptable.You'll work with our founders and US members plus off-shore cloud and AI team.Responsibilities: Architect and manage Kubernetes infrastructure to support large-scale AI/ML pipelines and GPU workloads.Implement GPU scheduling strategies and integrate components like the NVIDIA Device Plugin, GPU Operator, and MIG configurations.Optimize the deployment and orchestration of deep learning models, distributed training jobs (e.g., Horovod, PyTorch, TensorFlow), and serving frameworks.Support Kubeflow, Ray, MLflow, or other AI-centric platforms running on Kubernetes.Design and implement scalable multi-tenant environments with isolation, security, and resource quotas.Automate infrastructure using Helm, Kustomize, Terraform, or Ansible.Build observability for AI workloads using Prometheus, Grafana, and GPU monitoring tools.Drive infrastructure decisions, architecture reviews, and contribute to internal best practices for running production AI workloads.Collaborate with cross-functional teams to ensure a frictionless ML engineering experience from data ingestion to production deployment. Requirements: 3–5+ years of hands-on experience with Kubernetes in production environments.Expert in Linux operating systemStrong knowledge of containerization technologies (Docker, containerd) and orchestration at scale.Experience managing NVIDIA GPU workloads in Kubernetes environments, including drivers, device plugins, GPU sharing, and MIG.Familiarity with AI/ML infrastructure components, such as model training pipelines, inference services, and distributed computing frameworks.Solid understanding of Kubernetes internals—scheduling, networking, security, RBAC, taints/tolerations, and affinity rules.Experience with CI/CD and GitOps tools (e.g., ArgoCD, Flux, Jenkins).Proficient in scripting or development (Python, Bash, or Go preferred).Experience with cloud-native monitoring/logging stacks (e.g., Prometheus, ELK, Grafana). Other Nice to Have Skills: Exposure to ML orchestration tools like Kubeflow, Ray, Airflow, or Metaflow.Experience with distributed training and large-scale model serving.Familiarity with data versioning tools (e.g., DVC) and model lifecycle platforms (e.g., MLflow, Seldon).Certifications such as CKA, CKAD, or relevant cloud certifications (AWS, GCP, Azure).Knowledge of high-performance storage solutions and GPU-aware data pipelines.Strong experience in Python and writing fault-tolerant code. Proficiency in Deep learning frameworks (PyTorch, TensorRT, LangChain). Experience with Image Processing technologies (e.g., Stabe Diffusion, OpenCV, Dlib, Pillow, NumPy, SIFT). Experience with Natural Language Processing (NLP/NLU) technologies and portals (e.g., HuggingFace, NLTK, Spacy, Flair)Familiar with Vertex AIFamilar with Gemma, Mistral, Llama etcExperience with person/scene understanding (pose, re-identification, etc.). Familiarity with Training of custom AI models like Detectors (YOLO), Classifiers, and Transformers. Technical knowledge in Latest Advancement in AI, especially in Vision (e.g., ViTs, CLIPs, and LLM models)Understanding of Docker and GITGood understanding of optimizing data processing pipelines. Experience with successfully applying machine learning to solve a real-world problem. Ability to work and thrive in a start-up environment.

Job Title

Company : Qubrid AI

Location : Hyderabad, Telangana

Created : 2025-04-14

Job Type : Full Time