Job Description

We are seeking highly specialized candidates willing to work in India / Korea time zone and completely remote for an exciting project.SALARY IS NOT CONSTRAINT FOR THE RIGHT CANDIDATE The role is Compute Service Engineer with 8+ years of experience. This team is responsible for the compute system design, hardware optimization, OS hardening, system provisioning, deployment, automation and operations. Roles and Responsibilities Design and Implementation: ● Design, implement, and maintain computer services for both GPU and non-GPU environments. ● Develop and optimize algorithms for high-performance computing tasks, especially in the AI/ML Training and Inference domain. Infrastructure Management: ● Manage and monitor computer clusters, ensuring high availability and performance. ● Implement and maintain automation scripts for infrastructure provisioning and management. Performance Optimization: ● Analyze and optimize the performance of compute workloads. ● Implement best practices for resource utilization and efficiency. Collaboration: ● Work closely with data scientists, researchers, and other engineering teams to understand and meet their compute requirements. ● Collaborate with hardware vendors to evaluate and integrate new technologies. Security and Compliance: ● Ensure that compute services comply with security policies and industry standards. ● Implement and maintain security measures to protect data and compute resources. Troubleshooting and Support: ● Provide support for compute-related issues, including debugging and resolving hardware and software problems.● Develop and maintain documentation for troubleshooting procedures and best practices. Continuous Improvement: ● Stay updated with the latest advancements in compute technologies and integrate them into the infrastructure. ● Continuously improve the reliability, scalability, and performance of compute services. Qualifications: Education: ● Bachelor's or Master's degree in Computer Science, Engineering, or a related field. ● NVIDIA and AI Certification Experience: ● 8+ Years of experience managing on-premise GPU or non GPU systems ● Proven experience in managing and optimizing GPU and non-GPU computer environments. ● AI Infra Engineering building and operating skills ● Experience with high-performance computing (HPC) and parallel processing including Baremetel, large scale virtual environments. ● Implement virtualization architectures, leveraging expertise with Kubernetes distributions and cloud technologies on bare metal environments. ● Proficiency in hardware technologies such as SR-IOV, DPU, and GPU, with proven experience in implementing these technologies in virtualized and containerized environments.Technical Skills: ● Proficiency in programming languages such as Python, C++, or similar. ● Experience with infrastructure as code (IaC) tools like Terraform, Ansible, or similar. ● Familiarity with containerization and orchestration tools like Docker and Kubernetes. ● Familiarity with Kubernetes underlying technologies with CRI, CSI, CNI, Operators, GPU device plugin, RMDA/InfiniBand integration ● Knowledge of cloud platforms (AWS, Azure, GCP) and their compute services. Soft Skills● Strong problem-solving skills and attention to detail. ● Excellent communication and collaboration skills. ● Ability to work in a fast-paced, dynamic environment.

Job Title

Company : Subhanu Technologies & Solutions

Location : Kollam, Kerala

Created : 2025-01-07

Job Type : Full Time