DevOps Engineer - AI Cloud GPU Compute Company - Massive Opportunity - Remote-hk

  • Date: 17 Jul 2024
  • Location: Remote
  • Work Type: Permanent / Full Time
About the Company:
Our client is a groundbreaking technology company pioneering GPU-based compute infrastructure. Specializing in innovative solutions for various industries, from AI and machine learning to high-performance computing (HPC), our client is committed to pushing the boundaries of computational capabilities. Leveraging the latest advancements in hardware and software, they empower their clients with unparalleled computational resources.

About the Role:
We are in search of a highly skilled and motivated Infrastructure Operations Engineer to join our vibrant team. As an integral part of the InfraOps team, you will be instrumental in managing and optimizing our GPU-based compute infrastructure across multiple locations and partners, ensuring maximum performance, scalability, and reliability.


Infrastructure Management:
  • Deploy, configure, and maintain GPU-based compute infrastructure, including servers, storage, networking, and associated software stack, facilitating compute from numerous providers worldwide, ranging from 4090s to H200s.
Monitoring and Optimization:
  • Implement robust monitoring and alerting systems to identify performance bottlenecks, resource constraints, and potential failures proactively. Continuously optimize infrastructure for enhanced performance, efficiency, and cost-effectiveness.
Automation and Orchestration:
  • Develop automation scripts and tools to streamline deployment, configuration, and management of infrastructure components. Implement infrastructure as code (IaC) principles to enable rapid provisioning and scaling.
Security and Compliance:
  • Enforce security best practices to safeguard sensitive data and ensure compliance with regulations and industry standards. Conduct regular security audits and vulnerability assessments.
Incident Response and Troubleshooting:
  • Provide tier-3 support for infrastructure-related issues, investigating root causes and implementing timely resolutions. Participate in on-call rotation to address critical incidents outside regular business hours.
Capacity Planning and Scaling:
  • Collaborate with cross-functional teams to forecast resource requirements, plan capacity upgrades, and scale infrastructure to accommodate growing workloads and user demands.
Documentation and Knowledge Sharing:
  • Maintain comprehensive documentation of infrastructure configurations, procedures, and troubleshooting guidelines. Share knowledge and best practices with team members to facilitate continuous learning and skill development.
  • Experience in infrastructure operations, preferably in a DevOps, SRE, Sales Engineering, or Solution Architect role, focusing on GPU compute.
  • Proficiency in managing GPU-based compute infrastructure, including NVIDIA GPUs and CUDA programming.
  • Strong expertise in Linux system administration and shell scripting (e.g., Bash, Python).
  • Experience with configuration management tools (e.g., Ansible, Chef, Puppet) and version control systems (e.g., Git).
  • Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).
  • Solid understanding of networking concepts, protocols, and troubleshooting techniques.
  • Effective communication skills and the ability to collaborate effectively with cross-functional teams, with Mandarin language proficiency as a significant bonus.
  • Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and hybrid cloud architectures.
  • Knowledge of HPC frameworks and job scheduling systems (e.g., Slurm, PBS Pro).
  • Familiarity with GPU-accelerated libraries and frameworks (e.g., TensorFlow, PyTorch, CUDA Toolkit).
  • Understanding of cybersecurity principles and practices, including encryption, access controls, and threat detection/prevention.
  • Bonus points for familiarity with Web3 (cryptocurrency, tokenization of RWAs, mining/staking, etc.).
  • Competitive compensation structure with flexibility regarding fiat/token mix.
  • Flexible benefits package, depending on location and setup.
  • Salary is adaptable based on location and setup.
  • Flexible work hours and remote work options.
Why Our Client?
Join a team dedicated to democratizing access to high-performance computing for AI. Enjoy autonomy and resources to significantly influence product strategy and contribute to the growth of a rapidly scaling company.

Excited to pioneer the future of AI compute solutions? Apply today!

Skip the line....
If you've read this far and you think you have what it takes to be successful in this role, then skip the line and email me directly at

Please include your resume and a brief note showcasing your specific experience with GPU Compute Infrastructure. Outline what you know about AI, GPUs, blockchain, H200s and why you would be the perfect fit for this role.
Apply Now