
Cuda Kernel Software Engineer
- 上海市
- 长期
- 全职
- Design and implement highly optimized ML kernels (e.g., matrix operations, attention mechanisms) for AMD GPUs using ROCm.
- Profile, debug, and tune kernel performance to maximize hardware utilization for AI workloads.
- Collaborate with ML researchers and framework developers to integrate kernels into AI frameworks (e.g., PyTorch, TensorFlow) and inference engines (e.g., vLLM, SGLang).
- Contribute to the ROCm software stack by identifying and resolving bottlenecks in libraries like MIOpen, BLAS, or Composable Kernel.
- Stay updated on the latest AI/ML trends (LLMs, quantization, distributed inference) and apply them to kernel development.
- Document and communicate technical designs, benchmarks, and best practices.
- Troubleshoot and resolve issues related to GPU compatibility, performance, and scalability.
- 2+ years of experience in GPU kernel development for machine learning (ROCm or CUDA).
- Proficiency in C/C++ and Python, with experience in performance-critical programming.
- Strong understanding of ML frameworks (PyTorch, TensorFlow) and GPU-accelerated libraries.
- Basic knowledge of modern AI technologies (LLMs, transformers, inference optimization).
- Familiarity with parallel computing, memory optimization, and hardware architectures.
- Problem-solving skills and ability to work in a fast-paced environment.
- Direct experience with AMD ROCm development (HIP, MIOpen, Composable Kernel).
- Knowledge of LLM-specific optimizations (e.g., FlashAttention, PagedAttention in vLLM).
- Experience with distributed training/inference or model compression techniques.
- Contributions to open-source ML projects or GPU compute libraries.
- Bachelor’s/Master’s in Computer Science, Electrical Engineering, or related field.