Skip to content
View dafu-wu's full-sized avatar
🎯
Focusing
🎯
Focusing

Block or report dafu-wu

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
dafu-wu/README.md

Dafu Wu

AI Infrastructure Engineer

Building the substrate for AI systems that train, reason, and improve themselves — from GPU clusters to agentic RL pipelines.


What I Work On

Large-scale LLM Training Infrastructure Led end-to-end design and implementation of cloud-native AI training infrastructure from the ground up, supporting large-scale distributed training across heterogeneous GPU clusters (A100, H100, GB200). Integrated GPU scheduling, high-performance networking, and distributed training frameworks (PyTorch, Ray), achieving high cluster MFU. Drove system-level performance optimization across compute, networking, and storage layers, addressing bottlenecks in NCCL communication, GPU utilization, and I/O throughput in multi-node environments.

GPU Cluster Scheduling & Kubernetes Native AI Platforms Architected a multi-cluster scheduling system spanning 5 GPU clusters, enabling cross-cluster workload orchestration, resource pooling, and improved global utilization. Reviewer and contributor to Volcano (CNCF), with contributions to gang scheduling, capacity plugin correctness, and DRA resource management. End-to-end ML platform design on Kubernetes: job lifecycle management, GPU affinity, multi-tenancy, autoscaling, and observability.

Agentic RL & Inference Infrastructure Infrastructure for agentic RL training and inference. Integrated RL training frameworks (veRL, AReaL, NeMo-RL) and high-throughput inference engines (vLLM, SGLang) into production platforms. Built a pluggable OSWorld sandbox provider on the training platform, enabling closed-loop RL training pipelines for computer-use agents at scale.


Technical Stack

Layer Technologies
Training Frameworks PyTorch, Ray
RL Training veRL, AReaL, NeMo-RL
Inference vLLM, SGLang
Distributed NCCL
Orchestration Kubernetes, Volcano
Hardware A100, H100, GB200

Principles

  • Systems thinking first — AI infrastructure requires deep understanding of the full stack: hardware, networking, runtime, and model architecture.
  • Measure before optimizing — bottlenecks in distributed training are rarely where intuition suggests. Profile first, optimize second.

Open Source


Interests

Particularly drawn to automated research and self-improving agents — systems that can autonomously explore, experiment, and refine themselves. Interested in the infrastructure challenges these workloads introduce: long-horizon task execution, scalable sandbox environments, and tight feedback loops between inference and training.


Contact

GitHub

Pinned Loading

  1. volcano-sh/volcano volcano-sh/volcano Public

    A Cloud Native Batch System (Project under CNCF)

    Go 5.7k 1.4k

  2. NVIDIA-NeMo/RL NVIDIA-NeMo/RL Public

    Scalable toolkit for efficient model reinforcement

    Python 1.8k 444

  3. verl-project/verl verl-project/verl Public

    verl/HybridFlow: A Flexible and Efficient RL Post-Training Framework

    Python 22.2k 4.2k

  4. areal-project/AReaL areal-project/AReaL Public

    The RL Bridge for LLM-based Agent Applications. Made Simple & Flexible.

    Python 5.3k 530