A hands-on, project-based guide to Machine Learning Operations built specifically for DevOps, Platform, and SRE engineers.
No ML background required. Every concept is explained through DevOps analogies you already understand.
If you are completely new to MLOps, read our DevOps to MLOps guide first.
- Who This Is For
- What We Build
- Prerequisites
- Learning Path
- Phase 1: Local Dev & Pipelines
- Phase 2: Enterprise Orchestration for ML
- Tech Stack
- Recommended Reading
- License
Most MLOps resources are written for data scientists learning infrastructure. This repo flips that.
You do not need to become a data scientist. But just like understanding how a Java application is built makes you a better DevOps engineer, understanding how an ML model is built, trained, and served makes you effective at operating ML workloads in production.
| Track | What You Learn |
|---|---|
| π€ Traditional ML | Train, serve, automate, and monitor a real ML model on Kubernetes |
| π§ Foundational Models | Serve LLMs in production using vLLM, TGI, and Ollama |
| βοΈ LLM-Powered DevOps | Monitor K8s clusters, build RAG pipelines and agents with LLMs |
Everything runs on Kubernetes, Docker, and tools you already use.
| Skill | Level |
|---|---|
| Linux CLI | Intermediate |
| Docker | Intermediate |
| Kubernetes | Intermediate |
| AWS | Basic to Intermediate |
| Python | Basic- read and run scripts |
| Git | Intermediate |
No ML experience needed. That is what this repo teaches.
| Phase | Track | Title | Status |
|---|---|---|---|
| 1 | π€ Traditional ML | Local Dev & Pipelines | β Done |
| 1 | π€ Traditional ML | K8s Deploy & Model Serving | β Done |
| 3 | π€ Traditional ML | Enterprise Orchestration | π In Progress |
| 4 | π€ Traditional ML | Monitor & Observe | π Planned |
| 5 | π§ Foundational Models | Foundational Models | π Planned |
| 6 | π§ Foundational Models | LLM Serving & Scaling | π Planned |
| 7 | βοΈ LLM-Powered DevOps | LLM-Powered DevOps | π Planned |
| 8 | βοΈ LLM-Powered DevOps | Emerging AI Ops | π Planned |
Goal: Build the full ML foundation on your local machine β from raw data to a trained, tested model.
Use case throughout: Employee attrition prediction for a large organisation (~500,000 employees). One problem, end to end. Keeps the focus on infrastructure and operations, not data science theory.
| Step | Title | Guide |
|---|---|---|
| 1 | Project Dataset Pipeline | Read the Guide |
| 2 | Data Preparation Stages | Read the Guide |
| 3 | Training & Building the Prediction Model | Read the Guide |
| 4 | From Model to Live API with KServe | Read the Guide |
Code: phase-1-local-dev/
Goal: Replace local, manual ML workflows with production-grade orchestration. Versioned data, automated pipelines, experiment tracking, and scalable training.
| Step | Title | What it Covers | Guide |
|---|---|---|---|
| 1 | Data Versioning Fundamentals | Understanding Data Drift, Model Decay, and Dataset Versioning | Read the Guide |
| 2 | Hands-On Data Version Control with AWS S3 | Working with DVC and AWS s3 to Version the Dataset required for ML | Read the Guide |
| 3 | Data Versioning using Airflow on Kubernetes. | ETL pipeline that produces fresh employee_attrition.csv dataset and versions in on s3 using DVC | π Coming This Saturday |
| Category | Tools |
|---|---|
| Data Pipeline | Python, Pandas |
| Model Training | scikit-learn, XGBoost |
| API / Serving | FastAPI, Flask, Docker, KServe |
| Orchestration | Airflow, Kubeflow, MLflow Pipelines |
| Monitoring | Prometheus, Grafana, Evidently AI |
| Infrastructure | Kubernetes, Helm, GitHub Actions |
| LLM Serving | vLLM, TGI, Ollama |
- Ray: Open-source distributed computing framework For Python & AI Workloads
- rtk: High-performance CLI proxy that reduces LLM token consumption.
Dual licensed:
- Code (scripts, configs, manifests) β Apache 2.0
- Content (README, guides, docs) β All Rights Reserved
For commercial licensing: contact@devopscube.com