Machine learning operations (MLOps) is the development and use of machine learning models by development operations (DevOps) teams. MLOps adds discipline to the development and deployment of machine learning models, making the development process more reliable and productive.
dedicated ML tooling
- closed source
- open source
- Jina
popular alternative tools
- Argo Workflow
- Gitea (also supports cron jobs for periode retraining / inference)
data storage
typical ETL workflow
discussions
- slurm vs kubernetes (Nebius )
available architectures
- cpu (slow)
- gpu (cumbersome)
- tensor core gpu ()
- e g. Nvidia A100
relevant cloud products
baseten
- abstract gpu autoscaling and hosting
- multi-cluster hosting→ add compute on the edge
- team is excited about:
TGI vs vLLM
vLLM 15% faster for mistral and more stable on higher load https://tunehq.ai/blog/comparing-vllm-and-tgi
tech used by companies
- Nvidia
- Enroot
- Pyxis
- base10
- Kubernetes
- pytorch
MLOps design patterns
- Data representation design patterns
- #1 Hashed Feature
- #2 Embedding
- #3 Feature Cross
- #4 Multimodal Input
- Problem representation design patterns
- #5 Reframing
- #6 Multilabel
- #7 Ensemble
- #8 Cascade
- #9 Neutral Class
- #10 Rebalancing
- Patterns that modify model training
- #11 Useful overfitting
- #12 Checkpoints
- #13 Transfer Learning
- #14 Distribution Strategy
- #15 Hyperparameter Tuning
- Resilience patterns
- #16 Stateless Serving Function
- #17 Batch Serving
- #18 Continuous Model Evaluation
- #19 Two Phase Predictions
- #20 Keyed Predictions
- Reproducibility patterns
- #21 Transform
- #22 Repeatable Sampling
- #23 Bridged Schema
- #24 Windowed Inference
- #25 Workflow Pipeline
- #26 Feature Store
- #27 Model Versioning
- Responsible AI
- #28 Heuristic benchmark
- #29 Explainable Predictions
- #30 Fairness Lens
- https://github.com/GoogleCloudPlatform/ml-design-patterns