MLops Community Amsterdam NL meetup April 2024

This edition was organised by Nebius in a very fancy location and venue. I was curious to learn more about the current state and community of the MLOps industry and was pleased to find a lot of expert engineers in the room. I noticed a high percentage of expats in the audience, and also a big russian speaking audience, partly because of the mostly russian development team behind nebius.

Big kudos to Fillip addressing his stuttering in the intro of his presentation which didn’t keep him from delivering a deep technical talk. Filip shed a light on some interesting ways to avoid costly GPU cycles by detecting hardware failures early and improving data loading when reloading checkpoints.

Luka presented an interesting use case for timeseries prediction. Their iot devices can almost literally tap into powerlines of machinery to capture power usage and they detect machine idling with timeseries models. As I’m a big fan of Time series and Timeseries Forecasting this talk was right up my alley. Very interesting to hear their experiences with Flink, which I didn’t know about before, and the fact that they didn’t opt for a specific timeseries database but just work with postgres, increasing the compatibility with the expertise in their team.

generic

talk 1: Fail fast & recover faster: infrastructure resilience of multi-node LLM training - Filipp Fisin (Senior MLE @ Nebius)

Training an LLM model in a multi-node setup is a complex and expensive process. Training failures can’t be eliminated, but downtime can be reduced.
In this talk, we provide an overview of techniques for more resilient training that we’ve found useful in our JAX-based multi-node training setup, namely:

  • multi-node training orchestration in Kubernetes via Argo with automatic failure recovery
  • a special type of Kubernetes health-checks to detect if a training process is stuck
  • techniques to efficiently save and load terabyte-scale checkpoints
  • XLA compilation cache
  • GPU node monitoring and auto-cordoning

context

300bln parameters 1000+ GPUS Jax based framework

Flow

  • Batch
  • Model replicas
  • gradients
  • avaraged gradients

Parallel training failure reasons

  • hardware failure
  • data loading
  • data center (power software)

Orchestration

Dealing with hardware failures

  • split up work into checkpoints to preserve progress
  • health checks that disable nodes that are for instance unreachable
  • daemonset that exitcode of smi
  • if one node is slower, all nodes will have to wait for it
    • hourly nccl all-reduce perf test
    • works well for ram or GPU performance
  • training deadlocks

distributed write to deal with model size

3.5TB training size

  • Sharding tensors
  • sync copy of training state to ram
  • asynchronous write from ram to storage

time costly issues

  • 90 min restore from persistent storage
  • 15 min Jax graph
  • Use infiniband to peer to peer datashare between nodes

qa

  • restrictions to models
    • Will work with any model where tensors are separable by different gpus
  • why k8s vs slurm
    • loads of different use cases for use of infrastructure
    • load of engineers that have Kubernetes experience

 talk 2: Realtime Standby Energy Waste Prediction - Luka Sturtewagen (Principal DE @ Sensorfact)

At Sensorfact, our mission is to minimize industrial waste, particularly in energy consumption. This way we help our customers to raise the bar for their sustainability KPIs. For example, we measure for our customers the energy usage at the individual machine level. Armed with this detailed, but mass data, we provide tailored advice to them on reducing energy waste, including areas such as energy use outside production hours, compressed air leakages, and suboptimal machine usage. We have ML models to detect standby energy waste in batch. Recently we have even transformed our pipeline to be able to predict in real time. This allows us to provide our customers with immediate insights and alerts through our app, ultimately enabling proactive waste reduction strategies.

context

  • IoT sensor platform
    • using electricity sensors that clip around power cables
  • Predictive maintenance

goal

  • Optimization use of machines
    • Leakages in compressed air systems
  • prevent power usage when machines are standy

standby waste

  • Biggest win: standby waste. Turning off idling machine’s
  • detect when machines idle using sensor data
  • measure at 30 sec intervals

model choosing

  • not one single model, but 50.000 models to cover all machines
  • hidden markov algorithm combined with other unspecified algorithms

flow

  • 6 weeks of data to store models
  • postgres
  • energy consultant validates “human in the loop”
  • batch predictions
  • tech
    • prefect & dask

realtime detect standby

  • tech
  • Streaming for model inference
  • Detect Realtime with rolling / sliding windows

versioning

Fetch datacapture?

pipeline

  • measurements + CDC stream of models
  • enriched measurements
  • post lrocess to filter out short standby states (noisy)

deployments

  • kubernetes
    • Flink Kubernetes operator
  • cluster of task managers
  • adaptive autoscaling based on metrics

qa

  • What alternative tech did you consider?
    • PySpark - near realtime
  • how to save the model
    • type of model stay the consistent among machines, parameters change
  • biggest Challenges?
    • Sensors connected to multiple bridges, sending duplicate data and out of order

people met

20240411 mlops meetup