Original Source Here
There are two things you will need: Nvidia device plugin for scheduling(eksctl even added this) and DCGM exporter for observability(watch out for DCGM_FI_DEV_XID_ERRORS). You will also need nvidia-drivers(either baked in machine image with something like Hashicorp Packer or in containers) and nvidia container runtime in order for containers to use GPU…those two dependencies are also needed for running GPU containers locally.
Machine learning and deep learning require a lot of resources which means cluster can become much bigger and GPUs bring additional complexity as described in OpenAI blog. Since we need a lot of resources for AI workload and workload is probably not constant we also want our cluster to be scalable (GPUs are expensive!) and for that we can use cluster autoscaler(and descheduler for improved resource utilization, but it will be integrated with cluster autoscaler based on it’s roadmap). If we want to scale up from 0 cluster auscaler needs some hints for GPUs like resource hints and GPU specifics.
Nvidia created operator for “everything GPU related” to simplify our lives. It has great features, but I believe team needs some time to polish few rough edges. Not only does it take care for things like drivers, DCGM and device plugin but it also uses GFD which detects drivers version, GPU type and many other things due to NFD integration.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot