Skip to main content

Command Palette

Search for a command to run...

How to share Nvidia GPUs that don’t support MIG and when vGPU isn’t an option

Updated
9 min read
How to share Nvidia GPUs that don’t support MIG and when vGPU isn’t an option

I recently had to build a system to share two Nvidia L40S GPUs (each with 48 GB of memory) among a group of users. The L40S does not support MIG (Multi-instance GPU) as it was designed as a PCIe-only GPU. Therefore, hardware slicing an individual GPU was not an option. Nvidia's vGPU was another option, but it would require separate per-GPU or per-user licensing, which did not make sense for a small academic setup.

How can I share a couple of GPUs with a small group of users without MIG, using vGPU, and at the same time without giving everyone root access on the host?

The TLDR answer is:

  • Genv for soft GPU partitioning and enforcement

  • Docker for per-user environments

  • Some XFS tricks for disk quotas

  • A lot of glue scripts

This post walks you through what I built, what worked, and what still needs improvement.

What is genv and why did I use it?

Genv is an open source GPU environment and cluster manager. You can think of it as "virtual environments for GPUs"

It allows you to define environments with:

  • Limited GPU memory capacity

  • Number of GPUs

  • An enforcer that terminates processes in environments that use more than their allocated memory.

This was perfect for running training/inference scripts on GPUs. As an admin, I had a way to enforce the memory capacity that was allocated for that environment without needing MIG or vGPU licenses. The soft partitioning was good enough for a small set of trustworthy users.

High-level architecture

  • Bare metal Ubuntu 24.04 host

    • Nvidia drivers and CUDA installed

    • Nvidia container toolkit configured for Docker

    • Genv installed and configured

    • genv-docker as a Docker runtime

    • Docker data root on an XFS loopback image with project quotas

    • genv enforce running as a systemd service

  • One long-lived container per user

    • Image built from nvidia/cuda:12.9.1-runtime-ubuntu24.04

    • Python, PyTorch, TensorFlow, Jupyter, etc, baked in

    • GPU access via genv runtime

    • Limited GPU memory per container

    • Limited disk usage per container

  • SSH login

    • Users SSH into the host

    • A ForceCommand script auto-starts or attaches them to their container

    • Their home directory is mounted inside the container

From the user's perspective, this is a VM that they can SSH into. However, they land inside their own GPU-ready container, which includes the necessary libraries/tools pre-installed, as well as external drives mounted, allowing them to access their larger files/data.

Host side setup

On the host, I started with the basic Nvidia container stack:

  1. Install GPU drivers and CUDA - I used ubuntu-drivers autoinstall to get an appropriate driver and then followed Nvidia's guide to install the container toolkit and hook it into Docker.

  2. Install genv and genv-docker - Following the genv docs, I installed the core genv tool and the Docker integration, and added the genv runtime into Docker's daemon.json instead of passing it via dockerd flags.

  3. Build a base GPU image: Below is a stripped-down version that gives you a ~ 12 GB base image.

FROM nvidia/cuda:12.9.1-runtime-ubuntu24.04
# 1. Install OS packages & SSH server
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-venv python3-pip python3-dev \
    build-essential git curl ca-certificates \
    libsm6 libxext6 libnss-sss sssd-common \
    sudo vim wget \
 && rm -rf /var/lib/apt/lists/*


# 2. Optimize pip usage
ENV PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PIP_NO_CACHE_DIR=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PIP_BREAK_SYSTEM_PACKAGES=1


# 3. Install ML stack
RUN python3 -m pip install --no-cache-dir \
    numpy pandas scipy scikit-learn matplotlib seaborn \
    scikit-image transformers \
 && python3 -m pip install --no-cache-dir \
    torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 \
    --index-url https://download.pytorch.org/whl/cu124 \
 && python3 -m pip install --no-cache-dir tensorflow==2.16.*


WORKDIR /workspace
CMD ["sleep", "infinity"]

The goal is to provide users with a "batteries included" ML environment, so they do not need to install additional software (they can still use apt-get install).

Disk quotas with XFS project quotas

Since all the containers would run on a single host and to avoid certain users filling up the host's entire disk, having a hard cap on a container's disk usage is non-negotiable.

You could define disk quotas using Docker's overlay2.size option. However, this only works if the underlying file system is XFS. So I had to create a 1 TB loopback image, format it as XFS (ftype=1), and mount it on /var/lib/docker with the pquota flag.

Below are the relevant commands:

sudo fallocate -l 1T /home/docker-data.img
LOOP=$(sudo losetup -f --show /home/docker-data.img)
sudo mkfs.xfs -n ftype=1 "$LOOP"
sudo losetup -d "$LOOP"

echo '/home/docker-data.img /var/lib/docker xfs loop,pquota 0 0' | sudo tee -a /etc/fstab
sudo mount -a

Then in /etc/docker/daemon.json

{
  "storage-driver": "overlay2",
  "storage-opts": ["overlay2.size=100G"],
  "log-driver": "json-file",
  "log-opts": { "max-size": "50m", "max-file": "3" }
}

This gives:

  • A global upper bound for Docker's data

  • A default 100 GB upper bound per container

  • Log rotation so JSON logs do not eat the disk

You can also attach XFS project quotas to specific named volumes to set per-volume caps.

Auto-launching a personal container on SSH

I wanted the user experience to be - SSH to the host, and you are automatically dropped into your personal GPU container. To do that, you need:

  • The user is to be created on the host machine first.

  • Then provision a container for that user. I used the naming convention gpu-{USERNAME}.

  • Configured SSH with a Match User block that uses ForceCommand /usr/local/bin/docker_shell for the allowed users. This executes the docker_shell script, which I wrote for that user on login.

This script does the following:

  • Derives the container name from the username, for example, gpu-alice.

  • Looks up the user's drive (each user is also given an external drive based on their username) path and mounts it into /external_drive in the container.

  • Uses genv-docker run with:

    • --runtime genv

    • --user uid:gid

    • A host path bind for /var/lib/sss and a sudoers snippet (To give the user sudo access inside the container)

    • --gpus 1

    • --gpu-memory set to a slice of VRAM (for example, 8410mi)

    • --network host

If the container already exists, the script starts or attaches instead of recreating it. If it has never been created, it runs genv-docker run to spin up a fresh one.

From the user's point of view, this feels like a personal VM, but in reality, they are inside a container with a GPU slice and disk limit.

Enforcing GPU memory limits with genv-enforce

Everything so far is about wiring. The enforcement itself comes from genv enforce.

There are two key facts:

  • Inside each container, nvidia-smi is wrapped by genv and shows a reduced "visible" memory, for example, 8 GiB, even though the physical L40S has 48 GB.

  • Framework APIs like PyTorch's torch.cuda.get_device_properties(0).total_memory still see the real device size and may try to use the entire GPU Memory, because the driver is not actually partitioned in hardware.

To make the limits real, I run this on the host as a systemd service: genv enforce --env-memory --non-env-processes --interval 1 with a unit file like:

[Unit]
Description=genv enforce watchdog
After=network-online.target local-fs.target

[Service]
Type=simple
ExecStart=/usr/local/bin/genv enforce --env-memory --non-env-processes --interval 1
StandardOutput=append:/var/log/genv-enforce.log
StandardError=append:/var/log/genv-enforce.log
Restart=always
RestartSec=2s

[Install]
WantedBy=multi-user.target

This loop looks at GPU usage and kills:

  • Any process in a genv environment that exceeds its GPU memory quota

  • Any process touching the GPU that is not associated with a genv environment at all. It is a software cop, not a hardware partition, but for friendly users, who you can trust, it is enough.

Provisioning and housekeeping

To create a new user environment, I wrote a small provisioning script:

./provision_gpu_container.sh <username> <gpu_memory_gb> <gpus> [disk_gb] [image]

For example:

./provision_gpu_container.sh johndoe 6 1 40 gpu-cont
`

This:

  • Sets up their Unix account and Docker group membership

  • Configures sudo for their numeric UID

  • Starts a container named gpu-johndoe with:

    • 1 GPU

    • 6 GB GPU memory

    • 40 GB disk space

On the maintenance side, I have:

  • A simple command to restart any gpu-* containers after a host reboot.

  • Checks to confirm genv-enforce is running.

  • A note to users that any running processes inside containers die when the host reboots, but their files remain.

What worked well

A few things turned out nicely:

  • Usability - 
Users SSH once and land in a GPU-ready shell. No need to think about Docker or genv.

  • Per user containment
 - Each user lives in their own container with its own Python env and disk quota. Users cannot easily interfere with each other's packages.

  • Soft GPU partitioning - 
genv plus genv-docker gave me fine-grained control over how much VRAM each user can consume, and genv enforce gave me a safety net when someone inevitably went over.

  • Reuse of existing identities and storage - 
I reused our existing SSSD setup for Unix users and mounted their campus drive into the container so they did not need to learn a new storage system.

What was painful and what I would fix

You can feel the rough edges in the system. A few examples:

  1. Memory debugging is hard - 
Users cannot easily see how close they are to their genv quota. They only see "process killed" and guess it was OOM. Environment monitoring or a simple CLI that displays "You currently use X of Y MiB" would be beneficial.

  2. File movement
 - Copying data in and out of the container is not straightforward, even though the external drive was mounted. I need maybe a helper script for Rsync or scp.

  3. No scheduling - 
Right now, this is "first-come, first-served". No scheduler decides who gets a GPU slice when. For a small group, this is fine, but if more users show up, I would need a real scheduler or move to Kubernetes or Slurm with GPU support.

  4. No comprehensive observability or monitoring
 - I currently only have access to genv envs, which show me what envs are being actively used. We could spin up a small service in each of the containers that monitors GPU usage and reports it back. We could then use this data on a control plane to manage environments without needing the CLI or scripts.

When this system makes sense

This setup is not a universal pattern, but it can work well if:

  • You have a small number of users who mostly trust each other.

  • You have a handful of large non-MIG supported GPUs like L40S, and you want to slice them into several usable chunks.

  • vGPU licensing is overkill for your budget and complexity tolerance.

  • You want users to feel like they have "their own machine" with a stable environment. If you have dozens or hundreds of users, strict multi-tenant requirements, or want strong isolation, hardware options such as MIG-capable GPUs, full vGPU, or a proper GPU-aware cluster scheduler may be a better fit.

Core takeaway

If you are stuck with GPUs that cannot do MIG and vGPU is not practical, you can still get a long way with:

  • genv environments for soft VRAM slices

  • genv-docker to attach containers to those environments

  • Docker containers per user instead of full VMs

  • A small amount of scripting for SSH integration and quotas

it ain’t much but its honest work meme header - creates a gpu env that runs on scripts and hope