Docker-in-Docker for GFX hardware access

One problem that I’ve repeatedly struggled with when running a Docker Swarm cluster is that there is no way to use “devices”, such as graphics cards (and also, privileged containers, should you want that).

The problem is that, yes, you can volume-mount /dev/dri/* into the container, and those devices will show up, but when you attempt to use them, even if you are root and/or filesystem permissions allow, seccomp will block the syscalls, and you’ll get “Operation not permitted”.

I found a solution, it’s not the most graceful, but it works: docker-in-docker

services:
  svc:
    image: docker:28-dind
    entrypoint: [sh, -c]
    command: >-
      'exec docker run
      --interactive
      --user 0:0
      -e DISPATCHARR_ENV=aio
      -e REDIS_HOST=localhost
      -e CELERY_BROKER_URL=redis://localhost:6379/0
      -e DISPATCHARR_LOG_LEVEL=info
      -v /docker/data/dispatcharr:/data
      --name=dispatcharr_svc_sub
      --network=traefik
      --device=/dev/dri/renderD128:/dev/dri/renderD128
      --device=/dev/dri/card0:/dev/dri/card0
      --rm
      ghcr.io/dispatcharr/dispatcharr:latest'
    volumes:
      - /etc/docker/daemon.json:/etc/docker/daemon.json:ro
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        order: stop-first
      placement:
        constraints:
          - node.labels.gpu == true
      labels:
        - traefik.enable=true
        - traefik.http.routers.dispatcharr.entrypoints=https
        - traefik.http.services.dispatcharr.loadbalancer.server.url=http://dispatcharr_svc_sub:9191
        - traefik.swarm.network=traefik
    networks:
      - traefik

networks:
  traefik:
    external: true

To take this apart, this is actually two containers:

“Outer” container

This is what the swarm will schedule on the cluster. It runs the docker-in-docker (“dind”) image, and it mounts in both the docker socket and /etc/docker/daemon.json file from the host it lands on. The daemon.json config is optional. Note: this doesn’t need to be a swarm manager socket, as it won’t be interacting with the swarm. Most importantly, it runs the normal docker command to start an “inner” docker container (docker run …). I also added a placement constraint so it will only land on the systems I want it to node.labels.gpu == true even though this container itself doesn’t mount or use the GPUs.

“Inner” container

This is a non-swarm docker container that just happens to be running on the same host. Despite calling them inner & outer, these containers aren’t “nested”, they’re running beside each other on the host, so volume and device mounts are directly on the same host itself, not to each other. This has full access to --device, and --privileged, etc, like any normal docker container. In my case, I add the devices --device=/dev/dri/renderD128:/dev/dri/renderD128 and --device=/dev/dri/card0:/dev/dri/card0 to access the GPU. In this example, it’s running dispatcharr.

Linking

The outer container calls the inner container with --interactive, which means that if the swarm container is rescheduled for any reason, it will cascade-stop the inner container. Likewise, if anything happens to the inner container, the outer container’s command will exit, stopping the the outer container too, which will be rescheduled by the swarm orchestration.

Storage

In my case, there’s a local mount to /docker/data/dispatcharr, which is a location on glusterfs, mounted on all nodes.

Networking

Networking is unfortunately complex and painful. In this case, the inner container will open port 9191 (in my case, this isn’t exposed, but could be by adding -p 9191:9191), but that land on any host in the swarm, and will move around.

I’m using traefik as an ingress, but since the inner container isn’t on the swarm, it won’t be seen by traefik’s swarm provider.

There’s many solutions, but here is mine: Connect the inner container to the traefik network --network=traefik (which requires it to be “attachable”), set a predictable name --name=dispatcharr_svc_sub, then have the outer container set labels that traefik will pick up - traefik.http.services.dispatcharr.loadbalancer.server.url=http://dispatcharr_svc_sub:9191 that point at the predictable name, on the predictable port, which is a container conveniently on the traefik network, so treafik can reach it.

This networking setup has trade-offs, as I’m limited to one instance of the container, and stop-first is required so I don’t run into a name conflict with the predictable name.