Skip to main content
22.02.2026

Run 70B LLMs on Consumer GPUs: NVMe-to-GPU Direct Transfer

head-image

Running a 70-billion parameter LLM typically requires multiple A100 GPUs or expensive cloud instances. But what if you could run Llama 3.1 70B on a single RTX 3090 with its mere 24GB of VRAM? A new technique making waves in the AI community does exactly that by using NVMe storage as virtual VRAM and bypassing the CPU entirely.

Quick Reference

# Install dependencies (Ubuntu/Debian)
sudo apt install nvidia-cuda-toolkit nvme-cli

# Check NVMe device supports CMB (Controller Memory Buffer)
sudo nvme id-ctrl /dev/nvme0n1 | grep -i cmb

# Verify GPU direct storage support
nvidia-smi -q | grep -i "GPU Direct"

The VRAM Problem

Large language models have memory requirements that exceed most consumer GPUs:

Model Parameters FP16 Memory INT8 Memory
Llama 3.1 8B 8B 16 GB 8 GB
Llama 3.1 70B 70B 140 GB 70 GB
Llama 3.1 405B 405B 810 GB 405 GB

An RTX 3090 with 24GB cannot fit even an 8-bit quantized 70B model. Traditional solutions include:

  • Model quantization (4-bit, 2-bit): Reduces quality
  • CPU offloading: Slow PCIe transfers through system RAM
  • Tensor parallelism: Requires multiple expensive GPUs

Enter NVMe-to-GPU Direct Transfer

The key innovation is using NVIDIA GPUDirect Storage (GDS) combined with NVMe Controller Memory Buffer (CMB) to stream model weights directly from SSD to GPU memory, completely bypassing the CPU and system RAM.

How It Works

Traditional Path:
NVMe SSD -> PCIe -> CPU -> System RAM -> PCIe -> GPU VRAM

Direct Path:
NVMe SSD -> PCIe -> GPU VRAM (CPU bypassed)

This approach treats fast NVMe storage as an extension of GPU memory. During inference, only the active layers need to reside in VRAM while inactive layers remain on the SSD, streaming in as needed.

Hardware Requirements

Not all hardware supports this technique. You need:

  1. NVIDIA GPU with GPUDirect Storage support (RTX 30/40 series, A-series, H-series)
  2. NVMe SSD with CMB support and high sequential read speeds (5+ GB/s recommended)
  3. PCIe 4.0 or 5.0 motherboard with proper lane configuration
  4. Linux kernel 5.15+ with nvidia-gds module

Check Your Hardware

# Verify GDS support
dpkg -l | grep nvidia-gds

# List available NVMe devices
nvme list

# Check NVMe PCIe link speed
sudo lspci -vv -s $(lspci | grep -i nvme | cut -d' ' -f1) | grep -i lnksta

Setting Up the Environment

Install NVIDIA GDS

# Add NVIDIA repository (if not already present)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo apt-key add -

# Install GDS
sudo apt-get update
sudo apt-get install -y nvidia-gds
sudo modprobe nvidia_fs

Configure Huge Pages

GDS works best with huge pages enabled:

# Enable 1GB huge pages
echo 16 | sudo tee /proc/sys/vm/nr_hugepages_mempolicy

# Make persistent across reboots
echo "vm.nr_hugepages=16" | sudo tee -a /etc/sysctl.conf

Verify GDS Is Working

# Run the GDS verification tool
/usr/local/cuda/gds/tools/gdscheck -p

Expected output should show "GPUDirect Storage supported" for your NVMe devices.

Running Llama 70B with ntransformer

The ntransformer project implements this technique for running large LLMs:

# Clone the repository
git clone https://github.com/xaskasdf/ntransformer.git
cd ntransformer

# Install dependencies
pip install -r requirements.txt

# Download Llama 3.1 70B weights
# (requires HuggingFace access)
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
  --local-dir ./models/llama-70b

# Prepare weights for NVMe streaming
python prepare_weights.py --model ./models/llama-70b \
  --output /mnt/nvme/llama-70b-prepared

# Run inference
python run.py --model-path /mnt/nvme/llama-70b-prepared \
  --gds-enabled \
  --max-batch-size 1 \
  --prompt "Explain quantum computing"

Performance Characteristics

The trade-off is latency versus cost. Here are typical benchmarks on an RTX 3090 with a Samsung 990 Pro NVMe:

Metric Traditional (A100 80GB) NVMe Offload (RTX 3090)
Time to first token 0.5s 2.1s
Tokens per second 45 12
Hardware cost $15,000 $1,500
Power consumption 400W 350W

For batch processing or non-interactive use cases, the 10x cost reduction often justifies the 3-4x performance decrease.

Optimizing Performance

Several factors affect throughput:

Use the Fastest NVMe Possible

PCIe 5.0 NVMe drives like the Crucial T705 or Samsung 990 Pro deliver 12+ GB/s sequential reads, reducing layer loading time significantly.

Optimize Layer Scheduling

The inference engine predicts which layers are needed next and preloads them:

# Configuration for aggressive prefetching
config = {
    "prefetch_layers": 4,      # Load 4 layers ahead
    "cache_layers": 8,         # Keep 8 layers in VRAM
    "stream_priority": "high"  # Use high-priority CUDA streams
}

Monitor GDS Statistics

# Watch real-time GDS throughput
watch -n 1 'cat /proc/driver/nvidia-fs/stats'

When to Use This Approach

NVMe-to-GPU offloading makes sense when:

  • Running 70B+ models for development or testing
  • Cost is more important than latency
  • Batch processing large datasets overnight
  • Building proof-of-concepts before investing in GPU clusters

It is less suitable for:

  • Production real-time inference with strict SLAs
  • High-throughput serving (many concurrent users)
  • Applications requiring sub-second response times

Alternatives to Consider

If NVMe offloading does not fit your needs, consider:

  • vLLM with PagedAttention: Better memory efficiency for serving
  • llama.cpp quantization: Run smaller quantized models entirely in VRAM
  • Cloud spot instances: A100 spot pricing can be cost-effective for burst workloads

Conclusion

NVMe-to-GPU direct transfer democratizes access to large language models. You can now experiment with 70B parameter models on hardware you likely already own. While not suitable for high-throughput production serving, it enables research, development, and batch processing at a fraction of the traditional cost.

The technique represents a broader trend in AI infrastructure: creative hardware utilization to work around the GPU memory bottleneck that currently limits LLM deployment.


Want to automate AI infrastructure monitoring and LLM deployment pipelines? Akmatori helps SRE teams build intelligent agents that manage complex systems with natural language commands.

Automate incident response and prevent on-call burnout with AI-driven agents!