Run 70B LLMs on Consumer GPUs: NVMe-to-GPU Direct Transfer

Running a 70-billion parameter LLM typically requires multiple A100 GPUs or expensive cloud instances. But what if you could run Llama 3.1 70B on a single RTX 3090 with its mere 24GB of VRAM? A new technique making waves in the AI community does exactly that by using NVMe storage as virtual VRAM and bypassing the CPU entirely.
Quick Reference
# Install dependencies (Ubuntu/Debian)
sudo apt install nvidia-cuda-toolkit nvme-cli
# Check NVMe device supports CMB (Controller Memory Buffer)
sudo nvme id-ctrl /dev/nvme0n1 | grep -i cmb
# Verify GPU direct storage support
nvidia-smi -q | grep -i "GPU Direct"
The VRAM Problem
Large language models have memory requirements that exceed most consumer GPUs:
| Model | Parameters | FP16 Memory | INT8 Memory |
|---|---|---|---|
| Llama 3.1 8B | 8B | 16 GB | 8 GB |
| Llama 3.1 70B | 70B | 140 GB | 70 GB |
| Llama 3.1 405B | 405B | 810 GB | 405 GB |
An RTX 3090 with 24GB cannot fit even an 8-bit quantized 70B model. Traditional solutions include:
- Model quantization (4-bit, 2-bit): Reduces quality
- CPU offloading: Slow PCIe transfers through system RAM
- Tensor parallelism: Requires multiple expensive GPUs
Enter NVMe-to-GPU Direct Transfer
The key innovation is using NVIDIA GPUDirect Storage (GDS) combined with NVMe Controller Memory Buffer (CMB) to stream model weights directly from SSD to GPU memory, completely bypassing the CPU and system RAM.
How It Works
Traditional Path:
NVMe SSD -> PCIe -> CPU -> System RAM -> PCIe -> GPU VRAM
Direct Path:
NVMe SSD -> PCIe -> GPU VRAM (CPU bypassed)
This approach treats fast NVMe storage as an extension of GPU memory. During inference, only the active layers need to reside in VRAM while inactive layers remain on the SSD, streaming in as needed.
Hardware Requirements
Not all hardware supports this technique. You need:
- NVIDIA GPU with GPUDirect Storage support (RTX 30/40 series, A-series, H-series)
- NVMe SSD with CMB support and high sequential read speeds (5+ GB/s recommended)
- PCIe 4.0 or 5.0 motherboard with proper lane configuration
- Linux kernel 5.15+ with nvidia-gds module
Check Your Hardware
# Verify GDS support
dpkg -l | grep nvidia-gds
# List available NVMe devices
nvme list
# Check NVMe PCIe link speed
sudo lspci -vv -s $(lspci | grep -i nvme | cut -d' ' -f1) | grep -i lnksta
Setting Up the Environment
Install NVIDIA GDS
# Add NVIDIA repository (if not already present)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | \
sudo apt-key add -
# Install GDS
sudo apt-get update
sudo apt-get install -y nvidia-gds
sudo modprobe nvidia_fs
Configure Huge Pages
GDS works best with huge pages enabled:
# Enable 1GB huge pages
echo 16 | sudo tee /proc/sys/vm/nr_hugepages_mempolicy
# Make persistent across reboots
echo "vm.nr_hugepages=16" | sudo tee -a /etc/sysctl.conf
Verify GDS Is Working
# Run the GDS verification tool
/usr/local/cuda/gds/tools/gdscheck -p
Expected output should show "GPUDirect Storage supported" for your NVMe devices.
Running Llama 70B with ntransformer
The ntransformer project implements this technique for running large LLMs:
# Clone the repository
git clone https://github.com/xaskasdf/ntransformer.git
cd ntransformer
# Install dependencies
pip install -r requirements.txt
# Download Llama 3.1 70B weights
# (requires HuggingFace access)
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
--local-dir ./models/llama-70b
# Prepare weights for NVMe streaming
python prepare_weights.py --model ./models/llama-70b \
--output /mnt/nvme/llama-70b-prepared
# Run inference
python run.py --model-path /mnt/nvme/llama-70b-prepared \
--gds-enabled \
--max-batch-size 1 \
--prompt "Explain quantum computing"
Performance Characteristics
The trade-off is latency versus cost. Here are typical benchmarks on an RTX 3090 with a Samsung 990 Pro NVMe:
| Metric | Traditional (A100 80GB) | NVMe Offload (RTX 3090) |
|---|---|---|
| Time to first token | 0.5s | 2.1s |
| Tokens per second | 45 | 12 |
| Hardware cost | $15,000 | $1,500 |
| Power consumption | 400W | 350W |
For batch processing or non-interactive use cases, the 10x cost reduction often justifies the 3-4x performance decrease.
Optimizing Performance
Several factors affect throughput:
Use the Fastest NVMe Possible
PCIe 5.0 NVMe drives like the Crucial T705 or Samsung 990 Pro deliver 12+ GB/s sequential reads, reducing layer loading time significantly.
Optimize Layer Scheduling
The inference engine predicts which layers are needed next and preloads them:
# Configuration for aggressive prefetching
config = {
"prefetch_layers": 4, # Load 4 layers ahead
"cache_layers": 8, # Keep 8 layers in VRAM
"stream_priority": "high" # Use high-priority CUDA streams
}
Monitor GDS Statistics
# Watch real-time GDS throughput
watch -n 1 'cat /proc/driver/nvidia-fs/stats'
When to Use This Approach
NVMe-to-GPU offloading makes sense when:
- Running 70B+ models for development or testing
- Cost is more important than latency
- Batch processing large datasets overnight
- Building proof-of-concepts before investing in GPU clusters
It is less suitable for:
- Production real-time inference with strict SLAs
- High-throughput serving (many concurrent users)
- Applications requiring sub-second response times
Alternatives to Consider
If NVMe offloading does not fit your needs, consider:
- vLLM with PagedAttention: Better memory efficiency for serving
- llama.cpp quantization: Run smaller quantized models entirely in VRAM
- Cloud spot instances: A100 spot pricing can be cost-effective for burst workloads
Conclusion
NVMe-to-GPU direct transfer democratizes access to large language models. You can now experiment with 70B parameter models on hardware you likely already own. While not suitable for high-throughput production serving, it enables research, development, and batch processing at a fraction of the traditional cost.
The technique represents a broader trend in AI infrastructure: creative hardware utilization to work around the GPU memory bottleneck that currently limits LLM deployment.
Want to automate AI infrastructure monitoring and LLM deployment pipelines? Akmatori helps SRE teams build intelligent agents that manage complex systems with natural language commands.
