Nov 20, 2024

My AI Rig

Dual GeForce RTX 3090 GPUs for Local Large Language Models

In the AI age, having a robust computing setup is essential for developers, researchers, and enthusiasts who are not willing to solely rely on the big cloud providers but will rather run LLM and Diffusion models locally. This article explores how to build a high-performance local AI lab using dual NVIDIA GeForce RTX 3090 GPUs. We'll discuss why this setup is optimal for a home lab, delve into the best models to run—including Llama.cpp and vLLM—and compare it against other GPU options like AMD cards. Additionally, we'll explore why CUDA stands out compared to ROCm and Vulkan for AI applications.

Why Choose Dual GeForce RTX 3090 GPUs for Your Home Lab?

The NVIDIA GeForce RTX 3090 is renowned for its exceptional performance in AI and deep learning tasks. Here's why a dual RTX 3090 setup is ideal for a home AI lab:

High VRAM Capacity: Each RTX 3090 comes with 24GB of GDDR6X VRAM, totaling 48GB when using two GPUs. This ample memory is crucial for running large language models (LLMs).
Cost-Effective Performance: While not cheap, the RTX 3090 offers a sweet spot between consumer and enterprise GPUs, delivering near-professional performance without the hefty price tag of data center cards.
CUDA Support: NVIDIA's CUDA platform provides optimized performance for AI applications, outpacing alternatives like ROCm or Vulkan in compatibility and efficiency.
Widespread Community and Documentation: The RTX series is widely used, offering abundant resources, tutorials, and community support.

There's so many models out there where to start?

With dual RTX 3090 GPUs, you can efficiently run a variety of large language models all the way up to 70B Q4 quantized models. It should be noted that nowadays even very small models in the sub 10B parameter range can be very effective in many applications.

Llama3.1 One simply can't discuss about local LLM models without mentioning llama3.1. It's a competent and efficient series of models by Meta.
QwenCoder2.5: November 2024 saw the release of QwenCoder2.5, a highly efficient and effective model renowned for its absolutely magnificent size performance ratio. Small size makes it a compelling choice for dual RTX 3090 setups.

How to run the Models to Run on Dual RTX 3090 GPUs

Llama.cpp: An efficient framework for running Meta's LLaMA models on local hardware. It's optimized for performance and lower resource consumption, making it suitable for a dual RTX 3090 setup.
vLLM: A high-throughput and memory-efficient inference engine for LLMs. It leverages efficient batching and optimized CUDA kernels, which align well with the capabilities of the RTX 3090.
Ollama: A very nice and established convenience layer that makes running local models a breeze. Makes it easy to install and serve models. Uses llama.cpp in the background.

Setting Up Your Local AI Lab: Step-by-Step Guide

1. Hardware Requirements

Ensure you have the following components for an optimal setup:

Two NVIDIA GeForce RTX 3090 GPUs
A motherboard with dual PCIe x16 slots
A high-wattage power supply unit (minimum 1500W recommended)
A multi-core CPU (e.g., AMD Ryzen 7950X)
At least 32GB of RAM
High-speed NVMe SSDs for storage
Effective cooling solutions to manage heat. This is very important point in any multi GPU setup. It's gonna generate a lot of heat!

2. Installing the GPUs

Install the two RTX 3090 GPUs into the PCIe slots on your motherboard. Ensure they are securely seated and properly powered. While NVLink is an option, it's not necessary for most AI workloads and can be omitted to reduce complexity.

3. Setting Up the Software Environment

Use a Linux-based operating system like Ubuntu for better compatibility with AI tools. Install the latest NVIDIA drivers and CUDA toolkit to leverage the full power of your GPUs.

sudo apt update && sudo apt upgrade
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-470
sudo apt install nvidia-cuda-toolkit

4. Installing AI Frameworks and Libraries

Set up a Python environment and install essential AI libraries such as TensorFlow, PyTorch, and any specific libraries required for Llama.cpp and vLLM.

sudo apt install python3-venv python3-pip
python3 -m venv ai_lab_env
source ai_lab_env/bin/activate
pip install --upgrade pip
pip install tensorflow torch torchvision torchaudio
pip install llama-cpp-python vllm

5. Configuring Multi-GPU Support

Modify your model configurations to utilize both GPUs. Frameworks like PyTorch make it straightforward to distribute workloads across multiple GPUs.

import torch
model = YourModel()
if torch.cuda.device_count() > 1:
    model = torch.nn.DataParallel(model)
model.to('cuda')

Comparing the RTX 3090 with Other GPUs

NVIDIA GeForce RTX Series

While the RTX 3090 offers top-tier performance, other RTX GPUs like the 3080 or 3070 are also viable but come with less VRAM and lower computational capabilities. The RTX 3090's 24GB VRAM is particularly beneficial for large models that require extensive memory.

AMD Radeon Series

AMD's GPUs, such as the Radeon RX 6900 XT, offer competitive raw performance and are often more cost-effective. However, they lack the extensive AI ecosystem support that NVIDIA's CUDA platform provides. AMD relies on ROCm (Radeon Open Compute), which has limited compatibility with AI frameworks and models.

Why CUDA Outperforms ROCm and Vulkan

CUDA is NVIDIA's proprietary parallel computing platform and programming model, widely adopted in the AI community for several reasons:

Extensive Framework Support: CUDA is natively supported by major AI frameworks like TensorFlow and PyTorch, ensuring seamless integration and optimization.
Optimized Performance: CUDA libraries are highly optimized for NVIDIA hardware, delivering superior performance in AI computations compared to ROCm or Vulkan.
Robust Developer Tools: NVIDIA provides comprehensive tools for debugging, profiling, and optimizing AI applications.
Community and Resources: A vast community of developers contributes to a rich ecosystem of tutorials, forums, and third-party libraries.

On the other hand, ROCm is AMD's open compute platform but suffers from limited support and compatibility issues. Vulkan, while a powerful graphics API, is not specifically optimized for AI workloads and lacks the extensive libraries and community support found with CUDA.

Optimizing Performance for Llama.cpp and vLLM

To maximize the efficiency of your dual RTX 3090 setup when running models like Llama.cpp and vLLM:

Utilize GPU Acceleration: Ensure that the models are configured to leverage GPU acceleration fully. Both Llama.cpp and vLLM have options to offload computations to GPUs.
Adjust Precision: Use mixed-precision or lower-precision computations (e.g., FP16 or INT8) to reduce memory usage and increase speed without significantly impacting accuracy.
Efficient Batching: Optimize batch sizes to make the best use of GPU memory and processing power.
Profiling and Monitoring: Use tools like NVIDIA's Nsight Systems to profile your applications and identify performance bottlenecks.

Cost Analysis and Energy Considerations

Building a local AI lab is a significant investment, but it can offer long-term benefits over cloud-based solutions.

Initial Hardware Costs: The RTX 3090 GPUs are priced around $1,200 each, making the total GPU cost approximately $2400. Despite increased demand it is still possible to find deals. I sourced two almost new 3090s second hand for 1200$ total. Be, however, aware of scammers and never pay in advance without payment protection like Paypal Goods & Services.
Energy Consumption: Each RTX 3090 can draw up to 500W. Ensure your power supply can handle the load and consider the impact on your electricity bill. At the cost of performance it is always possible to power limit the GPUs to reduce power consumption and heat generation.
Long-Term Savings: For heavy users, the initial investment can be offset by savings from reduced cloud service fees. Having a way to run powerful models locally can also increase use cases for LLM utilization when the unit cost of running is low.

Advantages of a Local AI Lab Over Cloud-Based Solutions

While cloud services offer scalability and convenience, a local AI lab provides unique benefits:

Data Privacy: Keep sensitive data on-premises, reducing the risk of data breaches.
Cost Efficiency: Avoid recurring cloud costs, which can accumulate significantly over time, especially with large-scale experiments.
Performance: Eliminate latency issues associated with data transfer to and from the cloud.
Customization: Full control over your hardware and software stack allows for tailored optimizations.

Conclusion

Building a local AI lab with dual GeForce RTX 3090 GPUs empowers you to experiment with advanced AI models like Llama.cpp and vLLM efficiently. The RTX 3090's superior performance, combined with NVIDIA's CUDA platform, provides a robust foundation for AI development that outperforms other GPU options in both compatibility and speed. By investing in this setup, you gain control, flexibility, and the ability to push the boundaries of what's possible in AI research and application—all from the comfort of your home.

Contact me to discuss about hardware for local LLMs

This article is part of our series on AI development and high-performance computing. Dive deeper into AI technologies and hardware configurations with our other articles.