ย One day, I started wondering, is it possible to run AI models on a normal system where I play games? I recalled that I have a NVDIA graphics card in my desktop. I started digging into what I can do. I got so many doubts. Will the desktop be able to support? If yes, where do I start? Is it going to crash the system?
Generated By AI (ChatGPT)
So many questions. Let’s ask gen AI (ChatGPT/Gemini…etc).
On putting the query. It started giving me answers that need huge servers and configurations. No! No!…. !!! But do I have that much computing power? I need the answers that fit my existing computing resources.
First things first, I grabbed a piece of paper. Oh wait! I know this works, but in today’s digital world. VsCode becomes the new notepad :P. I gathered a couple of pieces of information.
Desktop Configurations ๐
| Component | Configurations |
|---|---|
| CPU | i5 11400 11th Generation |
| RAM | 32 GB |
| SSD | Yes |
| GPU | MSI GeForce GTX 1650 VENTUS XS OC Nvidia Graphic Card |
| OS | Windows 11 |
Configurations, as per me looks decent. Will it be able to support any LLMs? Answer is YES. Let’s give it a try.

Pre-requisites ๐
1. Update Windows ๐
- Update operating system. I am using Windows, and performed a Windows update.
2. Update GPU Drivers ๐
Update GPU drivers. I have an NVidia GPU, so updating the NVidia GPU drivers.
Enable the CUDA cores. Download the CUDA toolkit.
https://developer.nvidia.com/cuda/toolkitIf the GPU is older, like mine, use the link below to find the compute compatibility version for the CUDA toolkit. https://developer.nvidia.com/cuda/gpus
For the GTX 1650, it supports version 7.5.
https://developer.nvidia.com/cuda-75-downloads-archive
Run the command in cmd or PowerShell below to check if the drivers and the CUDA toolkit are installed.
nvidia-smi
Output will look like this.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 591.86 Driver Version: 591.86 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1650 WDDM | 00000000:01:00.0 On | N/A |
| 40% 36C P8 15W / 90W | 681MiB / 4096MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1816 C+G ...8bbwe\PhoneExperienceHost.exe N/A |
| |
+-----------------------------------------------------------------------------------------+
3. WSL Installation ๐
Since I am running Windows, I need a Linux distribution to runย an LLM locally. To do so, I have options such as Docker, a virtual machine or WSL. To take advantage of GPUs in a virtual environment, WSL is the best option.
- Check current WSL distros installed on the system.
wsl --list --verbose
- Install Ubuntu 22.04 using WSL 2.
wsl --install -d Ubuntu-22.04
- After downloading the distros, it will ask for username and password. Provide username and password,which will be used to login.
Check installation
wsl --list --verbose
NAME STATE VERSION
* Ubuntu-22.04 Stopped 2
Moving installation to other directory
- Export the installation to tar, to move the installation to another directory.
wsl --export Ubuntu-22.04 D:\AI\ubuntu.tar
- Unregister the current installation from the list of installed distros.
wsl --unregister Ubuntu-22.04
- Import to another location.
wsl --import Ubuntu-22.04 D:\AI\ubuntu D:\AI\ubuntu.tar --version 2
- Login to Ubuntu using PowerShell.
wsl -d Ubuntu-22.04
- Update the default user.
echo -e "[user]\ndefault=your_username" | sudo tee /etc/wsl.conf
- Restart WSL.
wsl --shutdown
- Login to Ubuntu using PowerShell.
wsl -d Ubuntu-22.04
- Check if drivers are installed correctly inside WSL/Ubuntu. Run the below command inside the same window.
nvidia-smi
Okay,! Now the desktop is ready to install and run LLMs. But which one to run? How to interact with LLM? How to manage LLMs locally?

Install Ollama ๐
Let’s use Ollama ๐. Docs: https://docs.ollama.com/
- Run below command to install Ollama.
curl -fsSL https://ollama.com/install.sh | sh
- Verify installation.
ollama --version
- Create a directory to save ollama models. Create an environment variable to point to the new location.
mkdir <path-to-store-ollama-models>
nano ~/.bashrc
export OLLAMA_MODELS=<path-to-store-ollama-models>
source ~/.bashrc
## Validate if environment variable is set properly
echo $OLLAMA_MODELS
- Start Ollama
ollama serve
- Now Ollama will start running. In another terminal, verify that Ollama is running:
ollama -v
ollama version is 0.15.2

I am done with the hard part ๐๐๐.
- Now, let’s download the AI model. To be on the safe side, I started small by downloading a vector embedding model (nomic-embed-text).
ollama pull nomic-embed-text
- Test the downloaded model.
curl --location 'http://localhost:11434/api/embeddings' \
--header 'Content-Type: application/json' \
--data '{
"model": "nomic-embed-text",
"prompt": "deciphermiddleware"
}'
A vector output is generated. A successful test!!!
How much LLM work I can offload to the GPU depends on VRAM a lot. Since the GTX 1650 has only 4GB VRAM, it will not allow large models to run on the GPU. Thus, models will run on a shared basis between CPU and GPU. Let me try a 3b parameter model llama3.2.
ollama pull llama3.2
ollama run llama3.2
Output
>>> hi
How can I assist you today?
DEBUG INFO
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloaded 26/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1918.35 MiB
load_tensors: CUDA0 model buffer size = 1488.14 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.50 MiB
llama_kv_cache: CPU KV buffer size = 32.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 416.00 MiB
llama_kv_cache: size = 448.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 224.00 MiB, V (f16): 224.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 588.73 MiB
llama_context: CUDA_Host compute buffer size = 14.01 MiB
llama_context: graph nodes = 875
llama_context: graph splits = 29 (with bs=512), 3 (with bs=1)

Now, let’s start using the AI models and explore them more. But that will be for some other day.
I hope you like the journey. Please share your valuable feedback. 😊😊😊

