Installing AI models Localhost

Mon, 16 Feb 2026 00:00:00 +0000

Everything works on my Localhost

One day, I started wondering, is it possible to run AI models on a normal system where I play games? I recalled that I have a NVDIA graphics card in my desktop. I started digging into what I can do. I got so many doubts. Will the desktop be able to support? If yes, where do I start? Is it going to crash the system?

Generated By AI (ChatGPT)

So many questions. Let’s ask gen AI (ChatGPT/Gemini…etc).

On putting the query. It started giving me answers that need huge servers and configurations. No! No!…. !!! But do I have that much computing power? I need the answers that fit my existing computing resources.

First things first, I grabbed a piece of paper. Oh wait! I know this works, but in today’s digital world. VsCode becomes the new notepad :P. I gathered a couple of pieces of information.

Desktop Configurations 🔗

Component	Configurations
CPU	i5 11400 11th Generation
RAM	32 GB
SSD	Yes
GPU	MSI GeForce GTX 1650 VENTUS XS OC Nvidia Graphic Card
OS	Windows 11

Configurations, as per me looks decent. Will it be able to support any LLMs? Answer is YES. Let’s give it a try.

Pre-requisites 🔗

1. Update Windows 🔗

Update operating system. I am using Windows, and performed a Windows update.

2. Update GPU Drivers 🔗

Update GPU drivers. I have an NVidia GPU, so updating the NVidia GPU drivers.
Enable the CUDA cores. Download the CUDA toolkit.
https://developer.nvidia.com/cuda/toolkit
If the GPU is older, like mine, use the link below to find the compute compatibility version for the CUDA toolkit. https://developer.nvidia.com/cuda/gpus
For the GTX 1650, it supports version 7.5.
https://developer.nvidia.com/cuda-75-downloads-archive

Run the command in cmd or PowerShell below to check if the drivers and the CUDA toolkit are installed.

nvidia-smi

Output will look like this.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 591.86 Driver Version: 591.86 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1650 WDDM | 00000000:01:00.0 On | N/A |
| 40% 36C P8 15W / 90W | 681MiB / 4096MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI  CI PID Type  Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1816 C+G ...8bbwe\PhoneExperienceHost.exe N/A |
| |
+-----------------------------------------------------------------------------------------+

3. WSL Installation 🔗

Since I am running Windows, I need a Linux distribution to run an LLM locally. To do so, I have options such as Docker, a virtual machine or WSL. To take advantage of GPUs in a virtual environment, WSL is the best option.

Check current WSL distros installed on the system.

wsl --list --verbose

Install Ubuntu 22.04 using WSL 2.

wsl --install -d Ubuntu-22.04

After downloading the distros, it will ask for username and password. Provide username and password,which will be used to login.

Check installation

wsl --list --verbose
 NAME STATE VERSION
* Ubuntu-22.04 Stopped 2

Moving installation to other directory

Export the installation to tar, to move the installation to another directory.

wsl --export Ubuntu-22.04 D:\AI\ubuntu.tar

Unregister the current installation from the list of installed distros.

wsl --unregister Ubuntu-22.04

Import to another location.

wsl --import Ubuntu-22.04 D:\AI\ubuntu D:\AI\ubuntu.tar --version 2

wsl -d Ubuntu-22.04

Update the default user.

echo -e "[user]\ndefault=your_username" | sudo tee /etc/wsl.conf

Restart WSL.

wsl --shutdown

wsl -d Ubuntu-22.04

Check if drivers are installed correctly inside WSL/Ubuntu. Run the below command inside the same window.

nvidia-smi

Okay,! Now the desktop is ready to install and run LLMs. But which one to run? How to interact with LLM? How to manage LLMs locally?

Install Ollama 🔗

Let’s use Ollama 😎. Docs: https://docs.ollama.com/

Run below command to install Ollama.

curl -fsSL https://ollama.com/install.sh | sh

Verify installation.

ollama --version

Create a directory to save ollama models. Create an environment variable to point to the new location.

mkdir <path-to-store-ollama-models>
nano ~/.bashrc
export OLLAMA_MODELS=<path-to-store-ollama-models>
source ~/.bashrc
## Validate if environment variable is set properly
echo $OLLAMA_MODELS

Start Ollama

ollama serve

Now Ollama will start running. In another terminal, verify that Ollama is running:

ollama -v

ollama version is 0.15.2

I am done with the hard part 😎😎😎.

Now, let’s download the AI model. To be on the safe side, I started small by downloading a vector embedding model (nomic-embed-text).

ollama pull nomic-embed-text

Test the downloaded model.

curl --location 'http://localhost:11434/api/embeddings' \
--header 'Content-Type: application/json' \
--data '{
 "model": "nomic-embed-text",
 "prompt": "deciphermiddleware"
 }'

A vector output is generated. A successful test!!!

How much LLM work I can offload to the GPU depends on VRAM a lot. Since the GTX 1650 has only 4GB VRAM, it will not allow large models to run on the GPU. Thus, models will run on a shared basis between CPU and GPU. Let me try a 3b parameter model llama3.2.

ollama pull llama3.2
ollama run llama3.2

Output

>>> hi
How can I assist you today?

DEBUG INFO

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloaded 26/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1918.35 MiB
load_tensors: CUDA0 model buffer size = 1488.14 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_seq = 4096
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.50 MiB
llama_kv_cache: CPU KV buffer size = 32.00 MiB
llama_kv_cache: CUDA0 KV buffer size = 416.00 MiB
llama_kv_cache: size = 448.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 224.00 MiB, V (f16): 224.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 588.73 MiB
llama_context: CUDA_Host compute buffer size = 14.01 MiB
llama_context: graph nodes = 875
llama_context: graph splits = 29 (with bs=512), 3 (with bs=1)

Now, let’s start using the AI models and explore them more. But that will be for some other day.

I hope you like the journey. Please share your valuable feedback. 😊😊😊