DecipherMiddleware

Installing AI models Localhost

ยท 1059 words ยท 5 minutes to read ยท Pranav Davar
Categories: AI
Tags: AI ollama wls

Everything works on my Localhost

ย  One day, I started wondering, is it possible to run AI models on a normal system where I play games? I recalled that I have a NVDIA graphics card in my desktop. I started digging into what I can do. I got so many doubts. Will the desktop be able to support? If yes, where do I start? Is it going to crash the system?

Generated By AI (ChatGPT)

So many questions. Let’s ask gen AI (ChatGPT/Gemini…etc).

On putting the query. It started giving me answers that need huge servers and configurations. No! No!…. !!! But do I have that much computing power? I need the answers that fit my existing computing resources.

First things first, I grabbed a piece of paper. Oh wait! I know this works, but in today’s digital world. VsCode becomes the new notepad :P. I gathered a couple of pieces of information.

Desktop Configurations ๐Ÿ”—

ComponentConfigurations
CPUi5 11400 11th Generation
RAM32 GB
SSDYes
GPUMSI GeForce GTX 1650 VENTUS XS OC Nvidia Graphic Card
OSWindows 11

Configurations, as per me looks decent. Will it be able to support any LLMs? Answer is YES. Let’s give it a try.


Pre-requisites ๐Ÿ”—

1. Update Windows ๐Ÿ”—

  • Update operating system. I am using Windows, and performed a Windows update.

2. Update GPU Drivers ๐Ÿ”—

Run the command in cmd or PowerShell below to check if the drivers and the CUDA toolkit are installed.

nvidia-smi

Output will look like this.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 591.86                 Driver Version: 591.86         CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1650      WDDM  |   00000000:01:00.0  On |                  N/A |
| 40%   36C    P8             15W /   90W |     681MiB /   4096MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            1816    C+G   ...8bbwe\PhoneExperienceHost.exe      N/A      |
|                                                                                         |
+-----------------------------------------------------------------------------------------+

3. WSL Installation ๐Ÿ”—

Since I am running Windows, I need a Linux distribution to runย an LLM locally. To do so, I have options such as Docker, a virtual machine or WSL. To take advantage of GPUs in a virtual environment, WSL is the best option.

  1. Check current WSL distros installed on the system.
wsl --list --verbose
  1. Install Ubuntu 22.04 using WSL 2.
wsl --install -d Ubuntu-22.04
  1. After downloading the distros, it will ask for username and password. Provide username and password,which will be used to login.

Check installation

wsl --list --verbose
  NAME            STATE           VERSION
* Ubuntu-22.04    Stopped         2

Moving installation to other directory

  • Export the installation to tar, to move the installation to another directory.
wsl --export Ubuntu-22.04 D:\AI\ubuntu.tar
  • Unregister the current installation from the list of installed distros.
wsl --unregister Ubuntu-22.04
  • Import to another location.
wsl --import Ubuntu-22.04 D:\AI\ubuntu D:\AI\ubuntu.tar --version 2

  1. Login to Ubuntu using PowerShell.
wsl -d Ubuntu-22.04
  1. Update the default user.
echo -e "[user]\ndefault=your_username" | sudo tee /etc/wsl.conf
  1. Restart WSL.
wsl --shutdown
  1. Login to Ubuntu using PowerShell.
wsl -d Ubuntu-22.04
  1. Check if drivers are installed correctly inside WSL/Ubuntu. Run the below command inside the same window.
nvidia-smi

Okay,! Now the desktop is ready to install and run LLMs. But which one to run? How to interact with LLM? How to manage LLMs locally?

Install Ollama ๐Ÿ”—

Let’s use Ollama ๐Ÿ˜Ž. Docs: https://docs.ollama.com/

  1. Run below command to install Ollama.
curl -fsSL https://ollama.com/install.sh | sh
  1. Verify installation.
ollama --version
  1. Create a directory to save ollama models. Create an environment variable to point to the new location.
mkdir <path-to-store-ollama-models>
nano ~/.bashrc
export OLLAMA_MODELS=<path-to-store-ollama-models>
source ~/.bashrc
## Validate if environment variable is set properly
echo $OLLAMA_MODELS
  1. Start Ollama
ollama serve
  1. Now Ollama will start running. In another terminal, verify that Ollama is running:
ollama -v

ollama version is 0.15.2

I am done with the hard part ๐Ÿ˜Ž๐Ÿ˜Ž๐Ÿ˜Ž.

  1. Now, let’s download the AI model. To be on the safe side, I started small by downloading a vector embedding model (nomic-embed-text).
ollama pull nomic-embed-text
  1. Test the downloaded model.
curl --location 'http://localhost:11434/api/embeddings' \
--header 'Content-Type: application/json' \
--data '{
    "model": "nomic-embed-text",
    "prompt": "deciphermiddleware"
  }'

A vector output is generated. A successful test!!!

How much LLM work I can offload to the GPU depends on VRAM a lot. Since the GTX 1650 has only 4GB VRAM, it will not allow large models to run on the GPU. Thus, models will run on a shared basis between CPU and GPU. Let me try a 3b parameter model llama3.2.

ollama pull llama3.2
ollama run llama3.2

Output

>>> hi
How can I assist you today?

DEBUG INFO

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 26 repeating layers to GPU
load_tensors: offloaded 26/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1918.35 MiB
load_tensors:        CUDA0 model buffer size =  1488.14 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.50 MiB
llama_kv_cache:        CPU KV buffer size =    32.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =   416.00 MiB
llama_kv_cache: size =  448.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  224.00 MiB, V (f16):  224.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      CUDA0 compute buffer size =   588.73 MiB
llama_context:  CUDA_Host compute buffer size =    14.01 MiB
llama_context: graph nodes  = 875
llama_context: graph splits = 29 (with bs=512), 3 (with bs=1)


Now, let’s start using the AI models and explore them more. But that will be for some other day.

I hope you like the journey. Please share your valuable feedback. 😊😊😊


Link copied!

Stats


Total Posts: 34

Total Categories: 8

Recently Published:
Installing AI models Localhost