Most people use AI the same way:
- Open ChatGPT, Claude, or Gemini.
- Ask a question.
- Get an answer.
- Close the tab.
I did that too—until I started spending more time on cybersecurity research, vulnerability analysis, and application security. At some point, it felt strange researching privacy and security while sending chunks of my work to infrastructure I didn’t control.
So I decided to build my own.
…
No API keys. No subscriptions. No cloud processing. Just a Ryzen 7 7435HS, 16GB RAM, an RTX 3050, and a growing curiosity about what actually happens behind the chatbot interface.
The goal was simple: create a local AI research assistant that could search my notes, help with security research, and keep everything on my machine. What I didn’t expect was that building it would teach me more about AI, LLMs, RAG, agents, and AI security than years of simply using them ever could.
Here’s what I built and what I learned.
First: What “Building Your Own AI” Actually Means
Let’s kill the ambiguity before we go any further. There are three distinct tiers to this:
- Level 1 — Running an existing model locally: You download a pre-trained model and run inference on your own hardware. You didn’t create the intelligence; you stopped renting it. This is what most developers mean when they say “I built my own AI.” It’s also what I did. No shame in it.
- Level 2 — Customizing a foundation model: You take a pre-trained model and augment it with your own data — notes, research, codebases, documents. This is RAG (Retrieval-Augmented Generation) or fine-tuning territory. Most startups building “AI assistants” operate here. They did not build a model; they adapted one. There’s a difference, and the marketing conveniently forgets it.
- Level 3 — Training from scratch: OpenAI, Anthropic, Google, Meta. Thousands of GPUs, millions of dollars, specialized research teams. Not realistic for individuals.
I’m at Level 1 moving toward Level 2. That’s the practical sweet spot for anyone doing real work without a datacenter.
How LLMs Actually Work (The Part Nobody Explains Clearly)
LLMs are prediction systems. They don’t think. They predict the statistically most likely next token given everything in their context window.
Input: "The capital of France is"
Output: "Paris"
Do that billions of times across a massive training corpus, and complex, reasoning-like behavior emerges. That’s genuinely it. No magic. No ghost in the machine. Just very expensive statistics that happen to be incredibly useful.
Understanding this matters for security work specifically:
- Prompt Injection: The model cannot fundamentally distinguish between your system instructions and an attacker’s instructions injected through retrieved content.
- Hallucinations: The model is predicting probable text, not retrieving ground truth.
- Verification: You should never blindly trust model output on anything critical without verification.
⚠️ The core paradox of LLMs: The model is confident. The model is sometimes wrong. These two facts coexist comfortably and will continue to cause problems for everyone who forgets them.
Why Local? The OPSEC Argument
For general-purpose use, cloud AI is fine. For security research, the calculus is completely different. When you use a cloud model:
- Your prompts leave your machine.
- Sensitive context — CVEs you’re analyzing, vulnerability writeups, client infrastructure details — passes through third-party infrastructure.
- You have zero visibility into what’s logged, retained, or used for training.
If you’re doing bug bounty, AppSec auditing, or anything involving non-public vulnerability data, feeding that into a cloud API is an OPSEC problem. Full stop.
Local solves this. Your data stays on your hardware. No terms of service to audit. No compliance risk from third-party data processing. No API costs. It works completely offline, allowing unlimited experimentation without watching a token counter.
The tradeoff? Setup takes a weekend and your hardware has strict limits. Still worth it.
The Tool: Ollama
Ollama is the easiest on-ramp to local AI. Think of it as Docker for language models — it handles downloads, quantization, GPU acceleration, and exposes a clean REST API at http://localhost:11434.
Installation & Setup
# Mac / Linux installation
curl -fsSL https://ollama.sh | sh
# Pull and run a model
ollama run qwen3:8b
That’s it. The model downloads, loads into memory, and the API goes live.
What’s happening under the hood:
Download Model Weights ──> Load Into RAM/VRAM ──> Tokenize Input ──> Transformer Inference ──> Local REST API
Ollama is a model manager, a runtime, and an API server all in one. The intelligence is in the model weights; Ollama is just the plumbing that makes those weights usable without a PhD in infrastructure.
Handy Ollama CLI Commands
ollama list # See installed models
ollama pull qwen3:8b # Download a specific model
ollama rm llama3 # Remove an unwanted model
ollama ps # See what models are currently loaded in memory
Picking the Right Model for Your Hardware
Not all models fit on all machines. Here is an honest breakdown of the hardware requirements:
| RAM / VRAM | Recommended Model | Experience Notes |
|---|---|---|
| 8GB | Gemma 4 4B or Phi-4 Mini | Fits cleanly, decent quality, highly efficient |
| 16GB | Qwen3 8B or DeepSeek R1 Distilled 8B | The Sweet Spot. Fast and highly capable |
| 32GB+ | DeepSeek R1 14B–32B | High-level technical reasoning, heavily data-intensive |
My daily driver: Qwen3 8B. It provides strong technical reasoning, handles code exceptionally well, is Apache 2.0 licensed, and runs cleanly on my laptop without fighting for VRAM.
The 2026 Open Model Landscape
The open-model ecosystem moves fast. Here’s where things actually stand right now:
- Qwen3 (Alibaba): Currently one of the strongest all-around open families. Apache 2.0, excellent multilingual, and top-tier technical performance. The 4B and 8B variants run wonderfully on consumer hardware. This is my default recommendation.
- DeepSeek V4 & R1: DeepSeek changed the industry when R1 dropped. MIT licensed, it matches frontier models on math and code benchmarks. The distilled 8B and 14B variants bring heavy-duty reasoning right to consumer hardware.
- Kimi K2.5: Excellent tool usage and designed specifically for agentic workflows. Worth watching if you’re building automated agents rather than just doing Q&A.
- GLM-5.1: Focused on advanced reasoning and agentic tasks. Particularly impressive for structured technical work.
- Llama 4 Scout (Meta): The first Mixture of Experts (MoE) architecture in the Llama family, sporting a massive 10-million token context window. However, it requires ~55GB of VRAM, making it impractical for standard laptops.
- Gemma 4 (Google): Highly efficient and accessible. This is your best option if you need capable reasoning on strictly constrained hardware resources.
💡 The honest takeaway on hardware: 8GB of VRAM was borderline a couple of years ago. It’s cramped now. 12GB is the modern floor for serious local work, while 16GB gives you room to actually experiment.
The Problem With Raw Models: They Don’t Know Your Stuff
A stock model only knows what it was trained on. It doesn’t know your security notes, your project internals, your custom vulnerability writeups, or yesterday’s newly disclosed CVEs.
That gap is the main limitation of Level 1. The fix is RAG.
RAG: Teaching Your AI Your Own Knowledge
RAG stands for Retrieval-Augmented Generation. The concept is simpler than the name suggests:
User Asks Question
│
▼
Search Vector Database (ChromaDB)
│
▼
Retrieve Relevant Document Chunks
│
▼
Inject Context into System Prompt (Question + Source Chunks)
│
▼
Local LLM Generates Grounded Answer
A Practical Local RAG Implementation in Python
Here is how you can spin up a local RAG pipeline using LangChain and Ollama.
pip install chromadb langchain langchain-community langchain-ollama
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
# 1. Your proprietary knowledge base
your_docs = [
"BOLA (Broken Object Level Authorization) occurs when an API doesn't verify the requesting user has permission to access the specific object. Most common API vulnerability in 2026.",
"JWT tokens must be verified server-side. Common mistakes: not checking the signature algorithm, skipping expiry validation, or accepting 'none' as a valid algorithm.",
"Django DEBUG=True in production exposes detailed stack traces, environment variables, and raw database queries to anyone who triggers an internal server error."
]
# 2. Split text into digestible chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.create_documents(your_docs)
# 3. Initialize local embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(model="qwen3:8b")
vector_store = Chroma.from_documents(chunks, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 2})
# 4. Connect to local LLM
llm = ChatOllama(model="qwen3:8b", temperature=0)
# 5. Define the RAG prompt system
system_prompt = (
"You are a security research assistant. Use the following pieces of retrieved context "
"to answer the question. If you don't know the answer, say that you don't know.nn"
"Context:n{context}"
)
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "{input}"),
])
# 6. Create and execute the RAG chain
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
response = rag_chain.invoke({"input": "What is BOLA and why is it dangerous?"})
print(response["answer"])
Point this script toward your local Markdown folders, OWASP PDFs, PortSwigger writeups, or disclosed HackerOne reports, and you instantly have a local research assistant that knows your actual data.
RAG vs. Fine-Tuning: The Classic Misconception
Beginners almost always assume they need to fine-tune a model to teach it new information. Usually, they are wrong.
┌───────────────────────────────────────┬───────────────────────────────────────┐
│ USE RAG WHEN │ USE FINE-TUNING WHEN │
├───────────────────────────────────────┼───────────────────────────────────────┤
│ • Knowledge changes frequently │ • You need style/tone behavioral shifts│
│ (new CVEs, fresh writeups) │ • You need strict output formatting │
│ • You need explicit source citations │ • You want deep task specialization │
│ • You want fast, zero-cost iteration │ • You have a vast, clean dataset │
└───────────────────────────────────────┴───────────────────────────────────────┘
The play: Always start with RAG. Fine-tune only if RAG fails to meet your structural formatting needs after extensive testing.
The Next Evolution: MCP (Model Context Protocol)
RAG gives a model knowledge. MCP gives a model tools.
Model Context Protocol allows local LLMs to safely step outside their sandbox and interact natively with systems:
┌──> GitHub Repository
├──> Live CVE Databases
User ──> Agent ──> Tools ──> ├──> Burp Suite Reports
├──> Local Filesystem
└──> System Logs
A chatbot answers questions; an agent completes tasks. Imagine an automated workflow: Find latest Django CVEs ──> read advisories ──> compare against requirements.txt ──> generate a remediation report ──> open a local GitHub issue. That’s the power of tool integration.
The Security Risks Inside Your Own Local Tools
Running AI locally protects your data from leaving your machine, but it shifts the application attack surface. Prompt injection isn’t theoretical—it’s cataloged under real CVEs (e.g., CVE-2025-53773 in GitHub Copilot allowing remote code execution).
When building local RAG and agent architectures, you must defend against:
- Indirect Prompt Injection: If your RAG system ingests external data (scraped web pages, untrusted security feeds, public PDFs), an attacker can embed malicious instructions inside those sources. When the data is retrieved, it hijacks your model’s execution flow.
- Data Poisoning: Malicious or inaccurate documents dropped into your local directory will poison your vector database embeddings, leading to engineered hallucinations.
- System Prompt Leakage: If your application logic depends on sensitive instructions or hardcoded parameters in the backend prompt configuration, assume they can be extracted by a clever prompt payload.
- Blind Trust in Output: The model will give completely inaccurate code analysis or flawed exploit interpretations with absolute confidence. Always cross-reference AI-generated analysis with canonical sources like the NVD (National Vulnerability Database).
🛡️ The Defense: Apply least-privilege principles to your agent’s tools, sandbox execution environments, sanitize inputs, and treat every retrieved document chunk as potentially untrusted adversarial input.
Honest Downsides of Going Local
- The Hardware Ceiling: 16GB of RAM gets you to an 8B parameter model cleanly. DeepSeek R1 reasoning models at 14B or 32B require 24GB+ VRAM. If you want to run heavy models, you’ll eventually need to expand your hardware or securely lease a decentralized bare-metal GPU (like RunPod or Vast.ai).
- The Capability Gap: While the open-source gap closed drastically, frontier cloud models still hold an edge in multi-step complex synthesis across thousands of asynchronous documents.
- Engineering Overhead: Building robust, stateful RAG pipelines with proper chunking, accurate embeddings, and tool orchestration takes real development time. Budget a weekend for setup, not an hour.
Why This Matters Beyond Just Privacy
Building a local AI system isn’t about outperforming a trillion-dollar tech giant on a standard benchmark.
You cannot thoroughly audit AI-integrated applications if you treat the model as a black box. You cannot effectively reason about prompt injection vectors in an enterprise system if you have never engineered a document pipeline from scratch.
A few years ago, understanding the TCP/IP stack separated master engineers from beginners. Today, understanding LLM inference, embedding vectors, context windows, and tool integration protocols is becoming the new dividing line.
To paraphrase Richard Feynman: There is a profound difference between knowing the name of something and knowing the thing itself.
Build the thing. Then break it. Then secure it. That’s where the real learning starts.
Let’s Connect!
I write about API security, backend systems, and building tools from scratch.
- 💬 Drop your thoughts or local setups in the comments below!
- 🛠️ What models are you running on your local rig?