Appendix II: Running Local LLMs for Coding

Warning

To-Do: This section needs reviewing and testing on multiple systems. Help us via GitHub issues and pull requests.

Why run models locally?

Running AI models on your own machine offers significant advantages for research computing:

Privacy: Your code never leaves your machine
Offline capability: Works without internet connection
No API costs: After initial setup, inference is free
Full control: Choose models, customize behavior, no vendor lock-in
Compliance: May help meet institutional data handling requirements

The trade-off is reduced capability compared to cloud models and hardware requirements. This appendix helps you set up a practical local coding assistant.

Overview: The local AI coding stack

A local AI coding setup has three components:

+------------------+     +------------------+     +------------------+
|   Your Editor    |     |   Model Runner   |     |   Local Model    |
|   (VS Code)      | --> |   (Ollama)       | --> |   (Qwen2.5-Coder)|
|                  |     |                  |     |                  |
|   + Extension    |     |   Serves model   |     |   Runs on your   |
|   (Continue)     |     |   via HTTP API   |     |   CPU/GPU        |
+------------------+     +------------------+     +------------------+

Recommended stack for most users:

Editor: VS Code
Extension: Continue (open source, feature-rich)
Model runner: Ollama (easy installation, good performance)
Model: Qwen2.5-Coder (best benchmark performance for size)

Hardware requirements

Local LLMs require significant resources. Here’s what to expect:

Minimum requirements

Component	Minimum	Recommended
RAM	8 GB	16+ GB
Storage	10 GB free	50+ GB free
CPU	Modern multi-core	Apple Silicon / Recent Intel/AMD

GPU acceleration (optional but recommended)

GPU Type	VRAM	What you can run
None (CPU only)	-	1-3B models, slow
6-8 GB	NVIDIA/AMD	7B models comfortably
12-16 GB	NVIDIA/AMD	14B models, some 32B quantized
24+ GB	NVIDIA/AMD	32B+ models
Apple Silicon	Unified	7-14B models well, larger with patience

Apple Silicon note

Apple Silicon Macs (M1/M2/M3/M4) work well for local LLMs because they use unified memory—the same RAM serves both CPU and GPU. A Mac with 16GB unified memory can comfortably run 7B models and handle 14B models reasonably well.

Step 1: Install Ollama

Ollama is the easiest way to run local models. It handles downloading, quantization, and serving models via a simple API.

macOS

# Option 1: Download from website
# Visit https://ollama.ai and download the app

# Option 2: Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

Download the installer from ollama.ai.

Verify installation

# Start Ollama (may start automatically)
ollama serve

# In another terminal, test it
ollama --version

Step 2: Download a coding model

Recommended models for coding

Based on current benchmarks (2025), here are the best local coding models:

Model	Size	Best for	Command
qwen2.5-coder:7b	~4.5 GB	Best balance of quality/speed	`ollama pull qwen2.5-coder:7b`
qwen2.5-coder:1.5b	~1 GB	Fast autocomplete	`ollama pull qwen2.5-coder:1.5b`
qwen2.5-coder:14b	~9 GB	Higher quality, slower	`ollama pull qwen2.5-coder:14b`
deepseek-coder:6.7b	~4 GB	Good alternative	`ollama pull deepseek-coder:6.7b`
codellama:7b	~4 GB	Meta’s coding model	`ollama pull codellama:7b`
starcoder2:3b	~2 GB	Lightweight, good for autocomplete	`ollama pull starcoder2:3b`

Model recommendation: Start with Qwen2.5-Coder

Qwen2.5-Coder currently leads open-source coding benchmarks, scoring 88.4% on HumanEval with the 7B model—higher than GPT-4’s 87.1%. It significantly outperforms alternatives like StarCoder2 and DeepSeek-Coder at similar sizes.

For most users, start with qwen2.5-coder:7b for chat/assistance and qwen2.5-coder:1.5b for fast autocomplete.

Download your chosen model

# Download the recommended model (takes a few minutes)
ollama pull qwen2.5-coder:7b

# Optionally, also get a small model for autocomplete
ollama pull qwen2.5-coder:1.5b

Test the model

# Interactive test
ollama run qwen2.5-coder:7b "Write a Python function to calculate fibonacci numbers"

If this works, your model is ready.

Step 3: Install Continue extension in VS Code

Continue is an open-source AI coding assistant that integrates with VS Code and JetBrains IDEs. It supports local models via Ollama.

Install the extension

Open VS Code
Go to Extensions (Ctrl/Cmd + Shift + X)
Search for “Continue”
Click Install on “Continue - Codestral, Claude, and more”

Alternatively, from the command line:

code --install-extension Continue.continue

Initial setup

After installation:

Continue will appear in the sidebar (look for the Continue icon)
Click on it to open the Continue panel
It will prompt you to configure a model

Step 4: Configure Continue for Ollama

Continue uses a config.json file for configuration. Open it via:

Open Command Palette (Ctrl/Cmd + Shift + P)
Type “Continue: Open Config”
Select it to open the configuration file

Basic configuration

Replace the contents with this configuration:

{
  "models": [
    {
      "title": "Qwen2.5 Coder 7B (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5 Coder 1.5B (Autocomplete)",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  },
  "tabAutocompleteOptions": {
    "multilineCompletions": "auto"
  }
}

Configuration explained

Setting	Purpose
`models`	Models available for chat (Ctrl/Cmd + L)
`tabAutocompleteModel`	Model used for inline code completion
`tabAutocompleteOptions`	Settings for autocomplete behavior

Multiple models configuration

You can configure multiple models and switch between them:

{
  "models": [
    {
      "title": "Qwen2.5 Coder 7B",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    },
    {
      "title": "Qwen2.5 Coder 14B (Slower)",
      "provider": "ollama",
      "model": "qwen2.5-coder:14b"
    },
    {
      "title": "DeepSeek Coder",
      "provider": "ollama",
      "model": "deepseek-coder:6.7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Fast Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  }
}

Step 5: Using Continue

Chat with your code (Ctrl/Cmd + L)

Select some code in your editor
Press Ctrl/Cmd + L
Type your question: “Explain this code” or “Add error handling”
The response appears in the Continue panel

Inline editing (Ctrl/Cmd + I)

Select code you want to modify
Press Ctrl/Cmd + I
Describe the change: “Add type hints” or “Convert to async”
Review and accept/reject the changes

Tab autocomplete

Once configured, you’ll see ghost text suggestions as you type. Press Tab to accept, or keep typing to ignore.

First-run slowness

Local models are slow on the first prompt while they load into memory. After the initial “warm-up,” responses are much faster. Don’t judge performance until after the first response.

Troubleshooting

“Connection refused” or model not responding

Make sure Ollama is running:
```
ollama serve
```
Check that the model is downloaded:
```
ollama list
```
Test the model directly:
```
ollama run qwen2.5-coder:7b "Hello"
```

Autocomplete not appearing

Check VS Code settings: editor.inlineSuggest.enabled must be true
Disable other completion extensions (GitHub Copilot, etc.) that might conflict
Ensure tabAutocompleteModel is configured in Continue config

Slow responses

Use a smaller model (1.5b or 3b) for autocomplete
Ensure you have enough RAM (close other applications)
On NVIDIA GPUs, verify CUDA is being used (check Ollama logs)
Consider using quantized models (Ollama does this automatically)

Out of memory errors

Use a smaller model
Close memory-intensive applications
On systems with dedicated GPU, ensure model fits in VRAM

Alternative setups

Cline (agentic alternative to Continue)

Cline is a VS Code extension that offers more autonomous capabilities—it can edit files and run commands, not just suggest code.

Install from VS Code marketplace, then configure for Ollama:

Open Cline settings
Select Ollama as provider
Choose your model

Cline is more powerful but also higher risk (it can modify files). Use with appropriate caution.

LM Studio (GUI alternative to Ollama)

LM Studio provides a graphical interface for running local models. It’s easier for beginners but less scriptable.

Download from lmstudio.ai
Search for and download coding models
Start the local server
Configure Continue to use the LM Studio endpoint:

{
  "models": [
    {
      "title": "LM Studio Model",
      "provider": "openai",
      "model": "local-model",
      "apiBase": "http://localhost:1234/v1"
    }
  ]
}

llama.cpp (advanced)

For maximum control and efficiency, you can run llama.cpp directly. This is more complex but offers the best performance tuning.

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make

# Download a model (GGUF format) from HuggingFace
# Then run the server
./llama-server -m path/to/model.gguf --port 8080

Configure Continue to use it:

{
  "models": [
    {
      "title": "llama.cpp Model",
      "provider": "openai",
      "model": "local",
      "apiBase": "http://localhost:8080/v1"
    }
  ]
}

Comparison: Local vs. cloud models

Aspect	Local (Ollama + Qwen2.5)	Cloud (GPT-4, Claude)
Privacy	Complete	Code sent to servers
Cost	Free after setup	Per-token charges
Quality	Good (88% HumanEval)	Better (90%+ HumanEval)
Speed	Depends on hardware	Generally faster
Offline	Yes	No
Setup	Required	None
Context window	Model-dependent (4K-32K typical)	Often larger (128K+)

When to use local vs. cloud

Use local models when:

Working with sensitive/proprietary code
Need offline capability
Want to avoid API costs
Institutional policy restricts cloud AI

Use cloud models when:

Maximum quality is critical
Working with very large codebases (need big context)
Hardware is insufficient for local models
Quick tasks where setup time isn’t justified

Model recommendations by use case

Use case	Recommended model	Why
General coding chat	qwen2.5-coder:7b	Best quality/speed balance
Fast autocomplete	qwen2.5-coder:1.5b	Snappy tab completion
Complex problems	qwen2.5-coder:14b or 32b	Higher quality, needs more RAM
Limited hardware	starcoder2:3b	Runs on minimal resources
Research/multilingual	deepseek-coder:6.7b	Good multilingual support

Security considerations

Warning

Running local models is safer from a data privacy perspective, but introduces other considerations:

Model provenance: Only download models from trusted sources (Ollama library, HuggingFace official repos)
System resources: Models can consume significant CPU/GPU/RAM
Network exposure: By default, Ollama only listens on localhost. Don’t expose it to the network without authentication
Model output: Local models can still generate insecure code—verification practices from the main course still apply

Further resources

Official documentation

Tutorials and guides

Model benchmarks and comparisons

Keypoints

Local AI coding requires: model runner (Ollama) + IDE extension (Continue) + model (Qwen2.5-Coder)
Qwen2.5-Coder currently leads open-source coding benchmarks
Use smaller models (1.5b-3b) for autocomplete, larger (7b+) for chat
First response is slow while model loads; subsequent responses are faster
Local models trade some quality for privacy, offline capability, and zero API costs