Appendix II: Running Local LLMs for Coding

Warning

To-Do: This section needs reviewing and testing on multiple systems. Help us via GitHub issues and pull requests.

Why run models locally?

Running AI models on your own machine offers significant advantages for research computing:

  • Privacy: Your code never leaves your machine

  • Offline capability: Works without internet connection

  • No API costs: After initial setup, inference is free

  • Full control: Choose models, customize behavior, no vendor lock-in

  • Compliance: May help meet institutional data handling requirements

The trade-off is reduced capability compared to cloud models and hardware requirements. This appendix helps you set up a practical local coding assistant.

Overview: The local AI coding stack

A local AI coding setup has three components:

+------------------+     +------------------+     +------------------+
|   Your Editor    |     |   Model Runner   |     |   Local Model    |
|   (VS Code)      | --> |   (Ollama)       | --> |   (Qwen2.5-Coder)|
|                  |     |                  |     |                  |
|   + Extension    |     |   Serves model   |     |   Runs on your   |
|   (Continue)     |     |   via HTTP API   |     |   CPU/GPU        |
+------------------+     +------------------+     +------------------+

Recommended stack for most users:

  • Editor: VS Code

  • Extension: Continue (open source, feature-rich)

  • Model runner: Ollama (easy installation, good performance)

  • Model: Qwen2.5-Coder (best benchmark performance for size)

Hardware requirements

Local LLMs require significant resources. Here’s what to expect:

Minimum requirements

Component

Minimum

Recommended

RAM

8 GB

16+ GB

Storage

10 GB free

50+ GB free

CPU

Modern multi-core

Apple Silicon / Recent Intel/AMD

Step 1: Install Ollama

Ollama is the easiest way to run local models. It handles downloading, quantization, and serving models via a simple API.

macOS

# Option 1: Download from website
# Visit https://ollama.ai and download the app

# Option 2: Homebrew
brew install ollama

Linux

curl -fsSL https://ollama.ai/install.sh | sh

Windows

Download the installer from ollama.ai.

Verify installation

# Start Ollama (may start automatically)
ollama serve

# In another terminal, test it
ollama --version

Step 2: Download a coding model

Download your chosen model

# Download the recommended model (takes a few minutes)
ollama pull qwen2.5-coder:7b

# Optionally, also get a small model for autocomplete
ollama pull qwen2.5-coder:1.5b

Test the model

# Interactive test
ollama run qwen2.5-coder:7b "Write a Python function to calculate fibonacci numbers"

If this works, your model is ready.

Step 3: Install Continue extension in VS Code

Continue is an open-source AI coding assistant that integrates with VS Code and JetBrains IDEs. It supports local models via Ollama.

Install the extension

  1. Open VS Code

  2. Go to Extensions (Ctrl/Cmd + Shift + X)

  3. Search for “Continue”

  4. Click Install on “Continue - Codestral, Claude, and more”

Alternatively, from the command line:

code --install-extension Continue.continue

Initial setup

After installation:

  1. Continue will appear in the sidebar (look for the Continue icon)

  2. Click on it to open the Continue panel

  3. It will prompt you to configure a model

Step 4: Configure Continue for Ollama

Continue uses a config.json file for configuration. Open it via:

  1. Open Command Palette (Ctrl/Cmd + Shift + P)

  2. Type “Continue: Open Config”

  3. Select it to open the configuration file

Basic configuration

Replace the contents with this configuration:

{
  "models": [
    {
      "title": "Qwen2.5 Coder 7B (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5 Coder 1.5B (Autocomplete)",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  },
  "tabAutocompleteOptions": {
    "multilineCompletions": "auto"
  }
}

Configuration explained

Setting

Purpose

models

Models available for chat (Ctrl/Cmd + L)

tabAutocompleteModel

Model used for inline code completion

tabAutocompleteOptions

Settings for autocomplete behavior

Multiple models configuration

You can configure multiple models and switch between them:

{
  "models": [
    {
      "title": "Qwen2.5 Coder 7B",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    },
    {
      "title": "Qwen2.5 Coder 14B (Slower)",
      "provider": "ollama",
      "model": "qwen2.5-coder:14b"
    },
    {
      "title": "DeepSeek Coder",
      "provider": "ollama",
      "model": "deepseek-coder:6.7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Fast Autocomplete",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  }
}

Step 5: Using Continue

Chat with your code (Ctrl/Cmd + L)

  1. Select some code in your editor

  2. Press Ctrl/Cmd + L

  3. Type your question: “Explain this code” or “Add error handling”

  4. The response appears in the Continue panel

Inline editing (Ctrl/Cmd + I)

  1. Select code you want to modify

  2. Press Ctrl/Cmd + I

  3. Describe the change: “Add type hints” or “Convert to async”

  4. Review and accept/reject the changes

Tab autocomplete

Once configured, you’ll see ghost text suggestions as you type. Press Tab to accept, or keep typing to ignore.

First-run slowness

Local models are slow on the first prompt while they load into memory. After the initial “warm-up,” responses are much faster. Don’t judge performance until after the first response.

Troubleshooting

“Connection refused” or model not responding

  1. Make sure Ollama is running:

    ollama serve
    
  2. Check that the model is downloaded:

    ollama list
    
  3. Test the model directly:

    ollama run qwen2.5-coder:7b "Hello"
    

Autocomplete not appearing

  1. Check VS Code settings: editor.inlineSuggest.enabled must be true

  2. Disable other completion extensions (GitHub Copilot, etc.) that might conflict

  3. Ensure tabAutocompleteModel is configured in Continue config

Slow responses

  • Use a smaller model (1.5b or 3b) for autocomplete

  • Ensure you have enough RAM (close other applications)

  • On NVIDIA GPUs, verify CUDA is being used (check Ollama logs)

  • Consider using quantized models (Ollama does this automatically)

Out of memory errors

  • Use a smaller model

  • Close memory-intensive applications

  • On systems with dedicated GPU, ensure model fits in VRAM

Alternative setups

Cline (agentic alternative to Continue)

Cline is a VS Code extension that offers more autonomous capabilities—it can edit files and run commands, not just suggest code.

Install from VS Code marketplace, then configure for Ollama:

  1. Open Cline settings

  2. Select Ollama as provider

  3. Choose your model

Cline is more powerful but also higher risk (it can modify files). Use with appropriate caution.

LM Studio (GUI alternative to Ollama)

LM Studio provides a graphical interface for running local models. It’s easier for beginners but less scriptable.

  1. Download from lmstudio.ai

  2. Search for and download coding models

  3. Start the local server

  4. Configure Continue to use the LM Studio endpoint:

{
  "models": [
    {
      "title": "LM Studio Model",
      "provider": "openai",
      "model": "local-model",
      "apiBase": "http://localhost:1234/v1"
    }
  ]
}

llama.cpp (advanced)

For maximum control and efficiency, you can run llama.cpp directly. This is more complex but offers the best performance tuning.

# Clone and build
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make

# Download a model (GGUF format) from HuggingFace
# Then run the server
./llama-server -m path/to/model.gguf --port 8080

Configure Continue to use it:

{
  "models": [
    {
      "title": "llama.cpp Model",
      "provider": "openai",
      "model": "local",
      "apiBase": "http://localhost:8080/v1"
    }
  ]
}

Comparison: Local vs. cloud models

Aspect

Local (Ollama + Qwen2.5)

Cloud (GPT-4, Claude)

Privacy

Complete

Code sent to servers

Cost

Free after setup

Per-token charges

Quality

Good (88% HumanEval)

Better (90%+ HumanEval)

Speed

Depends on hardware

Generally faster

Offline

Yes

No

Setup

Required

None

Context window

Model-dependent (4K-32K typical)

Often larger (128K+)

When to use local vs. cloud

Use local models when:

  • Working with sensitive/proprietary code

  • Need offline capability

  • Want to avoid API costs

  • Institutional policy restricts cloud AI

Use cloud models when:

  • Maximum quality is critical

  • Working with very large codebases (need big context)

  • Hardware is insufficient for local models

  • Quick tasks where setup time isn’t justified

Model recommendations by use case

Use case

Recommended model

Why

General coding chat

qwen2.5-coder:7b

Best quality/speed balance

Fast autocomplete

qwen2.5-coder:1.5b

Snappy tab completion

Complex problems

qwen2.5-coder:14b or 32b

Higher quality, needs more RAM

Limited hardware

starcoder2:3b

Runs on minimal resources

Research/multilingual

deepseek-coder:6.7b

Good multilingual support

Security considerations

Warning

Running local models is safer from a data privacy perspective, but introduces other considerations:

  1. Model provenance: Only download models from trusted sources (Ollama library, HuggingFace official repos)

  2. System resources: Models can consume significant CPU/GPU/RAM

  3. Network exposure: By default, Ollama only listens on localhost. Don’t expose it to the network without authentication

  4. Model output: Local models can still generate insecure code—verification practices from the main course still apply

Further resources

Official documentation

Tutorials and guides

Model benchmarks and comparisons

Keypoints

  • Local AI coding requires: model runner (Ollama) + IDE extension (Continue) + model (Qwen2.5-Coder)

  • Qwen2.5-Coder currently leads open-source coding benchmarks

  • Use smaller models (1.5b-3b) for autocomplete, larger (7b+) for chat

  • First response is slow while model loads; subsequent responses are faster

  • Local models trade some quality for privacy, offline capability, and zero API costs