Rapid Prototyping LLM UIs with Raspberry Pi: Offline Demos for Stakeholder Buy-in
prototypingedge-aihardware

Rapid Prototyping LLM UIs with Raspberry Pi: Offline Demos for Stakeholder Buy-in

rreacts
2026-02-09 12:00:00
11 min read
Advertisement

Build privacy-first, offline LLM demos on Raspberry Pi 5 + AI HAT+ 2 and ship stakeholder-ready React prototypes without cloud dependencies.

Hook — ship LLM demos without the cloud, in the room

You're under a tight timeline: stakeholders want to feel the product, not read a doc. But sending sensitive customer examples to a cloud API is a non-starter — and flaky demo Wi‑Fi kills credibility. What if you could run a privacy-preserving LLM demo locally on a pocket-sized device and hand your stakeholders a responsive, production‑looking UI? In 2026, the Raspberry Pi 5 paired with the AI HAT+ 2 makes that feasible. This guide teaches frontend engineers how to prototype an offline, privacy-first React demo app on a Raspberry Pi with AI HAT+ 2 to win buy‑in during product ideation.

Why this matters in 2026

Edge LLMs and hybrid models became mainstream in late 2024–2025. By early 2026 companies expect demos that respect privacy, run disconnected, and demonstrate real UX flow rather than mocked screenshots. Big vendors are moving to hybrid strategies — even Apple and Google announced integrations that signal a shift toward models running both server-side and on-device — so demonstrating offline capability is a strategic advantage when pitching product direction.

The Raspberry Pi 5 + AI HAT+ 2 (announced in 2025) brings dedicated AI acceleration to a low-cost platform, enabling quantized models to run locally for proof-of-concept demos. This yields fast, coherent responses good enough for product demos without huge RAM demands.

What you'll build (in under a week)

  • A hardened Raspberry Pi prototype that runs an LLM locally with the AI HAT+ 2.
  • A small Node.js local API that streams tokens from the model.
  • A React demo app (Vite) that connects to the Pi over the local network and shows streaming text, progressive UX, and privacy indicators.
  • Deployment tips: single-asset offline builds, kiosk mode, and demo scripts to hand to stakeholders.

Prerequisites

  • Raspberry Pi 5 with AI HAT+ 2 attached (or equivalent Pi + NPU HAT).
  • microSD card (32GB+), power supply, optional touchscreen or portable display.
  • Laptop with SSH and USB-C network access to Pi.
  • Basic Node.js and React knowledge (we include code snippets you can copy).

Step 1 — Prepare your Pi (fastest path)

  1. Flash Raspberry Pi OS (64-bit) or a lightweight Ubuntu image with Raspberry Pi 5 support. Use Raspberry Pi Imager or balenaEtcher.
  2. Boot the Pi, update packages and enable SSH:
    sudo apt update && sudo apt upgrade -y
    sudo raspi-config nonint do_ssh 0
  3. Install build essentials and common tooling:
    sudo apt install -y git build-essential cmake python3-venv python3-pip nodejs npm nginx
  4. Follow the AI HAT+ 2 vendor instructions to install drivers and runtime. The HAT usually provides an installer or Debian packages exposing the NPU to frameworks like llama.cpp builds or vendor-backed runtimes. After installation verify the device is visible.

Network & privacy setup

For an in-person demo, isolate the Pi from the internet: configure a hotspot or Ethernet-only network that has no upstream gateway. Disable cloud services and block outbound traffic with ufw when you're demoing (this maps to sandboxing and isolation best practices you should follow when running local agents).

sudo apt install ufw
sudo ufw default deny outgoing
sudo ufw allow from 192.168.4.0/24 to any port 3000  # allow local app traffic
sudo ufw enable

Step 2 — Choose a local LLM runtime and model

In 2026 the ecosystem matured: popular local runtimes include llama.cpp (optimized for small devices), MLC-LLM (for quantized GGUF models), and vendor runtime stacks that expose the HAT NPU. For privacy and licensing reasons pick an open model compatible with local execution and quantized formats (GGUF / ggml) — or use a permissibly licensed instruction-tuned model appropriate to your demo scope.

The practical recommendation: use a compact instruction-tuned model (e.g., ~3B to 7B parameter class quantized to Q4_K or similar) that the AI HAT+ 2 can accelerate. This yields fast, coherent responses good enough for product demos without huge RAM demands.

Install llama.cpp (example)

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

Convert or download a quantized GGUF/ggml model compatible with your runtime. Some toolchains include converters; follow model licencing rules and sandboxing guidance from resources on running desktop LLM agents safely (sandboxing & isolation best practices). Place the model in /home/pi/models/demo_model.gguf.

Step 3 — Local API that streams tokens

A streaming API sells the illusion of an intelligent assistant. Token streaming also keeps the UI responsive. We'll show a minimal Node.js server that spawns a local LLM subprocess, parses token output, and forwards it as Server-Sent Events (SSE) to the React app.

// server/index.js (simplified)
const express = require('express')
const { spawn } = require('child_process')
const app = express()
app.use(express.json())

app.post('/api/generate', (req, res) => {
  res.set({ 'Content-Type': 'text/event-stream', 'Cache-Control': 'no-cache' })
  res.flushHeaders()

  const prompt = req.body.prompt || ''
  // Example: run llama.cpp's example binary that streams tokens
  const proc = spawn('./main', ['-m', '/home/pi/models/demo_model.gguf', '-p', prompt, '--color'])

  proc.stdout.on('data', chunk => {
    // parse and forward token(s)
    const text = chunk.toString()
    res.write(`data: ${JSON.stringify({ token: text })}\n\n`)
  })

  proc.on('close', () => {
    res.write('event: done\ndata: {}\n\n')
    res.end()
  })

  req.on('close', () => {
    proc.kill()
  })
})

app.listen(3000, () => console.log('API listening on 3000'))

This minimal server is intentionally simple so you can iterate quickly. For production-like demos, add request limits, per-session logs (encrypted locally), and robust process supervision.

Step 4 — Build a React demo app optimized for offline demos

Use Vite + React for the fastest dev loop. Keep the UI focused — show capability, not feature-completeness. Include these elements:

  • Prompt input with examples to guide non-technical stakeholders.
  • Streaming text component that appends tokens as they arrive.
  • Latency & privacy indicators (e.g., “offline”, “local only”, “no cloud”).
  • Fallback mock for spotty hardware (use canned responses) so demos never fail completely.

Streaming client example (React)

// src/App.jsx (concept)
import { useState, useRef } from 'react'

export default function App() {
  const [prompt, setPrompt] = useState('Summarize our onboarding flow in 3 bullets')
  const [output, setOutput] = useState('')
  const controllerRef = useRef(null)

  async function start() {
    setOutput('')
    if (controllerRef.current) controllerRef.current.abort()
    controllerRef.current = new AbortController()

    const res = await fetch('/api/generate', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ prompt }),
      signal: controllerRef.current.signal,
    })

    const reader = res.body.getReader()
    const decoder = new TextDecoder()
    let done = false
    while (!done) {
      const { value, done: streamDone } = await reader.read()
      done = streamDone
      if (value) {
        const chunk = decoder.decode(value)
        // crude parse: append incoming chunk
        setOutput(prev => prev + chunk)
      }
    }
  }

  return (