edge-aihardwaredeployment

Edge AI with Raspberry Pi 5 and React: Building a Low-Latency Local Inference Dashboard

rreacts

2026-01-24

11 min read

Build a low-latency React dashboard that streams local generative and vision AI from Raspberry Pi 5 + AI HAT+ 2 for privacy-first, offline inference.

Low-latency edge AI for developers: why this matters now

Keeping inference close to users is no longer an optional performance trick — it’s a practical requirement for privacy-sensitive, offline-first apps and sub-100ms UX. If you’re supporting complex visual features, on-device generation, or strict data governance, shipping every request to the cloud increases latency, costs, and risk. This guide shows you how to connect a Raspberry Pi 5 equipped with the new AI HAT+ 2 to a production-ready React dashboard for local generative and vision inference, using WebSocket-based streaming, secure deployment, and modern build tooling.

The elevator pitch (what you’ll get)

By the end of this article you’ll understand a pragmatic architecture for local inference, secure low-latency transport using WebSocket, a reference server implementation you can run on Raspberry Pi 5 + AI HAT+ 2, and a React dashboard built with modern toolchains (Vite, React 19+, TypeScript). You’ll also get deployment recipes (systemd or container), CI hints for multi-arch builds, and production hardening tips for privacy and offline operation.

Context and 2026 trends

Late 2025 and early 2026 were pivotal for edge AI. Vendors released more compact quantized models and device SDKs tuned for NPUs and small form-factor compute. Browsers continued rolling out WebNN and improved WebGPU support, enabling richer in-browser ML. On-device LLMs matured for many use-cases, and the new AI HAT+ 2 (released late 2025) made generative/vision workloads practical on Raspberry Pi 5-class hardware.

For teams shipping React apps, this means you can build dashboards that control and visualize local inference with acceptable latency and strong privacy guarantees, while integrating cleanly into existing DevOps pipelines.

High-level architecture

Components

Raspberry Pi 5 + AI HAT+ 2: runs local inference server, optionally leverages NPU on the HAT.
Inference server: exposes a WebSocket API for bidirectional streaming of images, tokens, and status updates.
React dashboard: Web UI connecting via secure WebSocket (wss) to the Pi; renders real-time results and controls inference parameters.
Edge router / reverse proxy: TLS termination and optional authentication (Caddy, Nginx, or Traefik).
CI/CD: builds multi-arch Docker images and deploys to the Pi or push artifacts for local installation. See platform and build reviews like NextStream Cloud Platform Review for CI/CD considerations.

Dataflow

User interacts with the React dashboard (start inference, upload image, adjust prompt).
Dashboard opens a WebSocket connection to Pi and streams input (binary images or JSON prompts).
Inference server processes requests locally using optimized runtime (ONNX Runtime, TensorFlow Lite, or vendor SDK) and returns streamed tokens or result frames over WebSocket.
Dashboard renders partial outputs as they arrive, improving perceived latency.

Why WebSocket (and when to consider alternatives)

WebSocket offers low-latency, bidirectional, persistent connections that are simple to implement and work reliably across networks. It is ideal for streaming tokenized outputs, progress updates, and image chunks with minimal overhead.

When you need sub-50ms peer-to-peer streaming (camera-to-dashboard), evaluate WebRTC DataChannel. For many control-and-visualize dashboards, WebSocket is simpler and sufficiently fast.

Reference implementation: pieces you can copy

The examples here use a minimal WebSocket server (Python/FastAPI) on the Pi and a TypeScript React client (Vite). Replace the inference calls with the SDK or runtime you prefer on the Pi (ONNX Runtime, TensorFlow Lite, or vendor-provided API for AI HAT+ 2).

1) Minimal FastAPI WebSocket server (Raspberry Pi)

This server receives JSON control messages and binary image frames; it runs inference on a background thread and streams text or image results back to clients.

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
import asyncio
import json

app = FastAPI()

# Placeholder: replace with your model runtime initialization
class LocalModel:
    def __init__(self):
        # load quantized/optimized model for AI HAT+ 2
        pass

    async def infer(self, payload):
        # simulate streaming tokens or results
        for i in range(5):
            await asyncio.sleep(0.2)
            yield {"chunk": i, "text": f"partial {i}"}

model = LocalModel()

@app.websocket('/ws')
async def websocket_endpoint(ws: WebSocket):
    await ws.accept()
    try:
        while True:
            data = await ws.receive()
            # data can be text or bytes
            if 'text' in data and data['text']:
                msg = json.loads(data['text'])
                if msg.get('type') == 'infer':
                    async for chunk in model.infer(msg.get('payload')):
                        await ws.send_text(json.dumps({"type": "partial", "payload": chunk}))
            elif 'bytes' in data and data['bytes']:
                # handle binary image frame
                image_bytes = data['bytes']
                # run image model and send back result
                await ws.send_text(json.dumps({"type": "image_result", "payload": {"size": len(image_bytes)}}))
    except WebSocketDisconnect:
        print('client disconnected')

Notes:

Integrate the AI HAT+ 2 runtime where LocalModel is initialized.
Use asyncio tasks for long-running inferences to avoid blocking the event loop.
Return partial results aggressively to improve perceived performance.

2) TypeScript React WebSocket client (Vite)

Keep the dashboard simple. Use hooks to manage connection and streaming state. Render partial results as they arrive.

import React, {useEffect, useRef, useState} from 'react'

export default function Dashboard() {
  const wsRef = useRef(null)
  const [connected, setConnected] = useState(false)
  const [chunks, setChunks] = useState([])

  useEffect(() => {
    const ws = new WebSocket('wss://pi.local/ws')
    ws.binaryType = 'arraybuffer'
    ws.onopen = () => setConnected(true)
    ws.onmessage = (ev) => {
      try {
        const payload = JSON.parse(ev.data)
        if (payload.type === 'partial') setChunks(c => [...c, payload.payload])
      } catch (e) { console.error(e) }
    }
    ws.onclose = () => setConnected(false)
    wsRef.current = ws
    return () => ws.close()
  }, [])

  function startInference() {
    wsRef.current?.send(JSON.stringify({type: 'infer', payload: {prompt: 'Describe this scene'}}))
  }

  return (
    
      
      Connected: {String(connected)}
      
        {chunks.map((c, i) => Chunk {i}: {JSON.stringify(c)})}
      
    
  )
}

Model runtimes and on-device strategies (practical guidance)

Pick the runtime that matches your workload and the AI HAT+ 2 SDK. Practical options in 2026:

ONNX Runtime with ARM/NN or vendor NPU backend — great for quantized transformer and vision models. For production-grade telemetry and profiling, combine runtime metrics with modern observability.
TensorFlow Lite for optimized vision and small transformer models.
Vendor SDK for AI HAT+ 2 — this exposes NPU acceleration and power controls; check vendor docs for best practices.
WebNN / WebGPU for browser-based inference when the model fits in memory and you need on-device browser inference. For architecting offline-first client flows and resilient exports, see guidance on offline-first tooling and observability.

Key optimizations:

Use quantized models (8-bit/4-bit where supported) to reduce memory and inference cost. Consider security and permissioning patterns like zero-trust for generative agents when models accept user-provided prompts or private data.
Batch small requests if possible to utilize NPU throughput efficiently.
Stream outputs rather than waiting for full completion to improve UX.
Cache warm-up artifacts (compiled kernels, quantization tables) between runs.

Security, privacy, and offline-first practices

The central privacy win of this architecture: inference and raw data never leave your local network unless you explicitly opt-in. Make this trustworthy:

Local-only by default: default the service to listen on local network or local-only hostnames unless the operator enables external exposure.
TLS: terminate TLS at the edge router using Let's Encrypt or use mTLS for stronger guarantees on private networks.
Auth tokens: short-lived JWTs or signed cookies; rotate tokens in CI and provide an admin UI to revoke keys.
Audit logs: keep a tamper-evident log of inference calls (anonymized if necessary) for compliance.
No cloud fallback unless configured

Deployment recipes

Systemd unit (simple local install)

[Unit]
Description=Local Edge Inference Service
After=network.target

[Service]
User=pi
WorkingDirectory=/home/pi/app
ExecStart=/usr/bin/python3 -m uvicorn main:app --host 0.0.0.0 --port 8000
Restart=on-failure

[Install]
WantedBy=multi-user.target

Install and enable:

sudo cp inference.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now inference.service

Containerized deployment (Docker, multi-arch CI)

Use GitHub Actions to build multi-arch images (linux/arm64, linux/amd64) and push to a registry. Then run on Pi (ARM64) with docker run or podman. For advice on multi-arch CI and packaging, consult platform reviews such as NextStream's CI notes.

# Dockerfile (simplified)
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

GitHub Actions snippet (multi-arch):

name: Build and push
on: [push]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v2
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Login
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Build and push
        uses: docker/build-push-action@v4
        with:
          push: true
          platforms: linux/amd64,linux/arm64
          tags: ghcr.io/${{ github.repository }}:latest

Performance tuning and observability

Tuning edge inference is an iterative cycle:

Measure baseline — profile cold start, warm inference, memory usage, and CPU/NPU utilization.
Use the profiler — ONNX Runtime and many vendor SDKs provide profilers and timeline exports to identify hotspots. Combine that telemetry with modern observability practices for preprod and local stacks.
Optimize input paths — send compressed/resized images, use efficient serialization (binary blobs for images; JSON for control).
Rate limit — protect the device from accidental DoS by limiting concurrent sessions and frequency of expensive inferences.
Expose metrics — Prometheus metrics endpoint or logs with structured JSON for central collection.

React dashboard best practices for low-latency streaming

Render partial data as chunks arrive — users perceive lower latency if they see progressive results.
Use Suspense and streams for gradual hydration if the app needs to be offline-first.
Keep state local for streaming flows — avoid global re-renders while many small updates arrive; use useRef and local queues. For teams building small fast frontends, see guidance on micro-apps and developer tooling.
Profile rendering with React Profiler (2026 tooling improvements make profiling hooks and transitions easier) to isolate rerenders.
Chunk large images and use Web Workers for heavy post-processing to keep the UI thread responsive.

Edge cases and failure modes

Plan for intermittent networks, power cycles, and model corruption. Practical mitigations:

Automatic model verification: checksum and signature verification at boot.
Graceful degradation: fall back to local lightweight model or cached outputs when the full model fails to load.
Retries and exponential backoff: for noisy networks between dashboard and Pi.
Health probes: expose /health and /metrics so orchestrators or watchdogs can restart failed services.

Example: shipping a feature — constrained image captioning

Use case: a kiosk collects images, captions them locally, and displays results in a React dashboard that never touches the cloud.

Deploy a compact captioning model on AI HAT+ 2 (quantized ONNX model).
Client uploads image via WebSocket binary frame. For reliable uploads from mobile clients, consider tested client SDKs for reliable uploads.
Pi runs the model and streams tokenized caption back; React renders incrementally.
Logs and metrics go to a local Prometheus instance for operations.

The net effect is sub-second captioning with full data privacy and offline resilience — ideal for regulated environments and edge retail deployments.

Production checklist

Model is quantized and benchmarked on the device.
Inference service runs as a managed unit (systemd or container) and restarts on failure.
HTTPS/TLS in place via edge router; WebSocket uses wss://.
Auth tokens and optional mTLS configured for dashboard-to-Pi connections.
CI builds multi-arch images; images are signed and verifiable.
Metrics and logs aggregated to a local or private observability stack.

Future-proofing and 2026+ predictions

Expect a continued shift toward smaller, higher-quality quantized models and improved NPU tooling. Browsers will standardize WebNN and extend WebGPU, making richer client-side ML feasible. For React teams, that means hybrid architectures will become the default: small models running directly in the browser when possible, larger/accelerated models running on nearby edge devices like Raspberry Pi 5 + AI HAT+ 2.

Operationally, edge MLOps will mature: model signing, auto-rollbacks, and remote attestation will standardize. Designing your dashboard and inference layer to support remote updates with safeties will save headaches.

Final actionable checklist: get started in 60 minutes

Acquire Raspberry Pi 5 + AI HAT+ 2 and flash a 64-bit OS image.
Install Python3, FastAPI, and your chosen runtime (ONNX/TFLite or vendor SDK).
Clone the reference server and start it as a systemd service.
Create a Vite + React TypeScript dashboard and connect to ws://pi.local:8000/ws locally.
Benchmark one small quantized model; enable streaming in the server and render partial results in the dashboard.
Wrap with a reverse proxy (Caddy recommended for automatic TLS) and enable wss in production.

Resources and further reading

AI HAT+ 2 announcement coverage (late 2025) and vendor docs for NPU SDKs
ONNX Runtime and TensorFlow Lite docs for ARM/NPU backends
FastAPI docs: WebSocket examples and deployment patterns
React + Vite performance guides and React Profiler

“Bring inference local when latency, privacy, or offline capability matter — architecture matters more than model size.”

Call to action

If you’re ready to try this pattern, clone the companion repo (starter server + Vite dashboard) and deploy it to a Raspberry Pi 5 with AI HAT+ 2. Start with a simple captioning demo, iterate on quantization and streaming, and measure before you optimize. Share your results or issues on the project’s issue tracker so we can iterate on hardening and platform-specific tips.

Want a template? Download the reference repo, follow the 60-minute checklist above, and join the community discussion for deployment strategies and model recommendations tuned for AI HAT+ 2.

reacts

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

From EHR to Event-Driven Care: Building a Real-Time Healthcare Integration Layer with React

Mobile Development•13 min read

Performance Boost: How One UI 8.5 Will Enhance Mobile Applications

Healthcare Tech•20 min read

From EHR to Workflow Spine: How Middleware Turns Healthcare Data Into Faster React UIs

AI•17 min read

Leveraging AI Video in Marketing: Best Practices for React Developers

Healthcare IT•20 min read

From EHR to Middleware: The Hidden Layer Powering Faster, Safer Clinical Workflows

From Our Network

Trending stories across our publication group

Designing Predictive Data Ingestion for Sepsis CDS: Low-latency Streams, Secure Attachments, and Explainable Logging

uploadfile.pro

CDS•18 min read

Designing Predictive Data Ingestion for Sepsis CDS: Low-latency Streams, Secure Attachments, and Explainable Logging

Understanding Privacy by Design: Building File Uploads That Respect User Data