On-Device Voice Assistant

A self-contained voice assistant that runs entirely on a Jetson Orin Nano. No cloud APIs, no internet required during conversation. Speech recognition, language model, text-to-speech, and voice activity detection all run on the integrated GPU.

This started as a question: can you run a useful conversational agent on an 8 GB edge device with under 4-second response latency, and can you make it actually pleasant to talk to? The answer turned out to be yes, with caveats. This page documents the system, shows some demo recordings, and explains the design choices that kept it usable.

The primary use cases right now are structured voice surveys (where the flow of questions is deterministic and the LLM only provides natural phrasing) and freeform conversation (open-ended chat with optional note-taking). The system supports additional flow types and is designed to be extensible.

Demo

Survey Mode

The assistant runs a structured intake survey: fixed questions, fixed order, bounded LLM follow-up probes. The LLM adds natural phrasing but never controls the flow, never invents facts, and never generates the survey questions themselves.

Basic intake survey (~5 min). The assistant asks about comfort level, goals, prior experience, privacy expectations, and usefulness.

Freeform Mode

Open-ended conversation. The assistant greets first, then listens. The user can ask questions, take notes ("remember that my appointment is Tuesday"), or just talk. Memory is ephemeral by default, wiped at the start of each session.

Freeform conversation with note-taking. Response latency is typically 2.5–3.5 seconds.

Hardware

Component	What	Why
Compute	NVIDIA Jetson Orin Nano (8 GB)	CUDA GPU for parallel LLM + TTS + VAD inference
Microphone & speaker	Anker PowerConf S3	Hardware echo cancellation, full-duplex, USB
Storage	128 GB microSD	Holds JetPack, models, runtime
Network	Any local network (Wi-Fi, USB-C Ethernet)	SSH access for control; not needed during conversation

The full setup: Jetson Orin Nano, Anker PowerConf S3, power supply. No screen needed.

Architecture

The voice pipeline is a state machine: idle, listening, transcribing, thinking, speaking, and back to idle. Barge-in is supported: the microphone and VAD stay active during playback, so the user can interrupt at any time.

Audio pipeline. Each stage overlaps: TTS starts on the first LLM clause, not after the full reply.

Stage	Engine	Runs on
Voice activity	Silero VAD (ONNX)	CUDA
Speech recognition	whisper-server (tiny.en, long-running)	CPU / CUDA
Language model	Ollama → qwen2.5:1.5b	CUDA, kept warm in VRAM
Text-to-speech	Kokoro v1.0 (ONNX, 24 kHz)	CUDA
Audio I/O	PortAudio / sounddevice	USB speakerphone

Latency Budget

End-of-user-speech to first audio out, measured on warm qwen2.5:1.5b:

Stage	Time
VAD silence detection	~350 ms
Whisper ASR	~0.8 s
LLM first chunk + Kokoro first clause	~1.5–2.4 s
Speaker wake pad	~0.3 s
Typical total	~2.6–3.5 s

Two choices keep this usable. First, the ASR server stays loaded in memory instead of cold-starting per utterance, which drops transcription from ~1.2 s to ~0.8 s. Second, TTS starts synthesizing the first clause (at the first comma) while the LLM is still generating the rest of the reply.

Structured Flows

The survey mode is built on deterministic YAML state machines. Each state either speaks a fixed verbatim line or uses a prompt_template to let the LLM phrase a bounded response. Transitions are explicit. The LLM never decides what question comes next.

Basic intake survey flow. Solid boxes are fixed verbatim text; shaded boxes are bounded LLM probes.

A flow YAML file looks like this (shortened):

flow:
  id: basic_intake_survey_v1
  type: survey

  states:
    - id: welcome
      verbatim: "I have a short intake survey.
                 Is it okay if I ask you a few questions?"
      extract:
        - name: consent
          method: classify
          labels: ["no", "yes"]
      transitions:
        - if: "consent == 'no'"
          to: decline
        - default: comfort

    - id: goal_probe
      prompt_template: |
        Briefly acknowledge, then ask ONE short
        follow-up for a concrete example.
      transitions:
        - default: prior_experience

    - id: close
      terminal: true
      verbatim: "Thank you. That is the end."

The key principle: fixed questions, bounded probes. The LLM adds warmth and follow-up phrasing. It does not control the order, invent facts, or decide when the survey ends. This is important for research contexts where you need reproducible instruments.

Other flows are straightforward to add. The system currently ships with a daily check-in, an evening reflection, a curiosity survey, and a vaccination consultation flow (following a CDC/WHO motivational interviewing framework). New flows are a YAML file and a one-line config change.

Control and Deployment

The system runs headless. No screen is needed on the Jetson. You SSH in from a laptop, set the boot flow, and start the supervisor. Logs stream to stdout and can be monitored in real time over SSH. A USB speakerphone play button can start and stop sessions without a terminal open.

Environment

Everything runs on the Jetson. The laptop is only used for setup, syncing code, and monitoring. No cloud service is involved during conversation.

Deployment environment. The laptop connects over Wi-Fi for setup; audio goes through USB.

Privacy posture: by default, all transcripts, semantic facts, and vector embeddings are wiped at the start of each session. Nothing accumulates on the device. This is configurable; persistent memory can be enabled for personal-assistant use cases.

Next Steps

The current system works, but there are several directions I want to explore:

Avatar or animated face. Replace the pixel-face dashboard with a rendered avatar (lip-synced to TTS output) displayed on a small screen attached to the Jetson, or streamed to a tablet.
3D-printed enclosure. Package the Jetson, speaker, and a small display into a self-contained shell for field sessions. Think interview kiosk, not laptop.
Smaller, faster models. qwen2.5:1.5b is a good start, but quantized 1B models keep getting better. The target is sub-2-second end-to-end latency.
Multi-language support. Whisper already handles multiple languages; the bottleneck is TTS voice quality for non-English. Kokoro is adding more voices.
Longer structured flows. The survey engine supports branching, classifiers, and retry states. Building a 20-minute clinical interview flow with proper skip logic is the next test.
Field deployment. Running this in a real data collection context, e.g. community health surveys in areas with limited connectivity, where the offline-first design actually matters.

If you are interested in using this for research or have ideas, get in touch.