XiaoZhi Device

Bitcraft ESP32-S3 AI assistant · 240 MHz dual-core · 8 MB PSRAM · 16 MB Flash

What is XiaoZhi?

XiaoZhi is a Bitcraft ESP32-S3 AI voice assistant — a pocket-sized hardware device that listens for a wake word, records speech, streams audio to a cloud AI backend, and speaks the response back through its built-in speaker. It runs fully open-source firmware (github.com/78/xiaozhi-esp32) with a custom fork adding telemetry, media playback, and an OTA server hosted here on this site.

Boot Sequence

1
Power on
Chip boots, initialises display, audio codec, buttons. Shows "Initializing…" on screen.
2
OTA check
POSTs device info to /esp32/xiaozhi/ota.php on this server. If a newer firmware version is available, downloads and flashes it — then reboots.
3
WiFi connect
Reads saved credentials from NVS flash. On first run, creates a hotspot and lets the user configure WiFi via browser at 192.168.4.1.
4
AI server connect
Opens a WebSocket to api.tenclass.net/xiaozhi/v1/. Exchanges hello messages, negotiates MCP capabilities. Device enters Idle state — ready to talk.

Conversation Flow

1
Wake word
WakeNet9 model runs continuously on-device, listening for "你好小智". Costs ~15% CPU, runs even in idle. No audio is transmitted until the wake word fires.
2
Record & encode
Audio is captured at 16 kHz mono via I2S microphone. The AFE processor applies noise suppression and echo cancellation. Speech is encoded as Opus (60 ms frames, VBR).
3
Stream to AI
Opus frames are sent over the WebSocket in a binary protocol (16-byte header + payload). The cloud AI (Qwen / DeepSeek / other) receives streaming audio and performs ASR → LLM → TTS in one pass.
4
Speak response
The AI streams back Opus TTS audio. The device decodes and plays it at 24 kHz through the ES8311 codec and speaker. The display shows the response text while speaking.

Telemetry

The firmware sends operational telemetry to /esp32/xiaozhi/telemetry. Idle heartbeats still land every ~5 minutes, but active video playback now also emits progress snapshots every 5 seconds with startup latency, HTTP read timing, packet copy cost, queue pressure, decode/present timings, frame-age data, and stream resume counters so transport failures can be distinguished from local render limits while the video is still playing.

eventboot · heartbeat · conversation_start · conversation_end · wake_word · video_playback_progress · video_playback_end
firmwareRunning firmware version string
batteryCharge level % + charging flag
temperatureChip die temperature in °C
rssiWiFi signal strength in dBm
free_heapAvailable heap memory (bytes)
uptime_msTime since last boot (ms)
conversation_countTotal conversations since boot
playback fieldsStartup timing, HTTP read counts/bytes/latency, packet copy cost, queue overflow vs backlog drops, JPEG decode timing, LVGL/present timing, frame age before/after render, resume attempt/failure counters, expected stream bytes, and an explicit playback end reason

MCP — AI Controls the Device

The device runs a built-in MCP (Model Context Protocol) server. The cloud AI can call device tools mid-conversation — letting it respond to natural commands like "turn up the volume" or "what's your battery?"

self.audio_speaker.set_volumeSet volume 0–100
self.get_device_statusReturns firmware, battery, IP, uptime
self.media.searchSearch the media library by natural language
self.media.playStart audio playback from library
self.video.searchSearch video library
self.video.playPlay video on device display

Media Playback

The AI can discover and play media from the server's library. Two playback engines run in parallel:

MediaPlayerStreams OGG/Opus audio files over HTTP. Audio only. The AI searches with self.media.search and triggers with self.media.play.
VideoPlayerStreams .axv binary containers (MJPEG + Opus, interleaved). Renders JPEG frames to the 240×240 display at 8 fps while playing synchronised audio.

Both players abort gracefully when a new conversation starts.

OTA Firmware Updates

On every boot the device checks this server for a newer firmware version. If one exists it downloads the binary, writes it to the inactive OTA flash slot (ota_0 / ota_1 alternate), then reboots into the new firmware. The ESP32 OTA subsystem rolls back automatically if the new firmware fails to boot.

The assets partition (fonts, sounds, emoji) is separate — it is NOT updated by OTA. Changes to assets require a full USB flash.