XiaoZhi Device
Bitcraft ESP32-S3 AI assistant · 240 MHz dual-core · 8 MB PSRAM · 16 MB Flash
What is XiaoZhi?
XiaoZhi is a Bitcraft ESP32-S3 AI voice assistant — a pocket-sized hardware device that listens for a wake word, records speech, streams audio to a cloud AI backend, and speaks the response back through its built-in speaker. It runs fully open-source firmware (github.com/78/xiaozhi-esp32) with a custom fork adding telemetry, media playback, and an OTA server hosted here on this site.
Boot Sequence
/esp32/xiaozhi/ota.php on this server. If a newer firmware version is available, downloads and flashes it — then reboots.api.tenclass.net/xiaozhi/v1/. Exchanges hello messages, negotiates MCP capabilities. Device enters Idle state — ready to talk.Conversation Flow
Telemetry
The firmware sends operational telemetry to /esp32/xiaozhi/telemetry. Idle heartbeats still land every ~5 minutes, but active video playback now also emits progress snapshots every 5 seconds with startup latency, HTTP read timing, packet copy cost, queue pressure, decode/present timings, frame-age data, and stream resume counters so transport failures can be distinguished from local render limits while the video is still playing.
| event | boot · heartbeat · conversation_start · conversation_end · wake_word · video_playback_progress · video_playback_end |
| firmware | Running firmware version string |
| battery | Charge level % + charging flag |
| temperature | Chip die temperature in °C |
| rssi | WiFi signal strength in dBm |
| free_heap | Available heap memory (bytes) |
| uptime_ms | Time since last boot (ms) |
| conversation_count | Total conversations since boot |
| playback fields | Startup timing, HTTP read counts/bytes/latency, packet copy cost, queue overflow vs backlog drops, JPEG decode timing, LVGL/present timing, frame age before/after render, resume attempt/failure counters, expected stream bytes, and an explicit playback end reason |
MCP — AI Controls the Device
The device runs a built-in MCP (Model Context Protocol) server. The cloud AI can call device tools mid-conversation — letting it respond to natural commands like "turn up the volume" or "what's your battery?"
| self.audio_speaker.set_volume | Set volume 0–100 |
| self.get_device_status | Returns firmware, battery, IP, uptime |
| self.media.search | Search the media library by natural language |
| self.media.play | Start audio playback from library |
| self.video.search | Search video library |
| self.video.play | Play video on device display |
Media Playback
The AI can discover and play media from the server's library. Two playback engines run in parallel:
| MediaPlayer | Streams OGG/Opus audio files over HTTP. Audio only. The AI searches with self.media.search and triggers with self.media.play. |
| VideoPlayer | Streams .axv binary containers (MJPEG + Opus, interleaved). Renders JPEG frames to the 240×240 display at 8 fps while playing synchronised audio. |
Both players abort gracefully when a new conversation starts.
OTA Firmware Updates
On every boot the device checks this server for a newer firmware version. If one exists it downloads the binary, writes it to the inactive OTA flash slot (ota_0 / ota_1 alternate), then reboots into the new firmware. The ESP32 OTA subsystem rolls back automatically if the new firmware fails to boot.
The assets partition (fonts, sounds, emoji) is separate — it is NOT updated by OTA. Changes to assets require a full USB flash.