XiaoZhi Device

Bitcraft ESP32-S3 AI assistant · 240 MHz dual-core · 8 MB PSRAM · 16 MB Flash

What is XiaoZhi?

XiaoZhi is a Bitcraft ESP32-S3 AI voice assistant — a pocket-sized hardware device that listens for a wake word, records speech, streams audio to a cloud AI backend, and speaks the response back through its built-in speaker. It runs fully open-source firmware (github.com/78/xiaozhi-esp32) with a custom fork adding telemetry, media playback (audio + on-screen video), games (Flappy Bird and a DOOM attract-mode port with SFX), and an OTA server hosted here on this site. Voice chat runs through a self-hosted backend (Bedrock claude-haiku), answering in English.

Boot Sequence

Power on

Chip boots, initialises display, audio codec, buttons. Shows "Initializing…" on screen.

→

OTA check

POSTs device info to /esp32/xiaozhi/ota.php on this server. If a newer firmware version is available, downloads and flashes it — then reboots.

→

WiFi connect

Reads saved credentials from NVS flash. On first run, creates a hotspot and lets the user configure WiFi via browser at 192.168.4.1.

→

AI server connect

Opens a WebSocket to the self-hosted backend at ws://iot.danscodellaro.com/xiaozhi/v1/. Exchanges hello messages, negotiates MCP capabilities. Device enters Idle state — ready to talk.

Conversation Flow

Wake word

WakeNet9 model runs continuously on-device, listening for "你好小智". Costs ~15% CPU, runs even in idle. No audio is transmitted until the wake word fires.

→

Record & encode

Audio is captured at 16 kHz mono via I2S microphone. The AFE processor applies noise suppression and echo cancellation. Speech is encoded as Opus (60 ms frames, VBR).

→

Stream to AI

Opus frames are sent over the WebSocket in a binary protocol (16-byte header + payload). The self-hosted backend (ASR → LLM → TTS) receives streaming audio. The LLM is Bedrock claude-haiku via LiteLLM, replying in English.

→

Speak response

The AI streams back Opus TTS audio. The device decodes and plays it at 24 kHz through the ES8311 codec and speaker. The display shows the response text while speaking.

Telemetry

The firmware sends operational telemetry to /esp32/xiaozhi/telemetry. Idle heartbeats still land every ~5 minutes, but active video playback now also emits progress snapshots every 5 seconds with startup latency, HTTP read timing, packet copy cost, queue pressure, decode/present timings, frame-age data, and stream resume counters so transport failures can be distinguished from local render limits while the video is still playing.

event	boot · heartbeat · conversation_start · conversation_end · wake_word · video_playback_progress · video_playback_end
firmware	Running firmware version string
battery	Charge level % + charging flag
temperature	Chip die temperature in °C
rssi	WiFi signal strength in dBm
free_heap	Available heap memory (bytes)
uptime_ms	Time since last boot (ms)
conversation_count	Total conversations since boot
playback fields	Startup timing, HTTP read counts/bytes/latency, packet copy cost, queue overflow vs backlog drops, JPEG decode timing, LVGL/present timing, frame age before/after render, resume attempt/failure counters, expected stream bytes, and an explicit playback end reason

MCP — AI Controls the Device

The device runs a built-in MCP (Model Context Protocol) server. The cloud AI can call device tools mid-conversation — letting it respond to natural commands like "turn up the volume" or "what's your battery?"

self.audio_speaker.set_volume	Set volume 0–100
self.get_device_status	Returns firmware, battery, IP, uptime
self.device.about	Fetches the device's "about" blurb from the server
self.media.search / .play / .stop / .status	Search, play, stop, and report status of audio from the library
self.video.search / .play / .stop / .status	Search, play, stop, and report status of video on the display
self.doom.start / .stop / .status	Launch / stop DOOM (attract mode, with SFX audio)

Media Playback

The AI can discover and play media from the server's library. Two playback engines run in parallel:

MediaPlayer	Streams OGG/Opus audio files over HTTP. Audio only. The AI searches with `self.media.search` and triggers with `self.media.play`.
VideoPlayer	Streams `.axv` binary containers (MJPEG + Opus, interleaved) over a persistent WebSocket (plain `ws://` for low memory). Renders JPEG frames to the 240×240 display at the source frame rate — verified up to 23.976 fps (NTSC film cadence) with audio-clock-synced playback.

Both players abort gracefully when a new conversation starts.

OTA Firmware Updates

The device checks this server for a newer firmware version both on every boot AND every 60 seconds while idle (skipped during conversations or media playback). When a newer version is available it downloads the binary, writes it to the inactive OTA flash slot (ota_0 / ota_1 alternate), then reboots into the new firmware. The ESP32 OTA subsystem rolls back automatically if the new firmware fails to boot.

The assets partition (fonts, sounds, emoji) is separate — it is NOT updated by OTA. Changes to assets require a full USB flash.

idle heartbeat every ~5 min · playback progress every 5 s during active video · dashboard refresh every 5 s

Loading…

Firmware

—

Battery

—%

Temperature

—°C

WiFi RSSI

—dBm

Free Heap

—KB

Uptime

—

Conversations

—

Wake Words

—

Last Playback

—

Frames

—

Transport

—

Render

—

Time	Event	Firmware	Uptime	RSSI	Battery	Temp	Heap	Conv#	WW#	Duration	IP	Details

Device Specifications

Device	Bitcraft (比特工坊) XiaoZhi
Chip	ESP32-S3
CPU	240 MHz dual-core Xtensa LX7
PSRAM	8 MB
Flash	16 MB
Board config	zhengchen-1.54tft-wifi-bitcraft
Upstream repo	github.com/78/xiaozhi-esp32
Fork	github.com/danscod/xiaozhi-esp32

Display

Panel	240×240 ST7789 TFT (SPI)
Graphics	LVGL (Light & Versatile Graphics Library)
Thread safety	DisplayLockGuard RAII pattern
Power save	Dims after 60 s idle, off after 300 s
High-temp alert	Red bar at bottom — triggers at 75°C
Fonts	PuHui 14–30 px (CJK+Latin), Noto, Font Awesome
Emoji	noto-emoji 128 px, twemoji 32/64 px (assets partition)

Audio

Codec	ES8311, I2S interface
Mic input	16 kHz mono
Speaker output	24 kHz via ES8311
Encoding	Opus, 60 ms frames, mono, VBR, DTX
Wake word	"你好小智" — WakeNet9 model (baked into assets)
AEC modes	Off · Device-side (ESP AFE) · Server-side

Battery & Power

Measurement	ADC Channel 7 — 6-point calibration
Calibration	2030 ADC → 0% · 2606 ADC → 100%
Charging GPIO	GPIO 9, active LOW
Temp sampling	Every 10 s — callback fires if delta ≥ 3.5°C

GPIO Pin Map

Function	GPIO	Function	GPIO
I2S MIC WS	4	Display SDA	41
I2S MIC SCK	5	Display SCL	42
I2S MIC DIN	6	Display RESET	45
I2S SPK DOUT	7	Display DC	40
I2S SPK BCLK	15	Display CS	21
I2S SPK LRCK	16	Backlight	20
BOOT button	0	Battery ADC	Ch7
Volume Up	10	Charging status	9 (active LOW)
Volume Down	39

Button Behaviour

Button	Click	Long Press
BOOT (GPIO 0)	Toggle chat / start if idle	Enter WiFi config mode
Vol+ (GPIO 10)	+10% volume	Max volume (100%)
Vol− (GPIO 39)	−10% volume	Mute

Current Status

Running version	v2.3.18 (stable — 24 fps video, DOOM with SFX, English voice chat)
Server OTA version	v2.3.18
Language	English UI (en-US) · English voice chat via self-hosted backend (Bedrock claude-haiku)
Wake word	"你好小智" (Chinese — baked into WakeNet9, cannot OTA)
Board	zhengchen-1.54tft-wifi-bitcraft
Build machine	Ubuntu server, ESP-IDF v5.5.3

OTA System

On boot AND every 60 seconds while idle (see Application::StartAutoUpdateTimer) the device POSTs device info to /esp32/xiaozhi/ota.php and receives a JSON response with version + binary URL + WebSocket URL + ws_base_url. If the server version is newer it downloads, flashes the inactive OTA slot, and reboots. The auto-OTA poll is suppressed during conversations and media playback.

OTA endpoint	`/esp32/xiaozhi/ota.php` (Flask)
Binary path	`frontend/esp32/xiaozhi/xiaozhi.bin`
Version constant	`_OTA_FIRMWARE_VERSION` in `server.py`
Poll cadence	Boot + every 60 s when idle
Round-trip	~3 min from `systemctl restart travel-api` to device running new firmware

Partition Table

Partition	Offset	Size	Purpose
nvs	0x9000	16 KB	WiFi credentials, ota_url NVS config
ota_0	0x20000	3.94 MB	Firmware slot A
ota_1	0x410000	3.94 MB	Firmware slot B
assets	0x800000	8 MB	LittleFS: fonts, icons, OGG sounds, emojis

Assets partition is NOT updated by OTA — font/sound changes require a full USB flash.

Deploy Workflow (Server Build — preferred)

# 1. Bump version — do this BEFORE building
sed -i 's/set(PROJECT_VER ".*")/set(PROJECT_VER "2.2.X")/' \
    /home/ubuntu/danscodellaro.com/xiaozhi-esp32/CMakeLists.txt

# 2. Build — IMPORTANT: capture the real exit code, not tail's
source /home/ubuntu/esp-idf/export.sh
cd /home/ubuntu/danscodellaro.com/xiaozhi-esp32
idf.py build > /tmp/build.log 2>&1; echo "BUILD EXIT: $?"
tail -5 /tmp/build.log
# (piping idf.py through `tail` masks compile failures with tail's exit 0)

# 3. Asserted deploy — refuses to ship if binary != CMakeLists version
expected=$(grep PROJECT_VER CMakeLists.txt | grep -oE "[0-9]+\.[0-9]+\.[0-9]+")
built=$(strings build/xiaozhi.bin | grep -E "^[0-9]+\.[0-9]+\.[0-9]+$" | head -1)
if [ "$expected" != "$built" ]; then echo "MISMATCH $expected vs $built"; exit 1; fi

cp build/xiaozhi.bin \
    /home/ubuntu/danscodellaro.com/frontend/esp32/xiaozhi/xiaozhi.bin

sed -i "s/_OTA_FIRMWARE_VERSION = \".*\"/_OTA_FIRMWARE_VERSION = \"$expected\"/" \
    /home/ubuntu/danscodellaro.com/api/server.py

sudo systemctl restart travel-api

# 4. Optional: free ~350 MB disk space if tight on space
# rm -rf build/   # only after you're sure nothing else needs the build dir

Version constant and binary MUST be in sync — mismatch causes an infinite OTA loop (we hit this with 2.2.86 — the build had silently failed but tail's exit-0 hid it, so the deploy shipped a stale 2.2.85 binary labelled "2.2.86" and the device looped for ~10 minutes downloading the same wrong file).

Deploy Workflow (Windows Build → Server)

# Build on Windows:
cd C:\Users\dansc\xiaozhi-esp32
python idf_build.py build

# SCP to server:
scp -i C:/Keys/lightsail-cos.pem build/xiaozhi.bin \
    [email protected]:/home/ubuntu/danscodellaro.com/frontend/esp32/xiaozhi/xiaozhi.bin

# Then run steps 5-6 above on the server.

OTA Response Format

The OTA endpoint (POST /esp32/xiaozhi/ota.php) returns:

{
  "firmware": {
    "version": "2.3.18",
    "url": "https://www.danscodellaro.com/esp32/xiaozhi/xiaozhi.bin"
  },
  "websocket":   { "url": "ws://iot.danscodellaro.com/xiaozhi/v1/", "token": "" },
  "ws_base_url": "ws://iot.danscodellaro.com",
  "server_time": { "timestamp": 1775386136101, "timezone_offset": 600 }
}

The device compares firmware.version against its running version. If server is newer → download + flash + reboot. On success the app calls MarkCurrentVersionValid() to prevent OTA rollback.

ws_base_url (added 2.2.83) is the plain-WS base URL used by the chat WS, telemetry WS, and video media WS. It is now pinned to the stable DNS name ws://iot.danscodellaro.com — a DNS-only, Cloudflare-bypass plaintext origin reserved for the device. The device caches it in NVS namespace net, key ws_base. (Earlier builds used a raw ws://<public-ip> that a server.py daemon refreshed from AWS IMDS; the iot hostname removed the need to chase IP changes.)

Pointing Device at This Server (First Time)

# Device in download mode (hold BOOT + power cycle via USB)
python set_ota_url.py
# Patches NVS at offset 0x9000 only — no firmware reflash needed

USB Serial Debugging

Connect via USB-C on COM3 (CH340K — check Device Manager). Open at 115200 baud.

idf.py -p COM3 monitor
# Or PuTTY / Arduino IDE serial monitor at 115200 baud

Recovering from a Bad Flash

If a new build crashes on boot, the OTA bootloader rolls back automatically after repeated failures. To stop the loop immediately: revert _OTA_FIRMWARE_VERSION in server.py to the last known-good version, then sudo systemctl restart travel-api. The device will see "already up to date" and stop re-downloading.

Custom Files (untracked — required for build)

These files are not committed to git. They are required for a correct build — never delete them, and always restore them when cloning to a new machine. The committed main/CMakeLists.txt lists both telemetry.cc and media_player.cc as source files; a missing file causes a "no such file" build error.

File	Purpose
`main/telemetry.cc`	Posts heartbeat/event metrics to `/esp32/xiaozhi/telemetry`. Called from `application.cc` after activation.
`main/telemetry.h`	Header for Telemetry class.
`main/media_player.cc`	MCP tools for media search/play. Registered in `McpServer::AddCommonTools()`.
`main/media_player.h`	Header for MediaPlayer class.

⚠️ Known Footgun: Circular GetInstance() Deadlock

If a singleton's constructor calls GetInstance() on another singleton that also calls back into the first, the C++11 magic-static initialisation guard will deadlock the main task permanently. The device hangs at that point on every boot with no panic — it just silently freezes.

This hit us in v2.2.6 and v2.2.7: McpServer() constructor called MediaPlayer::RegisterMcpTools(), which called McpServer::GetInstance() — re-entering the already-held static-init mutex. Fixed in v2.2.8 by moving the RegisterMcpTools() call from the constructor to AddCommonTools().

Rule: never call GetInstance() on a singleton from within another singleton's constructor.

Language / UI Strings

Language is set in sdkconfig. Changing it triggers scripts/gen_lang.py, which regenerates main/assets/lang_config.h. Do not edit lang_config.h directly — it is overwritten on every build.

# sdkconfig — switch to English UI:
# CONFIG_LANGUAGE_ZH_CN is not set
CONFIG_LANGUAGE_EN_US=y

One hardcoded string exists outside the locale system: the high-temp warning popup in main/boards/zhengchen-1.54tft-wifi-bitcraft/zhengchen_lcd_display.h line ~30.

Audio Files — OGG/Opus

Audio-only tracks use the OGG container with Opus-encoded audio. These match the device's native audio pipeline, so no transcoding is needed on playback.

Container	OGG
Codec	Opus
Sample rate	24 000 Hz
Channels	Mono (1)
Bitrate	24 kbps
Frame duration	60 ms
VBR	Off (CBR)
File extension	`.ogg`

Convert with ffmpeg

ffmpeg -i input.mp3 \
  -vn -ar 24000 -ac 1 \
  -c:a libopus -b:a 24k \
  -frame_duration 60 -vbr off \
  output.ogg

Video Files — .axv Container

Video uses a custom binary container (.axv) that interleaves JPEG video frames and raw Opus audio packets. The device streams it over HTTP and decodes each frame type separately.

Container	`.axv` (custom binary)
Video codec	JPEG (one frame per packet)
Resolution	240 × 240 px
Frame rate	Configurable. Verified up to 23.976 fps (NTSC film). The encoder script `scripts/make_axv.py` accepts `--fps` with float precision; ts_ms timestamps are computed as `round(idx × 1000 / fps)` to avoid drift on non-integer rates.
JPEG quality	`-q:v 5` default (~60% quality, 8–12 KB/frame). Lower the number for higher quality, raise for smaller files.
Max JPEG size	32 KB per frame (firmware hard limit)
Audio codec	Opus (raw packets, no OGG wrapper)
Audio rate	24 000 Hz mono, 24 kbps, 60 ms frames

.axv Binary Frame Format

Each frame (video or audio) is prefixed with a 16-byte header. All four fields are big-endian (network byte order). The firmware rejects any file with wrong byte order — it logs "Frame sync lost" and stops silently.

Bytes	Field	Value
0–3	magic	`0xDEADBEEF` big-endian → raw bytes `DE AD BE EF`
4–7	type	`0` = video (JPEG) · `1` = audio (Opus) — big-endian
8–11	length	Payload byte count (uint32, big-endian)
12–15	ts_ms	Presentation timestamp in milliseconds (big-endian; used for muxing order only — device uses audio backpressure for A/V sync)
16+	payload	Raw JPEG bytes (type 0) or raw Opus packet (type 1)

Common mistake: Python's struct.pack uses native (little-endian) byte order by default. Always use struct.pack(">IIII", ...) — the ">" prefix is required. Type constants are also easy to get wrong: video=0, audio=1 (not 1/2). The encoder script is pack_axv.py in the same directory as each .axv file.

Frames are interleaved in presentation-timestamp order. At 24 fps video (~42 ms) and 60 ms audio, video frames out-number audio packets ~7:5. At 8 fps, the ratio is roughly 1 video frame per 2 audio packets (125 ms vs 60 ms).

Creating .axv Files

Use scripts/make_axv.py on the server. It requires ffmpeg in PATH and handles aspect ratio padding automatically.

From a video file

# Default (8 fps — small files, low CPU)
python3 scripts/make_axv.py --input video.mp4 --output out.axv

# Source-rate match (24 fps — cinema-cadence playback, ~3× file size)
python3 scripts/make_axv.py --input video.mp4 --output out.axv --fps 23.976

# Lower quality to shrink file at high fps
python3 scripts/make_axv.py --input video.mp4 --output out.axv --fps 23.976 --quality 8

Generate a test pattern (30 s SMPTE bars + 440 Hz tone)

python3 scripts/make_axv.py --test --output test_video.axv

The script extracts video at 240×240 with black-bar padding, re-encodes audio to Opus at 24 kHz mono, then interleaves both streams by presentation timestamp. Float-precision timestamps avoid drift at non-integer rates (e.g. 23.976 fps over a 23-minute episode would drift ~33 seconds with naive integer math).

ffmpeg internals

# Video frames (extracted as individual JPEGs):
ffmpeg -i input.mp4 \
  -vf "scale=240:240:force_original_aspect_ratio=decrease,pad=240:240:(ow-iw)/2:(oh-ih)/2:black" \
  -r 23.976 -q:v 5 frames/%06d.jpg

# Audio (Opus in OGG, then raw packets demuxed):
ffmpeg -i input.mp4 -vn -ar 24000 -ac 1 \
  -c:a libopus -b:a 24k -frame_duration 60 -vbr off audio.ogg

Media Library JSON Schema

Each item in the library (api/media_library.json) follows this schema:

Field	Type	Required	Description
`id`	string	Yes	Unique kebab-case identifier — used by MCP tools to select items
`title`	string	Yes	Human-readable title — shown on device display during playback
`url`	string	Yes	Absolute HTTPS URL to the `.ogg` or `.axv` file
`type`	string	Yes	`"audio"` or `"video"`
`description`	string	Recommended	Used by the AI for search relevance scoring
`tags`	string[]	Recommended	Keywords for search — e.g. `["jazz", "piano", "relaxing"]`
`duration_s`	number	Recommended	Duration in seconds — shown in search results
`series`	string	Optional	Series or album name — included in search text

Add an item via API

curl -X POST https://www.danscodellaro.com/esp32/xiaozhi/media/library \
  -H "Content-Type: application/json" \
  -d '{
    "id": "my-track",
    "title": "My Track",
    "description": "An Opus audio file.",
    "type": "audio",
    "tags": ["music"],
    "duration_s": 120,
    "url": "https://www.danscodellaro.com/esp32/xiaozhi/media/files/my-track.ogg"
  }'

Posting an item with an existing id performs an upsert (update). Delete via DELETE /esp32/xiaozhi/media/library/<id>.

File Location on Server

Audio files	`frontend/esp32/xiaozhi/media/files/*.ogg`
Video files	`frontend/esp32/xiaozhi/media/files/*.axv`
Library JSON	`api/media_library.json`
Muxer script	`scripts/make_axv.py`
Public URL base	`https://www.danscodellaro.com/esp32/xiaozhi/media/files/`

Files in media/files/ are served directly by Caddy as static assets — no Flask involved.

Server Infrastructure

Host: AWS Lightsail Ubuntu (foundry.danscodellaro.com) · Web: Caddy · API: Python Flask (travel-api.service, port 5001)

Path	Purpose
`frontend/esp32/xiaozhi/xiaozhi.bin`	Firmware binary served to device for OTA
`api/server.py`	Flask app — OTA endpoint, telemetry, media library, about
`api/esp32_telemetry.db`	SQLite telemetry DB (heartbeats, boot events, ~5 min interval)
`api/ota_requests.log`	Per-request log of every OTA check with full device JSON
`api/media_library.json`	Media library catalogue (id, title, description, url). Managed via admin API.
`frontend/esp32/xiaozhi/media/files/`	OGG audio + test AXV files — served by Caddy, no auth
`frontend/esp32/xiaozhi/media/video/`	AXV video files organised in subdirectories by title — served by Caddy, no auth

Telemetry Endpoints

POST /esp32/xiaozhi/telemetry	Device heartbeat (every ~5 min)
GET /esp32/xiaozhi/telemetry/latest	Most-recent row — used for status polling
GET /esp32/xiaozhi/telemetry/data?after_id=X	Incremental fetch of rows since `after_id`

Media Library Endpoints

GET /esp32/xiaozhi/media/search?q=…	AI search via MCP tool `self.media.search`
GET /esp32/xiaozhi/media/search/item/<id>	AI resolves stream URL via `self.media.play`
GET /esp32/xiaozhi/media/library	Admin — list full catalogue
POST /esp32/xiaozhi/media/library	Admin — add/upsert item (JSON body)
DELETE /esp32/xiaozhi/media/library/<id>	Admin — remove item

Audio files: OGG Opus, 24 kHz, mono, 60 ms frames (-frame_duration 60). Any other frame size plays silently.

OTA & About Endpoints

POST /esp32/xiaozhi/ota.php	Device checks for updates on boot. Returns version, binary URL, WebSocket URL, server time.
GET /esp32/xiaozhi/about	Returns JSON `{"title": "…", "content": "…"}` — used by `self.device.about` MCP tool and the About page.

Firmware Version History

Version	Date	Status	Notes
2.0.4.4	—	archived	Factory firmware. Full flash dump backed up at `C:\Users\dansc\xiaozhi_backup_full.bin`.
2.2.5	2026-04-05	superseded	First custom build — board `zhengchen-1.54tft-wifi-bitcraft`, OTA URL set to this server, Chinese UI. OTA-flashed from 2.0.4.4.
2.2.5 †	2026-04-05	running on device	English UI rebuild (Windows) — `CONFIG_LANGUAGE_EN_US=y`, fixed hardcoded Chinese high-temp label.
2.2.6	2026-04-07	crashes	First Linux build. Hangs after AudioCodec init — never reaches WiFi. Root cause: circular `GetInstance()` deadlock fixed in 2.2.8.
2.2.7	2026-04-07	crashes	Added `http->SetTimeout(300000)`. Same deadlock crash as 2.2.6.
2.2.8	2026-04-08	superseded	Fixed `GetInstance()` deadlock: moved `MediaPlayer::RegisterMcpTools()` from constructor to `AddCommonTools()`. First working Linux build.
2.2.9	2026-04-08	superseded	Fixed media playback silent failure: `stream_url_` cleared by `StopPlayback()` before `StreamTask` ran.
2.2.10	2026-04-08	superseded	Fixed `ReadAll()` deadlock for files >8 KB. Rewrote `StreamTask` to stream via `Read()` chunks into `OggDemuxer`.
2.2.11	2026-04-08	superseded	Fixed stack overflow: `OggDemuxer` moved to heap; task stack increased to 16 384 bytes. Audio confirmed working.
2.2.12	2026-04-08	superseded	Added `MediaPlayer::Stop()`. BOOT button stops media on tap. Used unsafe `vTaskDelete` — replaced in 2.2.13.
2.2.13	2026-04-08	regression	Atomic `stop_requested_` flag. Added wait-for-idle loop — broke audio (output path torn down at idle). Rolled back in 2.2.14.
2.2.14	2026-04-08	superseded	Reverted wait-for-idle; audio works. Atomic stop flag retained.
2.2.15	2026-04-08	superseded	Added Flappy Bird game. Button not responding — `BUTTON_SINGLE_CLICK` debounce too slow. Fixed in 2.2.17.
2.2.16	2026-04-08	superseded	Attempted to suppress AI TTS via empty MCP return — insufficient (AI speaks regardless). Superseded by 2.2.18.
2.2.17	2026-04-08	superseded	Fixed Flappy Bird button: moved to `BUTTON_PRESS_DOWN` (fires immediately). AI speech suppressed after game/media launch.
2.2.18	2026-04-08	superseded	Fixed AI speech overlap: `AbortSpeaking(kAbortReasonNone)` cuts TTS at protocol level. BOOT button stop with `press_consumed_` flag. Graceful game exit via `lv_scr_load_anim`.
2.2.19	2026-04-08	superseded	Flappy Bird: persistent "GAME OVER" overlay; long-press BOOT exits. Long-press WiFi config blocked during game.
2.2.20	2026-04-08	superseded	Added VideoPlayer module: `.axv` container (MJPEG + Opus), 4 MCP tools. Double-buffered JPEG decode to PSRAM. Test video (SMPTE bars) added.
2.2.21–2.2.23	2026-04-09	superseded	Details not documented.
2.2.24	2026-04-09	superseded	Baseline for this session — VideoPlayer functional.
2.2.25	2026-04-10	superseded	Fixed video running ~1.5× too fast: wake word acoustic feedback (speaker → mic) was triggering `ResetDecoder()` during playback, breaking audio backpressure. Fixed by disabling wake word detection during video and guarding `ResetDecoder()` calls in state machine.
2.2.26	2026-04-10	superseded	Added FlappyBird sound effects (flap/score/death via OGG assets). Added `self.device.about` MCP tool — fetches description from `/esp32/xiaozhi/about` and returns it to AI. Added public `/esp32/xiaozhi/about.html` page.
2.2.27	2026-04-10	superseded	Fixed generic media playback path so video items resolved by `self.media.search` are delegated to `VideoPlayer` instead of the audio-only OGG demuxer.
2.2.28	2026-04-10	superseded	Split video playback into separate reader and renderer tasks, added audio prebuffering, and started dropping stale video frames based on container timestamps instead of blocking audio.
2.2.29	2026-04-10	superseded	Ends active voice sessions when video starts and ignores late conversational events while media is active, preventing the AI from interrupting playback with retry speech.
2.2.30	2026-04-10	superseded	Removes per-frame JPEG output allocation/copy churn by decoding directly into reusable PSRAM framebuffers. Audio stayed solid; video remained decode-bound.
2.2.31	2026-04-11	superseded	Added playback-end diagnostics: HTTP read stalls, audio queue block time, late-frame stats, deeper audio buffering, and higher stream-reader priority.
2.2.32	2026-04-11	superseded	Streams playback telemetry live every 5 seconds during active video and surfaces the detailed diagnostics on this dashboard. Designed to distinguish stream transport stalls from local decode/render limits while playback is still running.
2.2.33	2026-04-11	regression	Introduced ranged video fetching and timeline resync, but startup failed before the first packet was delivered. Video items resolved correctly from the library, but playback exited after about 1.4 seconds with zero rendered frames.
2.2.34	2026-04-11	regression	Attempted a startup fallback to the legacy stream when the first `Range` request was ignored, but telemetry showed the same early-exit failure pattern as 2.2.33. The ranged transport remained broken on-device.
2.2.35	2026-04-12	superseded	Removed the ranged transport path and restored the known-good continuous HTTP streaming reader from 2.2.32 while keeping the live playback telemetry. This recovery build made video work again after the 2.2.33/2.2.34 transport regression.
2.2.36	2026-04-12	superseded	Expanded video telemetry into a full pipeline profile: startup latency, HTTP read timing, packet copy cost, queue overflow vs backlog drops, JPEG decode timing, frame-present timing, frame-age metrics, and explicit playback end reason. That data showed the renderer was oversleeping on far-future frames.
2.2.37	2026-04-12	superseded	Reworked video frame selection and pacing. The renderer now trims stale frames, selects the newest frame that is actually due, preserves future frames in the queue, and rechecks pacing in short intervals instead of sleeping on far-future frames.
2.2.38	2026-04-12	superseded	Added resumable video transport and new resume counters. Playback did attempt a resume after an early stream close, but the device’s range request was malformed and the server returned HTTP 416.
2.2.39	2026-04-12	superseded	Fixes the resume request formatting. The firmware now builds the `Range` header using a plain decimal string instead of printf-style formatting, which avoids malformed resume requests on the embedded libc build.
2.2.40–2.2.74	2026-04–06	superseded	Extensive video-pipeline iteration: ranged HTTP, sync_v1 batched transport, then full WebSocket streaming (`media/ws/<id>`) for video. Server-side pacing (`_WS_LEAD_MS = 10000`) keeps the device ~10 s ahead of audio_clock for jitter absorption. NVS-deferred end-of-playback summaries to avoid HttpClient lifecycle bugs.
2.2.75–2.2.82	2026-06-03/04	superseded	Patched managed-component `tcp_receive` and `ssl_receive` task priorities from 1 → 6 to stop them being starved by the video tasks. Coredump partition added to capture panics. Persistent telemetry WebSocket so per-event POSTs no longer instantiate HttpClient and trip the lifetime races. End-of-playback summary now sent via WS immediately (NVS only as fallback when WS isn't up).
2.2.83–2.2.85	2026-06-04	superseded	Restructured WS receive: dispatch happens inline on the receive task, eliminating the intermediate `raw_queue` whose auto-drop policy was silently discarding the OLDEST frames (the ones the renderer needed). Pinned `video_stream`/`video_render` to core 0/1 respectively. Pinned audio: `opus_codec`+`audio_output`+`audio_input`+`audio_detection`+`audio_communication` and both AFE-internal tasks to core 0. Dynamic `ws_base_url` via OTA response so server IP changes propagate without re-flash. Auto-OTA poll every 60 s when idle.
2.2.86–2.2.87	2026-06-04	superseded	Added per-core CPU stats to the playback summary (`cpu_core0_busy_pct`, `cpu_core1_busy_pct`, `cpu_top_task_name`, `cpu_top_task_pct`, `cpu_top_task_core`). Enabled `CONFIG_FREERTOS_VTASKLIST_INCLUDE_COREID=y`. Initial implementation reported since-boot stats; corrected to per-playback-window delta. Build-deploy verification: the script now hashes the version string out of the binary and aborts if it doesn't match `CMakeLists.txt`, after a 2.2.86 ship of the wrong binary caused a download-reboot loop.
2.2.88	2026-06-04	superseded	Big fix. The render task was spin-looping at 34 591 wakeups/sec when frames were not yet due. The "future_wait_us" branch did `continue` before reaching the `vTaskDelay` further down the loop body — sleep code unreachable. Result: core 1 sat at 100% even on 8 fps content, and LVGL got starved of CPU which is why visual playback didn't match the rendered-frame counter. Moved the delay inside the same branch as the continue. Core 1 dropped from 100% → 33% at 8 fps.
2.2.89	2026-06-05	superseded	Bumped `kMaxQueuedVideoFrames` 120 → 320 to cover the server's 10 s lead at 24 fps source rate. Verified 24 fps playback: R = 23.40 fps rendered (97.6% of 23.976 source), only 55 backlog drops over 137 s, 0 queue overflow, 0 audio underruns, core 1 at 81 %, core 0 at 27 %. Sustained ~245 KB/s network throughput. Top task `taskLVGL` at 43 % on core 1.
2.3.00–2.3.07	2026-06	superseded	Audio-path hardening (PSRAM task stacks, AFE VAD blocking, `SendStartListening`), VAD auto-stop + 60 s idle update poll, and the start of the DOOM (prboom) port under a gitignored `components/` directory with the WAD in the assets partition.
2.3.08–2.3.09	2026-06-22	superseded	ESP coredump → boot-telemetry → erase; per-core CPU stats; `ws_base_url` NVS cache; audio tasks pinned to core 0; video sync-playback rework. 2.3.09 switched the device's hardcoded URLs (media search, telemetry, about) from `https://www.danscodellaro.com` to `http://iot.danscodellaro.com` (Cloudflare-bypass plaintext origin).
2.3.10–2.3.12	2026-06-22	superseded	Root-caused video + DOOM launch failures: PSRAM task stacks. The 32 KB video stream task and the DOOM engine task were created with internal-SRAM stacks, which silently fail once AFE audio + LVGL are loaded. Moved both to PSRAM via `xTaskCreatePinnedToCoreWithCaps` (+ `vTaskDeleteWithCaps`). Video confirmed playing on hardware; DOOM now launches.
2.3.13–2.3.15	2026-06-22	superseded	DOOM display fixes (horizontal-wrap LUT overflow, top-bar flicker, palette[255]), `%zu` logging fix (newlib-nano drops `%z`) → counters now print, and internal-SRAM relief during playback.
2.3.16–2.3.17	2026-06-22	superseded	DOOM exit-freeze deadlock fixed (cross-thread LVGL lock → `lvgl_port_stop()`/`resume()`); higher-quality video encode; DOOM SFX audio added from scratch (DMX lump parser + ≤8-channel mixer pushed to the codec) with a `SetOutputKeepAlive` fix so the power-save loop no longer mutes it. DOOM audio confirmed working by ear.
2.3.18	2026-06-22	current	Fixed two regressions: reverted the served video to baseline 4:2:0 JPEG (the device's `esp_new_jpeg` decoder rejects 4:4:4) and reordered DOOM start to take the display lock before `lvgl_port_stop()`, eliminating a post-DOOM `main`-task TWDT hang. Current shipping firmware: English voice chat (Bedrock claude-haiku via self-hosted backend), local media search + Simpsons video over WebSocket, Flappy Bird, and DOOM (attract mode, with SFX).

† 2.2.5 "running on device" is the Windows English rebuild — different binary from the first Linux 2.2.5 build.

Known Issues

Issue	Status	Details
Display stuck on "Initializing…"	open	`HandleStateChangedEvent()` in `application.cc` has no handlers for `kDeviceStateStarting`, `kDeviceStateActivating`, or `kDeviceStateUpgrading` — display never updates from initial state during these phases. Device works correctly; display-only bug.
Video playback frame rate remains low	fixed v2.2.88	Resolved via a chain of fixes: `tcp_receive` priority bump (2.2.82), `raw_queue` elimination + task pinning + opus codec on core 0 (2.2.83–2.2.85), render-task spin-loop sleep fix (2.2.88), then queue size bump for 24 fps source (2.2.89). The device now sustains 23.4 of 23.976 source fps with 0 audio underruns. The current ceiling on this hardware is around 30 fps (LVGL becomes the bottleneck).
AI speaks over media for first few seconds	fixed v2.2.18	`AbortSpeaking(kAbortReasonNone)` cuts TTS at protocol level. Earlier attempt (returning `""` from MCP tool) was insufficient — AI generates spoken response regardless.
Device can't hear "stop" while media plays	by design	AEC is disabled during media playback — microphone effectively blocked. Use BOOT button (single tap) or `self.media.stop` MCP tool before playback starts.
Wake word is Chinese	by design	"你好小智" baked into WakeNet9 model in assets partition. Changing UI language does not change wake word. New wake word requires ESP-SR pipeline retraining + full USB flash.
Assets not updated by OTA	by design	8 MB LittleFS assets partition (fonts, icons, OGG sounds) is separate from OTA slots. Changes require full USB flash.

AutoStop	VAD detects silence → stops automatically
ManualStop	Hold button to talk, release to stop
Realtime	Continuous streaming to server

Binary frame	`version(2) \| type(2) \| reserved(4) \| timestamp(4) \| payload_size(4) \| payload`
type 0	Opus audio frame
type 1	JSON message

wake_word_detected	Wake word heard
start_listening	Begin capture, includes mode
stop_listening	Stop capture
abort_speaking	Interrupt TTS playback
mcp	MCP tool call pass-through

self.audio_speaker.set_volume	Set device volume 0–100
self.get_device_status	Returns firmware, battery, IP, uptime
self.media.* / self.video.*	Search / play / stop / status for audio and video
self.doom.*	start / stop / status for DOOM (attract mode, with SFX)

SCHEDULE	Deferred task scheduling
SEND_AUDIO	Audio ready to send
WAKE_WORD_DETECTED	Wake word heard
VAD_CHANGE	Voice activity state change
TOGGLE_CHAT	Button press event
NETWORK_CONNECTED	WiFi connected
CLOCK_TICK	1-second timer