Moxin Voice
Real-Time Translation & Voice Synthesis in Pure Rust
Live bilingual translation, GPU-accelerated text-to-speech, and zero-shot voice cloning — all running locally on Apple Silicon. Built entirely in Rust with zero Python dependencies.
Explore on GitHub
Open source under Apache 2.0. Clone, build, and run locally.
github.com/moxin-org/Moxin-Voice2.3x
Real-time TTS Synthesis
9
Preset Voices (4 Languages)
0s
Training for Voice Cloning
Benchmarked on Apple M3 Max • Qwen3-TTS 1.7B 8-bit quantized
Core Capabilities
Live Translation
Real-time bilingual subtitles — deploying at GOSIM Paris 2026
A real-time bilingual subtitle overlay that captures any audio source and produces live translated subtitles in a floating window. Designed for conferences, meetings, and live events where language barriers matter.
ASR Pipeline
Powered by Qwen3-ASR-1.7B (8-bit quantized), running entirely in Rust via OminiX MLX on Metal GPU. VAD-segmented for accurate sentence boundaries with real-time streaming output.
Translation Engine
Rolling translation with context-aware windowing for coherent output. Bilingual subtitle pairs (e.g. Chinese ↔ English) render in a native overlay window that stays above all apps.
Audio Capture
Captures system audio directly via macOS ScreenCaptureKit — no virtual audio drivers, no kernel extensions. Also supports microphone input for in-person scenarios.
GOSIM Paris 2026
Being deployed for live conference translation at GOSIM Paris 2026, providing real-time bilingual subtitles for keynotes and technical sessions to a multilingual audience.
Text-to-Speech
9 preset voices, 4 languages
High-quality neural speech synthesis powered by Qwen3-TTS (1.7B parameters, 8-bit quantized) running on Metal GPU via OminiX MLX. 2.3x real-time synthesis speed. Supports Chinese, English, Japanese, and Korean with distinct voice characters.
Zero-Shot Voice Cloning
Clone any voice in seconds
Record or upload 5-30 seconds of reference audio and clone any voice instantly using In-Context Learning (ICL Express mode). Uses Qwen3-TTS-Base (1.7B, 8-bit) for synthesis and Qwen3-ASR (1.7B, 8-bit) for automatic reference audio transcription. No training, no fine-tuning, no cloud upload.
Built on the Moxin Stack
GPU-accelerated UI framework, pure Rust
Rust ML inference on Apple Metal GPU
Dataflow orchestration for voice pipelines
Speech synthesis, 8-bit quantized
Speech recognition, 8-bit quantized
Cross-platform audio I/O in Rust
How It Compares
Leading TTS platforms like ElevenLabs and MiniMax offer powerful cloud APIs. Moxin Voice takes a fundamentally different approach: everything runs locally, in pure Rust, on your hardware.
| Moxin Voice | ElevenLabs | MiniMax TTS | |
|---|---|---|---|
| Runtime | 100% Local | Cloud API | Cloud API |
| Language | Pure Rust | Python (server) | Python (server) |
| Data Privacy | Never leaves device | Uploaded to cloud | Uploaded to cloud |
| Live Translation | Built-in, on-device | Dubbing Studio (cloud) | Not available |
| Voice Cloning | Zero-shot, on-device | Instant / Professional | API-based |
| Pricing | Free & Open Source | $5-$330/mo | Pay-per-character |
| Latency | Instant (no network) | Network-dependent | Network-dependent |
| Internet Required | No | Yes | Yes |
| Memory Safety | Rust ownership model | N/A (cloud) | N/A (cloud) |
| Source Code | Apache 2.0 | Proprietary | Proprietary |
Why Pure Rust Matters
Zero Python, Zero Overhead
No Python runtime, no virtualenv, no pip conflicts. The entire TTS pipeline — from UI to GPU inference — compiles to a single native binary. No GIL, no garbage collection pauses during synthesis.
Memory Safety at the GPU Boundary
Rust's ownership model extends into Metal GPU operations via OminiX MLX. Buffer lifetimes, tensor shapes, and kernel dispatch are all checked at compile time — eliminating crashes from dangling pointers or buffer overflows.
Native Performance
2.3x real-time synthesis speed on Apple Silicon. No interpreter overhead, no FFI marshaling between Python and C. The audio pipeline runs at native speed from text input to speaker output.
Ship as a Single Binary
cargo build --release produces one self-contained executable. No Docker, no conda environments, no system-wide library conflicts. Just download, build, and run.
Voice Library
Bright, slightly edgy young female
Warm, gentle young female
Low, mellow seasoned male
Clear Beijing young male
Dynamic male with rhythmic drive
Sunny American male
Playful Japanese female
Warm Korean female
Clone any voice with 5-30s of audio
Quick Start
Requires macOS 14.0+ (Sonoma) • Apple Silicon (M1/M2/M3/M4) • Rust 1.82+ • Dora CLI
# Install Dora CLI
cargo install dora-cli --locked
# Clone the repository
git clone https://github.com/moxin-org/Moxin-Voice.git
cd Moxin-Voice
# Download TTS models (~3.5 GB)
python3 scripts/download_models.py
# Build and run
cargo build --release
dora up && dora start apps/moxin-voice/dataflow/tts.yml