Moxin Voice
Real-Time Translation & Voice Synthesis in Pure Rust

Live bilingual translation, GPU-accelerated text-to-speech, and zero-shot voice cloning — all running locally on Apple Silicon. Built entirely in Rust with zero Python dependencies.

Live Demo Deploying at GOSIM Paris 2026 — live translation for conference talks

Explore on GitHub

Open source under Apache 2.0. Clone, build, and run locally.

github.com/moxin-org/Moxin-Voice

2.3x

Real-time TTS Synthesis

9

Preset Voices (4 Languages)

0s

Training for Voice Cloning

Benchmarked on Apple M3 Max • Qwen3-TTS 1.7B 8-bit quantized

Core Capabilities

Flagship Feature

Live Translation

Real-time bilingual subtitles — deploying at GOSIM Paris 2026

A real-time bilingual subtitle overlay that captures any audio source and produces live translated subtitles in a floating window. Designed for conferences, meetings, and live events where language barriers matter.

ASR Pipeline

Powered by Qwen3-ASR-1.7B (8-bit quantized), running entirely in Rust via OminiX MLX on Metal GPU. VAD-segmented for accurate sentence boundaries with real-time streaming output.

Translation Engine

Rolling translation with context-aware windowing for coherent output. Bilingual subtitle pairs (e.g. Chinese ↔ English) render in a native overlay window that stays above all apps.

Audio Capture

Captures system audio directly via macOS ScreenCaptureKit — no virtual audio drivers, no kernel extensions. Also supports microphone input for in-person scenarios.

GOSIM Paris 2026

Being deployed for live conference translation at GOSIM Paris 2026, providing real-time bilingual subtitles for keynotes and technical sessions to a multilingual audience.

Qwen3-ASR-1.7B ScreenCaptureKit Floating Overlay VAD Segmentation OminiX MLX Zero Python

Text-to-Speech

9 preset voices, 4 languages

High-quality neural speech synthesis powered by Qwen3-TTS (1.7B parameters, 8-bit quantized) running on Metal GPU via OminiX MLX. 2.3x real-time synthesis speed. Supports Chinese, English, Japanese, and Korean with distinct voice characters.

Qwen3-TTS-1.7B Chinese English Japanese Korean WAV Export

Zero-Shot Voice Cloning

Clone any voice in seconds

Record or upload 5-30 seconds of reference audio and clone any voice instantly using In-Context Learning (ICL Express mode). Uses Qwen3-TTS-Base (1.7B, 8-bit) for synthesis and Qwen3-ASR (1.7B, 8-bit) for automatic reference audio transcription. No training, no fine-tuning, no cloud upload.

5-30s Reference Audio ICL Express No Training Fully Local

Built on the Moxin Stack

Makepad

GPU-accelerated UI framework, pure Rust

OminiX MLX

Rust ML inference on Apple Metal GPU

DORA

Dataflow orchestration for voice pipelines

Qwen3-TTS 1.7B

Speech synthesis, 8-bit quantized

Qwen3-ASR 1.7B

Speech recognition, 8-bit quantized

CPAL

Cross-platform audio I/O in Rust

How It Compares

Leading TTS platforms like ElevenLabs and MiniMax offer powerful cloud APIs. Moxin Voice takes a fundamentally different approach: everything runs locally, in pure Rust, on your hardware.

Moxin Voice ElevenLabs MiniMax TTS
Runtime 100% Local Cloud API Cloud API
Language Pure Rust Python (server) Python (server)
Data Privacy Never leaves device Uploaded to cloud Uploaded to cloud
Live Translation Built-in, on-device Dubbing Studio (cloud) Not available
Voice Cloning Zero-shot, on-device Instant / Professional API-based
Pricing Free & Open Source $5-$330/mo Pay-per-character
Latency Instant (no network) Network-dependent Network-dependent
Internet Required No Yes Yes
Memory Safety Rust ownership model N/A (cloud) N/A (cloud)
Source Code Apache 2.0 Proprietary Proprietary

Why Pure Rust Matters

Zero Python, Zero Overhead

No Python runtime, no virtualenv, no pip conflicts. The entire TTS pipeline — from UI to GPU inference — compiles to a single native binary. No GIL, no garbage collection pauses during synthesis.

Memory Safety at the GPU Boundary

Rust's ownership model extends into Metal GPU operations via OminiX MLX. Buffer lifetimes, tensor shapes, and kernel dispatch are all checked at compile time — eliminating crashes from dangling pointers or buffer overflows.

Native Performance

2.3x real-time synthesis speed on Apple Silicon. No interpreter overhead, no FFI marshaling between Python and C. The audio pipeline runs at native speed from text input to speaker output.

Ship as a Single Binary

cargo build --release produces one self-contained executable. No Docker, no conda environments, no system-wide library conflicts. Just download, build, and run.

Voice Library

ZH Vivian

Bright, slightly edgy young female

ZH Serena

Warm, gentle young female

ZH Uncle Fu

Low, mellow seasoned male

ZH Dylan

Clear Beijing young male

EN Ryan

Dynamic male with rhythmic drive

EN Aiden

Sunny American male

JA Ono Anna

Playful Japanese female

KO Sohee

Warm Korean female

ICL + Your Voice

Clone any voice with 5-30s of audio

Quick Start

Requires macOS 14.0+ (Sonoma) • Apple Silicon (M1/M2/M3/M4) • Rust 1.82+ • Dora CLI

# Install Dora CLI

cargo install dora-cli --locked

# Clone the repository

git clone https://github.com/moxin-org/Moxin-Voice.git

cd Moxin-Voice

# Download TTS models (~3.5 GB)

python3 scripts/download_models.py

# Build and run

cargo build --release

dora up && dora start apps/moxin-voice/dataflow/tts.yml