
Clone voices, generate speech across seven TTS engines, dictate into any app, and talk to agents in voices you own. A free and local alternative to ElevenLabs and WisprFlow, running entirely on your machine.
macOS, Windows, Linux
Alexander AI Voice is open-source and used by creators, voice artists, podcasters, writers, developers, accessibility users, and curious humans all over the world. Sponsor the project and your logo lands on the homepage, in the app, in the README, and on the sponsors page — in front of every one of them.
Everything you need to clone voices, generate speech, and produce multi-voice content — running entirely on your machine.
Multiple TTS engines for exceptional voice quality. Clone any voice from a few seconds of audio with natural intonation and emotion.
Create multi-voice narratives with a timeline-based editor. Arrange tracks, trim clips, and mix conversations between characters.
Apply pitch shift, reverb, delay, compression, and more — then save as presets. Preview effects live and set defaults per voice profile.
Run GPU inference locally with Metal, CUDA, ROCm, Intel Arc, or DirectML — or connect to a remote machine. One-click server setup with automatic discovery.
Powered by Whisper for accurate speech-to-text. Automatically extract reference text from voice samples.
Generate up to 50,000 characters in one go. Text is auto-split at sentence boundaries, generated per-chunk, and crossfaded seamlessly.
Three ways to get a sample in. Upload a clip, record from your microphone, or capture audio playing on your system. Alexander AI Voice clones the voice from as little as 3 seconds of audio.
Click to record from your microphone.
Maximum duration: 30 seconds.
Hold a shortcut anywhere on your machine, speak, release. The transcript lands in a focused text field in any app, or your clipboard. Agents speak back through the same pill in any cloned voice.
Base, Small, Medium, Large, and Turbo. Pick the size that fits your hardware and quality bar — 99 languages across every tier, all running locally.
A local LLM cleans ums, self-corrections, and punctuation without rephrasing. Optional, toggleable, and never leaves your machine.
Any MCP-aware agent — Claude Code, Cursor, Cline — gets a voice with one tool call. The pill surfaces when an agent is speaking, so you always see what’s coming out of your machine.
One tool call — voicebox.speak— and any MCP-aware agent can talk to you in a voice you’ve cloned. Claude Code, Cursor, Cline, or anything that speaks MCP.
{
"mcpServers": {
"voicebox": {
"url": "http://127.0.0.1:17493/mcp"
}
}
}// In any MCP-aware agent:
await voicebox.speak({
text: "Deploy complete.",
profile: "Morgan",
})POST /speakfor anything that doesn’t speak MCP — ACP, A2A, shell scripts, or custom harnesses.Bind each MCP client to a voice profile. Claude Code in Morgan, Cursor in Scarlett — you know which agent is talking without looking.
Every agent-initiated speech surfaces the pill. No silent background TTS — you always see what’s coming out of your machine.
MCP ships day one. ACP, A2A, and anything else built on a tool-call primitive slots into the same endpoint.
Give any voice profile a free-form personality. Then Rewrite your text in their voice, or let them Compose a fresh line of their own — your cloned voice, in full character.
“1940s noir detective. World-weary, cynical, every situation a metaphor for the city's underbelly. Talks like he's seen one stack trace too many.”
Restate your text in their voice while preserving every idea. Same content, their delivery — for scripts, dubs, and consistent character voice across long-form work.
No input needed — hit the button and the character improvises a fresh line of their own. Roll again for another take. Useful for game dialogue, narration cues, or character barks.
Every engine you download becomes a REST endpoint on your machine. Build apps, games, and voice tools with full programmatic control — no API keys, no rate limits, no per-character fees.
http://127.0.0.1:17493/generateGenerate speech/generate/{id}/cancelCancel a generation/profilesList voice profiles/profilesCreate a new profile/models/statusModel catalog & state/historyPast generations/healthServer healthcurl -X POST http://127.0.0.1:17493/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Welcome to the game, player one.",
"profile_id": "b3f1c2d4-5e6f-4a7b-8c9d-0e1f2a3b4c5d",
"engine": "qwen_custom_voice",
"instruct": "warm, slow, cinematic"
}' \
--output line.wavGenerate NPC dialogue on the fly, localize characters into new languages, or ship expressive voice lines without a studio.
Give your app or AI agent a voice. Real-time narration, accessibility readouts, voice replies — all running on the user's machine.
Batch-generate audiobook chapters, automate podcast intros, or wire Alexander AI Voice into your Stream Deck. It's just a localhost URL.
Walkthroughs from the community covering setup, voice cloning, and production workflows.

Kevin Stratvert

Julian Goldie SEO

Dave Swift

Tech指南

StinkyScrublet

mikbes
Pick the right model for every job — TTS, transcription, refinement. All models run locally on your hardware. Download once, use forever.
Text → speech. Voice cloning, preset voices, and delivery control.
High-quality multilingual cloning with natural prosody. The only engine with delivery instructions — control tone, pace, and emotion with natural language.
Production-grade voice cloning with the broadest language support. 23 languages with zero-shot cloning and emotion exaggeration control.
Lightweight and fast. Supports paralinguistic tags — embed [laugh], [sigh], [gasp] directly in your text for expressive speech.
Ultra-fast, CPU-friendly cloning at 48kHz. Exceeds 150x realtime on CPU with ~1GB VRAM. The fastest engine for quick iterations.
Nine premium preset speakers with natural-language style control. "Speak slowly with warmth", "authoritative and clear" — tone and pace adapt.
Speech-language model with text-acoustic dual alignment. Built for long-form — 700s+ coherent audio without drift. Multilingual at 3B.
Tiny 82M-parameter TTS that runs at CPU realtime with negligible VRAM. Pre-built voice styles — pick a voice, type, generate.
Speech → text. Multi-language STT for dictation and captures.
The default. Mature multilingual ASR across a wide size range — pick Tiny for speed or Large for best accuracy.
Pruned Whisper Large v3. Near-best quality at roughly 8x the speed — the right default for real-time dictation.
Transcript refinement, persona replies, and on-device reasoning.
Powers transcript cleanup, persona voice replies, and the voice I/O loop. Shares its runtime with the TTS/STT stack — one model cache, one GPU story.
Available for macOS, Windows, and Linux. No dependencies required.