February 19, 2026•9 min read••
Tags ▼
- Voice Cloning Software
- Voice Box
- Offline Speech
- Text To Speech
- Voice Generation Model
A walkthrough of Voice Box, a free, locally hosted voice cloning tool highlighting setup, models, and practical cloning workflow.
The transcript demonstrates how to install and use Voice Box, a **locally hosted voice-cloning and text-to-speech (TTS) application that runs entirely on your computer. Unlike browser-based services such as ElevenLabs, this tool performs voice cloning and speech generation locally without requiring internet access after installation.
The tutorial focuses on:
The main technologies demonstrated are:
The workflow shows how to clone a voice from short audio clips (≤30 seconds each) and synthesize speech using that voice.
Example for Windows:
setup.exe
Typical installer options:
After installation, launch Voice Box.
Immediately after opening the application:
Model Management
Recommended models:
These models enable:
| Model | Purpose |
|---|---|
| Quen TTS 1.7B | Text-to-speech generation |
| Whisper Base | Automatic speech transcription |
Download both before creating voices to avoid workflow interruptions.
Inside Model Management:
Locate:
Quen TTS 1.7B
Click:
Download
Notes:
Possible issue mentioned:
Force quit application
Restart Voice Box
Resume download
Inside Model Management:
Find:
Whisper Base
Click:
Download
Purpose:
When recording a voice sample, the system requires the exact spoken text.
Without Whisper:
You must manually type the spoken sentence.
With Whisper:
Click "Transcribe"
The system automatically generates the transcript.
Navigate to:
Create Voice
Three input options are available:
| Input Method | Description |
|---|---|
| Audio File | Import prerecorded voice clip |
| Microphone Recording | Record voice directly |
| System Audio | Capture sound from computer playback |
Example using microphone recording.
Constraints:
Maximum clip length = 30 seconds
Steps:
Example spoken text from transcript:
Hello everybody. This is Stinky Scribblelet.
I'm currently doing a small recording for the YouTube video
to clone my own voice.
If Whisper Base is installed:
Click:
Transcribe
The software automatically fills the transcription field with the spoken text.
Example output:
Hello everybody this is Stinky Scribblelet
I'm currently doing a small recording for the YouTube video
to clone my own voice
This text helps the model align phonemes → speaker identity.
After transcription:
Click:
Create Profile
The voice appears in the voice list.
Example profile name:
Scrub
Navigate to:
Generate Speech
Select the voice profile:
Scrub
Enter text:
Hello this is Stinky Scrubblelet
Thanks for watching my video
Hopefully this sounds like me
Click:
Generate
Processing notes:
Output:
An audio file generated in the cloned voice.
Voice quality improves with more training samples.
Recommendations:
Add multiple clips
Use clean audio
Avoid:
Example improvement process:
Add Clip 1 (20s)
Add Clip 2 (25s)
Add Clip 3 (15s)
More samples → better timbre reproduction.
Useful for cloning voices from videos or podcasts.
Steps:
System Audio Capture
Start System Capture
Then repeat transcription and profile creation.
Voice Box includes a feature called:
Story
This allows multiple voices to interact in a scripted dialogue.
Example workflow:
Speaker 1: Alex Jones
Speaker 2: Joe Rogan
Speaker 3: Theo Von
Example script:
Alex Jones: The globalists are at it again.
Joe Rogan: I don't know man, that's pretty wild.
Theo Von: Brother I saw a raccoon do that once.
Voice Box will generate a single audio conversation.
Demonstrated on:
Windows
Maximum clip length: 30 seconds
Multiple clips allowed
Local inference
No internet required after setup
Microphone
Audio file
System audio capture
Model Management
Create Voice
Generate Speech
Story
Import Voice
Avoid:
Clean recordings dramatically improve synthesis.
Speak normally.
Avoid:
Models learn tone and cadence, not just pitch.
Ideal dataset:
3–10 voice clips
10–30 seconds each
This improves:
If the Quen TTS download stalls:
Close Voice Box
Restart application
Resume download
Without Whisper you must type transcripts manually.
With Whisper:
Click Transcribe → automatic text
This saves time and prevents transcription mistakes.
Because generation runs locally:
Cloud services like ElevenLabs may generate faster.
Transcript does not specify system specs.
Typical requirements for local TTS models:
Recommended
GPU: NVIDIA RTX 2060+
VRAM: 6–12 GB
RAM: 16 GB
Storage: ~10 GB for models
CPU inference will still work but slower.
Bad samples produce:
Some voice platforms restrict cloning public figures.
While local tools allow it technically, usage may violate:
Noise can confuse phoneme alignment during training, reducing voice similarity.
To deepen practical understanding of voice synthesis and speech AI:
These topics help optimize and customize local voice cloning systems.
This foundational NLP textbook covers speech recognition, language modeling, and speech synthesis techniques. It explains the theoretical frameworks behind systems like Whisper and neural TTS models, helping readers understand the architecture powering modern voice cloning tools. Paperback, eTextbook
A practical guide to implementing neural models used in speech and language AI systems. It introduces deep learning frameworks used for TTS, sequence models, and embeddings that form the basis of modern voice synthesis systems. Paperback, Kindle
This book focuses on practical aspects of voice technologies, including speech generation, voice persona design, and conversational audio systems. It provides insight into how generated voices are used in real-world products. Paperback, Kindle
A strong technical resource for understanding digital signal processing techniques used in speech analysis, audio feature extraction, and synthesis pipelines relevant to voice cloning systems.
A comprehensive reference on deep neural networks. It covers architectures such as convolutional and recurrent networks that underpin neural TTS models like those used in Voice Box and similar speech synthesis frameworks.
Follow on your preferred channel for new articles, notes, and experiments.