February 19, 2026•9 min read•

Tags ▼

Voice Cloning Software
Voice Box
Offline Speech
Text To Speech
Voice Generation Model

•

Share ▼

Share on Twitter Share on Facebook Share on LinkedIn Share on Reddit

Get Started with Voicebox: Open-Source Alternative to ElevenLabs Tutorial

One Sentence Summary

A walkthrough of Voice Box, a free, locally hosted voice cloning tool highlighting setup, models, and practical cloning workflow.

Main Points

Voice Box is a free, locally hosted voice cloning software with no required internet after install.
The video stresses legal limits: cloning public figures or celebrities may be restricted.
Install process is straightforward; download the Windows setup.exe and follow the wizard.
Model management is recommended first, especially downloading Quent TTS 1.7B for quality.
Whisper Base is advised to simplify transcription and alignment of spoken text.
Voice creation supports audio file, microphone, or system audio inputs for samples.
Maximum clip length per sample is 30 seconds; multiple clips improve accuracy.
Generating speech downloads the required TTS model if not already present.
The software supports cloning multiple voices (e.g., Stinky Scrubblet, Alex Jones, Rogan).
A “story” feature lets you compose conversations among multiple cloned voices.

Takeaways

Use Whisper to automatically transcribe audio for accurate text alignment.
Prefer Quent TTS 1.7B for higher fidelity voice cloning.
System audio capture helps clone voices using on-screen video or broadcasts.
Build multiple samples for each voice to improve likeness and consistency.
The “story” feature enables multi-voice dialogue for realistic conversations.

Extended Summary

The transcript demonstrates how to install and use Voice Box, a **locally hosted voice-cloning and text-to-speech (TTS) application that runs entirely on your computer. Unlike browser-based services such as ElevenLabs, this tool performs voice cloning and speech generation locally without requiring internet access after installation.

The tutorial focuses on:

Installing Voice Box on Windows
Downloading required AI models
Recording or importing voice samples
Automatically transcribing audio with Whisper
Creating a voice profile
Generating speech from typed text
Producing multi-speaker conversations using the Story feature

The main technologies demonstrated are:

Voice Box (local voice cloning application)
Quen TTS 1.7B voice generation model
Whisper Base speech-to-text transcription model

The workflow shows how to clone a voice from short audio clips (≤30 seconds each) and synthesize speech using that voice.

Detailed Step-by-Step Breakdown

1. Download the Software

Navigate to the Voice Box download page (link mentioned in video description).
Select the installer corresponding to your OS.

Example for Windows:

setup.exe

Run the installer.

Typical installer options:

Installation directory selection
Desktop shortcut creation
Standard install wizard steps

After installation, launch Voice Box.

2. Open Model Management

Immediately after opening the application:

Navigate to:

Model Management

Download required AI models.

Recommended models:

Quen TTS 1.7B (or newest version)
Whisper Base

These models enable:

Model	Purpose
Quen TTS 1.7B	Text-to-speech generation
Whisper Base	Automatic speech transcription

Download both before creating voices to avoid workflow interruptions.

3. Download the TTS Model

Inside Model Management:

Locate:

Quen TTS 1.7B

Click:

Download

Notes:

A smaller alternative model is available but less accurate.
Quen TTS provides higher-quality voice reproduction.

Possible issue mentioned:

Download progress may appear frozen.
Fix:

Force quit application
Restart Voice Box
Resume download

4. Download Whisper Transcription Model

Inside Model Management:

Find:

Whisper Base

Click:

Download

Purpose:

When recording a voice sample, the system requires the exact spoken text.

Without Whisper:

You must manually type the spoken sentence.

With Whisper:

Click "Transcribe"

The system automatically generates the transcript.

5. Create a New Voice Profile

Navigate to:

Create Voice

Three input options are available:

Input Method	Description
Audio File	Import prerecorded voice clip
Microphone Recording	Record voice directly
System Audio	Capture sound from computer playback

6. Record a Voice Sample

Example using microphone recording.

Constraints:

Maximum clip length = 30 seconds

Steps:

Click Record from Microphone
Speak naturally for ~10–30 seconds

Example spoken text from transcript:

Hello everybody. This is Stinky Scribblelet. 
I'm currently doing a small recording for the YouTube video 
to clone my own voice.

Stop recording.

7. Generate the Transcript

If Whisper Base is installed:

Click:

Transcribe

The software automatically fills the transcription field with the spoken text.

Example output:

Hello everybody this is Stinky Scribblelet 
I'm currently doing a small recording for the YouTube video 
to clone my own voice

This text helps the model align phonemes → speaker identity.

8. Create the Voice Profile

After transcription:

Click:

Create Profile

The voice appears in the voice list.

Example profile name:

Scrub

9. Generate Speech

Navigate to:

Generate Speech

Select the voice profile:

Scrub

Enter text:

Hello this is Stinky Scrubblelet
Thanks for watching my video
Hopefully this sounds like me

Click:

Generate

Processing notes:

Runs locally on CPU/GPU
May take ~1 minute depending on hardware

Output:

An audio file generated in the cloned voice.

10. Improve Voice Accuracy

Voice quality improves with more training samples.

Recommendations:

Add multiple clips
Use clean audio
Avoid:
- background music
- impressions
- exaggerated speech

Example improvement process:

Add Clip 1 (20s)
Add Clip 2 (25s)
Add Clip 3 (15s)

More samples → better timbre reproduction.

11. Capture Audio From System Output

Useful for cloning voices from videos or podcasts.

Steps:

Open Create Voice
Select:

System Audio Capture

Play a video containing the target voice
Click:

Start System Capture

Stop capture before background music begins.

Then repeat transcription and profile creation.

12. Generate Multi-Speaker Conversations

Voice Box includes a feature called:

Story

This allows multiple voices to interact in a scripted dialogue.

Example workflow:

Speaker 1: Alex Jones
Speaker 2: Joe Rogan
Speaker 3: Theo Von

Example script:

Alex Jones: The globalists are at it again.
Joe Rogan: I don't know man, that's pretty wild.
Theo Von: Brother I saw a raccoon do that once.

Voice Box will generate a single audio conversation.

Key Technical Details

Software

Voice Box – local voice cloning software
Whisper Base – speech-to-text transcription model
Quen TTS 1.7B – main TTS synthesis model

Operating System

Demonstrated on:

Windows

Clip Requirements

Maximum clip length: 30 seconds
Multiple clips allowed

Processing Method

Local inference
No internet required after setup

Input Sources

Microphone
Audio file
System audio capture

Key Interface Sections

Model Management
Create Voice
Generate Speech
Story
Import Voice

Pro Tips

1. Use Clean Voice Samples

Avoid:

music
crowd noise
sound effects

Clean recordings dramatically improve synthesis.

2. Use Natural Speech

Speak normally.

Avoid:

yelling
impressions
exaggerated accents

Models learn tone and cadence, not just pitch.

3. Add Multiple Voice Samples

Ideal dataset:

3–10 voice clips
10–30 seconds each

This improves:

pronunciation
pacing
emotional tone

4. Restart If Model Download Freezes

If the Quen TTS download stalls:

Close Voice Box
Restart application
Resume download

5. Use Whisper for Faster Workflow

Without Whisper you must type transcripts manually.

With Whisper:

Click Transcribe → automatic text

This saves time and prevents transcription mistakes.

Potential Limitations / Warnings

1. Processing Speed

Because generation runs locally:

CPU-only systems may take 30–90 seconds per generation.

Cloud services like ElevenLabs may generate faster.

2. Hardware Requirements (Assumption: Standard Setup)

Transcript does not specify system specs.

Typical requirements for local TTS models:

Recommended

GPU: NVIDIA RTX 2060+
VRAM: 6–12 GB
RAM: 16 GB
Storage: ~10 GB for models

CPU inference will still work but slower.

3. Training Data Quality

Bad samples produce:

robotic speech
incorrect tone
mismatched cadence

4. Legal / Ethical Concerns

Some voice platforms restrict cloning public figures.

While local tools allow it technically, usage may violate:

publicity rights
platform policies
regional laws.

5. Background Noise Sensitivity

Noise can confuse phoneme alignment during training, reducing voice similarity.

Recommended Follow-Up Resources

To deepen practical understanding of voice synthesis and speech AI:

Speech synthesis and neural TTS architecture tutorials
Whisper speech recognition documentation
Neural vocoder design (WaveNet, HiFi-GAN)
Voice cloning techniques (speaker embedding models)
Audio preprocessing for machine learning

These topics help optimize and customize local voice cloning systems.

Suggested Books (5)

1. Speech And Language Processing — Daniel Jurafsky & James H. Martin

This foundational NLP textbook covers speech recognition, language modeling, and speech synthesis techniques. It explains the theoretical frameworks behind systems like Whisper and neural TTS models, helping readers understand the architecture powering modern voice cloning tools. Paperback, eTextbook

2. Deep Learning for Natural Language Processing — Palash Goyal, Sumit Pandey, Karan Jain

A practical guide to implementing neural models used in speech and language AI systems. It introduces deep learning frameworks used for TTS, sequence models, and embeddings that form the basis of modern voice synthesis systems. Paperback, Kindle

3. Designing Voice User Interfaces — Cathy Pearl

This book focuses on practical aspects of voice technologies, including speech generation, voice persona design, and conversational audio systems. It provides insight into how generated voices are used in real-world products. Paperback, Kindle

4. Applied Speech and Audio Processing with MATLAB — Ian McLoughlin

A strong technical resource for understanding digital signal processing techniques used in speech analysis, audio feature extraction, and synthesis pipelines relevant to voice cloning systems.

5. Deep Learning — Ian Goodfellow, Yoshua Bengio, Aaron Courville

A comprehensive reference on deep neural networks. It covers architectures such as convolutional and recurrent networks that underpin neural TTS models like those used in Voice Box and similar speech synthesis frameworks.

Get New Posts

Follow on your preferred channel for new articles, notes, and experiments.