Tortoise-TTS is a new text-to-speech system capable of producing high-quality, natural-sounding speech. It is trained on a large text and audio dataset and can produce a wide range of voices, including male, female, and child voices. Tortoise-TTS is also fast and efficient, making it suitable for a wide range of applications such as authoring audiobooks, podcasting, and enabling accessibility for individuals with disabilities.
What is Tortoise-TTS
Tortoise is a very expressive TTS system with excellent voice copying capabilities. It is built on an autogressive acoustic model similar to GPT, which converts input text to discritized acouistic tokens, a diffusion model, which converts these tokens to spectrogram frames, and a Univnet vocoder, which converts the spectrograms to the final audio output. The main disadvantage is that Tortoise is extremely sluggish when compared to parallel TTS methods such as VITS.
How to Install Tortoise-TTS
- Install PyTorch: Follow the instructions provided at https://pytorch.org/get-started/locally/ to install PyTorch. It’s recommended to use the Conda installation path on Windows to avoid dependency issues.
- Install Tortoise TTS and its dependencies:
git clone https://github.com/neonbjb/tortoise-tts.git cd tortoise-tts python -m pip install -r ./requirements.txt python setup.py install
- If you’re using Windows, you’ll also need to install pysoundfile:
conda install -c conda-forge pysoundfile
- Use do_tts.py script to speak a single phrase:
python tortoise/do_tts.py -text “I’m going to speak this” -voice random -preset fast
- Use the read.py script to read large amounts of text:
python tortoise/read.py -textfile <your_text_to_be_read> -voice random
Replace <your_text_to_be_read> with the path to the text file you want to convert to speech. The script will generate spoken clips and combine them into a single output file.If any clips have issues, you can regenerate them by running read.py with the -regenerate argument.
- Use Tortoise TTS programmatically via the API:
import tortoise.api as api import tortoise.utils as utils reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths] tts = api.TextToSpeech() pcm_audio = tts.tts_with_preset(“your text here”, voice_samples=reference_clips, preset=’fast’)
You can use the Tortoise TTS API to generate speech programmatically. Provide the text you want to convert to speech as the input.
Please note that using Tortoise TTS requires an NVIDIA GPU. If you don’t have an NVIDIA GPU, you may encounter limitations or performance issues.
Voice customization guide
Tortoise was trained primarily to be a multi-speaker model. This is accomplished through the use of reference clips.
These are recordings of a speaker that you supply to help guide speech creation. These clips are used to identify several output features, such as voice pitch and tone, speaking speed, and even speaking flaws such as a lisp or stuttering. The reference clip is also used to assess non-voice characteristics of the audio output such as loudness, background noise, recording quality, and reverb.
Random voice
They have integrated a mechanism that creates a voice at random. These voices do not exist and will be generated at random each time you run it. The end product is extremely fascinating!
You may activate the random voice by entering ‘random’ as the voice name. Tortoise will handle the rest.
This is constructed in the ML space by projecting a random vector onto the voice conditioning latent space.
Provided voices
This repo includes a number of pre-packaged voices. Voices prefixed with “train_” are from the training set and outperform the rest. If you want to deliver a high-quality speech, I propose you choose one of these. Take a look at the others to see what Tortoise can do for zero-shot imitation.
Adding a new voice
To add new voices to Tortoise, you will need to do the following:
- Collect audio clips from your speaker(s). YouTube interviews (you can get the audio via youtube-dl), audiobooks, and podcasts are all good sources. The next section contains guidelines for creating good clips.
- Make 10 second parts out of your footage. You’ll need at least three clips. More is preferable, although in my tests, I only tried up to 5.
- Save the clips as a WAV file with a floating-point format and a sample rate of 22,050.
- Make a subfolder called voices/.
- Place your videos in that subfolder.
- Run tortoise utilities with -voice=.
Picking good reference clips
As mentioned above, your reference clips have a profound impact on the output of Tortoise. Following are some tips for picking good clips:
- Avoid using clips that have background music, noise, or reverb. These clips were taken out of the training dataset. Tortoise is unlikely to be successful with them.
- Avoid speeches. The distortion is usually induced by the amplification mechanism.
- Phone call clips should be avoided.
- Avoid footage with stuttering, stammering, or phrases like “uh” or “like” in them.
- Find clips that are spoken in the manner that you want your output to sound. If you want to hear your target speaker read an audiobook, for example, look for video of them reading a book.
- The text spoken in the video is unimportant, however varied text appears to perform better.
Advanced Usage of Tortoise-TTS
- Generation Settings: Tortoise TTS includes a number of options for customizing the produced speech. While the default settings are ideal for overall quality, you may play around with other parameters to produce different results. These options are accessible via the API. A complete list of accessible options may be found in the api.tts documentation.
- Prompt Engineering: Prompt engineering is the process of modifying the input prompt in order to evoke various emotions or effects in the output speech. Tortoise TTS offers an automatic redaction mechanism that allows you to control what is uttered by enclosing particular text inside brackets. “[I am really sad,] Please feed me,” for example, will result in merely “Please feed me” being said in a sorrowful tone.
- Playing with the Voice Latent: Tortoise TTS processes reference clips and generates a latent representation using a sub model. These latents may be manipulated to impact output speech features such as tone, speaking pace, and speech disorders. For example, you may blend two distinct voice latents to get a “average” voice.
- Generating Conditioning Latents: The get_conditioning_latents.py script included with Tortoise may be used to extract conditioning latents for a voice. This script creates a .pth pickle file with the conditioning latents (autoregressive_latent, diffusion_latent) for a given voice. To retrieve the latents programmatically, use the api.TextToSpeech.get_conditioning_latents() function.
- Using Raw Conditioning Latents: You may utilize the conditioned latents to produce speech by establishing a subfolder under the voices/ directory. Put a single .pth file in the subfolder that contains the pickled conditioning latents as a tuple (autoregressive_latent, diffusion_latent). The latents will then be used by Tortoise-TTS to create speech with the necessary voice qualities.
Also Read: CoquiTTS: A Python Library for Text-to-Speech
Conclusion
Tortoise-TTS is a new text-to-speech system that offers high quality and diversity. It is designed with a focus on delivering multi-voice capabilities and realistic prosody and intonation. The system is still under development, but it has already shown promise in a variety of applications, such as audiobooks, e-learning, and gaming. We hope you found this article helpful. If you have any questions, please feel free to leave a comment below.
Tortoise-TTS: A New Text-to-Speech System with High Quality and Diversity