How to launch your AI startup in 30 days? Register for free here
Request access

Top Free Text-to-speech (TTS) libraries for python

_

With more artificial intelligence applications being built, we need text-to-speech(TTS) engine API. The good news, there are a lot of open-source modules opensource for text-to-speech (TTS). This story will talk about python’s top text-to-speech(TTS) libraries.

gTTS

gTTS (Google Text-to-Speech) is a Python library that allows you to convert text to speech using Google’s Text-to-Speech API. It’s designed to be easy to use and provides a range of options for controlling the speech output, such as setting the language, the speed of the speech, and the volume.

When I wrote this post, The project had 1.7k stars on GitHub.

Usage

To use gTTS, you will need to install the library using pip:

pip install gTTS

Then, you can use the gTTS class to create an instance of the text-to-speech converter. You can pass the text you want to convert to speech as a string to the gTTS constructor. For example:

from gtts import gTTS

tts = gTTS("Hello this is a normal text and i am a python package. lol")

Once you have an instance of the gTTS class, you can use the save method to save the speech to a file. For example:


tts.save("speech.mp3")

You can also use the gTTS class to change the speech output’s language and speech speed. For example:

tts = gTTS("Bonjour, ceci est un texte normal et je suis un paquet python. lol", lang='fr')
tts.save("speech.mp3")

tts = gTTS("Hello this is a normal text and i am a python package. lol", slow=True)
tts.save("speech.mp3")

Complete code and output

from gtts import gTTS

tts = gTTS("Hello this is a normal text and i am a python package. lol")
tts.save("speech.mp3")

Many other options are available for controlling the speech output, such as setting the volume and pitch of the speech. You can find more information about these options in the gTTS documentation.

CoquiTTS

I already have a series of videos and posts about coquiTTS that you can find here.

CoquiTTS is a neural text-to-speech (TTS) library developed in PyTorch. It is designed to be easy to use and provides a range of options for controlling the speech output, such as setting the language, the pitch, and the duration of the speech.

It is the most popular package, with 7.4k stars on GitHub.

To use CoquiTTS, you will need to install the library using pip:

pip install TTS

Once you have installed the library, you can use the coquiTTS class to create an instance of the text-to-speech converter. You can pass the text you want to convert to speech as a string to the Synthesizer constructor. For example:

# import all the modules that we will need to use
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer

path = "/path/to/pip/site-packages/TTS/.models.json"

model_manager = ModelManager(path)

model_path, config_path, model_item = model_manager.download_model("tts_models/en/ljspeech/tacotron2-DDC")

voc_path, voc_config_path, _ = model_manager.download_model(model_item["default_vocoder"])

syn = Synthesizer(
    tts_checkpoint=model_path,
    tts_config_path=config_path,
    vocoder_checkpoint=voc_path,
    vocoder_config=voc_config_path
)

Once you have an instance of the Synthesizer class, you can use the tts method to generate speech. You can save the speech to a file using a the save_wav method. For example:

text = "Hello from a machine"

outputs = syn.tts(text)
syn.save_wav(outputs, "audio-1.wav")

Complete code and output

from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
import site
location = site.getsitepackages()[0]

path = location+"/TTS/.models.json"

model_manager = ModelManager(path)

model_path, config_path, model_item = model_manager.download_model("tts_models/en/ljspeech/tacotron2-DDC")

voc_path, voc_config_path, _ = model_manager.download_model(model_item["default_vocoder"])

synthesizer = Synthesizer(
    tts_checkpoint=model_path,
    tts_config_path=config_path,
    vocoder_checkpoint=voc_path,
    vocoder_config=voc_config_path
)

text = "Hello from a machine"

outputs = synthesizer.tts(text)
synthesizer.save_wav(outputs, "audio-1.wav")

You can find the documentation here.

TensorFlowTTS

TensorFlowTTS (TensorFlow Text-to-Speech) is a deep learning-based text-to-speech (TTS) library developed by TensorFlow, an open-source platform for machine learning and artificial intelligence. It is designed to be easy to use and provides a range of features for building TTS systems, including support for multiple languages and customizable models.

It has 3k stars on gihub.

To use TensorFlowTTS, you will need to install the library using pip:

pip install TensorFlowTTS

Sample code

import numpy as np
import soundfile as sf
import yaml
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

# initialize fastspeech2 model.
fastspeech2 = TFAutoModel.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")
# initialize mb_melgan model
mb_melgan = TFAutoModel.from_pretrained("tensorspeech/tts-mb_melgan-ljspeech-en")
# inference
processor = AutoProcessor.from_pretrained("tensorspeech/tts-fastspeech2-ljspeech-en")
input_ids = processor.text_to_sequence("Hello from a computer.")
# fastspeech inference
mel_before, mel_after, duration_outputs, _, _ = fastspeech2.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor([0], dtype=tf.int32),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios =tf.convert_to_tensor([1.0], dtype=tf.float32),
)

# melgan inference
audio_before = mb_melgan.inference(mel_before)[0, :, 0]
audio_after = mb_melgan.inference(mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")

TensorFlowTTS also provides pre-trained models for various languages, including English, Chinese, and Japanese. You can use these models to perform speech synthesis without the need to train your model. You can find more information about how to use TensorFlowTTS and the available options on GitHub.

pyttsx3

pyttsx3 is a Python text-to-speech (TTS) library that allows you to convert text to speech using a range of TTS engines, including the Microsoft Text-to-Speech API, the Festival, and the eSpeak TTS engine. pyttsx3 is designed to be easy to use and provides a range of options for controlling speech output.

It has 1.3k stars on github.

To use pyttsx3, you will need to install the library using pip:

pip install pyttsx3

Once you have installed the library, you can use the pyttsx3.init function to create an instance of the text-to-speech converter. You can pass the TTS engine you want to use as an argument to the init function. For example:

import pyttsx3
engine = pyttsx3.init()

Once you have an instance of the TTS engine, you can use the say method to generate speech from text. The say method takes the text you want to synthesize as an argument. For example:

engine.say("Hello from a machine")
engine.runAndWait()

larynx

Larynx is a text-to-speech (TTS) library written in Python that uses the Google Text-to-Speech API to convert text to speech.

To use Larynx, you will need to install the library using pip:

pip install larynx

Once you have installed the library, you can use the text_to_speech function for the text-to-speech converter. You can pass many parameters like:

text: str,
lang: str,
tts_model: typing.Union[TextToSpeechModel, Future],
vocoder_model: typing.Union[VocoderModel, Future],
audio_settings: AudioSettings,
number_converters: bool = False,
disable_currency: bool = False,
word_indexes: bool = False,
inline_pronunciations: bool = False,
phoneme_transform: typing.Optional[typing.Callable[[str], str]] = None,
text_lang: typing.Optional[str] = None,
phoneme_lang: typing.Optional[str] = None,
tts_settings: typing.Optional[typing.Dict[str, typing.Any]] = None,
vocoder_settings: typing.Optional[typing.Dict[str, typing.Any]] = None,
max_workers: typing.Optional[int] = 2,
executor: typing.Optional[Executor] = None,
phonemizer: typing.Optional[gruut.Phonemizer] = None,

You can save the speech to a file using the **wavfile** function. For example:

from larynx import text_to_speech
from larynx import wavfile
import numpy as np

text_and_audios = text_to_speech(**params)
audios = []
print(list(text_and_audios))
for _, audio in text_and_audios:
        audios.append(audio)

wavfile.write(data=np.concatenate(audios), rate=1, filename="a.wav")

Let's Innovate together for a better future.

We have the knowledge and the infrastructure to build, deploy and monitor Ai solutions for any of your needs.

Contact us