Skip to content

OpenAI

OpenAITTSVoice module-attribute

OpenAITTSVoice = Literal[
    "alloy",
    "ash",
    "ballad",
    "echo",
    "coral",
    "fable",
    "onyx",
    "nova",
    "sage",
    "shimmer",
    "verse",
]

OpenAI's text-to-speech available voices.

OpenAISpeechSynthesizer

Bases: BaseModel

Speech synthesizer using OpenAI's API.

provider class-attribute

provider: str = 'openai'

Provider name for OpenAI.

api_key class-attribute instance-attribute

api_key: str | None = None

API key for OpenAI's API.

base_url class-attribute instance-attribute

base_url: str | None = None

Base URL for OpenAI's API.

model class-attribute instance-attribute

model: Literal["gpt-4o-mini-tts", "tts-1", "tts-1-hd"] = (
    "gpt-4o-mini-tts"
)

Model to use for speech synthesis.

voice class-attribute instance-attribute

voice: OpenAITTSVoice = 'alloy'

Voice to use for speech synthesis.

speed class-attribute instance-attribute

speed: Annotated[float, Field(ge=0.25, le=4)] = 1.0

Speed of speech synthesis.

timeout class-attribute instance-attribute

timeout: PositiveInt = 120

Timeout for speech synthesis in seconds.

instructions class-attribute instance-attribute

instructions: str | None = None

Instructions passed to the model. Valid only when the model is from the GPT-4o family or higher.

silence_threshold class-attribute instance-attribute

silence_threshold: float | None = None

Silence threshold for the audio asset.

silence_duration class-attribute instance-attribute

silence_duration: float | None = None

Silence duration for the audio asset.

synthesize

synthesize(
    texts: Sequence[str],
    *,
    audio_params: AudioAssetParams | None = None,
    **kwargs: Any
) -> list[AudioAsset]

Synthesize speech from texts using OpenAI's API.

Parameters:

Name Type Description Default

texts

Sequence[str]

Texts to synthesize.

required

audio_params

AudioAssetParams | None

Parameters for the audio asset.

None

kwargs

Any

Additional parameters for the OpenAI API.

{}

Returns:

Type Description
list[AudioAsset]

List of audio assets.

Source code in src/mosaico/speech_synthesizers/openai.py
def synthesize(
    self, texts: Sequence[str], *, audio_params: AudioAssetParams | None = None, **kwargs: Any
) -> list[AudioAsset]:
    """
    Synthesize speech from texts using OpenAI's API.

    :param texts: Texts to synthesize.
    :param audio_params: Parameters for the audio asset.
    :param kwargs: Additional parameters for the OpenAI API.
    :return: List of audio assets.
    """
    assets = []

    model = kwargs.pop("model", self.model)
    instructions = kwargs.pop("instructions", self.instructions)
    silence_threshold = kwargs.pop("silence_threshold", self.silence_threshold)
    silence_duration = kwargs.pop("silence_duration", self.silence_duration)

    if instructions and model.startswith("tts-"):
        raise ValueError("`instructions` cannot be set when model is not from the GPT-4o family or higher.")

    for text in texts:
        response = self._client.audio.speech.create(
            input=text,
            model=model,
            instructions=instructions,
            voice=kwargs.pop("voice", self.voice),
            speed=kwargs.pop("speed", self.speed),
            response_format="mp3",
            **kwargs,
        )
        segment = AudioSegment.from_file(io.BytesIO(response.content), format="mp3")
        asset = AudioAsset.from_data(
            response.content,
            params=audio_params if audio_params is not None else {},
            mime_type="audio/mpeg",
            info=AudioInfo(
                duration=segment.duration_seconds,
                sample_rate=segment.frame_rate,
                sample_width=segment.sample_width,
                channels=segment.channels,
            ),
        )

        if silence_threshold is not None and silence_duration is not None:
            asset = asset.strip_silence(silence_threshold, silence_duration)

        assets.append(asset)

    return assets