Research Demo

SupertonicTTS

Towards Highly Scalable and Efficient Text-to-Speech System

Overview

We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. Furthermore, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models.

Comparison of Inference Steps

Audio samples were generated using our proposed method with varying numbers of inference steps (4, 8, 16, 32, 64, 128).

Listen and compare the synthesized speech quality across different inference steps.

Text 1: "This truth which I have learned from her lips is confirmed by his face in which we have both beheld that of our son."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.
Proposed

Step 4

Generated with 4 inference steps.
Proposed

Step 8

Generated with 8 inference steps.
Proposed

Step 16

Generated with 16 inference steps.
Proposed

Step 32

Generated with 32 inference steps.
Proposed

Step 64

Generated with 64 inference steps.
Proposed

Step 128

Generated with 128 inference steps.
GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 2: "This sentence also defines our sins as great so great in fact that the whole world could not make amends for a single sin."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.
Proposed

Step 4

Generated with 4 inference steps.
Proposed

Step 8

Generated with 8 inference steps.
Proposed

Step 16

Generated with 16 inference steps.
Proposed

Step 32

Generated with 32 inference steps.
Proposed

Step 64

Generated with 64 inference steps.
Proposed

Step 128

Generated with 128 inference steps.
GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 3: "I remember now and I congratulate myself do you love anyone."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.
Proposed

Step 4

Generated with 4 inference steps.
Proposed

Step 8

Generated with 8 inference steps.
Proposed

Step 16

Generated with 16 inference steps.
Proposed

Step 32

Generated with 32 inference steps.
Proposed

Step 64

Generated with 64 inference steps.
Proposed

Step 128

Generated with 128 inference steps.
GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 4: "Notwithstanding the high resolution of hawkeye he fully comprehended all the difficulties and danger he was about to incur."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.
Proposed

Step 4

Generated with 4 inference steps.
Proposed

Step 8

Generated with 8 inference steps.
Proposed

Step 16

Generated with 16 inference steps.
Proposed

Step 32

Generated with 32 inference steps.
Proposed

Step 64

Generated with 64 inference steps.
Proposed

Step 128

Generated with 128 inference steps.
GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Comparison with Other TTS Models

We generated audio samples using our proposed method, utilizing reference speech from each baseline model's demo page.

Listen and compare the audio samples generated with our proposed method versus existing baseline approaches.

Text 1: "Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VALL-E

Sourced from VALL-E demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 2: "They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VALL-E

Sourced from VALL-E demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 3: "Number ten fresh nelly is waiting on you good night husband."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VALL-E

Sourced from VALL-E demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 4: "And lay me down in thy cold bed and leave my shining lot."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VALL-E

Sourced from VALL-E demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 5: "Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VALL-E

Sourced from VALL-E demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 6: "The army found the people in poverty and left them in comparative wealth."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VALL-E

Sourced from VALL-E demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 1: "Thus did this humane and right minded father comfort his unhappy daughter and her mother embracing her again did all she could to soothe her feelings."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VoiceBox

Sourced from VoiceBox demo page.

Text 3: "And lay me down in thy cold bed and leave my shining lot."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VoiceBox

Sourced from VoiceBox demo page.

Text 2: "They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VoiceBox

Sourced from VoiceBox demo page.

Text 4: "And the whole night the tree stood still and in deep thought."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VoiceBox

Sourced from VoiceBox demo page.

Text 5: "Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VoiceBox

Sourced from VoiceBox demo page.

Text 6: "The army found the people in poverty and left them in comparative wealth."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

VoiceBox

Sourced from VoiceBox demo page.

Text 1: "There their sad condition evoked for a time general commiseration."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

Mega-TTS 2

Sourced from Mega-TTS 2 demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 2: "When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

Mega-TTS 2

Sourced from Mega-TTS 2 demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 3: "Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

Mega-TTS 2

Sourced from Mega-TTS 2 demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 4: "In what a disgraceful light might it not strike so vain a man!"

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

Mega-TTS 2

Sourced from Mega-TTS 2 demo page.

Text 5: "Then I would never talk to that person about boa constrictors, or primeval forests, or stars. I would bring myself down to his level."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

Mega-TTS 2

Sourced from Mega-TTS 2 demo page.

Text 6: "Then I would never talk to that person about boa constrictors, or primeval forests, or stars. I would bring myself down to his level."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

Mega-TTS 2

Sourced from Mega-TTS 2 demo page.

Text 1: "Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

CLaM-TTS

Sourced from CLaM-TTS demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 2: "He was in deep converse with the clerk and entered the hall holding him by the arm."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

CLaM-TTS

Sourced from CLaM-TTS demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 3: "Indeed, there were only one or two strangers who could be admitted among the sisters without producing the same result."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

CLaM-TTS

Sourced from CLaM-TTS demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 4: "For if he's anywhere on the farm, we can send for him in a minute."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

CLaM-TTS

Sourced from CLaM-TTS demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 5: "Their piety would be like their names, like their faces, like their clothes, and it was idle for him to tell himself that their humble and contrite hearts it might be paid a far-richer tribute of devotion than his had ever been. A gift tenfold more acceptable than his elaborate adoration."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

CLaM-TTS

Sourced from CLaM-TTS demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 6: "The air and the earth are curiously mated and intermingled as if the one were the breath of the other."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

CLaM-TTS

Sourced from CLaM-TTS demo page.

GT

Ground-truth Speech

Original recording from the reference speaker reading the text.

Text 1: "Do not therefore think that the gothic school is an easy one."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

DiTTo-TTS

Sourced from DiTTo-TTS demo page.

Text 2: "Did ever anybody see the like screamed missus poyser running towards the table when her eye had fallen on the blue stream."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

DiTTo-TTS

Sourced from DiTTo-TTS demo page.

Text 3: "It is sold everywhere but for the last three weeks nobody will use any snuff but that sold at the civet cat."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

DiTTo-TTS

Sourced from DiTTo-TTS demo page.

Text 4: "And this was why kenneth and beth discovered him conversing with the young woman in the buggy."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

DiTTo-TTS

Sourced from DiTTo-TTS demo page.

Text 5: "She felt the force of the objections."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

DiTTo-TTS

Sourced from DiTTo-TTS demo page.

Text 6: "She can't get it out of her head, even after fifty years."

Reference

Reference Speech

Original recording used as reference for zero-shot TTS synthesis.

Proposed

Our Method

Generated using the proposed model with 32 inference steps.

Baseline

DiTTo-TTS

Sourced from DiTTo-TTS demo page.