Overview
We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. Furthermore, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models.
Comparison of Inference Steps
Audio samples were generated using our proposed method with varying numbers of inference steps (4, 8, 16, 32, 64, 128).
Listen and compare the synthesized speech quality across different inference steps.
Text 1: "This truth which I have learned from her lips is confirmed by his face in which we have both beheld that of our son."
Reference Speech
Step 4
Step 8
Step 16
Step 32
Step 64
Step 128
Ground-truth Speech
Text 2: "This sentence also defines our sins as great so great in fact that the whole world could not make amends for a single sin."
Reference Speech
Step 4
Step 8
Step 16
Step 32
Step 64
Step 128
Ground-truth Speech
Text 3: "I remember now and I congratulate myself do you love anyone."
Reference Speech
Step 4
Step 8
Step 16
Step 32
Step 64
Step 128
Ground-truth Speech
Text 4: "Notwithstanding the high resolution of hawkeye he fully comprehended all the difficulties and danger he was about to incur."
Reference Speech
Step 4
Step 8
Step 16
Step 32
Step 64
Step 128
Ground-truth Speech
Comparison with Other TTS Models
We generated audio samples using our proposed method, utilizing reference speech from each baseline model's demo page.
Listen and compare the audio samples generated with our proposed method versus existing baseline approaches.
Text 1: "Yea his honourable worship is within but he hath a godly minister or two with him and likewise a leech."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 2: "They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 3: "Number ten fresh nelly is waiting on you good night husband."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 4: "And lay me down in thy cold bed and leave my shining lot."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 5: "Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 6: "The army found the people in poverty and left them in comparative wealth."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 1: "Thus did this humane and right minded father comfort his unhappy daughter and her mother embracing her again did all she could to soothe her feelings."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 3: "And lay me down in thy cold bed and leave my shining lot."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 2: "They moved thereafter cautiously about the hut groping before and about them to find something to show that warrenton had fulfilled his mission."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 4: "And the whole night the tree stood still and in deep thought."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 5: "Instead of shoes the old man wore boots with turnover tops and his blue coat had wide cuffs of gold braid."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 6: "The army found the people in poverty and left them in comparative wealth."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 1: "There their sad condition evoked for a time general commiseration."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 2: "When they entered the stage box on the left the first act was well under way, the scene being the interior of a cabin in the south of Ireland."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 3: "Now in this of Burne Jones, the landscape is clearly full of light everywhere, color or glass light: that is, the outline is prepared for modification of color only."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 4: "In what a disgraceful light might it not strike so vain a man!"
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 5: "Then I would never talk to that person about boa constrictors, or primeval forests, or stars. I would bring myself down to his level."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 6: "Then I would never talk to that person about boa constrictors, or primeval forests, or stars. I would bring myself down to his level."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 1: "Thus did this humane and right minded father comfort his unhappy daughter, and her mother embracing her again, did all she could to soothe her feelings."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 2: "He was in deep converse with the clerk and entered the hall holding him by the arm."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 3: "Indeed, there were only one or two strangers who could be admitted among the sisters without producing the same result."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 4: "For if he's anywhere on the farm, we can send for him in a minute."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 5: "Their piety would be like their names, like their faces, like their clothes, and it was idle for him to tell himself that their humble and contrite hearts it might be paid a far-richer tribute of devotion than his had ever been. A gift tenfold more acceptable than his elaborate adoration."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 6: "The air and the earth are curiously mated and intermingled as if the one were the breath of the other."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Ground-truth Speech
Original recording from the reference speaker reading the text.
Text 1: "Do not therefore think that the gothic school is an easy one."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 2: "Did ever anybody see the like screamed missus poyser running towards the table when her eye had fallen on the blue stream."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 3: "It is sold everywhere but for the last three weeks nobody will use any snuff but that sold at the civet cat."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 4: "And this was why kenneth and beth discovered him conversing with the young woman in the buggy."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 5: "She felt the force of the objections."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Text 6: "She can't get it out of her head, even after fifty years."
Reference Speech
Original recording used as reference for zero-shot TTS synthesis.
Our Method
Generated using the proposed model with 32 inference steps.
Full Research Paper
For a detailed description of our methodology and comprehensive results, please refer to our full research paper on arXiv:
View on arXiv