Samples for the paper "Prompt-to-Speech Using a Large Language Model"

Discourse neutral task

We first prompt the model to make only local modifications based on the target text for a discourse neutral task. We measure perceived naturalness for this task to evaluate whether the local modifications, in particular, have a measurable impact on perceived naturalness. We compare the proposed method with the baseline model and the two reference-based models and use validation utterances from the LJSpeech corpus.

Ground-truth Style-token oracle Prosody-transfer oracle Baseline Proposed

Target speaking style task

We measure the perceived "appropriateness" of utterances to evaluate the proposed method in predicting prosody for a given target speaking style. Participants are shown the target text and speaking style when performing this evaluation. We create a small text corpus for this where we first select a number speaking styles and, for each one, then generate a set of target texts that we deem would be appropriate for that style.

Target Speaking Style Text Baseline Proposed Method
Afraid for their life Please, don't hurt me! I'll give you everything I have, just let me go!
Somebody help me! Help me, please!
In a hurry Time is running out, we have to make it to the meeting!
We have to hurry if we want to catch the last train!
Murmuring Just keep calm, everything will be alright
I should have listened to my gut
Talking to a child It's okay to make mistakes, that's how we learn.
I love you more than anything in the world.
Explaining with frustration The bowl is in the cabinet furthest to the left.
It's not Joe, it's Frank who's coming to the party.
Talking like a robot I'm afraid I can't do that.
That is not a valid command.
In a fight with someone You will regret this!
This fight will be over as fast as it started!

Dialogue task.

We also evaluate the appropriateness of utterances in the context of a two-line dialogue. Here, the LLM is instructed to make modifications based only on the previous line in a dialogue and the target text, not a target style. We first choose six "hidden" speaking styles and then generate 10 two-line dialogues where the target text would naturally fit the speaking style.

Style Previous line Target text Baseline Proposed Method
Having a friendly conversation I love your outfit today, where did you get it? Thank you! I actually got it on sale at a boutique downtown.
I heard you're going on vacation soon, where are you headed? Yeah, I'm going to Hawaii! I can't wait to relax on the beach and soak up some sun.
In a heated argument I can't believe you did that! What's the big deal? It's not like it affects you.
You never pay attention to me anymore. That's because you're always nagging me about everything!
Formal discussion According to the data, we're seeing a decline in sales. Yes, I agree. However, we should also consider the external factors affecting our industry.
We need to review the proposed changes to our company's mission statement. Agreed. We should schedule a meeting with the stakeholders to discuss this further.
Sarcastic response Nice weather we're having today. Oh yeah, I love sweating through my clothes every five minutes.
I can't believe you forgot my birthday again. Oh, I'm sorry. I must have mistaken you for someone who actually matters.
Excited I got the lead role in the play! That's amazing! Congratulations, you're going to be fantastic!
Where are we going? Prepare yourself because we are going to Disneyland!
Apologizing You missed our meeting yesterday. I know, I'm sorry. Something urgent came up and I wasn't able to make it.
You were supposed to bring the food to the party. I'm sorry, I completely forgot. Let me order some food delivery right now.

Example prompt

The following is an example prompt that was used to adjust duration, pitch and energy of a text given a target speaking style:

Prompt:

        
Could you help me with generating speech? I am trying to select the most appropriate values of the following prosodic properties: duration, energy and pitch.

given some target text and a target speaking style. Assume that all the speech is a part of a dialogue. Assume that I already have the speech for the target text but I can change the prosodic properties in a relative manner. I will give you:
1. Either a previous line in the dialogue or a target speaking style
2. The target text

And I want to you tell me:
1. How to change the pitch, energy and duration, in general, for the target text. We can increase or decrease these values as long as they don't go out of the range that is considered normal for the current speaker. So for each of those attributes, tell me how much to change them: (0: the standard value, -5: the minimum value, 5: the maximum value).A positive value for duration means a slower speaking rate, a negative value means a faster speaking rate. You should select these values appropriately given either the target style or the dialogue context. These values should not be used to emphasise specific words in the sentence.

2. How prominent each word on a scale from 0 to 5 (0: standard prominence, 5: maximum prominence). Note that this value should reflect on how salient the word is in the given sentence so most words should probably have a value of 0. Salient words should then have a value ranging from 1-5, considering how salient they are in the sentence.

Instructions:
1. Report the sentence level change of pitch, energy and duration in a separate table. Pitch, energy and duration should be columns in the table. Pitch first, then energy and finally the duration
2. Report the prominence level of each word in a separate table. Remove any punctuation as we do not have to predict prominence for those symbols.

Here are templates for the two tables:

The sentence-level attribute table:
|Pitch|Energy|Duration|
|---|---|---|
|the chosen pitch value|the chosen energy value|the chosen duration value|

The prominence table (example sentence: he had a big house):
|he|had|a|big|house| ...
|---|---|---|...
|the chosen prominence level for word 1| the chosen prominence level for word 2| ...

Rules you must follow:
1. All words in the target text should be included in the prominence table.
2. Do not write any text above or between the tables. You should include your reasoning after the tables.
3. You must use the table templates.
4. Your choice of parameters and reasoning have to make sense. For example, you couldn't suggest a decrease but then motivate that choice by saying an increase is appropriate
6. Make your suggestions independently of who it is that is speaking. The only thing that matters is that they are a native US speaker.

Your first example is the following:
Target style: Talking to a child
Target text: I will always take care of you!
        
    

Response:

        
Here are my suggested prosodic changes for the given target text in the style of talking to a child:

Pitch	Energy	Duration
2	2	1

Word	Pitch	Energy	Duration
I	0	1	0
will	0	0	0
always	1	1	0
take	0	0	0
care	0	1	1
of	0	0	0
you	2	2	1

Just a slight increase in pitch, energy and duration is recommended to make the speech sound
more cheerful and caring.

For word-level changes, the word "always" should be emphasised by a slight increase in pitch
and energy to indicate that the speaker is making a promise to the child. The word "care"
should have a higher energy and longer duration to make the speaker sound more caring and
empathetic towards the child. The word "you" should have a significant increase in pitch,
energy and duration to make the child feel loved and appreciated.
        
    

Few-shot samples

These human-generated examples are additionally provided to the model through prompting

        
Here are ten examples answers for this task:

# example 1
Target speaking style: Narration
Target text: "The old woman walked through the door."

These are the changes I would suggest:

Pitch   Energy  Duration
-1      -1      2

Word    Prominence
The     0
old     0
woman   0
walked  0
through 0
the     0
door    1

Explanation:
We make the overall duration slightly longer since that is fitting for narration. We lower the pitch and energy for a calm sounding voice which is fitting for this neutral target text. We make the word "door" prominent since it is the most important word in the sentence.


# example 2
Target speaking style: Inspirational speech
Target text: "You can do it!"

Here is how I would change the prosody:

Pitch   Energy  Duration
2       3       -2

Word    Prominence
You     1
can     3
do      0
it      1

Explanation:
We make the overall duration shorter since the speaker is probably excited, given the target speaking style. We further raise the pitch and energy for a more excited sounding voice which is fitting for this target text. We add a low-level prominence to the words "You" and "it". The word "can" is the most important word in the sentence since the speaker is likely convincing the listener that they can do something.


# example 3
Target speaking style: Angry
Target text: "I can't stand this anymore!"

These are my suggestions:

Pitch   Energy  Duration
-1      2       -1

Word    Prominence
I       1
can't   0
stand   2
this    2
anymore 0


Explanation:
We lover the pitch and increase the energy to make the voice sound more angry. We slightly increase the speaking rate since this is probably something that is said in a heated argument. The word "I" is made slightly more prominent since the speaker is reflecting on their own feelings. The words "stand" and "this" are made more prominent since the speaker probably wants emphasize that they have had enough of something.


# example 4
Target speaking style: A college professor giving a lecture
Target text: "The following example illustrates the point."

This is what I would suggest:

Pitch   Energy  Duration
-1       0       2

Word        Prominence
The         0
following   0
example     0
illustrates 2
the         0
point       1

Explanation:
We make the speaking rate slower since that helps the listener to follow along. A slightly lower pitch is fitting for this target speaking style. We only make the words "illustrates" and "point" slightly more prominent since they are directly involved in the point the speaker is trying to make.


# example 5
Target speaking style: A villain in a movie
Target text: "Soon, this city will be mine!"

These are my suggestions for this speaking style:

Pitch   Energy  Duration
-3      4       -2

Word    Prominence
Soon    0
this    0
city    0
will    0
be      0
mine    4

Explanation:
A lower pitch and higher energy is fitting for a voice that is perceived as evil. We make the overall duration shorter to indicate excitement in the speaker's voice. We make the word "mine" much more prominent since it emphasizes the goal of the speaker.


# example 6
Target speaking style: A child
Target text: "Can we have pizza for dinner?"

My suggestions:

Pitch   Energy  Duration
2       0       2

Word    Prominence
Can     0
we      0
have    0
pizza   2
for     0
dinner  0

Explanation:
We make the pitch higher and the speaking rate slower which would be appropriate for a child speaking. We emphasize the word "pizza" since it is likely the focus of the child's request.


# example 7
Target speaking style: An old man telling a story
Target text: "In those days, we didn't have much."

I would make the following adjustments:

Pitch   Energy  Duration
-1      -3      3

Word    Prominence
In      0
those   0
days    1
we      0
didn't  0
have    0
much    0

Explanation:
The energy and speaking rate is decreased to make the voice sound older. The pitch is slightly lowered since it is an old man. We only emphasize the word "days" slightly since the speaker is describing a particular time period.


# example 8
Target speaking style: A shy person
Target text: "I'm not sure what they think of me."

Here are my suggestions:

Pitch   Energy  Duration
-2      -2      2

Word    Prominence
I'm     0
not     1
sure    0
what    0
they    2
think   0
of      0
me      1

Explanation:
We decrease pitch, energy and duration which would be appropriate for a less confident speaker. We add a small prominence value to the word "not" to indicate the negation. We then add prominence to the word "they" since the speaker is shy and probably worries about what other people think of them. We also add a low prominence value to the word "me" since the speaker is talking about themselves.


# example 9
Target speaking style: A child waking up on Christmas morning
Target text: "Quick, let's go and open the presents!"

I would make the following adjustments:

Pitch   Energy  Duration
2       3       -2

Word    Prominence
Quick   3
let's   0
go      0
and     0
open    0
the     0
presents 3

Explanation:
We increase pitch, energy and speaking rate since that would be fitting for the excitement that the child must be experiencing. We add high prominence values to the words "Quick" and "presents". The word "Quick" is prominent because it indicates the urgency of the situation. The word "presents" is prominent because it explains why the situation is exciting.


# example 10
Target speaking style: A very tired person
Target text: "We should probably go to bed."

These are the changes I would make:

Pitch   Energy  Duration
-2      -4      3

Word        Prominence
We          0
should      0
probably    0
go          0
to          0
bed         0

Explanation:
We decrease pitch, energy and speaking rate significantly since the speaker is probably speaking with very low effort. We don't add any particular prominence for this utterance since the tone of the voice should get the message across.