NAV Navbar
Python

TTS as a Service gRPC API

The Nuance TTS (Text to Speech) service is powered by the Nuance Vocalizer for Cloud (NVC) engine, which synthesizes speech from text. NVC works with Nuance Vocalizer for Enterprise (NVE) and Nuance voice packs to generate speech.

Nuance TTS as a service lets you request speech synthesis using gRPC from NVC engines running on Nuance-hosted machines. It works with a voice in your language and locale, with choices of gender and age.

The gRPC protocol provided by NVC allows a client application to request speech synthesis services in all the programming languages supported by gRPC.

gRPC is an open source RPC (remote procedure call) software used to create services. It uses HTTP/2 for transport, and protocol buffers to define the structure of the application. NVC supports Protocol Buffers version 3, also known as proto3.

Version: v1

This release supports version v1 of the gRPC protocol.

The earlier protocol, v1beta1, is obsolete and no longer supported.

Prerequisites from Mix

Before developing your TTS gRPC application, you need a Nuance Mix project. This project provides credentials to run your application against the Nuance-hosted NVC engine.

  1. Create a Mix project and model: see Mix.nlu workflow to:

    • Create a Mix project.

    • Create, train, and build a model in the project. If you are using other Nuance "as a service" products (such as ASRaaS or NLUaaS), you may use the same Mix project for NVC. Your project must include a model even though it is not needed for your NVC application.

    • Create and deploy an application configuration for the project.

  2. Generate a "secret" and client ID of your Mix project: see Mix.dashboard Obtain authentication for services. Later you will use these credentials to request an authorization token to run your application.

  3. Learn the URL to call the NLU service: see Mix.dashboard Accessing a runtime service.

gRPC setup

Install gRPC for programming language

$ python3 -m pip install --upgrade pip
$ python3 -m pip install grpcio
$ python3 -m pip install grpcio-tools

Generate client stubs

$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ --grpc_python_out=./ synthesizer.proto

$ ls -1 syn*.py
synthesizer_pb2_grpc.py
synthesizer_pb2.py

The basic steps in using the NVC gRPC protocol are:

  1. Download the gRPC proto file here: synthesizer.proto. This file contains a generic version of the functions or classes that request speech synthesis from an NVC engine.

  2. Install gRPC for the programming language of your choice, including C++, Java, Python, Go, Ruby, C#, Node.js, and others. See gRPC Documentation for a complete list and instructions on using gRPC with each one.

  3. Generate client stub files in your programming language from the proto file using gRPC protoc. Depending on your programming language, the stubs may consist of one file or multiple files.

    These stub files contain the methods and fields from the proto file as implemented in your programming language. You will consult the stubs in conjunction with the proto file.

  4. Write your client app, referencing the functions or classes in the client stub files. See Client app development for details and a scenario.

  5. Run your client app to request synthesis. See Sample Python app.

Client app development

The gRPC protocol for NVC lets you create client applications for synthesizing text and obtaining information about available voices.

Sequence flow

The essential tasks are illustrated in the following high-level sequence flow of an application at run time.

Sequence flow

Development steps

Authenticate and run simple Mix client (run-simple-mix-client.sh)

#!/bin/bash 

CLIENT_ID="appID%3ANMDPTRIAL_your_name_nuance_com_20190919T190532565840"
SECRET="9L4l...8oda"
export MY_TOKEN="`curl -s -u "$CLIENT_ID:$SECRET" \
"https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=asr nlu tts dlg' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'`"

./simple-mix-client.py --server_url 'tts.api.nuance.com:443' \
  --token $MY_TOKEN \
  --name 'Zoe-Ml' \
  --model 'enhanced' \
  --text "The wind was a torrent of darkness, among the gusty trees." \
  --output_audio_file 'highwayman.pcm' 

Simple Mix client (simple-mix-client.py)

#!/usr/bin/env python3

# Import functions
import sys
import grpc
import argparse
from synthesizer_pb2 import *
from synthesizer_pb2_grpc import *
from google.protobuf import text_format

# Define synthesis request
def create_synthesis_request(name, model, text, ssml, sample_rate, send_log_events, client_data):
    request = SynthesisRequest()

    request.voice.name = name
    request.voice.model = model

    pcm = PCM(sample_rate_hz=sample_rate)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)

    if text:
        request.input.text.text = text
    elif ssml:
        request.input.ssml.text = ssml
    else:
        raise RuntimeError("No input text or SSML defined.")

    request.event_params.send_log_events = send_log_events

    return request


def main():
    parser = argparse.ArgumentParser(
        prog="simple-mix-client.py",
        usage="%(prog)s [-options]",
        add_help=False,
        formatter_class=lambda prog: argparse.HelpFormatter(
            prog, max_help_position=45, width=100)
    )

    # Set arguments
    options = parser.add_argument_group("options")
    options.add_argument("-h", "--help", action="help",
                         help="Show this help message and exit")
    options.add_argument("--server_url", nargs="?",
                         help="Server hostname (default=localhost)", default="localhost:8080")
    options.add_argument("--token", nargs="?",
                         help="Authentication token", required=True)
    options.add_argument("--name", nargs="?", help="Voice name", required=True)
    options.add_argument("--model", nargs="?",
                         help="Voice model", required=True)
    options.add_argument("--sample_rate", nargs="?",
                         help="PCM sample rate (default=22050)", type=int, default=22050)
    options.add_argument("--text", nargs="?", help="Input text")
    options.add_argument("--ssml", nargs="?", help="Input SSML")
    options.add_argument("--send_log_events",
                         action="store_true", help="Subscribe to Log Events")
    options.add_argument("--output_audio_file", nargs="?",
                         help="Destination file path for synthesized audio")
    options.add_argument("--client_data", nargs="?",
                         help="Client information in key value pairs")

    args = parser.parse_args()

    # Create channel and stub 
    call_credentials = grpc.access_token_call_credentials(args.token)
    channel_credentials = grpc.composite_channel_credentials(
        grpc.ssl_channel_credentials(), call_credentials)

    # Send request and process results
    with grpc.secure_channel(args.server_url, credentials=channel_credentials) as channel:
        stub = SynthesizerStub(channel)
        request = create_synthesis_request(name=args.name, model=args.model, text=args.text,
            ssml=args.ssml, sample_rate=args.sample_rate, send_log_events=args.send_log_events,
            client_data=args.client_data)
        stream_in = stub.Synthesize(request)
        audio_file = None
        try:
            if args.output_audio_file:
                audio_file = open(args.output_audio_file, "wb")               
            for response in stream_in:
                if response.HasField("audio"):
                    print("Received audio: %d bytes" % len(response.audio))
                    if(audio_file):
                        audio_file.write(response.audio)
                elif response.HasField("events"):
                    print("Received events")
                    print(text_format.MessageToString(response.events))
                else:
                    if response.status.code == 200:
                        print("Received status response: SUCCESS")
                    else:
                        print("Received status response: FAILED")
                        print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                        print('Error: {}'.format(response.status.details))
        except Exception as e:
            print(e)
        if audio_file:
            print("Saved audio to {}".format(args.output_audio_file))
            audio_file.close()


if __name__ == '__main__':
    main()

This section describes how to implement the basic functionality of NVC in the context of a simple Python application, shown at the right.

This application synthesizes plain text or SSML input, streaming the audio back to the client and optionally creating an audio file containing the synthesized speech.

For a more sophisticated application to download and run, see Sample Python app.

Authenticate

Nuance Mix uses the OAuth 2.0 protocol for authentication. The client application must provide an access token to be able to access the NVC runtime service. The token expires after a short period of time so must be regenerated frequently.

Your client application uses the client ID and secret from the Mix Dashboard (see Prerequisites from Mix) to generate an authentication token from the Mix Authentication Service.

The token may be generated in several ways, either as part of the client application or as a script file. This Python example uses a Linux script to generate a token and store it in an environment variable. The token is then passed to the application, where it is used to create a secure connection to the TTS service.

Import functions

Import all functions from the NVC gRPC files, synthesizer*.*, generated in gRPC setup, along with other utilities.

Do not edit these synthesizer*.* files.

Set arguments

To provide essential information to the app and let users customize its operation, include arguments that the app can accept. For example:

Define synthesis request

Create a synthesis request using SynthesisRequest, including the arguments received from the end user. In this example, the request looks for a voice name and model plus the input to synthesize, either plain text or SSML.

The input is provided in the script file that runs the app, for example:

Features not shown in this simple application are described in other sections:

Create channel and stub

Create a secure gRPC channel and authenticate the application to the TTS service by providing the URL of the hosted service and an access token.

In many situations, the service URL and token are passed to the application as arguments. In this Python app, the URL is in the --server_url argument and the token is in --token.

Using this channel information, define a client stub function or class.

In some languages, this stub is defined in the generated client files: in Python it is named SynthesizerStub and in Go it is SynthesizerClient. In other languages, such as Java, you must create your own stub.

In the next step, you will call this stub and process the results.

Send request and process results

Finally, call the stub to send the synthesis request, then process the response (a stream of responses) using the fields in SynthesisResponse. For example, return the synthesized audio to the client, either streaming it and/or in an audio file.

In this example, the audio is saved to a file named in the --output_audio_file argument.

If you prefer a non-streamed response, see Streamed vs. unary response.

If you have multiple requests, direct them all to the same channel and stub. For an example, see Sample Python app - Multiple requests.

Sample Python app

Download and extract the sample Python app

$ unzip sample-python-app.zip
Archive:  sample-python-app.zip
  inflating: mix-client.py
  inflating: flow.py
  inflating: run-mix-client.sh
  inflating: synthesizer.proto
  inflating: synthesizer_pb2.py
  inflating: synthesizer_pb2_grpc.py

$ ll -1
flow.py
mix-client.py
run-mix-client.sh
synthesizer.proto
synthesizer_pb2.py
synthesizer_pb2_grpc.py

$ python3 --version
Python 3.6.8 

The sample Python application consists of these files, which you may download here and extract:

To run this sample app, you need Python 3.6 or later. Use python3 --version to check which version you have.

You also need Mix authentication credentials (a client ID and secret) to let you call the Nuance-hosted NVC engine. See Prerequisites from Mix.

You can use the application to check for available voices and/or request synthesis. Here are a few scenarios you can try.

Run app for help

Results from help request

$ ./mix-client.py -h
usage: mix-client.py [-options]

options:
  -h, --help                      Show this help message and exit
  -f file [file ...], --files file [file ...]
                                  List of flow files to execute sequentially,
                                  default=['flow.py']
  -p, --parallel                  Run each flow in a separate thread
  -i [num], --iterations [num]    Number of times to run the list of files, default=1
  -s [url], --serverUrl [url]     Mix TTS server URL, default=tts.api.nuance.com
  --token [token]                 Authentication token (required)
  --saveAudio                     Save whole audio to disk
  --saveAudioChunks               Save each individual audio chunk to disk
  --saveAudioAsWav                Save each audio file in WAVE format
  --sendUnary                     Receive one response (UnarySynthesize) instead of a
                                  stream of responses (Synthesize)
  --maxReceiveSizeMB [megabytes]  Maximum length of gRPC server response in megabytes,
                                  default=50 MB

For a quick check that the application is working, and to see the arguments it accepts, run the client app directly using the help (-h or --help) option.

$ ./mix-client.py -h

See the results at the right and notice:

The defaults for these two arguments mean you do not need to specify an input file or a server URL as you run the sample app.

Run app for voices

Results from get-voices request

$ ./run-mix-client.sh
2020-09-09 13:46:27,629 (140276734273344) INFO  Iteration #1
2020-09-09 13:46:27,638 (140276734273344) DEBUG Creating secure gRPC channel
2020-09-09 13:46:27,640 (140276734273344) INFO  Running file [flow.py]
2020-09-09 13:46:27,640 (140276734273344) DEBUG [voice {
  language: "en-us"
}
]
2020-09-09 13:46:27,640 (140276734273344) INFO  Sending GetVoices request
2020-09-09 13:46:27,976 (140276734273344) INFO  voices {
  name: "Ava-Mls"
  model: "enhanced"
  language: "en-us"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.1"
}
...
voices {
  name: "Evan"
  model: "enhanced"
  language: "en-us"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "1.1.1"
}
voices {
  name: "Nathan"
  model: "enhanced"
  language: "en-us"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "3.0.1"
}
...
voices {
  name: "Zoe-Ml"
  model: "enhanced"
  language: "en-us"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "1.0.2"
}

2020-09-09 13:46:27,977 (140276734273344) INFO  Done running file [flow.py]
2020-09-09 13:46:27,977 (140276734273344) INFO  Iteration #1 complete
2020-09-09 13:46:27,978 (140276734273344) INFO  Done

When you ask NVC to synthesize text, you must specify a named voice. To learn which voices are available, send a get-voices request, entering your requirements in the flow.py input file.

  1. Make sure your current directory contains the app files plus the gRPC files.

    flow.py
    mix-client.py
    run-mix-client.sh
    synthesizer_pb2_grpc.py
    synthesizer_pb2.py
    synthesizer.proto
    

  2. Edit the run script, run-mix-client.sh, to add your CLIENT_ID and SECRET. These are your Mix credentials as described in Authenticate.

    #!/bin/bash
     
    CLIENT_ID="appID%3A...ENTER MIX CLIENT_ID..."
    SECRET="...ENTER MIX SECRET..."
    export MY_TOKEN="`curl -s -u "$CLIENT_ID:$SECRET" \
    "https://auth.crt.nuance.com/oauth2/token" \
    -d 'grant_type=client_credentials' -d 'scope=asr nlu tts' \
    | python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'`"
     
    ./mix-client.py --token $MY_TOKEN --saveAudio --saveAudioAsWav
    

  3. Edit the input file, flow.py, to request all American English voices, and turn off synthesis.

    from synthesizer_pb2 import *
     
    list_of_requests = []
     
    # GetVoices request
    request = GetVoicesRequest()
    #request.voice.name = "Evan"
    request.voice.language = "en-us"     # Request all en-us voices
     
    # Add request to list
    list_of_requests.append(request)     # Enable voice request
     
    # ---
     
    # Synthesis request
    ... 
    #Add request to list
    #list_of_requests.append(request)    # Disable synthesis with #
     
    # ---
    

  4. Run the application using the script file.

    $ ./run-mix-client.sh
    

See the results at the right.

Get more voices

You can experiment with this request: for example, to see all available voices, remove or comment out all the request.voice lines, leaving only the main GetVoicesRequest.

# GetVoices request
request = GetVoicesRequest()            # Keep only this line
#request.voice.name = "Evan"
#request.voice.language = "en-us"

The results include all voices available from the Nuance-hosted NVC service.

Run app for synthesis

Results from synthesis request (some events are omitted)

$ ./run-mix-client.sh
2020-09-09 13:58:52,142 (140022203164480) INFO  Iteration #1
2020-09-09 13:58:52,151 (140022203164480) DEBUG Creating secure gRPC channel
2020-09-09 13:58:52,153 (140022203164480) INFO  Running file [flow.py]
2020-09-09 13:58:52,153 (140022203164480) DEBUG [voice {
  name: "Evan"
}
, voice {
  name: "Evan"
  model: "enhanced"
}
audio_params {
  audio_format {
    pcm {
      sample_rate_hz: 22050
    }
  }
  volume_percentage: 80
  speaking_rate_factor: 1.0
  audio_chunk_duration_ms: 2000
}
input {
  text {
    text: "This is a test. A very simple test."
  }
}
event_params {
  send_log_events: true
}
user_id: "MyApplicationUser"
]
2020-09-09 13:58:52,154 (140022203164480) INFO  Sending GetVoices request
2020-09-09 13:58:52,303 (140022203164480) INFO  voices {
  name: "Evan"
  model: "enhanced"
  language: "en-us"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "1.1.1"
}

2020-09-09 13:58:52,303 (140022203164480) INFO  Sending Synthesis request
2020-09-09 13:58:52,351 (140022203164480) INFO  Received events
2020-09-09 13:58:52,352 (140022203164480) INFO  events {...}

2020-09-09 13:58:52,538 (140022203164480) INFO  Received audio: 57484 bytes
2020-09-09 13:58:52,538 (140022203164480) INFO  Received events
2020-09-09 13:58:52,538 (140022203164480) INFO  events {...}

2020-09-09 13:58:52,658 (140022203164480) INFO  Received audio: 70432 bytes
2020-09-09 13:58:52,658 (140022203164480) INFO  Received events
2020-09-09 13:58:52,659 (140022203164480) INFO  events {...}

2020-09-09 13:58:52,663 (140022203164480) INFO  Received status response: SUCCESS
2020-09-09 13:58:52,664 (140022203164480) INFO  Wrote audio to flow.py_i1_s1.wav
2020-09-09 13:58:52,664 (140022203164480) INFO  Done running file [flow.py]
2020-09-09 13:58:52,665 (140022203164480) INFO  Iteration #1 complete
2020-09-09 13:58:52,665 (140022203164480) INFO  Done

Once you know the voice you want to use, you can ask NVC to synthesize a simple test string and save the resulting audio in a wave file. Again enter your requirements in flow.py.

  1. Look at run-mix-client.sh and notice the –saveAudio and –saveAudioAsWav arguments. There is no need to include the --file argument since flow.py is the default input filename.

    #!/bin/bash
     
    CLIENT_ID="appID%3A..." 
    SECRET="..."
    export MY_TOKEN="`curl -s -u "$CLIENT_ID:$SECRET" \
    "https://auth.crt.nuance.com/oauth2/token" \
    -d 'grant_type=client_credentials' -d 'scope=asr nlu tts' \
    | python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'`"
     
    ./mix-client.py --token $MY_TOKEN --saveAudio --saveAudioAsWav
    

  2. Edit flow.py to verify that your voice is available, then request synthesis using that voice.

    from synthesizer_pb2 import *
     
    list_of_requests = []
     
    # GetVoices request
    request = GetVoicesRequest()
    request.voice.name = "Evan"         #  Request a specific voice 
     
    # Add request to list
    list_of_requests.append(request)
     
    # ---
     
    # Synthesis request
    request = SynthesisRequest()
     
    request.voice.name = "Evan"         # Request synthesis using that voice
    request.voice.model = "enhanced"
     
    pcm = PCM(sample_rate_hz=22050)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)
     
    request.audio_params.volume_percentage = 80
    request.audio_params.speaking_rate_factor = 1.0
    request.audio_params.audio_chunk_duration_ms = 2000
     
    request.input.text.text = "This is a test. A very simple test."
     
    request.event_params.send_log_events = True
     
    request.user_id = "MyApplicationUser"
     
    #Add request to list
    list_of_requests.append(request)    # Enable synthesis request
     
    # ---
    

  3. Run the application using the script file.

    $ ./run-mix-client.sh
    

See the results at the right and notice the audio file created:

Multiple requests

Results from multiple synthesis

$ ./run-mix-client.sh
2020-09-27 14:26:27,209 (140665436571456) INFO  Iteration #1
2020-09-27 14:26:27,219 (140665436571456) DEBUG Creating secure gRPC channel
2020-09-27 14:26:27,221 (140665436571456) INFO  Running file [flow.py]
2020-09-27 14:26:27,221 (140665436571456) DEBUG [voice {
  name: "Evan"
  model: "enhanced"
}
audio_params {
  audio_format {
    pcm {
      sample_rate_hz: 22050
    }
  }
}
input {
  text {
    text: "This is a test. A very simple test."
  }
}
, 2, voice {
  name: "Evan"
  model: "enhanced"
}
audio_params {
  audio_format {
    pcm {
      sample_rate_hz: 22050
    }
  }
}
input {
  text {
    text: "Your coffee will be ready in 5 minutes."
  }
}
, 2, voice {
  name: "Zoe-Ml"
  model: "enhanced"
}
audio_params {
  audio_format {
    pcm {
      sample_rate_hz: 22050
    }
  }
}
input {
  text {
    text: "The wind was a torrent of darkness, among the gusty trees."
  }
}
]
2020-09-27 14:26:27,221 (140665436571456) INFO  Sending Synthesis request
2020-09-27 14:26:27,643 (140665436571456) INFO  Received audio: 57484 bytes
2020-09-27 14:26:27,672 (140665436571456) INFO  Received audio: 70432 bytes
2020-09-27 14:26:27,673 (140665436571456) INFO  Received status response: SUCCESS
2020-09-27 14:26:27,673 (140665436571456) INFO  Wrote audio to flow.py_i1_s1.wav
2020-09-27 14:26:27,673 (140665436571456) INFO  Waiting for 2 seconds
2020-09-27 14:26:29,675 (140665436571456) INFO  Sending Synthesis request
2020-09-27 14:26:29,852 (140665436571456) INFO  Received audio: 44756 bytes
2020-09-27 14:26:29,882 (140665436571456) INFO  Received audio: 67030 bytes
2020-09-27 14:26:29,882 (140665436571456) INFO  Received status response: SUCCESS
2020-09-27 14:26:29,883 (140665436571456) INFO  Wrote audio to flow.py_i1_s2.wav
2020-09-27 14:26:29,883 (140665436571456) INFO  Waiting for 2 seconds
2020-09-27 14:26:31,885 (140665436571456) INFO  Sending Synthesis request
2020-09-27 14:26:32,042 (140665436571456) INFO  Received audio: 24926 bytes
2020-09-27 14:26:32,042 (140665436571456) INFO  Received audio: 11942 bytes
. .. 
2020-09-27 14:26:32,101 (140665436571456) INFO  Received audio: 59466 bytes
2020-09-27 14:26:32,101 (140665436571456) INFO  Received status response: SUCCESS
2020-09-27 14:26:32,102 (140665436571456) INFO  Wrote audio to flow.py_i1_s3.wav
2020-09-27 14:26:32,102 (140665436571456) INFO  Done running file [flow.py]
2020-09-27 14:26:32,102 (140665436571456) INFO  Iteration #1 complete
2020-09-27 14:26:32,102 (140665436571456) INFO  Done

You can send multiple requests for synthesis (and/or get voices) in the same session. For efficient communication with the NVC server, all requests use the same channel and stub. This scenario sends three synthesis requests.

  1. Edit flow.py to add two more synthesis requests. (You may keep the get-voices request or remove it.) Optionally pause for a couple of seconds after each synthesis request.

    from synthesizer_pb2 import *
     
    list_of_requests = []
     
    # Synthesis request 
    request = SynthesisRequest()         # First request 
    request.voice.name = "Evan"
    request.voice.model = "enhanced"
    pcm = PCM(sample_rate_hz=22050)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)
    request.input.text.text = "This is a test. A very simple test."
    list_of_requests.append(request)
    list_of_requests.append(2)           # Optionally pause after request 
     
    # Synthesis request 
    request = SynthesisRequest()         # Second request 
    request.voice.name = "Evan" 
    request.voice.model = "enhanced"
    pcm = PCM(sample_rate_hz=22050)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)
    request.input.text.text = "Your coffee will be ready in 5 minutes."
    list_of_requests.append(request)
    list_of_requests.append(2)           # Optionally pause after request 
     
    # Synthesis request 
    request = SynthesisRequest()         # Third request 
    request.voice.name = "Zoe-Ml"
    request.voice.model = "enhanced"
    pcm = PCM(sample_rate_hz=22050)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)
    request.input.text.text = "The wind was a torrent of darkness, among the gusty trees."
    list_of_requests.append(request)
    

  2. Run the application using the script file.

    $ ./run-mix-client.sh
    

See the results at the right and notice the three audio files created:

Use different input

The input in this example is plain text ("This is a test...") but you can also provide input in the form of SSML and control codes. See Reference topics - Input to synthesize, which includes details plus examples that you can use in this sample application.

By default, the synthesized voice is streamed back to the application but you may request a unary (non-streamed, single package) synthesis response using the --sendUnary argument. See Streamed vs. unary response.

What's list_of_requests?

The application expects all input files to declare a global array named list_of_requests. It sequentially processes the requests contained in that array.

You may optionally instruct the application to wait a number of seconds between requests, by appending a number value to list_of_requests. For example:

list_of_requests.append(request1)
list_of_requests.append(1.5)
list_of_requests.append(request2)

Once request1 is complete, the application pauses for 1.5 seconds before executing request2.

Reference topics

This section provides more information about topics in the gRPC API.

Status codes

Code Message Indicates
200 Success Synthesis completed successfully.
400 Bad request A malformed or unsupported client request was rejected.
403 Forbidden A restricted voice was requested but you are not authorized to use it.
500 Internal server error An unknown error has occurred on the server.
502 Resource error An error has occurred with a synthesis resource.

Streamed vs. unary response

One request, two possible responses

service Synthesizer {
  rpc Synthesize(SynthesisRequest) returns (stream SynthesisResponse) {} 
  rpc UnarySynthesize(SynthesisRequest) returns (UnarySynthesisResponse {}
. . .
message SynthesisRequest { 
  Voice voice = 1;  
  AudioParameters audio_params = 2; 
  Input input = 3;   
  EventParameters event_params = 4;  
  map<string, string> client_data = 5; 
}

message SynthesisResponse {
  oneof response {
    Status status = 1;   
    Events events = 2;   
    bytes audio = 3;     
  }
}

message UnarySynthesisResponse {  
  Status status = 1;   
  Events events = 2;   
  bytes audio = 3;     
}

NVC offers two types of synthesis response: a streamed response available in SynthesisResponse and a non-streamed response in UnarySynthesisResponse.

The request is the same in both cases: SynthesisRequest specifies a voice, the input text to synthesize, and optional parameters. The response can be either:

Defaults

The proto file provides the following defaults for messages in SynthesisRequest. Mandatory fields are shown in bold.

                                   
Items in SynthesisRequest Default
    voice (Voice)  
    name Mandatory, e.g. 'Evan'
    model Mandatory, e.g. 'enhanced'
    age_group (EnumAgeGroup) ADULT
    gender (EnumGender) ANY
    audio_params (AudioParameters)  
    audio_format (AudioFormat) PCM 22.5kHz
    volume_percentage 80
    speaking_rate_factor 1.0
    audio_chunk_duration_ms 20000 (20 seconds)
    target_audio_length_ms 0, meaning no maximum duration
    disable_early_emission False: Send audio segments as soon as possible
    input (Input)  
    text (Text) Mandatory: one of text, tokenized_sequence, or ssmls
    tokenized_sequence (TokenizedSequence)
    ssml (SSML)
      ssml_validation_mode (EnumSSMLValidationMode) STRICT
    escape_sequence \! and <ESC>
    resources (SynthesisResource)  
      type (EnumResourceType) USER_DICTIONARY
    lid_params (LanguageIdentificationParameters)
      disable False: LID is turned on
      languages Blank, meaning use all available languages
      always_use_ highest_confidence False: Use highest language with any confidence score
    download_params (DownloadParameters)  
      headers Blank
      refuse_cookies False: Accept cookies
      request_timeout_ms NVC server default, usually 30000 (30 seconds)
    event_params (EventParameters)  
    send_sentence_marker_events False: Do not send
    send_word_marker_events False: Do not send
    send_phoneme_marker_events False: Do not send
    send_bookmark_marker_events False: Do not send
    send_paragraph_marker_events False: Do not send
    send_visemes False: Do not send
    send_log_events False: Do not send
    suppress_input False: Include text and URIs in logs
    client_data Blank
    user_id Blank

Input to synthesize

Plain text input

SynthesisRequest (
    voice = Voice (
        name = "Evan",
        model = "enhanced"
     ),
     input.text.text = "Your coffee will be ready in 5 minutes"
)

SSML input

SynthesisRequest (
    voice = Voice (
        name = "Evan",
        model = "enhanced"
    ),
    input = Input (
        ssml = SSML (
            text = '<?xml version="1.0"?><speak  xmlns="http://www.w3.org/2001/10/synthesis" 
xml:lang="en-US" version="1.0"><prosody volume="10">I can speak rather quietly,</prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
            ssml_validation_mode = WARN
        ) 
    )
)

Tokenized sequence input

SynthesisRequest (
    voice = Voice (
        name = "Evan",
        model = "enhanced"
    ),
    input = Input (
        tokenized_sequence = TokenizedSequence (
            tokens = [
                Token (text = "My name is "),
                Token (control_code = ControlCode (
                    key = "pause", value = "300")),
                Token (text = "Jeremiah Jones")
            ]
        )
    )
)   

You provide the text for NVC to synthesize in one of three ways using the Input message:

If you are using the sample Python app and its input file, flow.py, enter the different types of input as request.input lines. (When flow.py contains multiple request.input sections, it executes only the last uncommented section.)

SSML tags

Generic example

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US" version="1.0">
Text before control code. 
<prosody volume="10">Text following or affected by control code.</prosody> 
</speak>

SSML elements may be included when using the input type Input - SSML. These tags indicate how the text segments within the tag should be spoken.

See Control codes to accomplish the same type of control in tokenized sequence input.

NVC supports the following SSML elements and attributes in SSML input. For details about these items, see SSML Specification 1.0. Note that NVC does not support all SSML elements and attributes listed in the W3C specification.

xml version

xml declaration and speak

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US" version="1.0">
Input text and tags</speak>

An XML declaration, specifying the XML version, 1.0.

speak

The root SSML element. Mandatory. It contains the required attributes, xml:lang and version, and encloses text to be synthesized along with these optional elements: audio, break, lexicon, mark, p, prosody, say-as, and s.

audio

audio

Say your name at the beep. <audio src="/tmp/recordings/beep.wav" fetchtimeout=30s/>

The audio element inserts a digital audio recording at the current location. The src attribute specifies the location of the recording as a URI or local file.

NVC supports headerless WAV files containing 16-bit PCM samples.

The audio element supports extra attributes to control internet fetching as described in the VoiceXML 2.0 specification for this element:

NVC does not support the audio expr attribute defined in the VoiceXML 2.0 specification.

break

break

His name is <break time="300ms"/> Michael.

Tom lives in New York City. So does John. He\'s at 180 Park Ave. <break strength="none"/> Room 24

The break element controls pausing between words, overriding the default breaks based on punctuation in the text. The break tag has two optional attributes:

lexicon

lexicon

<lexicon uri="/tmp/lexicons/american_english.bdc" xml:id="En-us dict" 
type="application/edct-bin-dictionary"/>

The lexicon element loads user dictionaries, rulesets, and so on. For details, see the SSML specification - lexicon but these are the principal attributes:

NVC parses all lexicon elements and loads tuning data before starting text-to-speech conversion. This tuning data is unloaded when the last sample buffer is generated, or when the TTS process is stopped, so lexicon elements only affect the current synthesis request.

mark

mark

This bookmark <mark name="bookmark1"/> marks a reference point. 
Another <mark name="bookmark2"/> does the same.

The mark element inserts a bookmark that is returned in the results. The value can be any string.

p

p

<p>Welcome to Vocalizer.</p>
<p>Vocalizer is a state-of-the-art text to speech system.</p>

The p element indicates a paragraph break. A paragraph break is equivalent to break strength="x-strong".

prosody

The prosody element specifies intonation in the generated voice using several attributes. You may combine multiple attributes within the same prosody element.

prosody - rate

prosody - rate

This is my normal speaking rate. 
<prosody rate="+50%"> But I can speed up the rate.</prosody>
<prosody rate="-25%">Or I can slow it down.</prosody>

Prosody rate sets the speaking rate as a keyword, a number (0-100), or a relative percentage (+/-n%). The keywords are:

prosody - timbre

prosody - timbre

This is the normal timbre of my voice. 
<prosody timbre="young"> I can sound a bit younger. </prosody> 
<prosody timbre="old" rate="-10%"> Or older and hopefully wiser. </prosody>

Prosody timbre changes the speaking voice to sound bigger and older (lower values) or smaller and younger (higher values). Not supported for all languages. The value is a keyword, a number (50-200, default is 100), or a relative percentage (+/-n%). The keywords are:

prosody - volume

prosody - volume

This is my normal speaking volume. 
<prosody volume="-50%">I can also speak rather quietly,</prosody> 
<prosody volume="+50%"> or also very loudly.</prosody>

Prosody volume changes the speaking volume. The value is a keyword, a number (0-100), or a relative percentage (+/-n%). The keywords are: silent, x-soft, soft, medium (default), loud, or x-loud.

s

s

<s>The wind was a torrent of darkness, among the gusty trees</s>
<s>The moon was a ghostly galleon, tossed upon cloudy seas</s>

The s element indicates a sentence break. A sentence break is equivalent to break strength="strong".

say-as

say-as

<say-as interpret-as="address">Apt. 7-12, 28 N. Whitney St., Saint Augustine Beach, FL 32084-6715</say-as>

<say-as interpret-as="currency">12USD</say-as>

<say-as interpret-as="date">11/21/2020</say-as>

<say-as interpret-as="name">Care Telecom Ltd</say-as>

<say-as interpret-as="number">1343455</say-as>

<say-as interpret-as="ordinal">12th</say-as>

<say-as interpret-as="phone">1-800-688-0068</say-as>

<say-as interpret-as="raw">app.</say-as>

<say-as interpret-as="sms">CU :-)</say-as>

<say-as interpret-as="spell" format="alphanumeric">a34y - 347</say-as>

<say-as interpret-as="spell" format="strict">a34y - 347</say-as>

<say-as interpret-as="state">FL</say-as>

<say-as interpret-as="streetname">Emerson Rd.</say-as>

<say-as interpret-as="streetnumber">11001-11010</say-as>

<say-as interpret-as="time">10:00</say-as>

<say-as interpret-as="zip">01803</say-as>

The say-as element controls how to say specific types of text, using the interpret-as attribute to specify a value and (in some cases) a format. A wide range of input is accepted for most values. The values are:

style

style

Hello, this is Samantha. <style name="lively">Hope you’re having a nice day!</style>

Hello, this is Samantha. <style name="lively">Hope you’re having a nice day!</style>
<voice name="nathan">Hello, this is Nathan.</voice>

The style element sets the speaking style of the voice. Values for name depend on the voice but are usually neutral, lively, forceful, and apologetic. The default depends on the voice. If you request a style that the voice does not support, there is no effect.

The first example reads "Hello, this is Samantha" in Samantha's default style, then switches to lively style to say "Hope you're having a nice day!"

The style resets to default at the end of the synthesis request or if it encounters a change of voice. The second example continues with Nathan in default style saying "Hello, I am Nathan."

voice

voice

<voice name="samantha">Hello, this is Samantha. </voice>
<voice name="tom">Hello, this is Tom.</voice>

The voice element changes the speaking voice, which also forces a sentence break. Values for name are the voices available to the session.

Control codes

Tokenized sequence structure

SynthesisRequest - Input - TokenizedSequence - 
    Token - text "Text before control code"
    Token - ControlCode (key="code name", value="code value")
    Token - text "Text following or affected by control code"

Generic example

request.input.tokenized_sequence.tokens.extend ([
    Token (text = "Text before control code"),
    Token (control_code=ControlCode (key="code name", value="code value")),
    Token (text = "Text following or affected by control code")
])

Control codes, sometimes known as control sequences, may be included in the input text when using the input type Input - TokenizedSequence. These codes indicate how the text segments following the code should be spoken.

See SSML tags to accomplish the same types of control in SSML input.

Nuance supports the following control codes and values in TokenizedSequence.

audio

audio

Token - text "Say your name at the beep."  
Token - ControlCode (key="audio", value="/tmp/recordings/beep.wav") 

The audio code inserts a digital audio recording at the current location. The src attribute specifies the location of the recording as a URI or local file.

NVC supports headerless WAV files containing 16-bit PCM samples.

The audio element supports extra attributes to control internet fetching as described in the VoiceXML Version 2.0 specification for this element:

NVC does not support the audio expr attribute defined in the VoiceXML 2.0 specification.

eos

eos

Token - text "Tom lives in the U.S."  
Token - ControlCode (key="eos", value="1") 
Token - text "So does John. 180 Park Ave."
Token - ControlCode (key="eos", value="0")
Token - text "Room 24"

The eos code controls end-of-sentence detection. Values are:

To disable automatic end-of-sentence detection for a block of text, use readmode explicit_eos.

lang

lang

Token - text "The name of the song is. " 
Token - ControlCode (key="lang", value="unknown")
Token - text "Mon pays, c'est l'hiver."
Token - ControlCode (key="lang", value="normal")
Token - text "It's a famous Quebec song meaning my country is winter."

The lang code labels text identified as from an unknown language. Values are:

The value lang unknown labels all text from that position (up to a lang normal or the end of input) as being from an unknown language. NVC then uses its language identification feature on a sentence-by-sentence basis to determine the language, and switches to a voice for that language if necessary. The original voice is restored at the next lang normal or the end of the synthesis request.

See LanguageIdentificationParameters.

Language identification is only supported for a limited set of languages.

mrk

mrk

Token - ControlCode (key="mrk", value="important")
Token - text "This is an important point"

The mrk code inserts a bookmark that is returned in the results. The value can be any name.

pause

pause

Token - text "My name is"  
Token - ControlCode (key="pause", value="300")  
Token - text "Jeremiah Jones"

The pause code inserts a pause of a specified duration in milliseconds. Values from 1 to 65,535.

para

para

Token - text "Introduction to Vocalizer"
Token - ControlCode (key="para")
Token - text "Vocalizer is a state-of-the-art text-to-speech system."

The para code indicates a paragraph break and implies a sentence break. The difference between this and eos 1 (end of sentence) is that this triggers the delivery of a paragraph mark event.

prompt

prompt

Token - ControlCode (key="prompt", value="banking::confirm_account_number")
Token - text "Thanks"

The prompt code inserts an ActivePrompt at a specific location in the text. The value is the name of the prompt within an ActivePrompt database.

To use an ActivePrompt database, you must first load it into the session using SynthesisRequest - Input - SynthesisResource - EnumResourceType - ACTIVEPROMPT_DB.

rate

rate

Token - text "I can"
Token - ControlCode (key="rate", value="75")
Token - text "speed up the rate"
Token - ControlCode (key="rate", value="25")
Token - text "or slow it down"

The rate code sets the speaking rate as a percentage of the default speaking rate. Values are from 1 to 100, with 50 as the default rate.

readmode

readmode

Token - ControlCode (key="readmode", value="sent")
Token - text "Please buy green apples. You can also get pears."
Token - ControlCode (key="readmode", value="char")
Token - text "Apples"
Token - ControlCode (key="readmode", value="word")
Token - text "Please buy green apples."
Token - ControlCode (key="readmode", value="line")
Token - text "Bananas. Low-fat milk. Whole wheat flour."
Token - ControlCode (key="readmode", value="explicit_eos")
Token - text "Bananas. Low-fat milk. Whole wheat flour."

The readmode code changes the reading mode from sentence mode (the default) to specialized modes. Values are the modes:

Return to readmode sent after the specialized readme.

rst

rst

Token - ControlCode (key="vol", value="10")
Token - text "The volume is set to a low value"
Token - ControlCode (key="rst")
Token - text "Now it is reset to its default value"

The rst code resets all codes to the default values.

spell

spell

Token - ControlCode (key="tn", value="spell")
Token - ControlCode (key="spell", value="200")
Token - text "a134b"
Token - ControlCode (key="tn", value="normal")

The spell code sets the inter-character pause, in milliseconds, for tn - spell. Values are from 1 to 65535.

style

style

Token - text "Hello, this is Samantha.
Token - ControlCode (key="style", value="lively")
Token - text "Hope you're having a nice day!"

Token - text "Hello, this is Samantha.
Token - ControlCode (key="style", value="lively")
Token - text "Hope you're having a nice day!"
Token - ControlCode (key="voice", value="nathan")
Token - text "Hello, this is Nathan.

The style code sets the speaking style of the voice. Values depend on the voice but are usually neutral, lively, forceful, and apologetic. The default is usually neutral. If you request a style that the voice does not support, there is no effect.

The first example reads "Hello, this is Samantha" in Samantha's default style, then switches to lively style to say "Hope you're having a nice day!"

The style resets to default at the end of the synthesis request or if it encounters a change of voice. The second example continues with Nathan in default style saying "Hello, this is Nathan."

timbre

timbre

Token - ControlCode (key="timbre", value="180")
Token - text "I can sound quite young"
Token - ControlCode (key="timbre", value="50")
Token - text "Or I can sound old and maybe wise."
Token - ControlCode (key="tn", value="normal")

The timbre code changes the speaking voice to sound bigger and older (lower values) or smaller and younger (higher values). Values are between 50 and 200, and 100 is typical.

Not supported for all languages.

tn

The tn code guides text normalization. Values are the different types of text. After applying the normalization mode, apply another code or return to normal.

tn - address

tn - address

Token - ControlCode (key="tn", value="address")
Token - text "Apt. 7-12, 28 N. Whitney St., Saint Augustine Beach, FL 32084-6715
Token - ControlCode (key="tn", value="normal")  

Full name and address

Token - ControlCode (key="tn", value="name")  
Token - text "Aardvark & Sons Co. Inc."
Token - ControlCode (key="tn", value="address")
Token - text "123 E. Forest Ave., Portland, ME 04103
Token - ControlCode (key="tn", value="normal")  

The tn - address code provides optimal reading for complete postal addresses.

Do not include the name portion of the address to avoid undesired expansions of name-specific abbreviations. Instead, include the name in a separate tn - name section prior to the tn - address.

For example, the full name and address at the right is read as: "Aardvark and Sons Company Incorporated, one two three East Forest Avenue, Portland, Maine, zero four one zero three."

tn - alphanumeric

The tn- alphanumeric code is an alias of tn - spell:alphanumeric.

tn - boolean

tn - boolean

Token - ControlCode (key="tn", value="boolean")  
Token - text = "true"
Token - ControlCode (key="tn", value="normal")  

The tn - boolean code reads boolean values (true, false, yes, no) by spelling them out. This example spells out "T R U E."

tn - cardinal

The tn - cardinal code is an alias of tn - number.

tn - characters

The tn - characaters code is an alias of tn - spell:alphanumeric.

tn - currency

tn - currency

Token - ControlCode (key="tn", value="currency")  
Token - text = "123.45USD"
Token - ControlCode (key="tn", value="normal")  

The tn - currency code reads text as currency. For example, "123.45USD" is read as "one hundred twenty three U S dollars and forty five cents."

tn - date

tn - date

Token - ControlCode (key="tn", value="date")  
Token - text = "11/21/1984"
Token - ControlCode (key="tn", value="normal")  

The tn - date code reads text as a date. For example, "11/21/1984" is read as "November twenty-first, nineteen eighty four."

The precise output is determined by the voice, and ambiguous dates are interpreted according to the conventions of the voice's locale. For example, "05/12/2020" is read by an American English voice as "May twelfth two thousand twenty" and by a British English voice as "the fifth of December two thousand and twenty."

tn - digits

The tn - digits code is an alias for number.

tn - name

tn - name

Token - ControlCode (key="tn", value="name")  
Token - text = "Care Telecom Ltd"

The tn - name code gives correct reading of names.

tn - number

tn - number

Token - ControlCode (key="tn", value="number")  
Token - text = "1343455"
Token - ControlCode (key="tn", v="normal")  

The tn - number code reads the text as cardinal numbers, or digits. For example "1343455" is read as "one million three hundred forty three thousand four hundred and fifty five."

tn - normal

The tn - normal coee returns to generic normalization following a text fragment that is normalized in a special way. All the examples in this tn section include tn - normal following the specific normalization segment.

tn - ordinal

tn - ordinal

Token - ControlCode (key="tn", value="ordinal")  
Token - text "12th"
Token - ControlCode (key="tn", value="normal")  

The tn - ordinal code reads positional numbers such as 1st, 2nd, 3rd, and so on.

tn - phone

tn - phone

Token - ControlCode (key="tn", value="phone")  
Token - text = "1-H800-688-0068"
Token - ControlCode (key="tn", value="normal")  

The tn - phone code reads telephone numbers. For example, "1-800-688-0068" is read as "One, eight hundred, six eight eight, zero zero six eight."

tn - raw

tn - raw

Token - ControlCode (key="tn", value="raw")  
Token - text = "app."
Token - ControlCode (key="tn", value="normal")  

The tn - raw code provides a literal reading of the text, such as blocking undesired abbreviation expansion. It operates principally on the abbreviations and acronyms but may impact the surrounding text as well.

tn - sms

tn - sms

Token - ControlCode (key="tn", value="sms")  
Token - text = "ttyl, James, :-)"
Token - ControlCode (key="tn", value="normal")  

The tn - sms code gives short message service (SMS) reading. For example, "ttyl, James, :-)" is read as "Talk to you later, James, smiley happy."

tn - spell:alphanumeric

tn - spell:alphanumeric

Token - ControlCode (key="tn", value="spell:alphanumeric")  
Token - text = "a34y - 347"
Token - ControlCode (key="tn", value="normal") 

The tn - spell:alphanumeric code spells out all alphabetic and numeric characters, but does not read white space, special characters, and punctuation marks. For example, "a34y - 347" is pronounced "A three four Y, three four seven."

For both types of spell normalization, characters with accents are indicated. For example: "café" is spoken as "C A F E acute" and "Abc" is spoken as "Upper case A B C."

tn - spell:strict

tn - spell:strict

Token - ControlCode (key="tn", value="spell:strict")  
Token - text = "a34y - 347"
Token - ControlCode (key="tn", value="normal") 

The tn - spell:strict code spells out all characters, including white space, special characters, and punctuation marks.

For example, "a34y - 347" is pronounced "A three four Y, space hyphen space, three four seven."

tn - state

tn - state

Token - ControlCode (key="tn", value="state")  
Token - text "FL"
Token - ControlCode (key="tn", value="normal") 

The tn - state code expands and pronounces state, city, and province names and abbreviations. Not supported for all languages.

tn - streetname

tn - streetname

Token - ControlCode (key="tn", value="streetname")  
Token - text = "Emerson Rd."
Token - ControlCode (key="tn", value="normal") 

The tn - streetname reads street names and abbreviations. Not supported for all languages.

tn - telephone

The tn - telephone code is an alias of tn - phone.

tn - time

tn - time

Token - ControlCode (key="tn", value="time")  
Token - text = "10:00"
Token - ControlCode (key="tn", value="normal") 

The tn - time code gives a time of day reading. For example, 10:00 is pronounced "ten o'clock."

tn- zip

tn - zip

Token - ControlCode (key="tn", value="zip")  
Token - text = "01803"
Token - ControlCode (key="tn", value="normal") 

The tn - zip code reads US zip codes. Supported for American English only.

voice

voice

Token - ControlCode (key="voice", value="samantha")
Token - text "Hello, this is Samantha. "
Token - ControlCode (key="voice", value="tom")
Token - text "Hello, this is Tom."

The voice code changes the speaking voice, which also forces a sentence break. Values are the voices within the request.

vol

vol

Token - text "I can"
Token - ControlCode (key="vol", value="10")
Token - text "speak rather quietly,"
Token - ControlCode (key="vol", value="90")
Token - text "but also very loudly."

The vol code changes the volume as a percentage of maximum volume. Values are from 0 (silent) to 100 (maximum volume). The default is typically 80.

wait

wait

Token - ControlCode (key="wait", value="2")
Token - text "There will be a short wait period after this sentence."  
Token - ControlCode (key="wait", value="9") 
Token - text "This sentence will be followed by a long wait. Did you notice the difference?"

The wait code specifies the end-of-sentence pause duration. Values are from 0 to 9, where the pause is 200 milliseconds multiplied by the value.

Synthesis resources

To enhance basic synthesis, you can add external resources such as user dictionaries.

External resources are not currently supported in Nuance-hosted NVC, but you may include inline dictionaries with SynthesisResource - body.

User dictionary

Source user dictionary

[Header]
Language = ENU
[SubHeader]
Content = EDCT_CONTENT_BROAD_NARROWS
Representation = EDCT_REPR_SZZ_STRING
[Data]
zero // #'zi.R+o&U#
addr // #'@.dR+Es#
adm // #@d.'2mI.n$.'stR+e&I.S$n#
[SubHeader]
Content=EDCT_CONTENT_ORTHOGRAPHIC
Representation=EDCT_REPR_SZ_STRING
[Data]
Info      Information
IT        "Information Technology"
DLL       "Dynamic Link Library"
A-level   "advanced level"
Afr       africa
Acc       account
TEL       telephone
Anon      anonymous
AP        "associated press" 

Compiled dictionary referenced in flow.py

request.input.text.text = "I need to find a DLL."

user_dict = SynthesisResource()
user_dict.type = EnumResourceType.USER_DICTIONARY
user_dict.body = open('/path/to/user_dictionary.dcb', 'rb').read()
request.input.resources.extend([user_dict])

A user dictionary alters the default pronunciation of words spoken by NVC. For example, you can define the pronunciation of words from foreign languages, expand special acronyms, and tune the pronunciation of words with unusual spelling.

The sample dictionary shown at the right includes the pronunciation of "zero," the expansion and pronunciation of "addr" and "adm," plus the expansion of several abbreviated words and acronyms.

You may use this dictionary in NVC by referencing it with SynthesisResource - body.

  1. Compile the source dictionary using Nuance Vocalizer studio or its conversion tool, dictcpl. In this example, the resulting compiled file is user_dictionary.dcb.
  2. Reference user_dictionary.dcb in flow.py, which serves as input to the sample client.py application. See the example at the right for the SynthesisResource code that reads the dictionary as a local file.
  3. Run client.py. The audio output is: "I need to find a dynamic link library."

gRPC API

NVC provides a protocol buffer file, synthesizer.proto, that defines Nuance's text-to-speech service for gRPC. This file contains the building blocks of your voice synthesis applications.

Once you have transformed the proto file into functions and classes in your programming language using gRPC tools, you can call these functions from your application to set parameters, request synthesis, and receive the resulting audio.

See Client app development and Sample Python app for scenarios and examples in Python. For other languages, consult the gRPC and Protocol Buffers documentation.

Proto file structure

Structure of synthesizer.proto

Synthesizer
    Get Voices
        GetVoicesRequest
        GetVoicesResponse
    Synthesize
        SynthesisRequest
        SynthesisResponse
    UnarySynthesize
        SynthesisRequest
        UnarySynthesisResponse

GetVoicesRequest / GetVoicesResponse
    voice Voice
        age_group EnumAgeGroup
        gender EnumGender
        voice fields

SynthesisRequest
    voice Voice
        voice fields
    audio_params AudioParameters
        audio parm fields
        audio_format AudioFormat
            audio format fields
            ogg_opus OggOpus | opus Opus
                Opus fields
                vbr EnumVariableBitrate
    input Input
        text Text
        ssml SSML
          ssml_validation_mode EnumSSMLValidationMode
        tokenized_sequence TokenizedSequence
        resources SynthesisResource
            resource fields
            type EnumResourceType
        lid_params LanguageIdentificationParameters
        download_params DownloadParameters
    event_params EventParameters
        event parm fields
    client_data
    user_id

SynthesisResponse
    status Status
    events Events
        Event
    audio

UnarySynthesisResponse
    status Status
    events Events
        Event
    audio

The proto file defines a Synthesizer service with three RPC methods: GetVoices, Synthesize, and UnarySynthesize. Details about each component are referenced by name within the proto file.

These are the fields that make up the GetVoices request and response:

Proto file: GetVoices

And these are the principal fields in the Synthesize and UnarySynthesize request and response:

Proto file: Synthesis

Synthesizer

The Synthesizer service offers these functionalities:

Name Request Type Response Type
GetVoices GetVoicesRequest GetVoicesResponse
Synthesize SynthesisRequest SynthesisResponse stream
UnarySynthesize SynthesisRequest UnarySynthesisResponse

GetVoicesRequest

Get all American English voices

GetVoicesRequest (
    voice = Voice (
        language = "en-us"
    )
)

Get one named voice

GetVoicesRequest (
    voice = Voice (
        name = "Evan"
    )
)

Input message for message for Synthesizer - GetVoices, to query voices available to the client.

Field Type Description
voice Voice Optionally filter the voices to retrieve, e.g. set language to en-US to return only American English voices.

Voice

Input or output message for voices.

These fields are supported in all cases:

Field Type Description
name string The voice's name, e.g. 'Evan'. Mandatory for SynthesisRequest.
model string The voice's quality model, e.g. 'enhanced' or 'standard'. Mandatory for SynthesisRequest.

These Voice fields are used only in GetVoicesRequest and GetVoicesResponse. They are ignored in SynthesisRequest.

Field Type Description
language string IETF language code, e.g. 'en-US'. Search for voices with a specific language. Some voices support multiple languages.
age_group EnumAgeGroup Search for adult or child voices.
gender EnumGender Search for voices with a certain gender.
sample_rate_hz uint32 Search for a certain native sample rate.
language_tlw string Three-letter language code (e.g. 'enu' for American English) for configuring language identification in Input.
restricted bool Used only in GetVoicesResponse, to identify restricted voices (restricted true). These are custom voices available only to specific customers. Default is false, meaning the voice is public.
version string Used only in GetVoicesResponse, to return the voice's version.

EnumAgeGroup

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying whether the voice uses its adult or child version, if available. Included in Voice.

Name Number Description
ADULT 0 Adult voice. Default for GetVoicesRequest.
CHILD 1 Child voice.

EnumGender

Input field for GetVoicesRequest or output field for GetVoicesResponse, specifying gender for voices that support multiple genders. Included in Voice.

Name Number Description
ANY 0 Any gender voice. Default for GetVoicesRequest.
MALE 1 Male voice.
FEMALE 2 Female voice.
NEUTRAL 3 Neutral gender voice.

GetVoicesResponse

Response to GetVoicesRequest

2020-10-04 10:26:57,840 (139731120789312) INFO  Sending GetVoices request
2020-10-04 10:26:58,051 (139731120789312) INFO  voices {
  name: "Ava-Ml"
  model: "enhanced"
  language: "en-us"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "3.0.1"
}
voices {
  name: "Ava-Mls"
  model: "enhanced"
  language: "en-us"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "2.0.1"
}
voices {
  name: "Erica"
  model: "standard"
  language: "en-us"
  gender: FEMALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  restricted: true
  version: "1.0.2"
}
voices {
  name: "Erica"
  model: "standard"
  language: "en-us"
  gender: FEMALE
  sample_rate_hz: 8000
  language_tlw: "enu"
  restricted: true
  version: "1.0.2"
}
voices {
  name: "Evan"
  model: "enhanced"
  language: "en-us"
  gender: MALE
  sample_rate_hz: 22050
  language_tlw: "enu"
  version: "1.1.1"
}
. . . 

Output message for Synthesizer - GetVoices. Includes a list of voices that matched the input criteria, if any.

Field Type Description
voices Voice Repeated. Voices and characteristics returned.

SynthesisRequest

Synthesis request with most fields

SynthesisRequest(
    voice = Voice (
        name = "Evan",
        model = "enhanced"
    ),
    audio_params = AudioParameters (
        audio_format = AudioFormat (
            pcm = PCM (sample_rate_hz = 22050) # Default value
        ),
        volume_percentage = 80,                # Default value
        speaking_rate_factor = 1.0             # Default value
    ),
    input = Input (
        text = Text (
           text = "Your coffee will be ready in 5 minutes")
    ),
    event_params = EventParameters (
        send_log_events = True,
        suppress_input = True  
    ),
    client_data = {'company':'Aardvark Coffee','user':'Leslie'},
    user_id = "leslie.somebody@aardvark.com"
)

Minimal synthesis request, using all defaults

SynthesisRequest(
    voice = Voice (
        name = "Evan",
        model = "enhanced"
    ),
    input = Input (
        text = Text (
           text = "Your coffee will be ready in 5 minutes")
    )
)

Input message for Synthesizer - Synthesize. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio.

Field Type Description
voice Voice The voice to use for audio synthesis. Mandatory.
audio_params AudioParameters Output audio parameters, such as encoding and volume.
input Input Input text to synthesize, tuning data, etc. Mandatory.
event_params EventParameters Markers and other info to include in server events returned during synthesis.
client_data map<string,string> Map of client-supplied key:value pairs to inject into the call log.
user_id string Identifies a specific user within the application.

AudioParameters

Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.

Field Type Description
audio_format AudioFormat Audio encoding. Default PCM 22.5kHz.
volume_percentage uint32 Volume amplitude, from 0 to 100. Default 80.
speaking_rate_factor float Speaking rate, from 0 to 2.0. Default 1.0.
audio_chunk_ duration_ms uint32 Maximum duration, in ms, of an audio chunk delivered to the client, from 1 to 60000. Default is 20000 (20 seconds). When this parameter is large enough (for example, 20 or 30 seconds), each audio chunk contains an audible segment surrounded by silence.
target_audio_length_ms uint32 Maximum duration, in ms, of synthesized audio. When greater than 0, the server stops ongoing synthesis at the first sentence end, or silence, closest to the value.
disable_early_emission bool By default, audio segments are emitted as soon as possible, even if they are not audible. This behavior may be disabled.

AudioFormat

Default format (PCM 22050) with alternatives shown in commented lines

SynthesisRequest(
    voice = Voice (
        name = "Evan",
        model = "enhanced"
    ),
    audio_params = AudioParameters (
        audio_format = AudioFormat (
            pcm = PCM (sample_rate_hz = 22050)
#           alaw = ALaw ()
#           ulaw = ULaw ()
#           ogg_opus = OggOpus (sample_rate_hz=16000)
#           opus = Opus (sample_rate_hz=8000, bit_rate_bps=30000)
        )
    )

Input message for audio encoding of synthesized text. Included in AudioParameters.

Field Type Description
pcm PCM Signed 16-bit little endian PCM, 8kHz or 16kHz. Default.
alaw ALaw G.711 A-law, 8kHz.
ulaw ULaw G.711 Mu-law, 8kHz.
ogg_opus OggOpus Ogg Opus, 8kHz,16kHz, or 24 kHz.
opus Opus Opus, 8kHz, 16kHZ, or 24 kHz. The audio will be sent one Opus packet at a time.

PCM

Input message defining PCM sample rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Output sample rate: 8000, 16000, 22050 (default), 24000.

ALaw

Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

ULaw

Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

OggOpus

Input message defining Ogg Opus output rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Output sample rate. Supported values: 8000, 16000, 24000 Hz.
bit_rate_bps uint32 Valid range is 500 to 256000 bps. Default 28000 bps.
max_frame_ duration_ms float Opus frame size, in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity uint32 Computational complexity. A complexity of 0 means the codec default.
vbr EnumVariableBitrate Variable bitrate. On by default.

Opus

Input message defining Opus output rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Output sample rate. Supported values: 8000, 16000, 24000 Hz.
bit_rate_bps uint32 Valid range is 500 to 256000 bps. Default 28000 bps.
max_frame_ duration_ms float Opus frame size, in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity uint32 Computational complexity. A complexity of 0 means the codec default.
vbr EnumVariableBitrate Variable bitrate. On by default.

EnumVariableBitrate

Settings for variable bitrate. Included in OggOpus and Opus. Turned on by default.

Name Number Description
VARIABLE_BITRATE_ON 0 Use variable bitrate. Default.
VARIABLE_BITRATE_OFF 1 Do not use variable bitrate.
VARIABLE_BITRATE_ CONSTRAINED 2 Use constrained variable bitrate.

Input

Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The type of input may be plain text, SSML, or a sequence of plain text and Nuance control codes. See Input to synthesize for more examples.

Field Type Description
text Text Plain text input.
ssml SSML SSML input, including text and SSML elements.
tokenized_sequence TokenizedSequence Sequence of text and Nuance control codes.
resources SynthesisResource Repeated. Synthesis resources (user dictionaries, rulesets, etc.) to tune synthesized audio. Default blank.
lid_params LanguageIdentification Parameters LID parameters.
download_params DownloadParameters Remote file download parameters.

Text

Plain text input

SynthesisRequest (
   voice = Voice (
       name = "Evan",
       model = "enhanced"
    ),
    input = Input (
        text = Text (
           text = "Your coffee will be ready in 5 minutes")
    ),
)

Input message for synthesizing plain text. The encoding must be UTF-8.

Field Type Description
text string Plain input text in UTF-8 encoding.
uri string Remote URI to the plain input text. Not supported in Nuance-hosted NVC.

SSML

SSML input

SynthesisRequest (
   voice = Voice (
       name = "Evan",
       model = "enhanced"
    ),
    input = Input (
        ssml = SSML (
            text = '<?xml version="1.0"?><speak  xmlns="http://www.w3.org/2001/10/synthesis" 
xml:lang="en-US" version="1.0"><prosody volume="10">I can speak rather quietly,</prosody>
<prosody volume="90">But also very loudly.</prosody></speak>',
            ssml_validation_mode = WARN
        ) 
    )
)

Input message for synthesizing SSML input. See SSML tags for a list of supported elements and examples.

Field Type Description
text string SSML input text and elements.
uri string Remote URI to the SSML input text. Not supported in Nuance-hosted NVC.
ssml_validation_mode EnumSSML ValidationMode SSML validation mode. Default STRICT.

EnumSSMLValidationMode

SSML validation mode when using SSML input. Included in SSML. Strict by default but can be relaxed.

Name Number Description
STRICT 0 Strict SSL validation. Default.
WARN 1 Give warning only.
NONE 2 Do not validate.

TokenizedSequence

Tokenized sequence

SynthesisRequest (
   voice = Voice (
       name = "Evan",
       model = "enhanced"
    ),
    input = Input (
        tokenized_sequence = TokenizedSequence (
            tokens = [
                Token (control_code = ControlCode (
                    key = "vol",
                    value = "10")),
                 Token (text = "I can speak rather quietly,"),
                 Token (control_code = ControlCode (
                     key = "vol",
                     value = "90")),
                 Token (text = "but also very loudly.")
             ]
        )
    )
)

Input message for synthesizing a sequence of plain text and Nuance control codes.

Field Type Description
tokens Token Repeated. Sequence of text and control codes.

Token

The unit when using TokenizedSequence for input. Each token can be either plain text or a Nuance control code. See Control codes for a list of supported codes and examples.

Field Type Description
text string Plain input text.
control_code ControlCode Nuance control code.

ControlCode

Nuance control code that specifies how text should be spoken, similarly to SSML.

Field Type Description
key string Name of the control code, e.g. 'pause'
value string Value of the control code.

SynthesisResource

Compiled user dictionary

request = SynthesisRequest (
    voice = Voice (name = "Evan", model = "enhanced"),
    input = Input (
        text = Text (text = "Your coffee will be ready in 5 minutes"),
        resources =  [
            SynthesisResource (
                type = USER_DICTIONARY,
                body = open("/path/to/user_dictionary.dcb", 'rb').read()
            )
        ]
    )
)

Input message specifying the type of file to tune the synthesized output and its location or contents. Included in Input.

Field Type Description
type EnumResourceType Resource type, e.g. user dictionary, etc. Default USER_DICTIONARY.
uri string URI to the remote resource. Currently disabled.
body bytes For EnumResourceType USER_DICTIONARY, the contents of the file. See User dictionary for an example.

EnumResourceType

The type of synthesis resource to tune the output. Included in SynthesisResource. User dictionaries provide custom pronunciations, rulesets apply search-and-replace rules to input text, and ActivePrompt databases help tune synthesized audio under certain conditions, using Nuance Vocalizer Studio.

Only USER_DICTIONARY is currently supported in Nuance-hosted NVC, with the dictionary specified inline.

Name Number Description
USER_DICTIONARY 0 User dictionary (application/edct-bin-dictionary). Default.
TEXT_USER_RULESET 1 Text user ruleset (application/x-vocalizer-rettt+text).
BINARY_USER_RULESET 2 Binary user ruleset (application/x-vocalizer-rettt+bin).
ACTIVEPROMPT_DB 3 ActivePrompt database (application/x-vocalizer/activeprompt-db).
ACTIVEPROMPT_DB_AUTO 4 ActivePrompt database with automatic insertion (application/x-vocalizer/activeprompt-db;mode=automatic).
SYSTEM_DICTIONARY 5 Nuance system dictionary (application/sdct-bin-dictionary).

LanguageIdentificationParameters

LID parameters in Input message

SynthesisRequest(
    voice = Voice (
        name = "Evan",
        model = "enhanced"
    ),
    input = Input (
       tokenized_sequence = TokenizedSequence (
            tokens = [
                Token (text = "The name of the song is. "),
                Token (control_code = ControlCode (
                    key = "lang",
                    value = "unknown")),
                Token (text = "Mon pays, c'est l'hiver."),
                Token (control_code = ControlCode (
                    key = "lang",
                    value = "normal")),
                Token (text = "It's a famous Quebec song meaning, my country is winter.")
            ]
        ),
        lid_params = LanguageIdentificationParameters (
            languages = (["frc", "enu"])
        )
    )
)

Input message controlling the language identifier. Included in Input. The language identifier runs on input blocks labeled with the control code lang unknown or the SSML attribute xml:lang="unknown". The language identifier automatically restricts the matched languages to the installed voices. This limits the permissible languages, and also sets the order of precedence (first to last) when they have equal confidence scores.

Field Type Description
disable bool Whether to disable language identification. Turned on by default.
languages string Repeated. List of three-letter language codes (e.g. enu, frc, spm) to restrict language identification results, in order of precedence. Use GetVoicesRequest to obtain the three-letter codes, returned in GetVoicesResponse - language_tlw. Default blank.
always_use_ highest_confidence bool If enabled, language identification always chooses the language with the highest confidence score, even if the score is low. Default false, meaning use language with any confidence.

DownloadParameters

Input message containing parameters for remote file download, whether for input text (Input.uri) or a SynthesisResource (SynthesisResource.uri). Included in Input.

Field Type Description
headers map<string,string> Map of HTTP header name,value pairs to include in outgoing requests. Supported headers: max_age, max_stale.
request_timeout_ms uint32 Request timeout in ms. Default (0) means server default, usually 30000 (30 seconds).
refuse_cookies bool Whether to disable cookies. By default, HTTP requests accept cookies.

EventParameters

Event parameters in SynthesisRequest

SynthesisRequest(
    voice = Voice (
        name = "Evan",
        model = "enhanced"
    ),
    input = Input (
        text = Text (
           text = "Your coffee will be ready in 5 minutes.")
    ),
    event_params = EventParameters (
        send_sentence_marker_events = True,
        send_paragraph_marker_events = True,
        send_log_events = True,
        suppress_input = True
    )
)

Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream, when generated. Marker events can send events as certain parts of the synthesized audio are reached, for example, at the end of a word, sentence, or user-defined bookmark.

Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.

Field Type Description
send_sentence_marker_events bool Sentence marker. Default: do not send.
send_word_marker_events bool Word marker. Default: do not send.
send_phoneme_marker_events bool Phoneme marker. Default: do not send.
send_bookmark_marker_events bool Bookmark marker. Default: do not send.
send_paragraph_marker_events bool Paragraph marker. Default: do not send.
send_visemes bool Lipsync information. Default: do not send.
send_log_events bool Whether to log events during synthesis. By default, logging is turned off.
suppress_input bool Whether to omit input text and URIs from log events. By default, these items are included.

SynthesisResponse

Response to synthesis request

try:
    if args.output_audio_file:
        audio_file = open(args.output_audio_file, "wb")
    for response in stream_in:
        if response.HasField("audio"):
            print("Received audio: %d bytes" % len(response.audio))
            if(audio_file):
                audio_file.write(response.audio)
        elif response.HasField("events"):
            print("Received events")
            print(text_format.MessageToString(response.events))
        else:
            if response.status.code == 200:
                print("Received status response: SUCCESS")
            else:
                print("Received status response: FAILED")
                print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                print('Error: {}'.format(response.status.details))
except Exception as e:
    print(e)
if audio_file:
    print("Saved audio to {}".format(args.output_audio_file))
    audio_file.close()

The Synthesizer - Synthesize RPC call returns a stream of SynthesisResponse messages. (See UnarySynthesisResponse for a non-streamed response.) Each response contains one of:

Field Type Description
status Status A status response, indicating completion or failure of the request.
events Events A list of events. See EventParameters for details.
audio bytes The latest audio buffer.

Status

Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse and UnarySynthesisResponse.

Field Type Description
code uint32 HTTP-style return code: 200, 4xx, or 5xx as appropriate. See Status codes.
message string Brief description of the status.
details string Longer description if available.

Events

Output message defining a container for a list of events. This container is needed because oneof does not allow repeated parameters in Protobuf. Included in SynthesisResponse and UnarySynthesisResponse.

Field Type Description
events Event Repeated. One or more events.

Event

Output message defining an event message. Included in Events. See EventParameters for details.

Field Type Description
name string Either "Markers" or the name of the event in the case of a Log Event.
values map<string,string> Map of key:value data relevant to the current event.

UnarySynthesisResponse

The Synthesizer - UnarySynthesize RPC call returns a single UnarySynthesisResponse message. It is similar to SynthesisResponse but includes all the information instead of a single type of response. The response contains:

Field Type Description
status Status A status response, indicating completion or failure of the request.
events Events A list of events. See EventParameters for details.
audio bytes Audio buffer of the synthesized text.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str

Change log

2020-09-30

These changes were made:

2020-08-19

These changes were made:

2020-06-24

The TTS v1beta1 protocol is deprecated: it is currently being monitored and may be removed in the near future. If you are using v1beta1, we recommend that you upgrade to v1.

2020-05-31

These changes were made to the API and documentation:

2020-04-30

These changes were made to the documentation:

2020-03-31

These changes were made to the API and documentation:

v1beta1 v1
message Input {
  string type = 1;
  oneof input_data {
    string uri = 2;
    string body = 3;
  bytes body_as_bytes = 4;
  }
  string escape_sequence = 5;
}
message Input {
  oneof input_data {
    Text text = 1;
    SSML ssml = 2;
    TokenizedSequence tokenized_sequence = 3;
  }
}

message Text {}
message SSML {}
message TokenizedSequence {}
message Token {}
message ControlCode {}

2020-02-19

These changes were made to the API and documentation:

2020-01-22

These changes were made to the API and documentation:

2019-12-18

These changes were made to the TTSaaS gRPC API documentation:

2019-12-02

These changes were made to the TTSaaS gRPC API documentation:

2019-11-15

Below are changes made to the TTSaaS gRPC API documentation since the initial Beta release: