NAV Navbar
Python Go

TTS as a Service gRPC API

The Nuance TTS (Text to Speech) service is powered by the Nuance Vocalizer for Cloud (NVC) engine, which synthesizes speech from text. NVC works with Nuance Vocalizer for Enterprise (NVE) and Nuance voice packs to generate speech.

The gRPC protocol provided by NVC allows a client application to request speech synthesis services in all the programming languages supported by gRPC.

gRPC is an open source RPC (remote procedure call) software used to create services. It uses HTTP/2 for transport and protocol buffers to define the structure of the application. NVC supports Protocol Buffers version 3, also known as proto3.

Nuance TTS as a service means you can request speech synthesis using gRPC from NVC engines running on Nuance-hosted machines.

Version: v1beta1 (obsolete)

To use the latest protocol version, see v1 documentation.

Prerequisites from Mix

Before developing your TTS gRPC application, you need a Nuance Mix project. This project provides credentials to run you application against the Nuance-hosted NVC engine.

  1. Create a Mix project and model: see Mix.nlu workflow to:

    • Create a Mix project.

    • Create, train, and build a model in the project. If you are using other Nuance "as a service" products (such as Krypton or NLU), you may use the same Mix project for NVC. Your project must include a model even though it is not needed for your NVC application.

    • Create and deploy an application configuration for the project.

  2. Generate a "secret" and client ID of your Mix project: see Mix Dashboard Obtain authentication for services. Later you will use these credentials to request an authorization token to run your application.

  3. Learn the URL to call the NLU service: see Mix Dashboard Accessing a runtime service.

    • The URL is ttsaas.beta.mix.nuance.com:443

gRPC setup

Download proto file

nuance_tts.proto

Install gRPC for programming language

$ python3 -m pip install --upgrade pip
$ python3 -m pip install grpcio
$ python3 -m pip install grpcio-tools
$ go get google.golang.org/grpc
$ go get github.com/akamensky/argparse
$ go get github.com/youpy/go-wav

Generate client stubs

$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ --grpc_python_out=./ nuance_tts.proto

$ ls -1 nuance_t*.py
nuance_tts_pb2_grpc.py
nuance_tts_pb2.py
$ protoc -I ./src/v1beta1/ ./src/v1beta1/nuance_tts.proto --go_out=plugins=grpc:./src/v1beta1

$ ls -1 v1beta/nuance_t*.py
nuance_tts.pb.go

The basic steps in using the NVC gRPC protocol are:

  1. Download the gRPC proto file here: nuance_tts.proto. This proto3 file contains a generic version of the functions or classes that can request speech synthesis from an NVC engine.

  2. Install gRPC for the programming language of your choice, including C++, Java, Python, Go, Ruby, C#, Node.js, and others. See gRPC Documentation for a complete list and instructions on using gRPC with each one.

  3. Generate client stub files in your programming language from the proto files using gRPC protoc. Depending on your programming language, the stubs may consist of one file or multiple files.

    These stub files contain the methods and fields from the proto files as implemented in your programming language. You will consult the stubs in conjunction with the proto files.

  4. Write your client app, referencing the functions or classes in the client stub files. See Client app development for details and a scenario.

  5. Run your client app to request synthesis. See Sample Python app and Sample Go app.

Client app development

The gRPC protocol for NVC lets you create a client application for synthesizing text. In general, you need to:

This section describes how to implement the basic functionality of NVC in the context of a client application written in Python or Go. For the complete applications, see Sample Python app and Sample Go app.

These tasks are illustrated in the following high-level sequence flow.

Sequence flow

Step 1: Generate token

This simple example uses a Linux script to request the token and save it in a variable

#!/bin/bash

CLIENT_ID="appID%3ANMDPTRIAL_your_name_nuance_com_20190919T190532565840"
SECRET="5JEAu0YSAjV97oV3BWy2PRofy6V8FGmywiUbc0UfkGE"
MY_TOKEN="`curl -s -u "$CLIENT_ID:$SECRET" "https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=tts' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'`"

For the Go app, set the client ID and secret in a config file, config.json

{
    "client_id": "appID:<Provide Your Mix Beta App Id>",
    "client_secret": "<Provide Your Mix Beta Client Secret>",
    "token_url": "https://auth.crt.nuance.com/oauth2/token"
}

Nuance Mix uses the OAuth 2.0 protocol for authentication. The client application must provide an access token to be able to access the NVC runtime service. The token expires after a 3600 seconds (an hour) so must be regenerated frequently.

Your client application uses the client ID and secret from the Mix Dashboard (see Prerequisites from Mix) to generate an authentication token from the Mix Authentication Service, available at the following URL:

auth.crt.nuance.com/oauth2/token

The token may be generated in several ways, for example:

Step 2: Authenticate and connect

"Source" the token-generating script, then pass the variable to the Python app along with the URL of the TTS service

$ source ./gen-token.sh
$ python3 simple-client.py --server_url 'ttsaas.beta.mix.nuance.com:443' \
  --token $MY_TOKEN \
  --name 'Evan' \
  --model 'enhanced' \
  --body 'Hello world' \
  --sample_rate 22050 \
  --output_audio_file 'hello.pcm'

The app uses the token to create the secure connection to the TTS service

call_credentials = grpc.access_token_call_credentials(args.token)
    channel_credentials = grpc.composite_channel_credentials(
        grpc.ssl_channel_credentials(), call_credentials)

    with grpc.secure_channel(args.server_url, credentials=channel_credentials) as channel:
        stub = nuance_tts_pb2_grpc.SynthesizerStub(channel)
        request = create_synthesis_request(name=args.name, model=args.model, body=args.body,uri=args.uri, sample_rate=args.sample_rate,send_log_events=args.send_log_events)

The Go app collects the service URL (server) and credentials (configFile) in tts_client.go

func main() {

    // collect arguments
    parser := argparse.NewParser("tts_client", "Use Nuance TTS to vocalize text")
    server := parser.String("s", "server", &argparse.Options{
        Default: "ttsaas.beta.mix.nuance.com:443",
        Help:    "server host:port",
    })
    . . . 
    configFile := parser.String("c", "configFile", &argparse.Options{
        Default: "config.json",
        Help:    "config file containing client credentials (client_id and client_secret)",

Then calls authenticate.go to generate and validate the token using the values from config.json

package main

import (
    "encoding/json"
    "errors"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "net/url"
    "os"
    "strings"
    "time"
)

const (
    TokenCache  = "token.cache"
    TokenMaxAge = 59 // minutes
    GrantType   = "client_credentials"
    Scope       = "tts"
)

type Token struct {
    AccessToken string `json:"access_token"`
    ExpiresIn   int    `json:"expires_in"`
    Scope       string `json:"scope"`
    TokenType   string `json:"bearer"`
}

func (t *Token) String(pretty bool) string {
    var str []byte
    var err error

    if pretty {
        str, _ = json.MarshalIndent(t, "", "  ")
    } else {
        str, _ = json.Marshal(t)
    }

    if err != nil {
        log.Printf("Error marshalling token to json: %s", err)
    }

    return string(str)
}

type Authenticator struct {
    config Config
    token  *Token
}

func (a *Authenticator) generateToken() (*Token, error) {
    a.token = nil

    body := strings.NewReader(fmt.Sprintf("grant_type=%s&scope=%s", GrantType, Scope))
    req, err := http.NewRequest("POST", a.config.TokenURL, body)
    if err != nil {
        return nil, err
    }

    req.SetBasicAuth(url.QueryEscape(a.config.ClientID), url.QueryEscape(a.config.ClientSecret))
    req.Header.Set("Content-Type", "application/x-www-form-urlencoded")

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    if resp.StatusCode < 200 || resp.StatusCode >= 300 {
        return nil, errors.New(resp.Status)
    }

    bodyBytes, _ := ioutil.ReadAll(resp.Body)
    t := &Token{}
    err = json.Unmarshal(bodyBytes, t)
    if err != nil {
        return nil, err
    }

    a.token = t
    return a.token, nil
}

func (a *Authenticator) isTokenValid() bool {

    // Is token cached?
    info, err := os.Stat(TokenCache)
    if err != nil {
        return false
    }

    // Can token be read from file?
    source, err := ioutil.ReadFile(TokenCache)
    if err != nil {
        return false
    }

    // Are contents of token valid?
    t := &Token{}
    err = json.Unmarshal(source, t)
    if err != nil || len(t.AccessToken) == 0 {
        return false
    }

    // Has token expired?
    lapsed := time.Since(info.ModTime())
    if lapsed > (TokenMaxAge * time.Minute) {
        return false
    }

    // All tests passed
    a.token = t
    return true
}

func (a *Authenticator) cacheToken() {
    outputJSON, err := json.MarshalIndent(a.token, "", "  ")
    if err != nil {
        log.Printf("Failed to cache token: %v", err)
        return
    }

    err = ioutil.WriteFile(TokenCache, outputJSON, 0644)
    if err != nil {
        log.Printf("Failed to cache token: %v", err)
    }

    return
}

func (a *Authenticator) Authenticate() (*Token, error) {
    if a.isTokenValid() {
        return a.token, nil
    }

    if _, err := a.generateToken(); err != nil {
        return nil, err
    }

    a.cacheToken()
    return a.token, nil
}

func NewAuthenticator(config Config) *Authenticator {
    a := &Authenticator{
        config: config,
    }
    return a
}

The app creates a secure gRPC channel and authenticates itself to the TTS service by providing the URL of the hosted service and an access token.

URL: In both these examples, the URL of the service is passed to the application as an argument.

Token: There are several ways to generate and use the token that authenticates the application to the TTS service. The code samples show two methods:

Step 3: Import functions

Import functions from stubs

import sys
import grpc
import wave
import argparse
import nuance_tts_pb2
import nuance_tts_pb2_grpc
from google.protobuf import text_format
import (
    . . . 
    pb "./v1beta1"
)

The app imports all functions from the NVC client stubs that were generated in gRPC setup.

Do not edit these stub files.

Step 4: Set synthesis parameters

The Python app collects synthesis parameters on the command line

# ./simple-client.py -h
usage: simple-client.py [-options]

options:
  -h, --help                               Show this help message and exit
  --server_url [SERVER_URL]                Server hostname (default=localhost)
  --token [TOKEN]                          Authentication token
  --name [NAME]                            Voice name
  --model [MODEL]                          Voice model
  --sample_rate [SAMPLE_RATE]              PCM sample rate (default=22050)
  --body [BODY]                            Input text
  --uri [URI]                              Input URI
  --send_log_events                        Subscribe to Log Events
  --output_audio_file [OUTPUT_AUDIO_FILE]  Destination file path for synthesized audio

The Go app collects synthesis parameters from a params.json file

{
    "voice": {
      "name": "Zoe-sc",
      "model": "enhanced"
    },
    "audio_params": {
      "audio_format": {
        "AudioFormat": {
          "Pcm": {
            "sample_rate_hz": 22050
          }
        }
      }
    },
    "input": {
      "type": "text/plain;charset=utf-8",
      "InputData": {
        "Body": "Hello world. This is a test."
      }
    },
    "event_params": {
      "send_sentence_marker_events": false,
      "send_word_marker_events": false
    }
  }

The app uses parameters to define how the voice synthesis should be performed. Some synthesis parameters are:

In the Python application, these parameters are collected from the command line. In the Go sample application, they are set in a params.json file and read by the application.

In the next step, these parameters are added to a SynthesisRequest, ready to send to NVC to generate the synthesized speech.

See SynthesisRequest for other parameters that affect synthesis.

Step 5: Request synthesis

The Python app adds the parameters to the SynthesisRequest

def create_synthesis_request(name, model, body, uri, sample_rate, send_log_events=False):
    request = nuance_tts_pb2.SynthesisRequest()

    request.voice.name = name
    request.voice.model = model

    pcm = nuance_tts_pb2.PCM(sample_rate_hz=sample_rate)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)

    request.input.type = "text/plain;charset=utf-8"
    if body:
        request.input.body = body
    elif uri:
        request.input.uri = uri
    else:
        raise RuntimeError("--body or --uri must be defined.")

    request.event_params.send_log_events = send_log_events

    return request

The Go app reads params.json and constructs parameters

    paramsFile := parser.String("p", "paramsFile", &argparse.Options{
        Default: "params.json",
        Help:    "file containing tts parameters to customize TTS behavior",
    })
    . . . 
    {
        parameters := initializeSynthesisRequest(*paramsFile, textInput)
        jout, _ := json.MarshalIndent(parameters, "", "  ")
        log.Printf("parameters: %v", string(jout))
        synthesize(ctx, client, parameters, *outFile)
    }

The app defines a SynthesisRequest, including the parameters collected from the user on the command line or from a file.

Step 6: Call client stub

Call main client stub

   call_credentials = grpc.access_token_call_credentials(args.token)
    channel_credentials = grpc.composite_channel_credentials(
        grpc.ssl_channel_credentials(), call_credentials)

    with grpc.secure_channel(args.server_url, credentials=channel_credentials) as channel:
        stub = nuance_tts_pb2_grpc.SynthesizerStub(channel)
        request = create_synthesis_request(name=args.name, model=args.model, body=args.body,
        uri=args.uri, sample_rate=args.sample_rate, send_log_events=args.send_log_events)
        stream_in = stub.Synthesize(request)
    client := pb.NewSynthesizerClient(conn)
    ctx, cancel := CreateChannelContext(&token.AccessToken)

The app must include the location of the NVC service, the authentication token, and where the audio is obtained. See Step 2: Authenticate and connect.

Using this information, the app calls a client stub function or class. In some languages, this stub is defined in the generated client files: in Python it is named SynthesizerStub and in Go it is SynthesizerClient. In other languages, such as Java, you must create your own stub.

Step 7: Receive results

Receive results from NVC

     try:
            if args.output_audio_file:
                audio_file = open(args.output_audio_file, "wb")
            for response in stream_in:
                if response.HasField("audio"):
                    print("Received audio: %d bytes" % len(response.audio))
                    if(audio_file):
                        audio_file.write(response.audio)
                elif response.HasField("events"):
                    print("Received events")
                    print(text_format.MessageToString(response.events))
                else:
                    if response.status.code == 200:
                        print("Received status response: SUCCESS")
                    else:
                        print("Received status response: FAILED")
                        print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                        print('Error: {}'.format(response.status.details))
        except Exception as e:
            print(e)
        if audio_file:
            print("Saved audio to {}".format(args.output_audio_file))
            audio_file.close()
    outFile := parser.String("o", "outFile", &argparse.Options{
        Required: false,
        Help:     "file to save audio to",
    })
    . . . 
    {
        parameters := initializeSynthesisRequest(*paramsFile, textInput)
        jout, _ := json.MarshalIndent(parameters, "", "  ")
        log.Printf("parameters: %v", string(jout))
        synthesize(ctx, client, parameters, *outFile)
    }

Finally the app receives the results from the NVC engine, in these examples saving the synthesized speech to a file. In both examples, the file name is collected from the user on the command line.

Synthesis resources

Once you have experimented with basic synthesis, you can add external resources such as user dictionaries and tuning data.

Get voices

Each SynthesisRequest must include a voice name and model. If you need to know which voices and models are available, issue a GetVoicesRequest.

Python get voices

The get_voices_client.py app returns the voices available from the NVC server

#!/usr/bin/env python3

import sys
import grpc
import wave
import argparse
import nuance_tts_pb2
import nuance_tts_pb2_grpc
from google.protobuf import text_format

def create_get_voices_request(name=None, model=None, language=None, sample_rate=0):
    request = nuance_tts_pb2.GetVoicesRequest()
    if name:
        request.voice.name = name
    if model:
        request.voice.model = model
    if language:
        request.voice.language = language
    request.voice.sample_rate_hz=sample_rate

    return request

def main():
    parser = argparse.ArgumentParser(
        prog="get_voices_client.py",
        usage="%(prog)s [-options]",
        add_help=False,
        formatter_class=lambda prog: argparse.HelpFormatter(
            prog, max_help_position=45, width=100)
    )

    options = parser.add_argument_group("options")
    options.add_argument("-h", "--help", action="help",
                         help="Show this help message and exit")
    options.add_argument("--server_url", nargs="?",
                         help="Server hostname (default=localhost)", default="localhost:8080")
    options.add_argument("--token", nargs="?",
                         help="Authentication token", required=True)
    options.add_argument("--name", nargs="?", help="Voice name")
    options.add_argument("--model", nargs="?",
                         help="Voice model")
    options.add_argument("--language", nargs="?",
                         help="Voice IETF language")
    options.add_argument("--sample_rate", nargs="?",
                         help="Sample rate", type=int, default=0)

    args = parser.parse_args()

    call_credentials = grpc.access_token_call_credentials(args.token)
    channel_credentials = grpc.composite_channel_credentials(
        grpc.ssl_channel_credentials(), call_credentials)

    with grpc.secure_channel(args.server_url, credentials=channel_credentials) as channel:
        stub = nuance_tts_pb2_grpc.SynthesizerStub(channel)
        request = create_get_voices_request(name=args.name, model=args.model, language=args.language, sample_rate=args.sample_rate)
        response = stub.GetVoices(request)

        for voice in response.voices:
            print("Voice: %s, Model: %s, Language: %s, Sample rate: %d" % (voice.name, voice.model, voice.language, voice.sample_rate_hz))

if __name__ == '__main__':
    main()

In the Python scenario, this request is a separate application. The get_voices_client.py application gathers parameters from the user, then sends a GetVoicesRequest with the values received. To see the parameters that the application accepts, use the -h parameter. Make sure to generate a token (see Step 1: Generate token) to run the application against the Nuance-hosted NVC server.

$ source ./gen-token-var.sh
$ ./get_voices_client.py -h
usage: get_voices_client.py [-options]  
  
options:
  -h, --help                   Show this help message and exit
  --server_url [SERVER_URL]    Server hostname (default=localhost)
  --token [TOKEN]              Authentication token
  --name [NAME]                Voice name
  --model [MODEL]              Voice model
  --language [LANGUAGE]        Voice IETF language
  --sample_rate [SAMPLE_RATE]  Sample rate

In the example below, several limiting parameters are included, so only one voice is returned.

$ ./get_voices_client.py \ 
  --server_url 'ttsaas.beta.mix.nuance.com:443' --token $MY_TOKEN \
  --name 'Evan' --model 'enhanced' --sample_rate 22050
Voice: Evan, Model: enhanced, Language: en-us, Sample rate: 22050

In this second example, only the mandatory parameters (server and token) are provided, so NVC returns all available voices.

$ ./get_voices_client.py \ 
  --server_url 'ttsaas.beta.mix.nuance.com:443' --token $MY_TOKEN
  
Voice: Evan, Model: enhanced, Language: en-us, Sample rate: 22050
Voice: Zoe-Sc, Model: enhanced, Language: en-us, Sample rate: 22050

Go get voices

The Go app returns available voices when the -v option is set on the command line

$ go run ./src -h
usage: tts_client [-h|--help] [-s|--server "<value>"] [-i|--textInput
                "<value>"] [-o|--outFile "<value>"] [-c|--configFile
                "<value>"] [-p|--paramsFile "<value>"] [-v|--getVoices]
                Use Nuance  TTS to vocalize text
Arguments:
-h  --help        Print help information
-s  --server      server host:port. Default: ttsaas.beta.mix.nuance.com:443
-i  --textInput   text input
-o  --outFile     file to save audio to
-c  --configFile  config file containing client credentials (client_id and
                    client_secret). Default: config.json
-p  --paramsFile  file containing tts parameters to customize TTS behavior.
                    Default: params.json
-v  --getVoices   Specify this flag to retrieve available voices

The tts_client.go app includes a function that performs a basic GetVoicesRequest

func getVoices(ctx context.Context, client pb.SynthesizerClient) {

    req := &pb.GetVoicesRequest{}
    resp, err := client.GetVoices(ctx, req)
    if err != nil {
        log.Printf("Error getting voices: %s", err)
        return
    }

    out, _ := json.MarshalIndent(resp, "", "  ")
    log.Printf("Voices: %s", string(out))
}

In the Go example, GetVoicesRequest is part of the main application. Specify -v (or --getVoices) to retrieve available voices.

$ go run ./src -v
2019/11/01 11:31:29 Voices: {
"voices": [
    {
    "name": "Evan",
    "model": "enhanced",
    "language": "en-us"
    },
    {
    "name": "Zoe-Sc",
    "model": "enhanced",
    "language": "en-us"
    }
]
}

Sample applications

This section contains sample client applications for NVC.

Sample Python app

This basic Python app, simple-client.py, requests and receives synthesized voice audio

#!/usr/bin/env python3

import sys
import grpc
import wave
import argparse
import nuance_tts_pb2
import nuance_tts_pb2_grpc
from google.protobuf import text_format

def create_synthesis_request(name, model, body, uri, sample_rate, send_log_events=False):
    request = nuance_tts_pb2.SynthesisRequest()

    request.voice.name = name
    request.voice.model = model

    pcm = nuance_tts_pb2.PCM(sample_rate_hz=sample_rate)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)

    request.input.type = "text/plain;charset=utf-8"
    if body:
        request.input.body = body
    elif uri:
        request.input.uri = uri
    else:
        raise RuntimeError("--body or --uri must be defined.")

    request.event_params.send_log_events = send_log_events

    return request

def main():
    parser = argparse.ArgumentParser(
        prog="simple-client.py",
        usage="%(prog)s [-options]",
        add_help=False,
        formatter_class=lambda prog: argparse.HelpFormatter(
            prog, max_help_position=45, width=100)
    )

    options = parser.add_argument_group("options")
    options.add_argument("-h", "--help", action="help",
                         help="Show this help message and exit")
    options.add_argument("--server_url", nargs="?",
                         help="Server hostname (default=localhost)", default="localhost:8080")
    options.add_argument("--token", nargs="?",
                         help="Authentication token", required=True)
    options.add_argument("--name", nargs="?", help="Voice name", required=True)
    options.add_argument("--model", nargs="?",
                         help="Voice model", required=True)
    options.add_argument("--sample_rate", nargs="?",
                         help="PCM sample rate (default=22050)", type=int, default=22050)
    options.add_argument("--body", nargs="?", help="Input text")
    options.add_argument("--uri", nargs="?", help="Input URI")
    options.add_argument("--send_log_events",
                         action="store_true", help="Subscribe to Log Events")
    options.add_argument("--output_audio_file", nargs="?",
                         help="Destination file path for synthesized audio")

    args = parser.parse_args()

    call_credentials = grpc.access_token_call_credentials(args.token)
    channel_credentials = grpc.composite_channel_credentials(
        grpc.ssl_channel_credentials(), call_credentials)

    with grpc.secure_channel(args.server_url, credentials=channel_credentials) as channel:
        stub = nuance_tts_pb2_grpc.SynthesizerStub(channel)
        request = create_synthesis_request(name=args.name, model=args.model, body=args.body,uri=args.uri, sample_rate=args.sample_rate,send_log_events=args.send_log_events)
        stream_in = stub.Synthesize(request)
        audio_file = None
        try:
            if args.output_audio_file:
                audio_file = open(args.output_audio_file, "wb")
            for response in stream_in:
                if response.HasField("audio"):
                    print("Received audio: %d bytes" % len(response.audio))
                    if(audio_file):
                        audio_file.write(response.audio)
                elif response.HasField("events"):
                    print("Received events")
                    print(text_format.MessageToString(response.events))
                else:
                    if response.status.code == 200:
                        print("Received status response: SUCCESS")
                    else:
                        print("Received status response: FAILED")
                        print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                        print('Error: {}'.format(response.status.details))
        except Exception as e:
            print(e)
        if audio_file:
            print("Saved audio to {}".format(args.output_audio_file))
            audio_file.close()


if __name__ == '__main__':
    main()

There are several ways to run the little Python 3.6 app used in the examples. The complete application is shown at the right. In this scenario, first "source" the script file (see Step 1: Generate token) to generate the authentication token and store it in the MY_TOKEN environment variable, then run the app with the values shown.

$ source ./gen-token-var.sh
$ python3 simple-client.py --server_url 'ttsaas.beta.mix.nuance.com:443' \
  --token $MY_TOKEN \
  --name 'Evan' \
  --model 'enhanced' \
  --body 'Hello world' \
  --sample_rate 22050 \
  --output_audio_file 'hello.pcm'
Received audio: 68886 bytes
Received status response: SUCCESS
Saved audio to hello.pcm

The app synthesizes the text "Hello world" in Evan's voice and saves it in a file, hello.pcm, in the current location of the app.

If you need to provide more input text, you could add text to a variable. For example, set the BODY variable to a message, then run the app using "$BODY" as the input text.

$ export BODY="Your order will be ready in 10 minutes. You can collect it at the pick-up counter."  
$ source ./gen-token-var.sh
$ python3 simple-client.py --server_url 'ttsaas.beta.mix.nuance.com:443' \
  --token $MY_TOKEN \
  --name 'Zoe-Sc' \
  --model 'enhanced' \
  --body "$BODY" \
  --sample_rate 22050 \
  --output_audio_file 'output.pcm'
Received audio: 36636 bytes
Received audio: 15188 bytes
Received audio: 5432 bytes
Received audio: 13258 bytes
Received audio: 36294 bytes
Received audio: 41650 bytes
Received audio: 24678 bytes
Received audio: 42446 bytes
Received status response: SUCCESS
Saved audio to output.pcm

Note that this simple application only provides speech synthesis, or text-to-speech. In a real-life situation, this operation would likely be part of a larger application such as a virtual assistant that includes recognition and semantic understanding as well as speech synthesis.

Sample Python app: unary response

The client.py application supports a unary synthesis response

#import wave
import argparse
import sys
import time
import logging
import grpc
import os
from importlib.machinery import SourceFileLoader
import threading
from google.protobuf import text_format

import nuance_tts_pb2
import nuance_tts_pb2_grpc

thread_context = threading.local()
total_first_chunk_latency = 0
total_synthesis = 0

args = None

# Generates the .wav file header for a given set of parameters
def genHeader(sampleRate, bitsPerSample, channels, datasize, formattype):
    o = bytes("RIFF",'ascii')                                               # (4byte) Marks file as RIFF
    o += (datasize + 36).to_bytes(4,'little')                               # (4byte) File size in bytes excluding this and RIFF marker
    o += bytes("WAVE",'ascii')                                              # (4byte) File type
    o += bytes("fmt ",'ascii')                                              # (4byte) Format Chunk Marker
    o += (16).to_bytes(4,'little')                                          # (4byte) Length of above format data
    o += (formattype).to_bytes(2,'little')                                  # (2byte) Format type (1 - PCM)
    o += (channels).to_bytes(2,'little')                                    # (2byte) Will always be 1 for TTS
    o += (sampleRate).to_bytes(4,'little')                                  # (4byte) 
    o += (sampleRate * channels * bitsPerSample // 8).to_bytes(4,'little')  # (4byte)
    o += (channels * bitsPerSample // 8).to_bytes(2,'little')               # (2byte)
    o += (bitsPerSample).to_bytes(2,'little')                               # (2byte)
    o += bytes("data",'ascii')                                              # (4byte) Data Chunk Marker
    o += (datasize).to_bytes(4,'little')                                    # (4byte) Data size in bytes

    return o

def send_get_voices_request(grpc_client, request):
    log.info("Sending GetVoices request")

    client_span = None
    get_voices_span = None
    metadata = []

    if args.jaeger:
        log.debug("Injecting Jaeger span context into request")
        client_span = tracer.start_span("Client.gRPC")
        get_voices_span = tracer.start_span(
            "Client.GetVoices", child_of=client_span)
        carrier = dict()
        tracer.inject(get_voices_span.context,
                      opentracing.propagation.Format.TEXT_MAP, carrier)
        metadata.append(('uber-trace-id', carrier['uber-trace-id']))

    response = grpc_client.GetVoices(request=request, metadata=metadata)

    log.info(text_format.MessageToString(response))

    if get_voices_span:
        get_voices_span.finish()
    if client_span:
        client_span.finish()

def send_synthesis_request(grpc_client, request, metadata=None):
    log.info("Sending Synthesis request")

    audio_file = None
    audio_file_name = ""
    extension = ""
    waveheader = ""
    sampleRate  = 0
    bitsPerSample = 0
    channels = 1 
    audioformat = 0
    currentaudiolen = 0
    num_chunks = 0
    metadata = []
    client_span = None
    synthesis_span = None
    received_first_audio_chunk = False
    call_credentials = None

    global total_synthesis
    total_synthesis = total_synthesis + 1

    global args

    thread_context.num_synthesis = thread_context.num_synthesis + 1

    if args.saveAudio or args.saveAudioChunks:
        if request.audio_params.audio_format.HasField("pcm"):
            extension = "pcm"
            sampleRate = request.audio_params.audio_format.pcm.sample_rate_hz
            bitsPerSample = 16
            audioformat = 1
        elif request.audio_params.audio_format.HasField("alaw"):
            extension = "alaw"
            bitsPerSample = 8
            sampleRate = 8000
            audioformat = 6
        elif request.audio_params.audio_format.HasField("ulaw"):
            extension = "ulaw"
            bitsPerSample = 8
            sampleRate = 8000
            audioformat = 7
        elif request.audio_params.audio_format.HasField("ogg_opus"):
            extension = "ogg"
        elif request.audio_params.audio_format.HasField("opus"):
            extension = "opus"

        if args.saveAudio:
            if args.saveAudioinWave:
                if request.audio_params.audio_format.HasField("ogg_opus") or request.audio_params.audio_format.HasField("opus"):
                    log.warn("Cannot set to wav format for OggOpus and Opus, ignoring")
                else:
                    extension = "wav"

            if request.audio_params.audio_format.HasField("opus"):
                log.warn("Cannot save whole audio for Opus, ignoring.")
            else:
                audio_file_name = "%s_i%d_s%d.%s" % (
                thread_context.file, num_iterations, thread_context.num_synthesis, extension)
                audio_file = open(audio_file_name, "wb")

        if args.saveAudioChunks:
            if args.saveAudioinWave:
                if request.audio_params.audio_format.HasField("ogg_opus") or request.audio_params.audio_format.HasField("opus"):
                    log.warn("Cannot set to wav format for OggOpus and Opus, ignoring")
                else:
                    extension = "wav"

    if args.nmaid:
        metadata.append(('x-nuance-client-id', args.nmaid))

    if args.jaeger:
        log.debug("Injecting Jaeger span context into request")
        client_span = tracer.start_span("Client.gRPC")
        synthesis_span = tracer.start_span(
            "Client.Synthesize", child_of=client_span)
        carrier = dict()
        tracer.inject(synthesis_span.context,
                      opentracing.propagation.Format.TEXT_MAP, carrier)
        metadata.append(('uber-trace-id', carrier['uber-trace-id']))

    start = time.monotonic()

    responses = grpc_client.Synthesize(
        request=request, metadata=metadata)

    for response in responses:
        if response.HasField("audio"):
            log.info("Received audio: %d bytes" % len(response.audio))
            if not received_first_audio_chunk:
                received_first_audio_chunk = True
                latency = time.monotonic() - start
                log.info("First chunk latency: {} seconds".format(latency))
                global total_first_chunk_latency
                total_first_chunk_latency = total_first_chunk_latency + latency
                log.info("Average first-chunk latency (over {} synthesis requests): {} seconds".format(
                    total_synthesis, total_first_chunk_latency/(total_synthesis)))

            if args.saveAudio:
                if args.saveAudioinWave:
                    if request.audio_params.audio_format.HasField("ogg_opus") or request.audio_params.audio_format.HasField("opus"):
                        log.warn("Cannot set to wav format for OggOpus and Opus")
                    else:
                        currentaudiolen += len(response.audio)
                        waveheader = genHeader(sampleRate,bitsPerSample,channels,currentaudiolen,audioformat)
                        audio_file.seek(0,0)
                        audio_file.write(waveheader)
                        audio_file.seek(0,2)
                if audio_file:
                    audio_file.write(response.audio)
            if args.saveAudioChunks:
                if request.audio_params.audio_format.HasField("ogg_opus"):
                    log.warn("Cannot save separate audio chunks for Ogg Opus, ignoring")
                else:
                    num_chunks = num_chunks + 1
                    chunk_file_name = "%s_i%d_s%d_c%d.%s" % (
                        thread_context.file, num_iterations, thread_context.num_synthesis, num_chunks, extension)
                    chunk_audio_file = open(chunk_file_name, "wb")
                    if args.saveAudioinWave:
                        #Adding wav header before writing to audio file
                        waveheader = genHeader(sampleRate,bitsPerSample,channels,len(response.audio),audioformat)
                        chunk_audio_file.write(waveheader)
                    chunk_audio_file.write(response.audio)
                    chunk_audio_file.close()
                    log.info("Wrote audio chunk to %s" % chunk_file_name)
        elif response.HasField("events"):
            log.info("Received events")
            log.info(text_format.MessageToString(response.events))
        else:
            if response.status.code == 200:
                log.info("Received status response: SUCCESS")
            else:
                log.error("Received status response: FAILED")
                log.error("Code: {}, Message: {}".format(response.status.code, response.status.message))
                log.error('Error: {}'.format(response.status.details))

    if args.saveAudio and audio_file:
        audio_file.close()
        log.info("Wrote audio to %s" % audio_file_name)

    if synthesis_span:
        synthesis_span.finish()
    if client_span:
        client_span.finish()

def send_unary_synthesis_request(grpc_client, request, metadata=None):
    log.info("Sending Unary Synthesis request")

    audio_file = None
    audio_file_name = ""
    extension = ""
    waveheader = ""
    sampleRate  = 0
    bitsPerSample = 0
    channels = 1 
    audioformat = 0
    currentaudiolen = 0
    num_chunks = 0
    metadata = []
    client_span = None
    synthesis_span = None
    received_first_audio_chunk = False
    call_credentials = None

    global total_synthesis
    total_synthesis = total_synthesis + 1

    global args

    thread_context.num_synthesis = thread_context.num_synthesis + 1

    if args.saveAudio or args.saveAudioChunks:
        if request.audio_params.audio_format.HasField("pcm"):
            extension = "pcm"
            sampleRate = request.audio_params.audio_format.pcm.sample_rate_hz
            bitsPerSample = 16
            audioformat = 1
        elif request.audio_params.audio_format.HasField("alaw"):
            extension = "alaw"
            bitsPerSample = 8
            sampleRate = 8000
            audioformat = 6
        elif request.audio_params.audio_format.HasField("ulaw"):
            extension = "ulaw"
            bitsPerSample = 8
            sampleRate = 8000
            audioformat = 7
        elif request.audio_params.audio_format.HasField("ogg_opus"):
            extension = "ogg"
        else:
            extension = "opus"

        if args.saveAudio:
            if args.saveAudioinWave:
                if request.audio_params.audio_format.HasField("ogg_opus") or request.audio_params.audio_format.HasField("opus"):
                    log.warn("Cannot set to wav format for Ogg Opus, ignoring")
                else:
                    extension = "wav"

            if request.audio_params.audio_format.HasField("opus"):
                log.warn("Cannot save whole audio for Opus, ignoring")
            else:
                audio_file_name = "%s_i%d_s%d.%s" % (
                thread_context.file, num_iterations, thread_context.num_synthesis, extension)
                audio_file = open(audio_file_name, "wb")

    if args.nmaid:
        metadata.append(('x-nuance-client-id', args.nmaid))

    if args.jaeger:
        log.debug("Injecting Jaeger span context into request")
        client_span = tracer.start_span("Client.gRPC")
        synthesis_span = tracer.start_span(
            "Client.Synthesize", child_of=client_span)
        carrier = dict()
        tracer.inject(synthesis_span.context,
                      opentracing.propagation.Format.TEXT_MAP, carrier)
        metadata.append(('uber-trace-id', carrier['uber-trace-id']))

    start = time.monotonic()

    response = grpc_client.UnarySynthesize(
        request=request, metadata=metadata)

    if response.status.code == 200:
        log.info("Received audio: %d bytes" % len(response.audio))
        if not received_first_audio_chunk:
            received_first_audio_chunk = True
            latency = time.monotonic() - start
            log.info("First chunk latency: {} seconds".format(latency))
            global total_first_chunk_latency
            total_first_chunk_latency = total_first_chunk_latency + latency
            log.info("Average first-chunk latency (over {} synthesis requests): {} seconds".format(
                total_synthesis, total_first_chunk_latency/(total_synthesis)))

        if args.saveAudio:
            if args.saveAudioinWave:
                if request.audio_params.audio_format.HasField("ogg_opus") or request.audio_params.audio_format.HasField("opus"):
                    log.warn("Cannot save wave format for Opus, ignoring")
                else:
                    currentaudiolen += len(response.audio)
                    waveheader = genHeader(sampleRate,bitsPerSample,channels,currentaudiolen,audioformat)
                    audio_file.seek(0,0)
                    audio_file.write(waveheader)
                    audio_file.seek(0,2)
            if audio_file:
                audio_file.write(response.audio)
        if args.saveAudioChunks:
            if request.audio_params.audio_format.HasField("ogg_opus"):
                log.warn("Cannot save separate audio chunks for Opus, ignoring")
            else:
                num_chunks = num_chunks + 1
                chunk_file_name = "%s_i%d_s%d_c%d.%s" % (
                    thread_context.file, num_iterations, thread_context.num_synthesis, num_chunks, extension)
                chunk_audio_file = open(chunk_file_name, "wb")
                if args.saveAudioinWave:
                    #Adding wav header before writing to audio file
                    waveheader = genHeader(sampleRate,bitsPerSample,channels,len(response.audio),audioformat)
                    chunk_audio_file.write(waveheader)
                chunk_audio_file.write(response.audio)
                chunk_audio_file.close()
                log.info("Wrote audio chunk to %s" % chunk_file_name)
        if response.HasField("events"):
            log.info("Received events")
            log.info(text_format.MessageToString(response.events))
        log.info("Received status response: SUCCESS")
    else:
        if response.HasField("events"):
            log.info("Received events")
            log.info(text_format.MessageToString(response.events))
        log.error("Received status response: FAILED")
        log.error("Code: {}, Message: {}".format(response.status.code, response.status.message))
        log.error('Error: {}'.format(response.status.details))

    if args.saveAudio and audio_file:
        audio_file.close()
        log.info("Wrote audio to %s" % audio_file_name)

    if synthesis_span:
        synthesis_span.finish()
    if client_span:
        client_span.finish()

def parse_args():
    global args
    parser = argparse.ArgumentParser(
        prog="client.py",
        usage="%(prog)s [-options]",
        add_help=False,
        formatter_class=lambda prog: argparse.HelpFormatter(
            prog, max_help_position=45, width=100)
    )

    options = parser.add_argument_group("options")
    options.add_argument("-h", "--help", action="help",
                         help="Show this help message and exit")
    options.add_argument("--nmaid", nargs="?", help=argparse.SUPPRESS)
    options.add_argument("--token", nargs="?", help=argparse.SUPPRESS)
    options.add_argument("-f", "--files", metavar="file", nargs="+",
                         help="List of flow files to execute sequentially, default=['flow.py']", default=['flow.py'])
    options.add_argument("-p", "--parallel", action="store_true",
                         help="Run each flow in a separate thread.")
    options.add_argument("-i", "--iterations", metavar="num", nargs="?",
                         help="Number of times to run the list of files, default=1", default=1, type=int)
    options.add_argument("-s", "--serverUrl", metavar="url", nargs="?",
                         help="NVC server URL, default=localhost:8080", default='localhost:8080')
    options.add_argument("--secure", action="store_true",
                         help="Connect to the server using a secure gRPC channel.")
    options.add_argument("--rootCerts",  metavar="file", nargs="?",
                         help="Root certificates when using secure channel.")
    options.add_argument("--privateKey",  metavar="file", nargs="?",
                         help="Certificate private key when using secure channel.")
    options.add_argument("--certChain",  metavar="file", nargs="?",
                         help="Certificate chain when using secure channel.")
    options.add_argument("--saveAudio", action="store_true",
                         help="Save audio to disk")
    options.add_argument("--saveAudioChunks", action="store_true",
                         help="Save each individual audio chunk to disk")
    options.add_argument("--saveAudioinWave", action="store_true",
                         help="Save each audio to disk in WAVE format")                      
    options.add_argument("--jaeger", metavar="addr", nargs="?", const='udp://localhost:6831',
                         help="Send UDP opentrace spans, default addr=udp://localhost:6831")
    options.add_argument("--sendUnary", action="store_true",
                         help="Receive a single response of audio instead of streams of audio")
    options.add_argument("--maxReceiveSizeMB", metavar="megabytes", nargs="?", help="Maximum length of gRPC server response in megabytes, default=50 MB", default=50, type=int)

    args = parser.parse_args()

def initialize_tracing():
    if args.jaeger:
        print("Enabling Jaeger traces")
        global opentracing
        import opentracing
        import jaeger_client

        from urllib.parse import urlparse
        agent_addr = urlparse(args.jaeger)
        if not agent_addr.netloc:
            raise Exception(
                "invalid jaeger agent address: {}".format(args.jaeger))
        if not agent_addr.hostname:
            raise Exception(
                "missing hostname in jaeger agent address: {}".format(args.jaeger))
        if not agent_addr.port:
            raise Exception(
                "missing port in jaeger agent address: {}".format(args.jaeger))
        tracer_config = {
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'local_agent': {
                'reporting_host': agent_addr.hostname,
                'reporting_port': agent_addr.port
            },
            'logging': True
        }
        config = jaeger_client.Config(
            config=tracer_config, service_name='NVCClient', validate=True)
        global tracer
        tracer = config.initialize_tracer()

def create_channel():
    call_credentials = None
    channel = None

    if args.token:
        log.debug("Adding CallCredentials with token %s" % args.token)
        call_credentials = grpc.access_token_call_credentials(args.token)

    if args.secure:
        log.debug("Creating secure gRPC channel")
        root_certificates = None
        certificate_chain = None
        private_key = None
        if args.rootCerts:
            log.debug("Adding root certs")
            root_certificates = open(args.rootCerts, 'rb').read()
        if args.certChain:
            log.debug("Adding cert chain")
            certificate_chain = open(args.certChain, 'rb').read()
        if args.privateKey:
            log.debug("Adding private key")
            private_key = open(args.privateKey, 'rb').read()

        channel_credentials = grpc.ssl_channel_credentials(root_certificates=root_certificates, private_key=private_key, certificate_chain=certificate_chain)
        if call_credentials is not None:
            channel_credentials = grpc.composite_channel_credentials(channel_credentials, call_credentials)
        channel = grpc.secure_channel(args.serverUrl, credentials=channel_credentials, options=[('grpc.max_receive_message_length', args.maxReceiveSizeMB * 1024 * 1024)])
    else:
        log.debug("Creating insecure gRPC channel")
        channel = grpc.insecure_channel(args.serverUrl, options=[('grpc.max_receive_message_length', args.maxReceiveSizeMB * 1024 * 1024)])

    return channel

def run_one_file(file, list_of_requests):
    with create_channel() as channel:
        grpc_client = nuance_tts_pb2_grpc.SynthesizerStub(channel=channel)
        log.info("Running file [%s]" % file)
        log.debug(list_of_requests)

        thread_context.num_synthesis = 0
        thread_context.file = os.path.basename(file)

        for request in list_of_requests:
            if isinstance(request, nuance_tts_pb2.GetVoicesRequest):
                send_get_voices_request(grpc_client, request)
            elif isinstance(request, nuance_tts_pb2.SynthesisRequest):
                if args.sendUnary:
                    send_unary_synthesis_request(grpc_client, request)
                else:
                    send_synthesis_request(grpc_client, request)
            elif isinstance(request, (int, float)):
                log.info("Waiting for {} seconds".format(request))
                time.sleep(request)
        log.info("Done running file [%s]" % file)

def run():
    parse_args()

    log_level = logging.DEBUG
    global log
    log = logging.getLogger('')
    logging.basicConfig(
        format='%(asctime)s %(levelname)-5s: %(message)s', level=log_level)

    initialize_tracing()

    for i in range(args.iterations):
        global num_iterations
        num_iterations = i + 1
        log.info("Iteration #{}".format(num_iterations))
        threads = []
        for file in args.files:
            absolute_path = os.path.abspath(file)
            module_name = os.path.splitext(absolute_path)[0]
            module = SourceFileLoader(module_name, absolute_path).load_module()

            # module = importlib.import_module(basename)
            if module.list_of_requests == None:
                raise Exception(
                        "Error importing [%s]: variable list_of_requests not defined" % file)
            if args.parallel:
                log.info("Running flows in parallel")
                thread = threading.Thread(target=run_one_file, args=[file, module.list_of_requests])
                threads.append(thread)
                thread.start()
            else:
                run_one_file(file, module.list_of_requests)
        for thread in threads:
            thread.join()
        log.info("Iteration #{} complete".format(num_iterations))

    if total_synthesis > 0:
        log.info("Average first-chunk latency (over {} synthesis requests): {} seconds".format(total_synthesis, total_first_chunk_latency/(total_synthesis)))

    if args.jaeger:
        tracer.close()
        # Need to give time to tracer to flush the spans: https://github.com/jaegertracing/jaeger-client-python/issues/50
        time.sleep(2)
    print("Done")

if __name__ == '__main__':
    run()

This Python app, client.py, optionally returns a unary (single package) synthesis response.

You can obtain this type of response using the --sendUnary command line argument.

Sample Go app

The Go sample app consists of two config files and three application functions

The config.json file contains your authentication values

{
    "client_id": "appID:<Provide Your Mix Beta App Id>",
    "client_secret": "<Provide Your Mix Beta Client Secret>",
    "token_url": "https://auth.crt.nuance.com/oauth2/token"
}

The params.json file contains your synthesis and other parameters

{
    "voice": {
      "name": "Zoe-sc",
      "model": "enhanced"
    },
    "audio_params": {
      "audio_format": {
        "AudioFormat": {
          "Pcm": {
            "sample_rate_hz": 22050
          }
        }
      }
    },
    "input": {
      "type": "text/plain;charset=utf-8",
      "InputData": {
        "Body": "Hello world. This is a test"
      }
    },
    "event_params": {
      "send_sentence_marker_events": false,
      "send_word_marker_events": false
    }
  }

The src/tts_client.go file is the main application file

/*
 * Copyright 2019 Nuance Communications Inc.
 */

package main

import (
    "context"
    "crypto/tls"
    "encoding/json"
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "os"
    "time"

    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials"
    "google.golang.org/grpc/metadata"

    pb "./v1beta1"

    "github.com/akamensky/argparse"
)

func CreateChannelContext(token *string) (context.Context, context.CancelFunc) {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)

    // https://github.com/grpc/grpc-go/blob/master/Documentation/grpc-metadata.md
    ctx = metadata.AppendToOutgoingContext(ctx, "authorization", "Bearer "+*token)

    return ctx, cancel
}

func getVoices(ctx context.Context, client pb.SynthesizerClient) {

    req := &pb.GetVoicesRequest{}
    resp, err := client.GetVoices(ctx, req)
    if err != nil {
        log.Printf("Error getting voices: %s", err)
        return
    }

    out, _ := json.MarshalIndent(resp, "", "  ")
    log.Printf("Voices: %s", string(out))
}

func synthesize(ctx context.Context, client pb.SynthesizerClient, params *pb.SynthesisRequest, pcmFile string) {
    stream, err := client.Synthesize(ctx, params)
    if err != nil {
        log.Fatalf("%v.RouteChat(_) = _, %v", client, err)
    }
    waitc := make(chan struct{})
    go func() {
        bytesWritten := 0
        var ttsAudioOut *os.File

        for {
            resp, err := stream.Recv()
            if err == io.EOF {
                // read done.
                close(waitc)
                return
            }
            if err != nil {
                log.Fatalf("Failed to receive a response : %v", err)
            }
            switch response := resp.Response.(type) {
            case *pb.SynthesisResponse_Status:
                log.Printf("received status")
                out, _ := json.MarshalIndent(response.Status, "", "  ")
                log.Printf("%s", string(out))
            case *pb.SynthesisResponse_Events:
                log.Printf("received events")
                out, _ := json.MarshalIndent(response.Events, "", "  ")
                log.Printf("%s", string(out))
            case *pb.SynthesisResponse_Audio:
                log.Printf("received audio")
                if bytesWritten == 0 {
                    ttsAudioOut, _ = os.Create(pcmFile)
                }
                n, _ := ttsAudioOut.Write(response.Audio)
                bytesWritten += n
            }
        }
    }()
    <-waitc
}

func setInputData(req *pb.SynthesisRequest, textInput *string) {

    // If text input has been provided, over-write any text provided in the params file
    if textInput != nil && len(*textInput) > 0 {
        req.Input.InputData = &pb.Input_Body{
            Body: *textInput,
        }
    }
}

func initializeSynthesisRequest(paramFile string, textInput *string) *pb.SynthesisRequest {

    // Default voice parameters
    voice := &pb.Voice{
        Name:  "Evan",
        Model: "enhanced",
    }

    // Default audio parameters (22kHz PCM)
    format := &pb.AudioParameters{
        AudioFormat: &pb.AudioFormat{
            AudioFormat: &pb.AudioFormat_Pcm{
                Pcm: &pb.PCM{
                    SampleRateHz: 22050,
                },
            },
        },
    }

    // Default text to synthesize
    input := &pb.Input{
        Type:      "text/plain;charset=utf-8",
        InputData: &pb.Input_Body{}, // needs to be initialized for json.Marshal to work
    }

    req := &pb.SynthesisRequest{
        Voice:       voice,
        AudioParams: format,
        Input:       input,
    }

    // If textInput has been provided, this overwrites InputData in the params file
    defer setInputData(req, textInput)

    source, err := ioutil.ReadFile(paramFile)
    if err != nil {
        log.Printf("Error reading params file: %s", err)
        return req
    }

    err = json.Unmarshal(source, req)
    if err != nil {
        log.Printf("Error parsing params file: %s", err)
        return req
    }

    return req
}

func main() {

    // Collect arguments
    parser := argparse.NewParser("tts_client", "Use Nuance TTS to vocalize text")
    server := parser.String("s", "server", &argparse.Options{
        Default: "ttsaas.beta.mix.nuance.com:443",
        Help:    "server host:port",
    })
    textInput := parser.String("i", "textInput", &argparse.Options{
        Required: false,
        Help:     "text input",
    })
    outFile := parser.String("o", "outFile", &argparse.Options{
        Required: false,
        Help:     "file to save audio to",
    })
    configFile := parser.String("c", "configFile", &argparse.Options{
        Default: "config.json",
        Help:    "config file containing client credentials (client_id and client_secret)",
    })
    paramsFile := parser.String("p", "paramsFile", &argparse.Options{
        Default: "params.json",
        Help:    "file containing tts parameters to customize TTS behavior",
    })
    voices := parser.Flag("v", "getVoices", &argparse.Options{
        Help: "Specify this flag to retrieve available voices",
    })
    err := parser.Parse(os.Args)
    if err != nil {
        fmt.Print(parser.Usage(err))
        os.Exit(1)
    }

    // Import the user's Mix Beta credentials
    config, err := NewConfig(*configFile)
    if err != nil {
        log.Fatalf("Error importing user credentials: %v", err)
        os.Exit(1)
    }

    // Authenticate the user's credentials
    auth := NewAuthenticator(*config)
    token, err := auth.Authenticate()
    if err != nil {
        log.Fatalf("Error authenticating to Mix: %v", err)
        os.Exit(1)
    }

    // Connect to TTS service
    creds := credentials.NewTLS(&tls.Config{})
    if err != nil {
        log.Fatalf("Failed to create TLS credentials %v", err)
        os.Exit(1)
    }
    conn, err := grpc.Dial(*server, grpc.WithTransportCredentials(creds))
    if err != nil {
        log.Fatalf("fail to dial: %v", err)
        os.Exit(1)
    }
    defer conn.Close()

    // Run a synthesis request
    client := pb.NewSynthesizerClient(conn)
    ctx, cancel := CreateChannelContext(&token.AccessToken)
    defer cancel()
    if *voices {
        getVoices(ctx, client)
    } else {
        parameters := initializeSynthesisRequest(*paramsFile, textInput)
        jout, _ := json.MarshalIndent(parameters, "", "  ")
        log.Printf("parameters: %v", string(jout))
        synthesize(ctx, client, parameters, *outFile)
    }
}

The src/authenticate.go file authenticates the app to the TTS service. See Step 2: Authenticate and connect

The src/config.go file handles the authentication values in config.json

package main

import (
    "encoding/json"
    "io/ioutil"
    "log"
)

type Config struct {
    ClientID     string `json:"client_id"`
    ClientSecret string `json:"client_secret"`
    TokenURL     string `json:"token_url"`
}

func (c *Config) LoadConfig(configFile string) (*Config, error) {
    source, err := ioutil.ReadFile(configFile)
    if err != nil {
        return nil, err
    }
    err = json.Unmarshal(source, c)
    if err != nil {
        return nil, err
    }

    return c, nil
}

func (c *Config) String(pretty bool) string {
    var str []byte
    var err error

    if pretty {
        str, _ = json.MarshalIndent(c, "", "  ")
    } else {
        str, _ = json.Marshal(c)
    }

    if err != nil {
        log.Printf("Error marshalling config to json: %s", err)
    }

    return string(str)
}

func NewConfig(configFile string) (*Config, error) {
    config := &Config{}
    _, err := config.LoadConfig(configFile)

    return config, err
}

This Go application consists of these files:

For example, this runs the app with the help option to see the values that you may pass to the application.

$ go run ./src -h
usage: tts_client [-h|--help] [-s|--server "<value>"] [-i|--textInput
                "<value>"] [-o|--outFile "<value>"] [-c|--configFile
                "<value>"] [-p|--paramsFile "<value>"] [-v|--getVoices]
                Use Nuance TTS to vocalize text
Arguments:
-h  --help        Print help information
-s  --server      server host:port. Default: ttsaas.beta.mix.nuance.com:443
-i  --textInput   text input
-o  --outFile     file to save audio to
-c  --configFile  config file containing client credentials (client_id and
                    client_secret). Default: config.json
-p  --paramsFile  file containing tts parameters to customize TTS behavior.
                    Default: params.json
-v  --getVoices   Specify this flag to retrieve available voices

This runs the app with a SynthesisRequest, generating the audio in out.pcm.

$ go run ./src  -o out.pcm -i "This is a test using params dot json to specify Zoe as the voice"
2019/11/01 11:33:32 parameters: {
"voice": {
    "name": "Zoe-sc",
    "model": "enhanced"
},
"audio_params": {
    "audio_format": {
    "AudioFormat": {
        "Pcm": {
        "sample_rate_hz": 22050
        }
    }
    }
},
"input": {
    "type": "text/plain;charset=utf-8",
    "InputData": {
    "Body": "This is a test using the parameters file to specify Zoe as the voice"
    }
},
"event_params": {}
}
2019/11/01 11:33:33 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received status
2019/11/01 11:33:34 {
"code": 200,
"message": "OK"
}

Reference topics

This section provides more information about topics in the gRPC API.

Status codes

Code Message Indicates
200 Success Synthesis completed successfully.
400 Bad request A malformed or unsupported client request was rejected.

Streamed vs. unary response

One request, two possible responses

service Synthesizer {
  rpc Syunthesize(SynthesisRequest) returns (stream SynthesisResponse) {}  
  rpc UnarySynthesize(SynthesisRequest) returns (UnarySynthesisResponse) {} 
}
. . .
message SynthesisRequest { 
  Voice voice = 1;  
  AudioParameters audio_params = 2; 
  Input input = 3;   
  EventParameters event_params = 4;  
  map<string, string> client_data = 5; 
}

message SynthesisResponse {
  oneof response {
    Status status = 1;   
    Events events = 2;   
    bytes audio = 3;     
  }
}

message UnarySynthesisResponse {  
  Status status = 1;   
  Events events = 2;   
  bytes audio = 3;     
}

NVC offers two types of synthesis response: a streamed response available in SynthesisResponse and a non-streamed response in UnarySynthesisResponse.

The request is the same in both cases: SynthesisRequest specifies a voice, the input text to synthesize, and optional parameters. The response can be either:

See Sample Python app: unary response for an application that includes the unary response, activated by a command line flag.

Defaults

The proto file provides the following defaults for messages in SynthesisRequest. Mandatory fields are shown in bold.

                           
Items in SynthesisRequest Default
    voice (Voice)  
    name Mandatory, e.g. 'Evan'
    model Mandatory, e.g. 'enhanced'
    language Voice's default language
    age_group (EnumAgeGroup) ADULT
    gender (EnumGender) ANY
    audio_params (AudioParameters)  
    audio_format (AudioFormat) PCM 22.5kHz
    volume_percentage 80
    speaking_rate_percentage 50
    audio_chunk_duration_ms 20000 (20 seconds)
    target_audio_length_ms 0, meaning no maximum duration
    disable_early_emission False: Send audio segments as soon as possible
    input (Input)  
    type text/plain;charset=utf8
    body Mandatory
    uri Disabled
    body_as_bytes Disabled
    escape_sequence \! and <ESC>
    resources (SynthesisResource)  
      type (EnumResourceType USER_DICTIONARY
    ssml_validation (EnumSSMLValidation) STRICT
    lid_params (LanguageIdentificationParameters)  
      disable False: LID is turned on
      languages Blank, meaning use all available languages
      always_use_ highest_confidence False: use highest language with any confidence score
    download_params (DownloadParameters)  
      max_age 0 delta seconds
      max_stale 0 delta seconds
      request_timeout_ms 0*, usually 30000 (30 seconds)
      refuse_cookies False: Accept cookies
    event_params (EventParameters)  
    send_sentence_marker_events False: Do not send
    send_word_marker_events False: Do not send
    send_phoneme_marker_events False: Do not send
    send_bookmark_marker_events False: Do not send
    send_paragraph_marker_events False: Do not send
    send_visemes False: Do not send
    send_log_events False: Do not send
    suppress_input False: Include text and URIs in logs
    client_data Blank

* Items marked with an asterisk (*) default to 0, meaning a server default: the default is set in the configuration file used by the NVC engine instance. The values shown here are taken from the sample config files provided with NVC.

gRPC API

NVC provides a protocol buffer file, nuance_tts.proto, to define Nuance's TTS service for gRPC. This file contains the building blocks of your voice synthesis applications.

Once you have transformed the proto file into functions and classes in your programming language using gRPC tools, you can call these functions from your application to set parameters, request synthesis, and receive the resulting audio.

See Client app development for scenarios and sample apps in Python and Go. For other languages, consult the gRPC and Protocol Buffer documentation:

Field names in proto and stub files

In this section, the names of the fields are shown as they appear in the proto files. To see how they are generated in your programming language, consult your generated files. For example:

Proto file Python Go Java
audio_params audio_params AudioParams audioParams or getAudioParams
language language Language language or getLanguage

For details about how proto file terms are transformed in the different programming languages, see the Protocol Buffers documentation for:

Proto file structure

Structure of nuance_tts.proto

Synthesizer
    Get Voices
        GetVoicesRequest
        GetVoicesResponse
    Synthesize
        SynthesisRequest
        SynthesisResponse
    UnarySynthesize
        SynthesisRequest
        UnarySynthesisResponse

GetVoicesRequest / GetVoicesResponse
    Voice
        EnumAgeGroup
        EnumGender
        voice fields

SynthesisRequest
    Voice
        voice fields
    AudioParameters
        audio parm fields
        AudioFormat
            audio format fields
            OggOpus
                OggOpus fields
                EnumVariableBitrate
            Opus
                Opus fields
                EnumVariableBitrate
    Input
        input fields
        SynthesisResource
            resource fields
            EnumResourceType
        EnumSSMLValidation
        LanguageIdentificationParameters
        DownloadParameters
    EventParameters
        event parm fields
    client_data

SynthesisResponse
    Status
    Events
        Event
    audio

UnarySynthesisResponse
    Status
    Events
        Event
    audio

The proto file defines a Synthesizer service with three RPC methods: GetVoices, Synthesize, and UnarySynthesize. Details about each component are referenced by name within the proto file.

This is the structure of the GetVoices request and response:

Proto files: GetVoices

And this shows the Synthesize and UnarySynthesize request and response:

Proto files: Synthesize

Synthesizer

The Synthesizer service offers these functionalities:

Name Request Type Response Type
GetVoices GetVoicesRequest GetVoicesResponse
Synthesize SynthesisRequest SynthesisResponse stream
UnarySynthesize SynthesisRequest UnarySynthesisResponse

GetVoicesRequest

Get voices request/response code

def create_get_voices_request(name=None, model=None, language=None, sample_rate=0):
    request = nuance_tts_pb2.GetVoicesRequest()
    if name:
        request.voice.name = name
    if model:
        request.voice.model = model
    if language:
        request.voice.language = language
    request.voice.sample_rate_hz=sample_rate
. . .     
        for voice in response.voices:
            print("Voice: %s, Model: %s, Language: %s, Sample rate: %d" % (voice.name, voice.model, voice.language, voice.sample_rate_hz))
func getVoices(ctx context.Context, client pb.SynthesizerClient) {

    req := &pb.GetVoicesRequest{}
    resp, err := client.GetVoices(ctx, req)
    if err != nil {
        log.Printf("Error getting voices: %s", err)
        return
    }

    out, _ := json.MarshalIndent(resp, "", "  ")
    log.Printf("Voices: %s", string(out))
}

Sample response

Voice: Evan, Model: enhanced, Language: en-us, Sample rate: 22050
Voice: Zoe-Sc, Model: enhanced, Language: en-us, Sample rate: 22050
2019/11/01 11:31:29 Voices: {
"voices": [
    {
    "name": "Evan",
    "model": "enhanced",
    "language": "en-us"
    },
    {
    "name": "Zoe-Sc",
    "model": "enhanced",
    "language": "en-us"
    }
]
}

Input message for message for Synthesizer - GetVoices, to query voices available to the client.

Field Type Description
voice Voice Optionally filter the voices to retrieve, e.g. set language to en-US to return only American English voices.

Voice

Input or output message for voices. When sent as input:

When received as output in GetVoicesResponse, it returns the list of available voices.

Field Type Description
name string The voice's name, e.g. 'Evan'. Mandatory for SynthesisRequest.
model string The voice's quality model, e.g. 'enhanced' or 'standard'. Mandatory for SynthesisRequest.
language string IETF language code, e.g. 'en-US'. Some voices may support multiple languages. Default is the voice's default language.
age_group EnumAgeGroup Used only in GetVoicesRequest and GetVoicesResponse, to search for adult or child voices. Ignored otherwise.
gender EnumGender Used only in GetVoicesRequest and GetVoicesResponse, to search for voices with a certain gender. Ignored otherwise.
sample_rate_hz uint32 Used only in GetVoicesRequest and GetVoicesResponse, to search for a certain native sample rate. Ignored otherwise.
language_tlw string Used only in GetVoicesRequest and GetVoicesResponse. Three-letter language code (e.g. 'enu' for American English) for configuring language identification in Input.
restricted bool Used only in GetVoicesResponse, to identify restricted voices (restricted: true). These are custom voices available only to specific customers. Default is false, meaning the voice is public.

EnumAgeGroup

Input or output field specifying whether the voice uses its adult or child version, if available. Included in Voice.

Name Number Description
ADULT 0 Adult voice. Default for GetVoicesRequest.
CHILD 1 Child voice.

EnumGender

Input or output field, specifying gender for voices that support multiple genders. Included in Voice.

Name Number Description
ANY 0 Any gender voice. Default for GetVoicesRequest.
MALE 1 Male voice.
FEMALE 2 Female voice.
NEUTRAL 3 Neutral gender voice.

GetVoicesResponse

Output message for Synthesizer - GetVoices. Includes a list of voices that matched the input criteria, if any.

Field Type Description
voices Voice Repeated. Voices and characteristics returned.

SynthesisRequest

Synthesis request

def create_synthesis_request(name, model, body, uri, sample_rate, send_log_events=False):
    request = nuance_tts_pb2.SynthesisRequest()

    request.voice.name = name
    request.voice.model = model

    pcm = nuance_tts_pb2.PCM(sample_rate_hz=sample_rate)
    request.audio_params.audio_format.pcm.CopyFrom(pcm)

    request.input.type = "text/plain;charset=utf-8"
    if body:
        request.input.body = body
    elif uri:
        request.input.uri = uri
    else:
        raise RuntimeError("--body or --uri must be defined.")

    request.event_params.send_log_events = send_log_events

    return request
func initializeSynthesisRequest(paramFile string, textInput *string) *pb.SynthesisRequest {

    // Default voice parameters
    voice := &pb.Voice{
        Name:  "Evan",
        Model: "enhanced",
    }

    // Default audio parameters (22kHz PCM)
    format := &pb.AudioParameters{
        AudioFormat: &pb.AudioFormat{
            AudioFormat: &pb.AudioFormat_Pcm{
                Pcm: &pb.PCM{
                    SampleRateHz: 22050,
                },
            },
        },
    }

    // Default text to synthesize
    input := &pb.Input{
        Type:      "text/plain;charset=utf-8",
        InputData: &pb.Input_Body{}, // needs to be initialized for json.Marshal to work
    }

    req := &pb.SynthesisRequest{
        Voice:       voice,
        AudioParams: format,
        Input:       input,
    }

Input message for Synthesizer - Synthesize. Specifies input text, audio parameters, and events to subscribe to, in exchange for synthesized audio.

Field Type Description
voice Voice The voice to use for audio synthesis.
audio_params AudioParameters Output audio parameters, such as encoding and volume.
input Input Input text to synthesize, tuning data, etc.
event_params EventParameters Markers and other info to include in server events returned during synthesis.
client_data string,string Repeated. Optional client-supplied key-value pairs to inject into the call log.

AudioParameters

Input message for audio-related parameters during synthesis, including encoding, volume, and audio length. Included in SynthesisRequest.

Field Type Description
audio_format AudioFormat Audio encoding. Default PCM 22.5kHz.
volume_percentage uint32 Volume amplitude, from 0 to 100. Default 80.
speaking_rate_percentage uint32 Speaking rate, from 0 to 100. Default 50.
audio_chunk_duration_ms uint32 Maximum duration, in ms, of an audio chunk delivered to the client, from 1 to 60000. Default is 20000 (20 seconds). When this parameter is large enough (for example, 20 or 30 seconds), each audio chunk contains an audible segment surrounded by silence.
target_audio_length_ms uint32 Maximum duration, in ms, of synthesized audio. When greater than 0, the server stops ongoing synthesis at the first sentence end, or silence, closest to the value.
disable_early_emission bool By default, audio segments are emitted as soon as possible, even if they are not audible. This behavior may be disabled.

AudioFormat

Input message for audio encoding of synthesized text. Included in AudioParameters.

Field Type Description
pcm PCM Signed 16-bit little endian PCM, 8kHz or 16kHz. Default.
alaw ALaw G.711 A-law, 8kHz.
ulaw ULaw G.711 Mu-law, 8kHz.
ogg_opus OggOpus Ogg Opus, 8kHz,16kHz, or 24 kHz.
opus Opus Opus, 8kHz, 16kHZ, or 24 kHz. The audio will be sent one Opus packet at a time.

PCM

Input message defining PCM sample rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Output sample rate: 8000, 16000, 22050 (default), 24000.

ALaw

Input message defining A-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

ULaw

Input message defining Mu-law audio format. Included in AudioFormat. G.711 audio formats are set to 8kHz.

OggOpus

Input message defining Ogg Opus output rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Output sample rate. Supported values: 8000, 16000, 24000 Hz.
bit_rate_bps uint32 Valid range is 500 to 256000 bps. Default 28000 bps.
max_frame_duration_ms float Opus frame size, in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity uint32 Computational complexity. A complexity of 0 means the codec default.
vbr EnumVariableBitrate Variable bitrate. On by default.

Opus

Input message defining Opus output rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Output sample rate. Supported values: 8000, 16000, 24000 Hz.
bit_rate_bps uint32 Valid range is 500 to 256000 bps. Default 28000 bps.
max_frame_duration_ms float Opus frame size, in ms: 2.5, 5, 10, 20, 40, 60. Default 20.
complexity uint32 Computational complexity. A complexity of 0 means the codec default.
vbr EnumVariableBitrate Variable bitrate. On by default.

EnumVariableBitrate

Settings for variable bitrate. Included in OggOpus and Opus. Turned on by default.

Name Number Description
VARIABLE_BITRATE_ON 0 Use variable bitrate. Default.
VARIABLE_BITRATE_OFF 1 Do not use variable bitrate.
VARIABLE_BITRATE_CONSTRAINED 2 Use constrained variable bitrate.

Input

Input message containing text to synthesize and synthesis parameters, including tuning data, etc. Included in SynthesisRequest. The mime content type of the data may be:

For the escape_sequence, an additional escape character (ESC, 0x1B, is also accepted) for control sequences within the input text. Default is \!. The value is a Perl 5 compatible regular expression, for example ###. Special characters may be escaped with the backslash (\) character. Avoid using characters that might appear in your input text, otherwise you may inadvertently create an extra escape sequence. For example, \$ is a poor choice if your input text may include $.

Field Type Description
type string Mime content type of the data, as listed above. Default text/plain;charset=utf8
uri string Remote URI to the input text. Currently disabled.
body string Actual input text or SSML in UTF-8 encoding. Mandatory.
body_as_bytes bytes Binary-safe input text or SSML. Can be used with any encoding. Currently disabled.
escape_sequence string Additional escape character for control sequences within the input text. Default is \! and ESC.
resources SynthesisResource Repeated. Synthesis resources (user dictionaries, rulesets, etc.) to tune synthesized audio. Default blank.
ssml_validation EnumSSMLValidation SSML validation mode. Default STRICT.
lid_params LanguageIdentification Parameters LID parameters.
download_params DownloadParameters Remote file download parameters.

SynthesisResource

Input message specifying the type of file to tune the synthesized output and its location or contents. Included in Input.

Field Type Description
type EnumResourceType Resource type, e.g. user dictionary, etc. Default USER_DICTIONARY.
uri string URI to the remote resource, or
body bytes For EnumResourceType USER_DICTIONARY, the contents of the file.

EnumResourceType

The type of synthesis resource to tune the output. Included in SynthesisResource. User dictionaries provide custom pronunciations, rulesets apply search-and-replace rules to input text and ActivePrompt databases help tune synthesized audio under certain conditions, using Nuance Vocalizer Studio.

Name Number Description
USER_DICTIONARY 0 User dictionary (application/edct-bin-dictionary). Default.
TEXT_USER_RULESET 1 Text user ruleset (application/x-vocalizer-rettt+text).
BINARY_USER_RULESET 2 Binary user ruleset (application/x-vocalizer-rettt+bin).
ACTIVEPROMPT_DB 3 ActivePrompt database (application/x-vocalizer/activeprompt-db).
ACTIVEPROMPT_DB_AUTO 4 ActivePrompt database with automatic insertion (application/x-vocalizer/activeprompt-db;mode=automatic).
SYSTEM_DICTIONARY 5 Nuance system dictionary (application/sdct-bin-dictionary).

EnumSSMLValidation

SSML validation mode when using SSML input. Included in Input. Strict by default but can be relaxed.

Name Number Description
STRICT 0 Strict SSL validation. Default.
WARN 1 Give warning only.
NONE 2 Do not validate.

LanguageIdentificationParameters

Input message controlling the language identifier. Included in Input. The language identifier runs on input blocks labeled with the <ESC>\lang=unknown\ control sequence or SSML xml:lang="unknown". The language identifier automatically restricts the matched languages to the installed voices. This limits the permissible languages, and also sets the order of precedence (first to last) when they have equal confidence scores.

Field Type Description
disable bool Whether to disable language identification. Turned on by default.
languages string Repeated. List of three-letter language codes (e.g. enu, frc, spm) to restrict language identification results, in order of precedence. Use GetVoicesRequest - Voice - language_tlw to obtain the three-letter codes. Default blank.
always_use_highest_confidence bool If enabled, language identification always chooses the language with the highest confidence score, even if the score is low. Default false, meaning use language with any confidence.

DownloadParameters

Input message containing parameters for remote file download, whether for input text (Input.uri) or a SynthesisResource (SynthesisResource.uri). Included in Input.

Field Type Description
max_age uint32 See https://tools.ietf.org/html/rfc7234#section-5.2.1.1. Default 0 seconds.
max_stale uint32 See https://tools.ietf.org/html/rfc7234#section-5.2.1.2. Default 0 seconds.
request_timeout_ms uint32 Request timeout in ms. Default (0) means server default, usually 30000 (30 seconds).
refuse_cookies bool Whether to disable cookies. By default, HTTP requests accept cookies.

EventParameters

Input message that defines event subscription parameters. Included in SynthesisRequest. Events that are requested are sent throughout the SynthesisResponse stream, when generated. Marker events can send events as certain parts of the synthesized audio are reached, for example, at the end of a word, sentence, or user-defined bookmark.

Log events are produced throughout a synthesis request for events such as a voice loaded by the server or an audio chunk being ready to send.

Field Type Description
send_sentence_marker_events bool Sentence marker. Default: do not send.
send_word_marker_events bool Word marker. Default: do not send.
send_phoneme_marker_events bool Phoneme marker. Default: do not send.
send_bookmark_marker_events bool Bookmark marker. Default: do not send.
send_paragraph_marker_events bool Paragraph marker. Default: do not send.
send_visemes bool Lipsync information. Default: do not send.
send_log_events bool Whether to log events during synthesis. By default, logging is turned off.
suppress_input bool Whether to omit input text and URIs from log events. By default, these items are included.

SynthesisResponse

Response to synthesis request

       try:
            if args.output_audio_file:
                audio_file = open(args.output_audio_file, "wb")
            for response in stream_in:
                if response.HasField("audio"):
                    print("Received audio: %d bytes" % len(response.audio))
                    if(audio_file):
                        audio_file.write(response.audio)
                elif response.HasField("events"):
                    print("Received events")
                    print(text_format.MessageToString(response.events))
                else:
                    if response.status.code == 200:
                        print("Received status response: SUCCESS")
                    else:
                        print("Received status response: FAILED")
                        print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                        print('Error: {}'.format(response.status.details))
for {
            resp, err := stream.Recv()
            if err == io.EOF {
                // read done.
                close(waitc)
                return
            }
            if err != nil {
                log.Fatalf("Failed to receive a response : %v", err)
            }
            switch response := resp.Response.(type) {
            case *pb.SynthesisResponse_Status:
                log.Printf("received status")
                out, _ := json.MarshalIndent(response.Status, "", "  ")
                log.Printf("%s", string(out))
            case *pb.SynthesisResponse_Events:
                log.Printf("received events")
                out, _ := json.MarshalIndent(response.Events, "", "  ")
                log.Printf("%s", string(out))
            case *pb.SynthesisResponse_Audio:
                log.Printf("received audio")
                if bytesWritten == 0 {
                    ttsAudioOut, _ = os.Create(pcmFile)
                }
                n, _ := ttsAudioOut.Write(response.Audio)
                bytesWritten += n
            }
        }

Response example

Received audio: 36636 bytes
Received audio: 15188 bytes
Received audio: 5432 bytes
Received audio: 13258 bytes
Received audio: 36294 bytes
Received audio: 41650 bytes
Received audio: 24678 bytes
Received audio: 42446 bytes
Received status response: SUCCESS
2019/11/01 11:33:33 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received audio
2019/11/01 11:33:34 received status
2019/11/01 11:33:34 {
"code": 200,
"message": "OK"

The Synthesizer - Synthesize RPC call returns a stream of SynthesisResponse messages. (See UnarySynthesisResponse for a non-streamed response.) Each response contains one of:

Field Type Description
status Status A status response, indicating completion or failure of the request.
events Events A list of events. See EventParameters for details.
audio bytes The latest audio buffer.

Status

Output message containing a status response, indicating completion or failure of a Synthesize call. Included in SynthesisResponse and UnarySynthesisResponse.

Field Type Description
code uint32 HTTP-style return code: 200, 4xx, or 5xx as appropriate.
message string Brief description of the status.
details string Longer description if available.

Events

Output message defining a container for a list of events. This container is needed because oneof does not allow repeated parameters in Protobuf. Included in SynthesisResponse and UnarySynthesisResponse.

Field Type Description
events Event Repeated. One or more events.

Event

Output message defining an event message. Included in Events. See EventParameters for details.

Field Type Description
name string Either "Markers" or the name of the event in the case of a Log Event.
values string,string Repeated. Key-value data relevant to the current event.

UnarySynthesisResponse

The Synthesizer - UnarySynthesize RPC call returns a single UnarySynthesisResponse message. It is similar to SynthesisResponse but includes all the information instead of a single type of response. The response contains:

Field Type Description
status Status A status response, indicating completion or failure of the request.
events Events A list of events. See EventParameters for details.
audio bytes Audio buffer of the synthesized text.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str

Change log

2020-02-19

These changes were made to the API and documentation:

2020-01-22

These changes were made to the API and documentation:

2019-12-18

These changes were made to the TTSaaS gRPC API documentation:

2019-12-02

These changes were made to the TTSaaS gRPC API documentation:

2019-11-15

Below are changes made to the TTSaaS gRPC API documentation since the initial Beta release: