NAV Navbar
Code samples

ASR as a Service gRPC API

The Nuance ASR (Automatic Speech Recognition) service is powered by Nuance's Krypton transcription engine, which performs real-time large vocabulary continuous speech recognition. It converts an audio stream of human speech into text by recognizing the speech and transcribing it into text. Krypton can work independently or as part of a complete speech and/or virtual assistant system.

Krypton works with a data pack in a specific language and locale, and can optionally work with recognition aides such as domain language models and wordsets to improve recognition and transcription in specific environments or businesses.

The gRPC protocol provided by Krypton allows a client application to request transcription services in all the programming languages supported by gRPC.

gRPC is an open source RPC (remote procedure call) software used to create services. It uses HTTP/2 for transport and protocol buffers to define the structure of the application. Krypton supports Protocol Buffers version 3, also known as proto3.

Version: v1beta1 (deprecated)

You may continue to use v1beta1 in your Krypton gRPC applications without any changes, following this documentation. When Krypton receives a request from your client application, it identifies the protocol version transparently.

See v1 documentation to use the latest protocol version, including instructions on upgrading to v1.

You may use only one protocol version per application. You may not combine v1beta1, v1beta2, and/or v1 syntax in one application.

Prerequisites from Mix

Before developing your gRPC application, you need a Mix project. This project provides credentials to run your application against the Nuance-hosted Krypton engine. It also helps you create one or more domain language models to improve recognition in your Krypton application.

See Mix.nlu workflow for the steps to:

Generate a "secret" and client ID of your Mix project. Later, you will use these credentials to request an authorization token to run your application.

gRPC setup

Download proto files

nuance_asr.proto
nuance_asr_resource.proto
nuance_asr_result.proto

Install gRPC for programming language

Python
$ pip install --upgrade pip
$ pip install grpcio
$ pip install grpcio-tools
Go
$ go get google.golang.org/grpc
$ go get github.com/akamensky/argparse
$ go get github.com/youpy/go-wav

Generate client stubs from proto files

Python
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ --grpc_python_out=./ nuance_asr.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ nuance_asr_resource.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ nuance_asr_result.proto

$ ls -1 nuance*.py
nuance_asr_pb2_grpc.py  
nuance_asr_pb2.py  
nuance_asr_resource_pb2.py  
nuance_asr_result_pb2.py
Go
$ protoc -I ./v1beta1/ ./v1beta1/nuance_asr.proto --go_out=plugins=grpc:./v1beta1
$ protoc -I ./v1beta1/ ./v1beta1/nuance_asr_resource.proto --go_out=./v1beta1
$ protoc -I ./v1beta1/ ./v1beta1/nuance_asr_result.proto --go_out=./v1beta1

$ ls v1beta1/nuance*.go 
v1beta1/nuance_asr_resource.pb.go
v1beta1/nuance_asr.pb.go
v1beta1/nuance_asr_result.pb.go

The basic steps in using the Krypton gRPC protocol are:

  1. Download the three gRPC proto files: nuance_asr.proto, nuance_asr_resource.proto, and nuance_asr_result.proto. These proto3 files contain a generic version of the functions or classes for requesting transcription from a Krypton engine.

  2. Install gRPC for the programming language of your choice, including C++, Java, Python, Go, Ruby, C#, Node.js, and others. See gRPC Documentation for a complete list and instructions on using gRPC with each language.

    This documentation uses Python 3.6 and Go as its programming language examples. For Python, it uses Python pip to install the gRPC components, grpcio and grpcio-tools. For Go, it uses get to install gRPC.

  3. Generate client stub files in your programming language from the proto files. For example, for Python and Go, use the protoc tool to generate client stubs. The generated client stubs are named nuance_asr*.py.

    You do not need to review the generated stubs in detail. All the information needed to write your app is contained in the documentation provided with the proto files. See Reference.

  4. Write your client app, referencing the functions or classes in the client stub files. See Client application for details and a scenario, including domain language models (DLMs) and wordsets.

  5. Run your client app to request transcription, optionally passing DLMs and wordsets to improve recognition. See Sample Python app and Sample Go app.

Client application

The gRPC protocol for Krypton lets you create a voice recognition application that incorporates all the functionality expected for recognizing and transcribing speech. In general, you will want to:

This section describes how to implement the basic functionality of Krypton in the context of a Python 3.6 application named my-python-app.py that uses the Krypton gRPC protocol. For the complete application, see Sample Python app.

See also Sample Go app for an application written in Go.

Step 1: Set up token

Save token to file: gen-token.sh

#!/bin/bash

CLIENT_ID="appID%3ANMDPTRIAL_your_name_nuance_com_20190919T190532565840"
SECRET="5JEAu0YSAjV97oV3BWy2PRofy6V8FGmywiUbc0UfkGE"
curl -s -u "$CLIENT_ID:$SECRET" "https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=asr' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])' \
> my-token.txt

Save token to variable: gen-token-var.sh

#!/bin/bash

CLIENT_ID="appID%3ANMDPTRIAL_your_name_nuance_com_20190919T190532565840"
SECRET="5JEAu0YSAjV97oV3BWy2PRofy6V8FGmywiUbc0UfkGE"
MY_TOKEN="`curl -s -u "$CLIENT_ID:$SECRET" "https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=asr' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'`"

Nuance Mix uses the OAuth 2.0 protocol for authentication. Your client application must provide an access token to be able to access the ASR runtime service. The token expires after a certain period (usually an hour) so needs to be regenerated frequently.

Depending on how your application expects the token, you can include the generation code within your application or create a script that saves the token in a variable or file:

The curl command in these scripts generates a JSON object including the access_token field that contains the token, then uses Python tools to extract the token from the JSON. The resulting environment variable contains only the token.

In this scenario, the colon (:) in the client ID must be changed to the code %3A so curl can parse the value correctly:

appID:NMDPTRIAL_your_name_nuance_com_20190919T190532565840  
-->  
appID%3ANNMDPTRIAL_your_name_nuance_com_20190919T190532565840

Once you have created the script file, you can run it before running your application. See Sample Python app for an example.

Step 2: Import functions

Import functions from stubs

import sys, asyncio, wave, grpc
from nuance_asr_resource_pb2 import *
from nuance_asr_result_pb2 import *
from nuance_asr_pb2 import *
from nuance_asr_pb2_grpc import *
from h2.exceptions import StreamClosedError

In your client application, first import functions from the Krypton client stubs that you generated in gRPC setup step 3. This Python example imports all functions from these four files:

Store these files in a location relative to the application that you are creating. In this example, they are in the same directory.

You do not need to edit these generated stub files, or even understand them thoroughly. You can consult the original proto files, either directly or here in the documentation (see Reference), to learn about the objects that will make up your transcription application.

Step 3: Set recognition parms

Set reco parms: language and audio_format are mandatory

async def stream_out(wf):
    try:
        init = RecognizeInitMessage(
          parameters = RecognitionParameters(
            language = 'eng-USA',   
            audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),  
            result_type = 'IMMUTABLE_PARTIAL', 
            utterance_detection_mode='MULTIPLE', 
            recognition_flags = RecognitionFlags(auto_punctuate=True)))

The first part of your application sets RecognitionParameters, or parameters that customize the type of recognition you want. Some examples are:

For details about all recognition parameters, see RecognizeInitMessage and its RecognitionParameters, as described in the proto file.

Step 4: Request transcription

Request transcription and simulate audio stream

yield RecognizeRequest(recognize_init_message=init)

# simulate a typical realtime audio stream
print(f'stream {wf.name}')
packet_duration = 0.020
packet_samples = int(wf.getframerate() * packet_duration)
for packet in iter(lambda: wf.readframes(packet_samples), b''):
    yield RecognizeRequest(audio=packet)
    await asyncio.sleep(packet_duration)
print('stream complete')

After setting recognition parameters, your app defines a function that calls the RecognizeRequest stream, including your recognition parameters.

In this Python example, this is achieved with a two-part yield structure that first sends recognition parameters then sends the audio for transcription in chunks.

yield RecognizeRequest(recognize_init_params=init)
. . . 
    yield RecognizeRequest(audio=chunk)

Normally your app will send streaming audio to Krypton for processing but, for simplicity, this app simulates streaming audio by breaking up an audio file into chunks and feeding it to Krypton a bit at a time.

Step 5: Set channel and audio source

Set location of Krypton instance and audio file

try:
    hostaddr = sys.argv[1]
    token_file = sys.argv[2]
    audio_file = sys.argv[3]
    with open(token_file, 'r') as tf:
        access_token = tf.read().strip()
    . . . 
    call_credentials = grpc.access_token_call_credentials(access_token)
    ssl_credentials = grpc.ssl_channel_credentials()
    channel_credentials = grpc.composite_channel_credentials(ssl_credentials, call_credentials)     
    with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel:
        stub = RecognizerStub(channel)

The app must include the location of the Krypton instance, the authentication token, and where the audio is obtained.

This example calls the Krypton service on a secure channel passed from the command line, along with the token. It expects the audio to be provided as an audio file referenced on the command line. For example:

$ ./my-python-app.py \
    asraas.beta.mix.nuance.com:443 \
    token=my-token.txt \ 
    my-audio-file.wav 

Step 6: Receive results

Receive results from Krypton

try:
    async for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')
       elif message.HasField('result'):
            restype = 'partial' if message.result.result_type else 'final'
            print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

Finally the app returns the results received from the Krypton engine. This simple app merely prints the resulting transcription on screen as it is streamed from Krypton, sentence by sentence, with intermediate partial sentence results when the app has requested PARTIAL or IMMUTABLE_PARTIAL results.

It returns the formatted_text field, although you may also request minimally_formatted_text. See Hypothesis.

The results may be long or short depending on the length of your audio file and the result type.

Result type IMMUTABLE_PARTIAL

Results from audio file with result type PARTIAL_IMMUTABLE

$ ./my-python-app.py monday_morning_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream monday_morning_16.wav
partial : It's Monday
partial : It's Monday morning and the
final : It's Monday morning and the sun is shining.
partial : I'm getting ready
partial : I'm getting ready to
partial : I'm getting ready to walk
partial : I'm getting ready to walk to the
partial : I'm getting ready to walk to the train commute
final : I'm getting ready to walk to the train commute into work.
partial : I'll catch
partial : I'll catch the
partial : I'll catch the 750
partial : I'll catch the 758 train from
final : I'll catch the 758 train from Cedar Park station.
partial : It will take
partial : It will take me an hour
partial : It will take me an hour to get
final : It will take me an hour to get into town.
stream complete
200 Success

This example shows the transcription results from my audio file, monday_morning_16.wav, a 16kHz wave file talking about my commute into work. The audio file says:

It's Monday morning and the sun is shining.
I'm getting ready to walk to the train and commute into work.
I'll catch the 7:58 train from Cedar Park station.
It will take me an hour to get into town.

The result type in this example is IMMUTABLE_PARTIAL, meaning that partial results are delivered after a slight delay, to ensure that the recognized words do not change with the rest of the received speech. See Result type for the other choices.

In this example and the next, Krypton performs the transcription using only the data pack. For these simple sentences, the recognition is nearly perfect.

Result type FINAL

Result type FINAL returns only the final version of each sentence

$ ./my-python-app.py weather16.wav
stream weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Our radar and satellite pictures show that we're on the western edge of the storm system as it continues to traffic further to the east
stream complete
200 Success

This example transcribes the audio file weather16.wav, which talks about winter weather in Montreal. The file says:

There is more snow coming to the Montreal area in the next few days.
We're expecting ten centimeters overnight and the winds are blowing hard.
Our radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east.

The result type in the case is FINAL, meaning only the final transcription version is returned.

Step 7: Add DLMs and wordsets

Declare DLM and wordset (and see Wordset example for another way of declaring the wordset)

# Declare a DLM 
names_places_dlm = RecognitionResource(external_reference = 
  ResourceReference(type='DOMAIN_LM',
  uri='urn:nuance:mix/eng-USA/names-places/mix.asr'),
  weight_value=700)

# Declare an inline wordset in that DLM 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan goth lin","lhan goth luhn"]},{"literal":"Auchenblae"}]}')

# Add 16kHz sample rate for 16kHz files + immutable parm + resources 
init = RecognizeInitMessage(
    parameters = RecognitionParameters(language='eng-USA', 
      audio_format=AudioFormat(pcm=PCM(sample_rate_hz=16000)),
      result_type='FINAL', 
      utterance_detection_mode='MULTIPLE'),  
    resources = [ names_places_dlm, places_wordset ]
)

Once you have experimented with basic transcription, you can add resources such as domain language models and wordsets to improve recognition of specific terms and language in your environment. For example, you might add a DLM containing names and places in your business.

In this application, the resources are declared with RecognitionResource then added with RecognizeInitMessage - resources.

Before and after DLM and wordset

Before: Without a DLM or wordset, unusual place names are not recognized

$ ./my-python-app.py abington8_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream abintgon_16.wav
final : I'm going on a trip to Abington tickets in Cambridgeshire England.
final : I'm speaking to you from the town of cooking out in Northamptonshire.
final : We visited the village of steeple Morton on our way to highland common in Yorkshire.
final : We spent a week in the town of land Gosling in Wales. 
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

After: Recognition is perfect with a DLM and wordset

$ ./my-python-app.py abington8_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream abintgon_16.wav
final : I'm going on a trip to Abington Piggots in Cambridgeshire England.
final : I'm speaking to you from the town of Cogenhoe in Northamptonshire.
final : We visited the village of Steeple Morden on our way to Hoyland Common in Yorkshire.
final : We spent a week in the town of Llangollen in Wales.
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

The audio file in this example, abington8_16.wav, is a 16kHz recording containing a variety of place names, some common and some unusual (shown in bold). The audio file says:

I'm going on a trip to Abington Piggots in Cambridgeshire, England.
I'm speaking to you from the town of Cogenhoe in Northamptonshire.
We visited the village of Steeple Morden on our way to Hoyland Common in Yorkshire.
We spent a week in the town of Llangollen in Wales.
Have you ever thought of moving to La Jolla in California.

Without a DLM or wordset, the unusual place names are not recognized correctly.

But when all the place names are defined, either in the DLM or in a wordset, there is perfect recognition.

Wordset example

Wordset file places-wordset.json

{
   "PLACES":[ 
      { 
         "literal":"La Jolla",
         "spoken":[ "la hoya", "la jolla" ]
      },
      { 
         "literal":"Llanfairpwllgwyngyll",
         "spoken":[ "lan vire pool guin gill" ]
      },
      { 
         "literal":"Abington Pigotts"
      },
      { 
         "literal":"Steeple Morden"
      },
      { 
         "literal":"Hoyland Common"
      },
      { 
         "literal":"Cogenhoe",
         "spoken":[ "cook no" ]
      },
      { 
         "literal":"Fordoun",
         "spoken":[ "forden" ]
      },
      { 
         "literal":"Llangollen",
         "spoken":[ "lan goth lin", "lhan goth luhn" ]
      },
      { 
         "literal":"Auchenblae"
      }]
}

The DLM referenced here (names-places) includes an entity named PLACES, which contains a few place names. See Domain LMs for more information about DLMs.

This wordset extends the PLACES entity with additional place names. Notice that a spoken form is provided only for terms that do not follow the standard pronunciation rules for the language. See Wordsets for information on creating wordsets.

For inline wordsets, you may specify the JSON within the application, compressed (without spaces) and enclosed in single quotation marks:

# Define an inline wordset for that DLM 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La
Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll",
"spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":
"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":
["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen",
"spoken":["lan goth lin","lhan goth luhn"]},{"literal":"Auchenblae"}]}')

Alternatively, you could store the wordset in a local file (see places-wordset.json) and read the file at runtime. For example:

# Define a wordset in a file for that DLM 
places_wordset_content = None
with open('places-wordset.json', 'r') as f:
    places_wordset_content = f.read()
places_wordset = RecognitionResource(inline_wordset=places_wordset_content)
 
def client_stream(wf):
    try:
        init = RecognizeInitMessage(
            parameters = RecognitionParameters(language='eng-USA', 
              audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
              result_type='IMMUTABLE_PARTIAL', 
              utterance_detection_mode='MULTIPLE'),
            resources = [ names_places_dlm, places_wordset ]
        )

Sample Python app

This basic Python app, my-python-app.py, requests and returns transcription of an audio file

#!/usr/bin/env python3

import sys, wave, grpc, traceback
from time import sleep
from nuance_asr_resource_pb2 import *
from nuance_asr_result_pb2 import *
from nuance_asr_pb2 import *
from nuance_asr_pb2_grpc import *
from h2.exceptions import StreamClosedError

# Define a DLM resource
names_places_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM',
      uri='urn:nuance:mix/eng-USA/names-places/mix.asr',reuse='HIGH_REUSE'), 
    weight_value=700)

# Define an inline wordset for an entity in that DLM 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]},{"literal":"Auchenblae"}]}')

# Generator provides the sequence of client -> server messages
def client_stream(wf):
    try:
        # Start the recognition
        init = RecognizeInitMessage(
            parameters = RecognitionParameters(language='eng-USA', 
              audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
              result_type='FINAL', 
              utterance_detection_mode='MULTIPLE',
              recognition_flags = RecognitionFlags(auto_punctuate=True)),
            resources = [ names_places_dlm, places_wordset ],
            client_data = {'company':'Aardvark','user':'Leslie'} 
        )
        yield RecognizeRequest(recognize_init_message=init)

        # Simulate a typical realtime audio stream
        print(f'stream {wf.name}')
        packet_duration = 0.020
        packet_samples = int(wf.getframerate() * packet_duration)
        for packet in iter(lambda: wf.readframes(packet_samples), b''):
            yield RecognizeRequest(audio=packet)
            sleep(packet_duration)
        print('stream complete')
    except CancelledError as e:
        print(f'client stream: RPC canceled')
    except Exception as e:
        print(f'client stream: {type(e)}')
        traceback.print_exc()

hostaddr = access_token = audio_file = None
try:
    hostaddr = sys.argv[1]
    token_file = sys.argv[2]
    audio_file = sys.argv[3]
    with open(token_file, 'r') as tf:
        access_token = tf.read().strip()
except Exception as e:
    print(f'usage: {sys.argv[0]} <hostaddr> <token_file> <audio_file.wav>')
    exit(1)

with wave.open(audio_file, 'r') as wf:
    assert wf.getsampwidth() == 2, f'{audioi_file} is not linear PCM'
    assert wf.getframerate() in [8000, 16000], f'{audio_file} sample rate must be 8000 or 16000'
    setattr(wf, 'name', audio_file)
    call_credentials = grpc.access_token_call_credentials(access_token)
    ssl_credentials = grpc.ssl_channel_credentials()
    channel_credentials = grpc.composite_channel_credentials(ssl_credentials, call_credentials)     
    with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel:
        stub = RecognizerStub(channel)
        stream_in = stub.Recognize(client_stream(wf))
        try:
            # Iterate through the returned server -> client messages
            for message in stream_in:
                if message.HasField('status'):
                    if message.status.details:
                         print(f'{message.status.code} {message.status.message} - {message.status.details}')
                    else:
                         print(f'{message.status.code} {message.status.message}')
                elif message.HasField('result'):
                    restype = 'partial' if message.result.result_type else 'final'
                    print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
        except StreamClosedError:
            pass
        except Exception as e:
            print(f'server stream: {type(e)}')
            traceback.print_exc()

This is the little Python 3.6 app used in the examples, complete with DLM and wordset references. It performs these tasks:

Running the app with a token

To run this sample application, first generate a token (see Step 1: Set up token), then run the application, passing it the location of the Nuance-hosted Krypton server, your token file, and an audio file to transcribe. For example:

$ cat gen-token.sh
 
#!/bin/bash
CLIENT_ID="appID%3ANMDPTRIAL_your_name_nuance_com_20190919T190532565840"
SECRET="5JEAu0YSAjV97oV3BWy2PRofy6V8FGmywiUbc0UfkGE"
curl -s -u "$CLIENT_ID:$SECRET" "https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=asr' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])' \
> my-token.txt   
 
$ ./gen-token.sh
$ ./my-python-app.py asraas.beta.mix.nuance.com:443 my-token.txt audio-file.wav

The gen-token.sh script used in this example generates a token that authenticates the application to the Krypton server. It takes your client credentials from the Mix dashboard and stores the resulting token in my-token.txt.

You may instead incorporate the token-generation code within the application, reading the credentials from a configuration file.

No DLM required?

To request transcription without the DLM and wordset, simply comment out the resources line.

        init = RecognizeInitMessage(
            parameters = RecognitionParameters(language='eng-USA', 
              audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
              result_type='FINAL', 
              utterance_detection_mode='MULTIPLE'
              recognition_flags = RecognitionFlags(auto_punctuate=True)),
#           resources = [ names_places_dlm, places_wordset ],
            client_data = {'company':'Aardvark','user':'Leslie'}

Sample Go app

This basic Go app, nuance_asr_sample_client.go, requests and returns transcription of an audio file

package main

import (
    "os"
    "fmt"
    "strings"
    "context"
    "io"
    "io/ioutil"
    "log"
    "time"
    "crypto/tls"

    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials"
    "google.golang.org/grpc/metadata"
    pb "coretech.nuance.com/asr/sample-go-client/v1beta1"

    "github.com/youpy/go-wav"
    "github.com/akamensky/argparse"
)

// recognize initiates the recognition request, streams the given WAV audio file to the server, and prints responses
func recognize(client pb.RecognizerClient, wavFormat *wav.WavFormat, wavReader *wav.Reader, token *string) {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    // https://github.com/grpc/grpc-go/blob/master/Documentation/grpc-metadata.md
    ctx = metadata.AppendToOutgoingContext(ctx, "authorization", "Bearer " + *token)

    stream, err := client.Recognize(ctx)
    if err != nil {
        log.Fatalf("%v.RouteChat(_) = _, %v", client, err)
    }
    waitc := make(chan struct{})
    go func() {
        for {
            recognizeResponse, err := stream.Recv()
            if err == io.EOF {
                // read done.
                close(waitc)
                return
            }
            if err != nil {
                log.Fatalf("Failed to receive a response : %v", err)
            }
            log.Printf("response: %v\n", recognizeResponse);
        }
    }()

    parameters := &pb.RecognitionParameters{
        Language: "eng-USA",
        Topic: "GEN",
        AudioFormat: &pb.AudioFormat{
            AudioFormatUnion: &pb.AudioFormat_Pcm{
                Pcm: &pb.PCM{
                    SampleRateHz: wavFormat.SampleRate,
                },
            },
        },
    }

    // Declare a DLM 
    names_places_dlm := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_ExternalReference{
            ExternalReference: &pb.ResourceReference{
                Type: pb.EnumResourceType_DOMAIN_LM,
                Uri: "urn:nuance:mix/eng-USA/names-places/mix.asr",
                Reuse: pb.EnumResourceReuse_HIGH_REUSE,
            },
        },
        WeightUnion: &pb.RecognitionResource_WeightValue{
            WeightValue: 700,
        },
    }

    // Declare an inline wordset in that DLM 
    places_wordset := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_InlineWordset{
            InlineWordset: `{"PLACES":[{"literal":"Fordoun","spoken":["forden"]}, {"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]}, {"literal":"Auchenblae"}]}`,
        },
    }

    request := &pb.RecognizeRequest{
        RequestUnion: &pb.RecognizeRequest_RecognizeInitMessage{
            RecognizeInitMessage: &pb.RecognizeInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm, places_wordset },
            },
        },
    }

    if err := stream.Send(request); err != nil {
        log.Fatalf("Failed to start recognition: %v", err)
    }

    // simulate audio stream
    const packetDuration = 0.020
    packetSize := uint32(float64(wavFormat.ByteRate) * packetDuration)
    packet := make([]byte, packetSize)
    for {
        packetBytesRead, err := wavReader.Read(packet)
        if err != nil {
            if err != io.EOF {
                fmt.Println(err)
            }
            break
        }
        audio := packet[0:packetBytesRead]
        audioRequest := &pb.RecognizeRequest_Audio{ Audio: audio }
        request := &pb.RecognizeRequest{ RequestUnion: audioRequest }
        if err := stream.Send(request); err != nil {
            if err != io.EOF {
                log.Fatalf("Failed to send audio: %v", err)
            }
            break
        }
        time.Sleep(time.Duration(packetDuration * 1000) * time.Millisecond)
    }
    stream.CloseSend()
    <-waitc
}

func main() {

    // collect arguements
    parser := argparse.NewParser("nuance_asr_sample_client", "Use Nuance MIX ASR to recognize WAV audio")
    server := parser.String("s", "server", &argparse.Options{
        Default: "asraas.beta.mix.nuance.com:443",
        Help: "server host:port",
    })
    wavFile := parser.File("w", "wavFile", os.O_RDONLY, 0, &argparse.Options{
        Required: true,
        Help: "audio file to recognize",
    })
    tokenFile := parser.File("t", "tokenFile", os.O_RDONLY, 0, &argparse.Options{
        Required: true,
        Help: "text file containing base64 access token",
    })
    err := parser.Parse(os.Args)
    if err != nil {
        fmt.Print(parser.Usage(err))
        os.Exit(1)
    }
    defer wavFile.Close()
    defer tokenFile.Close()
    tokenBytes, err := ioutil.ReadAll(tokenFile)
    if err != nil {
        log.Fatalf("Failed reading token file: %v", err)
    }
    token := strings.TrimSpace(string(tokenBytes))

    // validate the audio file is compatible
    wavReader := wav.NewReader(wavFile)
    wavFormat, _ := wavReader.Format()
    if wavFormat.AudioFormat != 1 || wavFormat.BitsPerSample != 16 {
        log.Fatalf("Unsupported audio format: %v", wavFormat);
    }
    sampleRate := wavFormat.SampleRate
    if sampleRate != 8000 && sampleRate != 16000 {
        log.Fatalf("Unsupported sample rate: %v", sampleRate);
        os.Exit(1)
    }

    creds := credentials.NewTLS(&tls.Config{})
    if err != nil {
        log.Fatalf("Failed to create TLS credentials %v", err)
    }
    conn, err := grpc.Dial(*server, grpc.WithTransportCredentials(creds))
    if err != nil {
        log.Fatalf("fail to dial: %v", err)
    }
    defer conn.Close()
    client := pb.NewRecognizerClient(conn)
    recognize(client, wavFormat, wavReader, &token)
}

This Go application performs these tasks:

For example, this runs the app using the default server URL, with a token file my-token.txt and an audio file, my_audio_file.wav.

$ go run nuance_asr_sample_client.go -t my-token.txt -w my_audio_file.wav

Reference

Krypton provides three protocol buffer (.proto) files to define Nuance's ASR service for gRPC. These files contain the building blocks of your transcription applications.

Once you have transformed the proto files into functions and classes in your programming language using gRPC tools, you can call these functions from your application to request transcription, to set recognition parameters, to load “helper” resources such as domain language models and wordsets, and to send the resulting transcription where required.

See Client application for a scenario using Python that sets different recognition parameters, requests transcription, and loads domain LMs and wordsets. A sample Python app is included, along with a Go app. For other languages, consult the gRPC and Protocol Buffer documentation:

For detailed information about the contents of the proto files, see:

Overview

The proto files define an RPC service with a Recognize method that streams a RecognizeRequest and RecognizeResponse. Details about each component are referenced by name within the proto file.

Proto files

Recognizer service

service Recognizer {
  rpc Recognize (stream RecognizeRequest) returns (stream RecognizeResponse);
}

Status response message

{
  status: {
    code: 100
    message: 'Continue'
    details: 'recognition started on audio/l16;rate=8000 stream'
  }
  cookies: {  ... }
}

A single Recognizer service provides a single Recognize method supporting bi-directional streaming of requests and responses.

The client first provides a recognition request message with parameters indicating at minimum what language to use. Optionally, it can also include resources to customize the data packs used for recognition, and arbitrary client data to be injected into call recording for reference in offline tuning workflows.

In response to the recognition request message, Krypton returns a status message confirming the outcome of the request. Usually the message is Continue: recognition started on audio/l16;rate=8000 stream.

Status messages include HTTP-aligned status codes. A failure to begin recognizing is reflected in a 4xx or 5xx status as appropriate. (Cookies returned from resource fetches, if any, are returned in the first response only.)

When a 100 status is received the client may proceed to send one or more messages bearing binary audio samples in the format indicated in the recognize message (default: signed PCM/8000 Hz). The server responds with zero or more result messages reflecting the outcome of recognizing the incoming audio, until a terminating condition is reached, at which point the server sends a final status message indicating normal completion (200/204) or any errors encountered (4xx/5xx). Termination conditions include:

If the client cancels the RPC, no further messages are received from the server. If the server encounters an error, it attempts to send a final error status and then cancels the RPC.

Status codes

Code Message Indicates
100 Continue Recognition parameters and resources were accepted and successfully configured. Client can proceed to send audio data. Also returned in response to a start_timers message.
200 Success Recognition completed and returned at least one result.
204 No result Recognition completed without producing a result.
400 Bad request A malformed or unsupported client request was rejected.
404 No speech No utterance was detected in the audio stream for a number of samples corresponding to no_input_timeout_ms.
408 Audio timeout Excessive stall in sending audio data.
409 Conflict The recognizer is currently in use by another client.
410 Not recognizing A start_timers message was received but no in-progress recognition exists.
413 Too much speech Recognition of in-utterance samples reached a duration corresponding to recognition_timeout_ms.
500 Internal server error A serious error occurred that prevented the request from completing normally.
502 Resource error One or more resources failed to load.
503 Service unavailable Unused; reserved for gateways.

Result type

Final results

final : It's Monday morning and the sun is shining

Partial results

partial : It's
partial : It's me
partial : It's month
partial : It's Monday
partial : It's Monday no
partial : It's Monday more
partial : It's Monday March
partial : It's Monday morning
partial : It's Monday morning and
partial : It's Monday morning and the
partial : It's Monday morning and this
partial : It's Monday morning and the sun
partial : It's Monday morning and the center
partial : It's Monday morning and the sun is
partial : It's Monday morning and the sonny's
partial : It's Monday morning and the sunshine
final : It's Monday morning and the sun is shining

Immutable partial results

partial : It's Monday
partial : It's Monday morning and the
final : It's Monday morning and the sun is shining

Krypton offers three different types of results for the transcription of each utterance in the audio stream. The desired type is set in the request with RecognitionParameters - EnumResultType. In the response, the actual type is indicated in Result - EnumResultType.

Some data packs perform additional processing after the initial transcription. The transcription may change slightly during this second pass, even for immutable partial results. For example, Krypton originally recognized "the 7:58 train" as "the 750 A-Train" but adjusted it during a second pass, returning the correct transcription in the final version of the sentence.

partial : I'll catch the 750
partial : I'll catch the 750 A-Train
final : I'll catch the 758 train from Cedar Park station

Formatted text

Formatted vs. minimally formatted text

Formatted text: December 9, 2005
Minimally formatted text: December nine two thousand and five

Formatted text: $500
Minimally formatted text: Five hundred dollars

Formatted text: I'm okay James, how about yourself?
Minimally formatted text: I'm okay James, how about yourself?

Krypton returns transcriptions in two formats: formatted text and minimally formatted text. See Result - Hypothesis.

Formatted text includes initial capitals for recognized names and places, numbers expressed as digits, currency symbols, and common abbreviations. In minimally formatted text, words are spelled out but basic capitalization and punctuation are included.

In many cases, both formats are identical.

For more precise control over the formatted text results, you may specify a formatting scheme as a recognition parameter (RecognitionParameters - Formatting - scheme). The scheme is a keyword for a formatting type supported by the data pack and determines how results are presented in the formatted text result. For example, this is how the utterance “seven eleven” is formatted for these types:

Values depend on the data pack. For example, some eng-USA data packs include: phone, date, num_as_digits, time, default, address, all_as_words.

Resources

Krypton works with one or more factory data packs, available in several languages and locales. The data pack includes these models:

The base acoustic model is trained to give good performance in many acoustic environments. The base language model is developed to remain current with popular vocabulary and language use. As such, Krypton paired with a data pack is ready for use out-of-the-box for many applications.

You may extend the data pack at runtime using several types of specialization objects:

Each recognition turn can leverage a weighted mix of domain LMs, wordsets, builtins, and speaker profiles.

Domain LMs

Domain LM

names_places_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM', 
        uri='urn:nuance:mix/eng-USA/names-places/mix.asr'),
    weight_value=700)
. . . 
init = RecognizeInitMessage(
    parameters = RecognitionParameters(language='eng-USA',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),
    resources = [ names_places_dlm ]

Each data pack supplied with Krypton provides a base language model that lets the transcription engine recognize the most common terms and constructs in the language and locale. You may complement this language model with one or more domain-specific models, called domain language models (domain LMs or DLMs).

In Krypton, a DLM is a recognition resource declared with RecognitionResource and activated in RecognizeInitMessage - resources.

For Nuance-hosted Krypton, domain LMs are created in Nuance Mix (see Mix.nlu workflow) and accessed via a URN available from Mix. See the code sample at the right for an example of a URN. Domain LMs may contain entities (sometimes called concepts), or collections of specific terms. These entities may in turn be extended with wordsets.

Wordsets

Inline wordset, places_wordset, extends the PLACES entity

names_places_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM', 
        uri='urn:nuance:mix/eng-USA/names-places/mix.asr'),
    weight_value=700)

places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"Fordoun","spoken":["forden"]},{"literal":"Auchenblae"}]}')

init = RecognizeInitMessage(
    parameters = RecognitionParameters(language='eng-USA',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),
    resources = [ names_places_dlm, places_wordset ]

A wordset is a collection of words and short phrases that extends Krypton's recognition vocabulary by providing additional values for entities in a DLM. For example, a wordset might provide the names in a user’s contact list or local place names. Like DLMs, wordsets are declared with RecognitionResource and activated in RecognizeInitMessage.

Using wordsets in Krypton requires three items:

The wordset is defined in JSON format as a one or more arrays. Each array is named after an entity defined within a DLM to which words can be added at runtime. Entities are templates that tell Krypton how and where words are used in a conversation.

For example, you might have an entity, NAMES, containing personal names, or PLACES, with place names used by the application. The wordset adds to the existing terms in the entity, but applies only to the current recognition session. The terms in the wordset are not added permanently to the entity. All entities must be defined in domain LMs, which are loaded and activated along with the wordset.

This wordset adds terms to the PLACES entity

{
  "PLACES" : [
    {"literal" : "La Jolla", 
     "spoken" : ["la hoya", "la jolla"] 
    },
    {"literal" : "Beaulieu", 
     "spoken": ["bewly"] 
    },
    {"literal" : "Worcester", 
     "spoken": ["wuster"] 
    },
    {"literal" : "Abington Pigotts"
    },
    {"literal" : "Steeple Morden"
    }
  ]
}

The wordset includes additional values for one or more entities. The syntax is:

{
   "entity" : [
      { "literal": "written form",
      "spoken": ["spoken form 1", "spoken form n"]
      },
      { "literal": "written form",
      "spoken":"spoken form 1", "spoken form n"] },
   ...
   ],
   "entity" : [ ... ]
}

Syntax
entity String An entity defined in a domain LM, containing a set of values. The name is case-sensitive. Consult the DLM for entity names.
literal String The written form of the value to be returned by the formatted field.
spoken Array (Optional) One or more spoken forms of the value. When not supplied, Krypton guesses the pronunciation of the word from the literal form. Include a spoken form only if the literal is difficult to pronounce or has an unusual pronunciation in the language.

Wordsets may be provided in any of these ways:

In all cases, the wordset's underlying DLM must also be declared and activated.

Builtins

Data pack builtins

cal_builtin = RecognitionResource(builtin='CALENDARX')
distance_builtin = RecognitionResource(builtin='DISTANCE')
. . . 
init = RecognizeInitMessage(
    parameters = RecognitionParameters(language='eng-USA',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),
    resources = [ names_places_dlm, cal_builtin, distance_builtin ]

The data pack may include one or more builtins, which are predefined recognition objects focused on common tasks (numbers, dates, and so on) or general information in a vertical domain such as financial services or healthcare. The available builtins depend on the data pack. For many eng-USA data packs, for example, the builtins are:

ALPHANUM                 AMOUNT              BOOLEAN 
CALENDARX                CARDINAL_NUMBER     DATE 
DIGITS                   DISTANCE            DOUBLE 
DURATION                 DURATION_RANGE      GENERIC_ORDER 
GLOBAL                   NUMBERS             ORDINAL_NUMBER
QUANTITY_REL             TEMPERATURE         TIME 
VERT_FINANCIAL_SERVICES 
VERT_HEALTHCARE 
VERT_TELECOMMUNICATIONS 
VERT_TRAVEL

To use a built-in in Krypton, define it with RecognitionResource - builtin and activate it in RecognizeInitMessage - resources.

Defaults

The proto files provide the following default values for messages in the RecognizeRequest sent to Krypton. Mandatory fields are shown in bold.

Items in RecognizeRequest Default
recognize_init_message (RecognizeInitMessage)  
    parameters (RecognitionParameters)  
      language Mandatory, e.g. 'eng-USA'
    topic Default 'GEN'
    audio_format (AudioFormat) Mandatory, e.g. 'PCM'
    utterance_detection_mode (EnumUtterance DetectionMode) SINGLE (0): transcribe one utterance only
    result_type (EnumResultType) FINAL (0): return only final version of each utterance
    recognition_flags (RecognitionFlags)  
        auto_punctuate False: Do no punctuate results
      filter_profanity False: Leave profanity as is
      include_tokenization False: Do not include tokenized result
      stall_timers False: Start no-input timers
      discard_speaker_adaptation False: Keep speaker profile data
      suppress_call_recording False: Send data to CLAG
      mask_load_failures False: Loading errors terminate recognition
    no_input_timeout_ms 0*, usually no timeout
    recognition_timeout_ms 0*, usually no timeout
    utterance_end_silence_ms 0*, usually 500ms or half second
    speech_detection_sensitivity 500
    max_hypotheses 0*, usually 10 hypotheses
    speech_domain Depends on data pack
    formatting (Formatting)  
        scheme Depends on data pack
      options Blank
  resources (RecognitionResource)  
      external_reference (ResourceReference)  
        type (Enum ResourceType) Mandatory with resources - external_reference
      uri Mandatory with resources - external_reference
      max_age 0 delta seconds
      max_stale 0 delta seconds
      min_fresh 0 delta seconds
      cookies Blank
      request_timeout_ms 0*, usually 10000ms or 10 seconds
      reuse (Enum ResourceReuse) LOW_REUSE: only one recognition
    inline_wordset Blank
    builtin Blank
    weight-enum (EnumWeight) 0, meaning MEDIUM
    weight_value 0
  client_data Blank
control_message (ControlMessage) Blank
audio Mandatory

* Items marked with an asterisk (*) default to 0, meaning a server default: the default is set in the configuration file used by the Krypton engine instance. The values shown here are the values set in the sample configuration files (default.yaml and development.yaml) provided with the Krypton engine. In the case of max_hypotheses, the default (10 hypotheses) is set internally within Krypton.

Sequence flow

Applications based on the Krypton gRPC protocol follow a similar structure and runtime flow. These basic steps apply to any programming language.

Sequence flow

nuance_asr.proto

Structure of nuance_asr proto files

Recognizer     Recognize
        RecognizeRequest
        RecognizeResponse

RecognizeRequest
    RecognizeInitMessage
        RecognitionParameters
            language and other recognition parameter fields
            AudioFormat
            EnumResultType
            RecognitionFlags
            Formatting
        RecognitionResource
            recognition resource fields
            ResourceReference
                Resource reference fields
                EnumResourceType
                EnumResourceReuse
            EnumWeight
        client_data
    ControlMessage
        StartTimersControlMessage
    audio

RecognizeResponse
    Status
    StartOfSpeech
    Result
        result fields
        EnumResultType
        UtteranceInfo
             utterance fields
            Dsp
        Hypothesis
             hypothesis fields
            Word
    cookies

The nuance_asr.proto file is one of three proto files that together contain the functions and classes to request recognition from a Nuance-hosted Krypton server. Once you have transformed these files into executable files in your own programming language, you call these functions from your speech synthesis application. See gRPC setup and Client application.

Recognizer

Streaming recognition service API.

Name Request Type Response Type Description
Recognize RecognizeRequest stream RecognizeResponse stream Starts a recognition request and returns a response.

RecognizeRequest

Input stream messages that request recognition, sent one at a time in a specific order. The first mandatory field sends recognition parameters and resources, the final field sends audio to be recognized. Included in Recognizer - Recognize service.

Field Type Description
recognize_init_message RecognizeInitMessage Required first message in the RPC input stream, sends parameters and resources for recognition.
control_message ControlMessage Optional second message in the RPC input stream, for timer control.
audio bytes Audio samples in the selected encoding for recognition.

RecognizeInitMessage

Input message that initiates a new recognition turn. Included in RecognizeRequest.

Field Type Description
parameters RecognitionParameters Language, audio format, and other recognition parameters.
resources RecognitionResource Repeated. Optional resources (DLMs, wordsets, builtins) to improve recognition.
client_data string,string Repeated. Optional client-supplied key,value pairs to inject into the call log.

RecognitionParameters

Input message that defines parameters for the recognition process. Included in RecognizeInitMessage. The language and audio_format parameters are mandatory. All others are optional. See Defaults for a list of default values.

Field Type Description
language string Mandatory. Language and country (locale) code as xxx-XXX, e.g. 'eng-USA' for American English.
topic string Specialized language model in data pack. Default is 'GEN' (generic).
audio_format AudioFormat Mandatory. Audio codec type and sample rate.
utterance_detection_mode EnumUtterance DetectionMode How end of utterance is determined. Default SINGLE.
result_type EnumResultType The level of transcription results. Default FINAL.
recognition_flags RecognitionFlags Boolean recognition parameters.
no_input_timeout_ms uint32 Maximum silence, in ms, allowed while wating for user input after recognition timers are started. Default (0) means server default, usually no timeout.
recognition_timeout_ms uint32 Maximum duration, in ms, of recognition turn. Default (0) means server default, usually no timeout.
utterance_end_silence_ms uint32 Minimum silence, in ms, that determines the end of an utterance. Default (0) means server default, usually 500ms or half a second.
speech_detection_sensitivity uint32 A balance between detecting speech and noise (breathing, etc.), 0 to 1000. 0 means ignore all noise, 1000 means interpret all noise as speech. Default is 500.
max_hypotheses uint32 Maximum number of n-best hypotheses to return. Default (0) means server default, usually 10 hypotheses.
speech_domain string Mapping to internal weight sets for language models in the data pack. Values depend on the data pack.
formatting Formatting Formatting keyword.

AudioFormat

Mandatory input message containing the audio format of the audio to transcribe. Included in RecognitionParameters.

Field Type Description
pcm PCM Signed 16-bit little endian PCM, 8kHz or 16kHz.
alaw ALaw G.711 A-law, 8kHz.
ulaw Ulaw G.711 µ-law, 8kHz.
opus OggOpus Ogg Opus, 8kHz or 16kHz.

PCM

Input message defining PCM sample rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Audio sample rate: 0, 8000, 16000. Default 0, meaning 8000.

Alaw

Input message defining A-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

Ulaw

Input message defining µ-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

OggOpus

Input message defining Opus output rate, per RFC7845 decoder standard. Included in AudioFormat. The decoder output rate must match one of the available acoustic models in the data pack.

Field Type Description
output_rate_hz uint32 Decoder output rate: 0, 8000, 16000. Default 0, meaning 8000.

EnumUtteranceDetectionMode

Input field specifying how utterances should be detected and transcribed within the audio stream. Included in RecognitionParameters. The default is SINGLE. When the detection mode is DISABLED, the recognition ends only when the client stops sending audio.

Name Number Description
SINGLE 0 Return recognition results for one utterance only, ignoring any trailing audio. Default.
MULTIPLE 1 Return results for all utterances detected in the audio stream.
DISABLED 2 Return recognition results for all audio provided by the client, without separating it into utterances.

RecognitionFlags

Input message containing boolean recognition parameters. Included in RecognitionParameters. The default is false in all cases.

Field Type Description
auto_punctuate bool Whether to enable auto punctuation, if available for the language.
filter_profanity bool Whether to mask known profanities as *** in transcription, if available for the language.
include_tokenization bool Whether to include tokenized recognition result.
stall_timers bool Whether to disable recognition timers. By default, timers start when recognition begins.
discard_speaker_adaptation bool If speaker profiles are used, whether to discard updated speaker data. By default, data is stored.
suppress_call_recording bool Whether to disable recording to Call Log Aggregator (CLAG). By default, call logs, metadata, and audio are collected by CLAG. Call recording may also be disabled at the server level, in which case this parameter has no effect.
mask_load_failures bool When true, errors loading external resources are not reflected in the Status message and do not terminate recognition. They are still reflected in logs.

Formatting

Input message specifying how the transcription results are presented, using a keyword for a formatting type supported by the data pack. Included in RecognitionParameters. See Formatted text.

Field Type Description
scheme string Keyword for a formatting type defined in the data pack. Default is data pack dependent.
options string,bool Repeated. Optional key,value pairs.

ControlMessage

Input message that starts the recognition no-input timer. Included in RecognizeRequest. This setting is only effective if timers were disabled in the recognition request.

Field Type Description
start_timers_message StartTimersControlMessage Starts the recognition no-input timer.

StartTimersControlMessage

Input message the client sends when starting the no-input timer. Included in ControlMessage.

RecognizeResponse

Output stream of messages in response to a recognize request. Included in Recognizer - Recognize service.

Field Type Description
status Status Always the first message returned, indicating whether recognition was initiated successfully.
start_of_speech StartOfSpeech When speech was detected.
result Result The partial or final recognition result. A series of partial results may preceed the final result.
cookies string,string Repeated. One uri,cookies entry per ResourceReference.uri where cookies were returned, for the first response only.

Status

Output message indicating the status of the transcription. See Status codes for details about the codes. The message and details are developer-facing error messages in English. User-facing messages should be localized by the client based on the status code. Included in RecognizeResponse.

Field Type Description
code uint32 HTTP-style return code: 100, 200, 4xx, or 5xx as appropriate.
message string Brief description of the status.
details string Longer description if available.

StartOfSpeech

Output message containing the start-of-speech message. Included in RecognizeResponse.

Field Type Description
first_audio_to_start_of_speech_ms uint32 Offset from start of audio stream to start of speech detected.

nuance_asr_resource.proto

The nuance_asr_resource.proto file is one of three proto files that together contain the functions and classes to request recognition from a Nuance-hosted Krypton server. This file contains items related to recognition resources such as domain language models, wordsets, builtins, and so on.

RecognitionResource

Input message defining one or more recognition resources (domain LMs, wordsets, speaker profiles, builtins) to improve recognition. Included in RecognizeInitMessage. Domain LMs and speaker profiles must be external references but wordsets may also be provided inline.

Field Type Description
external_reference ResourceReference The resource is an external file. Mandatory for DLMs and speaker profiles.
inline_wordset string Inline wordset JSON resource. See Wordsets for the format. Default blank, meaning no inline wordset.
builtin string Name of a builtin resource in the data pack. Default blank, meaning no builtins.
weight_enum EnumWeight Keyword for weight of DLM relative to data pack. Default MEDIUM.
weight_value uint32 Weight of DLM relative to data pack as number from 0 to 1000. Default 0.

ResourceReference

Input message for fetching an external recognition resource. Included in RecognitionResource. See Resources. For Nuance-hosted Krypton, the uri for DLMs a URN, e.g. urn:nuance:mix/eng-USA/A2_C70/mix.asr.

Field Type Description
type EnumResourceType Resource type. Default UNDEFINED_RESOURCE_TYPE.
uri string Location of the resource as a URN reference.
max_age uint32 Default 0 seconds. See https://tools.ietf.org/html/rfc7234#section-5.2.1.1
max_stale uint32 Default 0 seconds. See https://tools.ietf.org/html/rfc7234#section-5.2.1.2
min_fresh uint32 Default 0 seconds. See https://tools.ietf.org/html/rfc7234#section-5.2.1.3
cookies string Cookies to transmit to the server. Default blank.
request_timeout_ms uint32 Time to wait when downloading resources. Default (0) means server default, usually 10000ms or 10 seconds.
reuse EnumResourceReuse Whether the resource will be used multiple times. Default LOW_REUSE.

EnumResourceType

Input field defining the content type of a recognition resource: a domain LM, wordset (plain text or compiled) or speaker profile. Included in ResourceReference. See Resources.

Name Number Description
UNDEFINED_RESOURCE_TYPE 0 Resource type is not specified. Client must always specify a type.
WORDSET 1 Resource is a plain-text JSON wordset, provided inline or referenced as a file.
COMPILED_WORDSET 2 Resource is a compiled wordset zip file.
DOMAIN_LM 3 Resource is a domain LM zip file.
SPEAKER_PROFILE 4 Resource is a speaker profile defined in a Minio endpoint.

EnumResourceReuse

Input field specifying whether the domain LMs or wordset will be used for one or many recognition turns. Included in ResourceReference.

Name Number Description
UNDEFINED_REUSE 0 Not specified: currently defaults to LOW_REUSE.
LOW_REUSE 1 The resource will be used for only one recognition turn.
HIGH_REUSE 5 The resource will be used for a sequence of recognition turns.

EnumWeight

Input field setting the relative weight of the domain LM relative to the data pack, as a keyword. Included in RecognitionResource. Wordsets and speaker profiles do not have a weight. See also weight_value to specify a numeric value.

Name Number Description
DEFAULT_WEIGHT 0 Same effect as MEDIUM
LOWEST 1 The DLM has less effect than the data pack.
LOW 2
MEDIUM 3 The DLM has roughly an equal effect compared to the data pack.
HIGH 4
HIGHEST 5 The DLM has more effect than the data pack.

nuance_asr_result.proto

The nuance_asr_result.proto file is one of three proto files that together contain the functions and classes to request recognition from a Nuance-hosted Krypton server. This file contains items related to results returned by the Krypton server to the client.

Result

Output message containing the transcription result, including the result type, the start and end times, metadata about the transcription, and one or more transcription hypotheses. Included in RecognizeResponse.

Field Type Description
result_type EnumResultType Whether final, partial, or immutable results are returned.
abs_start_ms uint32 Audio stream start time.
abs_end_ms uint32 Audio stream end time.
utterance_info UtteranceInfo Information about each sentence.
hypotheses Hypothesis Repeated. One or more transcription variations.

EnumResultType

Input and output field specifying how transcription results for each utterance are returned. See Result type for examples. In a request RecognitionParameters, it specifies the desired result type. In a response Result, it indicates the actual result type that was returned.

Name Number Description
FINAL 0 Only the final transcription result of each utterance is returned.
PARTIAL 1 Variable partial results are returned, followed by a final result.
IMMUTABLE_PARTIAL 2 Stabilized partial results are returned, following by a final result.

UtteranceInfo

Output message containing information about the recognized sentence in the transcription result. Included in Result.

Field Type Description
duration_ms uint32 Utterance duration in milliseconds.
clipping_duration_ms uint32 Milliseconds of clipping detected.
dropped_speech_packets uint32 Number of speech audio buffers discarded during processing.
dropped_nonspeech_packets uint32 Number of non-speech audio buffers discarded during processing.
dsp Dsp Digital signal processing results.

Dsp

Output message containing digital signal processing results. Included in UtteranceInfo.

Field Type Description
snr_estimate float The estimated speech-to-noise ratio.
level float Estimated speech signal level.
stereo bool Always false, meaning mono audio.
initial_silence float Milliseconds of silence observed before start of utterance.
initial_energy float Energy feature value of first speech frame.
final_energy float Energy feature value of last speech frame.
mean_energy float Average energy feature value of utterance.

Hypothesis

Output message containing one or more proposed transcriptions of the audio stream. Included in Result. Each variation has its own confidence level along with the text in two levels of formatting. See Formatted text.

Field Type Description
confidence uint32 The confidence score for the entire transcription, 0-999.
average_confidence uint32 The confidence score for the hypothesis, 0-999: the average of all word confidence scores based on their duration.
rejected bool Whether the hypothesis was rejected.
formatted_text string Formatted text of the result, e.g. $500.
minimally_formatted_text string Slightly formatted text of the result, e.g. Five hundred dollars.
words Word Repeated. One or more recognized words in the result.
encrypted_tokenization string Nuance-internal representation of the recognition result.

Word

Output message containing one or more recognized words in the hypothesis, including the text, confidence score, and timing information. Included in Hypothesis.

Field Type Description
text string The recognized word.
confidence uint32 The confidence score of the recognized word, 0-999.
start_ms uint32 Word start offset in the audio stream.
end_ms uint32 Word end offset in the audio stream.
silence_after_word_ms uint32 The amount of silence, in ms, detected after the word.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str

Change log

2019-11-15

Below are changes made to the ASRaaS gRPC API documentation since the initial Beta release: