NAV Navbar
Python Go Java

ASR as a Service gRPC API

The Nuance ASR (Automatic Speech Recognition) service is powered by Nuance's Krypton transcription engine, which performs real-time large vocabulary continuous speech recognition. It converts an audio stream of human speech into text by recognizing the speech and transcribing it into text. Krypton can work independently or as part of a complete speech and/or virtual assistant system.

Krypton works with a data pack in a specific language and locale, and can optionally work with recognition aides such as domain language models and wordsets to improve recognition and transcription in specific environments or businesses.

The gRPC protocol provided by Krypton allows a client application to request transcription services in all the programming languages supported by gRPC.

gRPC is an open source RPC (remote procedure call) software used to create services. It uses HTTP/2 for transport and protocol buffers to define the structure of the application. Krypton supports Protocol Buffers version 3, also known as proto3.

Version: v1beta2 (deprecated)

You may continue to use v1beta2 in your Krypton gRPC applications without any changes, following this documentation. When Krypton receives a request from your client application, it identifies the protocol version transparently.

See v1 documentation to use the latest protocol version, including instructions on upgrading to v1.

You may use only one protocol version per application. You may not combine v1beta1, v1beta2, and/or v1 syntax in one application.

Prerequisites from Mix

Before developing your gRPC application, you need a Nuance Mix project. This project provides credentials to run your application against the Nuance-hosted Krypton ASR engine. It also lets you create one or more domain language models (DLMs) to improve recognition in your application.

  1. Create a Mix project and model: see Mix.nlu workflow to:

    • Create a Mix project.

    • Create, train, and build a model in the project. The model must include an intent, optionally entities, and a few annotated sentences.

      Since your model is for recognition only (not understanding), you can use any intent name, for example DUMMY, and add entities and sentences to that intent. Your entities (for example NAMES and PLACES) should contain words that are specific to your application environment. In your application, you can add more words to these categories using wordsets.

    • Create and deploy an application configuration for the project.

  2. Learn how to reference the DLM in your application. You may only reference DLMs created in your Mix project.

  3. Generate a "secret" and client ID of your Mix project. Later you will use these credentials to request an authorization token to run your application.

  4. Learn the URL to call the Krypton ASR service.

    • The standard URL is asraas.beta.mix.nuance.com:443
    • The URL for smart routing is asraas2.beta.mix.nuance.com:443

gRPC setup

Download proto files

nuance_asr.proto
nuance_asr_resource.proto
nuance_asr_result.proto

Install gRPC for programming language

$ pip install --upgrade pip
$ pip install grpcio
$ pip install grpcio-tools
$ go get google.golang.org/grpc
$ go get github.com/akamensky/argparse
$ go get github.com/youpy/go-wav

Generate client stubs from proto files

$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ --grpc_python_out=./ nuance_asr.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ nuance_asr_resource.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ nuance_asr_result.proto

$ ls -1 nuance*.py
nuance_asr_pb2_grpc.py  
nuance_asr_pb2.py  
nuance_asr_resource_pb2.py  
nuance_asr_result_pb2.py
$ protoc -I ./v1beta2/ ./v1beta2/nuance_asr.proto --go_out=plugins=grpc:./v1beta2
$ protoc -I ./v1beta2/ ./v1beta2/nuance_asr_resource.proto --go_out=./v1beta2
$ protoc -I ./v1beta2/ ./v1beta2/nuance_asr_result.proto --go_out=./v1beta2

$ ls -1 v1beta2/nuance*.go 
nuance_asr_resource.pb.go
nuance_asr.pb.go
nuance_asr_result.pb.go

The basic steps in using the Krypton gRPC protocol are:

  1. Download the three gRPC proto files here. These files contain a generic version of the functions or classes for requesting transcription from a Krypton engine.

    • nuance_asr.proto
    • nuance_resource_asr.proto
    • nuance_asr_result.proto

  2. Install gRPC for your programming language, including C++, Java, Python, Go, Ruby, C#, Node.js, and others. See gRPC Documentation for a complete list and instructions on using gRPC with each one.

  3. Generate client stub files in your programming language from the proto files using gRPC protoc. Depending on your programming language, the stubs may consist of one file or multiple files per proto file.

    These stub files contain the methods and fields from the proto files as implemented in your programming language. You will consult the stubs in conjunction with the proto files.

  4. Write your client app, referencing the functions or classes in the client stub files. See Client app development for details and a scenario, including domain language models (DLMs) and wordsets.

  5. Run your client app to request transcription, optionally passing DLMs and wordsets to improve recognition. See Sample Python app and Sample Go app.

Client app development

The gRPC protocol for Krypton lets you create a voice recognition application that includes all the functionality for recognizing and transcribing speech. In general, you need to:

This section describes how to implement the basic functionality of Krypton in the context of a Python and Go application. For the complete applications, see Sample Python app and Sample Go app.

These tasks are illustrated in the following high-level sequence flow: Sequence flow

Step 1: Generate token

This simple example uses a Linux script to request the token and save it in a file

#!/bin/bash

# Change the colon (:) in your CLIENT_ID to code %3A
CLIENT_ID="appID%3ANMDPTRIAL_your_name_nuance_com_20190919T190532565840"
SECRET="5JEAu0YSAjV97oV3BWy2PRofy6V8FGmywiUbc0UfkGE"
curl -s -u "$CLIENT_ID:$SECRET" "https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=asr' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])' \
> my-token.txt

The Go application sets the client ID and secret in a config file, config.json

{
    "client_id": "appID:<Provide Your Mix Beta App Id>",
    "client_secret": "<Provide Your Mix Beta Client Secret>",
    "token_url": "https://auth.crt.nuance.com/oauth2/token"
}

The Java application sets the client ID and secret in a config file, config.json

{
    "client_id": "appID:<Provide Your Mix Beta App Id>",
    "client_secret": "<Provide Your Mix Beta Client Secret>",
    "token_url": "https://auth.crt.nuance.com/oauth2/token"
}

Nuance Mix uses the OAuth 2.0 protocol for authentication. To call the ASR runtime service, your client application must request and then provide an access token. The token expires after 3600 seconds (an hour) so must be regenerated frequently.

Your client application uses the client ID and secret from the Mix Dashboard to generate an authentication token from the URL auth.crt.nuance.com/oauth2/token. See Prerequisites from Mix.

You can generate the token in one of several ways, for example:

Step 2: Authenticate and connect

Run the token-generating script, then pass the token file to the Python app, along with the URL of the ASR service

$ ./gen-token.sh
$ ./my-python-app.py asraas.beta.mix.nuance.com:443 my-token.txt audio-file.wav

The Python app uses the token as it creates the secure connection to the ASR service

try:
    hostaddr = sys.argv[1]
    token_file = sys.argv[2]
    audio_file = sys.argv[3]
    with open(token_file, 'r') as tf:
        access_token = tf.read().strip()
    . . . 
    call_credentials = grpc.access_token_call_credentials(access_token)
    ssl_credentials = grpc.ssl_channel_credentials()
    channel_credentials = grpc.composite_channel_credentials(ssl_credentials, call_credentials)     
    with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel:

The Go app collects the service URL (server) and authentication credentials (configFile) in asr_client.go

func main() {
    // Collect arguments
    parser := argparse.NewParser("nuance_asr_sample_client", "Use Nuance ASR to recognize WAV audio")
    server := parser.String("s", "server", &argparse.Options{
        Default: "asraas.beta.mix.nuance.com:443",
        Help:    "server host:port",
    })
    . . . 
    // Import the user's Mix Beta credentials
    config, err := NewConfig(*configFile)
    if err != nil {
        log.Fatalf("Error importing user credentials: %v", err)
        os.Exit(1)
    }
    // Authenticate the user's credentials
    auth := NewAuthenticator(*config)
    token, err := auth.Authenticate()
    if err != nil {
        log.Fatalf("Error authenticating to Mix: %v", err)
        os.Exit(1)
    }

Then calls authenticate.go to generate and validate the token using the values from config.json

package main

import (
    "encoding/json"
    "errors"
    "fmt"
    "io/ioutil"
    "log"
    "net/http"
    "net/url"
    "os"
    "strings"
    "time"
)

const (
    TokenCache  = "token.cache"
    TokenMaxAge = 59 // minutes
    GrantType   = "client_credentials"
    Scope       = "asr"
)

type Token struct {
    AccessToken string `json:"access_token"`
    ExpiresIn   int    `json:"expires_in"`
    Scope       string `json:"scope"`
    TokenType   string `json:"bearer"`
}

func (t *Token) String(pretty bool) string {
    var str []byte
    var err error

    if pretty {
        str, _ = json.MarshalIndent(t, "", "  ")
    } else {
        str, _ = json.Marshal(t)
    }

    if err != nil {
        log.Printf("Error marshalling token to json: %s", err)
    }

    return string(str)
}

type Authenticator struct {
    config Config
    token  *Token
}

func (a *Authenticator) generateToken() (*Token, error) {
    a.token = nil

    body := strings.NewReader(fmt.Sprintf("grant_type=%s&scope=%s", GrantType, Scope))
    req, err := http.NewRequest("POST", a.config.TokenURL, body)
    if err != nil {
        return nil, err
    }

    req.SetBasicAuth(url.QueryEscape(a.config.ClientID), url.QueryEscape(a.config.ClientSecret))
    req.Header.Set("Content-Type", "application/x-www-form-urlencoded")

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    if resp.StatusCode < 200 || resp.StatusCode >= 300 {
        return nil, errors.New(resp.Status)
    }

    bodyBytes, _ := ioutil.ReadAll(resp.Body)
    t := &Token{}
    err = json.Unmarshal(bodyBytes, t)
    if err != nil {
        return nil, err
    }

    a.token = t
    return a.token, nil
}

func (a *Authenticator) isTokenValid() bool {

    // Is token cached?
    info, err := os.Stat(TokenCache)
    if err != nil {
        return false
    }

    // Can token be read from file?
    source, err := ioutil.ReadFile(TokenCache)
    if err != nil {
        return false
    }

    // Are contents of token valid?
    t := &Token{}
    err = json.Unmarshal(source, t)
    if err != nil || len(t.AccessToken) == 0 {
        return false
    }

    // Has token expired?
    lapsed := time.Since(info.ModTime())
    if lapsed > (TokenMaxAge * time.Minute) {
        return false
    }

    // All tests passed
    a.token = t
    return true
}

func (a *Authenticator) cacheToken() {
    outputJSON, err := json.MarshalIndent(a.token, "", "  ")
    if err != nil {
        log.Printf("Failed to cache token: %v", err)
        return
    }

    err = ioutil.WriteFile(TokenCache, outputJSON, 0644)
    if err != nil {
        log.Printf("Failed to cache token: %v", err)
    }

    return
}

func (a *Authenticator) Authenticate() (*Token, error) {
    if a.isTokenValid() {
        return a.token, nil
    }

    if _, err := a.generateToken(); err != nil {
        return nil, err
    }

    a.cacheToken()
    return a.token, nil
}

func NewAuthenticator(config Config) *Authenticator {
    a := &Authenticator{
        config: config,
    }
    return a
}

The Java app collects the service URL (server) and authentication credentials (configFile) in AsrClient.java

public class AsrClient {

    public class Defaults {
        static final String SERVER = "asraas.beta.mix.nuance.com:443";
        static final String CONFIG_FILE = "config.json";
        static final String PARAMS_FILE = "params.json";
        static final String AUDIO_SOURCE = "microphone";
    }

Then calls Authenticator.java to generate and validate the token using the values from config.json

public class Authenticator {

    static final String GRANT_TYPE = "client_credentials";
    static final String SCOPE = "asr";
    static final String TOKEN_CACHE = "token.cache";
    static final long TOKEN_MAX_AGE = 3540000; //in ms == 59 minutes;

    Configuration config;
    Token token;

    public Authenticator(Configuration config) {
        this.config = config;
    }

    private Token generateToken() throws Exception {
        token = null;

        String auth = URLEncoder.encode(config.getClientID(), "UTF-8") + ":" + config.getClientSecret();
        String authentication = Base64.getEncoder().encodeToString(auth.getBytes());

        String content = String.format("grant_type=%s&scope=%s", GRANT_TYPE, SCOPE);

        URL url = new URL(config.getTokenURL());

        HttpsURLConnection connection = (HttpsURLConnection) url.openConnection();
        connection.setRequestMethod("POST");
        connection.setDoOutput(true);

        connection.setRequestProperty("Authorization", "Basic " + authentication);
        connection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded");
        connection.setRequestProperty("Accept", "application/json");

        PrintStream os = new PrintStream(connection.getOutputStream());
        os.print(content);
        os.close();

        Gson gson = new Gson();

        BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));

        // Parse the configuration parameters...
        token = gson.fromJson(reader, Token.class);
        return token;
}

    private boolean isTokenValid() {
        File f = new File(TOKEN_CACHE);
        if(!f.exists() || f.isDirectory() || !f.canRead()) { 
            return false;
        }

        Gson gson = new Gson();
        try {
            BufferedReader reader = new BufferedReader(new FileReader(TOKEN_CACHE));
            Token t = gson.fromJson(reader, Token.class);
            if (t.accessToken == null || t.accessToken.isEmpty()) {
                return false;
            }

            if ((System.currentTimeMillis() - f.lastModified()) > TOKEN_MAX_AGE) {
                return false;
            }

            token = t;                
        } catch (Exception e) {
            return false;
        }
        return true;
    }

    private void cacheToken() {
        // Create a new Gson object
        Gson gson = new Gson();

        try {
            String jsonString = gson.toJson(token);
            FileWriter fileWriter = new FileWriter(TOKEN_CACHE);
            fileWriter.write(jsonString);
            fileWriter.close();                
        } catch (Exception e) {
            // Ignore...
        }
    }

    public Token Authenticate() throws Exception {
        if (isTokenValid()) {
            return token;
        }

        if (generateToken() != null) {
            cacheToken();
        }

        return token;
    }

    @Override
    public String toString() {
        return super.toString();
    }
}

You create a secure gRPC channel and authenticate your application to the ASR service by providing the URL of the hosted ASR service and an access token.

In both these examples, you pass the URL of the ASR service to the application as an argument.

See also Smart routing for an alternative URL.

There are several ways to generate and use the token that authenticates your application to the ASR service. The code samples show two methods:

Step 3: Import functions

Import functions from stubs

from nuance_asr_resource_pb2 import *
from nuance_asr_result_pb2 import *
from nuance_asr_pb2 import *
from nuance_asr_pb2_grpc import *
import (
    . . . 
    pb "./v1beta2"
)
import io.grpc.*;
import io.grpc.stub.StreamObserver;
import xaas.sample.asr.java.client.Params.Parameters;
import xaas.sample.asr.java.client.Audio.IAudioSource;
import xaas.sample.asr.java.client.Audio.Microphone;
import xaas.sample.asr.java.client.Audio.AudioFile;

import com.nuance.grpc.asr.v1beta1.*;

In your client application, import all functions from the Krypton client stubs that you generated in gRPC setup.

Do not edit these stub files.

Step 4: Set recognition parms

Set recognition parameters in the Python app

async def stream_out(wf):
    try:
        init = RecognitionInitMessage(
          parameters = RecognitionParameters(
            language = 'en-US',   
            audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),  
            result_type = 'IMMUTABLE_PARTIAL', 
            utterance_detection_mode='MULTIPLE', 
            recognition_flags = RecognitionFlags(auto_punctuate=True)))
          resources = [ names_places_dlm, places_wordset ]

Set recognition parameters in params.json

{
    "language": "en-US",
    "topic": "GEN",
    "utterance_detection_mode": 1,
    "result_type": 2,
    "recognition_flags": {
        "auto_punctuate": true,
        "filter_profanity": false,
        "include_tokenization": false,
        "small_timers": false,
        "discard_speaker_adaptation": false,
        "suppress_call_recording": false,
        "mask_load_failures": true
    },
    "no_input_timeout_ms": 1000,
    "recognition_timeout_ms": 0,
    "utterance_end_silence_ms": 900,
    "max_hypotheses": 5
}

The application reads params.json and constructs parameters

paramsFile := parser.String("p", "paramsFile", &argparse.Options{
        Default: "params.json",
        Help:    "file containing recognition parameters to customize ASR behavior",
    })

pcmFormat := &pb.AudioFormat_Pcm{Pcm: &pb.PCM{SampleRateHz: sampleRate}}
    audioFormat := &pb.AudioFormat{AudioFormatUnion: pcmFormat}
    parameters.AudioFormat = audioFormat
    jp, err := json.MarshalIndent(parameters, "", "  ")
    log.Printf("Reco Params: %s", jp)

    init := &pb.RecognitionInitMessage{Parameters: parameters}
    initRequest := &pb.RecognitionInitMessage{RecognitionInitMessage: init}
    request := &pb.RecognitionRequest{RequestUnion: initRequest}

Set recognition parameters in params.json

{
    "language": "en-US",
    "topic": "GEN",
    "utterance_detection_mode": 1,
    "result_type": 2,
    "recognition_flags": {
        "auto_punctuate": true,
        "filter_profanity": false,
        "include_tokenization": false,
        "small_timers": false,
        "discard_speaker_adaptation": false,
        "suppress_call_recording": false,
        "mask_load_failures": true
    },
    "no_input_timeout_ms": 1000,
    "recognition_timeout_ms": 0,
    "utterance_end_silence_ms": 900,
    "max_hypotheses": 5
}

The app reads params.json and constructs parameters

private RecognitionParameters initializeRecognitionRequest(String paramsFile) {
    Params params = new Params(paramsFile);
    Parameters parameters = params.getParameters();
    xaas.sample.asr.java.client.Params.RecognitionFlags flags = parameters.getRecognitionFlags();
    log(String.format("Recognition Parameters: %s", params.toString()));

    AudioFormat audioFormat = AudioFormat.newBuilder()
        .setPcm(PCM.newBuilder().setSampleRateHz(16000).build())
        .build();

    return RecognitionParameters.newBuilder()
        .setAudioFormat(audioFormat)
        .setLanguage(parameters.getLanguage())
        .setTopic(parameters.getTopic())
        .setUtteranceDetectionMode(EnumUtteranceDetectionMode.forNumber (parameters.getUtteranceDetectionMode()))
        .setResultType(EnumResultType.forNumber(parameters.getUtteranceDetectionMode()))
...

The application sets a RecognitionInitMessage containing RecognitionParameters, or parameters that define the type of recognition you want. Consult your generated stubs for the precise parameter names (see Field names in proto and stub files). Some parameters are:

For details about all recognition parameters, see RecognitionParameters.

RecognitionInitMessage may also include resources such as domain language models and wordsets, which customize recognition for a specific environment or business. See Step 8: Add DLMs and wordsets.

Step 5: Request transcription

Request transcription and simulate audio stream

yield RecognitionRequest(recognition_init_message=init)

# simulate a typical realtime audio stream
print(f'stream {wf.name}')
packet_duration = 0.020
packet_samples = int(wf.getframerate() * packet_duration)
for packet in iter(lambda: wf.readframes(packet_samples), b''):
    yield RecognitionRequest(audio=packet)
    await asyncio.sleep(packet_duration)
print('stream complete')
pcmFormat := &pb.AudioFormat_Pcm{Pcm: &pb.PCM{SampleRateHz: sampleRate}}
    audioFormat := &pb.AudioFormat{AudioFormatUnion: pcmFormat}
    parameters.AudioFormat = audioFormat
    jp, err := json.MarshalIndent(parameters, "", "  ")
    log.Printf("Reco Params: %s", jp)

    init := &pb.RecognitionInitMessage{Parameters: parameters}
    initRequest := &pb.RecognitionRequest_RecognitionInitMessage{RecognitionInitMessage: init}
    request := &pb.RecognitionRequest{RequestUnion: initRequest}
. . . 
packet := make([]byte, packetSize)
    for {
        packetBytesRead, err := wavReader.Read(packet)
        if err != nil {
            if err != io.EOF {
                fmt.Println(err)
            }
            break
        }
        audio := packet[0:packetBytesRead]
        audioRequest := &pb.RecognitionRequest_Audio{Audio: audio}
        request := &pb.RecognitionRequest{RequestUnion: audioRequest}
        if err := stream.Send(request); err != nil {
            log.Fatalf("Failed to send audio: %v", err)
        }
        time.Sleep(time.Duration(packetSize/640*20) * time.Millisecond) 
    }
private void recognize(RecognitionParameters params, String audioFile) {
    RecognitionInitMessage init = RecognitionInitMessage.newBuilder()
        .setParameters(params)
        .build();
    RecognitionRequest request = RecognitionRequest.newBuilder()
        .setRecognitionInitMessage(init)
        .build();

After setting recognition parameters, the app defines a function that calls the RecognitionRequest stream, including your recognition parameters.

In the Python example, this is achieved with a two-part yield structure that first sends recognition parameters then sends the audio for transcription in chunks.

yield RecognitionRequest(recognition_init_params=init)
. . . 
    yield RecognitionRequest(audio=chunk)

Normally your app will send streaming audio to Krypton for processing but, for simplicity, these applications simulate streaming audio by breaking up an audio file into chunks and feeding it to Krypton a bit at a time.

Step 6: Call client stub

Call main client stub

     with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel:
        stub = RecognizerStub(channel)
func main() {
    . . . 
    client := pb.NewRecognizerClient(conn)
    ctx, cancel := CreateChannelContext(&token.AccessToken)
    defer cancel()

    parameters := initializeRecognitionParameters(*paramsFile)
    recognize(ctx, client, parameters, sampleRate, packetSize, wavReader, &token.AccessToken)
}
    ManagedChannel chan = createChannel(server);
    RecognizerGrpc.RecognizerStub conn = createConnection(chan, String.format("%s %s", t.getTokenType(), t.getAccessToken()));

    AsrClient client = new AsrClient(conn);
    RecognitionParameters params = client.initializeRecognitionRequest(paramsFile);
    client.recognize(params, audioSource);
    shutdown(chan);

The app must include the location of the Krypton instance, the authentication token, and where the audio is obtained. See Step 2: Authenticate and connect.

Using this information, the app calls a client stub function or class. In some languages, this stub is defined in the generated client files: in Python it is named RecognizerStub, in Go it is RecognizerClient, and in Java it is RecognizerStub.

Step 7: Process results

Receive results

try:
    async for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')
       elif message.HasField('result'):
            restype = 'partial' if message.result.result_type else 'final'
            print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
            switch recognitionResponse.ResponseUnion.(type) {
            case *pb.RecognitionResponse_Status:
                status := recognitionResponse.ResponseUnion.(*pb.RecognitionResponse_Status).Status
                log.Printf("Status [%d, %s]: %s", status.Code, status.Message, status.Details)
            case *pb.RecognitionResponse_StartOfSpeech:
                sos := recognitionResponse.ResponseUnion.(*pb.RecognitionResponse_StartOfSpeech).StartOfSpeech
                log.Printf("Start of Speech detected: %dms", sos.FirstAudioToStartOfSpeechMs)
            case *pb.RecognitionResponse_Result:
                result := recognitionResponse.ResponseUnion.(*pb.RecognitionResponse_Result).Result
                for idx, h := range result.Hypotheses {
                    log.Printf("hypothesis #%d (%s): %s", idx, result.ResultType.String(), h.FormattedText)
                }
            }
    public void onNext(RecognitonResponse response) {
        switch (response.getResponseUnionCase()) {
            case RESULT:
                Result result = response.getResult();
                System.out.println(String.format("Transcription [%s]: [conf: %d] %s", 
                result.getResultType(), 
                result.getHypotheses(0).getAverageConfidence(),
                result.getHypotheses(0).getFormattedText()));
                if (result.getResultTypeValue() == EnumResultType.FINAL_VALUE) {
                    done.countDown();
                }
                break;
            case START_OF_SPEECH:
                StartOfSpeech sos = response.getStartOfSpeech();
                System.out.println(String.format("Start of Speech detected: %dms", sos.getFirstAudioToStartOfSpeechMs()));
                break;
            case STATUS:
                com.nuance.grpc.asr.v1beta1.Status status = response.getStatus();
                System.out.println(String.format("Recognition Status: %d %s", status.getCode(), status.getMessage()));
                break;
            default:
                break;
        }
    }

Finally the app returns the results received from the Krypton engine. These apps print the resulting transcription on screen as it is streamed from Krypton, sentence by sentence, with intermediate partial sentence results when the app has requested PARTIAL or IMMUTABLE_PARTIAL results.

The results may be long or short depending on the length of your audio, the result type, and the Result fields requested.

Result type IMMUTABLE_PARTIAL

Results from audio file with result type PARTIAL_IMMUTABLE

$ ./my-python-app.py monday_morning_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream monday_morning_16.wav
partial : It's Monday
partial : It's Monday morning and the
final : It's Monday morning and the sun is shining.
partial : I'm getting ready
partial : I'm getting ready to
partial : I'm getting ready to walk
partial : I'm getting ready to walk to the
partial : I'm getting ready to walk to the train commute
final : I'm getting ready to walk to the train commute into work.
partial : I'll catch
partial : I'll catch the
partial : I'll catch the 750
partial : I'll catch the 758 train from
final : I'll catch the 758 train from Cedar Park station.
partial : It will take
partial : It will take me an hour
partial : It will take me an hour to get
final : It will take me an hour to get into town.
stream complete
200 Success

This example shows the transcription results from my audio file, monday_morning_16.wav, a 16kHz wave file talking about my commute into work. The audio file says:

It's Monday morning and the sun is shining.
I'm getting ready to walk to the train and commute into work.
I'll catch the 7:58 train from Cedar Park station.
It will take me an hour to get into town.

The result type in this example is IMMUTABLE_PARTIAL, meaning that partial results are delivered after a slight delay, to ensure that the recognized words do not change with the rest of the received speech. See Result type for the other choices.

In this example and the next, Krypton performs the transcription using only the data pack. For these simple sentences, the recognition is nearly perfect.

Result type FINAL

Result type FINAL returns only the final version of each sentence

$ ./my-python-app.py weather16.wav
stream weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Our radar and satellite pictures show that we're on the western edge of the storm system as it continues to traffic further to the east
stream complete
200 Success

This example transcribes the audio file weather16.wav, which talks about winter weather in Montreal. The file says:

There is more snow coming to the Montreal area in the next few days.
We're expecting ten centimeters overnight and the winds are blowing hard.
Our radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east.

The result type in the case is FINAL, meaning only the final transcription version is returned.

Step 8: Add DLMs and wordsets

Declare DLM and wordset (see Wordset example to read the wordset from a file)

# Declare a DLM defined in your Mix project
names_places_dlm = RecognitionResource(external_reference = 
  ResourceReference(type='DOMAIN_LM',
  uri='urn:nuance:mix/eng-USA/<context-tag>/mix.asr'),
  weight_value=700)

# Declare an inline wordset in that DLM 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan goth lin","lhan goth luhn"]},{"literal":"Auchenblae"}]}')

# Add 16kHz sample rate for 16kHz files + immutable parm + resources 
init = RecognitionInitMessage(
    parameters = RecognitionParameters(language='en-US', 
      audio_format=AudioFormat(pcm=PCM(sample_rate_hz=16000)),
      result_type='FINAL', 
      utterance_detection_mode='MULTIPLE'),  
    resources = [ names_places_dlm, places_wordset ]
)
// Declare a DLM in your own Mix project
    names_places_dlm := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_ExternalReference{
            ExternalReference: &pb.ResourceReference{
                Type: pb.EnumResourceType_DOMAIN_LM,
                Uri: "urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr",
            },
        },
        WeightUnion: &pb.RecognitionResource_WeightValue{
            WeightValue: 700,
        },
    }

    // Declare an inline wordset in that DLM 
    places_wordset := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_InlineWordset{
            InlineWordset: `{"PLACES":[{"literal":"Fordoun","spoken":["forden"]}, {"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]}, {"literal":"Auchenblae"}]}`,
        },
    }
. . .
    request := &pb.RecognitionRequest{
        RequestUnion: &pb.RecognitionRequest_RecognitionInitMessage{
            RecognitionInitMessage: &pb.RecognitionInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm, places_wordset },
            },
        },
    }

Once you have experimented with basic transcription, you can add resources such as domain language models and wordsets to improve recognition of specific terms and language in your environment. For example, you might add a DLM containing names and places in your business.

Resources are declared in RecognitionInitMessage - resources.

Before and after DLM and wordset

Before: Without a DLM or wordset, unusual place names are not recognized

$ ./my-python-app.py abington8_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream abintgon_16.wav
final : I'm going on a trip to Abington tickets in Cambridgeshire England.
final : I'm speaking to you from the town of cooking out in Northamptonshire.
final : We visited the village of steeple Morton on our way to highland common in Yorkshire.
final : We spent a week in the town of land Gosling in Wales. 
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

After: Recognition is perfect with a DLM and wordset

$ ./my-python-app.py abington8_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream abintgon_16.wav
final : I'm going on a trip to Abington Piggots in Cambridgeshire England.
final : I'm speaking to you from the town of Cogenhoe in Northamptonshire.
final : We visited the village of Steeple Morden on our way to Hoyland Common in Yorkshire.
final : We spent a week in the town of Llangollen in Wales.
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

The audio file in this example, abington8_16.wav, is a 16kHz recording containing a variety of place names, some common and some unusual (shown in bold). The audio file says:

I'm going on a trip to Abington Piggots in Cambridgeshire, England.
I'm speaking to you from the town of Cogenhoe in Northamptonshire.
We visited the village of Steeple Morden on our way to Hoyland Common in Yorkshire.
We spent a week in the town of Llangollen in Wales.
Have you ever thought of moving to La Jolla in California.

Without a DLM or wordset, the unusual place names are not recognized correctly.

But when all the place names are defined, either in the DLM or in a wordset, there is perfect recognition.

Sample applications

This section contains sample client applications for Krypton ASR.

Sample Python app

This basic Python app, my-python-app.py, requests and returns transcription of an audio file

#!/usr/bin/env python3

import sys, wave, grpc, traceback
from time import sleep
from nuance_asr_resource_pb2 import *
from nuance_asr_result_pb2 import *
from nuance_asr_pb2 import *
from nuance_asr_pb2_grpc import *

# Declare a DLM that exists in your Mix project
names_places_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM',
      uri='urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr'),
    reuse='HIGH_REUSE', 
    weight_value=700)

# Declare an inline wordset for an entity in that DLM 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]},{"literal":"Auchenblae"}]}')

# Generator provides the sequence of client -> server messages
def client_stream(wf):
    try:
        # Start the recognition
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(language='en-US', 
              audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
              result_type='FINAL', 
              utterance_detection_mode='MULTIPLE',
              recognition_flags = RecognitionFlags(auto_punctuate=True)),
            resources = [ names_places_dlm, places_wordset ],
            client_data = {'company':'Aardvark','user':'Leslie'} 
        )
        yield RecognitionRequest(recognition_init_message=init)

        # Simulate a typical realtime audio stream
        print(f'stream {wf.name}')
        packet_duration = 0.020
        packet_samples = int(wf.getframerate() * packet_duration)
        for packet in iter(lambda: wf.readframes(packet_samples), b''):
            yield RecognitionRequest(audio=packet)
            sleep(packet_duration)
        print('stream complete')
    except CancelledError as e:
        print(f'client stream: RPC canceled')
    except Exception as e:
        print(f'client stream: {type(e)}')
        traceback.print_exc()

hostaddr = access_token = audio_file = None
try:
    hostaddr = sys.argv[1]
    token_file = sys.argv[2]
    audio_file = sys.argv[3]
    with open(token_file, 'r') as tf:
        access_token = tf.read().strip()
except Exception as e:
    print(f'usage: {sys.argv[0]} <hostaddr> <token_file> <audio_file.wav>')
    exit(1)

with wave.open(audio_file, 'r') as wf:
    assert wf.getsampwidth() == 2, f'{audioi_file} is not linear PCM'
    assert wf.getframerate() in [8000, 16000], f'{audio_file} sample rate must be 8000 or 16000'
    setattr(wf, 'name', audio_file)
    call_credentials = grpc.access_token_call_credentials(access_token)
    ssl_credentials = grpc.ssl_channel_credentials()
    channel_credentials = grpc.composite_channel_credentials(ssl_credentials, call_credentials)     
    with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel:
        stub = RecognizerStub(channel)
        stream_in = stub.Recognize(client_stream(wf))
        try:
            # Iterate through the returned server -> client messages
            for message in stream_in:
                if message.HasField('status'):
                    if message.status.details:
                         print(f'{message.status.code} {message.status.message} - {message.status.details}')
                    else:
                         print(f'{message.status.code} {message.status.message}')
                elif message.HasField('result'):
                    restype = 'partial' if message.result.result_type else 'final'
                    print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
        except StreamClosedError:
            pass
        except Exception as e:
            print(f'server stream: {type(e)}')
            traceback.print_exc()

This is the little Python 3.6 app used in the examples, complete with DLM and wordset references. It performs these tasks:

Running the Python app

To run this sample Python application, first generate a token (see Step 1: Generate token), then run the application, passing it the location of the Nuance-hosted Krypton server, your token file, and an audio file to transcribe. For example:

$ ./gen-token.sh 
$ ./my-python-app.py asraas.beta.mix.nuance.com:443 my-token.txt audio-file.wav

The gen-token.sh script used in this example generates a token that authenticates the application to the Krypton server. It takes your client credentials from the Mix dashboard and stores the resulting token in my-token.txt.

You may instead incorporate the token-generation code within the application, reading the credentials from a configuration file. See the Go application for that scenario.

No DLM required?

To request transcription without the DLM and wordset, simply comment out the resources line.

        init = RecognitionInitMessage(
            parameters = RecognitionParameters(language='en-US', 
              audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
              result_type='FINAL', 
              utterance_detection_mode='MULTIPLE'
              recognition_flags = RecognitionFlags(auto_punctuate=True)),
#           resources = [ names_places_dlm, places_wordset ],
            client_data = {'company':'Aardvark','user':'Leslie'}

Sample Go app

The Go sample app consists of two config files and three application functions

The config.json file contains your authentication values

{
    "client_id": "appID:<Provide Your Mix Beta App Id>",
    "client_secret": "<Provide Your Mix Beta Client Secret>",
    "token_url": "https://auth.crt.nuance.com/oauth2/token"
}

The params.json file contains your recognition parameters

{
    "language": "en-US",
    "topic": "GEN",
    "utterance_detection_mode": 1,
    "result_type": 2,
    "recognition_flags": {
        "auto_punctuate": true,
        "filter_profanity": false,
        "include_tokenization": false,
        "small_timers": false,
        "discard_speaker_adaptation": false,
        "suppress_call_recording": false,
        "mask_load_failures": true
    },
    "no_input_timeout_ms": 1000,
    "recognition_timeout_ms": 0,
    "utterance_end_silence_ms": 900,
    "max_hypotheses": 5
}

The src/asr_client.go file is the main application file

/*
 * Copyright 2019 Nuance Communications Inc.
 */
package main
import (
    "context"
    "crypto/tls"
    "encoding/json"
    "fmt"
    "io"
    "io/ioutil"
    "log"
    "os"
    "time"
    "google.golang.org/grpc"
    "google.golang.org/grpc/credentials"
    "google.golang.org/grpc/metadata"
    pb "./v1beta2"
    "github.com/akamensky/argparse"
    "github.com/youpy/go-wav"
)
func CreateChannelContext(token *string) (context.Context, context.CancelFunc) {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    // https://github.com/grpc/grpc-go/blob/master/Documentation/grpc-metadata.md
    ctx = metadata.AppendToOutgoingContext(ctx, "authorization", "Bearer "+*token)
    return ctx, cancel
}
// Send recognition request and stream WAV audio file to server, then print responses
func recognize(ctx context.Context, client pb.RecognizerClient, parameters *pb.RecognitionParameters, sampleRate uint32, packetSize uint32, wavReader *wav.Reader, token *string) {
    stream, err := client.Recognize(ctx)
    if err != nil {
        log.Fatalf("%v.RouteChat(_) = _, %v", client, err)
    }
    waitc := make(chan struct{})
    go func() {
        for {
            recognitionResponse, err := stream.Recv()
            if err == io.EOF {
                // read done.
                close(waitc)
                return
            }
            if err != nil {
                log.Fatalf("Failed to receive a response : %v", err)
            }
            //out, err := json.MarshalIndent(recognitionResponse, "", "  ")
            //log.Printf("response: %s\n", out)
            switch recognitionResponse.ResponseUnion.(type) {
            case *pb.RecognitionResponse_Status:
                status := recognitionResponse.ResponseUnion.(*pb.RecognitionResponse_Status).Status
                log.Printf("Status [%d, %s]: %s", status.Code, status.Message, status.Details)
            case *pb.RecognitionResponse_StartOfSpeech:
                sos := recognitionResponse.ResponseUnion.(*pb.RecognitionResponse_StartOfSpeech).StartOfSpeech
                log.Printf("Start of Speech detected: %dms", sos.FirstAudioToStartOfSpeechMs)
            case *pb.RecognitionResponse_Result:
                result := recognitionResponse.ResponseUnion.(*pb.RecognitionResponse_Result).Result
                for idx, h := range result.Hypotheses {
                    log.Printf("hypothesis #%d (%s): %s", idx, result.ResultType.String(), h.FormattedText)
                }
            }
        }
    }()
    pcmFormat := &pb.AudioFormat_Pcm{Pcm: &pb.PCM{SampleRateHz: sampleRate}}
    audioFormat := &pb.AudioFormat{AudioFormatUnion: pcmFormat}
    parameters.AudioFormat = audioFormat
    jp, err := json.MarshalIndent(parameters, "", "  ")
    log.Printf("Reco Params: %s", jp)
    init := &pb.RecognitionInitMessage{Parameters: parameters}
    initRequest := &pb.RecognitionRequest_RecognitionInitMessage{RecognitionInitMessage: init}
    request := &pb.RecognitionRequest{RequestUnion: initRequest}
    if err := stream.Send(request); err != nil {
        log.Fatalf("Failed to start recognition: %v", err)
    }
    // Simulate audio stream
    packet := make([]byte, packetSize)
    for {
        packetBytesRead, err := wavReader.Read(packet)
        if err != nil {
            if err != io.EOF {
                fmt.Println(err)
            }
            break
        }
        audio := packet[0:packetBytesRead]
        audioRequest := &pb.RecognitionRequest_Audio{Audio: audio}
        request := &pb.RecognitionRequest{RequestUnion: audioRequest}
        if err := stream.Send(request); err != nil {
            log.Fatalf("Failed to send audio: %v", err)
        }
        time.Sleep(time.Duration(packetSize/640*20) * time.Millisecond) // Pace the audio to mimic real-time
    } 
    stream.CloseSend()
    <-waitc
}
func initializeRecognitionParameters(paramFile string) *pb.RecognitionParameters {
    parameters := &pb.RecognitionParameters{
        Language: "en-US",
        Topic:    "GEN",
    }
    source, err := ioutil.ReadFile(paramFile)
    if err != nil {
        log.Printf("Error reading params file: %s", err)
        return parameters
    }
    err = json.Unmarshal(source, parameters)
    if err != nil {
        log.Printf("Error parsing params file: %s", err)
        return parameters
    }
    return parameters
}
func main() {
    // Collect arguments
    parser := argparse.NewParser("nuance_asr_sample_client", "Use Nuance ASR to recognize WAV audio")
    server := parser.String("s", "server", &argparse.Options{
        Default: "asraas.beta.mix.nuance.com:443",
        Help:    "server host:port",
    })
    wavFile := parser.File("w", "wavFile", os.O_RDONLY, 0, &argparse.Options{
        Required: true,
        Help:     "audio file to recognize",
    })
    configFile := parser.String("c", "configFile", &argparse.Options{
        Default: "config.json",
        Help:    "config file containing client credentials (client_id and client_secret)",
    })
    paramsFile := parser.String("p", "paramsFile", &argparse.Options{
        Default: "params.json",
        Help:    "file containing recognition parameters to customize ASR behavior",
    })
    err := parser.Parse(os.Args)
    if err != nil {
        fmt.Print(parser.Usage(err))
        os.Exit(1)
    }
    defer wavFile.Close()
    // Import the user's Mix Beta credentials
    config, err := NewConfig(*configFile)
    if err != nil {
        log.Fatalf("Error importing user credentials: %v", err)
        os.Exit(1)
    }
    // Authenticate the user's credentials
    auth := NewAuthenticator(*config)
    token, err := auth.Authenticate()
    if err != nil {
        log.Fatalf("Error authenticating to Mix: %v", err)
        os.Exit(1)
    }
    // Validate the audio file is compatible
    wavReader := wav.NewReader(wavFile)
    wavFormat, _ := wavReader.Format()
    if wavFormat.AudioFormat != 1 || wavFormat.BitsPerSample != 16 {
        log.Fatalf("Unsupported audio format: %v", wavFormat)
        os.Exit(1)
    }
    sampleRate := wavFormat.SampleRate
    if sampleRate != 8000 && sampleRate != 16000 {
        log.Fatalf("Unsupported sample rate: %v", sampleRate)
        os.Exit(1)
    }
    const packetDuration = 0.020
    packetSize := uint32(float64(wavFormat.ByteRate) * packetDuration)
    // Connect to ASR service
    creds := credentials.NewTLS(&tls.Config{})
    if err != nil {
        log.Fatalf("Failed to create TLS credentials %v", err)
        os.Exit(1)
    }
    conn, err := grpc.Dial(*server, grpc.WithTransportCredentials(creds))
    if err != nil {
        log.Fatalf("fail to dial: %v", err)
        os.Exit(1)
    }
    defer conn.Close()
    // Run a recognition request
    client := pb.NewRecognizerClient(conn)
    ctx, cancel := CreateChannelContext(&token.AccessToken)
    defer cancel()
    parameters := initializeRecognitionParameters(*paramsFile)
    recognize(ctx, client, parameters, sampleRate, packetSize, wavReader, &token.AccessToken)
}

The src/authenticate.go file authenticates the app to the ASR service. See Step 2: Authenticate and connect

The src/config.go file handles the authentication values in config.json

package main

import (
    "encoding/json"
    "io/ioutil"
    "log"
)

type Config struct {
    ClientID     string `json:"client_id"`
    ClientSecret string `json:"client_secret"`
    TokenURL     string `json:"token_url"`
}

func (c *Config) LoadConfig(configFile string) (*Config, error) {
    source, err := ioutil.ReadFile(configFile)
    if err != nil {
        return nil, err
    }
    err = json.Unmarshal(source, c)
    if err != nil {
        return nil, err
    }

    return c, nil
}

func (c *Config) String(pretty bool) string {
    var str []byte
    var err error

    if pretty {
        str, _ = json.MarshalIndent(c, "", "  ")
    } else {
        str, _ = json.Marshal(c)
    }

    if err != nil {
        log.Printf("Error marshalling config to json: %s", err)
    }

    return string(str)
}

func NewConfig(configFile string) (*Config, error) {
    config := &Config{}
    _, err := config.LoadConfig(configFile)

    return config, err
}

This Go application consists of these files:

Running the Go app

For example, this runs the Go app with the help option to see the values that you may pass to the application.

$ go run ./src -h
usage:  asr_client [-h|--help] [-s|--server "<value>"]
                                -w|--wavFile <file> [-c|--configFile "<value>"]
                                [-p|--paramsFile "<value>"]
                                Use Nuance ASR to recognize audio
Arguments:
-h  --help        Print help information
-s  --server      server host:port. Default: asraas.beta.mix.nuance.com:443
-w  --wavFile     audio file to recognize
-c  --configFile  config file containing client credentials (client_id and
                    client_secret). Default: config.json
-p  --paramsFile  file containing recognition parameters to customize ASR
                    behavior. Default: params.json

And this runs the app using the default config and parameter files and an audio file, my_audio_file.wav.

$ go run ./src -w my_audio_file.wav

Sample Java app

The Java sample app consists of two config files and three application functions

The config.json file contains your authentication values

{
    "client_id": "appID:<Provide Your Mix Beta App Id>",
    "client_secret": "<Provide Your Mix Beta Client Secret>",
    "token_url": "https://auth.crt.nuance.com/oauth2/token"
}

The params.json file contains your recognition parameters

{
    "language": "en-US",
    "topic": "GEN",
    "utterance_detection_mode": 1,
    "result_type": 2,
    "recognition_flags": {
        "auto_punctuate": true,
        "filter_profanity": false,
        "include_tokenization": false,
        "small_timers": true,
        "discard_speaker_adaptation": false,
        "suppress_call_recording": false,
        "mask_load_failures": true
    },
    "no_input_timeout_ms": 1000,
    "recognition_timeout_ms": 0,
    "utterance_end_silence_ms": 900,
    "max_hypotheses": 5
}

The src/.../AsrClient.java file is the main application file

/*
 * This Java source file was generated by the Gradle 'init' task.
 */
package xaas.sample.asr.java.client;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintStream;
import java.io.StringWriter;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.net.URL;
import java.net.URLEncoder;
import java.util.Base64;
import java.util.Iterator;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.Executor;
import java.util.concurrent.TimeUnit;

import javax.net.ssl.HttpsURLConnection;

/* Processing JSON and reading local files. */
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import com.google.gson.JsonParser;
import com.google.gson.JsonArray;
import com.google.gson.JsonIOException;
import com.google.gson.JsonSyntaxException;
import com.google.gson.stream.JsonReader;
import com.google.gson.TypeAdapterFactory;
import com.google.protobuf.ByteString;
import com.googlecode.protobuf.format.JsonFormat;

import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.DefaultParser;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.CommandLineParser;
import org.apache.commons.cli.DefaultParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Option;
import org.apache.commons.cli.OptionGroup;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.commons.cli.MissingOptionException;
import org.json.JSONArray;
import org.json.JSONException;
import org.json.JSONObject;

/* Generated from the ASR gRPC proto files. */
import io.grpc.*;
import io.grpc.stub.StreamObserver;
import xaas.sample.asr.java.client.Params.Parameters;
import xaas.sample.asr.java.client.Audio.IAudioSource;
import xaas.sample.asr.java.client.Audio.Microphone;
import xaas.sample.asr.java.client.Audio.AudioFile;


import com.nuance.grpc.asr.v1beta1.*;

public class AsrClient {

    public class Defaults {
        static final String SERVER = "asraas.beta.mix.nuance.com:443";
        static final String CONFIG_FILE = "config.json";
        static final String PARAMS_FILE = "params.json";
        static final String AUDIO_SOURCE = "microphone";
    }

    private RecognizerGrpc.RecognizerStub conn;
    private StreamObserver<RecognitionRequest> mRecoRequests;
    private IAudioSource mAudioSource = null;
    private CountDownLatch done;

    public AsrClient(RecognizerGrpc.RecognizerStub conn) {
        this.conn = conn;
    }

    public static ManagedChannel createChannel(String server) {
        ManagedChannel chan = ManagedChannelBuilder.forTarget(server)
                                                    .useTransportSecurity()
                                                    .build();

        return chan;
    }

    public static RecognizerGrpc.RecognizerStub createConnection(ManagedChannel chan, String accessToken) {
        RecognizerGrpc.RecognizerStub stub = RecognizerGrpc.newStub(chan).withCallCredentials(new CallCredentials() {
            @Override
            public void applyRequestMetadata(RequestInfo r, Executor e, MetadataApplier m) {
                e.execute(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            Metadata headers = new Metadata();
                            Metadata.Key<String> clientIdKey =
                                    Metadata.Key.of("Authorization", Metadata.ASCII_STRING_MARSHALLER);
                            headers.put(clientIdKey, accessToken);
                            m.apply(headers);
                        } catch (Throwable ex) {
                            //log the exception
                            ex.printStackTrace(System.out);
                        }
                    }
                });
            }

            @Override
            public void thisUsesUnstableApi() {
            }
        });
        return stub;
    }

    public static void shutdown(ManagedChannel chan) throws InterruptedException {
        chan.shutdown().awaitTermination(2, TimeUnit.SECONDS);
    }

    private final IAudioSource.Listener mAudioListener = new IAudioSource.Listener() {
        @Override
        public void onRecordingStarted() {
            log("Recording started");
        }

        @Override
        public void onRecordingStopped() {
            log("Recording stopped.");
            mRecoRequests.onCompleted();
        }

        @Override
        public void onRecordingError(String s) {
            log("Recording error: " + s);
            mRecoRequests.onCompleted();
        }

        @Override
        public void onRecord(final byte[] data, final int size) {

            RecognitionRequest req = RecognitionRequest.newBuilder()
                                                    .setAudio(ByteString.copyFrom(data)).build();
            mRecoRequests.onNext(req);
        }
    };

    private RecognitionParameters initializeRecognitionRequest(String paramsFile) {
        Params params = new Params(paramsFile);
        Parameters parameters = params.getParameters();
        xaas.sample.asr.java.client.Params.RecognitionFlags flags = parameters.getRecognitionFlags();
        log(String.format("Recognition Parameters: %s", params.toString()));

        AudioFormat audioFormat = AudioFormat.newBuilder()
                                            .setPcm(PCM.newBuilder().setSampleRateHz(16000).build())
                                            .build();

        return RecognitionParameters.newBuilder()
                                    .setAudioFormat(audioFormat)
                                    .setLanguage(parameters.getLanguage())
                                    .setTopic(parameters.getTopic())
                                    .setUtteranceDetectionMode(EnumUtteranceDetectionMode.forNumber(parameters.getUtteranceDetectionMode()))
                                    .setResultType(EnumResultType.forNumber(parameters.getUtteranceDetectionMode()))
                                    .setNoInputTimeoutMs(parameters.getNoInputTimeoutMs())
                                    .setRecognitionTimeoutMs(parameters.getRecognitionTimeoutMs())
                                    .setUtteranceEndSilenceMs(parameters.getUtteranceEndSilenceMs())
                                    .setMaxHypotheses(parameters.getMaxHypotheses())
                                    .setRecognitionFlags(RecognitionFlags.newBuilder()
                                                                    .setAutoPunctuate(flags.isAutoPunctuate())
                                                                    .setFilterProfanity(flags.isFilterProfanity())
                                                                    .setIncludeTokenization(flags.isIncludeTokenization())
                                                                    .setStallTimers(flags.isSmallTimers())
                                                                    .setDiscardSpeakerAdaptation(flags.isDiscardSpeakerAdaptation())
                                                                    .setSuppressCallRecording(flags.isSuppressCallRecording())
                                                                    .setMaskLoadFailures(flags.isMaskLoadFailures())
                                                                    .build())
                                    .build();
    }

    private void recognize(RecognitionParameters params, String audioFile) {

        RecognitionInitMessage init = RecognitionInitMessage.newBuilder()
                                                        .setParameters(params)
                                                        .build();


        RecognitionRequest request = RecognitionRequest.newBuilder()
                                                    .setRecognitionInitMessage(init)
                                                    .build();

        done = new CountDownLatch(1);
        mRecoRequests = conn.recognize(new StreamObserver<RecognitionResponse>() {

            @Override
            public void onCompleted() {
                log("onCompleted");
                done.countDown();
            }
            @Override
            public void onError(Throwable T) {
                T.printStackTrace(System.out);
                done.countDown();
            }
            @Override
            public void onNext(RecognitionResponse response) {
                switch (response.getResponseUnionCase()) {
                    case RESULT:
                        Result result = response.getResult();
                        System.out.println(String.format("Transcription [%s]: [conf: %d] %s", 
                        result.getResultType(), 
                        result.getHypotheses(0).getAverageConfidence(),
                        result.getHypotheses(0).getFormattedText()));
                        if (result.getResultTypeValue() == EnumResultType.FINAL_VALUE) {
                            done.countDown();
                        }
                        break;
                    case START_OF_SPEECH:
                        StartOfSpeech sos = response.getStartOfSpeech();
                        System.out.println(String.format("Start of Speech detected: %dms", sos.getFirstAudioToStartOfSpeechMs()));
                        break;
                    case STATUS:
                        com.nuance.grpc.asr.v1beta1.Status status = response.getStatus();
                        System.out.println(String.format("Recognition Status: %d %s", status.getCode(), status.getMessage()));
                        break;
                    default:
                        break;
                }
            }
        });

        mRecoRequests.onNext(request);

        try {
            if (audioFile == null || audioFile.isBlank() || audioFile.equalsIgnoreCase("microphone")) {
                mAudioSource = new Microphone(mAudioListener);
            } else {
                mAudioSource = new AudioFile(new File(audioFile), mAudioListener);
            }
            mAudioSource.start();
        }
        catch (Exception e) {
            e.printStackTrace();
            done.countDown();
        }

        try {
            done.await();
        } catch (Exception e) {
            log(e.getMessage());
        } finally {
            if (mAudioSource != null) {
                mAudioSource.stop();
            }
        }

    }

    public static void log(String msg) {
        System.out.println(msg);
    }

    /**
     * Generate cmd line options.
     *
     * @return the options
     */
    public static Options generateCmdLineOptions() {
        Options options = new Options();

        /** Help option */
        options.addOption( Option.builder("h")
                                .argName("help")
                                .required(false)
                                .longOpt("help")
                                .desc("Print this help information")
                                .build() );

        options.addOption( Option.builder("s")
                                .argName("server")
                                .hasArg()
                                .required(false)
                                .longOpt("server")
                                .desc("asr server URL host:port. Default: " + Defaults.SERVER)
                                .build() );

        options.addOption( Option.builder("c")
                                .argName("configFile")
                                .hasArg()
                                .required(false)
                                .longOpt("configFile")
                                .desc("config file containing client credentials (client_id and\n" +
                                    "client_secret). Default: " + Defaults.CONFIG_FILE)
                                .build() );

        options.addOption( Option.builder("p")
                                .argName("paramsFile")
                                .hasArg()
                                .required(false)
                                .longOpt("paramsFile")
                                .desc("parameter file containing recognition params. Default: " + Defaults.PARAMS_FILE)
                                .build() );

        options.addOption( Option.builder("a")
                                .argName("audioSource")
                                .hasArg()
                                .required(false)
                                .longOpt("audioSource")
                                .desc("Specify an audio source. Default is microphone. Specify a path to stream from file.")
                                .build() );

        return options;
    }

    /**
     * Parses the command line.
     *
     * @param args the args
     * @param options the options
     * @return the command line
     * @throws ParseException the parse exception
     */
    public static CommandLine parseCommandLine(String[] args, Options options) throws ParseException {

        CommandLineParser parser = new DefaultParser();     
        return parser.parse(options, args);
    }

    /**
     * Prints the usage.
     *
     * @param options the options
     */
    public static void printUsage(Options options) {
        HelpFormatter formatter = new HelpFormatter();
        formatter.setOptionComparator(null);
        formatter.setWidth(800);

       String path = AsrClient.class.getProtectionDomain().getCodeSource().getLocation().getFile();
       File f = new File(path);
       String jar = f.getName();

       formatter.printHelp("java -jar " + jar + " [-h|--help] [-s|--server \"<value>\"] [-c|--configFile\n" +
                         "                                " + 
                         "\"<value>\"] [-a|audioFile \"<value>\"]\n" +
                         "                                " + 
                        "Use Nuance MIX ASR to vocalize text\n\nArguments:\n\n"
                         , options);
   }

    public static void main(String[] args) {
        try {
            // Initialize available options and then parse the command line
            Options options = AsrClient.generateCmdLineOptions();
            CommandLine cmd = AsrClient.parseCommandLine(args, options);

            // If --help was specified, display usage details and exit, even if other options were provided
            if( cmd.hasOption("help") ) {
                printUsage(options);
                System.exit(0);
            }

            String configFile = cmd.getOptionValue("configFile", Defaults.CONFIG_FILE);
            String paramsFile = cmd.getOptionValue("paramsFile", Defaults.PARAMS_FILE);
            String server = cmd.getOptionValue("server", Defaults.SERVER);
            String audioSource = cmd.getOptionValue("audioSource", Defaults.AUDIO_SOURCE);

            // Load credentials from config file
            Config c = new Config(configFile);

            // Authenticate and create a token
            Authenticator a = new Authenticator(c.getConfiguration());
            Token t = a.Authenticate();

            // Create a connection
            ManagedChannel chan = createChannel(server);
            RecognizerGrpc.RecognizerStub conn = createConnection(chan, String.format("%s %s", t.getTokenType(), t.getAccessToken()));

            // Run the ASR request
            AsrClient client = new AsrClient(conn);
            RecognitionParameters params = client.initializeRecognitionRequest(paramsFile);
            client.recognize(params, audioSource);
            shutdown(chan);
        }
        catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The src/.../Authenticator.java file authenticates the app to the ASR service. See Step 2: Authenticate and connect

The src/.../Config.java file handles the authentication values in config.json

package xaas.sample.asr.java.client;

import java.io.BufferedReader;
import java.io.FileReader;

import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import com.google.gson.GsonBuilder;

/**
 * Config
 */
public class Config {

    class Configuration {
        @SerializedName("client_id")
        String clientID;
        @SerializedName("client_secret")
        String clientSecret;
        @SerializedName("token_url")
        String tokenURL;

        public String getClientID() {
            return clientID;
        }

        public void setClientID(String clientID) {
            this.clientID = clientID;
        }

        public String getClientSecret() {
            return clientSecret;
        }

        public void setClientSecret(String clientSecret) {
            this.clientSecret = clientSecret;
        }

        public String getTokenURL() {
            return tokenURL;
        }

        public void setTokenURL(String tokenURL) {
            this.tokenURL = tokenURL;
        }
    }

    Configuration configuration;

    public Config(String configFile) throws Exception {
        loadConfig(configFile);
    }

    private boolean isEmpty(String var) {
        if (var == null || var.isBlank()) {
            return true;
        }
        return false;
    }

    private boolean loadConfig(String configFile) throws Exception {

        Gson gson = new Gson();

        BufferedReader reader = new BufferedReader(new FileReader(configFile));

        // Parse the configuration parameters...
        configuration = gson.fromJson(reader, Configuration.class);
        if (isEmpty(configuration.clientID) || isEmpty(configuration.clientSecret) || isEmpty(configuration.tokenURL)) {
            throw new IllegalArgumentException("Invalid configuration file.");
        }
        return true;
    }

    @Override
    public String toString() {
        //Gson gson = new Gson();
        Gson gson = new GsonBuilder().setPrettyPrinting().create();

        return gson.toJson(configuration);
    }

    public Configuration getConfiguration() {
        return configuration;
    }
}

The src/.../Params.java file handles the recognition parameters in params.json

package xaas.sample.asr.java.client;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileNotFoundException;

import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import com.google.gson.GsonBuilder;

/* Config */
public class Params {

    class RecognitionFlags {
        @SerializedName("auto_punctuate")
        boolean autoPunctuate;

        @SerializedName("filter_profanity")
        boolean filterProfanity;

        @SerializedName("include_tokenization")
        boolean includeTokenization;

        @SerializedName("small_timers")
        boolean smallTimers;

        @SerializedName("discard_speaker_adaptation")
        boolean discardSpeakerAdaptation;

        @SerializedName("suppress_call_recording")
        boolean suppressCallRecording;

        @SerializedName("mask_load_failures")
        boolean maskLoadFailures;

        public boolean isAutoPunctuate() {
            return autoPunctuate;
        }

        public void setAutoPunctuate(boolean autoPunctuate) {
            this.autoPunctuate = autoPunctuate;
        }

        public boolean isFilterProfanity() {
            return filterProfanity;
        }

        public void setFilterProfanity(boolean filterProfanity) {
            this.filterProfanity = filterProfanity;
        }

        public boolean isIncludeTokenization() {
            return includeTokenization;
        }

        public void setIncludeTokenization(boolean includeTokenization) {
            this.includeTokenization = includeTokenization;
        }

        public boolean isSmallTimers() {
            return smallTimers;
        }

        public void setSmallTimers(boolean smallTimers) {
            this.smallTimers = smallTimers;
        }

        public boolean isDiscardSpeakerAdaptation() {
            return discardSpeakerAdaptation;
        }

        public void setDiscardSpeakerAdaptation(boolean discardSpeakerAdaptation) {
            this.discardSpeakerAdaptation = discardSpeakerAdaptation;
        }

        public boolean isSuppressCallRecording() {
            return suppressCallRecording;
        }

        public void setSuppressCallRecording(boolean suppressCallRecording) {
            this.suppressCallRecording = suppressCallRecording;
        }

        public boolean isMaskLoadFailures() {
            return maskLoadFailures;
        }

        public void setMaskLoadFailures(boolean maskLoadFailures) {
            this.maskLoadFailures = maskLoadFailures;
        }
    }

    class Parameters {
        @SerializedName("language")
        String language;

        @SerializedName("topic")
        String topic;

        @SerializedName("utterance_detection_mode")
        int utteranceDetectionMode;

        @SerializedName("result_type")
        int resultType;

        @SerializedName("recognition_flags")
        RecognitionFlags recognitionFlags;

        @SerializedName("no_input_timeout_ms")
        int noInputTimeoutMs;

        @SerializedName("recognition_timeout_ms")
        int recognitionTimeoutMs;

        @SerializedName("utterance_end_silence_ms")
        int utteranceEndSilenceMs;

        @SerializedName("max_hypotheses")
        int maxHypotheses;

        public String getLanguage() {
            return this.language;
        }

        public void setLanguage(String language) {
            this.language = language;
        }

        public String getTopic() {
            return topic;
        }

        public void setTopic(String topic) {
            this.topic = topic;
        }

        public int getUtteranceDetectionMode() {
            return utteranceDetectionMode;
        }

        public void setUtteranceDetectionMode(int utteranceDetectionMode) {
            this.utteranceDetectionMode = utteranceDetectionMode;
        }

        public int getResultType() {
            return resultType;
        }

        public void setResultType(int resultType) {
            this.resultType = resultType;
        }

        public int getNoInputTimeoutMs() {
            return noInputTimeoutMs;
        }

        public void setNoInputTimeoutMs(int noInputTimeoutMs) {
            this.noInputTimeoutMs = noInputTimeoutMs;
        }

        public int getRecognitionTimeoutMs() {
            return recognitionTimeoutMs;
        }

        public void setRecognitionTimeoutMs(int recognitionTimeoutMs) {
            this.recognitionTimeoutMs = recognitionTimeoutMs;
        }

        public int getUtteranceEndSilenceMs() {
            return utteranceEndSilenceMs;
        }

        public void setUtteranceEndSilenceMs(int utteranceEndSilenceMs) {
            this.utteranceEndSilenceMs = utteranceEndSilenceMs;
        }

        public int getMaxHypotheses() {
            return maxHypotheses;
        }

        public void setMaxHypotheses(int maxHypotheses) {
            this.maxHypotheses = maxHypotheses;
        }

        public RecognitionFlags getRecognitionFlags() {
            return recognitionFlags;
        }

        public void setRecognitionFlags(RecognitionFlags recognitionFlags) {
            this.recognitionFlags = recognitionFlags;
        }

    }

    Parameters parameters;

    public Params(String paramsFile) {
        loadParams(paramsFile);
    }

    private boolean isEmpty(String var) {
        if (var == null || var.isBlank()) {
            return true;
        }
        return false;
    }

    private boolean loadParams(String paramsFile) {

        try {
            Gson gson = new Gson();

            BufferedReader reader = new BufferedReader(new FileReader(paramsFile));

            // Parse the configuration parameters...
            parameters = gson.fromJson(reader, Parameters.class);
        }
        catch (FileNotFoundException e) {
            e.printStackTrace();
            return false;
        }

        return true;
    }

    @Override
    public String toString() {
        Gson gson = new GsonBuilder().setPrettyPrinting().create();

        return gson.toJson(parameters);
    }

    public Parameters getParameters() {
        return parameters;
    }
}

The src/.../Token.java file handles the authentication token

package xaas.sample.asr.java.client;

import com.google.gson.Gson;
import com.google.gson.annotations.SerializedName;
import com.google.gson.GsonBuilder;

/* Token */
public class Token {
    @SerializedName("access_token")
    String accessToken;

    @SerializedName("expires_in")
    int expiresIn;

    @SerializedName("scope")
    String scope;

    @SerializedName("token_type")
    String tokenType;

    public String getAccessToken() {
        return accessToken;
    }

    public void setAccessToken(String accessToken) {
        this.accessToken = accessToken;
    }

    public int getExpiresIn() {
        return expiresIn;
    }

    public void setExpiresIn(int expiresIn) {
        this.expiresIn = expiresIn;
    }

    public String getScope() {
        return scope;
    }

    public void setScope(String scope) {
        this.scope = scope;
    }

    public String getTokenType() {
        return tokenType;
    }

    public void setTokenType(String tokenType) {
        this.tokenType = tokenType;
    }

    public String toString() {
            Gson gson = new GsonBuilder().setPrettyPrinting().create();

            return gson.toJson(this);
    }
}

This Java application consists of these files:

Running the Java app

For example, this runs the Java app with the help option to see the values that you may pass to the application.

$ java -jar build/libs/asr_client.jar -h
usage: java -jar asr_client.jar [-h|--help] [-s|--server "<value>"] [-c|--configFile
                                "<value>"] [-a|audioFile "<value>"]
                                Use Nuance MIX ASR to recognize WAV audio
Arguments:
 -h,--help                        Print this help information
 -s,--server <server>             asr server URL host:port. Default: asraas.beta.mix.nuance.com:443
 -c,--configFile <configFile>     config file containing client credentials (client_id and
                                  client_secret). Default: config.json
 -p,--paramsFile <paramsFile>     parameter file containing recognition params. Default: params.json
 -a,--audioSource <audioSource>   Specify an audio source. Default is microphone. Specify a path to stream from file.

And this runs the app using the default config and parameter files and an audio file, sample.pcm.

$ java -jar build/libs/asr_client.jar -a sample.pcm 

Reference topics

This section provides more information about topics in the gRPC API.

Status messages and codes

Recognizer service

service Recognizer {
  rpc Recognize (stream RecognitionRequest) returns (stream RecognitionResponse);
}

Status response message

{
  status: {
    code: 100
    message: 'Continue'
    details: 'recognition started on audio/l16;rate=8000 stream'
  }
  cookies: {  ... }
}

A single Recognizer service provides a single Recognize method supporting bi-directional streaming of requests and responses.

The client first provides a recognition request message with parameters indicating at minimum what language to use. Optionally, it can also include resources to customize the data packs used for recognition, and arbitrary client data to be injected into call recording for reference in offline tuning workflows.

In response to the recognition request message, Krypton returns a status message confirming the outcome of the request. Usually the message is Continue: recognition started on audio/l16;rate=8000 stream.

Status messages include HTTP-aligned status codes. A failure to begin recognizing is reflected in a 4xx or 5xx status as appropriate. (Cookies returned from resource fetches, if any, are returned in the first response only.)

When a 100 status is received the client may proceed to send one or more messages bearing binary audio samples in the format indicated in the recognize message (default: signed PCM/8000 Hz). The server responds with zero or more result messages reflecting the outcome of recognizing the incoming audio, until a terminating condition is reached, at which point the server sends a final status message indicating normal completion (200/204) or any errors encountered (4xx/5xx). Termination conditions include:

If the client cancels the RPC, no further messages are received from the server. If the server encounters an error, it attempts to send a final error status and then cancels the RPC.

Status codes

Code Message Indicates
100 Continue Recognition parameters and resources were accepted and successfully configured. Client can proceed to send audio data. Also returned in response to a start_timers message.
200 Success Recognition completed and returned at least one result.
204 No result Recognition completed without producing a result.
400 Bad request A malformed or unsupported client request was rejected.
404 No speech No utterance was detected in the audio stream for a number of samples corresponding to no_input_timeout_ms.
408 Audio timeout Excessive stall in sending audio data.
409 Conflict The recognizer is currently in use by another client.
410 Not recognizing A start_timers message was received but no in-progress recognition exists.
413 Too much speech Recognition of in-utterance samples reached a duration corresponding to recognition_timeout_ms.
500 Internal server error A serious error occurred that prevented the request from completing normally.
502 Resource error One or more resources failed to load.
503 Service unavailable Unused; reserved for gateways.

Result type

Final results

final : It's Monday morning and the sun is shining

Partial results

partial : It's
partial : It's me
partial : It's month
partial : It's Monday
partial : It's Monday no
partial : It's Monday more
partial : It's Monday March
partial : It's Monday morning
partial : It's Monday morning and
partial : It's Monday morning and the
partial : It's Monday morning and this
partial : It's Monday morning and the sun
partial : It's Monday morning and the center
partial : It's Monday morning and the sun is
partial : It's Monday morning and the sonny's
partial : It's Monday morning and the sunshine
final : It's Monday morning and the sun is shining

Immutable partial results

partial : It's Monday
partial : It's Monday morning and the
final : It's Monday morning and the sun is shining

Krypton offers three different types of results for the transcription of each utterance in the audio stream. The desired type is set in the request with RecognitionParameters - EnumResultType. In the response, the actual type is indicated in Result - EnumResultType.

Some data packs perform additional processing after the initial transcription. The transcription may change slightly during this second pass, even for immutable partial results. For example, Krypton originally recognized "the 7:58 train" as "the 750 A-Train" but adjusted it during a second pass, returning the correct transcription in the final version of the sentence.

partial : I'll catch the 750
partial : I'll catch the 750 A-Train
final : I'll catch the 758 train from Cedar Park station

Formatted text

Formatted vs. minimally formatted text

Formatted text:           December 9, 2005
Minimally formatted text: December nine two thousand and five

Formatted text:           $500
Minimally formatted text: Five hundred dollars

Formatted text:           I'll catch the 758 train
Minimally formatted text: I'll catch the seven fifty eight train

Formatted text:           I'm okay James, how about yourself?
Minimally formatted text: I'm okay James, how about yourself?

Krypton returns transcriptions in two formats: formatted text and minimally formatted text. See Result - Hypothesis.

Formatted text includes initial capitals for recognized names and places, numbers expressed as digits, currency symbols, and common abbreviations. In minimally formatted text, words are spelled out but basic capitalization and punctuation are included.

In many cases, both formats are identical.

For more precise control over the formatted text results, you may specify a formatting scheme as a recognition parameter (RecognitionParameters - Formatting - scheme). The scheme is a keyword for a formatting type supported by the data pack and determines how results are presented in the formatted text result. For example, this is how the utterance “seven eleven” is formatted for these types:

Values depend on the data pack. For example, some American English data packs include: phone, date, num_as_digits, time, default, address, all_as_words.

Opus audio format

Krypton supports the Opus audio format, either raw Opus (RFC 6716) or Ogg-encapsulated Opus (RFC 7845). The recommended encoder settings for Opus for speech recognition are:

Please note that Opus is a lossy codec, so you should not expect recognition results to be identical to those obtained with PCM audio.

See AudioFormat for the other supported formats: PCM, A-law and µ-law.

Smart routing

The Krypton ASR service offers “smart routing,” meaning you can specify any supported language and topic in your recognition request, and your request is routed automatically to a Krypton engine running that language and topic. Currently Mix supports American English (en-US), German (de-DE), and Japanese (jp-JP) languages with the General (gen) topic.

The URL for this service is currently asraas2.beta.mix.nuance.com. (The standard URL is asraas.beta.mix.nuance.com.)

When using smart routing, we recommend you include a topic as well as the (mandatory) language in your recognition request, for example:

RecognitionParameters (
   language = 'en-US',
   topic = ‘GEN’,

The topic is currently optional (the default is ‘GEN’), but for clarity and forward compatibility a specific topic is recommended.

Resources

Krypton works with one or more factory data packs, available in several languages and locales. The data pack includes these models:

The base acoustic model is trained to give good performance in many acoustic environments. The base language model is developed to remain current with popular vocabulary and language use. As such, Krypton paired with a data pack is ready for use out-of-the-box for many applications.

You may extend the data pack at runtime using several types of specialization objects:

Each recognition turn can leverage a weighted mix of domain LMs, wordsets, and builtins.

Domain LMs

Domain LM

# Define a DLM
names_places_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM', 
        uri='urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr'),
    weight_value=700)

# Declare the DLM in RecognitionInitMessage
init = RecognitionInitMessage(
    parameters = RecognitionParameters(language='en-US',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),
    resources = [ names_places_dlm ]
    // Define a DLM 
    names_places_dlm := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_ExternalReference{
            ExternalReference: &pb.ResourceReference{
                Type: pb.EnumResourceType_DOMAIN_LM,
                Uri: "urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr",
                Reuse: pb.EnumResourceReuse_HIGH_REUSE,
            },
        },
        WeightUnion: &pb.RecognitionResource_WeightValue{
            WeightValue: 700,
        },
    }
    // Declare the DLM in RecognitionInitMessage
    request := &pb.RecognitionRequest{
        RequestUnion: &pb.RecognitionRequest_RecognitionInitMessage{
            RecognitionInitMessage: &pb.RecognitionInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm },
            },
        },
    }

Each data pack supplied with Krypton provides a base language model that lets the transcription engine recognize the most common terms and constructs in the language and locale. You may complement this language model with one or more domain-specific models, called domain language models (domain LMs or DLMs).

In Krypton, a DLM is a resource declared with RecognitionInitMessage - RecognitionResource.

Domain LMs are created in Nuance Mix (see Mix.nlu workflow) and accessed via a URN available from Mix. See the Python code sample at the right for an example of a URN.

Domain LMs may contain entities (sometimes called concepts), or collections of specific terms. These entities may in turn be extended with wordsets.

Wordsets

Inline wordset, places_wordset, extends the PLACES entity

# Define a DLM
names_places_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM', 
        uri='urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr'),
    weight_value=700)

# Define an inline wordset in that DLM
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"Fordoun","spoken":["forden"]},{"literal":"Auchenblae"}]}')


init = RecognitionInitMessage(
    parameters = RecognitionParameters(language='en-US',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),
    resources = [ names_places_dlm, places_wordset ]
    // Define a DLM 
    names_places_dlm := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_ExternalReference{
            ExternalReference: &pb.ResourceReference{
                Type: pb.EnumResourceType_DOMAIN_LM,
                Uri: "urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr",
                Reuse: pb.EnumResourceReuse_HIGH_REUSE,
            },
        },
        WeightUnion: &pb.RecognitionResource_WeightValue{
            WeightValue: 700,
        },
    }

    // Define an inline wordset in that DLM 
    places_wordset := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_InlineWordset{
            InlineWordset: `{"PLACES":[{"literal":"Fordoun","spoken":["forden"]}, {"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]}, {"literal":"Auchenblae"}]}`,
        },
    }

   // Declare the DLM and wordset in RecognitionInitMessage
    request := &pb.RecognitionRequest{
        RequestUnion: &pb.RecognitionRequest_RecognitionInitMessage{
            RecognitionInitMessage: &pb.RecognitionInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm, places_wordset },
            },
        },
    }

A wordset is a collection of words and short phrases that extends Krypton's recognition vocabulary by providing additional values for entities in a DLM. For example, a wordset might provide the names in a user’s contact list or local place names. Like DLMs, wordsets are declared with RecognitionInitMessage - resources.

Using wordsets in Krypton requires three items:

The wordset is defined in JSON format as a one or more arrays. Each array is named after an entity defined within a DLM to which words can be added at runtime. Entities are templates that tell Krypton how and where words are used in a conversation.

For example, you might have an entity, NAMES, containing personal names, or PLACES, with place names used by the application. The wordset adds to the existing terms in the entity, but applies only to the current recognition session. The terms in the wordset are not added permanently to the entity. All entities must be defined in domain LMs, which are loaded and activated along with the wordset.

This wordset adds terms to the PLACES entity

{
  "PLACES" : [
    {"literal" : "La Jolla", 
     "spoken" : ["la hoya", "la jolla"] 
    },
    {"literal" : "Beaulieu", 
     "spoken": ["bewly"] 
    },
    {"literal" : "Worcester", 
     "spoken": ["wuster"] 
    },
    {"literal" : "Abington Pigotts"
    },
    {"literal" : "Steeple Morden"
    }
  ]
}

The wordset includes additional values for one or more entities. The syntax is:

{
   "entity" : [
      { "literal": "written form",
      "spoken": ["spoken form 1", "spoken form n"]
      },
      { "literal": "written form",
      "spoken":"spoken form 1", "spoken form n"] },
   ...
   ],
   "entity" : [ ... ]
}

Syntax
entity String An entity defined in a domain LM, containing a set of values. The name is case-sensitive. Consult the DLM for entity names.
literal String The written form of the value to be returned by the formatted field.
spoken Array (Optional) One or more spoken forms of the value. When not supplied, Krypton guesses the pronunciation of the word from the literal form. Include a spoken form only if the literal is difficult to pronounce or has an unusual pronunciation in the language.

Wordset example

Wordset file places-wordset.json

{
   "PLACES":[ 
      { 
         "literal":"La Jolla",
         "spoken":[ "la hoya", "la jolla" ]
      },
      { 
         "literal":"Llanfairpwllgwyngyll",
         "spoken":[ "lan vire pool guin gill" ]
      },
      { 
         "literal":"Abington Pigotts"
      },
      { 
         "literal":"Steeple Morden"
      },
      { 
         "literal":"Hoyland Common"
      },
      { 
         "literal":"Cogenhoe",
         "spoken":[ "cook no" ]
      },
      { 
         "literal":"Fordoun",
         "spoken":[ "forden" ]
      },
      { 
         "literal":"Llangollen",
         "spoken":[ "lan goth lin", "lhan goth luhn" ]
      },
      { 
         "literal":"Auchenblae"
      }]
}

Before: Without a wordset, unusual place names are not recognized

$ ./my-python-app.py abington8_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream abintgon_16.wav
final : I'm going on a trip to Abington tickets in Cambridgeshire England.
final : I'm speaking to you from the town of cooking out in Northamptonshire.
final : We visited the village of steeple Morton on our way to highland common in Yorkshire.
final : We spent a week in the town of land Gosling in Wales. 
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

After: Recognition is perfect with a wordset

$ ./my-python-app.py abington8_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
stream abintgon_16.wav
final : I'm going on a trip to Abington Piggots in Cambridgeshire England.
final : I'm speaking to you from the town of Cogenhoe in Northamptonshire.
final : We visited the village of Steeple Morden on our way to Hoyland Common in Yorkshire.
final : We spent a week in the town of Llangollen in Wales.
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

This wordset extends the PLACES entity in the names-places DLM with additional place names. Notice that a spoken form is provided only for terms that do not follow the standard pronunciation rules for the language.

For inline wordsets, you may specify the JSON within the application, compressed (without spaces) and enclosed in single quotation marks:

# Define an inline wordset for that DLM 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La
Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll",
"spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":
"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":
["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen",
"spoken":["lan goth lin","lhan goth luhn"]},{"literal":"Auchenblae"}]}')

Alternatively, you could store the wordset in a local file (see places-wordset.json) and read the file at runtime. For example, in Python:

# Define a wordset in a file for that DLM 
places_wordset_content = None
with open('places-wordset.json', 'r') as f:
    places_wordset_content = f.read()
places_wordset = RecognitionResource(inline_wordset=places_wordset_content)
 
def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(language='en-US', 
              audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
              result_type='IMMUTABLE_PARTIAL', 
              utterance_detection_mode='MULTIPLE'),
            resources = [ names_places_dlm, places_wordset ]
        )

In the examples at the right, the audio file, abington8_16.wav, is a 16kHz recording containing a variety of place names, some common and some unusual.

The Before example shows the recognition without a DLM or wordset containing these place names. Notice that the unusual names are not recognized correctly.

The After example shows that recognition is perfect when all the place names are defined in a wordset.

Builtins

Data pack builtins

# define builtins
cal_builtin = RecognitionResource(builtin='CALENDARX')
distance_builtin = RecognitionResource(builtin='DISTANCE')

# Declare builtins in RecognitionInitMessage
init = RecognitionInitMessage(
    parameters = RecognitionParameters(language='en-US',
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),
    resources = [ names_places_dlm, cal_builtin, distance_builtin ]
    // Define builtins  
    cal_builtin := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_Builtin{
            Builtin: `CALENDARX`,
        },
    }
    cal_distance := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_Builtin{
            Builtin: `DISTANCE`,
        },
    }

   // Declare builtins in RecognitionInitMessage
    request := &pb.RecognitionRequest{
        RequestUnion: &pb.RecognitionRequest_RecognitionInitMessage{
            RecognitionInitMessage: &pb.RecognitionInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm, cal_builtin, distance_builtin },
            },
        },
    }

The data pack may include one or more builtins, which are predefined recognition objects focused on common tasks (numbers, dates, and so on) or general information in a vertical domain such as financial services or healthcare. The available builtins depends on the data pack. For many American English data packs, for example, the builtins are:

ALPHANUM                 AMOUNT              BOOLEAN 
CALENDARX                CARDINAL_NUMBER     DATE 
DIGITS                   DISTANCE            DOUBLE 
DURATION                 DURATION_RANGE      GENERIC_ORDER 
GLOBAL                   NUMBERS             ORDINAL_NUMBER
QUANTITY_REL             TEMPERATURE         TIME 
VERT_FINANCIAL_SERVICES 
VERT_HEALTHCARE 
VERT_TELECOMMUNICATIONS 
VERT_TRAVEL

To use a builtin in Krypton, declare it with RecognitionResource - builtin and activate it in RecognitionInitMessage - resources.

Speaker profiles

Speaker profile

speaker_profile = RecognitionResource(external_reference = 
    ResourceReference(type='SPEAKER_PROFILE', 
        uri='urn:nuance:asr/speakerid/james.somebody@abc.com'))
. . . 
init = RecognitionInitMessage(
    parameters = RecognitionParameters(language='en-US', 
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)), 
    resources = [ names_places_dlm, places_wordset, speaker_profile ]

Speaker adaptation is a technique that adapts and improves speech recognition based on qualities of the speaker and channel. The best results are achieved by updating the data pack's acoustic model in real time based on the immediate utterance.

Krypton maintains adaptation data for each caller as speaker profiles in an internal datastore.

To use speaker profiles in Krypton, specify them in RecognitionInitMessage - RecognitionResource - ResourceReference with type SPEAKER_PROFILE and a URI to urn:nuance:asr/speakerid/<speaker>. The speaker must be a unique identifier, for example:

uri='urn:nuance:asr/speakerid/socha.someone@aardvark.com'  
uri='urn:nuance:asr/speakerid/erij-lastname' 
uri='urn:nuance:asr/speakerid/device-1234'  
uri='urn:nuance:asr/speakerid/33ba3676-3423-438c-9581-bec1dc52548a'

The first time you send a request with a speaker profile, Krypton creates a profile based on the speaker identifier, and stores the data in the profile. On subsequent requests with the same speaker identifier, Krypton adds the data to the profile. This data then adapts the acoustic model for that specific speaker, providing customized recognition. A

fter the Krypton session, the adapted data is saved by default. If this information is not required after the session, set RecognitionParameters - RecognitionFlags - discard_speaker_adaptation = True.

Defaults

The proto files provide the following default values for messages in the RecognitionRequest sent to Krypton. Mandatory fields are shown in bold.

Items in RecognitionRequest Default
recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
      language Mandatory, e.g. 'en-US'
    topic Default 'GEN'
    audio_format (AudioFormat) Mandatory, e.g. 'PCM'
    utterance_detection_mode (EnumUtterance DetectionMode) SINGLE (0): transcribe one utterance only
    result_type (EnumResultType) FINAL (0): return only final version of each utterance
    recognition_flags (RecognitionFlags)  
        auto_punctuate False: Do no punctuate results
      filter_profanity False: Leave profanity as is
      include_tokenization False: Do not include tokenized result
      stall_timers False: Start no-input timers
      discard_speaker_adaptation False: Keep speaker profile data
      suppress_call_recording False: Send data to CLAG
      mask_load_failures False: Loading errors terminate recognition
    no_input_timeout_ms 0*, usually no timeout
    recognition_timeout_ms 0*, usually no timeout
    utterance_end_silence_ms 0*, usually 500ms or half second
    speech_detection_sensitivity 500
    max_hypotheses 0*, usually 10 hypotheses
    speech_domain Depends on data pack
    formatting (Formatting)  
        scheme Depends on data pack
      options Blank
  resources (RecognitionResource)
      external_reference (ResourceReference)
        type (Enum ResourceType) Mandatory with resources - external_reference
      uri Mandatory with resources - external_reference
      max_age 0 delta seconds
      max_stale 0 delta seconds
      min_fresh 0 delta seconds
      cookies Blank
      request_timeout_ms 0*, usually 10000ms or 10 seconds
    inline_wordset Blank
    builtin Blank
    inline_grammar Blank
    weight-enum (EnumWeight) 0, meaning MEDIUM
    weight_value 0
    reuse (EnumResourceReuse) LOW_REUSE: only one recognition
  client_data Blank
control_message (ControlMessage) Blank
audio Mandatory

* Items marked with an asterisk (*) default to 0, meaning a server default: the default is set in the configuration file used by the Krypton engine instance. The values shown here are the values set in the sample configuration files (default.yaml and development.yaml) provided with the Krypton engine. In the case of max_hypotheses, the default (10 hypotheses) is set internally within Krypton.

gRPC API

Krypton provides three protocol buffer (.proto) files to define Nuance's ASR service for gRPC. These files contain the building blocks of your transcription applications.

Once you have transformed the proto files into functions and classes in your programming language using gRPC tools, you can call these functions from your application to request transcription, to set recognition parameters, to load “helper” resources such as domain language models and wordsets, and to send the resulting transcription where required.

See Client app development for a scenario using Python and Go that sets different recognition parameters, requests transcription, and loads domain LMs and wordsets. Sample Python and Go apps are included. For other languages, consult the gRPC and Protocol Buffer documentation:

Field names in proto and stub files

In this section, the names of the fields are shown as they appear in the proto files. To see how they are generated in your programming language, consult your generated files. For example:

Proto file Python Go Java
audio_format audio_format AudioFormat audioFormat or getAudioFormat
language language Language language or getLanguage

For details about how proto file terms are transformed in the different programming languages, see the Protocol Buffers documentation for:

Proto file structure

Structure of proto files

Recognizer     Recognize
        RecognitionRequest
        RecognitionResponse

RecognitionRequest
    recognition_init_messsage RecognitionInitMessage
        parameters RecognitionParameters
            language and other recognition parameter fields
            audio_format AudioFormat
            result_type EnumResultType
            recognition_flags RecognitionFlags
            formatting Formatting
        resources RecognitionResource
            external_reference ResourceReference
                type EnumResourceType
            inline_wordset
            builtin
            inline_grammar
            weight_enum EnumWeight
            weight_value
        client_data
    control_message ControlMessage
        start_timers_message StartTimersControlMessage
    audio

RecognitionResponse
    status Status
    start_of_speech StartOfSpeech
    result Result
        result fields
        result_type EnumResultType
        utterance_info UtteranceInfo
             utterance fields
            dsp Dsp
        hypotheses Hypothesis
             hypothesis fields
            word Word
    cookies

The proto files define an RPC service with a Recognize method that streams a RecognitionRequest and RecognitionResponse. Details about each component are referenced by name within the proto file.

Proto files

Recognizer

See individual sections for examples

Streaming recognition service API.

Name Request Type Response Type Description
Recognize RecognitionRequest stream RecognitionResponse stream Starts a recognition request and returns a response.

RecognitionRequest

RecognitionRequest sends parameters then audio to be transcribed

def client_stream(wf):
    try:
        # Start the recognition
        init = RecognitionInitMessage(. . .)
        yield RecognitionRequest(recognition_init_message=init)

        # Simulate a typical realtime audio stream
        print(f'stream {wf.name}')
        packet_duration = 0.020
        packet_samples = int(wf.getframerate() * packet_duration)
        for packet in iter(lambda: wf.readframes(packet_samples), b''):
            yield RecognitionRequest(audio=packet)
    request := &pb.RecognitionRequest{
        RequestUnion: &pb.RecognitionRequest_RecognitionInitMessage{
            RecognitionInitMessage: &pb.RecognitionInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm, places_wordset },
            },
        },
    }
    . . . 
    // Simulate audio stream
    packet := make([]byte, packetSize)
    for {
        packetBytesRead, err := wavReader.Read(packet)
        if err != nil {
            if err != io.EOF {
                fmt.Println(err)
            }
            break
        }
        audio := packet[0:packetBytesRead]
        audioRequest := &pb.RecognitionRequest_Audio{Audio: audio}
        request := &pb.RecognitionRequest{RequestUnion: audioRequest}
        if err := stream.Send(request); err != nil {
            log.Fatalf("Failed to send audio: %v", err)
        }
        time.Sleep(time.Duration(packetSize/640*20) * time.Millisecond) // Pace audio to mimic real-time
    } 
    stream.CloseSend()
private void recognize(RecognitionParameters params, String audioFile) {
    RecognitionInitMessage init = RecognitionInitMessage.newBuilder()
        .setParameters(params)
        .build();
    RecognitionRequest request = RecognitionRequest.newBuilder()
        .setRecognitionInitMessage(init)
        .build();

Input stream messages that request recognition, sent one at a time in a specific order. The first mandatory field sends recognition parameters and resources, the final field sends audio to be recognized. Included in Recognizer - Recognize service.

Field Type Description
recognition_init_message RecognitionInitMessage Required first message in the RPC input stream, sends parameters and resources for recognition.
control_message ControlMessage Optional second message in the RPC input stream, for timer control.
audio bytes Audio samples in the selected encoding for recognition.

This method includes:

RecognitionRequest
    recognition_init_message (RecognitionInitMessage)
        parameters (RecognitionParameters)
        resources (RecognitionResource)
        client_data
    control_message (ControlMessage)
    audio

RecognitionInitMessage

RecognitionInitMessage example

def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(. . .),
            resources = RecognitionResource([. . .]),
            client_data = {'company':'Aardvark','user':'Leslie'} 
        )
        yield RecognitionRequest(recognition_init_message=init)
    request := &pb.RecognitionRequest{
        RequestUnion: &pb.RecognitionRequest_RecognitionInitMessage{
            RecognitionInitMessage: &pb.RecognitionInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm, places_wordset },
            },
        },
    }
private void recognize(RecognitionParameters params, String audioFile) {
    RecognitionInitMessage init = RecognitionInitMessage.newBuilder()
        .setParameters(params)
        .build();
    RecognitionRequest request = RecognitionRequest.newBuilder()
        .setRecognitionInitMessage(init)
        .build();

Input message that initiates a new recognition turn. Included in RecognitionRequest.

Field Type Description
parameters RecognitionParameters Language, audio format, and other recognition parameters.
resources RecognitionResource Repeated. Optional resources (DLMs, wordsets, builtins) to improve recognition.
client_data string,string Repeated. Optional client-supplied key,value pairs to inject into the call log.

This message includes:

RecognitionRequest
    recognition_init_message (RecognitionInitMessage)
        parameters (RecognitionParameters)
            language
            topic
            audio_format
            utterance_detection_mode
            result_type
            etc.
        resources (RecognitionResource)
            external_reference
                type
                uri
            inline_wordset
            builtin
            inline_grammar
            weight_enum/value
            reuse
        client_data

RecognitionParameters

RecognitionParameters example

def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(language='en-US', 
              audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
              result_type='FINAL', 
              utterance_detection_mode='MULTIPLE',
              recognition_flags = RecognitionFlags(auto_punctuate=True))
        )
    parameters := &pb.RecognitionParameters{
        Language: "en-US",
        Topic: "GEN",
        AudioFormat: &pb.AudioFormat{
            AudioFormatUnion: &pb.AudioFormat_Pcm{
                Pcm: &pb.PCM{
                    SampleRateHz: wavFormat.SampleRate,
                },
            },
        },
    }
private RecognitionParameters initializeRecognitionRequest(String paramsFile) {
    Params params = new Params(paramsFile);
    Parameters parameters = params.getParameters();
    xaas.sample.asr.java.client.Params.RecognitionFlags flags = parameters.getRecognitionFlags();
    log(String.format("Recognition Parameters: %s", params.toString()));

    AudioFormat audioFormat = AudioFormat.newBuilder()
        .setPcm(PCM.newBuilder().setSampleRateHz(16000).build())
        .build();

    return RecognitionParameters.newBuilder()
        .setAudioFormat(audioFormat)
        .setLanguage(parameters.getLanguage())
        .setTopic(parameters.getTopic())
        .setUtteranceDetectionMode(EnumUtteranceDetectionMode.forNumber (parameters.getUtteranceDetectionMode()))
        .setResultType(EnumResultType.forNumber(parameters.getUtteranceDetectionMode()))

Input message that defines parameters for the recognition process. Included in RecognitionInitMessage. The language and audio_format parameters are mandatory. All others are optional. See Defaults for a list of default values.

Field Type Description
language string Mandatory. Language and country (locale) code as xx-XX, e.g. 'en-US' for American English.
Codes in the form xxx-XXX, e.g. 'eng-USA' are also supported for backward compatibility.
topic string Specialized language model in data pack. Default is 'GEN' (generic).
audio_format AudioFormat Mandatory. Audio codec type and sample rate.
utterance_detection_mode EnumUtterance DetectionMode How end of utterance is determined. Default SINGLE.
result_type EnumResultType The level of transcription results. Default FINAL.
recognition_flags RecognitionFlags Boolean recognition parameters.
no_input_timeout_ms uint32 Maximum silence, in ms, allowed while wating for user input after recognition timers are started. Default (0) means server default, usually no timeout.
recognition_timeout_ms uint32 Maximum duration, in ms, of recognition turn. Default (0) means server default, usually no timeout.
utterance_end_silence_ms uint32 Minimum silence, in ms, that determines the end of an utterance. Default (0) means server default, usually 500ms or half a second.
speech_detection_sensitivity uint32 A balance between detecting speech and noise (breathing, etc.), 0 to 1000. 0 means ignore all noise, 1000 means interpret all noise as speech. Default is 500.
max_hypotheses uint32 Maximum number of n-best hypotheses to return. Default (0) means server default, usually 10 hypotheses.
speech_domain string Mapping to internal weight sets for language models in the data pack. Values depend on the data pack.
formatting Formatting Formatting keyword.

This message includes:

RecognitionRequest
    recognition_init_message (RecognitionInitMessage)
        parameters (RecognitionParameters)
            language
            topic
            audio_format
                pcm | alaw | ulaw | opus | ogg_opus
            utterance_detection_mode - SINGLE | MULTIPLE| DISABLED
            result_type - FINAL | PARTIAL | IMMUTABLE_PARTIAL
            recognition_flags
                auto_punctuate
                filter_profanity
                mask_load_failures
                etc.
            speech_detection_sensitivity
            max_hypotheses
            formatting
            etc.

AudioFormat

Mandatory input message containing the audio format of the audio to transcribe. Included in RecognitionParameters.

Field Type Description
pcm PCM Signed 16-bit little endian PCM, 8kHz or 16kHz.
alaw ALaw G.711 A-law, 8kHz.
ulaw Ulaw G.711 µ-law, 8kHz.
opus Opus RFC 6716 Opus, 8kHz or 16kHz.
ogg_opus OggOpus RFC 7845 Ogg-encapsulated Opus, 8kHz or 16kHz.

PCM

Input message defining PCM sample rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Audio sample rate: 0, 8000, 16000. Default 0, meaning 8000.

Alaw

Input message defining A-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

Ulaw

Input message defining µ-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

Opus

Input message defining Opus packet stream decoding parameters. Included in AudioFormat. See Opus audio format for encoding recommendations.

Field Type Description
output_rate_hz uint32 Decoder output rate: 0, 8000, 16000. Default 0, meaning 8000.
preskip_samples uint32 Decoder 48 kHz output samples to skip.
source_rate_hz uint32 Optional. Input source sample rate.

OggOpus

Input message defining Ogg-encapsulated Opus audio stream parameters. Included in AudioFormat.

Field Type Description
output_rate_hz uint32 Decoder output rate: 0, 8000, 16000. Default 0, meaning 8000.

EnumUtteranceDetectionMode

Input field specifying how utterances should be detected and transcribed within the audio stream. Included in RecognitionParameters. The default is SINGLE. When the detection mode is DISABLED, the recognition ends only when the client stops sending audio.

Name Number Description
SINGLE 0 Return recognition results for one utterance only, ignoring any trailing audio. Default.
MULTIPLE 1 Return results for all utterances detected in the audio stream.
DISABLED 2 Return recognition results for all audio provided by the client, without separating it into utterances.

EnumResultType

Input and output field specifying how transcription results for each utterance are returned. See Result type for examples. In a request RecognitionParameters, it specifies the desired result type. In a response Result, it indicates the actual result type that was returned.

Name Number Description
FINAL 0 Only the final transcription result of each utterance is returned.
PARTIAL 1 Variable partial results are returned, followed by a final result.
IMMUTABLE_PARTIAL 2 Stabilized partial results are returned, following by a final result.

RecognitionFlags

Input message containing boolean recognition parameters. Included in RecognitionParameters. The default is false in all cases.

Field Type Description
auto_punctuate bool Whether to enable auto punctuation, if available for the language.
filter_profanity bool Whether to mask known profanities as *** in transcription, if available for the language.
include_tokenization bool Whether to include tokenized recognition result.
stall_timers bool Whether to disable recognition timers. By default, timers start when recognition begins.
discard_speaker_adaptation bool If speaker profiles are used, whether to discard updated speaker data. By default, data is stored.
suppress_call_recording bool Whether to disable recording to Call Log Aggregator (CLAG). By default, call logs, metadata, and audio are collected by CLAG. Call recording may also be disabled at the server level, in which case this parameter has no effect.
mask_load_failures bool When true, errors loading external resources are not reflected in the Status message and do not terminate recognition. They are still reflected in logs.

Formatting

Input message specifying how the transcription results are presented, using a keyword for a formatting type supported by the data pack. Included in RecognitionParameters. See Formatted text.

Field Type Description
scheme string Keyword for a formatting type defined in the data pack. Default is data pack dependent.
options string,bool Repeated. Optional key,value pairs.

ControlMessage

Input message that starts the recognition no-input timer. Included in RecognitionRequest. This setting is only effective if timers were disabled in the recognition request.

Field Type Description
start_timers_message StartTimersControlMessage Starts the recognition no-input timer.

StartTimersControlMessage

Input message the client sends when starting the no-input timer. Included in ControlMessage.

RecognitionResource

RecognitionResource example


# Define a DLM that exists in your Mix project
names_places_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM',
      uri='urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr',reuse='HIGH_REUSE'), 
    weight_value=700)

# Define an inline wordset for an entity in that DLM 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]},{"literal":"Auchenblae"}]}')

def client_stream(wf):
    try:
        # Start the recognition
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(. . .),
            resources = [ names_places_dlm, places_wordset ]
        )
    // Declare a DLM that exists in your Mix project 
    names_places_dlm := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_ExternalReference{
            ExternalReference: &pb.ResourceReference{
                Type: pb.EnumResourceType_DOMAIN_LM,
                Uri: "urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr",
                Reuse: pb.EnumResourceReuse_HIGH_REUSE,
            },
        },
        WeightUnion: &pb.RecognitionResource_WeightValue{
            WeightValue: 700,
        },
    }

    // Declare an inline wordset in that DLM 
    places_wordset := &pb.RecognitionResource{
        ResourceUnion: &pb.RecognitionResource_InlineWordset{
            InlineWordset: `{"PLACES":[{"literal":"Fordoun","spoken":["forden"]}, {"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]}, {"literal":"Auchenblae"}]}`,
        },
    }

    request := &pb.RecognitionRequest{
        RequestUnion: &pb.RecognitionRequest_RecognitionInitMessage{
            RecognitionInitMessage: &pb.RecognitionInitMessage{
                Parameters: parameters,
                Resources: []*pb.RecognitionResource{ names_places_dlm, places_wordset },
            },
        },
    }

Input message defining one or more recognition resources (domain LMs, wordsets, and builtins) to improve recognition. Included in RecognitionInitMessage. Domain LMs must be external references but wordsets must be provided inline.

Field Type Description
external_reference ResourceReference The resource is an external file. Mandatory for DLMs.
inline_wordset string Inline wordset JSON resource. See Wordsets for the format. Default blank, meaning no inline wordset.
builtin string Name of a builtin resource in the data pack. Default blank, meaning no builtins.
inline_grammar string Inline grammar, SRGS XML format. Default blank, meaning no inline grammar. For Nuance internal use only.
weight_enum EnumWeight Keyword for weight of DLM relative to data pack. Default MEDIUM.
weight_value uint32 Weight of DLM relative to data pack as number from 0 to 1000. Default 0.
reuse EnumResourceReuse Whether the resource will be used multiple times. Default LOW_REUSE.

This message includes:

RecognitionRequest
    recognition_init_message (RecognitionInitMessage)
        parameters (RecognitionParameters)
        resources (RecognitionResource)
            external_reference (ResourceReference)
                type - DOMAIN_LM | SPEAKER_PROFILE
                uri
                etc.
            inline_wordset
            builtin
            inline_grammar
            weight_enum - LOWEST | LOW | MEDIUM | HIGH | HIGHEST
            weight_value
            reuse - LOW_REUSE | HIGH_REUSE

ResourceReference

Input message for fetching an external DLM that exists in your Mix project or for creating or updated a speaker profile. Included in RecognitionResource. See Domain LMs and Speaker profiles.

Field Type Description
type EnumResourceType Resource type. Default UNDEFINED_RESOURCE_TYPE.
uri string Location of the resource as a URN reference:
DLM: urn:nuance:mix/eng-USA/<Mix_context_tag>/mix.asr
Speaker profile: urn:nuance:asr/speakerid/<speaker>
max_age uint32 Default 0 seconds. See RFC 7234 section 5.2.1.1
max_stale uint32 Default 0 seconds. See RFC 7234 section 5.2.1.2
min_fresh uint32 Default 0 seconds. See RFC  7234 section 5.2.1.3
cookies string Cookies to transmit to the server. Default blank.
request_timeout_ms uint32 Time to wait when downloading resources. Default (0) means server default, usually 10000ms or 10 seconds.

EnumResourceType

Input field defining the content type of an external recognition resource: only DOMAIN_LM is currently supported as an external reference. Included in ResourceReference. See Resources.

Name Number Description
UNDEFINED_RESOURCE_TYPE 0 Resource type is not specified. Client must always specify a type.
WORDSET 1 Resource is a plain-text JSON wordset. Not currently supported, although inline_wordset is supported.
COMPILED_WORDSET 2 Resource is a compiled wordset. Not currently supported.
DOMAIN_LM 3 Resource is a domain LM.
SPEAKER_PROFILE 4 Resource is a speaker profile in a Krypton datastore.
GRAMMAR 5 Resource is an SRGS XML file. Not currently supported.

EnumWeight

Input field setting the relative weight of the domain LM relative to the data pack, as a keyword. Included in RecognitionResource. Wordsets do not have a weight. See also weight_value to specify a numeric value.

Name Number Description
DEFAULT_WEIGHT 0 Same effect as MEDIUM
LOWEST 1 The DLM has less effect than the data pack.
LOW 2
MEDIUM 3 The DLM has roughly an equal effect compared to the data pack.
HIGH 4
HIGHEST 5 The DLM has more effect than the data pack.

EnumResourceReuse

Input field specifying whether the domain LM or wordset will be used for one or many recognition turns. Included in RecognitionResource.

Name Number Description
UNDEFINED_REUSE 0 Not specified: currently defaults to LOW_REUSE.
LOW_REUSE 1 The resource will be used for only one recognition turn.
HIGH_REUSE 5 The resource will be used for a sequence of recognition turns.

RecognitionResponse

RecognitionResponse example

        try:
            # Iterate through the returned server -> client messages
            for message in stream_in:
                if message.HasField('status'):
                    if message.status.details:
                         print(f'{message.status.code} {message.status.message} - {message.status.details}')
                    else:
                         print(f'{message.status.code} {message.status.message}')
                elif message.HasField('result'):
                    restype = 'partial' if message.result.result_type else 'final'
                    print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
// Recognize initiates the recognition request, streams the given WAV audio file to the server, and prints responses
func recognize(client pb.RecognizerClient, wavFormat *wav.WavFormat, wavReader *wav.Reader, token *string) {
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    ctx = metadata.AppendToOutgoingContext(ctx, "authorization", "Bearer " + *token)

    stream, err := client.Recognize(ctx)
    if err != nil {
        log.Fatalf("%v.RouteChat(_) = _, %v", client, err)
    }
    waitc := make(chan struct{})
    go func() {
        for {
            recognitionResponse, err := stream.Recv()
            if err == io.EOF {
                // read done.
                close(waitc)
                return
            }
            if err != nil {
                log.Fatalf("Failed to receive a response : %v", err)
            }
            log.Printf("response: %v\n", recognitionResponse);
        }
    }()
    public void onNext(RecognitonResponse response) {
        switch (response.getResponseUnionCase()) {
            case RESULT:
                Result result = response.getResult();
                System.out.println(String.format("Transcription [%s]: [conf: %d] %s", 
                result.getResultType(), 
                result.getHypotheses(0).getAverageConfidence(),
                result.getHypotheses(0).getFormattedText()));
                if (result.getResultTypeValue() == EnumResultType.FINAL_VALUE) {
                    done.countDown();
                }
                break;
            case START_OF_SPEECH:
                StartOfSpeech sos = response.getStartOfSpeech();
                System.out.println(String.format("Start of Speech detected: %dms", sos.getFirstAudioToStartOfSpeechMs()));
                break;
            case STATUS:
                com.nuance.grpc.asr.v1beta1.Status status = response.getStatus();
                System.out.println(String.format("Recognition Status: %d %s", status.getCode(), status.getMessage()));
                break;
            default:
                break;
        }
    }

Output stream of messages in response to a recognize request. Included in Recognizer - Recognize service.

Field Type Description
status Status Always the first message returned, indicating whether recognition was initiated successfully.
start_of_speech StartOfSpeech When speech was detected.
result Result The partial or final recognition result. A series of partial results may preceed the final result.
cookies string,string Repeated. One uri,cookies entry per ResourceReference.uri where cookies were returned, for the first response only.

This message includes:

RecognitionResponse
    status (Status)
        code
        message
        details
    start_of_speech (StartOfSpeech)
        first_audio_to_start_of_speech_ms
    result (Result)
        result_type - FINAL | PARTIAL | IMMUTABLE_PARTIAL
        abs_start_ms
        abs_end_ms
        utterance_info (UtteranceInfo)
            duration_ms
            clipping_duration_ms
            dropped_speech_packets
            dropped_nonspeech_packets
            dsp (Dsp)
                digital signal processing results
        hypotheses (Hypothesis)
            confidence
            average_confidence
            rejected
            formatted_text
            minimally_formatted_text
            words (Words)
                text
                confidence
                start_ms
                end_ms
                silence_after_word_ms
                grammar_rule
            encrypted_tokenization
            grammar_id
    cookies

Status

Output message indicating the status of the transcription. See Status codes for details about the codes. The message and details are developer-facing error messages in English. User-facing messages should be localized by the client based on the status code. Included in RecognitionResponse.

Field Type Description
code uint32 HTTP-style return code: 100, 200, 4xx, or 5xx as appropriate.
message string Brief description of the status.
details string Longer description if available.

StartOfSpeech

Output message containing the start-of-speech message. Included in RecognitionResponse.

Field Type Description
first_audio_to_start_of_speech_ms uint32 Offset from start of audio stream to start of speech detected.

Result

See Result type and Formatted text for examples of transcription results in different formats

Output message containing the transcription result, including the result type, the start and end times, metadata about the transcription, and one or more transcription hypotheses. Included in RecognitionResponse.

Field Type Description
result_type EnumResultType Whether final, partial, or immutable results are returned.
abs_start_ms uint32 Audio stream start time.
abs_end_ms uint32 Audio stream end time.
utterance_info UtteranceInfo Information about each sentence.
hypotheses Hypothesis Repeated. One or more transcription variations.

UtteranceInfo

Output message containing information about the recognized sentence in the transcription result. Included in Result.

Field Type Description
duration_ms uint32 Utterance duration in milliseconds.
clipping_duration_ms uint32 Milliseconds of clipping detected.
dropped_speech_packets uint32 Number of speech audio buffers discarded during processing.
dropped_nonspeech_packets uint32 Number of non-speech audio buffers discarded during processing.
dsp Dsp Digital signal processing results.

Dsp

Output message containing digital signal processing results. Included in UtteranceInfo.

Field Type Description
snr_estimate float The estimated speech-to-noise ratio.
level float Estimated speech signal level.
stereo bool Always false, meaning mono audio.
initial_silence_ms uint32 Milliseconds of silence observed before start of utterance.
initial_energy float Energy feature value of first speech frame.
final_energy float Energy feature value of last speech frame.
mean_energy float Average energy feature value of utterance.

Hypothesis

Output message containing one or more proposed transcriptions of the audio stream. Included in Result. Each variation has its own confidence level along with the text in two levels of formatting. See Formatted text.

Field Type Description
confidence uint32 The confidence score for the entire transcription, 0-999.
average_confidence uint32 The confidence score for the hypothesis, 0-999: the average of all word confidence scores based on their duration.
rejected bool Whether the hypothesis was rejected.
formatted_text string Formatted text of the result, e.g. $500.
minimally_formatted_text string Slightly formatted text of the result, e.g. Five hundred dollars.
words Word Repeated. One or more recognized words in the result.
encrypted_tokenization string Nuance-internal representation of the recognition result. Not returned when result originates from a grammar.
grammar_id string Identifier of the matching grammar, as grammar_0, grammar_1, etc. representing the order the grammars were provided as resources. Returned when result originates from an SRGS grammar rather than generic dictation.

Word

Output message containing one or more recognized words in the hypothesis, including the text, confidence score, and timing information. Included in Hypothesis.

Field Type Description
text string The recognized word.
confidence uint32 The confidence score of the recognized word, 0-999.
start_ms uint32 Word start offset in the audio stream.
end_ms uint32 Word end offset in the audio stream.
silence_after_word_ms uint32 The amount of silence, in ms, detected after the word.
grammar_rule string The grammar rule that recognized the word text. Returned when result originates from an SRGS grammar rather than generic dictation.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str

Change log

2020-02-19

These changes were made to the ASRaaS software and documentation:

2020-01-22

These changes were made to the ASRaaS gRPC software and documentation since the last Beta release:

2019-12-18

These changes were made to the ASRaaS gRPC API and documentation since the last Beta release:

2019-11-15

Below are changes made to the ASRaaS gRPC API documentation since the initial Beta release: