Glossary of terms
Annotation is the process of marking up a sample phrase or sentence to identify the entities (previously called concepts) and the intents that it contains. Annotated samples are the main source of data used to train an NLU model.
Application programming interface. Specification of routines, data structures, object classes, and protocols, with the goal to communicate with a software system or a platform such as Nuance Mix.
The app ID (or application ID) is a unique application string. It is used to reference the resources created and managed in Nuance Mix.
A Mix application defines a set of credentials that you use to access Mix.asr, Mix.nlu, and Mix.dialog resources. Your Mix application is deployed from Mix.dashboard, where you can deploy your application to multiple runtime environments (for example, sandbox, QA, production).
Note: Currently, only the sandbox environment is supported. In future releases, multiple environments will be supported.
An application configuration associates an app ID with the Mix.asr, Mix.nlu, and Mix.dialog resources deployed in a runtime environment.
An application configuration is created by specifying:
- A context tag, which is a name identifying this application configuration
- The versions of the Mix.asr, Mix.nlu, and Mix.dialog resources to include in this application configuration
You use the application configuration at runtime by providing the app ID and context tag to load resources.
Automatic Speech Recognition. Conversion of spoken words to interpretable text. The Nuance ASR service is powered by Nuance's Krypton engine, which performs realtime large vocabulary continuous speech recognition.
A data pack may include one or more builtins, which are predefined recognition objects focused on common tasks (numbers, dates, and so on) or general information in a vertical domain such as financial services or healthcare.
Medium through which a message is transmitted to its intended audience, for example, to customers. Channels range from the traditional mediums such as print and broadcast (TV and radio), later video and email, and increasingly to digital channels of communications such as SMS, live chat, and chatbots or virtual assistants.
Nuance Mix uses the OAuth 2.0 protocol for authentication. All client applications must provide an access token to be able to access the ASR, NLU, Dialog, and TTS runtime services. To obtain an access token, a client ID and a client secret must be provided.
The client secret is generated through Mix.dashboard and used to obtain an access token for authentication.
Value assigned as a measure of the NLU engine's confidence that it can correctly identify the intent of a sentence. The higher the score, the more likely it is that the result matches what the user said.
A context tag is a string used to identify an application configuration.
A data pack is a set of data files that are used to configure the Krypton recognition engine and the Nuance Text Processing Engine (NTpE) for a particular language. The data pack consists of an acoustic model, language model, parameter files, and other configuration files.
Interaction between a user and a client application. A single unit of interaction or single transaction is often referred to as a dialog state.
Logical flow of the client application, including various dialog states, primary paths of informational exchanges, transaction outcomes, and decision logic. In Mix.dialog a dialog flow comprises nodes that perform operations such as prompting the user, collecting and evaluating a response, retrieving information from a backend system, or transferring the user to a live agent for assistance.
Method used to clarify when the recognized item has more than one possible meaning.
Domain language model
domain language model
The Krypton engine uses domain language models to identify the words and phrases most likely spoken by users of your application. Domain language models are overlaid on the factory or base data pack Krypton uses to provide a vocabulary for the application. You generate these custom models from training data that is representative of your application.
Dual Tone Multi Frequency. Also known as touchtone. Two-tone signal representing the digits 0-9, *, and #. Each DTMF signal is composed of one tone from a high-frequency group of tones, and a second tone from a low-frequency group.
In Mix you have the option to create projects that support DTMF as user input, for example, in IVR systems and other phone-based interfaces.
Entities (previously referred to as concepts) identify details or categories of information relevant to your application. In Mix.nlu you define entities in an ontology, and then annotate your sample data by labeling the tokens with entities.
For example, if the intent is ORDER_DRINK, a relevant entity might be [DRINK_TYPE]. In the sample sentences "I'd like an iced vanilla latte" and "What's in the caramel macchiato," you might annotate the words "latte" and "macchiato" with the [DRINK_TYPE] entity.
While the intent is the overall meaning of a sentence, entities and values capture the meaning of individual words and phrases in that sentence. See also predefined entity.
Words and word sequences the recognizer can recognize, and the interpretations for those utterances.
In Mix.nlu, predefined entities are expressed as grammars; that is, in a set of rules defining all the ways of expressing items associated with a given entity, without having to enumerate them. For example, the predefined entity [CALENDARX] is a grammar for specifying dates and times. A list of such expressions ("July 5th", "3rd of June", "tomorrow", "a week from Wednesday", and so on) would be unwieldy, to say the least. Instead, a grammar provides a relatively compact way to accomplish the same thing.
gRPC is an open source RPC (remote procedure call) framework used to create and connect services. It uses HTTP/2 for transport and protocol buffers to define the structure of the application.
Nuance shorthand for the syntax for grammars defined in the XML format of the W3C Speech Recognition Grammar Specification. May also refer to the file extension for such grammars. The current specification for GRXML is available on the Web at the W3C.
The Hypertext Transfer Protocol (HTTP) is the underlying protocol used by the World Wide Web and defines how messages are formatted and transmitted, and the actions web servers and browsers should take in response to various commands. Hypertext is structured text that uses logical links (hyperlinks) between nodes containing text.
HTTPS (also called HTTP over TLS, HTTP over SSL, and HTTP Secure) is a protocol for secure communication over a computer network that is widely used on the Internet.
An intent defines and identifies an intended action. An utterance or query spoken by a user will express an intent, for example, to order a drink. As you develop an NLU model, you define intents based on what you expect your users to do in your application. You will then link intents to functions or methods in your client application logic. Intents are almost always associated with entities that serve to further specify particulars about the intended action.
Representation of the meaning of a sentence. May also refer to the recognition of an utterance from text rather than audio.
Interactive Voice Response. General-purpose system for developing and deploying telephony applications that perform automated operations and transactions to callers primarily via voice and DTMF input.
Krypton is Nuance's enterprise-grade, realtime large vocabulary continuous speech recognition and transcription engine. The Krypton engine converts an audio stream of human speech into text by recognizing the speech and transcribing it into text. Krypton supports domain language models among other forms of specialization, allowing it to understand terms specific to a field of work/application.
Statistical or neural model for the syntax of language constructs. A recognizer uses language models to bias it appropriately towards more common phrases. These models increase overall recognition accuracy.
A literal is the range of tokens in a user's query that corresponds to a certain entity. The literal is the exact spoken text of this range.
For example, in the query "I'd like a large t-shirt," the literal corresponding to the entity [TSHIRT_SIZE] is "large". Other literals might be "small", "medium", "big", "very big", and "extra large". When you annotate samples, you select a range of text to tag with an entity.
Literals can be paired with values. For example, "small", "medium", and "large" can be paired with the values "S", "M", and "L", respectively. Multiple literals can have the same value, which makes it easy to map different ways a user might say an entity into a single common form. For example, "large", "big", "very big" might all be given the same value: "L".
In addition, if your NLU model has a list entity, it isn't necessary to define all of the literals for that entity. The NLU model will infer literals for the entity that are not in the list. Inferred literals will not have values returned, only the literal itself.
A modality specifies a format used to exchange information with the user, such as TTS, audio, text, and so on.
When you create a project, the following modalities are available:
- Rich text: Lets you specify text messages that can be displayed on any screen, such as SMS messages. It also provides the ability to include richer content in messages, such as HTML tags that can be used in a web chat.
- Audio Script: Lets you specify the transcription of an audio message. This can be used as a fallback when an audio file is not available.
- TTS: Lets you specify text that can be spoken using speech synthesis.
- Interactivity: Lets you add interactive elements to the message, such as buttons and clickable links.
natural language understanding
Speech recognition techniques that permit a user to answer a prompt with a full phrase or sentence, as in everyday conversation. Typically, natural speech is longer in duration and has a broad range of possible meanings. A grammar (or model) capable of natural language understanding must accept a wide variety of different phrases.
The Natural Language Engine (NLE) is Nuance's enterprise grade text-to-meaning engine or semantic engine. NLE provides ontology-based semantic processing. NLE takes as input the token sequence provided by the Nuance Text Processing Engine (NTpE) and from this input identifies the intent and/or meanings expressed in the human-machine turn. The outcome from NLE is typically used to drive the next machine-human turn.
A dialog flow comprises nodes that perform operations such as prompting the user, evaluating a response, retrieving information from a backend system, or transferring the user to a live agent for assistance.
Mix.dialog provides several types of nodes that each perform a specific kind of operation. For example, Start, Question & Answer, Message, Decision, and so on. For more information, see Dialog design elements.
The Nuance Text processing Engine (NTpE) is Nuance's normalization and tokenization (or lexical analysis) engine. NTpE applies transformation rules and formats output for display or for further processing by semantic engines such as NLE.
Nuance Vocalizer for Cloud. The Nuance text-to-speech engine, which powers Mix's TTS service to synthesize speech from text
An ontology is a formal specification of how words and language structures are related to meanings, typically within some specific context.
In Mix.nlu, the ontology for your model comprises intents, entities (previously referred to as concepts), information about relationships (for example, between intents and entities or between entities and other entities), and grammars associated with such information. The ontology is the central schema for organizing your model and its sample data. The intents, entities, and associations between them are all stored in an ontology.
A predefined entity is a generally useful entity that has been predefined in Mix.nlu to save you the trouble of defining the entity from scratch. Examples of predefined entities are: monetary amounts, Boolean values, calendar items (dates, times, or both), Cardinal and Ordinal numbers, and so on.
Process of identifying and interpreting spoken language. Recognition is performed by a recognizer such as Krypton, which, in turn requires models to define the words and phrases that can be recognized.
The Mix resources available in an application are:
- DLMs for Mix.asr
- NLU models for Mix.nlu
- Conversational applications for Mix.dialog
A sample phrase or sentence that you add to your NLU model. After you annotate samples with intents and entities, the model is trained to learn the annotations. You can exclude samples that you have not yet finished annotating ("Exclude from model") so that they are not used for training the NLU model.
See confidence score.
In the Dialog as a Service API, a selector helps identify the channel and language to use for each interaction. For more information, see Selectors.
Semantic interpretation allows utterances to be interpreted into structured objects that can be understood by an application.
Complete, continual interaction between a speaker and a dialog application. In telephony environments, this is the duration of a call.
Unique identifier for each interaction between a speaker and a dialog application. Typically generated at the start of a session.
Transport Layer Security. Cryptographic protocol that provides endpoint authentication and communications confidentiality over networks such as the Internet. TLS and its predecessor, Secure Sockets Layer (SSL), encrypt the segments of network connections at the Transport Layer end-to-end.
A token is a unit in text made up of characters. Individual words are tokens, as are intent and entity labels.
Consider the following string: "I'd like an [HOT_COLD_TYPE]iced[/] [FLAVOR]vanilla[/] [DRINK_TYPE]latte[/]."
Words like "I'd," "like," and "an" are tokens, as are the entity labels [HOT_COLD_TYPE], [FLAVOR], and [DRINK_TYPE].
Process in which a sequence of strings is broken up into individual words, keywords, phrases, symbols, and other elements called tokens. Tokens can be individual words or phrases.
Using user input to refine interpretation accuracy. In Nuance Mix, training is the process of building an NLU model based on the data that you have provided.
Developing a model is an iterative process that includes multiple training passes. Every time you add or remove sample sentences, annotate those samples to identify intents and entities, include or exclude certain samples, and so on, you change the training data. As a direct result your model no longer reflects the most up-to-date data. As this happens, the model must be re-trained behind the scenes to enable testing the changes, exposing errors and inconsistencies, and so on.
TRaining Set XML. A specification defined, owned, and maintained by Nuance to enable developers to manage an entire model in a single file outside of Nuance Mix and import the model into a Mix.nlu project. You can also manage training data in separate TRSX files and import them individually.
Text-To-Speech. The process of synthesizing audible speech from typed text.
A system prompt followed by a user's response. In telephony environments, the response can be a hangup, silence, or even noise that triggers recognition; a recognition can exit after detecting speech, DTMF, a hangup, or after a timeout.
Most transactional applications require multiple turns. For example, trading a stock or paying a credit card. Simpler applications require one or two turns. The more turns in a dialog, the more complex it tends to be to design.
Uniform Resource Identifier. Generic term for all types of names and addresses that refer to objects on the Web.
Uniform Resource Locator. Global address of documents and other resources on the Web; a URL is a kind of URI.
Uniform Resource Name. A name that identifies a resource on the Internet. Unlike URLs, which use network addresses (domain, directory path, filename), URNs use regular words that are protocol- and location-independent.
Uniform Resource Names (URNs) are used in Mix to load a specific Mix resource, described in the application configuration. A URN helps the service determine how to parse the resources in a context tag.
Distinct chunk of caller speech, usually in response to a prompt, that is recognized using a specific active grammar. An utterance is referred to colloquially as an "utt."
Set of words that can be understood as a part of an application. For example, both "cents" and "dollars" are in the vocabulary of the currency builtin, but these terms can only be said in particular locations in the phrase.
Complete set of recorded audio files used by Nuance Vocalizer for Cloud to enable a TTS persona. Each voice pack provides a male or a female voice for a particular locale (for example, for Canadian French, American English, or Australian English).
WebSocket is a communications protocol that provides full-duplex communications channels over a single TCP connection.
Weight values set the relative importance when processing speech input between the base language model (data pack) and specialization objects such as domain language models, builtins, and wordsets. Weights apply to speech recognition only and have no impact on meaning extraction; therefore, recognition weights are relevant to the Krypton engine but not to NLU.
A wordset is a set of words that customize the vocabulary used by an application at runtime. For example, an application might use wordsets to fetch identified user-specific information to add recognizable values into a grammar (such as the appropriate bank account information for a specific user). The Krypton recognition engine and NLE use wordsets for dynamic content injection.