NAV Navbar
Python

ASR as a Service gRPC API

Nuance ASR provides real-time speech recognition

Nuance ASR

Nuance ASR (Automatic Speech Recognition) as a Service is powered by Krypton, a speech-to-text engine that transcribes speech into text in real time.

Krypton works with Nuance data packs in many languages, and optionally uses domain language models and wordsets to customize recognition for specific environments.

The gRPC protocol provided by Krypton allows a client application to request transcription services in any the programming languages supported by gRPC.

gRPC is an open source RPC (remote procedure call) software that uses HTTP/2 for transport and protocol buffers to define the API. Krypton supports Protocol Buffers version 3, also known as proto3.

Version: v1

This release supports three versions of the gRPC Krypton protocol: v1, v1beta1, and v1beta2.

You may continue to use v1beta1 or v1beta2 without any changes to your applications. When Krypton receives a request from your client application, it identifies the protocol version transparently.

You may use only one protocol version per application. You cannot combine v1beta1, v1beta2, and/or v1 syntax in one application.

Upgrading to v1

Generate client stubs from the new proto files

$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ --grpc_python_out=./ recognizer.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ resource.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ result.proto

$ ls -1 nuance*.py
recognizer_pb2_grpc.py  
recognizer_pb2.py  
resource_pb2.py  
result_pb2.py 

In client app, change RecognizeXxx to RecognitionXxx (v1beta2)

RecognizeRequest       --> RecognitionRequest 
RecognizeResponse      --> RecognitionResponse  
RecognizeInitMessage   --> RecognitionInitMessage  
recognize_init_message --> recognition_init_message

Change URN format, move reuse parameter (v1beta2), change type of weight_value (v1)

RecognitionResource(
    external_reference = ResourceReference(
        type='DOMAIN_LM',
        uri='urn:nuance:mix/eng-USA/<context_tag>/mix.asr', 
        reuse='HIGH_REUSE'),
    weight_value=700)
-->
RecognitionResource(
    external_reference = ResourceReference(
        type='DOMAIN_LM',
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'), 
    reuse='HIGH_REUSE',
    weight_value=0.7)

Rename Dsp field (v1beta1)

initial_silence  --> initial_silence_ms

Use 4-letter language codes (optional) (v1beta2)

RecognitionParameters(language='eng-USA',
-->
RecognitionParameters(language='en-us', 

Change opus parameter if used (v1beta2)

audio_format=AudioFormat(opus=OggOpus(output_rate_hz=16000)), 
-->
audio_format=AudioFormat(ogg_opus=OggOpus(output_rate_hz=16000)), 

Rename fields and data types (v1)

output_rate_hz                        --> decode_rate_hz
snr_estimate                          --> snr_estimate_db
speech_detection_sensitivity (uint32) --> (float), default 0.5
stereo (bool)                         --> num_channels (uint32), default 1 (mono)
weight_value (uint32)                 --> (float), default 0.0
confidence (uint32)                   --> (float)
average_confidence (uint32            --> (float)

Regroup internal fields (v1)

message ResourceReference {
  ...
  oneof optional_resource_reference_max_age {uint32 max_age = 3;}
  oneof optional_resource_reference_max_stale {uint32 max_stale = 4;}
  oneof optional_resource_reference_min_fresh {uint32 min_fresh = 5;}
  string cookies = 6;
  ...
} 
-->
message ResourceReference {
  ...
  map<string, string> headers = 8; 
  ...
}

To upgrade to the v1 protocol from v1beta1 or v1beta2, you need to regenerate your programming-language stub files from the new proto files, then make small adjustments to your client application.

Rerun proto files

First regenerate your client stubs from the new proto files, as described in gRPC setup.

  1. Download the v1 proto files here. We recommend you make a new directory for the v1 files.

  2. Use gRPC tools to generate the client stubs from the proto files.

  3. Notice the new client stub files.

Update for v1beta

Adjust your client application for the changes made to the protocol in v1beta2.

Update for v1

Then apply the changes made in v1.

Prerequisites from Mix

Before developing your gRPC application, you need a Nuance Mix project. This project provides credentials to run your application against the Nuance-hosted Krypton ASR engine. It also lets you create one or more domain language models (DLMs) to improve recognition in your application.

  1. Create a Mix project and model: see Mix.nlu workflow to:

    • Create a Mix project.

    • Create, train, and build a model in the project. The model must include an intent, optionally entities, and a few annotated sentences.

      Since your model is for recognition only (not understanding), you can use any intent name, for example DUMMY, and add entities and sentences to that intent. Your entities (for example NAMES and PLACES) should contain words that are specific to your application environment. In your application, you can add more words to these categories using wordsets

    • Create and deploy an application configuration for the project.

  2. Generate a "secret" and client ID of your Mix project: see Authorize your client application. Later you will use these credentials to request an access token to run your application.

  3. Learn the URL to call the Krypton ASR service: see Accessing a runtime service.

  4. Learn how to reference the DLMs in your application. You may only reference DLMs created in your Mix project. See Accessing a runtime service - URN.

gRPC setup

Download proto files

recognizer.proto
resource.proto
result.proto

Install gRPC for programming language

$ pip install --upgrade pip
$ pip install grpcio
$ pip install grpcio-tools

Generate client stubs from proto files

$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ --grpc_python_out=./ recognizer.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ resource.proto
$ python3 -m grpc_tools.protoc --proto_path=./ --python_out=./ result.proto

$ ls -1 nuance*.py
recognizer_pb2_grpc.py  
recognizer_pb2.py  
resource_pb2.py  
result_pb2.py

The basic steps for using the Krypton gRPC protocol are:

  1. Download the three gRPC proto files here. These files contain a generic version of the functions or classes for requesting transcription from a Krypton engine.

    • recognizer.proto
    • resource.proto
    • result.proto

  2. Install gRPC for your programming language, including C++, Java, Python, Go, Ruby, C#, Node.js, and others. See gRPC Documentation for a complete list and instructions on using gRPC with each one.

  3. Generate client stub files in your programming language from the proto files using gRPC protoc. Depending on your programming language, the stubs may consist of one file or multiple files per proto file.

    These stub files contain the methods and fields from the proto files as implemented in your programming language. You will consult the stubs in conjunction with the proto files.

    Some languages, such as Node.js, can use the proto files directly, meaning client stubs are not required. Consult the gRPC documentation for your programming language.

  4. Write your client app, referencing the functions or classes in the client stub files. See Client app development for details and a scenario, including domain language models (DLMs) and wordsets.

  5. Run your client app to request transcription, optionally passing DLMs and wordsets to improve recognition. See Sample Python app.

Client app development

The gRPC protocol for Krypton lets you create a client application for recognizing and transcribing speech. This section describes how to implement the basic functionality of Krypton in the context of a Python application. For the complete application, see Sample Python app.

The essential tasks are illustrated in the following high-level sequence flow:

Sequence flow

Step 1: Authorize and connect

Authorize and run Python client (run-python-client.sh)

#!/bin/bash

CLIENT_ID="appID%3ANMDPTRIAL_your_name_company_com_20201102T144327123022%3Ageo%3Aus%3AclientName%3Adefault"
SECRET="9L4l...8oda"
export MY_TOKEN="`curl -s -u "$CLIENT_ID:$SECRET" \
"https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=asr nlu tts dlg' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'`"

./my-python-client.py asr.api.nuance.com:443 $MY_TOKEN $1

Nuance Mix uses the OAuth 2.0 protocol for authorization. The client application must provide an access token to be able to access the ASR runtime service. The token expires after a short period of time so must be regenerated frequently.

Your client application uses the client ID and secret from the Mix Dashboard (see Prerequisites from Mix) to generate an access token from the Nuance authorization server.

The client ID starts with appID: followed by a unique identifier. If you are using the curl command, replace the colon with %3A so the value can be parsed correctly:

appID%3ANMDPTRIAL_your_name_company_com_2020...  
-->     
appID%3ANNMDPTRIAL_your_name_company_com_2020...

The token may be generated in several ways, either as part of the client application or as a script file. This Python example uses a Linux script to generate a token and store it in an environment variable. The token is then passed to the application, where it is used to create a secure connection to the ASR service.

Step 2: Import functions

Import functions from stubs

from resource_pb2 import *
from result_pb2 import *
from recognizer_pb2 import *
from recognizer_pb2_grpc import *

The application imports all functions from the Krypton client stubs that you generated from the proto files in gRPC setup.

Do not edit these stub files.

Step 3: Set recognition parms

Set recognition parameters

async def stream_out(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(
                language = 'en-US',   
                audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)),  
                result_type = 'IMMUTABLE_PARTIAL', 
                utterance_detection_mode='MULTIPLE', 
                recognition_flags = RecognitionFlags(auto_punctuate=True))
            resources = [ travel_dlm, places_wordset ]
        )

The application sets a RecognitionInitMessage containing RecognitionParameters, or parameters that define the type of recognition you want. Consult your generated stubs for the precise parameter names. Some parameters are:

For details about all recognition parameters, see RecognitionParameters.

RecognitionInitMessage may also include resources such as domain language models and wordsets, which customize recognition for a specific environment or business. See Add DLMs and wordsets.

Step 4: Call client stub

Define and call client stub

try:
    hostaddr = sys.argv[1]
    access_token = sys.argv[2]
    audio_file = sys.argv[3]
    . . . 
    call_credentials = grpc.access_token_call_credentials(access_token)
    ssl_credentials = grpc.ssl_channel_credentials()
    channel_credentials = grpc.composite_channel_credentials(ssl_credentials, call_credentials)
with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel: stub = RecognizerStub(channel) stream_in = stub.Recognize(client_stream(wf))

The app must include the location of the Krypton instance, the access token, and where the audio is obtained. See Authorize and connect.

Using this information, the app calls a client stub function or class. In some languages, this stub is defined in the generated client files: in Python it is named RecognizerStub, in Go it is RecognizerClient, and in Java it is RecognizerStub.

Step 5: Request transcription

Request transcription and simulate audio stream

def client_stream(wf):
    try:
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(
                language='en-US',
                audio_format=AudioFormat(
                    pcm=PCM(sample_rate_hz=wf.getframerate())),    
                result_type='FINAL', 
                utterance_detection_mode='MULTIPLE',
            resources = [ travel_dlm, places_wordset ]
        )
        yield RecognitionRequest(recognition_init_message=init)

        print(f'stream {wf.name}')
        packet_duration = 0.020
        packet_samples = int(wf.getframerate() * packet_duration)
        for packet in iter(lambda: wf.readframes(packet_samples), b''):
            yield RecognitionRequest(audio=packet)
            sleep(packet_duration)

After setting recognition parameters, the app sends the RecognitionRequest stream, including recognition parameters and the audio to transcribe, to the channel and stub.

In this Python example, this is achieved with a two-part yield structure that first sends recognition parameters then sends the audio for transcription in chunks.

yield RecognitionRequest(recognition_init_params=init)
. . . 
yield RecognitionRequest(audio=chunk)

Normally your app will send streaming audio to Krypton for processing but, for simplicity, this application simulates streaming audio by breaking up an audio file into chunks and feeding it to Krypton a bit at a time.

Step 6: Process results

Receive results

        try:
            if args.output_audio_file:
                audio_file = open(args.output_audio_file, "wb")               
            for response in stream_in:
                if response.HasField("audio"):
                    print("Received audio: %d bytes" % len(response.audio))
                    if(audio_file):
                        audio_file.write(response.audio)
                elif response.HasField("events"):
                    print("Received events")
                    print(text_format.MessageToString(response.events))
                else:
                    if response.status.code == 200:
                        print("Received status response: SUCCESS")
                    else:
                        print("Received status response: FAILED")
                        print("Code: {}, Message: {}".format(response.status.code, response.status.message))
                        print('Error: {}'.format(response.status.details))
        except Exception as e:
            print(e)
        if audio_file:
            print("Saved audio to {}".format(args.output_audio_file))
            audio_file.close()

Finally the app returns the results received from the Krypton engine. This app prints the resulting transcription on screen as it is streamed from Krypton, sentence by sentence, with intermediate partial sentence results when the app has requested PARTIAL or IMMUTABLE_PARTIAL results.

The results may be long or short depending on the length of your audio, the result type, and the Result fields requested.

Result type IMMUTABLE_PARTIAL

Results from audio file with result type PARTIAL_IMMUTABLE

stream ../audio/monday_morning_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
partial : It's Monday
partial : It's Monday morning and the
final : It's Monday morning and the sun is shining.
partial : I'm getting ready
partial : I'm getting ready to
partial : I'm getting ready to walk
partial : I'm getting ready to walk to the
partial : I'm getting ready to walk to the train commute
final : I'm getting ready to walk to the train commute into work.
partial : I'll catch
partial : I'll catch the
partial : I'll catch the 750
partial : I'll catch the 758 train from
final : I'll catch the 758 train from Cedar Park station.
partial : It will take
partial : It will take me an hour
partial : It will take me an hour to get
final : It will take me an hour to get into town.
stream complete
200 Success

This example shows the transcription results from my audio file, monday_morning_16.wav, a 16kHz wave file talking about my commute into work. The audio file says:

It's Monday morning and the sun is shining.
I'm getting ready to walk to the train and commute into work.
I'll catch the seven fifty-eight train from Cedar Park station.
It will take me an hour to get into town.

The result type in this example is IMMUTABLE_PARTIAL, meaning that partial results are delivered after a slight delay, to ensure that the recognized words do not change with the rest of the received speech. See Result type for the other choices.

Result type FINAL

Result type FINAL returns only the final version of each sentence

stream ../audio/weather16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: There is more snow coming to the Montreal area in the next few days
final: We're expecting 10 cm overnight and the winds are blowing hard
final: Our radar and satellite pictures show that we're on the western edge of the storm system as it continues to traffic further to the east
stream complete
200 Success

This example transcribes the audio file weather16.wav, which talks about winter weather in Montreal. The file says:

There is more snow coming to the Montreal area in the next few days.
We're expecting ten centimeters overnight and the winds are blowing hard.
Our radar and satellite pictures show that we're on the western edge of the storm system as it continues to track further to the east.

The result type in the case is FINAL, meaning only the final transcription version is returned.

In both these examples, Krypton performs the transcription using only the data pack. For these simple sentences, the recognition is nearly perfect.

Step 7: Add DLMs and wordsets

Declare DLM and wordset

# Declare a DLM defined in your Mix project
travel_dlm = RecognitionResource(
    external_reference = ResourceReference(
        type='DOMAIN_LM',
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'),
    reuse='HIGH_REUSE',
    weight_value=0.7)

# Define a wordset that extends an entity in the DLM
places_wordset = RecognitionResource(
    inline_wordset='{"PLACES":[{"literal":"La Jolla", "spoken":["la hoya","la jolla"]},
{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},
{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},
{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},
{"literal":"Fordoun","spoken":["forden","fordoun"]},{"literal":"Llangollen",
"spoken":["lan goth lin","lan gollen"]},{"literal":"Auchenblae"}]}',
    reuse='HIGH_REUSE')

# Add recognition parms and resources 
init = RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US',
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=16000)),
        result_type='FINAL',
        utterance_detection_mode='MULTIPLE'),
    resources = [ travel_dlm, placess_wordset ]
)

Once you have experimented with basic transcription, you can add resources such as domain language models and wordsets to improve recognition of specific terms and language in your environment. For example, you might add resources containing names and places in your business.

Include DLMs and wordsets in RecognitionInitMessage - RecognitionResource:

Before and after DLM and wordset

Before: Without a DLM or wordset, unusual place names are not recognized

stream ../audio/abington.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final : I'm going on a trip to Abington tickets in Cambridgeshire England.
final : I'm speaking to you from the town of cooking out in Northamptonshire.
final : We visited the village of steeple Morton on our way to highland common in Yorkshire.
final : We spent a week in the town of land Gosling in Wales. 
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

After: Recognition is perfect with a DLM and wordset

stream ../audio/abington.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final : I'm going on a trip to Abington Piggots in Cambridgeshire England.
final : I'm speaking to you from the town of Cogenhoe in Northamptonshire.
final : We visited the village of Steeple Morden on our way to Hoyland Common in Yorkshire.
final : We spent a week in the town of Llangollen in Wales.
final : Have you ever thought of moving to La Jolla in California.
stream complete
200 Success

The audio file in this example, abington.wav, is a recording containing a variety of place names, some common and some unusual. The recording says:

I'm going on a trip to Abington Piggots in Cambridgeshire, England.
I'm speaking to you from the town of Cogenhoe [cook-no] in Northamptonshire.
We visited the village of Steeple Morden on our way to Hoyland Common in Yorkshire.
We spent a week in the town of Llangollen [lan-goth-lin] in Wales.
Have you ever thought of moving to La Jolla [la-hoya] in California.

Without a DLM or wordset, the unusual place names are not recognized correctly.

But when all the place names are defined, either in the DLM or in a wordset such as the following, there is perfect recognition.

{
   "PLACES": [ 
      { "literal":"La Jolla",
        "spoken":[ "la hoya","la jolla" ] },
      { "literal":"Llanfairpwllgwyngyll",
        "spoken":[ "lan vire pool guin gill" ] },
      { "literal":"Abington Pigotts" },
      { "literal":"Steeple Morden" },
      { "literal":"Hoyland Common" },
      { "literal":"Cogenhoe",
        "spoken":[ "cook no" ] },
      { "literal":"Fordoun",
        "spoken":[ "forden","fordoun" ] },
      { "literal":"Llangollen",
        "spoken":[ "lan goth lin","lan gollen" ] },
      { "literal":"Auchenblae" }
   ]
}

Sample Python app

A shell script, run-python-client.sh, obtains an access token and runs the app

#!/bin/bash

CLIENT_ID="appID%3ANMDPTRIAL_your_name_company_com_20201102T144327123022%3Ageo%3Aus%3AclientName%3Adefault"
SECRET="9L4l...8oda"
export MY_TOKEN="`curl -s -u "$CLIENT_ID:$SECRET" \
"https://auth.crt.nuance.com/oauth2/token" \
-d 'grant_type=client_credentials' -d 'scope=asr nlu tts dlg' \
| python -c 'import sys, json; print(json.load(sys.stdin)["access_token"])'`"

./my-python-client.py asr.api.nuance.com:443 $MY_TOKEN ../audio/towns_16.wav

This basic Python app, my-python-client.py, transcribes an audio file

#!/usr/bin/env python3

import sys, wave, grpc, traceback
from time import sleep
from resource_pb2 import *
from result_pb2 import *
from recognizer_pb2 import *
from recognizer_pb2_grpc import *

# Declare a DLM that exists in a Mix project
travel_dlm = RecognitionResource(
    external_reference = ResourceReference(
        type='DOMAIN_LM',
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'),
    reuse='HIGH_REUSE', 
    weight_value=0.7)

# Declare an inline wordset for an entity in that DLM 
places_wordset = RecognitionResource(
    inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]},{"literal":"Auchenblae"}]}',
    reuse='HIGH_REUSE')

# Send recognition request parameters and audio
def client_stream(wf):
    try:
        # Set recognition parameters
        init = RecognitionInitMessage(
            parameters = RecognitionParameters(
                language='en-US', 
                audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
                result_type='FINAL', 
                utterance_detection_mode='MULTIPLE',
                recognition_flags = RecognitionFlags(auto_punctuate=True)),
            resources = [ travel_dlm, places_wordset ],
            client_data = {'company':'Aardvark','user':'Leslie'} 
        )
        yield RecognitionRequest(recognition_init_message=init)

        # Simulate a realtime audio stream using an audio file
        print(f'stream {wf.name}')
        packet_duration = 0.020
        packet_samples = int(wf.getframerate() * packet_duration)
        for packet in iter(lambda: wf.readframes(packet_samples), b''):
            yield RecognitionRequest(audio=packet)
            sleep(packet_duration)
        print('stream complete')
    except CancelledError as e:
        print(f'client stream: RPC canceled')
    except Exception as e:
        print(f'client stream: {type(e)}')
        traceback.print_exc()

# Collect arguments from user
hostaddr = access_token = audio_file = None
try:
    hostaddr = sys.argv[1]
    access_token = sys.argv[2]
    audio_file = sys.argv[3]
except Exception as e:
    print(f'usage: {sys.argv[0]} <hostaddr> <token> <audio_file.wav>')
    exit(1)

# Check audio file attributes and open secure channel with token
with wave.open(audio_file, 'r') as wf:
    assert wf.getsampwidth() == 2, f'{audio_file} is not linear PCM'
    assert wf.getframerate() in [8000, 16000], f'{audio_file} sample rate must be 8000 or 16000'
    assert wf.getnchannels() == 1, f'{audio_file} is not a mono audio file'
    setattr(wf, 'name', audio_file)
    call_credentials = grpc.access_token_call_credentials(access_token)
    ssl_credentials = grpc.ssl_channel_credentials()
    channel_credentials = grpc.composite_channel_credentials(ssl_credentials, call_credentials)     
    with grpc.secure_channel(hostaddr, credentials=channel_credentials) as channel:
        stub = RecognizerStub(channel)
        stream_in = stub.Recognize(client_stream(wf))
        try:
            # Iterate through messages returned from server
            for message in stream_in:
                if message.HasField('status'):
                    if message.status.details:
                         print(f'{message.status.code} {message.status.message} - {message.status.details}')
                    else:
                         print(f'{message.status.code} {message.status.message}')
                elif message.HasField('result'):
                    restype = 'partial' if message.result.result_type else 'final'
                    print(f'{restype}: {message.result.hypotheses[0].formatted_text}')
        except StreamClosedError:
            pass
        except Exception as e:
            print(f'server stream: {type(e)}')
            traceback.print_exc()

A simple Python 3.6 client application is shown at the right. To run it:

init = RecognitionInitMessage(
    parameters = RecognitionParameters(...),
#   resources = [ travel_dlm, places_wordset ], 
    client_data = {'company':'Aardvark','user':'Leslie'}

Running the Python app

Run this sample Python application from the shell script, which generates a token and runs the app. Pass it the name of an audio file.

$ ./run-python-client.sh ../audio/towns_16.wav
stream ../audio/towns_16.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: I'm going on a trip to Abington Pigotts in Cambridgeshire England
final: I'm speaking to you from the town of Cogenhoe in Northamptonshire
final: We stopped at the village of steeple Morden on our way to Hoyland Common in Yorkshire
final: We spent a week in the town of Llangollen in Wales
final: Have you ever thought of moving to La Jolla in California
stream complete
200 Success

The run-python-client.sh script generates a token that authorizes the application to call the Krypton service. It takes your credentials and stores the resulting token in an environment variable, MY_TOKEN.

You may instead incorporate the token-generation code within the application, reading the credentials from a configuration file.

Reference topics

This section provides more information about topics in the gRPC API.

Status messages and codes

Recognizer service

service Recognizer {
  rpc Recognize (stream RecognitionRequest) returns (stream RecognitionResponse);
}

Status response message

{
  status: {
    code: 100
    message: 'Continue'
    details: 'recognition started on audio/l16;rate=8000 stream'
  }
  cookies: {  ... }
}

A single Recognizer service provides a single Recognize method supporting bi-directional streaming of requests and responses.

The client first provides a recognition request message with parameters indicating at minimum what language to use. Optionally, it can also include resources to customize the data packs used for recognition, and arbitrary client data to be injected into call recording for reference in offline tuning workflows.

In response to the recognition request message, Krypton returns a status message confirming the outcome of the request. Usually the message is Continue: recognition started on audio/l16;rate=8000 stream.

Status messages include HTTP-aligned status codes. A failure to begin recognizing is reflected in a 4xx or 5xx status as appropriate. (Cookies returned from resource fetches, if any, are returned in the first response only.)

When a 100 Continue status is received, the client may proceed to send one or more messages bearing binary audio samples in the format indicated in the recognize message (default: signed PCM/8000 Hz). The server responds with zero or more result messages reflecting the outcome of recognizing the incoming audio, until a terminating condition is reached, at which point the server sends a final status message indicating normal completion (200/204) or any errors encountered (4xx/5xx). Termination conditions include:

If the client cancels the RPC, no further messages are received from the server. If the server encounters an error, it attempts to send a final error status and then cancels the RPC.

Status codes

Code Message Indicates
100 Continue Recognition parameters and resources were accepted and successfully configured. Client can proceed to send audio data.
Also returned in response to a start_timers_message, which starts the no-input timer manually.
200 Success Audio was processed, recognition completed, and returned a result with at least one hypothesis. Each hypothesis includes a confidence score, the text of the result, and (for the final result only) whether the hypothesis was accepted or rejected.
200 Success is returned for both accepted and rejected results. A rejected result means that one or more hypothesis are returned, all with rejected = True.
204 No result Recognition completed without producing a result. This may occur if the client closes the RPC stream before sending any audio.
400 Bad request A malformed or unsupported client request was rejected.
403 Forbidden A request specified a topic that the client is not authorized to use.
404 No speech No utterance was detected in the audio stream for a number of samples corresponding to no_input_timeout_ms. This may occur if the audio does not contain anything resembling speech.
408 Audio timeout Excessive stall in sending audio data.
409 Conflict The recognizer is currently in use by another client.
410 Not recognizing A start_timers_message was received (to start the no-input timer manually) but no in-progress recognition exists.
413 Too much speech Recognition of utterance samples reached a duration corresponding to recognition_timeout_ms.
500 Internal server error A serious error occurred that prevented the request from completing normally.
502 Resource error One or more resources failed to load.
503 Service unavailable Unused; reserved for gateways.

Result type

Final results

final : It's Monday morning and the sun is shining

Partial results

partial : It's
partial : It's me
partial : It's month
partial : It's Monday
partial : It's Monday no
partial : It's Monday more
partial : It's Monday March
partial : It's Monday morning
partial : It's Monday morning and
partial : It's Monday morning and the
partial : It's Monday morning and this
partial : It's Monday morning and the sun
partial : It's Monday morning and the center
partial : It's Monday morning and the sun is
partial : It's Monday morning and the sonny's
partial : It's Monday morning and the sunshine
final : It's Monday morning and the sun is shining

Immutable partial results

partial : It's Monday
partial : It's Monday morning and the
final : It's Monday morning and the sun is shining

Krypton offers three different types of results for the transcription of each utterance in the audio stream. Specify the desired result with RecognitionParameters - EnumResultType. In the response, the actual type is indicated in Result - EnumResultType.

Some data packs perform additional processing after the initial transcription. The transcription may change slightly during this second pass, even for immutable partial results. For example, Krypton originally recognized "the seven fifty eight train" as "the 750 A-Train" but adjusted it during a second pass, returning "the 758 train" in the final version of the sentence.

partial : I'll catch the 750
partial : I'll catch the 750 A-Train
final : I'll catch the 758 train from Cedar Park station

Formatted text

Formatted vs. minimally formatted text

Formatted text:           December 9, 2005
Minimally formatted text: December nine two thousand and five

Formatted text:           $500
Minimally formatted text: Five hundred dollars

Formatted text:           I'll catch the 758 train
Minimally formatted text: I'll catch the seven fifty eight train

formattedText:            We’re expecting 10 cm overnight
minimallyFormattedText:   We’re expecting ten centimeters overnight

Formatted text:           I'm okay James, how about yourself?
Minimally formatted text: I'm okay James, how about yourself?

Krypton returns transcriptions in two formats: formatted text and minimally formatted text. See Result - Hypothesis.

Formatted text includes initial capitals for recognized names and places, numbers expressed as digits, currency symbols, and common abbreviations. In minimally formatted text, words are spelled out but basic capitalization and punctuation are included.

In many cases, both formats are identical.

Krypton uses the settings in the data pack to format the material in formattedText, for example displaying "ten centimeters" as "10 cm." For more precise control, you may specify a formatting scheme and/or option as a recognition parameter (RecognitionParameters - Formatting - scheme and options).

Formatting scheme

Formatting scheme

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US', 
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type = 'FINAL', 
        utterance_detection_mode = 'MULTIPLE',
        formatting = Formatting(
            scheme('date'),
            options = {'abbreviate_titles':True, 'abbreviate_units':False, 'censor_full_words':True}
        )
    )
)

The formatting scheme determines how ambiguous numbers are displayed in the formattedText field. Only one type may be specified, for example scheme('date').

The available schemes depend on the data pack, but most data packs support date, time, phone, address, all_as_words, default, and num_as_digits.

Each scheme is a collection of many options (see Formatting options below), but the defining option is PatternBias, which sets the preferred pattern for numbers that cannot otherwise be interpreted. The values of PatternBias give their name to most of the schemes: date, time, phone, address, and default.

The PatternBias option cannot be modified, but you may adjust other options using formatting options.

date, time, phone, and address

Formatting schemes help Krypton interpret ambiguous numbers, e.g. "It's seven twenty six"

scheme('date')    -->  It's 7/26
scheme('time')    -->  It's 7:26
scheme('address') -->  It's 726
scheme('phone')   -->  It's 726 

The formatting schemes date, time, phone, and address tell Krypton to prefer one pattern for ambiguous numbers.

By default, Krypton can identify some numbers as dates, times, or phone numbers, for example:

But Krypton considers some numbers ambiguous:

By setting the formatting scheme to date, time, phone, or address, you instruct Krypton to interpret these ambiguous numbers as the specified pattern. For example, if you know that the utterances coming into your application are likely to contain dates rather than times, set scheme to date.

all_as_words

Scheme all_as_words

"I'll catch the seven twenty six a m train"

With scheme('all_as_words') 
-->  I'll catch the seven twenty six a.m. train

With the default or any other scheme
--> I'll catch the 7:26 AM train

The all_as_words scheme displays all numbers as words, even when a pattern (date, time, phone, or address) is found. For example, Krypton identifies this as an address: “My address is seven twenty six brookline avenue cambridge mass oh two one three nine.”

default

This scheme is the default. It has the same effect as not specifying a scheme. If Krypton cannot determine the format of the number, it interprets it as a cardinal number.

num_as_digits

The num_as_digits scheme is the same as default, except in its treatment of numbers under 10.

Num_as_digits affects isolated cardinal and ordinal numbers, plural cardinals (ones, twos, nineteen fifties, etc.), some prices, and fractions. "Isolated" means a number that is not found within a greater pattern such as a date or time.

This scheme has no modifiable options.

all_as_katakana

Formatting scheme all_as_katakana

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'ja-JP', 
        ...
        formatting = Formatting(
            scheme ('all_as_katakana')
        )
    )
)

With and without all_as_katakana

Japanese form of "How many kilograms can I check in?"

With scheme('all_as_katakana') 
--> アズケルニモツノオモサハナンキロマデデスカ

With the default or any other scheme
-->  預ける荷物の重さは何キロまでですか

Available for Japanese data packs only, the all_as_katakana scheme returns the transcription in Katakana, meaning the output is entirely in the phonetic Katakana script, without Kanji, Arabic numbers, or Latin characters.

When all_as_katakana is not specified, the output is a mix of scripts representing standard written Japanese.

This scheme has no modifiable options.

Formatting options

No formatting scheme or options: default scheme is in effect

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US', 
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type = 'FINAL', 
        utterance_detection_mode = 'MULTIPLE'
    )
)

Scheme only: all options in the date scheme are in effect

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'en-US', 
        audio_format = AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type = 'FINAL', 
        utterance_detection_mode = 'MULTIPLE',
        formatting = Formatting(
            scheme('date')
        )
    )
)

Options only: options in the default scheme are overridden by specific options

RecognitionInitMessage(
    parameters = RecognitionParameters(
        ...
        formatting = Formatting(
            options = {'abbreviate_titles':True,'abbreviate_units':False,'censor_full_words':True}
        )
    )
)

Both: options in the date scheme are overridden by specific options

RecognitionInitMessage(
    parameters = RecognitionParameters(
        ...
        formatting = Formatting(
            scheme('date'),
            options={'abbreviate_titles':True,'abbreviate_units':False,'censor_profanities':True}
        )
    )
)

Formatting options are individual parameters for displaying words and numbers in the formattedText result field. All options are part of the current formatting scheme (default if not specified) but can be set on their own to override the current setting.

The available options depend on the data pack. See Formatting options by language.

All options are boolean. The values are set in the scheme to which they belong. (The num_as_digits and all_as_katakana schemes have no modifiable options.) Differences from the default type are shown.

       
Formatting options Formatting scheme
default, date, time, phone, address all_as_words
PatternBias
The defining characteristic of the scheme. Not modifiable.
default, date, time, phone, addresss 
abbreviate_titles
Whether to abbreviate titles such as Captain (Capt), Director (Dir), Madame (Mme), Professor (Prof), etc. In American English, a period follows the abbreviation. The titles Mr, Mrs, and Dr are always abbreviated.
False False
abbreviate_units
Whether to abbreviate units of measure such as centimeters (cm), meters (m), megabytes (MB), pounds (lbs), ounces (oz), miles per hour (mph), etc. When true, metric units are always abbreviated, but imperial one-word tokens are not abbreviated, so ten feet is 10 feet and twelve quarts is 12 quarts. The formatting of expressions with mutiple units depends on the units involved: only common combinations are formatted.
True False
Arabic_numerals_not_Kanji (Japanese)
How to display numbers.
False: All numbers are displayed in Kanji.
True: Numbers are either Arabic or half-formatted, depending on the half-formatted (million_as_numerals) setting.
By default, cardinals are half-formatted, meaning that magnitude words (thousands, millions, etc.) are in Kanji.
See Japanese options.
True False
capitalize_2nd_person_pronouns (German)
Whether to capitalize second person personal pronouns such as Du, Dich, etc.
False False
capitalize_3rd_person_pronouns (German)
Whether to capitalize third-person personal pronouns such as Sie, Ihnen, etc.
True True
censor_full_words
Whether to mask profanities completely with asterisks, for example "********" versus "frigging."
False False
censor_profanities
Whether to mask profanities partially with asterisks, for example “fr*gging” versus “frigging.”
False False
expand_contractions
In English data packs, whether to expand common contractions, for example "don't" versus "do not" or "it's nice" versus "it is nice."
False False
format_addresses
Whether to format text identified as postal addresses. This does not include adding commas or new lines. Full street address formatting is done for most data packs, following the standards of the country's postal service.
True False
format_currency_codes
When format_prices is true, whether to replace the currency symbol with its ISO currency code, for example 125USD instead of $125.
False False
format_dates
Whether to format text identified as dates as, for example, 7/26/1994, 7/26/94, or 7/26. The order of month and day depends on the data pack.
True False
format_non-USA_postcodes
For non-US data packs, whether to format UK and Canadian postcodes. UK postcodes have the form A9 9AA, A99 9AA, etc. Canadian postal codes have the form A9A 9A9.
False False
format_phone_numbers
For US and Canadian data packs, whether to format numbers identified as phone numbers, as 123-456-7890 or 456-7899, optionally with 1 or +1 before the number.
True False
format_prices
Whether to format numbers identified as prices, including currency symbols and price ranges. The currency symbol depends on the data pack language.
True False
format_social_security_numbers
Whether to format numbers identified as US social security numbers or (for Canadian data packs) Canadian social insurance numbers. Both are a series of nine digits formatted as 123-45-6789 or 123 456 789.
False False
format_times
Whether to format numbers identified as times (including both 12- and 24-hour times) as, for example, 10:35 with optional AM or PM.
True False
format_URLs_and_email_addresses
Whether to format web and email addresses, including @ (for at) and most suffixes, including multiple suffixes, for example .ac.edu. Numbers are displayed as digits and output is in lowercase.
True False
format_USA_phone_numbers (Mexican)
Whether to use US phone formatting instead of Mexican.
False False
improper_fractions_as_numerals
Whether to express improper fractions as numbers, for example 5/4 versus five fourths.
True False
million_as_numerals
Whether to half-format numbers ending in million, billion, trillion, and so on, for example 5 million.
See Japanese options.
True Inactive
mixed_numbers_as_numerals
Whether to express numbers that are a combination or an integer and a fraction (three and a half) as numerals (3 1/2).
True False
two_spaces_after_period
Whether to insert two spaces (instead of one) following a period (full stop), question mark, or exclamation mark.
False False

Japanese options

Formatting options in Japanese data packs

format_addresses
censor_profanities
abbreviate_units
format_phone_numbers
Arabic_numerals_not_Kanji
format_times
format_dates
million_as_numerals
format_URLs_and_email_addresses
format_prices

Combining options: This displays all numbers in Kanji

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'ja-JP', 
        ...
        formatting = Formatting(
            options = {'Arabic_numerals_not_Kanji':False}
        )
    )
)

This displays numbers in Kanji and Arabic. This is the default setting so may be omitted.

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'ja-JP', 
        ...
        formatting = Formatting(
            options = {'Arabic_numerals_not_Kanji':True, 'millions_as_numerals':True}
        )
    )
)

This displays all numbers in Arabic

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language = 'ja-JP', 
        ...
        formatting = Formatting(
            options = {'Arabic_numerals_not_Kanji':True, 'millions_as_numerals':False}
        )
    )
)

Japanese data packs support ten formatting options, shown at the right. (See also all_as_katakana for a Japanese-specific formatting scheme.) In these data packs, two options work together to specify how numbers are displayed.

You can control how numbers are displayed by combining Arabic_numerals_not_Kanji and million_as_numerals:

All Kanji Half-formatted (default) All Arabic
Arabic: False Arabic: True
million: True
Arabic: True
million: False
All numbers are displayed in Kanji. Magnitude words are in Kanji and the rest in Arabic. All numbers are displayed in Arabic.
3 3
十一 11 11
六十五 65 65
八百三十七 837 837
1,000 1,000
千九百四十五 1,945 1,945
八千五百 8,500 8,500
一万 1万 10,000
一万五千 1万5,000 15,000
一億三千万 1億3,000万 130,000,000
二億五 2億5 200,000,005

Scheme vs. options

Scheme vs. options

Utterance: "My address is seven twenty six brookline avenue cambridge mass"

With any formatting scheme and 
formatting option 'format_addresses':true
--> My address is 726 Broookline Ave., Cambridge, MA

formatting option 'format_addresses':false
--> My address is 726 Brookline Avenue Cambridge Mass

Some formatting schemes have similar names to formatting options, for example the date, phone, time, and address scheme and the options format_dates, format_times, and so on. What's the difference?

These schemes tell Krypton how to interpret ambiguous numbers, while the options tell Krypton how to format text for display. For example:

When you set formatting options, be aware of the default for the scheme to which it belongs. For example, format_prices is True for most schemes, so there is no need to set it explicitly if you want prices to be shown with currency symbols and characters.

Other schemes—all_as_words, num_as_digits, and all_as_katakana—set general instructions for displaying the transcription and are not related to interpretation of ambiguous numbers.

Formatting options by language

Each language supports a different set of formatting options, which you may modify to customize the way that Krypton formats its transcription. See Formatting options.

Arabic (ara-XWW)

censor_profanities
format_dates
format_times
format_URLs_and_email_addresses

Chinese (China, chm-CHN)

abbreviate_units
censor_profanities
format_addresses
format_channel_numbers
format_dates
format_phone_numbers
format_times
million_as_numerals
no_math_symbols

Chinese (Taiwan, chm-TWN)

As Chinese plus:

censor_full_word
format_prices

Croatian (hrv-HRV)

abbreviate_units
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Czech (ces-CZE)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
format_social_security_numbers

Danish (dan-DNK)

abbreviate_units
censor_full_words
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Dutch (nld-NLD)

As Danish plus:

format_addresses

English (USA eng-USA)

abbreviate_titles
abbreviate_units
censor_full_words
censor_profanities
expand_contractions
format_addresses
format_currency_codes
format_dates
format_non-USA_postcodes
format_phone_numbers
format_prices
format_social_security_numbers
format_times
format_URLs_and_email_addresses
improper_fractions_as_numeral
million_as_numerals
mixed_numbers_as_numerals
two_spaces_after_period

English (Australia eng-AUS, Britain eng-BGR)

As English (US) excluding:

format_non-USA_postcodes
format_social_security_numbers

English (India eng-IND)

As English (US) excluding:

format_addresses
format_non-USA_postcodes

Finnish (fin-FIN)

abbreviate_units
censor_profanities
format_currency_codes
format_prices
format_times
format_URLs_and_email_addresses

French (France, fra-FRA), Italian (ita-ITA)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

French (Canada fra-CAN)

As French plus:

format_social_insurance_numbers

German (deu-DEU)

abbreviate_units
capitalize_2nd_person_pronouns
capitalize_3rd_person_pronouns
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Greek (ell-GRC)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Hebrew (heb-ISR)

abbreviate_units
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Hindi (hin-IND)

abbreviate_units
format_dates
format_prices
format_times

Hungarian (hun-HUN)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Indonesian (ind-IDN)

abbreviate_units
censor_profanities
format_dates
format_phone_numbers
format_prices
format_times

Japanese (jpn-JPN)

abbreviate_units
Arabic_numerals_not_Kanji
censor_full_words
censor_profanities
format_addresses
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Korean (kor-KOR)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses

Norwegian (nor-NOR), Polish (pol-POL)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses

Portuguese (Brazil por-BRA, Portugal por-PRT)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Romanian (ron-ROU)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Slovak (slk-SVK), Ukranian (ukr-UKR)

abbreviate_units
censor_profanities
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses

Spanish (spa-ESP)

abbreviate_units
censor_profanities
format_addresses
format_currency_codes
format_dates
format_phone_numbers
format_prices
format_times
format_URLs_and_email_addresses
format_USA_phone_numbers
million_as_numerals

Spanish Latin America (spa-XLA), USA (spa-USA)

As Spanish plus:

format_USA_phone_numbers

Thai (tha-THA)

abbreviate_units
censor_profanities
format_dates
format_prices
format_times

Turkish (tur-TUR, Swedish swe-SWE, Russian rus-RUS)

abbreviate_units
censor_full_words
censor_profanities
format_addresses
format_currency_codes
format_dates
format_prices
format_times
format_URLs_and_email_addresses
million_as_numerals

Vietnamese (vie-VNM)

abbreviate_units
censor_full_words
censor_profanities
format_dates
format_prices
format_times

Opus audio format

Krypton supports the Opus audio format, either raw Opus (RFC 6716) or Ogg-encapsulated Opus (RFC 7845). The recommended encoder settings for Opus for speech recognition are:

Please note that Opus is a lossy codec, so you should not expect recognition results to be identical to those obtained with PCM audio.

See AudioFormat for the other supported formats: PCM, A-law and µ-law.

Timers

Default no-input timer

Krypton offers three timers for limiting user silence and recognition time: a no-input timer, a recognition timer, and an end-of-utterance timer. See Timeout and timer fields below.

No-input timer

No-input timeout on its own can cause problems

RecognitionRequest(
    recognition_init_message = RecognitionInitMessage(  
        parameters = RecognitionParameters(  
            no_input_timeout_ms = 3000
        )
    )
)                 
[*** Play prompt to user ***]
RecognitionRequest(audio)

Timer expires before user speaks

By default, the no-input timer starts when recognition starts, but has an infinite timeout, meaning Krypton simply waits for the user to speak and never times out.

If you set a no-input timeout, for example no_input_timeout_ms = 3000, the user must start speaking within 3 seconds. If a prompt plays as recognition starts, the recognition may time out before the user hears the prompt.

Add stall_timers and start_timers_message

RecognitionRequest(
    recognition_init_message = RecognitionInitMessage(  
        parameters = RecognitionParameters(
            no_input_timeout_ms = 3000,  
            recognition_flags = RecognitionFlags(stall_timers = True)
        )
    )
)         
[*** Play prompt to user ***]
RecognitionRequest(
    control_message = ControlMessage(
        start_timers_message = StartTimersControlMessage() 
    )
) 
RecognitionRequest(audio)  

Timer starts later

To avoid this problem, use stall_timers and start_timers_message to start the no-input timer only after the prompt finishes.

Timeout and timer fields

Field Description
RecognitionParameters
no_input_timeout_ms
(no-input timer)

Time to wait for user input. Default is 0, meaning infinite.

By default, the no-input timer starts with recognition_init_message but is only effective when no_input_timeout_ms has a value.

When stall_timers=True, you can start the timer manually with start_timers_message.

recognition_timeout_ms
(recognition timer)
Duration of recognition, in milliseconds. Default is 0, meaning infinite.

The recognition timer starts when speech input starts (after the no-input timer) but is only effective when recognition_timeout_ms has a value.

utterance_end_silence_ms
(utterance end timer)
Period of time that signals the end of an utterance. Default is 500 (ms, or half a second).

The utterance end timer starts automatically.

RecognitionFlags
stall_timers

Do not start the no-input timer. Default is False.

By default, the no-input timer starts with recognition_init_message. When stall_timers=True, this timer does not start at that time.

The other timers are not affected by stall_timers.

ControlMessage
start_timers_message

Starts the no-input timer if it was disabled by stall_timers. This message starts the no-input timer manually.

Resources

In the context of Krypton, resources are objects that facilitate or improve recognition of user speech. Resources include data packs, domain language models, wordsets, builtins, and speaker profiles.

At least one data pack is required. Other resources are optional.

Data packs

Data pack includes acoustic and language model

Data pack

Krypton works with one or more factory data packs, available in several languages and locales. The data pack includes these neural network-based components:

The base acoustic model is trained to give good performance in many acoustic environments. The base language model is developed to remain current with popular vocabulary and language use. As such, Krypton paired with a data pack is ready for use out-of-the-box for many applications.

You may extend the data pack at runtime using several types of specialization resources:

Each recognition turn leverages a weighted mix of builtins, domain LMs, and wordsets. See Resource weights.

Builtins

Data pack builtins

# Define builtins
cal_builtin = RecognitionResource(
    builtin='CALENDARX',
    weight_value=0.2)

distance_builtin = RecognitionResource( builtin='DISTANCE', weight_value=0.2)

# Include builtins in RecognitionInitMessage init = RecognitionInitMessage( parameters = RecognitionParameters( language='en-US', audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)), resources = [ travel_dlm, cal_builtin, distance_builtin ]

The data pack may include one or more builtins, which are predefined recognition objects focused on common tasks (numbers, dates, and so on) or general information in a vertical domain such as financial services or healthcare. The available builtins depends on the data pack. For American English data packs, for example, the builtins are:

ALPHANUM           DOUBLE            TEMPERATURE
AMOUNT             DURATION          TIME
BOOLEAN            DURATION_RANGE    VERT_FINANCIAL_SERVICES
CALENDARX          GENERIC_ORDER     VERT_HEALTHCARE
CARDINAL_NUMBER    GLOBAL            VERT_TELECOMMUNICATIONS
DATE               NUMBERS           VERT_TRAVEL
DIGITS             ORDINAL_NUMBER
DISTANCE           QUANTITY_REL

To use a builtin in Krypton, declare it with RecognitionResource - builtin and activate it in RecognitionInitMessage - resources.

Domain LMs

Two DLMs with entities

DLMs

Load a DLM

# Define DLM 
travel_dlm = RecognitionResource(external_reference =
    ResourceReference(type='DOMAIN_LM',
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'),
    weight_value=0.7)

# Include DLM in RecognitionInitMessage init = RecognitionInitMessage( parameters = RecognitionParameters(language='en-US', audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)), resources = [ travel_dlm ]

Each data pack supplied with Krypton provides a base language model that lets the transcription engine recognize the most common terms and constructs in the language and locale.

You may complement this language model with one or more domain-specific models, called domain language models (domain LMs or DLMs). Each DLM is based on sentences from a specific environment, or domain, and may include entities, or collections of terms used in that environment.

DLMs are created in Nuance Mix and accessed via a URN available from Mix. See Prerequisites from Mix and the code sample at the right for an example of a URN.

In Krypton, a DLM is a resource declared with RecognitionInitMessage - RecognitionResource. Krypton accepts up to ten DLMs, which are weighted along with other recognition objects. See Resource weights.

Wordsets

Wordsets extend entities in DLMs

Inline wordset, places_wordset, extends the PLACES entity

# Define DLM
travel_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM', 
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'),
    weight_value=0.7)

# Define a wordset that extends an entity in that DLM places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal": "La Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll", "spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal": "Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken": ["cook no"]},{"literal":"Fordoun","spoken":["forden","fordoun"]},{"literal":"Llangollen", "spoken":["lan goth lin","lan gollen"]},{"literal":"Auchenblae"}]}')

# Include DLM and wordset in RecogntitionInitMessage init = RecognitionInitMessage( parameters = RecognitionParameters( language='en-US', audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000), result_type='FINAL', utterance_detection_mode='MULTIPLE'), resources = [ travel_dlm, places_wordset ]

A wordset is a collection of words and short phrases that extends Krypton's recognition vocabulary by providing additional values for entities in a DLM. For example, a wordset might provide the names in a user’s contact list or local place names.

Wordsets are declared with RecognitionInitMessage - RecognitionResource - inline_wordset.

Defining wordsets

The wordset is defined in JSON format as a one or more arrays. Each array is named after an entity defined within a DLM to which words can be added at runtime. Entities are templates that tell Krypton how and where words are used in a conversation.

For example, you might have an entity, PLACES, with place names used by the application, or NAMES, containing personal names. The wordset adds to the existing terms in the entity, but applies only to the current recognition session. The terms in the wordset are not added permanently to the entity.

All entities must be defined in DLMs, which are loaded along with the wordset.

The wordset includes additional values for one or more entities. The syntax is:

{
   "<entity-1>" : [
      { "literal": "<written form>",
        "spoken": ["<spoken form 1>", "<spoken form n>"]
      },
      { "literal": "<written form>",
        "spoken": ["<spoken form 1">, "<spoken form n>"]
      },
      ...
   ],
   "<entity-n>": [ ... ]
}

Syntax
<entity> String An entity defined in a domain LM, containing a set of values. The name is case-sensitive. Consult the DLM for entity names.
literal String The written form of the value that Krypton returns in the formatted_text field.
spoken Array (Optional) One or more spoken forms of the value. When not supplied, Krypton guesses the pronunciation of the word from the literal. Include a spoken form only if the literal is difficult to pronounce or has an unusual pronunciation in the language.

When a spoken form is supplied, it is the only source for recognition: the literal is not considered. If the literal pronunciation is also valid, you should include it as a spoken form.

For example, the city of Worcester, Massachusetts is pronounced wuster, but users reading it on a map may say it literally, as worcester. To allow Krypton to recognize both forms, specify:
{"literal":"Worcester","spoken":["wuster","worcester"]}

See Before and after DLM and wordset to see the difference that a wordset can make on recognition.

Wordsets: inline or read from file

Wordset defined inline

# Define DLM
travel_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM', 
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'),
    weight_value=0.7)

# Define the wordset inline 
places_wordset = RecognitionResource(inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya","la jolla"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden","fordoun"]},{"literal":"Llangollen","spoken":["lan goth lin","lan gollen"]},{"literal":"Auchenblae"}]}')

# Include the DLM and wordset in RecognitionInitMessage 
init = RecognitionInitMessage(
    parameters = RecognitionParameters(...),
    resources = [ travel_dlm, places_wordset ]
)

Wordset read from a local file using Python function

# Define DLM
travel_dlm = RecognitionResource(external_reference = 
    ResourceReference(type='DOMAIN_LM', 
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'),
    weight_value=0.7)

# Read wordset from local file 
places_wordset_content = None
with open('places-wordset.json', 'r') as f:
    places_wordset_content = f.read()
places_wordset = RecognitionResource(inline_wordset=places_wordset_content)

# Include the DLM and wordset in RecognitionInitMessage 
init = RecognitionInitMessage(
    parameters = RecognitionParameters(...),
    resources = [ travel_dlm, places_wordset ]
)

This wordset extends the PLACES entity in the DLM with additional place names. Notice that a spoken form is provided only for terms that do not follow the standard pronunciation rules for the language.

{
   "PLACES": [ 
      { "literal":"La Jolla",
        "spoken":[ "la hoya","la jolla" ] },
      { "literal":"Llanfairpwllgwyngyll",
        "spoken":[ "lan vire pool guin gill" ] },
      { "literal":"Abington Pigotts" },
      { "literal":"Steeple Morden" },
      { "literal":"Hoyland Common" },
      { "literal":"Cogenhoe",
        "spoken":[ "cook no" ] },
      { "literal":"Fordoun",
        "spoken":[ "forden","fordoun" ] },
      { "literal":"Llangollen",
        "spoken":[ "lan goth lin","lan gollen" ] },
      { "literal":"Auchenblae" }
   ]
}

To use this wordset, specify it in RecognitionResource - inline_wordset:

Speaker profiles

Speaker profile

# Define speaker profile
speaker_profile = RecognitionResource(
    external_reference = ResourceReference(
        type='SPEAKER_PROFILE'))

# Include profile in RecognitionInitMessage init = RecognitionInitMessage( parameters = RecognitionParameters( language='en-US', audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000)), resources = [ travel_dlm, places_wordset, speaker_profile ], user_id = 'james.somebody@aardvark.com'

Optionally discard data after request

# Define speaker profile
speaker_profile = RecognitionResource(
    external_reference = ResourceReference(
        type = 'SPEAKER_PROFILE'))

# Include profile in RecognitionInitMessage, (optionally) discard after adaptation init = RecognitionInitMessage( parameters = RecognitionParameters( language = 'en-US', audio_format = AudioFormat(pcm=PCM(sample_rate_hz=16000), recognition_flags = RecognitionFlags( discard_speaker_adaptation=True)), resources = [ travel_dlm, places_wordset, speaker_profile ], user_id = 'james.somebody@aardvark.com'

Speaker adaptation is a technique that adapts and improves speech recognition based on qualities of the speaker and channel. The best results are achieved by updating the data pack's acoustic model in real time based on the immediate utterance.

Krypton maintains adaptation data for each caller as speaker profiles in an internal datastore.

To use speaker profiles in Krypton, specify them in RecognitionInitMessage - RecognitionResource - ResourceReference with type SPEAKER_PROFILE, and include a user_id in RecognitionInitMessage. The user_id must be a unique identifier for a speaker, for example:

user_id='socha.someone@aardvark.com'  
user_id='erij-lastname'   
user_id='device-1234'    
user_id='33ba3676-3423-438c-9581-bec1dc52548a'

The first time you send a request with a speaker profile, Krypton creates a profile based on the user id and stores the data in the profile. On subsequent requests with the same user id, Krypton adds the data to the profile, which adapts the acoustic model for that specific speaker, providing custom recognition.

Speaker profiles do not have a weight.

After the Krypton session, the adapted data is saved by default. If this information is not required after the session, set RecognitionParameters - RecognitionFlags - discard_speaker_adaptation=True.

Resource weights

Resources used in recognition

Resource weight

A wordset, two DLMs, and one builtin are declared in this example, leaving the base LM with a weight of 0.200

Weight example

In each recognition turn, Krypton uses a weighted mix of resources: the base LM plus any builtins, DLMs, and wordsets declared in the recognition request. You may set specific weights for DLMs and builtins. You cannot set a weight for wordsets.

The total weight of all resources is 1.0, made up of these components:

Component Weight
Base LM

By default, the base language model has a weight of 1.0 minus other components in the recognition turn, with a minimum of 0.1 (10%). If other resources exceed 0.9, their weight is reduced to allow the base LM a minimum weight of 0.1.

When RecognitionFlags - allow_zero_base_lm_weight is true, other resources may use the entire weight, with the base LM reduced to zero. In this case, the words in the base LM are still recognized, but with lower probability than words in the DLMs and other resources.

Builtins

The default weight of each declared builtin is 0.25, or MEDIUM. You may set a weight for each builtin with RecognitionResource - weight_enum or weight_value.

Domain LMs

The default weight of each declared DLM is 0.25, or MEDIUM. You may set a weight for each DLM with RecognitionResource - weight_enum or weight_value.

Wordsets

The weight of each wordset is tied to the weight of its DLM. You cannot set a weight for wordsets.

Wordsets also have a small fixed weight (0.1) to ensure clarity within the wordset, so values such as John Smith and Jon Taylor are not confused as John Taylor and Jon Smith. This weight applies to all wordsets together.

This DLM is the principal resource weight

# Define DLM with 100% weight
names_places_dlm = RecognitionResource(
    external_reference = ResourceReference(
        type='DOMAIN_LM',
        uri='urn:nuance-mix:tag:model/names-places/mix.asr?=language=eng-USA'),
    reuse='HIGH_REUSE',
    weight_value=1.0)

# Set allow_zero_base_lm_weight to let DLM use all weight
RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US', 
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type='PARTIAL', 
        utterance_detection_mode='MULTIPLE',
        recognition_flags = RecognitionFlags(
            allow_zero_base_lm_weight=True)
    )
)

If you wish to emphasize the DLM at the expense of the base LM, give it a weight of 1.0 and turn on the recognition flag, allow_zero_base_lm_weight. In this example, the base LM has little effect on recognition.

Defaults

The proto files provide the following default values for messages in the RecognitionRequest sent to Krypton. Mandatory fields are shown in bold.

   
Items in RecognitionRequest Default
recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
      language Mandatory, e.g. 'en-US'
    topic Default 'GEN'
    audio_format (AudioFormat) Mandatory, e.g. 'PCM'
    utterance_detection_mode (EnumUtterance DetectionMode) SINGLE (0): transcribe one utterance only
    result_type (EnumResultType) FINAL (0): return only final version of each utterance
    recognition_flags (RecognitionFlags)
        auto_punctuate False: Do no punctuate results
      filter_profanity False: Leave profanity as is
      include_tokenization False: Do not include tokenized result
      stall_timers False: Start no-input timers
      discard_speaker_adaptation False: Keep speaker profile data
      suppress_call_recording False: Log calls
      mask_load_failures False: Loading errors end recognition
      allow_zero_base_lm_weight False: Base LM uses minimum 10% resource weight
    no_input_timeout_ms 0*, usually no timeout
    recognition_timeout_ms 0*, usually no timeout
    utterance_end_silence_ms 0*, usually 500 ms or half second
    speech_detection_sensitivity 500
    max_hypotheses 0*, usually 10 hypotheses
    speech_domain Depends on data pack
    formatting (Formatting)  
        scheme Depends on data pack
      options Blank
  resources (RecognitionResource)
      external_reference (ResourceReference)
        type (Enum ResourceType) Mandatory with resources - external_reference
      uri Mandatory with resources - external_reference
      mask_load_failures False: Loading errors end recognition
      request_timeout_ms 0*, usually 10000 ms or 10 seconds
      headers Blank
    inline_wordset Blank
    builtin Blank
    inline_grammar Blank
    weight_enum (EnumWeight) 0, meaning MEDIUM
    weight_value 0
    reuse (EnumResourceReuse) LOW_REUSE: only one recognition
  client_data Blank
  user_id Blank
control_message (ControlMessage) Blank
audio Mandatory

* Items marked with an asterisk (*) default to 0, meaning a server default: the default is set in the configuration file used by the Krypton engine instance. The values shown here are the values set in the sample configuration files (default.yaml and development.yaml) provided with the Krypton engine. In the case of max_hypotheses, the default (10 hypotheses) is set internally within Krypton.

gRPC API

Krypton provides three protocol buffer (.proto) files to define Nuance's ASR service for gRPC. These files contain the building blocks of your transcription applications.

Once you have transformed the proto files into functions and classes in your programming language using gRPC tools, you can call these functions from your application to request transcription, to set recognition parameters, to load “helper” resources such as domain language models and wordsets, and to send the resulting transcription where required.

See Client app development and Sample Python app for scenarios and examples in Python. For other languages, consult the gRPC and Protocol Buffers documentation.

Proto file structure

Structure of proto files

Recognizer
    Recognize
        RecognitionRequest
        RecognitionResponse

RecognitionRequest
    recognition_init_messsage RecognitionInitMessage
        parameters RecognitionParameters
            language and other recognition parameter fields
            audio_format AudioFormat
            result_type EnumResultType
            recognition_flags RecognitionFlags
            formatting Formatting
        resources RecognitionResource
            external_reference ResourceReference
                type EnumResourceType
            inline_wordset
            builtin
            inline_grammar
            weight_enum EnumWeight | weight_value
        client_data
        user_id
    control_message ControlMessage
        start_timers_message StartTimersControlMessage
    audio

RecognitionResponse
    status Status
    start_of_speech StartOfSpeech
    result Result
        result fields
        result_type EnumResultType
        utterance_info UtteranceInfo
             utterance fields
            dsp Dsp
        hypotheses Hypothesis
            hypothesis fields
            word Word
                word fields
        data_pack DataPack
            data pack fields
    cookies

The proto files define an RPC service with a Recognize method that streams a RecognitionRequest and RecognitionResponse. Details about each component are referenced by name within the proto file.

This shows the structure of the principal request fields:

Proto file: request

And this shows the main response fields:

Proto file: response

Recognizer

See individual sections for examples

Streaming recognition service API.

Name Request Type Response Type Description
Recognize RecognitionRequest stream RecognitionResponse stream Starts a recognition request and returns a response.

RecognitionRequest

RecognitionRequest sends recognition_init_message, then audio to be transcribed

def client_stream(wf):
    try:
        # Start the recognition
        init = RecognitionInitMessage(. . .)
        yield RecognitionRequest(recognition_init_message=init)

        # Simulate a typical realtime audio stream
        print(f'stream {wf.name}')
        packet_duration = 0.020
        packet_samples = int(wf.getframerate() * packet_duration)
        for packet in iter(lambda: wf.readframes(packet_samples), b''):
            yield RecognitionRequest(audio=packet)

For a control_message example, see Timers.

Input stream messages that request recognition, sent one at a time in a specific order. The first mandatory field sends recognition parameters and resources, the final field sends audio to be recognized. Included in Recognizer - Recognize service.

Field Type Description
recognition_init_message Recognition InitMessage Mandatory. First message in the RPC input stream, sends parameters and resources for recognition.
control_message ControlMessage Optional second message in the RPC input stream, for timer control.
audio bytes Mandatory. Subsequent message containing audio samples in the selected encoding for recognition.

This method includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
    resources (RecognitionResource)
    client_data
    user_id
  control_message (ControlMessage)
  audio

RecognitionInitMessage

RecognitionInitMessage example

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US', 
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type='FINAL', 
        utterance_detection_mode='MULTIPLE',
        recognition_flags = RecognitionFlags(auto_punctuate=True)
    )
    resources = [travel_dlm, places_wordset],
    client_data = {'company':'Aardvark','user':'James'},
    user_id = 'james.somebody@aardvark.com'
)

Minimal RecognitionInitMessage

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US', 
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=22050))    
    )
)

Input message that initiates a new recognition turn. Included in RecognitionRequest.

Field Type Description
parameters RecognitionParameters Mandatory. Language, audio format, and other recognition parameters.
resources RecognitionResource Repeated. Resources (DLMs, wordsets, builtins) to improve recognition.
client_data map<string,string> Map of client-supplied key, value pairs to inject into the call log.
user_id string Identifies a specific user within the application.

This message includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
      language
      topic
      audio_format
      utterance_detection_mode
      result_type
      etc.
    resources (RecognitionResource)
      external_reference
        type
        uri
      inline_wordset
      builtin
      inline_grammar
      weight_enum | weight_value
      reuse
    client_data
    user_id

RecognitionParameters

RecognitionParameters example

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US', 
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type='PARTIAL', 
        utterance_detection_mode='MULTIPLE',
        recognition_flags = RecognitionFlags(auto_punctuate=True)
    )
)

For examples of the formatting parameter, see Formatting and Formatted text.

Input message that defines parameters for the recognition process. Included in RecognitionInitMessage.

The language and audio_format parameters are mandatory. All others are optional. See Defaults for a list of default values.

Field Type Description
language string Mandatory. Language and country (locale) code as xx-XX, e.g. 'en-US' for American English.
Codes in the form xxx-XXX, e.g. 'eng-USA' are also supported for backward compatibility.
topic string Specialized language model in data pack. Default is 'GEN' (generic).
audio_format AudioFormat Mandatory. Audio codec type and sample rate.
utterance_detection_mode EnumUtterance DetectionMode How end of utterance is determined. Default SINGLE.
result_type EnumResultType The level of transcription results. Default FINAL.
recognition_flags RecognitionFlags Boolean recognition parameters.
no_input_timeout_ms uint32 Maximum silence, in ms, allowed while waiting for user input after recognition timers are started. Default (0) means server default, usually no timeout. See Timers.
recognition_timeout_ms uint32 Maximum duration, in ms, of recognition turn. Default (0) means server default, usually no timeout.
utterance_end_silence_ms uint32 Minimum silence, in ms, that determines the end of an utterance. Default (0) means server default, usually 500ms or half a second.
speech_detection_sensitivity float A balance between detecting speech and noise (breathing, etc.), 0 to 1.
0 means ignore all noise, 1 means interpret all noise as speech. Default is 0.5.
max_hypotheses uint32 Maximum number of n-best hypotheses to return. Default (0) means server default, usually 10 hypotheses.
speech_domain string Mapping to internal weight sets for language models in the data pack. Values depend on the data pack.
formatting Formatting Formatting keyword.

This message includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
      language
      topic
      audio_format
        pcm|alaw|ulaw|opus|ogg_opus
      utterance_detection_mode - SINGLE|MULTIPLE|DISABLED
      result_type - FINAL|PARTIAL|IMMUTABLE_PARTIAL
      recognition_flags
        auto_punctuate
        filter_profanity
        mask_load_failures
        etc.
      speech_detection_sensitivity
      max_hypotheses
      formatting
      etc.

AudioFormat

PCM format, with alternatives shown in commented lines

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US',
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),
#        audio_format=AudioFormat(alaw=Alaw()),
#        audio_format=AudioFormat(ulaw=Ulaw()),
#        audio_format=AudioFormat(opus=Opus(source_rate_hz=16000)),
#        audio_format=AudioFormat(ogg_opus=OggOpus(output_rate_hz=16000)),
        result_type='FINAL',
        utterance_detection_mode='MULTIPLE'
)

Mandatory input message containing the audio format of the audio to transcribe. Included in RecognitionParameters.

Field Type Description
pcm PCM Signed 16-bit little endian PCM, 8kHz or 16kHz.
alaw ALaw G.711 A-law, 8kHz.
ulaw Ulaw G.711 µ-law, 8kHz.
opus Opus RFC 6716 Opus, 8kHz or 16kHz.
ogg_opus OggOpus RFC 7845 Ogg-encapsulated Opus, 8kHz or 16kHz.

PCM

Input message defining PCM sample rate. Included in AudioFormat.

Field Type Description
sample_rate_hz uint32 Audio sample rate in Hertz: 0, 8000, 16000. Default 0, meaning 8000.

Alaw

Input message defining A-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

Ulaw

Input message defining µ-law audio format. G.711 audio formats are set to 8kHz. Included in AudioFormat.

Opus

Input message defining Opus packet stream decoding parameters. Included in AudioFormat. See Opus audio format for encoding recommendations.

Field Type Description
decode_rate_hz uint32 Decoder output rate in Hertz: 0, 8000, 16000. Default 0, meaning 8000.
preskip_samples uint32 Decoder 48 kHz output samples to skip.
source_rate_hz uint32 Input source sample rate in Hertz.

OggOpus

Input message defining Ogg-encapsulated Opus audio stream parameters. Included in AudioFormat.

Field Type Description
output_rate_hz uint32 Decoder output rate in Hertz: 0, 8000, 16000. Default 0, meaning 8000.

EnumUtteranceDetectionMode

Input field specifying how utterances should be detected and transcribed within the audio stream. Included in RecognitionParameters. The default is SINGLE. When the detection mode is DISABLED, the recognition ends only when the client stops sending audio.

Name Number Description
SINGLE 0 Return recognition results for one utterance only, ignoring any trailing audio. Default.
MULTIPLE 1 Return results for all utterances detected in the audio stream.
DISABLED 2 Return recognition results for all audio provided by the client, without separating it into utterances.

EnumResultType

Input and output field specifying how transcription results for each utterance are returned. See Result type for examples. In a request RecognitionParameters, it specifies the desired result type. In a response Result, it indicates the actual result type that was returned.

Name Number Description
FINAL 0 Only the final transcription result of each utterance is returned. Default.
PARTIAL 1 Variable partial results are returned, followed by a final result.
IMMUTABLE_PARTIAL 2 Stabilized partial results are returned, following by a final result.

RecognitionFlags

Recognition flags are set within recognition parameters

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US', 
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type='PARTIAL', 
        utterance_detection_mode='MULTIPLE',
        recognition_flags = RecognitionFlags(
            auto_punctuate=True,
            filter_profanity=True,
            suppress_initial_capitalization=True,
            allow_zero_base_lm_weight=True
        )
    )
)

When suppress_initial_capitalization=True, sentences start with lowercase

stream ../audio/testtowns.wav
100 Continue - recognition started on audio/l16;rate=16000 stream
final: my father's family comes from the town Fordoun in Scotland near Aberdeen
final: another town nearby is called Auchenblae
final: when we were in Wales we visited the town of Llangollen

Input message containing boolean recognition parameters. Included in RecognitionParameters. The default is false in all cases.

Field Type Description
auto_punctuate bool Whether to enable auto punctuation, if available for the language.
filter_profanity bool Whether to mask known profanities as *** in transcription, if available for the language.
include_tokenization bool Whether to include tokenized recognition result.
stall_timers bool Whether to disable the no-input timer. By default, this timer starts when recognition begins. See Timers.
discard_speaker_adaptation bool If speaker profiles are used, whether to discard updated speaker data. By default, data is stored.
suppress_call_recording bool Whether to disable call logging. By default, call logs, metadata, and audio are collected. Call logging may be disabled by the Krypton service, in which case this parameter has no effect.
mask_load_failures bool When true, errors loading external resources are not reflected in the Status message and do not terminate recognition. They are still reflected in logs.
To set this flag for a specific resource, use RecognitionResource - ResourceReference - mask_load_failures.
suppress_initial_capitalization bool When true, the first word in a sentence is not automatically capitalized. This option does not affect words that are capitalized by definition, such as proper names, place names, etc. See example at right.
allow_zero_base_lm_weight bool When true, custom resources (DLMs, wordsets, etc.) can use the entire weight space, disabling the base LM contribution. By default, the base LM uses at least 10% of the weight space. See Resource weights. Even when true, words from the base LM are still recognized, but with lower probability.

Formatting

Formatting example

RecognitionInitMessage(
    parameters = RecognitionParameters(
        language='en-US', 
        audio_format=AudioFormat(pcm=PCM(sample_rate_hz=wf.getframerate())),    
        result_type='IMMUTABLE_PARTIAL', 
        utterance_detection_mode='MULTIPLE',
        formatting = Formatting(
            scheme('date'),
            options = {'abbreviate_titles':True,'abbreviate_units':False,'censor_full_words':True})
    )
)   

Input message specifying how the transcription results are presented, using keywords for formatting types and options supported by the data pack. Included in RecognitionParameters. See Formatted text.

Field Type Description
scheme string Keyword for a formatting type defined in the data pack.
options map<string,bool> Map of key, value pairs of formatting options and values defined in the data pack.

ControlMessage

Input message that starts the recognition no-input timer. Included in RecognitionRequest. This setting is only effective if timers were disabled in the recognition request. See Timers.

Field Type Description
start_timers_message StartTimers ControlMessage Starts the recognition no-input timer.

StartTimersControlMessage

Input message the client sends when starting the no-input timer. Included in ControlMessage.

RecognitionResource

RecognitionResource example

# Define a DLM that exists in your Mix project
travel_dlm = RecognitionResource(
    external_reference = ResourceReference(
        type='DOMAIN_LM',
        uri='urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA'),
    reuse='HIGH_REUSE', 
    weight_value=0.7)

# Define an inline wordset for an entity in that DLM places_wordset = RecognitionResource( inline_wordset='{"PLACES":[{"literal":"La Jolla","spoken":["la hoya"]},{"literal":"Llanfairpwllgwyngyll","spoken":["lan vire pool guin gill"]},{"literal":"Abington Pigotts"},{"literal":"Steeple Morden"},{"literal":"Hoyland Common"},{"literal":"Cogenhoe","spoken":["cook no"]},{"literal":"Fordoun","spoken":["forden"]},{"literal":"Llangollen","spoken":["lan-goth-lin","lhan-goth-luhn"]},{"literal":"Auchenblae"}]}', reuse='HIGH_REUSE')

# Include DLM and wordset in RecognitionInitMessage def client_stream(wf): try: init = RecognitionInitMessage( parameters = RecognitionParameters(. . .), resources = [travel_dlm, places_wordset] )

Input message defining one or more recognition resources (domain LMs, wordsets, and builtins) to improve recognition. Included in RecognitionInitMessage. Domain LMs must be external references but wordsets must be provided inline.

Field Type Description
external_reference ResourceReference The resource is an external file. Mandatory for DLMs and settings files.
inline_wordset string Inline wordset JSON resource. See Wordsets for the format. Default blank, meaning no inline wordset.
builtin string Name of a builtin resource in the data pack. Default blank, meaning no builtins.
inline_grammar string Inline grammar, SRGS XML format. Default blank, meaning no inline grammar. For Nuance internal use only.
weight_enum EnumWeight Keyword for weight of DLM or builtin. Default is MEDIUM, meaning 0.250. Use either weight_enum or weight_value (next). Wordsets and speaker profiles do not take a weight. See Resource weights.
weight_value float Weight of DLM or builtin as a numeric value from 0 to 1. Default 0.
reuse EnumResourceReuse Whether the resource will be used multiple times. Default LOW_REUSE.

This message includes:

RecognitionRequest
  recognition_init_message (RecognitionInitMessage)
    parameters (RecognitionParameters)
    resources (RecognitionResource)
      external_reference (ResourceReference)
        type - DOMAIN_LM|SPEAKER_PROFILE|SETTINGS
        uri
        etc.
      inline_wordset
      builtin
      inline_grammar
      weight_enum - LOWEST to HIGHEST | weight_value
      reuse - LOW_REUSE|HIGH_REUSE

ResourceReference

External reference examples

# Define a DLM (A77_C1 is my context tag from Mix) 
travel_dlm = RecognitionResource(
    external_reference = ResourceReference(
        type='DOMAIN_LM',
        uri='urn:nuance-mix:tag:model/A77_C1/mix.asr?=language=eng-USA'), 
    reuse='HIGH_REUSE',
    weight_value=0.7)

# Define a setttings file
settings = RecognitionResource(
    external_reference = ResourceReference(
        type='SETTINGS',
        uri='urn:nuance-mix:tag:settings/A77_C1/asr'))

# Define a speaker profile (no URI)
speaker_profile = RecognitionResource(
    external_reference = ResourceReference(
        type='SPEAKER_PROFILE'))

Input message for fetching an external DLM or settings file that exists in your Mix project, or for creating or updating a speaker profile. Included in RecognitionResource. See Domain LMs and Speaker profiles.

Field Type Description
type Enum ResourceType Resource type. Default UNDEFINED_RESOURCE_TYPE.
uri string Location of the resource as a URN reference:
DLM: urn:nuance-mix:tag:model/<context_tag>/mix.asr?=language=eng-USA
Settings: urn:nuance-mix:tag:setting/<context_tag>/asr
Speaker profile: Not used. The speaker is identified in RecognitionInitMessage - user _id.
mask_load_failures bool When true, errors loading the resource are not reflected in the Status message and do not terminate recognition. They are still reflected in logs.
To apply this flag to all resources, use RecognitionParameters - RecognitionFlags - mask_load_failures.
request_timeout_ ms uint32 Time to wait when downloading resources. Default (0) means server default, usually 10000ms or 10 seconds.
headers map<string,string> Map of HTTP cache-control directives, including max-age, max-stale, min-fresh, etc. For example, in Python:
headers = {'cache-control': 'max-age=604800, max-stale=3600'}

EnumResourceType

Input field defining the content type of an external recognition resource. Included in ResourceReference. See Resources.

Name Number Description
UNDEFINED_ RESOURCE_TYPE 0 Resource type is not specified. Client must always specify a type.
WORDSET 1 Resource is a plain-text JSON wordset. Not currently supported, although inline_wordset is supported.
COMPILED_WORDSET 2 Resource is a compiled wordset. Not currently supported.
DOMAIN_LM 3 Resource is a domain LM.
SPEAKER_PROFILE 4 Resource is a speaker profile in a Krypton datastore.
GRAMMAR 5 Resource is an SRGS XML file. Not currently supported.
SETTINGS 6 Resource is ASR settings metadata, including the desired data pack version.

EnumWeight

Input field setting the weight of the domain LM or builtin relative to the data pack, as a keyword. Included in RecognitionResource. Wordsets and speaker profiles do not have a weight. See weight_value to specify a numeric value. See Resource weights.

Name Number Description
DEFAULT_WEIGHT 0 Same effect as MEDIUM.
LOWEST 1 The resource has minimal influence on the recognition process, equivalent to weight_value 0.05.
LOW 2 The resource has noticeable influence, equivalent to weight_value 0.1.
MEDIUM 3 The resource has a moderate influence, equivalent to weight_value 0.25.
HIGH 4 Words from the resource may be favored over words from the data pack, equivalent to weight_value 0.5.
HIGHEST 5 The resource has the greatest influence on the recognition, equivalent to weight_value 0.9.

EnumResourceReuse

Input field specifying whether the domain LM or wordset will be used for one or many recognition turns. Included in RecognitionResource.

Name Number Description
UNDEFINED_REUSE 0 Not specified: currently defaults to LOW_REUSE.
LOW_REUSE 1 The resource will be used for only one recognition turn.
HIGH_REUSE 5 The resource will be used for a sequence of recognition turns.

RecognitionResponse

RecognitionResponse example

# Iterate through the returned server -> client messages
try:
    for message in stream_in:
        if message.HasField('status'):
            if message.status.details:
                 print(f'{message.status.code} {message.status.message} - {message.status.details}')
            else:
                 print(f'{message.status.code} {message.status.message}')
        elif message.HasField('result'):
            restype = 'partial' if message.result.result_type else 'final'
            print(f'{restype}: {message.result.hypotheses[0].formatted_text}')

Output stream of messages in response to a recognize request. Included in Recognizer - Recognize service.

Field Type Description
status Status Always the first message returned, indicating whether recognition was initiated successfully.
start_of_speech StartOfSpeech When speech was detected.
result Result The partial or final recognition result. A series of partial results may preceed the final result.
cookies map<string,string> Map of uri:cookies for each ResourceReference uri where cookies were returned, for the first response only.

This message includes:

RecognitionResponse
  status (Status)
    code
    message
    details
  start_of_speech (StartOfSpeech)
    first_audio_to_start_of_speech_ms
  result (Result)
    result_type - FINAL|PARTIAL|IMMUTABLE_PARTIAL
    abs_start_ms
    abs_end_ms
    utterance_info (UtteranceInfo)
      duration_ms
      clipping_duration_ms
      dropped_speech_packets
      dropped_nonspeech_packets
      dsp (Dsp)
        digital signal processing results
    hypotheses (Hypothesis)
      confidence
      average_confidence
      rejected
      formatted_text
      minimally_formatted_text
      words (Words)
        text
        confidence
        start_ms
        end_ms
        silence_after_word_ms
        grammar_rule
      encrypted_tokenization
      grammar_id
    data_pack (DataPack)
      language
      topic
      version
      id
  cookies

Status

Output message indicating the status of the transcription. See Status codes for details about the codes. The message and details are developer-facing error messages in English. User-facing messages should be localized by the client based on the status code. Included in RecognitionResponse.

Field Type Description
code uint32 HTTP-style return code: 100, 200, 4xx, or 5xx as appropriate.
message string Brief description of the status.
details string Longer description if available.

StartOfSpeech

Output message containing the start-of-speech message. Included in RecognitionResponse.

Field Type Description
first_audio_to_ start_of_speech_ms uint32 Offset from start of audio stream to start of speech detected.

Result

See Result type and Formatted text for examples of transcription results in different formats

Output message containing the transcription result, including the result type, the start and end times, metadata about the transcription, and one or more transcription hypotheses. Included in RecognitionResponse.

Field Type Description
result_type EnumResultType Whether final, partial, or immutable results are returned.
abs_start_ms uint32 Audio stream start time.
abs_end_ms uint32 Audio stream end time.
utterance_info UtteranceInfo Information about each sentence.
hypotheses Hypothesis Repeated. One or more transcription variations.
data_pack DataPack Data pack information.

UtteranceInfo

Output message containing information about the recognized sentence in the transcription result. Included in Result.

Field Type Description
duration_ms uint32 Utterance duration in milliseconds.
clipping_duration_ms uint32 Milliseconds of clipping detected.
dropped_speech_packets uint32 Number of speech audio buffers discarded during processing.
dropped_nonspeech_packets uint32 Number of non-speech audio buffers discarded during processing.
dsp Dsp Digital signal processing results.

Dsp

Output message containing digital signal processing results. Included in UtteranceInfo.

Field Type Description
snr_estimate_db float The estimated speech-to-noise ratio.
level float Estimated speech signal level.
num_channels uint32 Number of channels. Default is 1, meaning mono audio.
initial_silence_ms uint32 Milliseconds of silence observed before start of utterance.
initial_energy float Energy feature value of first speech frame.
final_energy float Energy feature value of last speech frame.
mean_energy float Average energy feature value of utterance.

Hypothesis

Output message containing one or more proposed transcriptions of the audio stream. Included in Result. Each variation has its own confidence level along with the text in two levels of formatting. See Formatted text.

Field Type Description
confidence float The confidence score for the entire transcription, 0 to 1.
average_confidence float The confidence score for the hypothesis, 0 to 1: the average of all word confidence scores based on their duration.
rejected bool Whether the hypothesis was rejected or accepted.
True: The hypotheses was rejected.
False: The hypothesis was accepted.
The recognizer determines rejection based on an internal algorithm. If the audio input cannot be assigned to a sequence of tokens with sufficiently high probability, it is rejected.
Recognition can be improved with domain LMs, wordsets, and builtins.
The rejected field is returned for final results only, not for partial results.
formatted_text string Formatted text of the result, e.g. $500.
minimally_formatted_text string Slightly formatted text of the result, e.g. Five hundred dollars.
words Word Repeated. One or more recognized words in the result.
encrypted_tokenization string Nuance-internal representation of the recognition result. Not returned when result originates from a grammar.
grammar_id string Identifier of the matching grammar, as grammar_0, grammar_1, etc. representing the order the grammars were provided as resources. Returned when result originates from an SRGS grammar rather than generic dictation.

Word

Output message containing one or more recognized words in the hypothesis, including the text, confidence score, and timing information. Included in Hypothesis.

Field Type Description
text string The recognized word.
confidence uint32 The confidence score of the recognized word, 0 to 1.
start_ms uint32 Word start offset in the audio stream.
end_ms uint32 Word end offset in the audio stream.
silence_after_word_ms uint32 The amount of silence, in ms, detected after the word.
grammar_rule string The grammar rule that recognized the word text. Returned when result originates from an SRGS grammar rather than generic dictation.

DataPack

Output message containing information about the current data pack. Included in Result.

Field Type Description
language string Language of the data pack loaded by Krypton.
topic string Topic of the data pack loaded by Krypton.
version string Version of the data pack loaded by Krypton.
id string Identifier string of the data pack, including nightly update information if a nightly build was loaded.

Scalar value types

The data types in the proto files are mapped to equivalent types in the generated client stub files.

Proto Notes C++ Java Python
double double double float
float float float float
int32 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint32 instead. int32 int int
int64 Uses variable-length encoding. Inefficient for encoding negative numbers. If your field is likely to have negative values, use sint64 instead. int64 long int/long
uint32 Uses variable-length encoding. uint32 int int/long
uint64 Uses variable-length encoding. uint64 long int/long
sint32 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int32s. int32 int int
sint64 Uses variable-length encoding. Signed int value. These encode negative numbers more efficiently than regular int64s. int64 long int/long
fixed32 Always four bytes. More efficient than uint32 if values are often greater than 2^28. uint32 int int
fixed64 Always eight bytes. More efficient than uint64 if values are often greater than 2^56. uint64 long int/long
sfixed32 Always four bytes. int32 int int
sfixed64 Always eight bytes. int64 long int/long
bool bool boolean boolean
string A string must always contain UTF-8 encoded or 7-bit ASCII text. string String str/unicode
bytes May contain any arbitrary sequence of bytes. string ByteString str

Change log

2021-01-13

The Reference - Formatted text section was updated to include:

2020-12-21

2020-12-14

Three new fields were added to the proto files:

The sample Python app was updated to check for stereo audio files, which are not supported.

2020-10-27

These documentation changes were made:

2020-06-29

A new section was added to the result.proto file: see DataPack.

2020-04-30

The proto files were renamed from nuance_asr*.proto to:

These proto files are available in the zip file, nuance_asr_proto_files_v1.zip. The content of the files remains the same with the exception of new Java options referenced in recognizer.proto:

option java_multiple_files = true;
option java_package = "com.nuance.rpc.asr.v1";
option java_outer_classname = "RecognizerProto";

2020-03-31

The fields names and data types in the v1 protocol are aligned with other Nuance as a service engines. See Upgrading to v1 for instructions on adjusting your applications to the latest protocol. The changes made since v1beta2 are:

New information was added about timer settings and interaction: see Timers.

The status codes were updated to clarify the notion of rejection: see Status messages and codes.

A new resource type, SETTINGS, was added, allowing you to set the data pack version. See ResourceReference.

2020-02-19

These changes were made to the ASRaaS software and documentation:

2020-01-22

These changes were made to the ASRaaS gRPC software and documentation since the last Beta release:

2019-12-18

The protocol was updated to v1beta2, with these changes: - RecognizeXxx → RecognitionXxx: The proto file methods RecognizeRequest, RecognizeResponse, and RecognizeInitMessage were renamed RecognitionRequest, RecognitionResponse, and RecognitionInitMessage.
- The Dsp - initial_silence field was renamed initial_silence_ms.
- Locale codes for the RecognitionParameters - language field were changed from xxx-XXX (for example, eng-USA) to xx-XX (en-US).
- The RecognitionResource - reuse field (LOW_REUSE, HIGH_REUSE) applies to wordsets as well as DLMs, meaning both types of resources can be used for multiple recognition turns.
- The AudioFormat - opus field (representing Ogg Opus) was replaced with opus (for raw Opus) and ogg_opus for Ogg-encapsulated Opus audio.

2019-11-15

Below are changes made to the ASRaaS gRPC API documentation since the initial Beta release: