The Python Command Line Interface (CLI) is intended to provide developers with an easy to use, interactive interface to understand the NDEV ASR and TTS HTTP services.
Moreover, the CLI provides utilities that help developers deal with capturing and resampling audio.
The methods that are made available to developers are:
Service | Command |
---|---|
Capturing Audio | record_wav.py some_output_file.wav |
Resample Audio | resample.sh some_sample_file.wav |
ASR | asr.py some_sample_file.{wav,spx,ogg} --lang=en_US |
Streaming ASR | asr_stream.py --lang=tr_TR |
TTS | tts.py some_output_file.{wav,spx,ogg,mp3} --lang=zh_HK |
Before proceeding, make sure you have created an app profile using ndevmobile.com and have identified that the app is using the HTTP services.
These installation instructions assume you use Homebrew to manage your packages.
brew install libsamplerate
brew install portaudio
Replace apt-get
with the appropriate package manager for your system.
apt-get install libsamplerate
Once you have installed portaudio, you will want to install the required Python modules. Before proceeding with the modules in requirements.txt
, you will want to ensure that you have numpy installed. If you do not:
pip install numpy
With numpy installed, it is up to you whether you want to create a virtualenv
for the project or not. Regardless, you will want to install the required packages:
pip install -r requirements.txt
Execute the following command within the root directory of the project.
export PYTHONPATH=$PYTHONPATH:`pwd`
credentials.json
file as described below.To download these utilities, you can clone the project on GitHub
git clone git@github.com:NuanceDev/ndev-python-http-cli.git
OR
The project’s structure is defined below.
ndev-python-http-cli/ ├── README.md ├── bin │ ├── asr.py │ ├── asr_stream.py │ ├── asr_then_tts.py │ ├── play_wav.py │ ├── record_and_recognize.sh │ ├── record_wav.py │ ├── resample.sh │ └── tts.py ├── credentials.json ├── ndev │ ├── __init__.py │ ├── asr.py │ ├── core.py │ └── tts.py ├── requirements.txt └── setup.py
The ndev
directory houses all the necessary interfaces for the NDEV HTTP API.
Peruse this code to learn about various aspects of the APIs for both ASR and TTS, like the languages available or the sampling rates available for a given codec.
The bin
directory houses all the scripts that you are able to run. When invoking a script, you will want to do so in the root directory of the CLI, as the scripts currently rely on credentials.json
being present there.
In order to work with the CLI, you will need to have an NDEV developer account.
Once you have an account, create an application, and be sure to specify that you’re using the HTTP service. You will receive an email with credentials and will be able to access the dev portal to see those credentials at any time.
These are the properties that you will need for the Python CLI:
The credentials for your app are needed in order to make requests through the CLI. Define them in the credentials.json
file.
Here is an example of what the file looks like:
{ "appId": "HTTP_NMDP_MyApp_20130506033030", "appKey": "[128-char-key]", "asrUrl": "[protocol://pathname:port]", "asrEndpoint": "/dictation", "ttsUrl": "protocol://pathname:port]", "ttsEndpoint": "/tts" }
Note This utility will record audio at the sampling rate determined from the audio device you choose to capture audio from. This will most likely be 44.1kHz or 48kHz. You will need to downsample the recorded audio in order to use it with the NDEV HTTP services.
The CLI provide an interface for capturing audio using the portaudio library. To use the record_wav.py
script, provide the name of the file to write a wave file, including the wav extension, like so:
python bin/record_wav.py test.wav
Enter
) to begin capturing audioCtrl+C
) to end capturing audio if < 10s longHere is an example of the output after having recorded a wav file.
Recording to: test.wav Here are the available audio devices: [0] Built-in Microph Default Sample Rate: 44100 [1] Built-in Input Default Sample Rate: 44100 [2] Built-in Output Default Sample Rate: 44100 Which device would you like to record audio from: 0 [enter] to begin recording, [ctrl-c] to cancel o recording (ctrl+c to stop) ^Cx done recording
The utility leverages the following Python modules
During installation you would have performed pip install -r requirements.txt
. This will install pyaudio
.
If you record audio using the record.py
script, you will notice that the wav file is stored at the sample rate it was captured at, possibly 44.1kHz or 48kHz. The NDEV HTTP service requires that you use 8kHz or 16kHz, and so the audio needs to be resampled.
The CLI offers a resample.sh
script that provides an interface leveraging the SOX utility.
Specify the wav file that you want to resample and optionally pass in the sample rate to resample to, i.e. 8k
, 16000
.
Here is an example of how to use the script
./bin/resample.sh test.wav 8k
If you do not define the sample rate, a rate of 16000
will be used as the default. The utility will create a new wav file after resampling, with a naming pattern like [name]_[samplerate].wav
.
To take advantage of the resampling utility, install SoX.
All ASR requests are performed using the chunked-transfer encoding transfer mechanism.
To perform speech recognition on an audio file using the NDEV HTTP services use the asr.py
script.
This utility will do the following:
Determine the appropriate request headers based on the audio file
Ask the user for a language to use if one is not defined
Build the request using data available and credentials.json
Issue the request to the HTTP service
Display the top result for the perform recognition OR Display the error message from the server
Usage: asr.py {source_file.wav} [options] Options: -h, --help show this help message and exit -l LANGUAGE, --lang=LANGUAGE desired language via language code
The asr.py
utility provides an output of an ASR request. For example, using a wav
file sampled at a rate of 16kHz
and using the language en_US
results in the following output:
* analyzing audio stream... Request URL protocol://server:port/endpoint Request Params --------------- appId -- appKey -- id -- Request Headers --------------- Content-Type audio/x-wav;bit=16;codec=pcm;rate=16000 Transfer-Encoding chunked Accept text/plain Accept-Topic Dictation Accept-Language en_US Audio Information ----------------- Sample Width 2 Sample Rate 16000 Num Channels 1 Bit Rate 16 Audio File test_16k.wav Bytes Sent 94366/94366 100% * analyzed stream.
Sending audio data in real time while capturing it enhances the user experience drastically when integrating speech into your applications.
There is a utility asr_stream.py
that will perform real time streaming and audio capture for speech recognition.
Usage: asr_stream.py [options] Options: -h, --help show this help message and exit -l LANGUAGE, --lang=LANGUAGE desired language via language code -s SAMPLERATE, --samplerate=SAMPLERATE specify the desired samplerate for audio transfer -v, --verbose see the raw HTTP
If you choose to view the raw bytes being transferred during the request, you can use the -v
, verbose flag.
The language is optional, and if unspecified, will be determined by the user with an interactive input for available languages.
Speech synthesis from text is a compelling feature that can be added to enhance an application.
The CLI TTS utilities encourage experimentation and allow you to store an audio file that is returned from the server based on text and the given language.
Please note that these utilities should not be used to gather samples that can then be used later. This is stated in the Terms of Use.
Usage: tts.py {destination_file_name.format} {text_to_synthesize} [options] Options: -h, --help show this help message and exit -l LANGUAGE, --lang=LANGUAGE desired language via language code, i.e. en_US -r SAMPLERATE, --rate=SAMPLERATE the sample rate to use for the create audio file if relevant, i.e. 16000
Here is an example of making a TTS request having defined the destination path of a wav
file and some text:
./bin/tts.py test.wav "this is a test"
NDEV HTTP Python CLI from Nuance Communications for more info see: http://nuancedev.github.io Select Synthesis Language [0] Arabic ar_WW [1] Australian English en_AU [2] Bahasa (Indonesia) id_ID [3] Basque eu_ES [4] Belgian Dutch nl_BE [5] Canadian French fr_CA [6] Cantonese zh_HK [7] Catalan ca_ES [8] Czech cs_CZ [9] Danish da_DK [10] Dutch nl_NL [11] Finnish fi_FI [12] French fr_FR [13] German de_DE [14] Greek el_GR [15] Hindi hi_IN [16] Hungarian hu_HU [17] Indian English en_IN [18] Irish English en_IE [19] Italian it_IT [20] Japanese jp_JP [21] Korean ko_KR [22] Mandarin zh_CN [23] Norwegian no_NO [24] Polish pl_PL [25] Portuguese pt_PT [26] Portuguese Braz. pt_BR [27] Romanian ro_RO [28] Russian ru_RU [29] Scottish English en_SC [30] Slovak sk_SK [31] South African English en_ZA [32] Spanish Castilian es_ES [33] Spanish Mexican es_MX [34] Swedish sv_SE [35] Taiwanese Mandarin zh_TW [36] Thai th_TH [37] Turkish tr_TR [38] UK English en_UK [39] US English en_US Which language (default: US English)? 39 The following voices are available in en_US.. [0] Allison (F) [1] Carol (F) [2] Samantha (F) [3] Tom (M) Which voice would you like to use? 0 Using Language: US English (en_US) Voice: Allison The following sample rates are available for the 'wav' format.. [0] 8000Hz [1] 16000Hz [2] 22000Hz What sample rate would you like to use? 1 Using Sample Rate: 16000 * synthesizing text... Request URL --------------- [url here] Request Headers --------------- Content-Type: text/plain; charset=utf-8 Accept: audio/x-wav;bit=16;codec=pcm;rate=16000 Making request: 1.038076 seconds, 43648 bytes * synthesize request complete ✓ TTS Text synthesized to file -> test.wav
The TTS service supports the ability to create synthesized samples in
wav
(Sample Rates: 8k, 16k, 22k)spx
or ogg
(Sample Rates: 8k, 16k)mp3
amr
The format will be determined based upon specifying the extension for the file to write out.
For example if you specify test.wav
the resulting file will be of wave format (with unsigned PCM).
Alternatively, if you specify test.mp3
the resulting file will be an mp3 format with a bit rate of 128kbps.
If you deal with spx
or ogg
, you may want to use the speex
library to decode it into wave.