File Processing

DeepTone™'s File Processing functionality allows you to extract insights from your audio files.

Working with stereo files

DeepTone™ processes each audio channel separately. If you provide a stereo file, you can provide a specific channel to be processed, otherwise, all channels will be processed separately.

Sample data

You can download this sample audio file with a woman speaking for the examples below. For code sample go to Example Usage.

Supported formats

Currently, processing WAV files is supported. Ideally, the files should be 16-bit PCM with the sample rate of 16 kHz. If a different sample rate is provided, the file will be up- or downsampled accordingly. Please be aware though that using files with sample rates lower than recommended may lead to deterioration of analysis results.

If you're not sure your audio files meet these criteria you can use the CLI tool SoX for that verification by doing the following:

sox --i PATH_TO_YOUR_AUDIO_FILE

The result will be something similar to:

Input File : PATH_TO_YOUR_AUDIO_FILE
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:03.99 = 63840 samples ~ 299.25 CDDA sectors
File Size : 128k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

SoX also allows you to convert your files in case they don't match our criteria by using the following command:

sox PATH_TO_YOUR_AUDIO_FILE -b 16 PATH_TO_OUTPUT_FILE rate 16k

Configuration options and outputs

There are different configuration options and types of outputs which can be used depending on the SDK language.

For code sample go to Example Usage. For detailed output specification go to Output specification.

Available configuration options

There are several possible arguments which can be passed to the process_file function:

  • filename - the path to the file to be analysed
  • models - the list of model names to use for the audio analysis
  • output_period - how often (in milliseconds, multiple of 64) the output of the models should be returned
  • channel - optionally a channel to analyse, otherwise all channels will be analysed
  • include_summary - optionally if the output should contain of summary of the analysis, defaults to False
  • include_transitions - optionally if the output should contain transitions of the analysis, defaults to False
  • use_chunking - optionally if the data should be chunked before making the analysis (recommended for large files to avoid memory issues)

Available Outputs

There are three possible output types, depending on the parameters that you pass to the process_file function:

  • a plain time series - default output type, returned always
  • a summary - appended to the results when include_summary=True
  • a simplified time series - appended to the results when include_transitions=True

For code sample go to Example Usage. For detailed output specification go to Output specification.

See below for examples of each of the three outputs:

  • plain time series (according to the specified output_period):
{
"channels": {
"0": {
"time_series": [
{
"timestamp": 0,
"gender": {
"result": "male",
"confidence": 0.92
}
},
{
"timestamp": 1024,
"gender": {
"result": "male",
"confidence": 0.86
}
},
{
"timestamp": 2048,
"gender": {
"result": "male",
"confidence": 0.85
}
},
...
{
"timestamp": 29696,
"gender": {
"result": "male",
"confidence": 0.98
}
}
]
}
}
}
  • summary (showing fraction of each class across the entire file):
{
"channels": {
"0": {
"time_series": [ ... ],
"summary": {
"gender": {
"male_fraction": 0.7451,
"female_fraction": 0.1024,
"other_fraction": 0.112,
"unknown_fraction": 0.0405,
},
}
}
}
}
  • simplified time series (indicating transition points between alternating results):
{
"channels": {
"0": {
"time_series": [ ... ],
"transitions": {
"gender": [
{
"timestamp_start": 0,
"timestamp_end": 1024,
"result": "female",
"confidence": 0.96
},
{
"timestamp_start": 1024,
"timestamp_end": 3072,
"result": "male",
"confidence": 0.87
},
...
{
"timestamp_start": 8192,
"timestamp_end": 12288,
"result": "female",
"confidence": 0.89
}
],
}
}
}
}

Example Usage

You can use the process_file method to process your audio files.

from deeptone import Deeptone
from deeptone.deeptone import GENDER_MALE, GENDER_FEMALE, GENDER_UNKOWN, GENDER_NO_SPEECH
# Initialise Deeptone
engine = Deeptone(model_path="path/to/model", license_key="...")
output = engine.process_file(
filename="PATH_TO_AUDIO_FILE",
models=[engine.models.Gender],
output_period=1024,
channel=0,
use_chunking=True,
include_summary=True,
include_transitions=True
)

The returned object contains the time series with an analysis of the file broken down by the provided output period:

# Inspect the result
print(output)
print("Time series:")
for ts_result in output["channels"]["0"]["time_series"]:
ts = ts_result["timestamp"]
res = ts_result["gender"]
print(f'Timestamp: {ts}ms\tresult: {res["result"]}\t'
f'confidence: {res["confidence"]}')
summary = output["channels"]["0"]["summary"]["gender"]
male = summary[f"{GENDER_MALE}_fraction"] * 100
female = summary[f"{GENDER_FEMALE}_fraction"] * 100
no_speech = summary[f"{GENDER_NO_SPEECH}_fraction"] * 100
unknown = summary[f"{GENDER_UNKOWN}_fraction"] * 100
print(f'\nSummary: male: {male}%, female: {female}%, no_speech: {no_speech}%, unknown: {unknown}%')
print("\nTransitions:")
for ts_result in output["channels"]["0"]["transitions"]["gender"]:
ts = ts_result["timestamp_start"]
print(f'Timestamp: {ts}ms\tresult: {ts_result["result"]}\t'
f'confidence: {ts_result["confidence"]}')

The output of the script would be something like:

Time series:
Timestamp: 0ms result: female confidence: 0.6418
Timestamp: 1024ms result: female confidence: 0.9002
Timestamp: 2048ms result: female confidence: 0.4725
Timestamp: 3072ms result: female confidence: 0.4679
Summary: male: 0.0%, female: 85.48%, no_speech: 0.0%, unknown: 14.52%
Transitions:
Timestamp: 0ms result: unknown confidence: 0.01510
Timestamp: 320ms result: female confidence: 0.8075
Timestamp: 2880ms result: unknown confidence: 0.0771
Timestamp: 3136ms result: female confidence: 0.4931

Raw output:

{
"channels": {
"0": {
"time_series": [
{ "timestamp" : 0, "gender": { "result": "female", "confidence": 0.6418, } },
{ "timestamp" : 1024, "gender": { "result": "female", "confidence": 0.9002, } },
{ "timestamp" : 2048, "gender": { "result": "female", "confidence": 0.4725, } },
{ "timestamp" : 3072, "gender": { "result": "female", "confidence": 0.4679, } },
],
"summary": {
"gender": { "male_fraction": 0, "female_fraction": 0.8548, "no_speech": 0.0, "unknown_fraction": 0.1452 },
},
"transitions": {
"gender": [
{ "timestamp_start" : 0, "timestamp_end": 320, "result": "unknown", "confidence": 0.0151, },
{ "timestamp_start" : 320, "timestamp_end": 2880, "result": "female", "confidence": 0.8075, },
{ "timestamp_start" : 2880, "timestamp_end": 3136, "result": "unknown", "confidence": 0.0771, },
{ "timestamp_start" : 3136, "timestamp_end": 3968, "result": "female", "confidence": 0.4931, },
]
}
}
}
}

Further examples

For more example usage of the summary and transitions, head to the Speech model recipes and the Arousal model recipes sections.