Speech model recipes

Overview

The speech model is integrated in the other models to ensure voice insights are only applied to parts of the audio that contain speech. On its own, however, it can be used for a more granular control in detecting music and voice in audio. Future iterations of the model will include silence detection and more detailed human sounds.

The examples here address the combined use of the Speech model, together with the Arousal model to detect music and high energy segments in audio from vlogs:

  • basic analysis of audio files with built-in summarisation options - example 1
  • custom summarisation options for audio file analysis - example 2

Pre-requisites

  • Deeptone with license key and models
  • audio file(s) you want to process

Sample data

You can download this sample audio file with our CTO talking about OTO for the examples below.

Default summaries - Example 1

Remember to add the path to a valid .model file, a valid license key and the path to your file before running the example.

In this examples we make use of the summary and transitions level outputs, calculated optionally when processing a file.

The summary output presents us with the fraction of the audio which falls in a particular class. In the case below we are interested in the high arousal part of the speech, ignoring the audio with no speech detected.

In the second part of the example, we also look at the transitions to count how many uninterrupted segments of music we could find in this file.

from deeptone import Deeptone
from deeptone.deeptone import AROUSAL_HIGH, AROUSAL_NO_SPEECH
import pyaudio
import time
# Set the required constants
VALID_LICENSE_KEY = None
MODEL_PATH = None
FILE_TO_PROCESS = None
assert not None in (VALID_LICENSE_KEY, MODEL_PATH, FILE_TO_PROCESS), "Set the required constants"
# Initialise Deeptone
engine = Deeptone(model_path=MODEL_PATH, license_key=VALID_LICENSE_KEY)
output = engine.process_file(
filename=FILE_TO_PROCESS,
models=[engine.models.Speech, engine.models.Arousal],
output_period=1024,
channel=0,
use_chunking=True,
include_summary=True,
include_transitions=True,
)
arousal_summary = output["channels"]["0"]["summary"]["arousal"]
high_part = arousal_summary[f"{AROUSAL_HIGH}_fraction"] / (
1 - arousal_summary[f"{AROUSAL_NO_SPEECH}_fraction"]
)
totalTimeAroused = round(high_part * 100, 2)
print(f"You were excited {totalTimeAroused}% of the time you were speaking")
print(
f'You had {len([transition for transition in output["channels"]["0"]["transitions"]["speech"] if transition["result"] == "music"])} music transitions'
)

Custom summaries - Example 2

The built-in summary and transitions output present a useful concept of how to collect high-level information from an audio file. They operate on the most granular level of the output - 64ms in most models. As a result, even very small pauses between speech will be reflected in the output. Depending on your use case you may be targeting a more custom summarisation.

In this second example, instead of counting all segments with music, we could count only those longer than 1s. The same logic can be applied in calculating summaries of the audio file - you can always operate on the level of the timeseries and calculate whatever property you need.

from deeptone import Deeptone
from deeptone.deeptone import AROUSAL_HIGH, AROUSAL_NO_SPEECH
import pyaudio
import time
# Set the required constants
VALID_LICENSE_KEY = None
MODEL_PATH = None
FILE_TO_PROCESS = None
assert not None in (VALID_LICENSE_KEY, MODEL_PATH, FILE_TO_PROCESS), "Set the required constants"
# Initialise Deeptone
engine = Deeptone(model_path=MODEL_PATH, license_key=VALID_LICENSE_KEY)
output = engine.process_file(
filename=FILE_TO_PROCESS,
models=[engine.models.Speech, engine.models.Arousal],
output_period=1024,
channel=0,
use_chunking=True,
include_summary=True,
include_transitions=True,
)
arousal_summary = output["channels"]["0"]["summary"]["arousal"]
high_part = arousal_summary[f"{AROUSAL_HIGH}_fraction"] / (
1 - arousal_summary[f"{AROUSAL_NO_SPEECH}_fraction"]
)
totalTimeAroused = round(high_part * 100, 2)
print(f"You were excited {totalTimeAroused}% of the time you were speaking")
print(
f'You had {len([transition for transition in output["channels"]["0"]["transitions"]["speech"] if transition["result"] == "music"])} music transitions'
)
# Custom transition count
long_music_transition = 0
for transition in output["channels"]["0"]["transitions"]["speech"]:
if transition["result"] == "music":
music_duration = transition["timestamp_end"] - transition["timestamp_start"]
if music_duration > 1000 and transition["confidence"] > 0.5:
long_music_transition += 1
print(f"You had {long_music_transition} music transitions with duration at least 1s")
  • Overview
  • Default summaries - Example 1
  • Custom summaries - Example 2