The speech model is integrated in the other models to ensure voice insights are only applied to parts of the audio that contain speech. On its own, however, it can be used for a more granular control in detecting music and voice in audio. Future iterations of the model will include silence detection and more detailed human sounds.
The examples here address the combined use of the Speech model, together with the Arousal model to detect music and high energy segments in audio from vlogs:
- basic analysis of audio files with built-in summarisation options - example 1
- custom summarisation options for audio file analysis - example 2
- Deeptone with license key and models
- audio file(s) you want to process
You can download this sample audio file with our CTO talking about OTO for the examples below.
Remember to add the path to a valid
.model file, a valid license key and the path to your file before running the example.
In this examples we make use of the
transitions level outputs, calculated optionally when processing a file.
summary output presents us with the fraction of the audio which falls in a particular class. In the case below we are interested in the high arousal part of the speech, ignoring the audio with no speech detected.
In the second part of the example, we also look at the
transitions to count how many uninterrupted segments of music we could find in this file.
transitions output present a useful concept of how to collect high-level information from an audio file. They operate on the most granular level of the output - 64ms in most models. As a result, even very small pauses between speech will be reflected in the output. Depending on your use case you may be targeting a more custom summarisation.
In this second example, instead of counting all segments with music, we could count only those longer than 1s. The same logic can be applied in calculating summaries of the audio file - you can always operate on the level of the timeseries and calculate whatever property you need.