Speech Model

The speech model can classify audio into "Speech", "Music" or "Other". The "Other" class contains everything which is not speech or music, including silence.

The receptive filed of this model is 1082ms.


Receptive FieldResult Type
1082msresult ∈ ["speech", "music", "other"]


The time-series result will be an iterable with elements that contain the following information:

"timestamp": 0,
"speech": {
"result": "speech",
"confidence": 0.92


In case a summary is requested the following will be returned

"speech": {
"speech_fraction": 0.30,
"other_fraction": 0.65,
"music_fraction": 0.05

where x_fraction represents the percentage of time that x class was identified for the duration of the input.


In case the transitions are requested a time-series with the following transition elements will be returned:

"timestamp_start": 0,
"timestamp_end": 1500,
"result": "other",
"confidence": 0.96
"timestamp_start": 1500,
"timestamp_end": 6000,
"result": "music",
"confidence": 0.86

The result above means that the first 1500ms of the audio snippet contained no speech or music, and between 1500ms and 6000ms of the audio - music was detected.