Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification

Authors: 
Allan Hanbury
Allan Hanbury
Allan Hanbury
Type: 
Poster presentation with proceedings
Proceedings: 
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)
Publisher: 
Pages: 
118 - 122
Year: 
2017
ISBN: 
ISBN: 978-952-15-4042-4
Abstract: 
In this paper we present a Deep Neural Network architecture for the task of acoustic scene classification which harnesses information from increasing temporal resolutions of Mel-Spectrogram segments. This architecture is composed of separated parallel Convolutional Neural Networks which learn spectral and temporal representations for each input resolution. The resolutions are chosen to cover fine-grained characteristics of a scene´s spectral texture as well as its distribution of acoustic events. The proposed model shows a 3.56% absolute improvement of the best performing single resolution model and 12.49% of the DCASE 2017 Acoustic Scenes Classification task baseline.
TU Focus: 
Information and Communication Technology
Reference: 

A. Schindler, T. Lidy, A. Rauber:
"Multi-Temporal Resolution Convolutional Neural Networks for Acoustic Scene Classification";
Poster: Detection and Classification of Acoustic Scenes and Events, Munich, Germany; 16.11.2017 - 17.11.2017; in: "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)", (2017), ISBN: 978-952-15-4042-4; S. 118 - 122.

Zusätzliche Informationen

Last changed: 
11.01.2018 19:05:08
Accepted: 
Accepted
TU Id: 
267111
Invited: 
Department Focus: 
Computational Intelligence
Author List: 
A. Schindler, T. Lidy, A. Rauber
Abstract German: 
In dieser Arbeit präsentieren wir eine Deep Neural Network Architektur zur akustischen Szenenklassifikation, die Informationen aus verschiedenen zeitlichen Auflösungen von Mel-Spektrogramm-Segmenten nutzt. Diese Architektur besteht aus getrennten parallelen Convolutional-Neuronalen Netzen, die spektrale und zeitliche Darstellungen für jede Eingangsauflösung lernen. Die Auflösungen werden so gewählt, dass sie feinkörnige Eigenschaften der spektralen Textur einer Szene sowie ihre Verteilung von akustischen Ereignissen abdecken. Das vorgeschlagene Modell zeigt eine 3,56%-ige absolute Verbesserung des besten Single-Resolution-Modells und 12,49% der Baseline des DCASE 2017 Acoustic Scenes Classification-Tasks.