Abstract
Activities performed by humans can be recognized by the sound they emit while being performed, hence, researchers have proposed methods that use sound to recognize human activities, by detecting the presence of sound events in short time frames. However, in crowded environments, many sound events overlap making it impossible to distinguish the individual events and methods of detection to fail.
To address this issue and make the sound-based model suitable for crowd activities, this paper proposes to predict the proportion of activities happening in a specific place, by designing two neural network-based regression models: a CNN-model and a concatenate model. The CNN-model takes the Mel-bands as the input and is very popular in single activity recognition problems. Based on the CNN-model, we also designed a concatenate model which additionally inputting the global FFT feature to further improve the performance.
The evaluation of this approach is performed over 3 generated groups of audio samples, where each group has a different crowded-level. Both RMSE and coefficient of determination (R2 score), are used as evaluation metrics. The experiments show that the concatenate model works statistically better throughout the dataset, with a R2 score of 0.7377. Results show that using the concatenate model with both short-frame and holistic features provides a better result than any single-feature based model.
To address this issue and make the sound-based model suitable for crowd activities, this paper proposes to predict the proportion of activities happening in a specific place, by designing two neural network-based regression models: a CNN-model and a concatenate model. The CNN-model takes the Mel-bands as the input and is very popular in single activity recognition problems. Based on the CNN-model, we also designed a concatenate model which additionally inputting the global FFT feature to further improve the performance.
The evaluation of this approach is performed over 3 generated groups of audio samples, where each group has a different crowded-level. Both RMSE and coefficient of determination (R2 score), are used as evaluation metrics. The experiments show that the concatenate model works statistically better throughout the dataset, with a R2 score of 0.7377. Results show that using the concatenate model with both short-frame and holistic features provides a better result than any single-feature based model.
Original language | English |
---|---|
Title of host publication | PETRA '20: The 13th PErvasive Technologies Related to Assistive Environments Conference |
Publisher | ACM SigCHI |
Pages | 126-133 |
Number of pages | 8 |
ISBN (Electronic) | 9781450377737 |
ISBN (Print) | 978-1-4503-7773-7 |
DOIs | |
Publication status | Published - 30 Jun 2020 |
Event | 13th ACM International Conference on PErvasive Technologies Related to Assistive Environments, PETRA 2020 - Corfu Holiday Palace, Virtual, Online, Greece Duration: 30 Jun 2020 → 3 Jul 2020 Conference number: 13 |
Conference
Conference | 13th ACM International Conference on PErvasive Technologies Related to Assistive Environments, PETRA 2020 |
---|---|
Abbreviated title | PETRA 2020 |
Country/Territory | Greece |
City | Virtual, Online |
Period | 30/06/20 → 3/07/20 |
Keywords
- ambient intelligence
- automatic sound event recognition
- concatenate neural network
- convolutional neural network
- crowd activity monitoring
- machine learning
- mel-bands spectrogram
- r-squared score