Abstract
This paper proposes an approach for the automatic recognition of roles in settings like news and talk-shows, where roles correspond to specific functions like Anchorman, Guest or Interview Participant. The approach is based on purely nonverbal vocal behavioral cues, including who talks when and how much (turn-taking behavior), and statistical properties of pitch, formants, energy and speaking rate (prosodic behavior). The experiments have been performed over a corpus of around 50 hours of broadcast material and the accuracy, percentage of time correctly labeled in terms of role, is up to 89%. Both turn-taking and prosodic behavior lead to satisfactory results. Furthermore, on one database, their combination leads to a statistically significant improvement.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the ACM International Conference on Multimedia |
| Place of Publication | New York |
| Publisher | Association for Computing Machinery |
| Pages | 847-850 |
| Number of pages | 4 |
| ISBN (Print) | 978-1-60558-933-6 |
| DOIs | |
| Publication status | Published - 2010 |
| Event | 18th ACM Multimedia Conference, MM 2010 - Firenze, Italy Duration: 25 Oct 2010 → 29 Oct 2010 Conference number: 18 http://www.sigmm.org/archive/MM/mm10/www.acmmm10.org/index.html |
Conference
| Conference | 18th ACM Multimedia Conference, MM 2010 |
|---|---|
| Abbreviated title | MM |
| Country/Territory | Italy |
| City | Firenze |
| Period | 25/10/10 → 29/10/10 |
| Internet address |
Keywords
- METIS-271131
- EC Grant Agreement nr.: FP7/231287
- EWI-18805
- IR-74618