Abstract
This paper proposes an approach for the automatic recognition of roles in settings like news and talk-shows, where roles correspond to specific functions like Anchorman, Guest or Interview Participant. The approach is based on purely nonverbal vocal behavioral cues, including who talks when and how much (turn-taking behavior), and statistical properties of pitch, formants, energy and speaking rate (prosodic behavior). The experiments have been performed over a corpus of around 50 hours of broadcast material and the accuracy, percentage of time correctly labeled in terms of role, is up to 89%. Both turn-taking and prosodic behavior lead to satisfactory results. Furthermore, on one database, their combination leads to a statistically significant improvement.
Original language | English |
---|---|
Title of host publication | Proceedings of the ACM International Conference on Multimedia |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 847-850 |
Number of pages | 4 |
ISBN (Print) | 978-1-60558-933-6 |
DOIs | |
Publication status | Published - 2010 |
Event | 18th ACM Multimedia Conference, MM 2010 - Firenze, Italy Duration: 25 Oct 2010 → 29 Oct 2010 Conference number: 18 http://www.sigmm.org/archive/MM/mm10/www.acmmm10.org/index.html |
Conference
Conference | 18th ACM Multimedia Conference, MM 2010 |
---|---|
Abbreviated title | MM |
Country/Territory | Italy |
City | Firenze |
Period | 25/10/10 → 29/10/10 |
Internet address |
Keywords
- METIS-271131
- EC Grant Agreement nr.: FP7/231287
- EWI-18805
- IR-74618