Response Selection and Turn-taking for a Sensitive Artificial Listening Agent

Mark ter Maat

Abstract

Communication with a machines is inherently not ‘natural’, but still we prefer to interact with them without learning new skills but by using types of communication we already know. Ideally, we want to communicate with a machine just as we communicate with people: we explain (using our voice and gestures) what we want the machine to do, and it understands this and performs the required task. In its simplest form, such a dialogue system receives the user’s input as written text, which it has to parse and analyze to extract the intentions of the user. But a more complex dialogue system can perceive the user via a microphone and a camera, and the user can use normal speech and gestures to explain his or her intentions. However, this means that the system has to take other aspects of human conversation into account besides interpreting the user’s intentions. For example, it has to manage correct turn-taking behaviour, it has to provide feedback, and it has to manage a correct level of politeness. This thesis focusses on two aspects of the interaction between a user and a virtual agent (a dialogue system with a visual embodiment), namely the perception of turn-taking strategies and the selection of appropriate responses. This research was carried out in the context of the SEMAINE project, in which a virtual listening agent was built: a virtual agent that tries to keep the user talking for as long as possible. Additionally, the system consists of four specific characters, each with a certain emotional state: a happy, a gloomy, an aggressive, and a pragmatic one. These characters also try to get the user in the same emotional state as they themselves are in. Turn-taking is a good example of something that is completely natural for most people, but very hard to teach a system. And while most dialogue systems focus on having the agent’s responses start as soon as possible after the user’s end of turn without overlapping it, evidence indicates that starting too early or too late is not always inappropriate per se. People might start speaking too early because of their enthusiasm, or they might start later than usual because they are thinking. This thesis describes the study of how different turn-taking strategies used by a dialogue system influence the perception that users have of that system. These turn-taking strategies are different start times of the next turn (starting before the user’s turn has finished, directly when it finishes or after a small pause) and different responses when overlapping speech is detected (stop speaking, continue normally or continue with a raised voice). These strategies were evaluated in two studies. In the first study, a simulator was cre- ated that generated conversations by having two agents ‘talk’ to each other. The turn-taking behaviour of each agent was scripted beforehand, and the resulting conversation was played by using non-intelligible speech. After listening to a simulated conversation, the users had to complete a questionnaire containing semantic differential scales about how they perceived a participant in the conversation. In the second study, the users actively participated in the conversation themselves. They were interviewed by a dialogue system, but the exact timing of each question was controlled by a human wizard. This wizard varied the start time of the questions depending on the selected strategy of that particular interview, and after each interview the users had to complete a questionnaire about how they perceived the dialogue system. These studies showed that starting too early (that is, interrupting the user) was mostly associated with negative and strong personality attributes: agents were perceived as less agreeable and more assertive. Leaving pauses between turns had contrary associations: it was perceived as more agreeable, less assertive, and created the feeling of having more rapport. It also showed that different strategies influence the response behaviour of the users as well. The users seemed to ‘adapt’ to the interviewing agent’s turn-taking strategy, for example by talking faster and with shorter turns when the interviewer started early during the interview. The final part of the thesis describes the response selection of the listening agent. We decided to select an appropriate response based on the non-verbal input, rather than on the content of the user’s speech, to make the listening agent capable of responding appropriately regardless of the topic. This thesis first describes the handcrafted models and then the more data-driven approach. In this approach, humans annotated videos containing user turns with appropriate possible responses. Classifiers were then used to learn how to respond after a user’s turn. Different methods were used to create the training data and evaluate the results. The classifiers were tested by letting them predict appropriate responses for new fragments and let humans rate these responses. We found that some classifiers produced significantly more appropriate responses than a random model.
Original languageUndefined
Awarding Institution
  • University of Twente
Supervisors/Advisors
  • Nijholt, Antinus , Supervisor
  • Heylen, Dirk K.J., Advisor
Sponsors
Date of Award30 Nov 2011
Place of PublicationEnschede
Print ISBNs978-94-6191-105-6
DOIs
StatePublished - 30 Nov 2011

Fingerprint

Dialogue systems
Turn-taking
Classifier
Emotional state
Gesture
Pause
Questionnaire
Data-driven
Interaction
Enthusiasm
Controlled
Simulator
Interviewing
Rapport
Embodiment
Politeness

Keywords

  • IR-78566
  • Virtual agents
  • HMI-MI: MULTIMODAL INTERACTIONS
  • METIS-285079
  • Turn Taking
  • HMI-CI: Computational Intelligence
  • HMI-IA: Intelligent Agents
  • EC Grant Agreement nr.: FP7/211486
  • EWI-21416
  • Response selection

Cite this

@misc{9ca2fe1de2ec48ca9691d6564b3e50ad,
title = "Response Selection and Turn-taking for a Sensitive Artificial Listening Agent",
abstract = "Communication with a machines is inherently not ‘natural’, but still we prefer to interact with them without learning new skills but by using types of communication we already know. Ideally, we want to communicate with a machine just as we communicate with people: we explain (using our voice and gestures) what we want the machine to do, and it understands this and performs the required task. In its simplest form, such a dialogue system receives the user’s input as written text, which it has to parse and analyze to extract the intentions of the user. But a more complex dialogue system can perceive the user via a microphone and a camera, and the user can use normal speech and gestures to explain his or her intentions. However, this means that the system has to take other aspects of human conversation into account besides interpreting the user’s intentions. For example, it has to manage correct turn-taking behaviour, it has to provide feedback, and it has to manage a correct level of politeness. This thesis focusses on two aspects of the interaction between a user and a virtual agent (a dialogue system with a visual embodiment), namely the perception of turn-taking strategies and the selection of appropriate responses. This research was carried out in the context of the SEMAINE project, in which a virtual listening agent was built: a virtual agent that tries to keep the user talking for as long as possible. Additionally, the system consists of four specific characters, each with a certain emotional state: a happy, a gloomy, an aggressive, and a pragmatic one. These characters also try to get the user in the same emotional state as they themselves are in. Turn-taking is a good example of something that is completely natural for most people, but very hard to teach a system. And while most dialogue systems focus on having the agent’s responses start as soon as possible after the user’s end of turn without overlapping it, evidence indicates that starting too early or too late is not always inappropriate per se. People might start speaking too early because of their enthusiasm, or they might start later than usual because they are thinking. This thesis describes the study of how different turn-taking strategies used by a dialogue system influence the perception that users have of that system. These turn-taking strategies are different start times of the next turn (starting before the user’s turn has finished, directly when it finishes or after a small pause) and different responses when overlapping speech is detected (stop speaking, continue normally or continue with a raised voice). These strategies were evaluated in two studies. In the first study, a simulator was cre- ated that generated conversations by having two agents ‘talk’ to each other. The turn-taking behaviour of each agent was scripted beforehand, and the resulting conversation was played by using non-intelligible speech. After listening to a simulated conversation, the users had to complete a questionnaire containing semantic differential scales about how they perceived a participant in the conversation. In the second study, the users actively participated in the conversation themselves. They were interviewed by a dialogue system, but the exact timing of each question was controlled by a human wizard. This wizard varied the start time of the questions depending on the selected strategy of that particular interview, and after each interview the users had to complete a questionnaire about how they perceived the dialogue system. These studies showed that starting too early (that is, interrupting the user) was mostly associated with negative and strong personality attributes: agents were perceived as less agreeable and more assertive. Leaving pauses between turns had contrary associations: it was perceived as more agreeable, less assertive, and created the feeling of having more rapport. It also showed that different strategies influence the response behaviour of the users as well. The users seemed to ‘adapt’ to the interviewing agent’s turn-taking strategy, for example by talking faster and with shorter turns when the interviewer started early during the interview. The final part of the thesis describes the response selection of the listening agent. We decided to select an appropriate response based on the non-verbal input, rather than on the content of the user’s speech, to make the listening agent capable of responding appropriately regardless of the topic. This thesis first describes the handcrafted models and then the more data-driven approach. In this approach, humans annotated videos containing user turns with appropriate possible responses. Classifiers were then used to learn how to respond after a user’s turn. Different methods were used to create the training data and evaluate the results. The classifiers were tested by letting them predict appropriate responses for new fragments and let humans rate these responses. We found that some classifiers produced significantly more appropriate responses than a random model.",
keywords = "IR-78566, Virtual agents, HMI-MI: MULTIMODAL INTERACTIONS, METIS-285079, Turn Taking, HMI-CI: Computational Intelligence, HMI-IA: Intelligent Agents, EC Grant Agreement nr.: FP7/211486, EWI-21416, Response selection",
author = "{ter Maat}, Mark",
note = "SIKS Dissertation Series; no. 2011-48",
year = "2011",
month = "11",
doi = "10.3990/1.9789461911056",
isbn = "978-94-6191-105-6",
school = "University of Twente",

}

Response Selection and Turn-taking for a Sensitive Artificial Listening Agent. / ter Maat, Mark.

Enschede, 2011. 119 p.

Research output: ScientificPhD Thesis - Research UT, graduation UT

TY - THES

T1 - Response Selection and Turn-taking for a Sensitive Artificial Listening Agent

AU - ter Maat,Mark

N1 - SIKS Dissertation Series; no. 2011-48

PY - 2011/11/30

Y1 - 2011/11/30

N2 - Communication with a machines is inherently not ‘natural’, but still we prefer to interact with them without learning new skills but by using types of communication we already know. Ideally, we want to communicate with a machine just as we communicate with people: we explain (using our voice and gestures) what we want the machine to do, and it understands this and performs the required task. In its simplest form, such a dialogue system receives the user’s input as written text, which it has to parse and analyze to extract the intentions of the user. But a more complex dialogue system can perceive the user via a microphone and a camera, and the user can use normal speech and gestures to explain his or her intentions. However, this means that the system has to take other aspects of human conversation into account besides interpreting the user’s intentions. For example, it has to manage correct turn-taking behaviour, it has to provide feedback, and it has to manage a correct level of politeness. This thesis focusses on two aspects of the interaction between a user and a virtual agent (a dialogue system with a visual embodiment), namely the perception of turn-taking strategies and the selection of appropriate responses. This research was carried out in the context of the SEMAINE project, in which a virtual listening agent was built: a virtual agent that tries to keep the user talking for as long as possible. Additionally, the system consists of four specific characters, each with a certain emotional state: a happy, a gloomy, an aggressive, and a pragmatic one. These characters also try to get the user in the same emotional state as they themselves are in. Turn-taking is a good example of something that is completely natural for most people, but very hard to teach a system. And while most dialogue systems focus on having the agent’s responses start as soon as possible after the user’s end of turn without overlapping it, evidence indicates that starting too early or too late is not always inappropriate per se. People might start speaking too early because of their enthusiasm, or they might start later than usual because they are thinking. This thesis describes the study of how different turn-taking strategies used by a dialogue system influence the perception that users have of that system. These turn-taking strategies are different start times of the next turn (starting before the user’s turn has finished, directly when it finishes or after a small pause) and different responses when overlapping speech is detected (stop speaking, continue normally or continue with a raised voice). These strategies were evaluated in two studies. In the first study, a simulator was cre- ated that generated conversations by having two agents ‘talk’ to each other. The turn-taking behaviour of each agent was scripted beforehand, and the resulting conversation was played by using non-intelligible speech. After listening to a simulated conversation, the users had to complete a questionnaire containing semantic differential scales about how they perceived a participant in the conversation. In the second study, the users actively participated in the conversation themselves. They were interviewed by a dialogue system, but the exact timing of each question was controlled by a human wizard. This wizard varied the start time of the questions depending on the selected strategy of that particular interview, and after each interview the users had to complete a questionnaire about how they perceived the dialogue system. These studies showed that starting too early (that is, interrupting the user) was mostly associated with negative and strong personality attributes: agents were perceived as less agreeable and more assertive. Leaving pauses between turns had contrary associations: it was perceived as more agreeable, less assertive, and created the feeling of having more rapport. It also showed that different strategies influence the response behaviour of the users as well. The users seemed to ‘adapt’ to the interviewing agent’s turn-taking strategy, for example by talking faster and with shorter turns when the interviewer started early during the interview. The final part of the thesis describes the response selection of the listening agent. We decided to select an appropriate response based on the non-verbal input, rather than on the content of the user’s speech, to make the listening agent capable of responding appropriately regardless of the topic. This thesis first describes the handcrafted models and then the more data-driven approach. In this approach, humans annotated videos containing user turns with appropriate possible responses. Classifiers were then used to learn how to respond after a user’s turn. Different methods were used to create the training data and evaluate the results. The classifiers were tested by letting them predict appropriate responses for new fragments and let humans rate these responses. We found that some classifiers produced significantly more appropriate responses than a random model.

AB - Communication with a machines is inherently not ‘natural’, but still we prefer to interact with them without learning new skills but by using types of communication we already know. Ideally, we want to communicate with a machine just as we communicate with people: we explain (using our voice and gestures) what we want the machine to do, and it understands this and performs the required task. In its simplest form, such a dialogue system receives the user’s input as written text, which it has to parse and analyze to extract the intentions of the user. But a more complex dialogue system can perceive the user via a microphone and a camera, and the user can use normal speech and gestures to explain his or her intentions. However, this means that the system has to take other aspects of human conversation into account besides interpreting the user’s intentions. For example, it has to manage correct turn-taking behaviour, it has to provide feedback, and it has to manage a correct level of politeness. This thesis focusses on two aspects of the interaction between a user and a virtual agent (a dialogue system with a visual embodiment), namely the perception of turn-taking strategies and the selection of appropriate responses. This research was carried out in the context of the SEMAINE project, in which a virtual listening agent was built: a virtual agent that tries to keep the user talking for as long as possible. Additionally, the system consists of four specific characters, each with a certain emotional state: a happy, a gloomy, an aggressive, and a pragmatic one. These characters also try to get the user in the same emotional state as they themselves are in. Turn-taking is a good example of something that is completely natural for most people, but very hard to teach a system. And while most dialogue systems focus on having the agent’s responses start as soon as possible after the user’s end of turn without overlapping it, evidence indicates that starting too early or too late is not always inappropriate per se. People might start speaking too early because of their enthusiasm, or they might start later than usual because they are thinking. This thesis describes the study of how different turn-taking strategies used by a dialogue system influence the perception that users have of that system. These turn-taking strategies are different start times of the next turn (starting before the user’s turn has finished, directly when it finishes or after a small pause) and different responses when overlapping speech is detected (stop speaking, continue normally or continue with a raised voice). These strategies were evaluated in two studies. In the first study, a simulator was cre- ated that generated conversations by having two agents ‘talk’ to each other. The turn-taking behaviour of each agent was scripted beforehand, and the resulting conversation was played by using non-intelligible speech. After listening to a simulated conversation, the users had to complete a questionnaire containing semantic differential scales about how they perceived a participant in the conversation. In the second study, the users actively participated in the conversation themselves. They were interviewed by a dialogue system, but the exact timing of each question was controlled by a human wizard. This wizard varied the start time of the questions depending on the selected strategy of that particular interview, and after each interview the users had to complete a questionnaire about how they perceived the dialogue system. These studies showed that starting too early (that is, interrupting the user) was mostly associated with negative and strong personality attributes: agents were perceived as less agreeable and more assertive. Leaving pauses between turns had contrary associations: it was perceived as more agreeable, less assertive, and created the feeling of having more rapport. It also showed that different strategies influence the response behaviour of the users as well. The users seemed to ‘adapt’ to the interviewing agent’s turn-taking strategy, for example by talking faster and with shorter turns when the interviewer started early during the interview. The final part of the thesis describes the response selection of the listening agent. We decided to select an appropriate response based on the non-verbal input, rather than on the content of the user’s speech, to make the listening agent capable of responding appropriately regardless of the topic. This thesis first describes the handcrafted models and then the more data-driven approach. In this approach, humans annotated videos containing user turns with appropriate possible responses. Classifiers were then used to learn how to respond after a user’s turn. Different methods were used to create the training data and evaluate the results. The classifiers were tested by letting them predict appropriate responses for new fragments and let humans rate these responses. We found that some classifiers produced significantly more appropriate responses than a random model.

KW - IR-78566

KW - Virtual agents

KW - HMI-MI: MULTIMODAL INTERACTIONS

KW - METIS-285079

KW - Turn Taking

KW - HMI-CI: Computational Intelligence

KW - HMI-IA: Intelligent Agents

KW - EC Grant Agreement nr.: FP7/211486

KW - EWI-21416

KW - Response selection

U2 - 10.3990/1.9789461911056

DO - 10.3990/1.9789461911056

M3 - PhD Thesis - Research UT, graduation UT

SN - 978-94-6191-105-6

ER -