MultiMediate Grand Challenge 2024

MultiMediate’24: Multi-Domain Engagement Estimation

Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling

Proceedings of the 32nd ACM International Conference on Multimedia, pp. 11377 - 11382, 2024.

Abstract Links BibTeX

Estimating the momentary level of participant’s engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate’24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate’24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate’24 challenge and presents baseline results. In addition, we discuss selected challenge solutions.

doi: 10.1145/3664647.3689004

Paper: mueller24_mm.pdf

Paper Access: https://dl.acm.org/doi/abs/10.1145/3664647.3689004

@inproceedings{mueller24_mm, author = {M{\"{u}}ller, Philipp and Balazia, Michal and Baur, Tobias and Dietz, Michael and Heimerl, Alexander and Penzkofer, Anna and Schiller, Dominik and Brémond, François and Alexandersson, Jan and André, Elisabeth and Bulling, Andreas}, title = {MultiMediate'24: Multi-Domain Engagement Estimation}, year = {2024}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, doi = {10.1145/3664647.3689004}, booktitle = {Proceedings of the 32nd ACM International Conference on Multimedia}, pages = {11377 - 11382}, url = { https://dl.acm.org/doi/abs/10.1145/3664647.3689004} }

Multi-Domain Engagement Estimation

Estimating the momentary level of participant's engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate'24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate'24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction.

Organisers

Cognitive Assistants
DFKI GmbH
Germany

Philipp Müller

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Jan Alexandersson

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Andreas Bulling

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Anna Penzkofer

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Human Centered Artificial Intelligence
Augsburg University
Germany

Elisabeth André

Universitätsstr. 6a
86159 Augsburg, Germany

Tobias Baur

Universitätsstr. 6a
86159 Augsburg, Germany

Michael Dietz

Universitätsstr. 6a
86159 Augsburg, Germany

Dominik Schiller

Universitätsstr. 6a
86159 Augsburg, Germany

INRIA
Sophia Antipolis

Michal Balazia

2004, route des Lucioles
06902 Sophia Antipolis, France

François Brémond

2004, route des Lucioles
06902 Sophia Antipolis, France

MultiMediate Grand Challenge 2023

MultiMediate’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions

Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling

Proceedings of the 31st ACM International Conference on Multimedia, pp. 9640–9645, 2023.

Links BibTeX

doi: 10.1145/3581783.3613851

Paper: mueller23_mm.pdf

Paper Access: http://arxiv.org/abs/2308.08256

@inproceedings{mueller23_mm, title = {{MultiMediate}'23: {Engagement} {Estimation} and {Bodily} {Behaviour} {Recognition} in {Social} {Interactions}}, shorttitle = {{MultiMediate}'23}, author = {Müller, Philipp and Balazia, Michal and Baur, Tobias and Dietz, Michael and Heimerl, Alexander and Schiller, Dominik and Guermal, Mohammed and Thomas, Dominike and Brémond, François and Alexandersson, Jan and André, Elisabeth and Bulling, Andreas}, year = {2023}, booktitle = {Proceedings of the 31st {ACM} {International} {Conference} on {Multimedia}}, pages = {9640--9645}, doi = {10.1145/3581783.3613851}, url = {http://arxiv.org/abs/2308.08256}, keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction}, annote = {Comment: ACM MultiMedia'23} }

Backchannel Detection

Bodily behaviours like fumbling, gesturing or crossed arms are key signals in social interactions and are related to many higher-level attributes including liking, attractiveness, social verticality, stress and anxiety. While impressive progress was made on human body- and hand pose estimation the recognition of such more complex bodily behaviours is still underexplored. With the bodily behaviour recognition task, we present the first challenge addressing this problem. We formulate bodily behaviour recognition as a 14-class multi-label classification. This task is based on the recently released BBSI dataset (Balazia et al., 2022). Challenge participants will receive 64-frame video snippets as input and need output a score indicating the likelihood of each behaviour class being present. To counter class imbalances, performance will be evaluated using macro averaged average precision.

Engagement Estimation

Knowing how engaged participants are is important for a mediator whose goal it is to keep engagement at a high level. Engagement is closely linked to the previous MultiMediate tasks of eye contact- backchannel detection. For the purpose of this challenge, we collected novel annotations of engagement on the Novice-Expert Interaction (NoXi) database (Cafaro et al., 2017). This database consists of dyadic, screen-mediated interactions focussed on information exchange. Interactions took place in several languages, and participants were recorded with video cameras and microphones. The task includes the continuous, frame-wise prediction of the level of conversational engagement of each participant on a continuous scale from 0 (lowest) to 1 (highest). Participants are encouraged to investigate multimodal as well as reciprocal behaviour of both interlocutors. We will use the Concordance Correlation Coefficient (CCC) to evaluate predictions.

Organisers

Cognitive Assistants
DFKI GmbH
Germany

Philipp Müller

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Jan Alexandersson

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Andreas Bulling

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Dominike Thomas

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Human Centered Artificial Intelligence
Augsburg University
Germany

Elisabeth André

Universitätsstr. 6a
86159 Augsburg, Germany

Tobias Baur

Universitätsstr. 6a
86159 Augsburg, Germany

Michael Dietz

Universitätsstr. 6a
86159 Augsburg, Germany

Dominik Schiller

Universitätsstr. 6a
86159 Augsburg, Germany

INRIA
Sophia Antipolis

Michal Balazia

2004, route des Lucioles
06902 Sophia Antipolis, France

François Brémond

2004, route des Lucioles
06902 Sophia Antipolis, France

MultiMediate Grand Challenge 2022

MultiMediate’22: Backchannel Detection and Agreement Estimation in Group Interactions

Philipp Müller, Dominik Schiller, Dominike Thomas, Michael Dietz, Hali Lindsay, Patrick Gebhard, Elisabeth André, Andreas Bulling

arXiv:2209.09578, pp. 1–6, 2022.

Abstract Links BibTeX

Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel detection and agreement estimation from backchannels in group conversations. This paper describes the MultiMediate challenge and presents a novel set of annotations consisting of 7234 backchannel instances for the MPIIGroupInteraction dataset. Each backchannel was additionally annotated with the extent by which it expresses agreement towards the current speaker. In addition to a an analysis of the collected annotations, we present baseline results for both challenge tasks.

doi:

Paper: mueller22_arxiv.pdf

Paper Access: http://arxiv.org/abs/2209.09578

@techreport{mueller22_arxiv, title = {MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Dietz, Michael and Lindsay, Hali and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2022}, pages = {1--6}, doi = {}, url = {http://arxiv.org/abs/2209.09578} }

The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions

Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, Michel Valstar

Proceedings of 19th ACM International Conference on Multimodal Interaction, pp. 350–359, 2017.

Abstract Links BibTeX

We present a novel multi-lingual database of natural dyadic novice- expert interactions, named NoXi, featuring screen-mediated dyadic human interactions in the context of information exchange and retrieval. NoXi is designed to provide spontaneous interactions with emphasis on adaptive behaviors and unexpected situations (e.g. conversational interruptions). A rich set of audio-visual data, as well as continuous and discrete annotations are publicly available through a web interface. Descriptors include low level social signals (e.g. gestures, smiles), functional descriptors (e.g. turn-taking, dialogue acts) and interaction descriptors (e.g. engagement, interest, and fluidity).

doi: 10.1145/3136755.3136780

Paper: cafaro17_icmi.pdf

@inproceedings{cafaro17_icmi, title = {The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions}, author = {Cafaro, Angelo and Wagner, Johannes and Baur, Tobias and Dermouche, Soumia and Torres, Mercedes Torres and Pelachaud, Catherine and André, Elisabeth and Valstar, Michel}, year = {2017}, booktitle = {Proceedings of 19th ACM International Conference on Multimodal Interaction}, doi = {10.1145/3136755.3136780}, pages = {350–359} }

Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

Michal Balazia, Philipp Müller, Ákos Levente Tánczos, August von Liechtenstein, François Brémond

Proceedings of the 30th ACM International Conference on Multimedia, pp. 70–79, 2022.

Abstract Links BibTeX

Body language is an eye-catching social signal and its automatic analysis can significantly advance artificial intelligence systems to understand and actively participate in social interactions. While computer vision has made impressive progress in low-level tasks like head and body pose estimation, the detection of more subtle behaviors such as gesturing, grooming, or fumbling is not well explored. In this paper we present BBSI, the first set of annotations of complex Bodily Behaviors embedded in continuous Social Interactions in a group setting. Based on previous work in psychology, we manually annotated 26 hours of spontaneous human behavior in the MPIIGroupInteraction dataset with 15 distinct body language classes. We present comprehensive descriptive statistics on the resulting dataset as well as results of annotation quality evaluations. For automatic detection of these behaviors, we adapt the Pyramid Dilated Attention Network (PDAN), a state-of-the-art approach for human action detection. We perform experiments using four variants of spatial-temporal features as input to PDAN: Two-Stream Inflated 3D CNN, Temporal Segment Networks, Temporal Shift Module and Swin Transformer. Results are promising and indicate a great room for improvement in this difficult task. Representing a key piece in the puzzle towards automatic understanding of social behavior, BBSI is fully available to the research community.

doi: 10.1145/3503161.3548363

Paper: balazia22_mm.pdf

Paper Access: https://doi.org/10.1145/3503161.3548363

@inproceedings{balazia22_mm, author = {Balazia, Michal and M\"{u}ller, Philipp and T\'{a}nczos, \'{A}kos Levente and Liechtenstein, August von and Br\'{e}mond, Fran\c{c}ois}, title = {Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation}, year = {2022}, url = {https://doi.org/10.1145/3503161.3548363}, doi = {10.1145/3503161.3548363}, booktitle = {Proceedings of the 30th ACM International Conference on Multimedia}, pages = {70–79} }

Backchannel Detection

Backchannels serve important meta-conversational purposes like signifying attention or indicating agreement. They can be expressed in a variety of ways - ranging from vocal behaviour (“yes”, “ah-ha”) to subtle nonverbal cues like head nods or hand movements. The backchannel detection sub-challenge focuses on classifying whether a participant of a group interaction expresses a backchannel at a given point in time. Challenge participants will be required to perform this classification based on a 10-second context window of audiovisual recordings of the whole group.

Agreement Estimation

A key function of backchannels is the expression of agreement or disagreement towards the current speaker. It is crucial for artificial mediators to have access to this information to understand the group structure and to intervene to avoid potential escalations. In this sub-challenge, participants will address the task of automatically estimating the amount of agreement expressed in a backchannel. In line with the backchannel detection sub-challenge, a 10-second audiovisual context window containing views on all interactants will be provided.

Organisers

Cognitive Assistants
DFKI GmbH
Germany

Philipp Müller

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Patrick Gebhard

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Hali Lindsay

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Andreas Bulling

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Dominike Thomas

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Human Centered Artificial Intelligence
Augsburg University
Germany

Elisabeth André

Universitätsstr. 6a
86159 Augsburg, Germany

Dominik Schiller

Universitätsstr. 6a
86159 Augsburg, Germany

Michael Dietz

Universitätsstr. 6a
86159 Augsburg, Germany

MultiMediate Grand Challenge 2021

MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation

Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, Andreas Bulling

Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.

Abstract Links BibTeX

Artificial mediators are promising to support human group conversations but at present their abilities are limited by insufficient progress in group behaviour analysis. The MultiMediate challenge addresses, for the first time, two fundamental group behaviour analysis tasks in well-defined conditions: eye contact detection and next speaker prediction. For training and evaluation, MultiMediate makes use of the MPIIGroupInteraction dataset consisting of 22 three- to four-person discussions as well as of an unpublished test set of six additional discussions. This paper describes the MultiMediate challenge and presents the challenge dataset including novel fine-grained speaking annotations that were collected for the purpose of MultiMediate. Furthermore, we present baseline approaches and ablation studies for both challenge tasks.

doi: 10.1145/3474085.3479219

Paper: mueller21_mm.pdf

@inproceedings{mueller21_mm, title = {MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Zhang, Guanhua and Dietz, Michael and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2021}, pages = {4878--4882}, doi = {10.1145/3474085.3479219}, booktitle = {Proc. ACM Multimedia (MM)} }

Eye Contact Detection Sub-challenge

This sub-challenge focuses on eye contact detection in group interactions from ambient RGB cameras. We define eye contact as a discrete indication of whether a participant is looking at another participants’ face, and if so, who this other participant is. Video and audio recordings over a 10 second context window will be provided as input to provide temporal context for the classification decision. Eye contact has to be detected for the last frame of this context window, making the task formulation also applicable to an online prediction scenario as encountered by artificial mediators.

Next Speaker Prediction Sub-challenge

In the next speaker prediction sub-challenge, approaches need to predict which members of the group will be speaking at a future point in time. Similar to the eye contact detection sub-challenge, video and audio recordings over a 10 second context window will be provided as input. Based on this information, approaches need to predict the speaking status of each participant at one second after the end of the context window.

Evaluation of Participants’ Approaches

For the purpose of this challenge we model the next speaker detection problem as a multi label problem. Hence a model for this task should predict a binary value (speaking = 1, not-speaking = 0) for each participant, for a given sample. As a metric to compare the submitted models we will use the unweighted average recall over all samples (see scikit recall_score(y_true, y_pred, average='macro') function).

For the eye contact detection task the problem is modeled as a multi class problem. Given a specific participant, a submitted model should predict with what other participant he or she is making eye contact. The task is modeled using five classes - one for each participants position (classes 1-4) and an additional class for no eye contact (class 0). To evaluate the performance of this task we will use accuracy as a metric (see scikit accuracy_score(y_true, y_pred) function).

Participants will receive training and validation data that can be used to build solutions for each sub-challenge (eye contact detection and next speaker prediction). The evaluation of these approaches will then be performed remotely on our side with the unpublished test portion of the dataset. For that, participants will create and upload docker images with their solutions that are then evaluated on our systems (for more information regarding the process visit this link).

Organisers

Cognitive Assistants
DFKI GmbH
Germany

Philipp Müller

Stuhlsatzenhausweg 3
D-66123 Saarbrücken, Germany

Patrick Gebhard

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Andreas Bulling

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Dominike Thomas

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Guanhua Zhang

Pfaffenwaldring 5a
70569 Stuttgart, Germany

Human Centered Multimedia
Augsburg University
Germany

Elisabeth André

Universitätsstr. 6a
86159 Augsburg, Germany

Dominik Schiller

Universitätsstr. 6a
86159 Augsburg, Germany

Michael Dietz

Universitätsstr. 6a
86159 Augsburg, Germany

MultiMediate Grand Challenge 2024

Multi-Domain Engagement Estimation

Organisers

Cognitive Assistants DFKI GmbH Germany

Philipp Müller

Jan Alexandersson

Human-Computer Interaction and Cognitive Systems University of Stuttgart Germany

Andreas Bulling

Anna Penzkofer

Human Centered Artificial Intelligence Augsburg University Germany

Elisabeth André

Tobias Baur

Michael Dietz

Dominik Schiller

INRIA Sophia Antipolis

Michal Balazia

François Brémond

MultiMediate Grand Challenge 2023

Backchannel Detection

Engagement Estimation

Organisers

Cognitive Assistants DFKI GmbH Germany

Philipp Müller

Jan Alexandersson

Human-Computer Interaction and Cognitive Systems University of Stuttgart Germany

Andreas Bulling

Dominike Thomas

Human Centered Artificial Intelligence Augsburg University Germany

Elisabeth André

Tobias Baur

Michael Dietz

Dominik Schiller

INRIA Sophia Antipolis

Michal Balazia

François Brémond

MultiMediate Grand Challenge 2022

Backchannel Detection

Agreement Estimation

Organisers

Cognitive Assistants DFKI GmbH Germany

Philipp Müller

Patrick Gebhard

Hali Lindsay

Human-Computer Interaction and Cognitive Systems University of Stuttgart Germany

Andreas Bulling

Dominike Thomas

Human Centered Artificial Intelligence Augsburg University Germany

Elisabeth André

Dominik Schiller

Michael Dietz

MultiMediate Grand Challenge 2021

Eye Contact Detection Sub-challenge

Next Speaker Prediction Sub-challenge

Evaluation of Participants’ Approaches

Organisers

Cognitive Assistants DFKI GmbH Germany

Philipp Müller

Patrick Gebhard

Human-Computer Interaction and Cognitive Systems University of Stuttgart Germany

Andreas Bulling

Dominike Thomas

Guanhua Zhang

Human Centered Multimedia Augsburg University Germany

Elisabeth André

Dominik Schiller

Michael Dietz

Links

Contact Us

Cognitive Assistants
DFKI GmbH
Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Human Centered Artificial Intelligence
Augsburg University
Germany

INRIA
Sophia Antipolis

Cognitive Assistants
DFKI GmbH
Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Human Centered Artificial Intelligence
Augsburg University
Germany

INRIA
Sophia Antipolis

Cognitive Assistants
DFKI GmbH
Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Human Centered Artificial Intelligence
Augsburg University
Germany

Cognitive Assistants
DFKI GmbH
Germany

Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Human Centered Multimedia
Augsburg University
Germany