MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation

MultiMediate Grand Challenge 2021

  1. MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation

    MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation

    Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, Andreas Bulling

    Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.

    Abstract Links BibTeX

Eye Contact Detection Sub-challenge

This sub-challenge focuses on eye contact detection in group interactions from ambient RGB cameras. We define eye contact as a discrete indication of whether a participant is looking at another participants’ face, and if so, who this other participant is. Video and audio recordings over a 10 second context window will be provided as input to provide temporal context for the classification decision. Eye contact has to be detected for the last frame of this context window, making the task formulation also applicable to an online prediction scenario as encountered by artificial mediators.

Group Discussion

Next Speaker Prediction Sub-challenge

In the next speaker prediction sub-challenge, approaches need to predict which members of the group will be speaking at a future point in time. Similar to the eye contact detection sub-challenge, video and audio recordings over a 10 second context window will be provided as input. Based on this information, approaches need to predict the speaking status of each participant at one second after the end of the context window.

Evaluation of Participants’ Approaches

For the purpose of this challenge we model the next speaker detection problem as a multi label problem. Hence a model for this task should predict a binary value (speaking = 1, not-speaking = 0) for each participant, for a given sample. As a metric to compare the submitted models we will use the unweighted average recall over all samples (see scikit recall_score(y_true, y_pred, average='macro') function).

For the eye contact detection task the problem is modeled as a multi class problem. Given a specific participant, a submitted model should predict with what other participant he or she is making eye contact. The task is modeled using five classes - one for each participants position (classes 1-4) and an additional class for no eye contact (class 0). To evaluate the performance of this task we will use accuracy as a metric (see scikit accuracy_score(y_true, y_pred) function).

Participants will receive training and validation data that can be used to build solutions for each sub-challenge (eye contact detection and next speaker prediction). The evaluation of these approaches will then be performed remotely on our side with the unpublished test portion of the dataset. For that, participants will create and upload docker images with their solutions that are then evaluated on our systems (for more information regarding the process visit this link).

Organisers

Cognitive Assistants
DFKI GmbH
Germany

Philipp Müller

Stuhlsatzenhausweg 3
D-66123 Saarbrücken, Germany


Patrick Gebhard

Stuhlsatzenhausweg 3
66123 Saarbrücken, Germany


Human-Computer Interaction and Cognitive Systems
University of Stuttgart
Germany

Andreas Bulling

Pfaffenwaldring 5a
70569 Stuttgart, Germany


Dominike Thomas

Pfaffenwaldring 5a
70569 Stuttgart, Germany


Guanhua Zhang

Pfaffenwaldring 5a
70569 Stuttgart, Germany


Human Centered Multimedia
Augsburg University
Germany

Elisabeth André

Universitätsstr. 6a
86159 Augsburg, Germany


Dominik Schiller

Universitätsstr. 6a
86159 Augsburg, Germany


Michael Dietz

Universitätsstr. 6a
86159 Augsburg, Germany