MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation

MultiMediate 2024

Here, we introduce the different challenge tasks, evaluation methodology and rules for participation.

Baseline approaches are available at

Multi-Domain Engagement Estimation

Knowing how engaged participants are is important for a mediator whose goal it is to keep engagement at a high level. Engagement is closely linked to the previous MultiMediate tasks of eye contact- backchannel detection. For the purpose of this challenge, we collected novel annotations of engagement on the Novice-Expert Interaction (NoXi) database (Cafaro et al., 2017). This database consists of dyadic, screen-mediated interactions focussed on information exchange. Interactions took place in several languages, and participants were recorded with video cameras and microphones. The task includes the continuous, frame-wise prediction of the level of conversational engagement of each participant on a continuous scale from 0 (lowest) to 1 (highest). Participants are encouraged to investigate multimodal as well as reciprocal behaviour of both interlocutors. We will use the Concordance Correlation Coefficient (CCC) to evaluate predictions.

The overall performance of a team will be evaluated by taking the average CCC across four different test datasets. Of these four datasets, two will include validation sets that will be made available to participants (please sign up to our mailing list to be notified):

  • NOXI (MultiMediate’23 version): This part of the evaluation set is identical to the test set of MultiMediate'23 and consists of 16 sessions (in English, French and German). That is, the MultiMediate'23 version of the NOXI test set comes from the same domain as the training set, providing a reference to compare MultiMediate'24 submissions to MultiMediate'23 results, as well as a point of comparison for evaluating the impact of out-of-domain test scenarios on performance.
  • NOXI (additional languages): This evaluation set includes four languages that are not part of the NOXI training set: two sessions in Arabic, two in Italian, four in Indonesian, and four in Spanish. As a result, this evaluation set tests the ability of participants' approaches to transfer to new languages and cultural backgrounds not seen at training time.
  • MPIIGroupInteraction: For MultiMediate'24 we collected novel engagement annotations on the MPIIGroupInteraction test and validation sets. The validation set with ground truth annotations will be provided to participants to monitor their performance on the out-of-domain task. In addition it may be used as a limited set of training data to develop supervised domain adaptation approaches.
  • Speed Dating Dataset: The yet unpublished Speed Dating dataset was recorded at DFKI and consists of dyadic speed dates conducted with a videoconferencing tool. Each speed date lasted approximately 6 minutes, and video and audio recordings were captured with cameras and microphones on the participant's devices. The dataset represents a challenging addition to the other lab-based evaluation sets, as it is recorded in an unrestricted in-the-wild setting. Participants took part in the data collection with their private mobile devices such as laptops and smartphones, and lighting condition and position of the participant relevant to the device were not controlled. The recorded videos from the participant's cameras in most cases primarily show the face, thus we do not provide pose information on this data. Similar to MPIIGroupInteraction, we will make available a validation set to participants.

Continuing MultiMediate Tasks

In addition to the two tasks described above we also invite submission to the three most popular tasks included in MultiMediate’21-’23.

Bodily Behaviour Recognition

Bodily behaviours like fumbling, gesturing or crossed arms are key signals in social interactions and are related to many higher-level attributes including liking, attractiveness, social verticality, stress and anxiety. While impressive progress was made on human body- and hand pose estimation the recognition of such more complex bodily behaviours is still underexplored. With the bodily behaviour recognition task, we present the first challenge addressing this problem. We formulate bodily behaviour recognition as a 14-class multi-label classification. This task is based on the recently released BBSI dataset (Balazia et al., 2022). Challenge participants will receive 64-frame video snippets as input and need output a score indicating the likelihood of each behaviour class being present. To counter class imbalances, performance will be evaluated using macro averaged average precision.

Backchannel Detection (Multimediate'22 task)

Backchannels serve important meta-conversational purposes like signifying attention or indicating agreement. They can be expressed in a variety of ways - ranging from vocal behaviour (“yes”, “ah-ha”) to subtle nonverbal cues like head nods or hand movements. The backchannel detection sub-challenge focuses on classifying whether a participant of a group interaction expresses a backchannel at a given point in time. Challenge participants will be required to perform this classification based on a 10-second context window of audiovisual recordings of the whole group. Approaches will be evaluated using classification accuracy.

Eye Contact Detection (MultiMediate’21 task)

We define eye contact as a discrete indication of whether a participant is looking at another participant’s face, and if so, who this other participant is. Video and audio recordings over a 10 second context window will be provided as input to provide temporal context for the classification decision. Eye contact has to be detected for the last frame of the 10-second context window. In the next speaker prediction sub-challenge, participants need to predict the speaking status of each participant at one second after the end of the context window. Approaches will be evaluated using classification accuracy.

Evaluation of Participants’ Approaches

Training and validation data for each sub-challenge can be downloaded at We will provide baseline implementations along with pre-computed features to minimise the overhead for participants. For the tasks newly included in this years’ challenge, the test set (without ground truth) will be released two weeks before the challenge deadline. Participants will in turn submit their predictions for evaluation (details will follow).

We will evaluate approaches with the following metrics: accuracy for backchannel detection and eye contact estimation, mean squared error for agreement estimation from backchannels, and next speaker prediction is evaluated with unweighted average recall.

Rules for participation

  • The competition is team-based. A single person can only be part of a single team.
  • For bodily behaviour recognition and engagement estimation tasks, each team will have 5 evaluation runs on the test set (per task).
  • For the tasks that were already included in Multimediate’21-23, three evaluations on the test set are allowed per month. In July 2024, we will make an exception and allow for five evaluations on the test set.
  • Additional datasets can be used, but they need to be publicly available.
  • The Organisers will not participate in the challenge.
  • For awarding certificates for 1st, 2nd and 3rd place in each subchallenge we will only consider approaches that are described in accepted papers that were submitted to the ACM MM Grand Challenge track.
  • The evaluation servers will be open until the paper submission deadline (12 July 2024).
  • The test set (without labels) will be provided to participants 2 weeks before the challenge deadline. It is not allowed to manually annotate the test set.
  1. Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

    Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

    Michal Balazia, Philipp Müller, Ákos Levente Tánczos, August von Liechtenstein, François Brémond

    Proceedings of the 30th ACM International Conference on Multimedia, pp. 70–79, 2022.

    Abstract Links BibTeX

  1. The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions

    The NoXi Database: Multimodal Recordings of Mediated Novice-Expert Interactions

    Angelo Cafaro, Johannes Wagner, Tobias Baur, Soumia Dermouche, Mercedes Torres Torres, Catherine Pelachaud, Elisabeth André, Michel Valstar

    Proceedings of 19th ACM International Conference on Multimodal Interaction, pp. 350–359, 2017.

    Abstract Links BibTeX