Recently, the number of devices equipped with both audio and video sensors (wearable cameras, smart-phones) has been growing continuously, leading to an increasing interest in taking advantage of a joint audio-video processing to address single-modal challenges and to open new application context.
However, communities still remain rather isolated and efforts are often driven by specific needs (a project or a challenge), focusing on rather simple combinations of the multi-modal information, like: early feature stacking or late score combination. These approaches do not really exploit the potential of the multimodal information. For instance, early fusion cannot deal with intermittent features and late fusion just skips the unreliable modality. Moreover, none of these methods employ multimodality to reduce any model mismatch or to allow training on very limited amounts of data. Conversely, an effective joint processing of the audio and video information, fully aware of the complementarity of the two modalities, can help circumventing critical issues of each single modality, leading to more robust and smart systems.
Some examples are the following:
- Target identification, in particular for egocentric data, which raises specific issues not addressed by traditional systems.
- Speaker diarization, since in many cases (e.g. recordings of meetings, public debates, TV talk-shows, etc) information about “who is speaking and when” can be extracted from both audio and video data, thus allowing an improvement on both the speech segmentation and the speaker clustering process.
- Automatic indexing of large audio/video archives, since detection of important events characterizing a given audio/video recording, i.e. events containing specific noises, keywords, background audio, etc (e.g. important events in sports competitions, news reporting situations in particular environments, etc) can significantly improve if audio/video information is used in a reinforcement learning paradigm.
Multi-modal model Adaptation for Target ID on Egocentric Data
As a first application scenario, we consider target (person) identification, one of the most common tasks in audio-video processing, which can greatly benefit from a more advanced integration than score combination or feature stacking. The idea is to move beyond traditional combination methods and their limitations by introduce a cross-modal aware processing where elaboration and modelling of a modality depend on the information provided by the other sensor.
We developed a multimodal unsupervised continuous adaptation of each single-modal model using the information from the other modality: as soon as a new observation is available, the related models (audio and/or video) are updated to mitigate possible mismatches and to improve weak models, in a sort of on-line Co-EM scheme.
We focused on egocentric data acquired by wearable devices. Egocentric data are particularly interesting because they pose a series of severe challenges (rapidly varying environmental conditions, limited training material, highly intermittent features, very short interactions), which stress the potential of the complementarity of audio-visual data
More details about the data and the processing chain are available here
A. Brutti, A. Cavallaro, “On-line cross-modal adaptation for audio-visual person identification with wearable cameras”, IEEE Transactions on Human-Machine Systems, to appear, 2016
A. Brutti, A. Cavallaro, “Unsupervised cross-modal deep-model adaptation for audio-visual re-identification with wearable cameras”, ICCV Workshop CVAVM, 2017