Review of paper Multimodal Machine Learning: A Survey and Taxonomy

3 minute read

Published:

Review of Paper Multimodal Machine Learning: A Survey and Taxonomy

The paper proposes 5 broad challenges that are faced by multimodal machine learning, namely: 

  • representation ( how to represent multimodal data)
  • translation (how to map data from one modality to another)
  • alignment (how to identify relations b/w modalities)
  • fusion ( how to join semantic information from different modalities)
  • co-learning (how to transfer knowledge from models/representation of one modality to another)

The sections of this part of the paper discuss the alignment, fusion, and co-learning challenges for multi-modal learning. These sections do a good job of highlighting the older methods used to tackle these challenges and their pros and cons. The authors also talk about the newer methods and where the scope of improvement lies.

Alignment: The paper shows that earlier work in Multimodal alignment focuses on unsupervised approaches (without direct alignment labels) using graphical models and dynamic programming techniques. Both of these techniques had their restrictions in terms of temporal consistency b/w modalities, no large jumps in time and monotonicity, etc. With DTW, learning both the similarity and alignment jointly is possible, the graphical models need expert knowledge for construction. Due to the availability of aligned datasets, deep learning is widely applied to perform an explicit alignment. Attention mechanism has proven to help with focusing on sub-components of the source instance which helps in easing the weight put on the encoder.

Fusion: The paper next show that fusion is a widely researched topic and multiple approaches including model agnostic methods (early, hybrid and late fusion), graphical models, multiple kernel learning, and various type of neural networks are used to tackle it. Each method has its own shortcoming, with some like graphical models, MKL suited for scenarios with limited training data and neural network when learning complex decision boundaries on a big dataset.

Co-learning: This section describes the challenge of co-learning as a system where modalities aid each other with semantic information for training - and that this can be used for helping with the previously described challenges (fusion, translation, and alignment). Different techniques like co-training, multimodal representation learning, conceptual grounding, and Zero-shot learning are ways to perform co-learning and are used in applications like visual classification, action recognition, etc are discussed.

Discussion: The paper does an excellent job of highlighting all challenges while doing multimodal fusion. Since these heuristics are applied in a different manner across different applications and it is not possible to benchmark the progress on solving “alignment” for instance. However, I feel It would have been nice to have an empirical study to show how DTW compares to graphical models and CNN for alignment on any given dataset just to showcase the overall progress?

Also, I was led into reading up on recent work on graph neural networks which help learn longer temporal dependency in long unaligned sequences link and this link highlights a practical challenge of incomplete multi-modal data - then proposes a unique strategy for learning using Heterogeneous Graph-based Multimodal Fusion.