Robust Factor Graph attention
Overview Visual Dialog is a challenging problem where models not only need to interact with multiple modalities but also have to maintain context in the form of dialog history to provide an answer to the given query. This is a more natural form of the Visual Question Answering task as it allows for communication with the agent. Significant progress has been made in this domain. In this paper, we show the results of two baselines on the VisDial dataset and discuss their current challenges. We attempt to address them using a combination of methods and design choices such a contrastive loss formulation, data-augmentation strategies, and generating unimodal and multimodal heuristic scores while training. Thus, our goal is to make VisDial models more robust and accurate for general use.