Current (as of 06/2020)
see below for older code an project pages
Vision and Language
Multimodal:
- MMF: A modular framework for vision & language multimodal research from Facebook AI Research (FAIR).
- 12-in-1: Multi-Task Vision and Language Representation Learning [pdf] [code] [CVPR presentation], [Demo from Cloud CV]
Images with Text
- TextVQA: answering questions about images with text
- Dataset
- Code: M4C (CVPR 2020)
- TextCaps: describing images with text
- Dataset
- Code: M4C-Captioner
VQA
- In Defense of Grid Features for Visual Question Answering [pdf] [code] [CVPR presentation]
- See also TextVQA and Multimodal above
Video Description
- ActivityNet Entities
Long-tail
Continual Learning
Older
Visual Question Answering
- Multimodal Compact Bilinear Pooling (VQA 2016 challenge winner): Code and Demo
- Neural Module Networks (EMNLP 2016 best paper award): Code
- LSTM-based encoder-decoder approach: Code and DAQUAR dataset
Grounding
Image and video description
- Recurrent models for computer vision: LRCN
- Video description: S2VT
- A Movie Description Dataset
- Detailed and summary description for MPII Cooking 2: TACoS Multilevel
- Captioning novel objects in images and videos: DCC
Visual Knowledge Transfer with linguistic knowledge
Activity Recognition
older versions:
- MPII Cooking Activities Dataset (includes a human pose challenge)
- MPII Cooking Composite Activities