Counting 3,834 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1


Bottom-Up and Top-Down Attention for Visual Question Answering

An tensorflow implementation of the winning entry of the 2017 VQA Challenge. The model details are in "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge" paper.
This implementation is motivated from pytorch implementation link
This codes are collaboarted with vaicarran.

Some Info.

  • This code do not use visual-genome dataset fore pretraining.
  • More number of hidden neurons are used than original paper (512 >> 1024)
  • batch normalization is used in classifier.


I checked the final results and it can be differ whether early-stopping is used.

Model Validation Accuracy Training Time
Reported Model 63.15 12 - 18 hours (Tesla K40)
TF Model 61~64 < 1 hours (Tesla P40)

Model Architecture

Proposed Model (in paper)


Implemented Graph (tensorboard)

Learning curve (score)


Main Codes

  • ./ : data preprocessing and tensorflow dataset modules.
  • ./models/ : tensorflow operation warpper.
  • ./models/ : model class
  • ./models/ : word embedding and question embedding
  • ./models/ : proposed top-down-attention module


Make sure you are on a machine with a NVIDIA GPU and Python 3 with about 100 GB disk space.
This code needs more memory than pytorch version (70~80 GB).
Some issues are resolving to increase memory efficiency.
If you resolve, i always welcome PR.

- tensorflow 1.13.0
- h5py

Data Setup

Data download and preprocessing module is from original_repo and [pytorch_repo]((

Make sure your dataset should be downloaded to ./data/ directory.

>> mkdir data
>> sh tools/
>> sh tools/


For various hyperparameter setting, refer the arguments in

>> python

If you want to visualize your result and graph

>> tensorboard --logdir='./tensorboard' (--ip=YOUR_IP) (--port=YOUR_PORT)

Early stopping

Evaluation on validation data is available at every epoch.
Also early-stpping is enable to prevent from overfitting.
However, because of memory issue, the code in is annotated.


  • "Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge" paper.
  • pytorch implementation of hengyuan-hu link