Counting 2,653 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Last Commit
Apr. 15, 2018
Mar. 5, 2017

Image Captioning in Keras

(Note: You can read an in-depth tutorial about the implementation in this blogpost.)

This is an implementation of image captioning model based on Vinyals et al. with a few differences:

  • For CNN we use Inception v3 instead of Inception v1.

  • For RNN we use multi-layered LSTM instead of single-layered one.

  • We don't have a special start-of-sentence word so we feed the first word at t = 1 instead of t = 2.

  • We use different values for some hyperparameters:

    Hyperparameter Value
    Learning rate 0.00051
    Batch size 32
    Epochs 33
    Dropout rate 0.22
    Embedding size 300
    LSTM output size 300
    LSTM layers 3

Examples of Captions Generated by the Proposed Model

Result examples without errors

Evaluation Metrics

Quantitatively, the proposed model's performance is on par with Vinyals' model on Flickr8k dataset:

Metric Proposed Model Vinyals' Model
BLEU-1 61.8 63
BLEU-2 40.8 41
BLEU-3 27.8 27
BLEU-4 19.0 N/A
CIDEr 41.5 N/A

Environment Setup

  1. Download the dataset needed.

  2. Download pretrained word vectors.

  3. Download pycocoevalcap data.

  4. Install the dependencies.

    Note: It was only tested on Python 2.7. It may need minor code changes to work on Python 3.

    # Optional: Create and activate your virtualenv / Conda environment
    pip install -r requirements.txt
  5. Setup PYTHONPATH.

    source ./scripts/

Using a Pretrained Model

  1. Download a pretrained model from releases page.

  2. Copy model-weights.hdf5 to keras-image-captioning/results/flickr8k/final-model.

  3. Now you can run an inference from that checkpoint by executing a command below from keras-image-captioning directory:

    python -m keras_image_captioning.inference \
    --dataset-type test \
    --method beam_search \
    --beam-size 3 \
    --training-dir results/flickr8k/final-model

Training from Scratch

1. Run a Training

For reproducing the model, execute:

python -m \
  --training-label repro-final-model \
  --from-training-dir results/flickr8k/final-model

There are many arguments available that you can look inside

2. Run an Inference and Evaluate It

python -m keras_image_captioning.inference \
  --dataset-type test \
  --method beam_search \
  --beam-size 3 \
  --training-dir var/flickr8k/training-results/repro-final-model


  • dataset_type can be either 'validation' or 'test'.
  • You can look the captions generated at var/flickr8k/training-results/repro-final-model/test-predictions-3-20.yaml. You can compare it with my result at results/flickr8k/final-model/test-predictions-3-20.yaml.


MIT License. See LICENSE file for details.

Latest Releases
Pretrained Model
 Apr. 15 2018