Image Captioning in Keras
(Note: You can read an in-depth tutorial about the implementation in this blogpost.)
This is an implementation of image captioning model based on Vinyals et al. with a few differences:
For CNN we use Inception v3 instead of Inception v1.
For RNN we use multi-layered LSTM instead of single-layered one.
We don't have a special start-of-sentence word so we feed the first word at t = 1 instead of t = 2.
We use different values for some hyperparameters:
Hyperparameter Value Learning rate 0.00051 Batch size 32 Epochs 33 Dropout rate 0.22 Embedding size 300 LSTM output size 300 LSTM layers 3
Examples of Captions Generated by the Proposed Model
Quantitatively, the proposed model's performance is on par with Vinyals' model on Flickr8k dataset:
|Metric||Proposed Model||Vinyals' Model|
Download the dataset needed.
Download pretrained word vectors.
Download pycocoevalcap data.
Install the dependencies.
Note: It was only tested on Python 2.7. It may need minor code changes to work on Python 3.
# Optional: Create and activate your virtualenv / Conda environment pip install -r requirements.txt
Run a Training
For reproducing the model, execute:
python -m keras_image_captioning.training \ --training-label repro-final-model \ --from-training-dir results/flickr8k/final-model
There are many arguments available that you can look inside
Run an Inference and Evaluate It
python -m keras_image_captioning.inference \ --dataset-type test \ --method beam_search \ --beam-size 3 \ --training-dir var/flickr8k/training-results/repro-final-model
dataset_typecan be either 'validation' or 'test'.
- You can look the captions generated at
var/flickr8k/training-results/repro-final-model/test-predictions-3-20.yaml. You can compare it with my result at
MIT License. See LICENSE file for details.