Counting 2,784 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Last Commit
May. 25, 2018
Aug. 29, 2017


Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization and other text generation tasks. It provides reference implementations of various sequence-to-sequence models, including:

Fairseq features multi-GPU (distributed) training on one machine or across multiple machines, fast beam search generation on both CPU and GPU, and includes pre-trained models for several benchmark translation datasets.


Requirements and Installation

Currently fairseq requires PyTorch version >= 0.4.0. Please follow the instructions here:

If you use Docker make sure to increase the shared memory size either with --ipc=host or --shm-size as command line options to nvidia-docker run.

After PyTorch is installed, you can install fairseq with:

pip install -r requirements.txt
python build
python develop

Quick Start

The following command-line tools are provided:

  • python Data pre-processing: build vocabularies and binarize training data
  • python Train a new model on one or multiple GPUs
  • python Translate pre-processed data with a trained model
  • python Translate raw text with a trained model
  • python BLEU scoring of generated translations against reference translations

Evaluating Pre-trained Models

First, download a pre-trained model along with its vocabularies:

$ curl | tar xvjf -

This model uses a Byte Pair Encoding (BPE) vocabulary, so we'll have to apply the encoding to the source text before it can be translated. This can be done with the script using the wmt14.en-fr.fconv-cuda/bpecodes file. @@ is used as a continuation marker and the original text can be easily recovered with e.g. sed s/@@ //g or by passing the --remove-bpe flag to Prior to BPE, input text needs to be tokenized using tokenizer.perl from mosesdecoder.

Let's use python to generate translations interactively. Here, we use a beam size of 5:

$ MODEL_DIR=wmt14.en-fr.fconv-py
$ python \
 --path $MODEL_DIR/ $MODEL_DIR \
 --beam 5
| loading model(s) from wmt14.en-fr.fconv-py/
| [en] dictionary: 44206 types
| [fr] dictionary: 44463 types
| Type the input sentence and press return:
> Why is it rare to discover new marine [email protected]@ mal species ?
O       Why is it rare to discover new marine [email protected]@ mal species ?
H       -0.06429661810398102    Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins ?
A       0 1 3 3 5 6 6 8 8 8 7 11 12

This generation script produces four types of outputs: a line prefixed with S shows the supplied source sentence after applying the vocabulary; O is a copy of the original source sentence; H is the hypothesis along with an average log-likelihood; and A is the attention maxima for each word in the hypothesis, including the end-of-sentence marker which is omitted from the text.

Check below for a full list of pre-trained models available.

Training a New Model

Data Pre-processing

Fairseq contains example pre-processing scripts for several translation datasets: IWSLT 2014 (German-English), WMT 2014 (English-French) and WMT 2014 (English-German). To pre-process and binarize the IWSLT dataset:

$ cd data/
$ bash
$ cd ..
$ TEXT=data/
$ python --source-lang de --target-lang en \
  --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
  --destdir data-bin/

This will write binarized data that can be used for model training to data-bin/


Use python to train a new model. Here a few example settings that work well for the IWSLT 2014 dataset:

$ mkdir -p checkpoints/fconv
$ CUDA_VISIBLE_DEVICES=0 python data-bin/ \
  --lr 0.25 --clip-norm 0.1 --dropout 0.2 --max-tokens 4000 \
  --arch fconv_iwslt_de_en --save-dir checkpoints/fconv

By default, python will use all available GPUs on your machine. Use the CUDA_VISIBLE_DEVICES environment variable to select specific GPUs and/or to change the number of GPU devices that will be used.

Also note that the batch size is specified in terms of the maximum number of tokens per batch (--max-tokens). You may need to use a smaller value depending on the available GPU memory on your system.


Once your model is trained, you can generate translations using python (for binarized data) or python (for raw text):

$ python data-bin/ \
  --path checkpoints/fconv/ \
  --batch-size 128 --beam 5
  | [de] dictionary: 35475 types
  | [en] dictionary: 24739 types
  | data-bin/ test 6750 examples
  | model fconv
  | loaded checkpoint trainings/fconv/
  S-721   danke .
  T-721   thank you .

To generate translations with only a CPU, use the --cpu flag. BPE continuation markers can be removed with the --remove-bpe flag.

Pre-trained Models

We provide the following pre-trained fully convolutional sequence-to-sequence models:

In addition, we provide pre-processed and binarized test sets for the models above:

Generation with the binarized test sets can be run in batch mode as follows, e.g. for WMT 2014 English-French on a GTX-1080ti:

$ curl | tar xvjf - -C data-bin
$ curl | tar xvjf - -C data-bin
$ python data-bin/wmt14.en-fr.newstest2014  \
  --path data-bin/wmt14.en-fr.fconv-py/ \
  --beam 5 --batch-size 128 --remove-bpe | tee /tmp/gen.out
| Translated 3003 sentences (96311 tokens) in 166.0s (580.04 tokens/s)
| Generate test with beam=5: BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)

# Scoring with
$ grep ^H /tmp/gen.out | cut -f3- > /tmp/gen.out.sys
$ grep ^T /tmp/gen.out | cut -f2- > /tmp/gen.out.ref
$ python --sys /tmp/gen.out.sys --ref /tmp/gen.out.ref
BLEU4 = 40.83, 67.5/46.9/34.4/25.5 (BP=1.000, ratio=1.006, syslen=83262, reflen=82787)

Distributed version

Distributed training in fairseq is implemented on top of torch.distributed. Training begins by launching one worker process per GPU. These workers discover each other via a unique host and port (required) that can be used to establish an initial connection. Additionally, each worker is given a rank, that is a unique number from 0 to n-1 where n is the total number of GPUs.

If you run on a cluster managed by SLURM you can train a large English-French model on the WMT 2014 dataset on 16 nodes with 8 GPUs each (in total 128 GPUs) using this command:

$ DATA=... # path to the preprocessed dataset, must be visible from all nodes
$ PORT=9218 # any available tcp port that can be used by the trained to establish initial connection
$ sbatch --job-name fairseq-py --gres gpu:8 --nodes 16 --ntasks-per-node 8 \
    --cpus-per-task 10 --no-requeue --wrap 'srun --output train.log.node%t \
    --error train.stderr.node%t.%j python $DATA --distributed-world-size 128 \
    --distributed-port $PORT --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
    --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
    --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 --wd 0.0001'

Alternatively you'll need to manually start one process per each GPU:

$ DATA=... # path to the preprocessed dataset, must be visible from all nodes
$ # has to be one of the hosts that will be used by the job \
    and the port on that host has to be available
$ RANK=... # the rank of this process, has to go from 0 to 127 in case of 128 GPUs
$ python $DATA --distributed-world-size 128 \
      --force-anneal 50 --lr-scheduler fixed --max-epoch 55 \
      --arch fconv_wmt_en_fr --optimizer nag --lr 0.1,4 --max-tokens 3000 \
      --clip-norm 0.1 --dropout 0.1 --criterion label_smoothed_cross_entropy \
      --label-smoothing 0.1 --wd 0.0001 \
      --distributed-init-method='tcp://$HOST_PORT' --distributed-rank=$RANK

Join the fairseq community


If you use the code in your paper, then please cite it as:

  author    = {Gehring, Jonas, and Auli, Michael and Grangier, David and Yarats, Denis and Dauphin, Yann N},
  title     = "{Convolutional Sequence to Sequence Learning}",
  booktitle = {Proc. of ICML},
  year      = 2017,


fairseq(-py) is BSD-licensed. The license applies to the pre-trained models as well. We also provide an additional patent grant.


This is a PyTorch version of fairseq, a sequence-to-sequence learning toolkit from Facebook AI Research. The original authors of this reimplementation are (in no particular order) Sergey Edunov, Myle Ott, and Sam Gross.

Latest Releases
 Feb. 12 2018