A TensorFlow Implementation of the Transformer: Attention Is All You Need
- NumPy >= 1.11.1
- TensorFlow >= 1.2 (Probably 1.1 should work, too, though I didn't test it)
Why This Project?
I tried to implement the idea in Attention Is All You Need. They authors claimed that their model, the Transformer, outperformed the state-of-the-art one in machine translation with only attention, no CNNs, no RNNs. How cool it is! At the end of the paper, they promise they will make their code available soon, but apparently it is not so yet. I have two goals with this project. One is I wanted to have a full understanding of the paper. Often it's hard for me to have a good grasp before writing some code for it. Another is to share my code with people who are interested in this model before the official code is unveiled.
Differences with the original paper
I don't intend to replicate the paper exactly. Rather, I aim to implement the main ideas in the paper and verify them in a SIMPLE and QUICK way. In this respect, some parts in my code are different than those in the paper. Among them are
- I used the IWSLT 2016 de-en dataset, not the wmt dataset because the former is much smaller, and requires no special preprocessing.
- I constructed vocabulary with words, not subwords for simplicity. Of course, you can try bpe or word-piece if you want.
- I parameterized positional encoding. The paper used some sinusoidal formula, but Noam, one of the authors, says they both work. See the discussion in reddit
- The paper adjusted the learning rate to global steps. I fixed the learning to a small number, 0.0001 simply because training was reasonably fast enough with the small dataset (Only a couple of hours on a single GTX 1060!!).
hyperparams.pyincludes all hyper parameters that are needed.
prepro.pycreates vocabulary files for the source and the target.
data_load.pycontains functions regarding loading and batching data.
modules.pyhas all building blocks for encoder/decoder networks.
train.pyhas the model.
eval.pyis for evaluation.
- STEP 1. Download IWSLT 2016 German–English parallel corpus and extract it to
wget -qO- --show-progress https://wit3.fbk.eu/archive/2016-01//texts/de/en/de-en.tgz | tar xz; mv de-en corpora
- STEP 2. Adjust hyper parameters in
- STEP 3. Run
prepro.pyto generate vocabulary files to the
- STEP 4. Run
train.pyor download the pretrained files.
Training Loss and Accuracy
- Training Loss
- Training Accuracy
I got a BLEU score of 17.14. (Recollect I trained with a small dataset, limited vocabulary) Some of the evaluation results are as follows. Details are available in the
source: Sie war eine jährige Frau namens Alex
expected: She was a yearold woman named Alex
got: She was a woman named yearold name
source: Und als ich das hörte war ich erleichtert
expected: Now when I heard this I was so relieved
got: And when I heard that I was an
source: Meine Kommilitonin bekam nämlich einen Brandstifter als ersten Patienten
expected: My classmate got an arsonist for her first client
got: Because my first came from an in patients
source: Das kriege ich hin dachte ich mir
expected: This I thought I could handle
got: I'll go ahead and I thought
source: Aber ich habe es nicht hingekriegt
expected: But I didn't handle it
got: But I didn't it
source: Ich hielt dagegen
expected: I pushed back
got: I thought about it
source: Das ist es was Psychologen einen AhaMoment nennen
expected: That's what psychologists call an Aha moment
got: That's what a like a
source: Meldet euch wenn ihr in euren ern seid
expected: Raise your hand if you're in your s
got: Get yourself in your s
source: Ich möchte ein paar von euch sehen
expected: I really want to see some twentysomethings here
got: I want to see some of you
source: Oh yeah Ihr seid alle unglaublich
expected: Oh yay Y'all's awesome
got: Oh yeah you all are incredibly
source: Dies ist nicht meine Meinung Das sind Fakten
expected: This is not my opinion These are the facts
got: This is not my opinion These are facts