Counting 2,784 Big Data & Machine Learning Frameworks, Toolsets, and Examples...
Suggestion? Feedback? Tweet @stkim1

Last Commit
May. 6, 2018
Mar. 21, 2018



This package contains fast TensorFlow and NumPy implementations of GloVe and Mittens.

By vectorizing the GloVe objective function, we deliver massive speed gains over other Python implementations (10x on CPU; 60x on GPU). See the Speed section below.

The caveat is that our implementation is only suitable for modest vocabularies (up to ~20k tokens should be fine) since the co-occurrence matrix must be held in memory.

Vectorizing the objective also reveals that it is amenable to a retrofitting term that encourages representations to remain close to pretrained embeddings. This is useful for domains that require specialized representations but lack sufficient data to train them from scratch. Mittens starts with the general-purpose pretrained representations and tunes them to a specialized domain.



Mittens only requires numpy. However, if tensorflow is available, that will be used instead. The two implementations use the same cost function and optimizer, so the only difference is that the tensorflow version shows a small speed improvement on CPU, and a large speed improvement when run on GPU.

User installation

The easiest way to install mittens is with pip:

pip install -U mittens

You can also install it by cloning the repository and adding it to your Python path. Make sure you have at least numpy installed.

Note that neither method automatically installs TensorFlow: see their instructions.


For both examples, it is assumed that you have already computed the weighted co-occurrence matrix (cooccurence for vocabulary vocab).


from mittens import GloVe

# Load `cooccurrence`
# Train GloVe model
glove_model = GloVe(n=25, max_iter=1000)  # 25 is the embedding dimension
embeddings =

embeddings is now an np.array of size (len(vocab), n), where the rows correspond to the tokens in vocab.

A small complete example:

from mittens import GloVe
import numpy as np

cooccurrence = np.array([
    [  4.,   4.,   2.,   0.],
    [  4.,  61.,   8.,  18.],
    [  2.,   8.,  10.,   0.],
    [  0.,  18.,   0.,   5.]])
glove_model = GloVe(n=2, max_iter=100)
embeddings =

array([[ 1.13700831, -1.16577291],
       [ 2.52644205,  1.56363213],
       [ 0.2376546 ,  0.96793109],
       [ 0.41685158,  1.32988596]], dtype=float32)


To use Mittens, you first need pre-trained embeddings. In our paper, we used Pennington et al's embeddings, available from the Stanford GloVe website.

These vectors should be stored in a dict, where the key is the token and the value is the vector. For example, the function glove2dict below manipulates a Stanford embedding file into the appropriate format.

import csv
import numpy as np

def glove2dict(glove_filename):
    with open(glove_filename) as f:
        reader = csv.reader(f, delimiter=' ', quoting=csv.QUOTE_NONE)
        embed = {line[0]: np.array(list(map(float, line[1:])))
                for line in reader}
    return embed

Now that we have our embeddings (stored as original_embeddings), as well as a co-occurrence matrix and associated vocabulary, we're ready to train Mittens:

from mittens import Mittens

# Load `cooccurrence` and `vocab`
# Load `original_embedding`
mittens_model = Mittens(n=50, max_iter=1000)
# Note: n must match the original embedding dimension
new_embeddings =
    initial_embedding_dict= original_embedding)

Once trained, new_embeddings should be compatible with the existing embeddings in the sense that they will be oriented such that using a mix of the the two embeddings is meaningful (e.g. using original embeddings for any test-set tokens that were not in the training set).


We compared the per-epoch speed (measured in seconds) for a variety of vocabulary sizes using randomly-generated co-occurrence matrices that were approximately 90% sparse. As we see here, for matrices that fit into memory, performance is competitive with the official C implementation when run on a GPU.

For denser co-occurrence matrices, Mittens will have an advantage over the C implementation since it's speed does not depend on sparsity, while the official release is linear in the number of non-zero entries.

5K (CPU) 10K (CPU) 20K (CPU) 5K (GPU) 10K (GPU) 20K (GPU)
Non-vectorized TensorFlow 14.02 63.80 252.65 13.56 55.51 226.41
Vectorized Numpy 1.48 7.35 50.03
Vectorized TensorFlow 1.19 5.00 28.69 0.27 0.95 3.68
Official GloVe 0.66 1.24 3.50


[1] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

[2] Nicholas Dingwall and Christopher Potts. 2018. Mittens: An Extension of GloVe for Learning Domain-Specialized Representations. (NAACL 2018) [code]