Welcome to WordEmbeddingLoader’s documentation!¶
Loaders and savers for different implentations of word embedding. The motivation of this project is that it is cumbersome to write loaders for different pretrained word embedding files. This project provides a simple interface for loading pretrained word embedding files in different formats.
from word_embedding_loader import WordEmbedding
# it will automatically determine format from content
wv = WordEmbedding.load('path/to/embedding.bin')
# This project provides minimum interface for word embedding
print wv.vectors[wv.vocab['is']]
# Modify and save word embedding file with arbitrary format
wv.save('path/to/save.txt', 'word2vec', binary=False)
This project currently supports following formats:
- GloVe, Global Vectors for Word Representation, by Jeffrey Pennington, Richard Socher, Christopher D. Manning from Stanford NLP group.
- word2vec, by Mikolov.
- text (create with
-binary 0
option (the default)) - binary (create with
-binary 1
option)
- text (create with
- gensim ‘s
models.word2vec
module (coming) - original HDFS format: a performance centric option for loading and saving word embedding (coming)
Sometimes, you want combine an external program with word embedding file of your own choice. This project also provides a simple executable to convert a word embedding format to another.
# it will automatically determine the format from the content
word-embedding-loader convert -t glove test/word_embedding_loader/word2vec.bin test.bin
# Get help for command/subcommand
word-embedding-loader --help
word-embedding-loader convert --help
Issues with encoding¶
This project does decode vocab. It is up to users to determine and decode bytes.
decoded_vocab = {k.decode('latin-1'): v for k, v in wv.vocab.iteritems()}
Development¶
This project us Cython to build some modules, so you need Cython for development.
`bash
pip install -r requirements.txt
`
If environment variable DEVELOP_WE
is set, it will try to rebuild .pyx
modules.
`bash
DEVELOP_WE=1 python setup.py test
`
CHANGELOG¶
v0.2¶
- Supports for python 3.4+
WordEmbedding.vocab
stores words as bytes instead of unicode.
** This allows more consistent loading/saving without needing to care about encoding. * bugfix: ** building sphinx fails when package is not installed ** issues loading pretrained word2vec GoogleNews-vectors-negative300.bin (#1, #4)
v0.1¶
- First release.
- Supports word2vec and glove.
- Documentation using Sphinx.
- CLI interface for converting formats.