In this blog I explore the implementation of  document similarity using enriched word vectors. There are several libraries like Gensim, Spacy, FastText which allow building word vectors with  a corpus and using the word vectors for building document similarity solution. I had previously shared such an implementation using Gensim in this blog. But the limitation for that approach is the size of the corpus – if the size of the corpus is small the generated word vectors may not contain sufficient semantic and syntactic information to be really useful.

To address this issue, in the blog I will show how to enrich the word vector with public word vectors like Google dataset trained with 100  billion words from a Google News dataset. I use gensim library to combine the Google word vector with Word vector created using our corpus. I also use spacy which provides very simple interfaces to load word vectors and calculate document similarity.

We begin by including  the needed imports

import pandas as pd
import gensim
import spacy
from gensim.models import Word2Vec
import nltk

The input dataset is in a json with the text as strings – one for each line. We first read it into a pandas DataFrame and randomly re-order using pandas DataFrame sample with the fraction set to 1.

sample = pd.read_json("data_files.json", encoding='utf-8')
sample = sample.sample(frac=1).reset_index(drop=True)
sample = sample[['text']]
print ('The shape of the input data frame: {}'.format(sample.shape)) 

Then we clean the text to get rid of unnecessary characters and stop words. We can also stem the words but in this example I have set stem to false.

sample['text'] = sample['text'].apply(default_clean)
sample['text'] = sample['text'].apply(stop_and_stem, stem=False)

Next we build the corpus vocabulary using Gensim. Gensim’s word2vec expects a sequence of sentences as its input,  each sentence a list of words (utf8 strings), but keeping the entire corpus as  a Python list can use up a lot of RAM when the input is large. Gensim only requires that the input must provide sentences sequentially, when iterated over. So we use a python generator to provide the input to Gensim. This code can be further enhanced for cases where we have the inputs across multiple files or directories; as long we are able to provide sentences sequentially it will work.

def tokenize(sentences):
    for sentence in sentences:
        yield (nltk.word_tokenize(sentence))
sentences = list(tokenize(data))

We now initialize a word2vec, the parameters passed are highly dependent on the corpus and need to be selected carefully.  For more details on selecting and tuning the parameters, you can refer to the Gensim tutorial here.

model = Word2Vec(size=300, window=10, min_count=5, workers=11, alpha=0.025, iter=20)

We then go ahead and build the model with our corpus vocabulary


Next step, is enriching it with Google word2vec. We can download the GoogleNews word2vec from below S3 store to local disk and unzip it.

gunzip GoogleNews-vectors-negative300.bin.gz

We then intersect the vocabulary with the google word vectors. This improves our model using the Google word vectors. No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.


After that we train the model using our corpus

model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)

Once the enriched wordv2ec is available we save the model to the disk.  It is also possible to directly use the model for implementing a document similarity solution but in this blog, I have used Spacy which provides quite easy to use interfaces.


We then convert the word2vec model into spacy model using the below commands. There are some known issues with spacy init-model and for this step I have used the version 2.0.16 which works fine.

gzip ./enrich_with_google_w2v.txt
python3 -m spacy init-model en spacy.word2vec.model --vectors-loc retrain_with_google.txt

We now have a brand new spacy word vector which we load.

enrich_w2v = spacy.load('./spacy.word2vec.model')

We run the models on our sample corpus to generate the vectors for each document in the corpus. The vectors are also stored in the pandas DataFrame as it will allow us to access them easily when calculating the similarity scores across the entire document set.

sample['enriched_text'] = sample['text'].apply(lambda x: enrich_w2v(x))

Once we have the vectorization of the documents done we are good to go – for any new input test document, we use the spacy model to vectorize the document text

test_doc = enrich_w2v("test document")

and calculate the similarity score of  the test vector for each of the vectors in our sample dataframe, we then store the distance in the dataframe itself and sort it by similarity score to get the documents which are closest to the test document at the top.

def get_distance(doc_str, input_str):
    return doc_str.similarity(input_str)

sample['doc_dist'] = sample['enriched_text'].apply(get_distance, input_str=test_doc)
sample_top = sample.sort_values(by=['enr_spacy_dist'], ascending=False).head(10)
sample_top[['text', 'doc_dist']]

This provides a easy to use approach to generate document similarity.

In my next blog, I will share the implementation for Document Similarity using FastText and also the comparison of results among all these different solutions.

Finally, below is the code to the two simple helper functions we used to clean the corpus.

def default_clean():
    ''' Removes default bad characters '''
    if not (pd.isnull(text)):
        bad_chars = set(["@", "+", '/', "'", '"', '\\','(',')', '', '\\n', '', '?', '#', ',','.', '[',']', '%', '$', '&', ';', '!', ';', ':',"*", "_", "=", "}", "{"])
        for char in bad_chars:
            text = text.replace(char, " ")
            text = re.sub('\d+', "", text)
    return text

def stop_and_stem(text, stem=True, stemmer = PorterStemmer()):
    ''' Removes stopwords and does stemming '''
    stoplist = stopwords.words('english')
    if stem:
        text_stemmed = [stemmer.stem(word) for word in word_tokenize(text) if word not in stoplist and len(word) > 3]
        text_stemmed = [word for word in word_tokenize(text) if word not in stoplist and len(word) > 3] text = ' '.join(text_stemmed)
    return text

Hope you found this useful.

If you have any questions  or suggestions for future blogs, please do drop a line in the comments section.