Gensim Document2Vector is based on the word2vec for unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents. This is an implementation of  Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”.

Doc2vec allows training on documents by creating vector representation of the documents using “distributed memory” (dm) and “distributed bag of words” (dbow) mentioned in the paper.

The use case I have implemented is to identify most similar documents to a given document in a training document set of roughly 20000  documents. All the documents are labelled and there are some 500 unique document labels.

We being by including  the needed imports

import pandas as pd
import numpy as np
import nltk
import re

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from nltk import word_tokenize

The input dataset is in a json with the text as a single long string and a label associated with each. We first read it into a pandas DataFrame and randomly re-order using pandas DataFrame sample with the fraction set to 1.

sample = pd.read_json("data_files.json", encoding='utf-8')
sample = sample.sample(frac=1).reset_index(drop=True)
sample = sample[['text', 'label']]
print ('The shape of the input data frame: {}'.format(sample.shape))

Then we clean the text to get rid of unnecessary characters and stop words. We can also stem the words but in this example I have set it to false.

sample['text'] = sample['text'].apply(default_clean)
sample['text'] = sample['text'].apply(stop_and_stem, stem=False)

The input to Doc2Vec is an iterator of LabeledSentence objects or TaggedDocument objects, Each such object represents a single document as a sentence, and consists of two simple lists: a list of words and a list of labels. For our case where we have a set of documents and labels and inputs , we need to convert our pandas input into such a list of words and labels and for this we implement a  TaggedDocumentIterator class which takes the pandas text and label Series as lists and creates an python iterator which yields a TaggedDocument of words and labels.

class TaggedDocumentIterator(object):
    def __init__(self, doc_list, labels_list):
        self.labels_list = labels_list
        self.doc_list = doc_list
    def __iter__(self):
        for idx, doc in enumerate(self.doc_list):
            yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]])

docLabels = list(sample['label'])
data = list(sample['text'])
sentences = TaggedDocumentIterator(data, docLabels)

Once we have the TaggedDocumentIterator for our input data ready, we can train the Doc2Vec. Doc2Vec learns representations for words and labels simultaneously.

model = Doc2Vec(size=100, window=10, min_count=5, workers=11,alpha=0.025, iter=20)
model.build_vocab(sentences)
model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)

Once the model is created it may be a good idea to save it. Gensim provides utility methods to save and load the model from disk.

# Store the model to mmap-able files
model.save('/tmp/model_docsimilarity.doc2vec')
# Load the model
model = Doc2Vec.load('/tmp/model_docsimilarity.doc2vec')

One way to test our model is to take a sample document from the input dataset and check if the model is able to find similar documents in the input dataset, the model must always find the sample document itself as the most closest match.

Doc2vec model provides an infer_vector implementation which allows generating the vector presentation of new document which can be compared with the document vectors in the training model.

def test_predict():
    #Select a random document for the document dataset
    rand_int = np.random.randint(0, sample.shape[0])
    print ('Random int {}'.format(rand_int))
    test_sample = sample.iloc[rand_int]['info']
    label = sample.iloc[rand_int, sample.columns.get_loc('problemReportId')]

    #Clean the document using the utility functions used in train phase
    test_sample = default_clean(test_sample)
    test_sample = stop_and_stem(test_sample, stem=False)

    #Convert the sample document into a list and use the infer_vector method to get a vector representation for it
    new_doc_words = test_sample.split()
    new_doc_vec = model.infer_vector(new_doc_words, steps=50, alpha=0.25)

    #use the most_similar utility to find the most similar documents.
    similars = model.docvecs.most_similar(positive=[new_doc_vec])
test_predict()

Both the model creation and infer_vector for a new document took some time for me to optimize and get good results. I will share the optimization steps and the results in my next blog. I have also implemented the same use case using sklearn kneighbors algorithm with the same dataset for comparison

For now, it’s some more code; below is the code to the two simple helper functions to clean the document text.

 

def default_clean(text):
    '''
    Removes default bad characters
    '''
    if not (pd.isnull(text)):
    # text = filter(lambda x: x in string.printable, text)
    bad_chars = set(["@", "+", '/', "'", '"', '\\','(',')', '', '\\n', '', '?', '#', ',','.', '[',']', '%', '$', '&', ';', '!', ';', ':',"*", "_", "=", "}", "{"])
    for char in bad_chars:
        text = text.replace(char, " ")
    text = re.sub('\d+', "", text)
    return text

def stop_and_stem(text, stem=True, stemmer = PorterStemmer()):
    '''
    Removes stopwords and does stemming
    '''
    stoplist = stopwords.words('english')
    if stem:
        text_stemmed = [stemmer.stem(word) for word in word_tokenize(text) if word not in stoplist and len(word) > 3]
    else:
        text_stemmed = [word for word in word_tokenize(text) if word not in stoplist and len(word) > 3]
    text = ' '.join(text_stemmed)
    return text

In the next blog, I will share the results and the comparison with kneighbor in identifying document similarity.

Hope you found this useful.

If you have any questions  or suggestions for future blogs, please do drop a line in the comments section.