Gensim Document2Vector is based on the word2vec for unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents. This is an implementation of Quoc Le & Tomáš Mikolov: “Distributed Representations of Sentences and Documents”.
Doc2vec allows training on documents by creating vector representation of the documents using “distributed memory” (dm) and “distributed bag of words” (dbow) mentioned in the paper.
The use case I have implemented is to identify most similar documents to a given document in a training document set of roughly 20000 documents. All the documents are labelled and there are some 500 unique document labels.
We being by including the needed imports
import pandas as pd import numpy as np import nltk import re from gensim.models import Doc2Vec from gensim.models.doc2vec import TaggedDocument from nltk import word_tokenize
The input dataset is in a json with the text as a single long string and a label associated with each. We first read it into a pandas DataFrame and randomly re-order using pandas DataFrame sample with the fraction set to 1.
sample = pd.read_json("data_files.json", encoding='utf-8') sample = sample.sample(frac=1).reset_index(drop=True) sample = sample[['text', 'label']] print ('The shape of the input data frame: {}'.format(sample.shape))
Then we clean the text to get rid of unnecessary characters and stop words. We can also stem the words but in this example I have set it to false.
sample['text'] = sample['text'].apply(default_clean) sample['text'] = sample['text'].apply(stop_and_stem, stem=False)
The input to Doc2Vec is an iterator of LabeledSentence objects or TaggedDocument objects, Each such object represents a single document as a sentence, and consists of two simple lists: a list of words and a list of labels. For our case where we have a set of documents and labels and inputs , we need to convert our pandas input into such a list of words and labels and for this we implement a TaggedDocumentIterator class which takes the pandas text and label Series as lists and creates an python iterator which yields a TaggedDocument of words and labels.
class TaggedDocumentIterator(object): def __init__(self, doc_list, labels_list): self.labels_list = labels_list self.doc_list = doc_list def __iter__(self): for idx, doc in enumerate(self.doc_list): yield TaggedDocument(words=doc.split(), tags=[self.labels_list[idx]]) docLabels = list(sample['label']) data = list(sample['text']) sentences = TaggedDocumentIterator(data, docLabels)
Once we have the TaggedDocumentIterator for our input data ready, we can train the Doc2Vec. Doc2Vec learns representations for words and labels simultaneously.
model = Doc2Vec(size=100, window=10, min_count=5, workers=11,alpha=0.025, iter=20) model.build_vocab(sentences) model.train(sentences,total_examples=model.corpus_count, epochs=model.iter)
Once the model is created it may be a good idea to save it. Gensim provides utility methods to save and load the model from disk.
# Store the model to mmap-able files model.save('/tmp/model_docsimilarity.doc2vec') # Load the model model = Doc2Vec.load('/tmp/model_docsimilarity.doc2vec')
One way to test our model is to take a sample document from the input dataset and check if the model is able to find similar documents in the input dataset, the model must always find the sample document itself as the most closest match.
Doc2vec model provides an infer_vector implementation which allows generating the vector presentation of new document which can be compared with the document vectors in the training model.
def test_predict(): #Select a random document for the document dataset rand_int = np.random.randint(0, sample.shape[0]) print ('Random int {}'.format(rand_int)) test_sample = sample.iloc[rand_int]['info'] label = sample.iloc[rand_int, sample.columns.get_loc('problemReportId')] #Clean the document using the utility functions used in train phase test_sample = default_clean(test_sample) test_sample = stop_and_stem(test_sample, stem=False) #Convert the sample document into a list and use the infer_vector method to get a vector representation for it new_doc_words = test_sample.split() new_doc_vec = model.infer_vector(new_doc_words, steps=50, alpha=0.25) #use the most_similar utility to find the most similar documents. similars = model.docvecs.most_similar(positive=[new_doc_vec]) test_predict()
Both the model creation and infer_vector for a new document took some time for me to optimize and get good results. I will share the optimization steps and the results in my next blog. I have also implemented the same use case using sklearn kneighbors algorithm with the same dataset for comparison
For now, it’s some more code; below is the code to the two simple helper functions to clean the document text.
def default_clean(text): ''' Removes default bad characters ''' if not (pd.isnull(text)): # text = filter(lambda x: x in string.printable, text) bad_chars = set(["@", "+", '/', "'", '"', '\\','(',')', '', '\\n', '', '?', '#', ',','.', '[',']', '%', '$', '&', ';', '!', ';', ':',"*", "_", "=", "}", "{"]) for char in bad_chars: text = text.replace(char, " ") text = re.sub('\d+', "", text) return text def stop_and_stem(text, stem=True, stemmer = PorterStemmer()): ''' Removes stopwords and does stemming ''' stoplist = stopwords.words('english') if stem: text_stemmed = [stemmer.stem(word) for word in word_tokenize(text) if word not in stoplist and len(word) > 3] else: text_stemmed = [word for word in word_tokenize(text) if word not in stoplist and len(word) > 3] text = ' '.join(text_stemmed) return text
In the next blog, I will share the results and the comparison with kneighbor in identifying document similarity.
Hope you found this useful.
If you have any questions or suggestions for future blogs, please do drop a line in the comments section.
I’m trying to perform exactly this task and trying to adapt your code to my documents. However, for your helper functions I get the error `default_clean() takes 0 positional arguments but 1 was given`. Any thoughts?
LikeLike
Hi, Thanks for noticing the issue. There was a missing argument in the default_clean function – i have updated it now.
LikeLike
Awesome. For `stop_and_stem` did you mean to use `>` instead of “>” I get a syntax error for that?
LikeLike
Its for checking the length and removing short words with less than 2 characters – so keep word with length > 3
LikeLike
Can you share the “data_files.json” file or atleast a sample of it. This is help us to understand the data structure used in this exercise.
LikeLike
Hi Manish,
The dataset is plain text as string in the text column and a associated label – so nothing special about it. Here is one example of such a text: https://github.com/praveen049/MLtext2/blob/master/data/sms.tsv
LikeLike
it says The page doesn’t exist.
LikeLike
I moved it to a public repo. Here it is : https://github.com/praveen049/pandas/blob/master/sms.csv
You need to be logged into GitHub to access it.
Please also be careful with the delimiter in the file- it should be tab. The label is the first word and text is rest of the line.
LikeLike
Hi , where can I see how your dataset looks like?
LikeLike
Hi, the dataset is a pandas data frame with rows of sentences. The one i used has a label but it’s not relevant for this particular functionality. Here is a sample dataset which i used for testing: https://github.com/praveen049/pandas/blob/master/sms.csv .
LikeLike
I have two data sets (train set and test set) ,
What I need is to query the similar documents in the test set only and do not care about the train set
but, because the test set is very samll so I have to train the model use a large enough training data set.
when I query the simular documents with a new document vector, it always return the similar documents in the training set, how can I just get the similar documents in the test set.
LikeLike
Hi, One way will be to train a word2vec with the larger corpus of data and then build the Doc2Vec with the smaller corpus in which would want to find similar documents. Please check this blog which i published for similar use case : https://praveenbezawada.com/2019/03/22/document-similarity-using-gensim-word2vec/
LikeLike