In a previous blog, I posted a solution for document similarity using gensim doc2vec. One problem with that solution was that a large document corpus is needed to build the Doc2Vec model to get good results. In many cases, the corpus in which we want to identify similar documents to a given query document may not be large enough to build a Doc2Vec model which can identify the semantic relationships among the corpus vocabulary.

In the blog, I show a solution which uses a Word2Vec built on a much larger corpus for implementing a document similarity. The solution is based SoftCosineSimilarity, which is a soft cosine or (“soft” similarity) between two vectors, proposed in this paper, considers similarities between pairs of features. The traditional cosine similarity considers the vector space model (VSM) features as independent or orthogonal, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity.

Let’s begin my importing the needed packages

import numpy as np
import pandas as pd
import nltk
import os
from gensim.models import Word2Vec

I have a large corpus of sentences extracted from windows txt files stored as sentences one per line in a single folder. Gensim requires that the input must provide sentences sequentially, when iterated over.

Below is a small iterator which can process the input file by file, line by line. This iterator code is from gensim word2vec tutorial

 class MySentences(object):
     def __init__(self, dirname):
         self.dirname = dirname

     def __iter__(self):
         for fname in os.listdir(self.dirname):
             for line in open(os.path.join(self.dirname, fname), encoding='cp1252'):
                 yield line.lower().split()

Below code iterates over corpus sentences and creates a word2vec model and saves it to the disk

corpus_sentences = MySentences('../data/corpus/')
model = Word2Vec(size=300, window=10, min_count=5, workers=11, alpha=0.025, iter=20)

model.train(corpus_sentences, total_examples=model.corpus_count, epochs=model.iter)'../models/corpus_word2vec.model')

It is not necessary to build the word2vec model with our own corpus, in case you do not have a sufficiently large corpus to build a word2vec model you can use an off-the-shelf model.

from gensim.corpora import Dictionary
from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
from nltk import word_tokenize
from nltk.corpus import stopwords

Below is a simple preprocessor to clean the document corpus for the document similarity use-case

def preprocess(doc):
    doc = doc.lower() # To lower
    doc = word_tokenize(doc) # Tokenize to words
    doc = [w for w in doc if not w in stop_words] # Remove stopwords.
    doc = [w for w in doc if w.isalpha()] # Remove numbers and special characters
return doc 

We have all the pieces in place, let’s begin by loading the word2vec model

gates_model = Word2Vec.load('../models/corpus_word2vec.model')

We then load the document corpus for which we need to build the document similarity functionality. If this document corpus is large we can directly use it to build the Doc2Vec solution. But in this case we use it together with the word2vec that we build with a larger corpus

doc_df = pd.read_json('../data/document_data.json')

Index(['text', 'id'], dtype='object')

Using the Word2vec model we build WordEmbeddingSimilarityIndex model which is a term similarity index that computes cosine similarities between word embeddings.

termsim_index = WordEmbeddingSimilarityIndex(gates_model.wv)

Using the document corpus we construct a dictionary,  and a term similarity matrix.

corpus_list = doc_df['info'].tolist()
corpus_list_token = [word_tokenize(each) for each in corpus_list]
dictionary = Dictionary(corpus_list_token)
bow_corpus = [dictionary.doc2bow(document) for document in corpus_list_token]
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

Next we compute soft cosine similarity against a corpus of documents by storing the index matrix in memory. The index matrix can be saved to the disk

docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)'../models/gensim_docims_index')

To use the docsim index, we load the index matrix and search a query string against the index to find the most similar documents. Below we a select a random document from the document corpus and find documents similar to it.

docsim_index = SoftCosineSimilarity.load('../models/gensim_docims_index')
randint = np.random.randint(0,10000)
query = preprocess(document_df['info'].iloc[randint])
sims = docsim_index[dictionary.doc2bow(query)]
result_list = [corpus_list_token[i] for i in [a[0] for a in sims]]
score_list = [a[1] for a in sims]
print ('Input query : {}\n'.format(' '.join(corpus_list_token[randint])))
results = [' '.join(each) for each in result_list]
for score, result in zip(score_list, results):
print ('{:.3f} : {}'.format(score, result))

Hope you found this useful.

If you have any questions  or suggestions, please drop a line in the comments section.