Experimenting with different features is fundamental to machine learning application development. But when trying different ideas very often the code can get messy and hard to keep track of which features are working  well and which need to be discarded.

Using design patterns to implement feature extraction can help to keep the code clean and help in going from prototyping ideas to productizing really quickly.  In this context, combing Strategy and Chain of responsibility patterns can help implement clean code for prototyping new ideas with little overhead.

Let’s first see what these patterns are. Strategy pattern allows selecting algorithm from a family of algorithms at runtime. This is achieved by encapsulating the algorithms and providing a common interface to select the algorithm at runtime.

Chain of responsibility consists of implementing a chain of processors each of  which completes a part of the task, in our case it will be extracting a set of features and then passing the data along further to next processor in chain.

In this blog we will see how we can use these patterns for feature extraction from text in flexible ways.

We begin by defining an abstract class. This class contains one abstract method for feature extraction which will be later implemented by concrete classes which extract different features and return the features as a numpy array.

from abc import ABC, abstractmethod
class FeatureExtractor(ABC): 

    @abstractmethod
    def extract_features(self, text **kwargs):
        """ Extracts features from the text and return a numpy array """
        pass 

When trying new ideas for feature extractions, we implement concrete implementation of  this abstract class as below. Normally we will have different algorithms to generate different set of features from the text.

class ConcreteFeatureExtractor_1(FeatureExtractor)
    def extract_features(self, text):
        """This is concrete feature extractor """
        return features_numpy_array

class ConcreteFeatureExtractor_2(FeatureExtractor)
    def extract_features(self, text):
        """ This is a concrete feature extractor """
        return features_numpy_array

class ConcreteFeatureExtractor_3(FeatureExtractor)
    def extract_features(self, text):
        """ This is a concrete feature extractor """
        return features_numpy_array

Once we have the concrete feature extractors implemented, we can chain them all together as below by defining a Client class

class FeatureExtractorClient:
    extractor_dict={}

    def __init__(self):
        self._extractor1 = ConcreteFeatureExtractor_1()
        self._extractor2 = ConcreteFeatureExtractor_2()
        self._extractor3 = ConcreteFeatureExtractor_3()
        self._load_dict()

    def _load_dict(self):
        FeatureExtractorClient.extractor_dict = {
             "test1": [self._extractor1, self._extractor2],
             "test2": [self._extractor2, self._extractor3],
             "test3": [self._extractor3]
             }

    def get_features(self, text, test='test1'):
        extractor_chain = FeatureExtractorClient.extractor_dict.get(test)
        feature_list = [extractor.extract_features(text) for extractor in extractor_chain]
        return feature_list

Once we have the chains of extractor ready, we can call the get_features in the FeatureExtractorClient passing the text and the chain of extractor we want to test with, the function returns a feature_list which is a list of numpy arrays returned by each extractor in the chain.

This allows to easily keep track of all the tests with different feature extractors and finally select the best set of features. Productizing this is a simple case of selecting our best ‘test’

Hope you found this useful.

If you have any questions  or suggestions for future blogs, please do drop a line in the comments section.