In this blog, I will share how to port existing ML workflows to work with MLFlow. The workflow in this example is for document similarity use-case which I shared in my previous blogs. Let’s begin by setting up the environment.

Setup Conda environment

MLFlow support Conda and Docker environments. In this example let’s begin by  setting up Conda environment.

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p $HOME/miniconda
export PATH="$HOME/miniconda/bin:$PATH"

Then let’s install the packages needed for the document similarity application, this version of document similarity is built with sklearn tfidf and k-nearest neighbours implementation.

conda init
conda install numpy
conda install pandas
conda install nltk
conda install scikit-learn
conda install -c conda-forge mlflow

Using MLFlow tracker

Now that we have the environment ready, lets begin with using the MLFLow tracker. To get started I have cloned the ML application project repo and below is the directory structure

.
└── project
    ├── .git
    ├── __init__.py
    ├── train
        ├── __init__.py
        ├── train.py
    ├── data
        ├── sample_data.json

The model training code is the train/train.py module; code for this example is quite simple using sklearn tfidfvectorizer to generate feature vectors from text and then using k-nearest neighbours to identify the most similar documents.

MLFlow tracker allows tracking of training runs and provides interface to log parameters, code versions, metrics, and artifacts files associated with each run.

In the below code snippet, model is a k-nearest neighbors model object and tfidf is TFIDFVectorizer object. Below is a sample set of parameters and metrics which I am logging as reference.

from mlflow import log_metric, log_param, log_artifact
mlflow.log_param('n_neighbors', model.n_neighbors)
mlflow.log_param('metric', model.metric)
mlflow.sklearn.log_model(model, "model")
mlflow.sklearn.log_model(tfidf, "vectorizer")
mlflow.sklearn.save_model(model, './models/model.pkl', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE)
mlflow.sklearn.save_model(tfidf, './models/tfidf.pkl', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE)

The training  code can be executed as a python module as below including any parameters as arguments

python -m train.train

Each execution of the code is a MLFlow run and MLFlow UI helps tracks the individual execution.

The UI can be access using the command

mlflow ui

The UI by default launches on the port 5000 on the localhost.

Screen Shot 2019-07-30 at 19.01.26

The UI lists all the MLFlow runs and the logged information.  In this example, the metrics are logged once per run but its possible to log metrics for each epoch of the run and track the metric in live.

Using MLFlow projects

Next step, let’s convert the existing ML application into a MLFlow project.

Lets begin by getting a snapshot of the conda environment into the project root directory, remember to use the no-builds , as below

conda env export --no-builds > project-env.yaml

Next we create the MLFLow MLproject file containing the information of the project. below is an example of the MLproject for the project that we are porting. The file can specify a name and a Conda or Docker environment, as well as more detailed information about each entry point. Specifically, each entry point defines a command to run and parameters to pass to the command. More information about specifying the MLproject file can be found here

name: MLAppName
conda_env: project-env.yaml
entry_points:
    main:
        parameters:
            regularization: {type: float, default: 0.1}
        command: "python -m train.train {regularization}"

Below is the structure of the project root directory with the newly added files.

.
└── project
    ├── .git
    ├── __init__.py
    ├── MLproject
    ├── train
        ├── __init__.py
        ├── train.py
    ├── data
        ├── sample_data.json
    ├── project-env.yaml

Commit and push the changes to the project and we are done.

The code repo is now a MLproject and can be executed from a different server with a simple mlflow run command as below

mlflow run git@gitlab.ext.company.com:user/mlflow-project.git

The runs are visible in the MLFlow UI as below.

Screen Shot 2019-07-30 at 19.25.23

This allow for tracking all the runs and results.

MLFlow tracking and project APIs can help build multi steps ML workflow really fast and make keep track of modification and results very fast.

Hope you found this introduction useful. In the next blogs, i will introduce tracking MLflow run using a central tracking server.