In this blog, I will share how to port existing ML workflows to work with MLFlow. The workflow in this example is for document similarity use-case which I shared in my previous blogs. Let’s begin by setting up the environment.
Setup Conda environment
MLFlow support Conda and Docker environments. In this example let’s begin by setting up Conda environment.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh bash ~/miniconda.sh -b -p $HOME/miniconda export PATH="$HOME/miniconda/bin:$PATH"
Then let’s install the packages needed for the document similarity application, this version of document similarity is built with sklearn tfidf and k-nearest neighbours implementation.
conda init conda install numpy conda install pandas conda install nltk conda install scikit-learn conda install -c conda-forge mlflow
Using MLFlow tracker
Now that we have the environment ready, lets begin with using the MLFLow tracker. To get started I have cloned the ML application project repo and below is the directory structure
. └── project ├── .git ├── __init__.py ├── train ├── __init__.py ├── train.py ├── data ├── sample_data.json
The model training code is the train/train.py module; code for this example is quite simple using sklearn tfidfvectorizer to generate feature vectors from text and then using k-nearest neighbours to identify the most similar documents.
MLFlow tracker allows tracking of training runs and provides interface to log parameters, code versions, metrics, and artifacts files associated with each run.
In the below code snippet, model is a k-nearest neighbors model object and tfidf is TFIDFVectorizer object. Below is a sample set of parameters and metrics which I am logging as reference.
from mlflow import log_metric, log_param, log_artifact mlflow.log_param('n_neighbors', model.n_neighbors) mlflow.log_param('metric', model.metric) mlflow.sklearn.log_model(model, "model") mlflow.sklearn.log_model(tfidf, "vectorizer") mlflow.sklearn.save_model(model, './models/model.pkl', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE) mlflow.sklearn.save_model(tfidf, './models/tfidf.pkl', serialization_format=mlflow.sklearn.SERIALIZATION_FORMAT_PICKLE)
The training code can be executed as a python module as below including any parameters as arguments
python -m train.train
Each execution of the code is a MLFlow run and MLFlow UI helps tracks the individual execution.
The UI can be access using the command
mlflow ui
The UI by default launches on the port 5000 on the localhost.
The UI lists all the MLFlow runs and the logged information. In this example, the metrics are logged once per run but its possible to log metrics for each epoch of the run and track the metric in live.
Using MLFlow projects
Next step, let’s convert the existing ML application into a MLFlow project.
Lets begin by getting a snapshot of the conda environment into the project root directory, remember to use the no-builds , as below
conda env export --no-builds > project-env.yaml
Next we create the MLFLow MLproject file containing the information of the project. below is an example of the MLproject for the project that we are porting. The file can specify a name and a Conda or Docker environment, as well as more detailed information about each entry point. Specifically, each entry point defines a command to run and parameters to pass to the command. More information about specifying the MLproject file can be found here
name: MLAppName
conda_env: project-env.yaml
entry_points:
main:
parameters:
regularization: {type: float, default: 0.1}
command: "python -m train.train {regularization}"
Below is the structure of the project root directory with the newly added files.
. └── project ├── .git ├── __init__.py ├── MLproject ├── train ├── __init__.py ├── train.py ├── data ├── sample_data.json ├── project-env.yaml
Commit and push the changes to the project and we are done.
The code repo is now a MLproject and can be executed from a different server with a simple mlflow run command as below
mlflow run git@gitlab.ext.company.com:user/mlflow-project.git
The runs are visible in the MLFlow UI as below.
This allow for tracking all the runs and results.
MLFlow tracking and project APIs can help build multi steps ML workflow really fast and make keep track of modification and results very fast.
Hope you found this introduction useful. In the next blogs, i will introduce tracking MLflow run using a central tracking server.