Train Machine Learning Models

In this chapter you will learn:

how to setup Jupyter notebook,
how to build two ML algorithms,
save pre-processing details and algorithms.

Setup Jupyter notebook

For building ML algorithms I’m using Jupyter notebook. It can be easily installed:

# run commands in your project directory
pip3 install jupyter notebook

To set Jupyter to use local virtualenv environment run:

ipython kernel install --user --name=venv

I will create a research directory where I will put Jupiter files. To start Jupyter notebook run:

# create a research directory
mkdir research
cd research
# start Jupyter
jupyter notebook

When starting a new notebook make sure that you select the correct kernel, venv in our case:

Train ML algorithms

Before building ML algorithms we need to install packages:

pip3 install numpy pandas sklearn joblib

The numpy and pandas packages are used for data manipulation. The joblib is used for ML objects saving. Whereas, the sklearn package offers a wide range of ML algorithms. We need to reload Jupyter after installation.

The first step in our code is to load packages:

import json # will be needed for saving preprocessing details
import numpy as np # for data manipulation
import pandas as pd # for data manipulation
from sklearn.model_selection import train_test_split # will be used for data split
from sklearn.preprocessing import LabelEncoder # for preprocessing
from sklearn.ensemble import RandomForestClassifier # for training the algorithm
from sklearn.ensemble import ExtraTreesClassifier # for training the algorithm
import joblib # for saving algorithm and preprocessing objects

Loading data

In this tutorial, I will use Adult Income data set. In this data set, the ML will be used to predict whether income exceeds $50K/year based on census data. I will load data from my public repository with data sets good for start with ML.

Code to load data and image with first rows of data:

# load dataset
df = pd.read_csv('https://raw.githubusercontent.com/pplonski/datasets-for-start/master/adult/data.csv', skipinitialspace=True)
x_cols = [c for c in df.columns if c != 'income']
# set input matrix and target column
X = df[x_cols]
y = df['income']
# show first rows of data
df.head()

The X matrix has 32,561 rows and 14 columns. This is input data for our algorithm, each row describes one person. The y vector has 32,561 values indicating whether income exceeds 50K per year.

Before starting data preprocessing we will split our data into training, and testing subsets. We will use 30% of the data for testing.

# data split train / test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1234)

Data pre-processing

In our data set, there are missing values and categorical columns. For ML algorithm training I will use the Random Forest algorithm from the sklearn package. In the current implementation it can not handle missing values and categorical columns, that’s why we need to apply pre-processing algorithms.

To fill missing values we will use the most frequent value in each column (there are many other filling methods, the one I select is just for example purposes).

# fill missing values
train_mode = dict(X_train.mode().iloc[0])
X_train = X_train.fillna(train_mode)
print(train_mode)

The train_mode values look like:

{'age': 31.0,
 'workclass': 3.0,
 'fnlwgt': 121124.0,
 'education': 11.0,
 'education-num': 9.0,
 'marital-status': 2.0,
 'occupation': 9.0,
 'relationship': 0.0,
 'race': 4.0,
 'sex': 1.0,
 'capital-gain': 0.0,
 'capital-loss': 0.0,
 'hours-per-week': 40.0,
 'native-country': 37.0}

From train_mode you see, that for example in the age column the most frequent value is 31.0.

Let’s convert categoricals into numbers. I will use LabelEncoder from sklearn package:

# convert categoricals
encoders = {}
for column in ['workclass', 'education', 'marital-status',
                'occupation', 'relationship', 'race',
                'sex','native-country']:
    categorical_convert = LabelEncoder()
    X_train[column] = categorical_convert.fit_transform(X_train[column])
    encoders[column] = categorical_convert

Algorithms training

Data is ready, so we can train our Random Forest algorithm.

# train the Random Forest algorithm
rf = RandomForestClassifier(n_estimators = 100)
rf = rf.fit(X_train, y_train)

We will also train Extra Trees algorithm:

# train the Extra Trees algorithm
et = ExtraTreesClassifier(n_estimators = 100)
et = et.fit(X_train, y_train)

As you see, training the algorithm is easy, just 2 lines of code - much less than data reading and pre-processing. Now, let’s save the algorithm that we have created. The important thing to notice is that the ML algorithm is not only the rf and et variable (with model weights), but we also need to save pre-processing variables train_mode and encoders as well. For saving, I will use joblib package.

# save preprocessing objects and RF algorithm
joblib.dump(train_mode, "./train_mode.joblib", compress=True)
joblib.dump(encoders, "./encoders.joblib", compress=True)
joblib.dump(rf, "./random_forest.joblib", compress=True)
joblib.dump(et, "./extra_trees.joblib", compress=True)

Add ML code and artifacts to the repository

Before continuing to the next chapter, let’s add our notebook and files to the repository.

# execute in project main directory
git add research/*
git commit -am "add ML code and algorithms"
git push

Each file with preprocessing objects and algorithms is smaller than 100 MB, which is the GitHub file limit. For larger files it will be better to use separate version control systems like DVC - however, this is a more advanced topic.

Next step: Django models