Sklearn Basics 1: Train, Evaluate and Deploy a Classifier

In this lesson, we will learn how to train, evaluate and deploy a classifier with Khiops sklearn.

We start by importing Khiops sklearn classifier KhiopsClassifier and saving the location of the Khiops Samples directory into a variable:

from os import path
import pandas as pd

from khiops import core as kh
from khiops.sklearn import KhiopsClassifier

samples_dir = kh.get_samples_dir()
print(f"Khiops samples directory located at {samples_dir}")
/github/home/.local/lib/python3.10/site-packages/khiops/core/internals/runner.py:1259: UserWarning: Too few cores: 2. To efficiently run Khiops in parallel at least 3 processes are needed. Khiops will run in a single process.
  warnings.warn(
Khiops samples directory located at /github/home/khiops_data/samples

Training a Classifier

We’ll train a classifier for the Iris dataset. This is a classical dataset containing data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of the three Iris’s variants: Setosa, Virginica and Versicolor. Each record contains the length and the width of both the petal and the sepal of the plant. The standard task, when using this dataset, is to construct a classifier for the type of the Iris, based on the petal and sepal characteristics.

To train a classifier with Khiops, we only need a dataframe that we are going to load from a file.

Let’s first save the location of this file into a variable iris_data_file, load it and take a look at its content:

iris_data_file = path.join(samples_dir, "Iris", "Iris.txt")
print("")
print(f"Iris data: 10 first records")
iris_df = pd.read_csv(iris_data_file, sep="\t")
iris_df.head()
Iris data: 10 first records
   SepalLength  SepalWidth  PetalLength  PetalWidth        Class
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa

Before training the classifier, we split the data into the feature matrix (sepal length, width, etc) and the target vector containing the labels (the Class column).

X_iris_train = iris_df.drop("Class", axis=1)
y_iris_train = iris_df["Class"]

Let’s check the contents of the feature matrix and the target vector:

print("Features of the Iris dataset:")
display(X_iris_train.head())
print("")
print("Label of the Iris dataset:")
display(y_iris_train.head())
Features of the Iris dataset:
   SepalLength  SepalWidth  PetalLength  PetalWidth
0          5.1         3.5          1.4         0.2
1          4.9         3.0          1.4         0.2
2          4.7         3.2          1.3         0.2
3          4.6         3.1          1.5         0.2
4          5.0         3.6          1.4         0.2
Label of the Iris dataset:
0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: Class, dtype: object

Let’s now train the classifier with the Khiops function KhiopsClassifier. This method returns a model ready to classify new Iris plants.

Note: By default Khiops builds 10 decision trees. This is not necessary for this tutorial so we set ``n_trees=0``

khc_iris = KhiopsClassifier(n_trees=0)
khc_iris.fit(X_iris_train, y_iris_train)
KhiopsClassifier(n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Exercise

We’ll repeat the same steps with the Adult dataset. It contains characteristics of a adult population in the USA such as age, gender and education. The task here is to predict the variable class which indicates if the individual earns more or less than 50,000 dollars.

Let’s start by loading the Adult dataframe and checking its contents:

Load the adult dataset and take a look at its content

adult_data_file = path.join(samples_dir, "Adult", "Adult.txt")
print("")
print(f"Adult data: 10 first records")
adult_df = pd.read_csv(adult_data_file, sep="\t")
adult_df.head()
Adult data: 10 first records
   Label  age         workclass  fnlwgt  education  education_num  0      1   39         State-gov   77516  Bachelors             13
1      2   50  Self-emp-not-inc   83311  Bachelors             13
2      3   38           Private  215646    HS-grad              9
3      4   53           Private  234721       11th              7
4      5   28           Private  338409  Bachelors             13

       marital_status         occupation   relationship   race     sex  0       Never-married       Adm-clerical  Not-in-family  White    Male
1  Married-civ-spouse    Exec-managerial        Husband  White    Male
2            Divorced  Handlers-cleaners  Not-in-family  White    Male
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female

   capital_gain  capital_loss  hours_per_week native_country class
0          2174             0              40  United-States  less
1             0             0              13  United-States  less
2             0             0              40  United-States  less
3             0             0              40  United-States  less
4             0             0              40           Cuba  less

Build the feature matrix and the the target vector to train the Adult classifier

Note that the name of the target variable is class (in lower case!).

X_adult_train = adult_df.drop(["class"], axis=1)
y_adult_train = adult_df["class"]
print("Adult dataset feature matrix (first 10 rows):")
display(X_adult_train.head(10))
print("")
print("Adult dataset target vector (first 10 values):")
display(y_adult_train.head(10))
Adult dataset feature matrix (first 10 rows):
   Label  age         workclass  fnlwgt  education  education_num  0      1   39         State-gov   77516  Bachelors             13
1      2   50  Self-emp-not-inc   83311  Bachelors             13
2      3   38           Private  215646    HS-grad              9
3      4   53           Private  234721       11th              7
4      5   28           Private  338409  Bachelors             13
5      6   37           Private  284582    Masters             14
6      7   49           Private  160187        9th              5
7      8   52  Self-emp-not-inc  209642    HS-grad              9
8      9   31           Private   45781    Masters             14
9     10   42           Private  159449  Bachelors             13

          marital_status         occupation   relationship   race     sex  0          Never-married       Adm-clerical  Not-in-family  White    Male
1     Married-civ-spouse    Exec-managerial        Husband  White    Male
2               Divorced  Handlers-cleaners  Not-in-family  White    Male
3     Married-civ-spouse  Handlers-cleaners        Husband  Black    Male
4     Married-civ-spouse     Prof-specialty           Wife  Black  Female
5     Married-civ-spouse    Exec-managerial           Wife  White  Female
6  Married-spouse-absent      Other-service  Not-in-family  Black  Female
7     Married-civ-spouse    Exec-managerial        Husband  White    Male
8          Never-married     Prof-specialty  Not-in-family  White  Female
9     Married-civ-spouse    Exec-managerial        Husband  White    Male

   capital_gain  capital_loss  hours_per_week native_country
0          2174             0              40  United-States
1             0             0              13  United-States
2             0             0              40  United-States
3             0             0              40  United-States
4             0             0              40           Cuba
5             0             0              40  United-States
6             0             0              16        Jamaica
7             0             0              45  United-States
8         14084             0              50  United-States
9          5178             0              40  United-States
Adult dataset target vector (first 10 values):
0    less
1    less
2    less
3    less
4    less
5    less
6    less
7    more
8    more
9    more
Name: class, dtype: object

Train a classifier for the Adult dataset

Do not forget to set n_trees=0

khc_adult = KhiopsClassifier(n_trees=0)
khc_adult.fit(X_adult_train, y_adult_train)
KhiopsClassifier(n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Accessing the Classifier’ Basic Train Evaluation Metrics

Khiops calculates evaluation metrics for the training dataset. We access them via the model’s attribute model_report which is an instance of the AnalysisResults class. Let’s check this out:

iris_results = khc_iris.model_report_
print(type(iris_results))
<class 'khiops.core.analysis_results.AnalysisResults'>

The model evaluation report is stored in the train_evaluation_report attribute of iris_results.

iris_train_eval = iris_results.train_evaluation_report
print(type(iris_train_eval))
<class 'khiops.core.analysis_results.EvaluationReport'>

We access the default predictor’s metrics with the get_snb_performance method of iris_train_eval:

iris_train_performance = iris_train_eval.get_snb_performance()
print(type(iris_train_performance))
<class 'khiops.core.analysis_results.PredictorPerformance'>

This object iris_train_performance is of class PredictorPerformance and has accuracy and auc attributes:

print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris train AUC     : {iris_train_performance.auc}")
Iris train accuracy: 0.96
Iris train AUC     : 0.9914

The PredictorPerformance object has also a confusion matrix attribute:

iris_classes = iris_train_performance.confusion_matrix.values
iris_confusion_matrix = pd.DataFrame(
    iris_train_performance.confusion_matrix.matrix,
    columns=iris_classes,
    index=iris_classes,
)
print("Iris train confusion matrix:")
iris_confusion_matrix
Iris train confusion matrix:
                 Iris-setosa  Iris-versicolor  Iris-virginica
Iris-setosa               50                0               0
Iris-versicolor            0               49               5
Iris-virginica             0                1              45

Exercise

Access the adult modeling report and print its type

adult_results = khc_adult.model_report_
type(adult_results)
khiops.core.analysis_results.AnalysisResults

Save the evaluation report of the Adult classification into the variable adult_train_eval

adult_train_eval = adult_results.train_evaluation_report

Show the model’s train accuracy, auc and confusion matrix

adult_train_performance = adult_train_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult train AUC     : {adult_train_performance.auc}")

adult_classes = adult_train_performance.confusion_matrix.values
adult_confusion_matrix = pd.DataFrame(
    adult_train_performance.confusion_matrix.matrix,
    columns=adult_classes,
    index=adult_classes,
)
print("Adult train confusion matrix:")
adult_confusion_matrix
Adult train accuracy: 0.869334
Adult train AUC     : 0.925553
Adult train confusion matrix:
       less  more
less  35197  4424
more   1958  7263

Deploying a Classifier

We are now going to deploy the Iris classifier khc_iris, that we have just trained, on the same dataset (normally we do this on new data).

The learned classifier can be deployed in two different ways:

  • to predict a class that can be obtained using the predict method of the model.

  • to predict class probabilities that can be obtained using the predict_proba method of the model.

Let’s first predict the Iris labels:

iris_predictions = khc_iris.predict(X_iris_train)
print("Iris model predictions (first 10 values):")
iris_predictions[:10]
Iris model predictions (first 10 values):
array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa'], dtype='<U15')

Let’s now predict the probabilities for each Iris type. Note that the column order of this matrix is given by the estimator attribute khc.classes_:

iris_probas = khc_iris.predict_proba(X_iris_train)
print(f"Iris classes {khc_iris.classes_}")
print("Iris model probabilities for each class (first 10 rows):")
iris_probas[:10]
Iris classes ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Iris model probabilities for each class (first 10 rows):
array([[0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729],
       [0.99730542, 0.00134729, 0.00134729]])

Exercise

Use the predict and predict_proba methods to deploy the Adult model khc_adult

Which columns are deployed in each case?

adult_predictions = khc_adult.predict(X_adult_train)
print("Adult model predictions (first 10 values):")
display(adult_predictions[:10])

adult_probas = khc_adult.predict_proba(X_adult_train)
print(f"Adult classes {khc_adult.classes_}")
print("Adult model predictions for each class (first 10 rows):")
display(adult_probas[:10])
Adult model predictions (first 10 values):
array(['less', 'more', 'less', 'less', 'less', 'more', 'less', 'more',
       'more', 'more'], dtype='<U4')
Adult classes ['less' 'more']
Adult model predictions for each class (first 10 rows):
array([[9.99994845e-01, 5.15479465e-06],
       [4.05868754e-01, 5.94131246e-01],
       [9.61770510e-01, 3.82294902e-02],
       [9.12629478e-01, 8.73705223e-02],
       [5.62226618e-01, 4.37773382e-01],
       [2.32734078e-01, 7.67265922e-01],
       [9.93356522e-01, 6.64347792e-03],
       [4.24222870e-01, 5.75777130e-01],
       [1.79285954e-03, 9.98207141e-01],
       [5.17299187e-06, 9.99994827e-01]])