Sklearn Basics 1: Train, Evaluate and Deploy a Classifier¶
In this lesson, we will learn how to train, evaluate and deploy a classifier with Khiops sklearn.
We start by importing the sklearn estimator KhiopsClassifier
:
import os
import pandas as pd
from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()
Training a Classifier¶
We’ll train a classifier for the Iris
dataset. This is a classical
dataset containing data of different plants belonging to the genus
Iris. It contains 150 records, 50 for each of the three Iris’s
variants: Setosa, Virginica and Versicolor. Each record contains
the length and the width of both the petal and the sepal of the plant.
The standard task, when using this dataset, is to construct a classifier
for the type of the Iris, based on the petal and sepal
characteristics.
To train a classifier with Khiops, we only need a dataframe that we are going to load from a file.
Let’s first save the location of this file into a variable
iris_data_file
, load it and take a look at its content:
iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt")
iris_df = pd.read_csv(iris_data_file, sep="\t")
print(f"Iris data: 10 first records")
iris_df.head()
Iris data: 10 first records
SepalLength SepalWidth PetalLength PetalWidth Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Before training the classifier, we split the data into the feature
matrix (sepal length, width, etc) and the target vector containing the
labels (the Class
column).
X_iris_train = iris_df.drop("Class", axis=1)
y_iris_train = iris_df["Class"]
Let’s check the contents of the feature matrix and the target vector:
print("Features of the Iris dataset:")
display(X_iris_train.head())
print("")
print("Label of the Iris dataset:")
display(y_iris_train.head())
Features of the Iris dataset:
SepalLength SepalWidth PetalLength PetalWidth
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
Label of the Iris dataset:
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
Name: Class, dtype: object
Let’s now train the classifier with the Khiops function
KhiopsClassifier
. This method returns a model ready to classify new
Iris plants.
Note: By default Khiops builds 10 decision trees. This is not necessary for this tutorial so we set ``n_trees=0``
khc_iris = KhiopsClassifier(n_trees=0)
khc_iris.fit(X_iris_train, y_iris_train)
KhiopsClassifier(n_trees=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier(n_trees=0)
Exercise¶
We’ll repeat the same steps with the Adult
dataset. It contains
characteristics of a adult population in the USA such as age, gender and
education. The task here is to predict the variable class
which
indicates if the individual earns more
or less
than 50,000
dollars.
Let’s start by loading the Adult
dataframe and checking its
contents:
Load the adult dataset and take a look at its content¶
adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt")
adult_df = pd.read_csv(adult_data_file, sep="\t")
print(f"Adult data: 10 first records")
adult_df.head()
Adult data: 10 first records
Label age workclass fnlwgt education education_num 0 1 39 State-gov 77516 Bachelors 13 1 2 50 Self-emp-not-inc 83311 Bachelors 13 2 3 38 Private 215646 HS-grad 9 3 4 53 Private 234721 11th 7 4 5 28 Private 338409 Bachelors 13 marital_status occupation relationship race sex 0 Never-married Adm-clerical Not-in-family White Male 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female capital_gain capital_loss hours_per_week native_country class 0 2174 0 40 United-States less 1 0 0 13 United-States less 2 0 0 40 United-States less 3 0 0 40 United-States less 4 0 0 40 Cuba less
Build the feature matrix and the the target vector to train the Adult
classifier¶
Note that the name of the target variable is class
(in lower
case!).
X_adult_train = adult_df.drop(["class"], axis=1)
y_adult_train = adult_df["class"]
print("Adult dataset feature matrix (first 10 rows):")
display(X_adult_train.head(10))
print("Adult dataset target vector (first 10 values):")
display(y_adult_train.head(10))
Adult dataset feature matrix (first 10 rows):
Label age workclass fnlwgt education education_num 0 1 39 State-gov 77516 Bachelors 13 1 2 50 Self-emp-not-inc 83311 Bachelors 13 2 3 38 Private 215646 HS-grad 9 3 4 53 Private 234721 11th 7 4 5 28 Private 338409 Bachelors 13 5 6 37 Private 284582 Masters 14 6 7 49 Private 160187 9th 5 7 8 52 Self-emp-not-inc 209642 HS-grad 9 8 9 31 Private 45781 Masters 14 9 10 42 Private 159449 Bachelors 13 marital_status occupation relationship race sex 0 Never-married Adm-clerical Not-in-family White Male 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female 5 Married-civ-spouse Exec-managerial Wife White Female 6 Married-spouse-absent Other-service Not-in-family Black Female 7 Married-civ-spouse Exec-managerial Husband White Male 8 Never-married Prof-specialty Not-in-family White Female 9 Married-civ-spouse Exec-managerial Husband White Male capital_gain capital_loss hours_per_week native_country 0 2174 0 40 United-States 1 0 0 13 United-States 2 0 0 40 United-States 3 0 0 40 United-States 4 0 0 40 Cuba 5 0 0 40 United-States 6 0 0 16 Jamaica 7 0 0 45 United-States 8 14084 0 50 United-States 9 5178 0 40 United-States
Adult dataset target vector (first 10 values):
0 less
1 less
2 less
3 less
4 less
5 less
6 less
7 more
8 more
9 more
Name: class, dtype: object
Train a classifier for the Adult
dataset¶
Do not forget to set n_trees=0
khc_adult = KhiopsClassifier(n_trees=0)
khc_adult.fit(X_adult_train, y_adult_train)
KhiopsClassifier(n_trees=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier(n_trees=0)
Accessing the Classifier’ Basic Train Evaluation Metrics¶
Khiops calculates evaluation metrics for the training dataset. We access
them via the model’s attribute model_report
which is an instance of
the AnalysisResults
class. Let’s check this out:
iris_results = khc_iris.model_report_
print(type(iris_results))
<class 'khiops.core.analysis_results.AnalysisResults'>
The model evaluation report is stored in the train_evaluation_report
attribute of iris_results
.
iris_train_eval = iris_results.train_evaluation_report
print(type(iris_train_eval))
<class 'khiops.core.analysis_results.EvaluationReport'>
We access the default predictor’s metrics with the
get_snb_performance
method of iris_train_eval
:
iris_train_performance = iris_train_eval.get_snb_performance()
print(type(iris_train_performance))
<class 'khiops.core.analysis_results.PredictorPerformance'>
This object iris_train_performance
is of class
PredictorPerformance
and has accuracy
and auc
attributes:
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris train AUC : {iris_train_performance.auc}")
Iris train accuracy: 0.96
Iris train AUC : 0.9914
The PredictorPerformance
object has also a confusion matrix
attribute:
iris_classes = iris_train_performance.confusion_matrix.values
iris_confusion_matrix = pd.DataFrame(
iris_train_performance.confusion_matrix.matrix,
columns=iris_classes,
index=iris_classes,
)
print("Iris train confusion matrix:")
iris_confusion_matrix
Iris train confusion matrix:
Iris-setosa Iris-versicolor Iris-virginica
Iris-setosa 50 0 0
Iris-versicolor 0 49 5
Iris-virginica 0 1 45
To further explore the results we can see the report with the Khiops Visualization app:
# To visualize uncomment the lines below
# khc_iris.export_report_file("./iris_report.khj")
# kh.visualize_report("./iris_report.khj")
Exercise¶
Access the adult modeling report and print its type¶
adult_results = khc_adult.model_report_
type(adult_results)
khiops.core.analysis_results.AnalysisResults
Save the evaluation report of the Adult
classification into the variable adult_train_eval
¶
adult_train_eval = adult_results.train_evaluation_report
Show the model’s train accuracy, auc and confusion matrix¶
adult_train_performance = adult_train_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult train AUC : {adult_train_performance.auc}")
adult_classes = adult_train_performance.confusion_matrix.values
adult_confusion_matrix = pd.DataFrame(
adult_train_performance.confusion_matrix.matrix,
columns=adult_classes,
index=adult_classes,
)
print("Adult train confusion matrix:")
adult_confusion_matrix
Adult train accuracy: 0.869334
Adult train AUC : 0.925553
Adult train confusion matrix:
less more
less 35197 4424
more 1958 7263
Deploying a Classifier¶
We are now going to deploy the Iris
classifier khc_iris
, that we
have just trained, on the same dataset (normally we do this on new
data).
The learned classifier can be deployed in two different ways:
to predict a class that can be obtained using the
predict
method of the model.to predict class probabilities that can be obtained using the
predict_proba
method of the model.
Let’s first predict the Iris
labels:
iris_predictions = khc_iris.predict(X_iris_train)
print("Iris model predictions (first 10 values):")
iris_predictions[:10]
Iris model predictions (first 10 values):
array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa'], dtype='<U15')
Let’s now predict the probabilities for each Iris
type. Note that
the column order of this matrix is given by the estimator attribute
khc.classes_
:
iris_probas = khc_iris.predict_proba(X_iris_train)
print(f"Iris classes {khc_iris.classes_}")
print("Iris model probabilities for each class (first 10 rows):")
iris_probas[:10]
Iris classes ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']
Iris model probabilities for each class (first 10 rows):
array([[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729],
[0.99730542, 0.00134729, 0.00134729]])
Exercise¶
Use the predict
and predict_proba
methods to deploy the Adult
model khc_adult
¶
Which columns are deployed in each case?
adult_predictions = khc_adult.predict(X_adult_train)
print("Adult model predictions (first 10 values):")
display(adult_predictions[:10])
adult_probas = khc_adult.predict_proba(X_adult_train)
print(f"Adult classes {khc_adult.classes_}")
print("Adult model predictions for each class (first 10 rows):")
display(adult_probas[:10])
Adult model predictions (first 10 values):
array(['less', 'more', 'less', 'less', 'less', 'more', 'less', 'more',
'more', 'more'], dtype='<U4')
Adult classes ['less' 'more']
Adult model predictions for each class (first 10 rows):
array([[9.99994845e-01, 5.15479465e-06],
[4.05868754e-01, 5.94131246e-01],
[9.61770510e-01, 3.82294902e-02],
[9.12629478e-01, 8.73705223e-02],
[5.62226618e-01, 4.37773382e-01],
[2.32734078e-01, 7.67265922e-01],
[9.93356522e-01, 6.64347792e-03],
[4.24222870e-01, 5.75777130e-01],
[1.79285954e-03, 9.98207141e-01],
[5.17299187e-06, 9.99994827e-01]])
Open the training report with the Khiops Visualization app¶
# To visualize uncomment the lines below
# khc_adult.export_report_file("./adult_report.khj")
# kh.visualize_report("./adult_report.khj")