Core Basics 1: Train, Evaluate and Deploy a Classifier

In this lesson we will learn how to train, evaluate and deploy classifiers with Khiops.

We start by importing Khiops, some helper functions and saving the location of the Khiops Samples directory into a variable

from os import path

from khiops import core as kh
from helper_functions import explorer_open, peek

samples_dir = kh.get_samples_dir()
print(f"Khiops samples directory located at {samples_dir}")
Khiops samples directory located at /github/home/khiops_data/samples
/github/home/.local/lib/python3.10/site-packages/khiops/core/internals/runner.py:1259: UserWarning: Too few cores: 2. To efficiently run Khiops in parallel at least 3 processes are needed. Khiops will run in a single process.
  warnings.warn(

Training a Classifier

We’ll train a classifier for the Iris dataset. This is a classical dataset containing the data of different plants belonging to the genus Iris. It contains 150 records, 50 for each of three variants of Iris: Setosa, Virginica and Versicolor. The records for each sample contain the length and width of its petal and sepal. The standard task for this dataset is to construct a classifier for the type of Iris taking as inputs the length and width characteristics.

Now to train a classifier with Khiops we use two types of files: - A plain-text delimited data file (for example a csv file) - A dictionary file which describes the schema of the above data table (.kdic file extension)

Let’s save into variables the locations of these files for the Iris dataset and then take a look at their contents:

iris_kdic = path.join(samples_dir, "Iris", "Iris.kdic")
iris_data_file = path.join(samples_dir, "Iris", "Iris.txt")

print("")
print(f"Iris dictionary file location: {iris_kdic}")
print("")
peek(iris_kdic)

print("")
print("")
print(f"Iris data location: {iris_data_file}")
print("")
peek(iris_data_file)
Iris dictionary file location: /github/home/khiops_data/samples/Iris/Iris.kdic


Dictionary  Iris
{
    Numerical       SepalLength     ;
    Numerical       SepalWidth      ;
    Numerical       PetalLength     ;
    Numerical       PetalWidth      ;
    Categorical     Class   ;
};


Iris data location: /github/home/khiops_data/samples/Iris/Iris.txt

SepalLength SepalWidth      PetalLength     PetalWidth      Class
5.1 3.5     1.4     0.2     Iris-setosa
4.9 3.0     1.4     0.2     Iris-setosa
4.7 3.2     1.3     0.2     Iris-setosa
4.6 3.1     1.5     0.2     Iris-setosa
5.0 3.6     1.4     0.2     Iris-setosa
5.4 3.9     1.7     0.4     Iris-setosa
4.6 3.4     1.4     0.3     Iris-setosa
5.0 3.4     1.5     0.2     Iris-setosa
4.4 2.9     1.4     0.2     Iris-setosa
4.9 3.1     1.5     0.1     Iris-setosa
...

Note that the Iris variant information is in the column Class. Now let’s specify directory to save our results:

iris_results_dir = path.join("exercises", "Iris")
print(f"Iris results directory: {iris_results_dir}")
Iris results directory: exercises/Iris

We are now ready to train the classifier with the Khiops function train_predictor. This method returns a tuple containing the location of two files: - the modeling report (AllReports.khj): A JSON file containing information such as the informativeness of each variable, those selected for the model and performance metrics. - model’s dictionary file (Modeling.kdic): This file is an enriched version of the initial dictionary file that contains the model. It can be used to make predictions on new data.

iris_report, iris_model_kdic = kh.train_predictor(
    iris_kdic,
    dictionary_name="Iris",
    data_table_path=iris_data_file,
    target_variable="Class",
    results_dir=iris_results_dir,
    max_trees=0,  # by default Khiops constructs 10 decision tree variables
)
print(f"Iris report file located at: {iris_report}")
print(f"Iris modeling dictionary file located at: {iris_model_kdic}")
Iris report file located at: exercises/Iris/AllReports.khj
Iris modeling dictionary file located at: exercises/Iris/Modeling.kdic

You can verify that the result files were created in iris_results_dir. In the next sections, we’ll use the file at iris_report to assess the models’ performances and the file at iris_model_kdic to deploy it.

# To take a look at the directory where the resulting files are stored
# explorer_open(iris_results_dir)

Exercise

We’ll repeat the examples on this notebook with the Adult dataset. It contains characteristics of the adult population in USA such as age, gender and education and its task is to predict the variable class, which indicates if the individual earns more or less than 50,000 dollars.

Let’s start by putting into variables the paths for the Adult dataset:

adult_kdic = path.join(samples_dir, "Adult", "Adult.kdic")
adult_data_file = path.join(samples_dir, "Adult", "Adult.txt")

Train a classifier for the Adult database

Note the name of the target variable is class (in lower case!). Do not forget to set max_trees=0. Save the resulting file locations into the variables adult_report and adult_model_kdic and print them

adult_report, adult_model_kdic = kh.train_predictor(
    adult_kdic,
    dictionary_name="Adult",
    data_table_path=adult_data_file,
    target_variable="class",
    results_dir=adult_results_dir,
    max_trees=0,
)
print(f"Adult report file located at: {adult_report}")
print(f"Adult modeling dictionary file located at: {adult_model_kdic}")
Adult report file located at: exercises/Adult/AllReports.khj
Adult modeling dictionary file located at: exercises/Adult/Modeling.kdic

Accessing a Classifiers’ Basic Evaluation Metrics

We access the classifier’s evaluation metrics by loading file at iris_report file with the Khiops function read_analysis_results_file:

iris_results = kh.read_analysis_results_file(iris_report)
print(type(iris_results))
<class 'khiops.core.analysis_results.AnalysisResults'>

The resulting object is an instance of the AnalysisResults class. The model evaluation reports are stored in its train_evaluation_report and test_evaluation_report attributes which are of class EvaluationReport.

iris_train_eval = iris_results.train_evaluation_report
iris_test_eval = iris_results.test_evaluation_report
print(type(iris_train_eval))
print(type(iris_test_eval))
<class 'khiops.core.analysis_results.EvaluationReport'>
<class 'khiops.core.analysis_results.EvaluationReport'>

We access the default predictor’s metrics with the get_snb_performance method of the evaluation report objects:

iris_train_performance = iris_train_eval.get_snb_performance()
iris_test_performance = iris_test_eval.get_snb_performance()

These objects are of class PredictorPerformance and have accuracy and auc attributes for these metrics:

print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris test accuracy:  {iris_test_performance.accuracy}")
print("")
print(f"Iris train AUC: {iris_train_performance.auc}")
print(f"Iris test AUC:  {iris_test_performance.auc}")
Iris train accuracy: 0.980952
Iris test accuracy:  0.955556

Iris train AUC: 0.997868
Iris test AUC:  0.984362

Exercise

Read the contents of the file at adult_report for the Adult analysis and print its type

adult_results = kh.read_analysis_results_file(adult_report)
type(adult_results)
khiops.core.analysis_results.AnalysisResults

Save the evaluation reports of the Adult classification to the variables adult_train_eval and adult_test_eval

adult_train_eval = adult_results.train_evaluation_report
adult_test_eval = adult_results.test_evaluation_report

Show the model’s train and test accuracies and AUCs

adult_train_performance = adult_train_eval.get_snb_performance()
adult_test_performance = adult_test_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult test accuracy:  {adult_test_performance.accuracy}")
print("")
print(f"Adult train AUC: {adult_train_performance.auc}")
print(f"Adult test AUC:  {adult_test_performance.auc}")
Adult train accuracy: 0.869295
Adult test accuracy:  0.865714

Adult train AUC: 0.926145
Adult test AUC:  0.921665

Deploying a Classifier

We are going to deploy the Iris classifier we have just trained on the same dataset (normally we would do this on new data). We saved the model in the file iris_model_kdic. This file is usually large and incomprehensible, so you should know what you are doing before editing it. Just this time let’s take a quick look at its contents:

peek(iris_model_kdic, 25)
#Khiops 10.2.2

Dictionary  SNB_Iris
<InitialDictionary="Iris"> <PredictorLabel="Selective Naive Bayes"> <PredictorTy ...
{
Unused      Numerical       SepalLength             ; <Cost=1.38629> <Level=0.331855>
Unused      Numerical       SepalWidth              ; <Cost=1.38629> <Level=0.116679>
Unused      Numerical       PetalLength             ; <Cost=1.38629> <Importance=0.46748> <Level=0.621 ...
Unused      Numerical       PetalWidth              ; <Cost=1.38629> <Importance=0.538587> <Level=0.663 ...
Unused      Categorical     Class           ; <TargetVariable>
Unused      Structure(DataGrid)     VClass   = DataGrid(ValueSetC("Iris-setosa", "Iris-ver ...
Unused      Structure(DataGrid)     PPetalLength     = DataGrid(IntervalBounds(3.15, 4.75, 5 ...
Unused      Structure(DataGrid)     PPetalWidth      = DataGrid(IntervalBounds(0.75, 1.75), V ...
Unused      Structure(Classifier)   SNBClass         = SNBClassifier(Vector(0.3515625, 0.4375) ...
    Categorical     PredictedClass   = TargetValue(SNBClass)        ; <Prediction>
Unused      Numerical       ScoreClass       = TargetProb(SNBClass) ; <Score>
    Numerical       ProbClassIris-setosa   = TargetProbAt(SNBClass, "Iris-setosa")        ; <Ta ...
    Numerical       ProbClassIris-versicolor       = TargetProbAt(SNBClass, "Iris-versicolor ...
    Numerical       ProbClassIris-virginica        = TargetProbAt(SNBClass, "Iris-virginica") ...
};

Note that the modeling dictionary contains 5 used variables: - Class : The original target of the dataset - PredictedClass : The class with the highest probability according to the model - ProbClassIris-setosa, ProbClassIris-versicolor, ProbClassIris-virginica: The probabilities of each class according to the model

These will be the columns of the output table when deploying the model:

iris_deployment_file = path.join(iris_results_dir, "iris_deployment.txt")
kh.deploy_model(
    iris_model_kdic,
    dictionary_name="SNB_Iris",
    data_table_path=iris_data_file,
    output_data_table_path=iris_deployment_file,
)

peek(iris_deployment_file)
PredictedClass      ProbClassIris-setosa    ProbClassIris-versicolor        ProbClassIris-virgi ...
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
Iris-setosa 0.9884494887    0.008598869265  0.002951642068
...

Exercise

Use the deploy_model function to deploy the model stored in the file at adult_model_kdic

Which columns are deployed?

adult_deployment_file = path.join(adult_results_dir, "adult_deployment.txt")
kh.deploy_model(
    adult_model_kdic,
    dictionary_name="SNB_Adult",
    data_table_path=adult_data_file,
    output_data_table_path=adult_deployment_file,
)
peek(adult_deployment_file)
Predictedclass      Probclassless   Probclassmore
less        0.9999926658    7.33418716e-06
more        0.4122763795    0.5877236205
less        0.9624691952    0.03753080482
less        0.9158716208    0.08412837917
less        0.5717571015    0.4282428985
more        0.2594836411    0.7405163589
less        0.9939376151    0.006062384897
more        0.4223655109    0.5776344891
more        0.001798128     0.998201872
more        7.347401589e-06 0.9999926526
...