Sklearn Basics 2: Train a Classifier on a Star Multi-Table Dataset

In this notebook, we will learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the Sklearn Basics 1 lesson if you are not familiar with Khiops’ sklearn estimators.

We start by importing the sklearn estimator KhiopsClassifier:

import os
import pandas as pd
from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
from khiops.utils.helpers import train_test_split_dataset
from sklearn import metrics

# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()

Training a Multi-Table Classifier

We’ll train a “sarcasm detector” using the dataset HeadlineSarcasm. In its raw form, the dataset contains a list of text headlines paired with a label that indicates whether its source is a sarcastic site (such as The Onion) or not.

We have transformed this dataset into two tables such that the text-label record

"groundbreaking study finds gratification can be deliberately postponed"    yes

is transformed to an entry in a table that contains (id, label) records

97 yes

and various entries in a secondary table linking a headline id to its words and positions

97  0   groundbreaking
97  1   study
97  2   finds
97  3   gratification
97  4   can
97  5   be
97  6   deliberately
97  7   postponed

Thus, the HeadlineSarcasm dataset has the following multi-table schema

+-----------+
|Headline   |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId*  |
     |        |Position     |
     +-1:n--->|Word         |
              +-------------+

The HeadlineId variable is special because it is a key that links a particular headline to its words (a 1:n relation).

Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only intended for pedagogical purposes.

To train the KhiopsClassifier for this setup we must specify a multi-table dataset. Let’s first check the content of the created tables: - The main table Headline - The secondary table HeadlineWords

sarcasm_dataset_dir = os.path.join("data", "HeadlineSarcasm")
headlines_file = os.path.join(sarcasm_dataset_dir, "Headlines.txt")
headlines_df = pd.read_csv(headlines_file, sep="\t")
print("Headlines table (first 10 rows)")
display(headlines_df.head(10))

headlines_words_file = os.path.join(sarcasm_dataset_dir, "HeadlineWords.txt")
headlines_words_df = pd.read_csv(headlines_words_file, sep="\t")
print("HeadlineWords table (first 10 rows)")
display(headlines_words_df.head(10))
Headlines table (first 10 rows)
   HeadlineId IsSarcasm
0           0       yes
1           1        no
2          10        no
3         100       yes
4        1000       yes
5       10000        no
6       10001       yes
7       10002        no
8       10003       yes
9       10004        no
HeadlineWords table (first 10 rows)
   HeadlineId  Position             Word
0           0         0  thirtysomething
1           0         1       scientists
2           0         2           unveil
3           0         3         doomsday
4           0         4            clock
5           0         5               of
6           0         6             hair
7           0         7             loss
8           1         0              dem
9           1         1             rep.

Before training the classifier, we split the main table into a feature matrix (only the HeadlineId column) and a target vector containing the labels (the IsSarcasm column).

headlines_main_df = headlines_df.drop("IsSarcasm", axis=1)
y_sarcasm = headlines_df["IsSarcasm"]

You may note that the feature matrix does not contain any feature but do not worry. The Khiops AutoML engine will automatically create features by aggregating the columns of HeadlineWords for each headline (more details about this below).

Moreover, instead of passing an X table to the fit method, we pass a multi-table dataset specification which is dictionary with the following format:

X = {
   "main_table": <name of the main table>,
   "tables" : {
       <name of table 1>: (<dataframe of table 1>, <key column names of table 1>),
       <name of table 2>: (<dataframe of table 2>, <key column names of table 2>),
       ...
    }
}

Note that the key columns of each table are specified as a single name or a tuple containing the column names composing the key.

So for our HeadlineSarcasm case, we specify the dataset as:

X_sarcasm = {
    "main_table": "headlines",
    "tables": {
        "headlines": (headlines_main_df, "HeadlineId"),
        "headline_words": (headlines_words_df, "HeadlineId"),
    },
}

To separate this dataset into train and test, we user the khiops-python helper function train_test_split_dataset. This function allows to separate dict dataset specifications:

(
    X_sarcasm_train,
    X_sarcasm_test,
    y_sarcasm_train,
    y_sarcasm_test,
) = train_test_split_dataset(X_sarcasm, y_sarcasm)

The call to the KhiopsClassifier fit method is very similar to the single-table case but this time we specify the additional parameter n_features which is the number of aggregates that Khiops AutoML engine will construct and analyze during the training. Some examples of the features it will create for HeadlineSarcasm are: - Number of different words in the headline - Most common word in the headline - Number of times the word ‘the’ appears - …

The Khiops AutoML engine will also evaluate, select and combine the these features to build a classifier. We’ll here request for 1000 features (the default is 100):

Note: By default Khiops builds 10 decision tree features. This is not necessary for this tutorial so we set ``n_trees=0``

khc_sarcasm = KhiopsClassifier(n_features=1000, n_trees=0)
khc_sarcasm.fit(X_sarcasm_train, y_sarcasm_train)
KhiopsClassifier(n_features=1000, n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We quickly check its train accuracy and auc as in the previous tutorial:

sarcasm_train_performance = (
    khc_sarcasm.model_report_.train_evaluation_report.get_snb_performance()
)
print(f"HeadlineSarcasm train accuracy: {sarcasm_train_performance.accuracy}")
print(f"HeadlineSarcasm train auc     : {sarcasm_train_performance.auc}")
HeadlineSarcasm train accuracy: 0.863958
HeadlineSarcasm train auc     : 0.941691

Now, we use our sarcasm classifier to obtain predictions and probabilities on the test data:

y_sarcasm_test_predicted = khc_sarcasm.predict(X_sarcasm_test)
probas_sarcasm_test = khc_sarcasm.predict_proba(X_sarcasm_test)

print("HeadlineSarcasm test predictions (first 10 values):")
display(y_sarcasm_test_predicted[:10])
print("HeadlineSarcasm test prediction probabilities (first 10 values):")
display(probas_sarcasm_test[:10])
HeadlineSarcasm test predictions (first 10 values):
array(['no', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'no', 'yes'],
      dtype='<U3')
HeadlineSarcasm test prediction probabilities (first 10 values):
array([[0.95532504, 0.04467496],
       [0.1746559 , 0.8253441 ],
       [0.06993393, 0.93006607],
       [0.82820742, 0.17179258],
       [0.30951808, 0.69048192],
       [0.80932336, 0.19067664],
       [0.70963749, 0.29036251],
       [0.07363783, 0.92636217],
       [0.86032515, 0.13967485],
       [0.0613004 , 0.9386996 ]])

Finally we may estimate the accuracy and AUC for the test data:

sarcasm_test_accuracy = metrics.accuracy_score(y_sarcasm_test, y_sarcasm_test_predicted)
sarcasm_test_auc = metrics.roc_auc_score(y_sarcasm_test, probas_sarcasm_test[:, 1])

print(f"Sarcasm test accuracy: {sarcasm_test_accuracy}")
print(f"Sarcasm test auc     : {sarcasm_test_auc}")
Sarcasm test accuracy: 0.8178895877009085
Sarcasm test auc     : 0.9063618657053968

To further explore the results we can see the report with the Khiops Visualization app:

# To visualize uncomment the lines below
khc_sarcasm.export_report_file("./sarcasm_report.khj")
kh.visualize_report("./sarcasm_report.khj")
Could not open report file: [Errno 2] No such file or directory: 'xdg-open'. Path: ./sarcasm_report.khj

Exercise

Repeat the previous steps with the AccidentsSummary dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:

+---------------+
|Accidents      |
+---------------+
|AccidentId*    |
|Gravity        |
|Date           |
|Hour           | +---------------+
|Light          | |Vehicles       |
|Department     | +---------------+
|Commune        | |AccidentId*    |
|InAgglomeration| |VehicleId*     |
|...            | |Direction      |
+---------------+ |Category       |
       |          |PassengerNumber|
       +---1:n--->|...            |
                  +---------------+

For each accident, we have both its characteristics (such as Gravity or Light conditions) and those of each involved vehicle (its Direction or PassengerNumber). We first load the tables of the AccidentsSummary into dataframes:

accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "AccidentsSummary")

accidents_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
accidents_df = pd.read_csv(accidents_file, sep="\t", encoding="latin1")
print(f"Accidents dataframe (first 10 rows):")
display(accidents_df.head(10))
print()

vehicles_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
vehicles_df = pd.read_csv(vehicles_file, sep="\t", encoding="latin1")
print(f"Vehicles dataframe (first 10 rows):")
display(vehicles_df.head(10))
Accidents dataframe (first 10 rows):
     AccidentId    Gravity        Date      Hour               Light  0  201800000001  NonLethal  2018-01-24  15:05:00            Daylight
1  201800000002  NonLethal  2018-02-12  10:15:00            Daylight
2  201800000003  NonLethal  2018-03-04  11:35:00            Daylight
3  201800000004  NonLethal  2018-05-05  17:35:00            Daylight
4  201800000005  NonLethal  2018-06-26  16:05:00            Daylight
5  201800000006  NonLethal  2018-09-23  06:30:00      TwilightOrDawn
6  201800000007  NonLethal  2018-09-26  00:40:00  NightStreelightsOn
7  201800000008     Lethal  2018-11-30  17:15:00  NightStreelightsOn
8  201800000009  NonLethal  2018-02-18  15:57:00            Daylight
9  201800000010  NonLethal  2018-03-19  15:30:00            Daylight

   Department  Commune InAgglomeration IntersectionType    Weather  0         590        5              No           Y-type     Normal
1         590       11             Yes           Square   VeryGood
2         590      477             Yes           T-type     Normal
3         590       52             Yes   NoIntersection   VeryGood
4         590      477             Yes   NoIntersection     Normal
5         590       52             Yes   NoIntersection  LightRain
6         590      133             Yes   NoIntersection     Normal
7         590       11             Yes   NoIntersection     Normal
8         590      550              No   NoIntersection     Normal
9         590       51             Yes           X-type     Normal

                      CollisionType               PostalAddress
0  2Vehicles-BehindVehicles-Frontal      route des Ansereuilles
1                       NoCollision  Place du général de Gaul
2                       NoCollision              Rue  nationale
3                    2Vehicles-Side         30 rue Jules Guesde
4                    2Vehicles-Side          72 rue Victor Hugo
5                             Other                         D39
6                             Other          4 route de camphin
7                             Other          rue saint exupéry
8                             Other          rue de l'égalité
9  2Vehicles-BehindVehicles-Frontal     face au 59 rue de Lille
Vehicles dataframe (first 10 rows):
     AccidentId VehicleId Direction          Category  PassengerNumber  0  201800000001       A01   Unknown         Car<=3.5T                0
1  201800000001       B01   Unknown         Car<=3.5T                0
2  201800000002       A01   Unknown         Car<=3.5T                0
3  201800000003       A01   Unknown  Motorbike>125cm3                0
4  201800000003       B01   Unknown         Car<=3.5T                0
5  201800000003       C01   Unknown         Car<=3.5T                0
6  201800000004       A01   Unknown         Car<=3.5T                0
7  201800000004       B01   Unknown           Bicycle                0
8  201800000005       A01   Unknown             Moped                0
9  201800000005       B01   Unknown         Car<=3.5T                0

       FixedObstacle MobileObstacle ImpactPoint           Maneuver
0                NaN        Vehicle  RightFront         TurnToLeft
1                NaN        Vehicle   LeftFront  NoDirectionChange
2                NaN     Pedestrian         NaN  NoDirectionChange
3  StationaryVehicle        Vehicle       Front  NoDirectionChange
4                NaN        Vehicle    LeftSide         TurnToLeft
5                NaN            NaN   RightSide             Parked
6                NaN          Other  RightFront          Avoidance
7                NaN        Vehicle    LeftSide                NaN
8                NaN        Vehicle  RightFront           PassLeft
9                NaN        Vehicle   LeftFront               Park

Create the main feature matrix and the target vector for AccidentsSummary

Note that the target variable is Gravity.

accidents_main_df = accidents_df.drop("Gravity", axis=1)
y_accidents = accidents_df["Gravity"]

Create the multi-table dataset specification

Note the main table has one key AccidentId and the secondary table has two keys AccidentId and VehicleId.

X_accidents = {
    "main_table": "accidents",
    "tables": {
        "accidents": (accidents_main_df, "AccidentId"),
        "vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
    },
}

Split the dataset into train and test

(
    X_accidents_train,
    X_accidents_test,
    y_accidents_train,
    y_accidents_test,
) = train_test_split_dataset(X_accidents, y_accidents)

Train a classifier with this dataset

  • You may choose the number of features n_features to be created by the Khiops AutoML engine

  • Set the number of trees to zero (n_trees=0)

khc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)
khc_accidents.fit(X_accidents_train, y_accidents_train)
KhiopsClassifier(n_features=1000, n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Deploy the classifier to obtain predictions and its probabilites on the test data

y_accidents_test_predicted = khc_accidents.predict(X_accidents_test)
probas_accidents_test = khc_accidents.predict_proba(X_accidents_test)

print("Accidents test predictions (first 10 values):")
display(y_accidents_test_predicted[:10])
print("Accidents test prediction probabilities (first 10 values):")
display(probas_accidents_test[:10])
Accidents test predictions (first 10 values):
array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal',
       'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'],
      dtype='<U9')
Accidents test prediction probabilities (first 10 values):
array([[0.06586758, 0.93413242],
       [0.0316421 , 0.9683579 ],
       [0.01812328, 0.98187672],
       [0.02759779, 0.97240221],
       [0.07860092, 0.92139908],
       [0.04881271, 0.95118729],
       [0.1313691 , 0.8686309 ],
       [0.02545221, 0.97454779],
       [0.00644633, 0.99355367],
       [0.03363112, 0.96636888]])

Obtain the accuracy and AUC on the test dataset

accidents_test_accuracy = metrics.accuracy_score(
    y_accidents_test, y_accidents_test_predicted
)
accidents_test_auc = metrics.roc_auc_score(
    y_accidents_test, probas_accidents_test[:, 1]
)

print(f"Accidents test accuracy: {accidents_test_accuracy}")
print(f"Accidents test auc     : {accidents_test_auc}")
Accidents test accuracy: 0.9455904748719368
Accidents test auc     : 0.8091979330822332

Explore the report with the Khiops Visualization App

# To visualize uncomment the lines below
# khc_accidents.export_report_file("./accidents_report.khj")
# kh.visualize_report("./accidents_report.khj")