Sklearn Basics 2: Train a Classifier on a Star Multi-Table Dataset

In this notebook, we will learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the Sklearn Basics 1 lesson if you are not familiar with Khiops’ sklearn estimators.

We start by importing Khiops sklearn classifier KhiopsClassifier and saving the location of the Khiops Samples directory into a variable:

from os import path
import pandas as pd

from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
/github/home/.local/lib/python3.10/site-packages/khiops/core/internals/runner.py:1259: UserWarning: Too few cores: 2. To efficiently run Khiops in parallel at least 3 processes are needed. Khiops will run in a single process.
  warnings.warn(

Training a Multi-Table Classifier

We’ll train a “sarcasm detector” using the dataset HeadlineSarcasm. In its raw form, the dataset contains a list of text headlines paired with a label that indicates whether its source is a sarcastic site (such as The Onion) or not.

We have transformed this dataset into two tables such that the text-label record

"groundbreaking study finds gratification can be deliberately postponed"    yes

is transformed to an entry in a table that contains (id, label) records

97 yes

and various entries in a secondary table linking a headline id to its words and positions

97  0   groundbreaking
97  1   study
97  2   finds
97  3   gratification
97  4   can
97  5   be
97  6   deliberately
97  7   postponed

Thus, the HeadlineSarcasm dataset has the following multi-table schema

+-----------+
|Headline   |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId*  |
     |        |Position     |
     +-1:n--->|Word         |
              +-------------+

The HeadlineId variable is special because it is a key that links a particular headline to its words (a 1:n relation).

Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only intended for pedagogical purposes.

To train the KhiopsClassifier for this setup we must specify a multi-table dataset. Let’s first check the content of the created tables: - The main table Headline - The secondary table HeadlineWords

sarcasm_dataset_dir = path.join("data", "HeadlineSarcasm")
headlines_file = path.join(sarcasm_dataset_dir, "Headlines.txt")
headlines_df = pd.read_csv(headlines_file, sep="\t")
print("Headlines table (first 10 rows)")
display(headlines_df.head(10))

headlines_words_file = path.join(sarcasm_dataset_dir, "HeadlineWords.txt")
headlines_words_df = pd.read_csv(headlines_words_file, sep="\t")
print("HeadlineWords table (first 10 rows)")
display(headlines_words_df.head(10))
Headlines table (first 10 rows)
   HeadlineId IsSarcasm
0           0       yes
1           1        no
2          10        no
3         100       yes
4        1000       yes
5       10000        no
6       10001       yes
7       10002        no
8       10003       yes
9       10004        no
HeadlineWords table (first 10 rows)
   HeadlineId  Position             Word
0           0         0  thirtysomething
1           0         1       scientists
2           0         2           unveil
3           0         3         doomsday
4           0         4            clock
5           0         5               of
6           0         6             hair
7           0         7             loss
8           1         0              dem
9           1         1             rep.

Before training the classifier, we split the main table into a feature matrix (only the HeadlineId column) and a target vector containing the labels (the IsSarcasm column).

headlines_train_df = headlines_df.drop("IsSarcasm", axis=1)
y_sarcasm_train = headlines_df["IsSarcasm"]

You may note that the feature matrix does not contain any feature but do not worry. The Khiops AutoML engine will automatically create features by aggregating the columns of HeadlineWords for each headline (more details about this below).

Moreover, instead of passing an X table to the fit method, we pass a multi-table dataset specification which is dictionary with the following format:

X = {
   "main_table": <name of the main table>,
   "tables" : {
       <name of table 1>: (<dataframe of table 1>, <key column names of table 1>),
       <name of table 2>: (<dataframe of table 2>, <key column names of table 2>),
       ...
    }
}

Note that the key columns of each table are specified as a single name or a tuple containing the column names composing the key.

So for our HeadlineSarcasm case, we specify the dataset as:

X_sarcasm_train = {
    "main_table": "headlines",
    "tables": {
        "headlines": (headlines_train_df, "HeadlineId"),
        "headline_words": (headlines_words_df, "HeadlineId"),
    },
}

The call to the KhiopsClassifier fit method is very similar to the single-table case but this time we specify the additional parameter n_features which is the number of aggregates that Khiops AutoML engine will construct and analyze during the training. Some examples of the features it will create for HeadlineSarcasm are: - Number of different words in the headline - Most common word in the headline - Number of times the word ‘the’ appears - …

The Khiops AutoML engine will also evaluate, select and combine the these features to build a classifier. We’ll here request for 1000 features (the default is 100):

Note: By default Khiops builds 10 decision tree features. This is not necessary for this tutorial so we set ``n_trees=0``

khc_sarcasm = KhiopsClassifier(n_features=1000, n_trees=0)
khc_sarcasm.fit(X_sarcasm_train, y_sarcasm_train)
KhiopsClassifier(n_features=1000, n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We quickly check its train accuracy and auc as in the previous tutorial:

sarcasm_train_performance = (
    khc_sarcasm.model_report_.train_evaluation_report.get_snb_performance()
)
print(f"HeadlineSarcasm train accuracy: {sarcasm_train_performance.accuracy}")
print(f"HeadlineSarcasm train auc     : {sarcasm_train_performance.auc}")
HeadlineSarcasm train accuracy: 0.856808
HeadlineSarcasm train auc     : 0.93599

Now, we use our sarcasm classifier to obtain predictions on the training data. We normally do that on new test data, and again a multi-table dataset specification would have been needed.

sarcasm_predictions = khc_sarcasm.predict(X_sarcasm_train)
print("HeadlineSarcasm train predictions (first 10 values):")
display(sarcasm_predictions[:10])
HeadlineSarcasm train predictions (first 10 values):
array(['yes', 'no', 'no', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'no'],
      dtype='<U3')

Exercise

Repeat the previous steps with the AccidentsSummary dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:

+---------------+
|Accidents      |
+---------------+
|AccidentId*    |
|Gravity        |
|Date           |
|Hour           | +---------------+
|Light          | |Vehicles       |
|Department     | +---------------+
|Commune        | |AccidentId*    |
|InAgglomeration| |VehicleId*     |
|...            | |Direction      |
+---------------+ |Category       |
       |          |PassengerNumber|
       +---1:n--->|...            |
                  +---------------+

For each accident, we have both its characteristics (such as Gravity or Light conditions) and those of each involved vehicle (its Direction or PassengerNumber). We first load the tables of the AccidentsSummary into dataframes:

accidents_dataset_dir = path.join(kh.get_samples_dir(), "AccidentsSummary")

accidents_file = path.join(accidents_dataset_dir, "Accidents.txt")
accidents_df = pd.read_csv(accidents_file, sep="\t", encoding="ISO-8859-1")
print(f"Accidents dataframe (first 10 rows):")
display(accidents_df.head(10))
print()

vehicles_file = path.join(accidents_dataset_dir, "Vehicles.txt")
vehicles_df = pd.read_csv(vehicles_file, sep="\t", encoding="ISO-8859-1")
print(f"Vehicles dataframe (first 10 rows):")
display(vehicles_df.head(10))
Accidents dataframe (first 10 rows):
     AccidentId    Gravity        Date      Hour               Light  0  201800000001  NonLethal  2018-01-24  15:05:00            Daylight
1  201800000002  NonLethal  2018-02-12  10:15:00            Daylight
2  201800000003  NonLethal  2018-03-04  11:35:00            Daylight
3  201800000004  NonLethal  2018-05-05  17:35:00            Daylight
4  201800000005  NonLethal  2018-06-26  16:05:00            Daylight
5  201800000006  NonLethal  2018-09-23  06:30:00      TwilightOrDawn
6  201800000007  NonLethal  2018-09-26  00:40:00  NightStreelightsOn
7  201800000008     Lethal  2018-11-30  17:15:00  NightStreelightsOn
8  201800000009  NonLethal  2018-02-18  15:57:00            Daylight
9  201800000010  NonLethal  2018-03-19  15:30:00            Daylight

   Department  Commune InAgglomeration IntersectionType    Weather  0         590        5              No           Y-type     Normal
1         590       11             Yes           Square   VeryGood
2         590      477             Yes           T-type     Normal
3         590       52             Yes   NoIntersection   VeryGood
4         590      477             Yes   NoIntersection     Normal
5         590       52             Yes   NoIntersection  LightRain
6         590      133             Yes   NoIntersection     Normal
7         590       11             Yes   NoIntersection     Normal
8         590      550              No   NoIntersection     Normal
9         590       51             Yes           X-type     Normal

                      CollisionType             PostalAddress
0  2Vehicles-BehindVehicles-Frontal    route des Ansereuilles
1                       NoCollision  Place du général de Gaul
2                       NoCollision            Rue  nationale
3                    2Vehicles-Side       30 rue Jules Guesde
4                    2Vehicles-Side        72 rue Victor Hugo
5                             Other                       D39
6                             Other        4 route de camphin
7                             Other         rue saint exupéry
8                             Other          rue de l'égalité
9  2Vehicles-BehindVehicles-Frontal   face au 59 rue de Lille
Vehicles dataframe (first 10 rows):
     AccidentId VehicleId Direction          Category  PassengerNumber  0  201800000001       A01   Unknown         Car<=3.5T                0
1  201800000001       B01   Unknown         Car<=3.5T                0
2  201800000002       A01   Unknown         Car<=3.5T                0
3  201800000003       A01   Unknown  Motorbike>125cm3                0
4  201800000003       B01   Unknown         Car<=3.5T                0
5  201800000003       C01   Unknown         Car<=3.5T                0
6  201800000004       A01   Unknown         Car<=3.5T                0
7  201800000004       B01   Unknown           Bicycle                0
8  201800000005       A01   Unknown             Moped                0
9  201800000005       B01   Unknown         Car<=3.5T                0

       FixedObstacle MobileObstacle ImpactPoint           Maneuver
0                NaN        Vehicle  RightFront         TurnToLeft
1                NaN        Vehicle   LeftFront  NoDirectionChange
2                NaN     Pedestrian         NaN  NoDirectionChange
3  StationaryVehicle        Vehicle       Front  NoDirectionChange
4                NaN        Vehicle    LeftSide         TurnToLeft
5                NaN            NaN   RightSide             Parked
6                NaN          Other  RightFront          Avoidance
7                NaN        Vehicle    LeftSide                NaN
8                NaN        Vehicle  RightFront           PassLeft
9                NaN        Vehicle   LeftFront               Park

Create the main feature matrix and the target vector for AccidentsSummary

Note that the target variable is Gravity.

accidents_main_df = accidents_df.drop("Gravity", axis=1)
y_accidents_train = accidents_df["Gravity"]

Create the multi-table dataset specification

Note the main table has one key AccidentId and the secondary table has two keys AccidentId and VehicleId.

X_accidents_train = {
    "main_table": "accidents",
    "tables": {
        "accidents": (accidents_main_df, "AccidentId"),
        "vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
    },
}

Train a classifier with this dataset

  • You may choose the number of features n_features to be created by the Khiops AutoML engine

  • Set the number of trees to zero (n_trees=0)

khc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)
khc_accidents.fit(X_accidents_train, y_accidents_train)
KhiopsClassifier(n_features=1000, n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Deploy the classifier to obtain predictions on the training data

Note that usually one deploys the model on new test data. We deploy on the train dataset to keep the tutorial simple.

khc_accidents.predict(X_accidents_train)
array(['NonLethal', 'NonLethal', 'NonLethal', ..., 'NonLethal',
       'NonLethal', 'NonLethal'], dtype='<U9')