Sklearn Basics 2: Train a Classifier on a Star Multi-Table Dataset¶
In this notebook, we will learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the Sklearn Basics 1 lesson if you are not familiar with Khiops’ sklearn estimators.
We start by importing the sklearn estimator KhiopsClassifier
:
import os
import pandas as pd
from khiops import core as kh
from khiops.sklearn import KhiopsClassifier
from khiops.utils.helpers import train_test_split_dataset
from sklearn import metrics
# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()
Training a Multi-Table Classifier¶
We’ll train a “sarcasm detector” using the dataset HeadlineSarcasm
.
In its raw form, the dataset contains a list of text headlines paired
with a label that indicates whether its source is a sarcastic site (such
as The Onion) or not.
We have transformed this dataset into two tables such that the text-label record
"groundbreaking study finds gratification can be deliberately postponed" yes
is transformed to an entry in a table that contains (id, label) records
97 yes
and various entries in a secondary table linking a headline id to its words and positions
97 0 groundbreaking
97 1 study
97 2 finds
97 3 gratification
97 4 can
97 5 be
97 6 deliberately
97 7 postponed
Thus, the HeadlineSarcasm
dataset has the following multi-table
schema
+-----------+
|Headline |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId* |
| |Position |
+-1:n--->|Word |
+-------------+
The HeadlineId
variable is special because it is a key that links
a particular headline to its words (a 1:n
relation).
Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only intended for pedagogical purposes.
To train the KhiopsClassifier
for this setup we must specify a
multi-table dataset. Let’s first check the content of the created
tables: - The main table Headline
- The secondary table
HeadlineWords
sarcasm_dataset_dir = os.path.join("data", "HeadlineSarcasm")
headlines_file = os.path.join(sarcasm_dataset_dir, "Headlines.txt")
headlines_df = pd.read_csv(headlines_file, sep="\t")
print("Headlines table (first 10 rows)")
display(headlines_df.head(10))
headlines_words_file = os.path.join(sarcasm_dataset_dir, "HeadlineWords.txt")
headlines_words_df = pd.read_csv(headlines_words_file, sep="\t")
print("HeadlineWords table (first 10 rows)")
display(headlines_words_df.head(10))
Headlines table (first 10 rows)
HeadlineId IsSarcasm
0 0 yes
1 1 no
2 10 no
3 100 yes
4 1000 yes
5 10000 no
6 10001 yes
7 10002 no
8 10003 yes
9 10004 no
HeadlineWords table (first 10 rows)
HeadlineId Position Word
0 0 0 thirtysomething
1 0 1 scientists
2 0 2 unveil
3 0 3 doomsday
4 0 4 clock
5 0 5 of
6 0 6 hair
7 0 7 loss
8 1 0 dem
9 1 1 rep.
Before training the classifier, we split the main table into a feature
matrix (only the HeadlineId
column) and a target vector containing
the labels (the IsSarcasm
column).
headlines_main_df = headlines_df.drop("IsSarcasm", axis=1)
y_sarcasm = headlines_df["IsSarcasm"]
You may note that the feature matrix does not contain any feature but
do not worry. The Khiops AutoML engine will automatically create
features by aggregating the columns of HeadlineWords
for each
headline (more details about this below).
Moreover, instead of passing an X
table to the fit
method, we
pass a multi-table dataset specification which is dictionary with the
following format:
X = {
"main_table": <name of the main table>,
"tables" : {
<name of table 1>: (<dataframe of table 1>, <key column names of table 1>),
<name of table 2>: (<dataframe of table 2>, <key column names of table 2>),
...
}
}
Note that the key columns of each table are specified as a single name or a tuple containing the column names composing the key.
So for our HeadlineSarcasm
case, we specify the dataset as:
X_sarcasm = {
"main_table": "headlines",
"tables": {
"headlines": (headlines_main_df, "HeadlineId"),
"headline_words": (headlines_words_df, "HeadlineId"),
},
}
To separate this dataset into train and test, we user the
khiops-python
helper function train_test_split_dataset
. This
function allows to separate dict
dataset specifications:
(
X_sarcasm_train,
X_sarcasm_test,
y_sarcasm_train,
y_sarcasm_test,
) = train_test_split_dataset(X_sarcasm, y_sarcasm)
The call to the KhiopsClassifier
fit
method is very similar to
the single-table case but this time we specify the additional parameter
n_features
which is the number of aggregates that Khiops AutoML
engine will construct and analyze during the training. Some examples of
the features it will create for HeadlineSarcasm
are: - Number of
different words in the headline - Most common word in the headline -
Number of times the word ‘the’ appears - …
The Khiops AutoML engine will also evaluate, select and combine the
these features to build a classifier. We’ll here request for 1000
features (the default is 100
):
Note: By default Khiops builds 10 decision tree features. This is not necessary for this tutorial so we set ``n_trees=0``
khc_sarcasm = KhiopsClassifier(n_features=1000, n_trees=0)
khc_sarcasm.fit(X_sarcasm_train, y_sarcasm_train)
KhiopsClassifier(n_features=1000, n_trees=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier(n_features=1000, n_trees=0)
We quickly check its train accuracy and auc as in the previous tutorial:
sarcasm_train_performance = (
khc_sarcasm.model_report_.train_evaluation_report.get_snb_performance()
)
print(f"HeadlineSarcasm train accuracy: {sarcasm_train_performance.accuracy}")
print(f"HeadlineSarcasm train auc : {sarcasm_train_performance.auc}")
HeadlineSarcasm train accuracy: 0.863958
HeadlineSarcasm train auc : 0.941691
Now, we use our sarcasm classifier to obtain predictions and probabilities on the test data:
y_sarcasm_test_predicted = khc_sarcasm.predict(X_sarcasm_test)
probas_sarcasm_test = khc_sarcasm.predict_proba(X_sarcasm_test)
print("HeadlineSarcasm test predictions (first 10 values):")
display(y_sarcasm_test_predicted[:10])
print("HeadlineSarcasm test prediction probabilities (first 10 values):")
display(probas_sarcasm_test[:10])
HeadlineSarcasm test predictions (first 10 values):
array(['no', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'yes', 'no', 'yes'],
dtype='<U3')
HeadlineSarcasm test prediction probabilities (first 10 values):
array([[0.95532504, 0.04467496],
[0.1746559 , 0.8253441 ],
[0.06993393, 0.93006607],
[0.82820742, 0.17179258],
[0.30951808, 0.69048192],
[0.80932336, 0.19067664],
[0.70963749, 0.29036251],
[0.07363783, 0.92636217],
[0.86032515, 0.13967485],
[0.0613004 , 0.9386996 ]])
Finally we may estimate the accuracy and AUC for the test data:
sarcasm_test_accuracy = metrics.accuracy_score(y_sarcasm_test, y_sarcasm_test_predicted)
sarcasm_test_auc = metrics.roc_auc_score(y_sarcasm_test, probas_sarcasm_test[:, 1])
print(f"Sarcasm test accuracy: {sarcasm_test_accuracy}")
print(f"Sarcasm test auc : {sarcasm_test_auc}")
Sarcasm test accuracy: 0.8178895877009085
Sarcasm test auc : 0.9063618657053968
To further explore the results we can see the report with the Khiops Visualization app:
# To visualize uncomment the lines below
khc_sarcasm.export_report_file("./sarcasm_report.khj")
kh.visualize_report("./sarcasm_report.khj")
Could not open report file: [Errno 2] No such file or directory: 'xdg-open'. Path: ./sarcasm_report.khj
Exercise¶
Repeat the previous steps with the AccidentsSummary
dataset. This
dataset describes the characteristics of traffic accidents that happened
in France in 2018. It has two tables with the following schema:
+---------------+
|Accidents |
+---------------+
|AccidentId* |
|Gravity |
|Date |
|Hour | +---------------+
|Light | |Vehicles |
|Department | +---------------+
|Commune | |AccidentId* |
|InAgglomeration| |VehicleId* |
|... | |Direction |
+---------------+ |Category |
| |PassengerNumber|
+---1:n--->|... |
+---------------+
For each accident, we have both its characteristics (such as Gravity
or Light
conditions) and those of each involved vehicle (its
Direction
or PassengerNumber
). We first load the tables of the
AccidentsSummary
into dataframes:
accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "AccidentsSummary")
accidents_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
accidents_df = pd.read_csv(accidents_file, sep="\t", encoding="latin1")
print(f"Accidents dataframe (first 10 rows):")
display(accidents_df.head(10))
print()
vehicles_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
vehicles_df = pd.read_csv(vehicles_file, sep="\t", encoding="latin1")
print(f"Vehicles dataframe (first 10 rows):")
display(vehicles_df.head(10))
Accidents dataframe (first 10 rows):
AccidentId Gravity Date Hour Light 0 201800000001 NonLethal 2018-01-24 15:05:00 Daylight 1 201800000002 NonLethal 2018-02-12 10:15:00 Daylight 2 201800000003 NonLethal 2018-03-04 11:35:00 Daylight 3 201800000004 NonLethal 2018-05-05 17:35:00 Daylight 4 201800000005 NonLethal 2018-06-26 16:05:00 Daylight 5 201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 6 201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 7 201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 8 201800000009 NonLethal 2018-02-18 15:57:00 Daylight 9 201800000010 NonLethal 2018-03-19 15:30:00 Daylight Department Commune InAgglomeration IntersectionType Weather 0 590 5 No Y-type Normal 1 590 11 Yes Square VeryGood 2 590 477 Yes T-type Normal 3 590 52 Yes NoIntersection VeryGood 4 590 477 Yes NoIntersection Normal 5 590 52 Yes NoIntersection LightRain 6 590 133 Yes NoIntersection Normal 7 590 11 Yes NoIntersection Normal 8 590 550 No NoIntersection Normal 9 590 51 Yes X-type Normal CollisionType PostalAddress 0 2Vehicles-BehindVehicles-Frontal route des Ansereuilles 1 NoCollision Place du général de Gaul 2 NoCollision Rue nationale 3 2Vehicles-Side 30 rue Jules Guesde 4 2Vehicles-Side 72 rue Victor Hugo 5 Other D39 6 Other 4 route de camphin 7 Other rue saint exupéry 8 Other rue de l'égalité 9 2Vehicles-BehindVehicles-Frontal face au 59 rue de Lille
Vehicles dataframe (first 10 rows):
AccidentId VehicleId Direction Category PassengerNumber 0 201800000001 A01 Unknown Car<=3.5T 0 1 201800000001 B01 Unknown Car<=3.5T 0 2 201800000002 A01 Unknown Car<=3.5T 0 3 201800000003 A01 Unknown Motorbike>125cm3 0 4 201800000003 B01 Unknown Car<=3.5T 0 5 201800000003 C01 Unknown Car<=3.5T 0 6 201800000004 A01 Unknown Car<=3.5T 0 7 201800000004 B01 Unknown Bicycle 0 8 201800000005 A01 Unknown Moped 0 9 201800000005 B01 Unknown Car<=3.5T 0 FixedObstacle MobileObstacle ImpactPoint Maneuver 0 NaN Vehicle RightFront TurnToLeft 1 NaN Vehicle LeftFront NoDirectionChange 2 NaN Pedestrian NaN NoDirectionChange 3 StationaryVehicle Vehicle Front NoDirectionChange 4 NaN Vehicle LeftSide TurnToLeft 5 NaN NaN RightSide Parked 6 NaN Other RightFront Avoidance 7 NaN Vehicle LeftSide NaN 8 NaN Vehicle RightFront PassLeft 9 NaN Vehicle LeftFront Park
Create the main feature matrix and the target vector for AccidentsSummary
¶
Note that the target variable is Gravity
.
accidents_main_df = accidents_df.drop("Gravity", axis=1)
y_accidents = accidents_df["Gravity"]
Create the multi-table dataset specification¶
Note the main table has one key AccidentId
and the secondary table
has two keys AccidentId
and VehicleId
.
X_accidents = {
"main_table": "accidents",
"tables": {
"accidents": (accidents_main_df, "AccidentId"),
"vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
},
}
Split the dataset into train and test¶
(
X_accidents_train,
X_accidents_test,
y_accidents_train,
y_accidents_test,
) = train_test_split_dataset(X_accidents, y_accidents)
Train a classifier with this dataset¶
You may choose the number of features
n_features
to be created by the Khiops AutoML engineSet the number of trees to zero (
n_trees=0
)
khc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)
khc_accidents.fit(X_accidents_train, y_accidents_train)
KhiopsClassifier(n_features=1000, n_trees=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KhiopsClassifier(n_features=1000, n_trees=0)
Print the train accuracy and auc of the model¶
accidents_train_performance = (
khc_accidents.model_report_.train_evaluation_report.get_snb_performance()
)
print(f"AccidentsSummary train accuracy: {accidents_train_performance.accuracy}")
print(f"AccidentsSummary train auc : {accidents_train_performance.auc}")
AccidentsSummary train accuracy: 0.944735
AccidentsSummary train auc : 0.814207
Deploy the classifier to obtain predictions and its probabilites on the test data¶
y_accidents_test_predicted = khc_accidents.predict(X_accidents_test)
probas_accidents_test = khc_accidents.predict_proba(X_accidents_test)
print("Accidents test predictions (first 10 values):")
display(y_accidents_test_predicted[:10])
print("Accidents test prediction probabilities (first 10 values):")
display(probas_accidents_test[:10])
Accidents test predictions (first 10 values):
array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal',
'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'],
dtype='<U9')
Accidents test prediction probabilities (first 10 values):
array([[0.06586758, 0.93413242],
[0.0316421 , 0.9683579 ],
[0.01812328, 0.98187672],
[0.02759779, 0.97240221],
[0.07860092, 0.92139908],
[0.04881271, 0.95118729],
[0.1313691 , 0.8686309 ],
[0.02545221, 0.97454779],
[0.00644633, 0.99355367],
[0.03363112, 0.96636888]])
Obtain the accuracy and AUC on the test dataset¶
accidents_test_accuracy = metrics.accuracy_score(
y_accidents_test, y_accidents_test_predicted
)
accidents_test_auc = metrics.roc_auc_score(
y_accidents_test, probas_accidents_test[:, 1]
)
print(f"Accidents test accuracy: {accidents_test_accuracy}")
print(f"Accidents test auc : {accidents_test_auc}")
Accidents test accuracy: 0.9455904748719368
Accidents test auc : 0.8091979330822332
Explore the report with the Khiops Visualization App¶
# To visualize uncomment the lines below
# khc_accidents.export_report_file("./accidents_report.khj")
# kh.visualize_report("./accidents_report.khj")