Core Basics 2: Train a Classifier on a Star Multi-Table Dataset¶
In this notebook we learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the Core Basics 1 lesson if you are not familiar with Khiops.
Make sure you have installed Khiops and Khiops Visualization.
We start by importing Khiops, checking its installation and defining some helper functions:
import os
import platform
import subprocess
from khiops import core as kh
# Define peek helper function
def peek(file_path, n=10):
"""Shows the first n lines of a file"""
with open(file_path, encoding="utf8", errors="replace") as file:
for line in file.readlines()[:n]:
print(line, end="")
print("")
# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()
Training a Multi-Table Classifier¶
We’ll train a “sarcasm detector” using the dataset HeadlineSarcasm
.
In its raw form, it contains a list of text headlines paired with a
label that indicates whether its source is a sarcastic site (such as
The Onion) or not.
We have transformed this dataset into two tables such that the text-label record
"groundbreaking study finds gratification can be deliberately postponed" yes
is transformed to an entry in a table that contains id-label records
97 yes
and various entries in a secondary table linking a headline id to its words and positions
97 0 groundbreaking
97 1 study
97 2 finds
97 3 gratification
97 4 can
97 5 be
97 6 deliberately
97 7 postponed
Thus the HeadlineSarcasm
dataset has the following multi-table
schema
+-----------+
|Headline |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId* |
| |Position |
+-1:n--->|Word |
+-------------+
The HeadlineId
variable is special because it is a key that links
a particular headline to its words (a 1:n relation).
Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.
To train a classifier with Khiops in this multi-table setup, this schema
must be codified in the dictionary file. Let’s check the contents of the
HeadlineSarcasm
dictionary file:
sarcasm_kdic = os.path.join("data", "HeadlineSarcasm", "HeadlineSarcasm.kdic")
print(f"HeadlineSarcasm dictionary file: {sarcasm_kdic}")
print("")
peek(sarcasm_kdic, n=15)
HeadlineSarcasm dictionary file: data/HeadlineSarcasm/HeadlineSarcasm.kdic
Root Dictionary Headline(HeadlineId)
{
Categorical HeadlineId;
Categorical IsSarcasm;
Table(Words) HeadlineWords;
};
Dictionary Words(HeadlineId)
{
Categorical HeadlineId;
Numerical Position;
Categorical Word;
};
As in the single-table case the .kdic
file describes the schema for
both tables, but note the following differences: - The dictionary for
the table Headline
is prefixed by the Root
keyword to indicate
that is the main one. - For both tables, their dictionary names are
followed by (HeadlineId)
to indicate that HeadlineId
is the key
of these tables. - The schema for the main table contains an extra
special variable defined with the statement
Table(Words) HeadlineWords
. This is, in addition to sharing the same
key variable, is necessary to indicate the 1:n
relationship between
the main and secondary table.
Now let’s store the location main and secondary tables and peek their contents:
sarcasm_headlines_file = os.path.join("data", "HeadlineSarcasm", "Headlines.txt")
sarcasm_words_file = os.path.join("data", "HeadlineSarcasm", "HeadlineWords.txt")
print(f"HeadlineSarcasm main table file: {sarcasm_headlines_file}")
print("")
peek(sarcasm_headlines_file, n=3)
print(f"HeadlineSarcasm secondary table file location: {sarcasm_words_file}")
print("")
peek(sarcasm_words_file, n=15)
HeadlineSarcasm main table file: data/HeadlineSarcasm/Headlines.txt
HeadlineId IsSarcasm
0 yes
1 no
HeadlineSarcasm secondary table file location: data/HeadlineSarcasm/HeadlineWords.txt
HeadlineId Position Word
0 0 thirtysomething
0 1 scientists
0 2 unveil
0 3 doomsday
0 4 clock
0 5 of
0 6 hair
0 7 loss
1 0 dem
1 1 rep.
1 2 totally
1 3 nails
1 4 why
1 5 congress
The call to the train_predictor
will be very similar to the
single-table case but there are some differences.
The first is that we must pass the path of the extra secondary data
table. This is done with the additional_data_tables
parameter that
is a Python dictionary containing key-value pairs for each table. More
precisely: - keys describe data paths of secondary tables. In this
case only Headline`HeadlineWords
- values describe the file
paths of secondary tables. In this case only the file path we stored in
sarcasm_words_file
Note: For understanding what data paths are see the “Multi-Table Tasks” section of the Khiops ``core.api`` documentation
Secondly, we specify how many features/aggregates Khiops will create
with its multi-table AutoML mode. For the HeadlineSarcasm
dataset
Khiops can create features such as: - Number of different words in the
headline - Most common word in the headline before the third one -
Number of times the word ‘the’ appears - …
It will then evaluate, select and combine the created features to build
a classifier. We’ll ask to create 1000
of these features (the
default is 100
).
With these considerations, let’s setup the some extra variables and train the classifier:
sarcasm_results_dir = os.path.join("exercises", "HeadlineSarcasm")
sarcasm_report, sarcasm_model_kdic = kh.train_predictor(
sarcasm_kdic,
dictionary_name="Headline", # This must be the main/root dictionary
data_table_path=sarcasm_headlines_file, # This must be the data file for the main table
target_variable="IsSarcasm",
results_dir=sarcasm_results_dir,
additional_data_tables={"Headline`HeadlineWords": sarcasm_words_file},
max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table
max_trees=0, # by default Khiops constructs 10 decision tree variables
)
print(f"HeadlineSarcasm report file located at: {sarcasm_report}")
print(f"HeadlineSarcasm modeling dictionary file located at: {sarcasm_model_kdic}")
HeadlineSarcasm report file located at: exercises/HeadlineSarcasm/AllReports.khj
HeadlineSarcasm modeling dictionary file located at: exercises/HeadlineSarcasm/Modeling.kdic
We now may take a look at the results with the visualization tool:
# To visualize uncomment the line below
# kh.visualize_report(sarcasm_report)
Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops ``sort_data_table`` function or your favorite software. The examples of this tutorial have their tables pre-sorted.
Exercise time!¶
Repeat the previous steps with the AccidentsSummary
dataset. It
describes the characteristics of traffic accidents that happened in
France in 2018. It has two tables with the following schema:
+---------------+
|Accidents |
+---------------+
|AccidentId* |
|Gravity |
|Date |
|Hour | +---------------+
|Light | |Vehicles |
|Department | +---------------+
|Commune | |AccidentId* |
|InAgglomeration| |VehicleId* |
|... | |Direction |
+---------------+ |Category |
| |PassengerNumber|
+---1:n--->|... |
+---------------+
So for each accident we have its characteristics (such as Gravity
or
Light
conditions) and those of each involved vehicle (its
Direction
or PassengerNumber
). The main task for this dataset is
to predict the variable Gravity
that has two possible
values:Lethal
and NonLethal
.
We first save the paths of the AccidentsSummary
dictionary file and
data table files into variables:
accidents_kdic = os.path.join(
kh.get_samples_dir(), "AccidentsSummary", "Accidents.kdic"
)
accidents_data_file = os.path.join(
kh.get_samples_dir(), "AccidentsSummary", "Accidents.txt"
)
vehicles_data_file = os.path.join(
kh.get_samples_dir(), "AccidentsSummary", "Vehicles.txt"
)
Print the file locations and use the function peek
to list their contents¶
Which table is the Root
in this case?
print(f"Accidents dictionary file: {accidents_kdic}")
print("")
peek(accidents_kdic, n=40)
print(f"Accidents (main) data table: {accidents_data_file}")
print("")
peek(accidents_data_file)
print(f"Vehicles data table: {vehicles_data_file}")
print("")
peek(vehicles_data_file)
Accidents dictionary file: /github/home/khiops_data/samples/AccidentsSummary/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
Categorical Gravity;
Date Date;
Time Hour;
Categorical Light;
Categorical Department;
Categorical Commune;
Categorical InAgglomeration;
Categorical IntersectionType;
Categorical Weather;
Categorical CollisionType;
Categorical PostalAddress;
Table(Vehicle) Vehicles;
};
Dictionary Vehicle(AccidentId, VehicleId)
{
Categorical AccidentId;
Categorical VehicleId;
Categorical Direction;
Categorical Category;
Numerical PassengerNumber;
Categorical FixedObstacle;
Categorical MobileObstacle;
Categorical ImpactPoint;
Categorical Maneuver;
};
Accidents (main) data table: /github/home/khiops_data/samples/AccidentsSummary/Accidents.txt
AccidentId Gravity Date Hour Light Department Commune InAgglomeration IntersectionType Weather CollisionType PostalAddress
201800000001 NonLethal 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Vehicles-BehindVehicles-Frontal route des Ansereuilles
201800000002 NonLethal 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood NoCollision Place du général de Gaul
201800000003 NonLethal 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal NoCollision Rue nationale
201800000004 NonLethal 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection VeryGood 2Vehicles-Side 30 rue Jules Guesde
201800000005 NonLethal 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection Normal 2Vehicles-Side 72 rue Victor Hugo
201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersection LightRain Other D39
201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoIntersection Normal Other 4 route de camphin
201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoIntersection Normal Other rue saint exupéry
201800000009 NonLethal 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection Normal Other rue de l'égalité
Vehicles data table: /github/home/khiops_data/samples/AccidentsSummary/Vehicles.txt
AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObstacle ImpactPoint Maneuver
201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft
201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange
201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange
201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDirectionChange
201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft
201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked
201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance
201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None
201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft
We now save the results directory for this exercise:
accidents_results_dir = os.path.join("exercises", "AccidentSummary")
print(f"AccidentsSummary exercise results directory: {accidents_results_dir}")
AccidentsSummary exercise results directory: exercises/AccidentSummary
Train a classifier for the Accidents
database with 1000 variables¶
Save the resulting file locations into the variables
accidents_report
and accidents_model_kdic
and print them.
Do not forget: - The target variable is Gravity
- The key for the
additional_data_tables
parameter is Accident`Vehicles
and
its value that of vehicles_data_file
- Set max_trees=0
accidents_report, accidents_model_kdic = kh.train_predictor(
accidents_kdic,
dictionary_name="Accident",
data_table_path=accidents_data_file,
target_variable="Gravity",
results_dir=accidents_results_dir,
additional_data_tables={"Accident`Vehicles": vehicles_data_file},
max_constructed_variables=1000,
max_trees=0,
)
print(f"AccidentsSummary report file: {accidents_report}")
print(f"AccidentsSummary modeling dictionary: {accidents_model_kdic}")
AccidentsSummary report file: exercises/AccidentSummary/AllReports.khj
AccidentsSummary modeling dictionary: exercises/AccidentSummary/Modeling.kdic
Take a look to the report¶
Which variables predict well the gravity of an accident?
# To visualize uncomment the line below
# kh.visualize_report(accidents_report)