Core Basics 2: Train a Classifier on a Star Multi-Table Dataset¶
In this notebook we learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the Core Basics 1 lesson if you are not familiar with Khiops.
We start by importing Khiops and some helper functions:
from os import path
from khiops import core as kh
from helper_functions import explorer_open, peek
/github/home/.local/lib/python3.10/site-packages/khiops/core/internals/runner.py:1259: UserWarning: Too few cores: 2. To efficiently run Khiops in parallel at least 3 processes are needed. Khiops will run in a single process.
warnings.warn(
Training a Multi-Table Classifier¶
We’ll train a “sarcasm detector” using the dataset HeadlineSarcasm
.
In its raw form, it contains a list of text headlines paired with a
label that indicates whether its source is a sarcastic site (such as
The Onion) or not.
We have transformed this dataset into two tables such that the text-label record
"groundbreaking study finds gratification can be deliberately postponed" yes
is transformed to an entry in a table that contains id-label records
97 yes
and various entries in a secondary table linking a headline id to its words and positions
97 0 groundbreaking
97 1 study
97 2 finds
97 3 gratification
97 4 can
97 5 be
97 6 deliberately
97 7 postponed
Thus the HeadlineSarcasm
dataset has the following multi-table
schema
+-----------+
|Headline |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId* |
| |Position |
+-1:n--->|Word |
+-------------+
The HeadlineId
variable is special because it is a key that links
a particular headline to its words (a 1:n relation).
Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.
To train a classifier with Khiops in this multi-table setup, this schema
must be codified in the dictionary file. Let’s check the contents of the
HeadlineSarcasm
dictionary file:
sarcasm_kdic = path.join("data", "HeadlineSarcasm", "HeadlineSarcasm.kdic")
print("")
print(f"HeadlineSarcasm dictionary file location: {sarcasm_kdic}")
print("")
peek(sarcasm_kdic, n=15)
HeadlineSarcasm dictionary file location: data/HeadlineSarcasm/HeadlineSarcasm.kdic
Root Dictionary Headline(HeadlineId)
{
Categorical HeadlineId;
Categorical IsSarcasm;
Table(Words) HeadlineWords;
};
Dictionary Words(HeadlineId)
{
Categorical HeadlineId;
Numerical Position;
Categorical Word;
};
As in the single-table case the .kdic
file describes the schema for
both tables, but note the following differences: - The dictionary for
the table Headline
is prefixed by the Root
keyword to indicate
that is the main one. - For both tables, their dictionary names are
followed by (HeadlineId)
to indicate that HeadlineId
is the key
of these tables. - The schema for the main table contains an extra
special variable defined with the statement
Table(Words) HeadlineWords
. This is, in addition to sharing the same
key variable, is necessary to indicate the 1:n
relationship between
the main and secondary table.
Now let’s store the location main and secondary tables and peek their contents:
sarcasm_headlines_file = path.join("data", "HeadlineSarcasm", "Headlines.txt")
sarcasm_words_file = path.join("data", "HeadlineSarcasm", "HeadlineWords.txt")
print("")
print(f"HeadlineSarcasm main table file location: {sarcasm_headlines_file}")
print("")
peek(sarcasm_headlines_file, n=3)
print("")
print(f"HeadlineSarcasm secondary table file location: {sarcasm_words_file}")
print("")
peek(sarcasm_words_file, n=15)
HeadlineSarcasm main table file location: data/HeadlineSarcasm/Headlines.txt
HeadlineId IsSarcasm
0 yes
1 no
10 no
...
HeadlineSarcasm secondary table file location: data/HeadlineSarcasm/HeadlineWords.txt
HeadlineId Position Word
0 0 thirtysomething
0 1 scientists
0 2 unveil
0 3 doomsday
0 4 clock
0 5 of
0 6 hair
0 7 loss
1 0 dem
1 1 rep.
1 2 totally
1 3 nails
1 4 why
1 5 congress
1 6 is
...
The call to the train_predictor
will be very similar to the
single-table case but there are some differences.
The first is that we must pass the path of the extra secondary data
table. This is done with the additional_data_tables
parameter that
is a Python dictionary containing key-value pairs for each table. More
precisely: - keys describe data paths of secondary tables. In this
case only Headline`HeadlineWords
- values describe the file
paths of secondary tables. In this case only the file path we stored in
sarcasm_words_file
Note: For understanding what data paths are see the “Multi-Table Tasks” section of the Khiops ``core.api`` documentation
Secondly, we specify how many features/aggregates Khiops will create
with its multi-table AutoML mode. For the HeadlineSarcasm
dataset
Khiops can create features such as: - Number of different words in the
headline - Most common word in the headline before the third one -
Number of times the word ‘the’ appears - …
It will then evaluate, select and combine the created features to build
a classifier. We’ll ask to create 1000
of these features (the
default is 100
).
With these considerations, let’s setup the some extra variables and train the classifier:
sarcasm_results_dir = path.join("exercises", "HeadlineSarcasm")
sarcasm_report, sarcasm_model_kdic = kh.train_predictor(
sarcasm_kdic,
dictionary_name="Headline", # This must be the main/root dictionary
data_table_path=sarcasm_headlines_file, # This must be the data file for the main table
target_variable="IsSarcasm",
results_dir=sarcasm_results_dir,
additional_data_tables={"Headline`HeadlineWords": sarcasm_words_file},
max_constructed_variables=1000, # by default Khiops constructs 100 variables for AutoML multi-table
max_trees=0, # by default Khiops constructs 10 decision tree variables
)
print(f"HeadlineSarcasm report file located at: {sarcasm_report}")
print(f"HeadlineSarcasm modeling dictionary file located at: {sarcasm_model_kdic}")
HeadlineSarcasm report file located at: exercises/HeadlineSarcasm/AllReports.khj
HeadlineSarcasm modeling dictionary file located at: exercises/HeadlineSarcasm/Modeling.kdic
We now may take a look at the results with the visualization tool:
# explorer_open(sarcasm_report)
Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops ``sort_data_table`` function or your favorite software. The examples of this tutorial have their tables pre-sorted.
Exercise time!¶
Repeat the previous steps with the AccidentsSummary
dataset. It
describes the characteristics of traffic accidents that happened in
France in 2018. It has two tables with the following schema:
+---------------+
|Accidents |
+---------------+n|AccidentId* |
|Gravity |
|Date |
|Hour | +---------------+
|Light | |Vehicles |
|Department | +---------------+
|Commune | |AccidentId* |
|InAgglomeration| |VehicleId* |
|... | |Direction |
+---------------+ |Category |
| |PassengerNumber|
+---1:n--->|... |
+---------------+
So for each accident we have its characteristics (such as Gravity
or
Light
conditions) and those of each involved vehicle (its
Direction
or PassengerNumber
). The main task for this dataset is
to predict the variable Gravity
that has two possible
values:Lethal
and NonLethal
.
We first save the paths of the AccidentsSummary
dictionary file and
data table files into variables:
accidents_kdic = path.join(kh.get_samples_dir(), "AccidentsSummary", "Accidents.kdic")
accidents_data_file = path.join(
kh.get_samples_dir(), "AccidentsSummary", "Accidents.txt"
)
vehicles_data_file = path.join(kh.get_samples_dir(), "AccidentsSummary", "Vehicles.txt")
Print the file locations and use the function peek
to list their contents¶
Which table is the Root
in this case?
print("")
print(f"Accidents dictionary file location: {accidents_kdic}")
print("")
peek(accidents_kdic, n=40)
print("")
print(f"Accidents data table location: {accidents_data_file}")
print("")
peek(accidents_data_file)
print("")
print(f"Vehicles main data table location: {vehicles_data_file}")
print("")
peek(vehicles_data_file)
Accidents dictionary file location: /github/home/khiops_data/samples/AccidentsSummary/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
Categorical Gravity;
Date Date;
Time Hour;
Categorical Light;
Categorical Department;
Categorical Commune;
Categorical InAgglomeration;
Categorical IntersectionType;
Categorical Weather;
Categorical CollisionType;
Categorical PostalAddress;
Table(Vehicle) Vehicles;
};
Dictionary Vehicle(AccidentId, VehicleId)
{
Categorical AccidentId;
Categorical VehicleId;
Categorical Direction;
Categorical Category;
Numerical PassengerNumber;
Categorical FixedObstacle;
Categorical MobileObstacle;
Categorical ImpactPoint;
Categorical Maneuver;
};
Accidents data table location: /github/home/khiops_data/samples/AccidentsSummary/Accidents.txt
AccidentId Gravity Date Hour Light Department Commune InAgglomeration Intersecti ...
201800000001 NonLethal 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Ve ...
201800000002 NonLethal 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood ...
201800000003 NonLethal 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal No ...
201800000004 NonLethal 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection V ...
201800000005 NonLethal 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection N ...
201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersec ...
201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoInte ...
201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoInterse ...
201800000009 NonLethal 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection No ...
201800000010 NonLethal 2018-03-19 15:30:00 Daylight 590 051 Yes X-type Normal 2V ...
...
Vehicles main data table location: /github/home/khiops_data/samples/AccidentsSummary/Vehicles.txt
AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObst ...
201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft
201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange
201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange
201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDi ...
201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft
201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked
201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance
201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None
201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft
201800000005 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront Park
...
We now save the results directory for this exercise:
accidents_results_dir = path.join("exercises", "AccidentSummary")
print(f"AccidentsSummary exercise results directory: {accidents_results_dir}")
AccidentsSummary exercise results directory: exercises/AccidentSummary
Train a classifier for the AccidentsEn
database with 1000 variables¶
Save the resulting file locations into the variables
accidents_report
and accidents_model_kdic
and print them.
Do not forget: - The target variable is Gravity
- The key for the
additional_data_tables
parameter is Accident`Vehicles
and
its value that of vehicles_data_file
- Set max_trees=0
accidents_report, accidents_model_kdic = kh.train_predictor(
accidents_kdic,
dictionary_name="Accident",
data_table_path=accidents_data_file,
target_variable="Gravity",
results_dir=accidents_results_dir,
additional_data_tables={"Accident`Vehicles": vehicles_data_file},
max_constructed_variables=1000,
max_trees=0,
)
print(f"AccidentsSummary report file located at: {accidents_report}")
print(f"AccidentsSummary modeling dictionary file located at: {accidents_model_kdic}")
AccidentsSummary report file located at: exercises/AccidentSummary/AllReports.khj
AccidentsSummary modeling dictionary file located at: exercises/AccidentSummary/Modeling.kdic
Take a look to the report¶
Which variables predict well the gravity of an accident?
# explorer_open(accidents_report)