Core Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset¶
In this notebook, we learn how to train a classifier with a more complex multi-table data where a secondary table is itself a parent table of another table (ie. snowflake schema). It is highly recommended to see the Basics 1 and Basics 2 lessons if you are not familiar with Khiops.
We start by importing khiops
and some helper functions:
from os import path
from khiops import core as kh
from helper_functions import explorer_open, peek
/github/home/.local/lib/python3.10/site-packages/khiops/core/internals/runner.py:1259: UserWarning: Too few cores: 2. To efficiently run Khiops in parallel at least 3 processes are needed. Khiops will run in a single process.
warnings.warn(
Training a Multi-Table Classifier¶
We’ll train a multi-table classifier on a extension of dataset
AccidentsSummary
that we used in the previous notebook Sklearn
Basics 2. This dataset Accidents
contains two additional tables
Place
and User
and is organized in the following relational
snowflake schema:
Accident
|
| -- 1:n -- Vehicle
| |
| |-- 1:n -- User
|
| -- 1:1 -- Place
Note that the target variable is Gravity
.
To train the KhiopsClassifier for this setup, this schema must be
codified in the dictionary file. Let’s check the contents of the
Accidents
dictionary file:
accidents_dataset_dir = path.join(kh.get_samples_dir(), "Accidents")
accidents_kdic = path.join(accidents_dataset_dir, "Accidents.kdic")
print("")
print(f"Accidents dictionary file location: {accidents_kdic}")
print("")
peek(accidents_kdic, n=40)
Accidents dictionary file location: /github/home/khiops_data/samples/Accidents/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
Categorical Gravity = IfC(
G(TableSum(Vehicles, TableCount(TableSelection(Users, EQc(Gravity, "Death" ...
"Lethal", "NonLethal");
Date Date;
Time Hour;
Categorical Light;
Categorical Department;
Categorical Commune;
Categorical InAgglomeration;
Categorical IntersectionType;
Categorical Weather;
Categorical CollisionType;
Categorical PostalAddress;
Categorical GPSCode;
Numerical Latitude;
Numerical Longitude;
Entity(Place) Place;
Table(Vehicle) Vehicles;
};
Dictionary Place(AccidentId)
{
Categorical AccidentId;
Categorical RoadType;
Categorical RoadNumber;
Categorical RoadSecNumber;
Categorical RoadLetter;
Categorical Circulation;
Numerical LaneNumber;
Categorical SpecialLane;
Categorical Slope;
Categorical RoadMarkerId;
Numerical RoadMarkerDistance;
Categorical Layout;
Numerical StripWidth;
Numerical LaneWidth;
Categorical SurfaceCondition;
Categorical Infrastructure;
...
Note the following differences in comparison with the dictionary of
dataset AccidentsSummary
.
The schema for the main table contains one extra special variable defined with the statement
Entity(Place) Place
which indicate a1:1
relationship betweenAccident
andPlace
tables.The main table
Accident
and entityPlace
have the same keyAccidentId
. TableVehicle
and its child tableUser
have two keysAccidentId
andVehicleId
.
Now let’s store the location of the tables and peek their contents:
print("")
accidents_data_file = path.join(accidents_dataset_dir, "Accidents.txt")
print(f"Accidents data table location: {accidents_data_file}")
print("")
peek(accidents_data_file)
print("")
vehicles_data_file = path.join(accidents_dataset_dir, "Vehicles.txt")
print(f"Vehicles data table location: {vehicles_data_file}")
print("")
peek(vehicles_data_file)
print("")
places_data_file = path.join(accidents_dataset_dir, "Places.txt")
print(f"Places data table location: {places_data_file}")
print("")
peek(places_data_file)
print("")
users_data_file = path.join(accidents_dataset_dir, "Users.txt")
print(f"Users data table location: {users_data_file}")
print("")
peek(users_data_file)
Accidents data table location: /github/home/khiops_data/samples/Accidents/Accidents.txt
AccidentId Date Hour Light Department Commune InAgglomeration IntersectionType W ...
201800000001 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Vehicles-Beh ...
201800000002 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood NoCollisio ...
201800000003 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal NoCollision ...
201800000004 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection VeryGood 2V ...
201800000005 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection Normal 2Veh ...
201800000006 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersection Light ...
201800000007 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoIntersection N ...
201800000008 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoIntersection N ...
201800000009 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection Normal Other ...
201800000010 2018-03-19 15:30:00 Daylight 590 051 Yes X-type Normal 2Vehicles-Be ...
...
Vehicles data table location: /github/home/khiops_data/samples/Accidents/Vehicles.txt
AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObst ...
201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft
201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange
201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange
201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDi ...
201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft
201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked
201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance
201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None
201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft
201800000005 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront Park
...
Places data table location: /github/home/khiops_data/samples/Accidents/Places.txt
AccidentId RoadType RoadNumber RoadSecNumber RoadLetter Circulation LaneNumber S ...
201800000001 Departamental 41 C TwoWay 2 0 Flat RightCurve Normal Unknown L ...
201800000002 Communal 41 D TwoWay 2 0 Flat LeftCurve Normal Unknown Lane 00 ...
201800000003 Departamental 39 D TwoWay 2 0 Flat Straight Normal Unknown Lan ...
201800000004 Departamental 39 TwoWay 2 0 Flat Straight Normal Unknown Lane ...
201800000005 Communal OneWay 1 0 Flat Straight Normal Unknown Lane 00
201800000006 Departamental 39 D Unknown 2 0 Uphill LeftCurve Wet Unknown Sh ...
201800000007 Departamental 41 D TwoWay 2 0 Flat 16 500 Straight Normal Unknow ...
201800000008 Communal - TwoWay 2 0 Flat Straight Normal Unknown Lane 00
201800000009 Departamental 141 D TwoWay 2 0 Flat Straight Normal Unknown Sh ...
201800000010 Departamental 641 TwoWay 2 Bike Flat 1 670 Straight Normal Unkn ...
...
Users data table location: /github/home/khiops_data/samples/Accidents/Users.txt
AccidentId VehicleId Seat Category Gravity Gender TripReason SafetyDevice Safety ...
201800000001 A01 1 Driver Unscathed Male Leisure SeatBelt Yes None None Unknown ...
201800000001 B01 1 Driver InjuredAndHospitalized Male None SeatBelt Yes None Non ...
201800000002 A01 1 Driver Unscathed Male None SeatBelt Yes None None Unknown 194 ...
201800000002 A01 Pedestrian MildlyInjured Male None Helmet OnLane<=OnSidewalk0 ...
201800000003 A01 1 Driver InjuredAndHospitalized Male Leisure Helmet Yes None No ...
201800000003 C01 1 Driver Unscathed Male None ChildrenDevice None None Unknown ...
201800000004 A01 1 Driver Unscathed Male Leisure SeatBelt Yes None None Unknown ...
201800000004 B01 1 Driver InjuredAndHospitalized Male Leisure Helmet None None ...
201800000005 A01 1 Driver MildlyInjured Male Leisure Helmet Yes None None Unknow ...
201800000005 B01 1 Driver Unscathed Male Leisure SeatBelt Yes None None Unknown ...
...
Train a classifier for the Accidents
database with 1000 variables¶
The call to the train_predictor is exactly the same as seen before on
the exercice of the previous notebook Sklearn Basics 2. The only
difference is the extension of the dictionary
additional_data_tables
, which contains paths of the additional
tables, with two new paths:
Path of entity
Place
isAccident`Place
.Path of table
User
isAccident`Vehicles`Users
.
Same as previously, we’ll ask Khiops to create 1000 additional features with its multi-table AutoML mode.
Do not forget: - The target variable is Gravity
- Set
max_trees=0
With these considerations, let’s now train the classifier:
accidents_results_dir = path.join("exercises", "Accidents")
accidents_report, accidents_model_kdic = kh.train_predictor(
accidents_kdic,
dictionary_name="Accident",
data_table_path=accidents_data_file,
target_variable="Gravity",
results_dir=accidents_results_dir,
additional_data_tables={
"Accident`Vehicles": vehicles_data_file,
"Accident`Place": places_data_file,
"Accident`Vehicles`Users": users_data_file,
},
max_constructed_variables=1000,
max_trees=0,
)
print(f"Accidents report file located at: {accidents_report}")
print(f"Accidents modeling dictionary file located at: {accidents_model_kdic}")
Accidents report file located at: exercises/Accidents/AllReports.khj
Accidents modeling dictionary file located at: exercises/Accidents/Modeling.kdic
Take a look to the report¶
Which variables predict well the gravity of an accident?
# explorer_open(accidents_report)