Core Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset¶
In this notebook, we learn how to train a classifier with a more complex multi-table data where a secondary table is itself a parent table of another table (ie. snowflake schema). It is highly recommended to see the Basics 1 and Basics 2 lessons if you are not familiar with Khiops.
Make sure you have installed Khiops and Khiops Visualization.
We start by importing Khiops, checking its installation and defining some helper functions:
import os
import platform
import subprocess
from khiops import core as kh
# Define helper functions
def peek(file_path, n=10):
"""Shows the first n lines of a file"""
with open(file_path, encoding="utf8", errors="replace") as file:
for line in file.readlines()[:n]:
print(line, end="")
print("")
# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()
Training a Multi-Table Classifier¶
We’ll train a multi-table classifier on a extension of dataset
AccidentsSummary
that we used in the previous notebook Sklearn
Basics 2. This dataset Accidents
contains two additional tables
Place
and User
and is organized in the following relational
snowflake schema:
Accident
|
| -- 1:n -- Vehicle
| |
| |-- 1:n -- User
|
| -- 1:1 -- Place
Note that the target variable is Gravity
.
To train the KhiopsClassifier for this setup, this schema must be
codified in the dictionary file. Let’s check the contents of the
Accidents
dictionary file:
accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "Accidents")
accidents_kdic = os.path.join(accidents_dataset_dir, "Accidents.kdic")
print(f"Accidents dictionary file location: {accidents_kdic}")
print("")
peek(accidents_kdic, n=45)
Accidents dictionary file location: /github/home/khiops_data/samples/Accidents/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
Categorical Gravity;
Date Date;
Time Hour;
Categorical Light;
Categorical Department;
Categorical Commune;
Categorical InAgglomeration;
Categorical IntersectionType;
Categorical Weather;
Categorical CollisionType;
Categorical PostalAddress;
Categorical GPSCode;
Numerical Latitude;
Numerical Longitude;
Entity(Place) Place;
Table(Vehicle) Vehicles;
};
Dictionary Place(AccidentId)
{
Categorical AccidentId;
Categorical RoadType;
Categorical RoadNumber;
Categorical RoadSecNumber;
Categorical RoadLetter;
Categorical Circulation;
Numerical LaneNumber;
Categorical SpecialLane;
Categorical Slope;
Categorical RoadMarkerId;
Numerical RoadMarkerDistance;
Categorical Layout;
Numerical StripWidth;
Numerical LaneWidth;
Categorical SurfaceCondition;
Categorical Infrastructure;
Categorical Localization;
Categorical SchoolNear;
};
Dictionary Vehicle(AccidentId, VehicleId)
Note the following differences in comparison with the dictionary of
dataset AccidentsSummary
.
The schema for the main table contains one extra special variable defined with the statement
Entity(Place) Place
which indicate a1:1
relationship betweenAccident
andPlace
tables.The main table
Accident
and entityPlace
have the same keyAccidentId
. TableVehicle
and its child tableUser
have two keysAccidentId
andVehicleId
.
Now let’s store the location of the tables and peek their contents:
accidents_data_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
print(f"Accidents data table: {accidents_data_file}")
print("")
peek(accidents_data_file)
vehicles_data_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
print(f"Vehicles data table: {vehicles_data_file}")
print("")
peek(vehicles_data_file)
places_data_file = os.path.join(accidents_dataset_dir, "Places.txt")
print(f"Places data table: {places_data_file}")
print("")
peek(places_data_file)
users_data_file = os.path.join(accidents_dataset_dir, "Users.txt")
print(f"Users data table: {users_data_file}")
print("")
peek(users_data_file)
Accidents data table: /github/home/khiops_data/samples/Accidents/Accidents.txt
AccidentId Gravity Date Hour Light Department Commune InAgglomeration IntersectionType Weather CollisionType PostalAddress GPSCode Latitude Longitude
201800000001 NonLethal 2018-01-24 15:05:00 Daylight 590 005 No Y-type Normal 2Vehicles-BehindVehicles-Frontal route des Ansereuilles M 50.55737 2.55737
201800000002 NonLethal 2018-02-12 10:15:00 Daylight 590 011 Yes Square VeryGood NoCollision Place du général de Gaul M 50.52936 2.52936
201800000003 NonLethal 2018-03-04 11:35:00 Daylight 590 477 Yes T-type Normal NoCollision Rue nationale M 50.51243 2.51243
201800000004 NonLethal 2018-05-05 17:35:00 Daylight 590 052 Yes NoIntersection VeryGood 2Vehicles-Side 30 rue Jules Guesde M 50.51974 2.51974
201800000005 NonLethal 2018-06-26 16:05:00 Daylight 590 477 Yes NoIntersection Normal 2Vehicles-Side 72 rue Victor Hugo M 50.51607 2.51607
201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 590 052 Yes NoIntersection LightRain Other D39 M 50.52132 2.52132
201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 590 133 Yes NoIntersection Normal Other 4 route de camphin M 50.52211 2.52211
201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 590 011 Yes NoIntersection Normal Other rue saint exupéry M 50.53146 2.53146
201800000009 NonLethal 2018-02-18 15:57:00 Daylight 590 550 No NoIntersection Normal Other rue de l'égalité M 50.53707 2.53707
Vehicles data table: /github/home/khiops_data/samples/Accidents/Vehicles.txt
AccidentId VehicleId Direction Category PassengerNumber FixedObstacle MobileObstacle ImpactPoint Maneuver
201800000001 A01 Unknown Car<=3.5T 0 None Vehicle RightFront TurnToLeft
201800000001 B01 Unknown Car<=3.5T 0 None Vehicle LeftFront NoDirectionChange
201800000002 A01 Unknown Car<=3.5T 0 None Pedestrian None NoDirectionChange
201800000003 A01 Unknown Motorbike>125cm3 0 StationaryVehicle Vehicle Front NoDirectionChange
201800000003 B01 Unknown Car<=3.5T 0 None Vehicle LeftSide TurnToLeft
201800000003 C01 Unknown Car<=3.5T 0 None None RightSide Parked
201800000004 A01 Unknown Car<=3.5T 0 None Other RightFront Avoidance
201800000004 B01 Unknown Bicycle 0 None Vehicle LeftSide None
201800000005 A01 Unknown Moped 0 None Vehicle RightFront PassLeft
Places data table: /github/home/khiops_data/samples/Accidents/Places.txt
AccidentId RoadType RoadNumber RoadSecNumber RoadLetter Circulation LaneNumber SpecialLane Slope RoadMarkerId RoadMarkerDistance Layout StripWidth LaneWidth SurfaceCondition Infrastructure Localization SchoolNear
201800000001 Departamental 41 C TwoWay 2 0 Flat RightCurve Normal Unknown Lane 00
201800000002 Communal 41 D TwoWay 2 0 Flat LeftCurve Normal Unknown Lane 00
201800000003 Departamental 39 D TwoWay 2 0 Flat Straight Normal Unknown Lane 00
201800000004 Departamental 39 TwoWay 2 0 Flat Straight Normal Unknown Lane 00
201800000005 Communal OneWay 1 0 Flat Straight Normal Unknown Lane 00
201800000006 Departamental 39 D Unknown 2 0 Uphill LeftCurve Wet Unknown Shoulder 00
201800000007 Departamental 41 D TwoWay 2 0 Flat 16 500 Straight Normal Unknown Shoulder 00
201800000008 Communal - TwoWay 2 0 Flat Straight Normal Unknown Lane 00
201800000009 Departamental 141 D TwoWay 2 0 Flat Straight Normal Unknown Shoulder 00
Users data table: /github/home/khiops_data/samples/Accidents/Users.txt
AccidentId VehicleId Seat Category Gender TripReason SafetyDevice SafetyDeviceUsed PedestrianLocation PedestrianAction PedestrianCompany BirthYear
201800000001 A01 1 Driver Male Leisure SeatBelt Yes None None Unknown 1960
201800000001 B01 1 Driver Male None SeatBelt Yes None None Unknown 1928
201800000002 A01 1 Driver Male None SeatBelt Yes None None Unknown 1947
201800000002 A01 Pedestrian Male None Helmet OnLane<=OnSidewalk0mCrossing Crossing Alone 1959
201800000003 A01 1 Driver Male Leisure Helmet Yes None None Unknown 1987
201800000003 C01 1 Driver Male None ChildrenDevice None None Unknown 1977
201800000004 A01 1 Driver Male Leisure SeatBelt Yes None None Unknown 1982
201800000004 B01 1 Driver Male Leisure Helmet None None Unknown 2013
201800000005 A01 1 Driver Male Leisure Helmet Yes None None Unknown 2001
Train a classifier for the Accidents
database with 1000 variables¶
The call to the train_predictor is exactly the same as seen before on
the exercice of the previous notebook Sklearn Basics 2. The only
difference is the extension of the dictionary
additional_data_tables
, which contains paths of the additional
tables, with two new paths:
Path of entity
Place
isAccident`Place
.Path of table
User
isAccident`Vehicles`Users
.
Same as previously, we’ll ask Khiops to create 1000 additional features with its multi-table AutoML mode.
Do not forget: - The target variable is Gravity
- Set
max_trees=0
With these considerations, let’s now train the classifier:
accidents_results_dir = os.path.join("exercises", "Accidents")
accidents_report, accidents_model_kdic = kh.train_predictor(
accidents_kdic,
dictionary_name="Accident",
data_table_path=accidents_data_file,
target_variable="Gravity",
results_dir=accidents_results_dir,
additional_data_tables={
"Accident`Vehicles": vehicles_data_file,
"Accident`Place": places_data_file,
"Accident`Vehicles`Users": users_data_file,
},
max_constructed_variables=1000,
max_trees=0,
)
print(f"Accidents report file: {accidents_report}")
print(f"Accidents modeling dictionary file: {accidents_model_kdic}")
Accidents report file: exercises/Accidents/AllReports.khj
Accidents modeling dictionary file: exercises/Accidents/Modeling.kdic
Take a look to the report¶
Which variables predict well the gravity of an accident?
# To visualize uncomment the line below
# kh.visualize_report(accidents_report)