Core Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset

In this notebook, we learn how to train a classifier with a more complex multi-table data where a secondary table is itself a parent table of another table (ie. snowflake schema). It is highly recommended to see the Basics 1 and Basics 2 lessons if you are not familiar with Khiops.

We start by importing khiops and some helper functions:

from os import path

from khiops import core as kh
from helper_functions import explorer_open, peek
/github/home/.local/lib/python3.10/site-packages/khiops/core/internals/runner.py:1259: UserWarning: Too few cores: 2. To efficiently run Khiops in parallel at least 3 processes are needed. Khiops will run in a single process.
  warnings.warn(

Training a Multi-Table Classifier

We’ll train a multi-table classifier on a extension of dataset AccidentsSummary that we used in the previous notebook Sklearn Basics 2. This dataset Accidents contains two additional tables Place and User and is organized in the following relational snowflake schema:

Accident
|
| -- 1:n -- Vehicle
|             |
|             |-- 1:n -- User
|
| -- 1:1 -- Place

Note that the target variable is Gravity.

To train the KhiopsClassifier for this setup, this schema must be codified in the dictionary file. Let’s check the contents of the Accidents dictionary file:

accidents_dataset_dir = path.join(kh.get_samples_dir(), "Accidents")
accidents_kdic = path.join(accidents_dataset_dir, "Accidents.kdic")

print("")
print(f"Accidents dictionary file location: {accidents_kdic}")
print("")
peek(accidents_kdic, n=40)
Accidents dictionary file location: /github/home/khiops_data/samples/Accidents/Accidents.kdic

Root Dictionary Accident(AccidentId)
{
  Categorical AccidentId;
  Categorical       Gravity = IfC(
      G(TableSum(Vehicles, TableCount(TableSelection(Users, EQc(Gravity, "Death" ...
      "Lethal", "NonLethal");
  Date Date;
  Time Hour;
  Categorical Light;
  Categorical Department;
  Categorical Commune;
  Categorical InAgglomeration;
  Categorical IntersectionType;
  Categorical Weather;
  Categorical CollisionType;
  Categorical PostalAddress;
  Categorical GPSCode;
  Numerical Latitude;
  Numerical Longitude;
  Entity(Place) Place;
  Table(Vehicle) Vehicles;
};

Dictionary Place(AccidentId)
{
  Categorical AccidentId;
  Categorical RoadType;
  Categorical RoadNumber;
  Categorical RoadSecNumber;
  Categorical RoadLetter;
  Categorical Circulation;
  Numerical LaneNumber;
  Categorical SpecialLane;
  Categorical Slope;
  Categorical RoadMarkerId;
  Numerical RoadMarkerDistance;
  Categorical Layout;
  Numerical StripWidth;
  Numerical LaneWidth;
  Categorical SurfaceCondition;
  Categorical Infrastructure;
...

Note the following differences in comparison with the dictionary of dataset AccidentsSummary.

  • The schema for the main table contains one extra special variable defined with the statement Entity(Place) Place which indicate a 1:1 relationship between Accident and Place tables.

  • The main table Accident and entity Place have the same key AccidentId. Table Vehicle and its child table User have two keys AccidentId and VehicleId.

Now let’s store the location of the tables and peek their contents:

print("")
accidents_data_file = path.join(accidents_dataset_dir, "Accidents.txt")
print(f"Accidents data table location: {accidents_data_file}")
print("")
peek(accidents_data_file)

print("")
vehicles_data_file = path.join(accidents_dataset_dir, "Vehicles.txt")
print(f"Vehicles data table location: {vehicles_data_file}")
print("")
peek(vehicles_data_file)

print("")
places_data_file = path.join(accidents_dataset_dir, "Places.txt")
print(f"Places data table location: {places_data_file}")
print("")
peek(places_data_file)

print("")
users_data_file = path.join(accidents_dataset_dir, "Users.txt")
print(f"Users data table location: {users_data_file}")
print("")
peek(users_data_file)
Accidents data table location: /github/home/khiops_data/samples/Accidents/Accidents.txt

AccidentId  Date    Hour    Light   Department      Commune InAgglomeration IntersectionType        W ...
201800000001        2018-01-24      15:05:00        Daylight        590     005     No      Y-type  Normal  2Vehicles-Beh ...
201800000002        2018-02-12      10:15:00        Daylight        590     011     Yes     Square  VeryGood        NoCollisio ...
201800000003        2018-03-04      11:35:00        Daylight        590     477     Yes     T-type  Normal  NoCollision      ...
201800000004        2018-05-05      17:35:00        Daylight        590     052     Yes     NoIntersection  VeryGood        2V ...
201800000005        2018-06-26      16:05:00        Daylight        590     477     Yes     NoIntersection  Normal  2Veh ...
201800000006        2018-09-23      06:30:00        TwilightOrDawn  590     052     Yes     NoIntersection  Light ...
201800000007        2018-09-26      00:40:00        NightStreelightsOn      590     133     Yes     NoIntersection  N ...
201800000008        2018-11-30      17:15:00        NightStreelightsOn      590     011     Yes     NoIntersection  N ...
201800000009        2018-02-18      15:57:00        Daylight        590     550     No      NoIntersection  Normal  Other ...
201800000010        2018-03-19      15:30:00        Daylight        590     051     Yes     X-type  Normal  2Vehicles-Be ...
...

Vehicles data table location: /github/home/khiops_data/samples/Accidents/Vehicles.txt

AccidentId  VehicleId       Direction       Category        PassengerNumber FixedObstacle   MobileObst ...
201800000001        A01     Unknown Car<=3.5T       0       None    Vehicle RightFront      TurnToLeft
201800000001        B01     Unknown Car<=3.5T       0       None    Vehicle LeftFront       NoDirectionChange
201800000002        A01     Unknown Car<=3.5T       0       None    Pedestrian      None    NoDirectionChange
201800000003        A01     Unknown Motorbike>125cm3        0       StationaryVehicle       Vehicle Front   NoDi ...
201800000003        B01     Unknown Car<=3.5T       0       None    Vehicle LeftSide        TurnToLeft
201800000003        C01     Unknown Car<=3.5T       0       None    None    RightSide       Parked
201800000004        A01     Unknown Car<=3.5T       0       None    Other   RightFront      Avoidance
201800000004        B01     Unknown Bicycle 0       None    Vehicle LeftSide        None
201800000005        A01     Unknown Moped   0       None    Vehicle RightFront      PassLeft
201800000005        B01     Unknown Car<=3.5T       0       None    Vehicle LeftFront       Park
...

Places data table location: /github/home/khiops_data/samples/Accidents/Places.txt

AccidentId  RoadType        RoadNumber      RoadSecNumber   RoadLetter      Circulation     LaneNumber      S ...
201800000001        Departamental   41              C       TwoWay  2       0       Flat                    RightCurve                      Normal  Unknown L ...
201800000002        Communal        41              D       TwoWay  2       0       Flat                    LeftCurve                       Normal  Unknown Lane    00 ...
201800000003        Departamental   39              D       TwoWay  2       0       Flat                    Straight                        Normal  Unknown Lan ...
201800000004        Departamental   39                      TwoWay  2       0       Flat                    Straight                        Normal  Unknown Lane ...
201800000005        Communal                                OneWay  1       0       Flat                    Straight                        Normal  Unknown Lane    00
201800000006        Departamental   39              D       Unknown 2       0       Uphill                  LeftCurve                       Wet     Unknown Sh ...
201800000007        Departamental   41              D       TwoWay  2       0       Flat    16      500     Straight                        Normal  Unknow ...
201800000008        Communal        -                       TwoWay  2       0       Flat                    Straight                        Normal  Unknown Lane    00
201800000009        Departamental   141             D       TwoWay  2       0       Flat                    Straight                        Normal  Unknown Sh ...
201800000010        Departamental   641                     TwoWay  2       Bike    Flat    1       670     Straight                        Normal  Unkn ...
...

Users data table location: /github/home/khiops_data/samples/Accidents/Users.txt

AccidentId  VehicleId       Seat    Category        Gravity Gender  TripReason      SafetyDevice    Safety ...
201800000001        A01     1       Driver  Unscathed       Male    Leisure SeatBelt        Yes     None    None    Unknown  ...
201800000001        B01     1       Driver  InjuredAndHospitalized  Male    None    SeatBelt        Yes     None    Non ...
201800000002        A01     1       Driver  Unscathed       Male    None    SeatBelt        Yes     None    None    Unknown 194 ...
201800000002        A01             Pedestrian      MildlyInjured   Male    None    Helmet          OnLane<=OnSidewalk0 ...
201800000003        A01     1       Driver  InjuredAndHospitalized  Male    Leisure Helmet  Yes     None    No ...
201800000003        C01     1       Driver  Unscathed       Male    None    ChildrenDevice          None    None    Unknown  ...
201800000004        A01     1       Driver  Unscathed       Male    Leisure SeatBelt        Yes     None    None    Unknown  ...
201800000004        B01     1       Driver  InjuredAndHospitalized  Male    Leisure Helmet          None    None     ...
201800000005        A01     1       Driver  MildlyInjured   Male    Leisure Helmet  Yes     None    None    Unknow ...
201800000005        B01     1       Driver  Unscathed       Male    Leisure SeatBelt        Yes     None    None    Unknown  ...
...

Train a classifier for the Accidents database with 1000 variables

The call to the train_predictor is exactly the same as seen before on the exercice of the previous notebook Sklearn Basics 2. The only difference is the extension of the dictionary additional_data_tables, which contains paths of the additional tables, with two new paths:

  • Path of entity Place is Accident`Place.

  • Path of table User is Accident`Vehicles`Users.

Same as previously, we’ll ask Khiops to create 1000 additional features with its multi-table AutoML mode.

Do not forget: - The target variable is Gravity - Set max_trees=0

With these considerations, let’s now train the classifier:

accidents_results_dir = path.join("exercises", "Accidents")
accidents_report, accidents_model_kdic = kh.train_predictor(
    accidents_kdic,
    dictionary_name="Accident",
    data_table_path=accidents_data_file,
    target_variable="Gravity",
    results_dir=accidents_results_dir,
    additional_data_tables={
        "Accident`Vehicles": vehicles_data_file,
        "Accident`Place": places_data_file,
        "Accident`Vehicles`Users": users_data_file,
    },
    max_constructed_variables=1000,
    max_trees=0,
)
print(f"Accidents report file located at: {accidents_report}")
print(f"Accidents modeling dictionary file located at: {accidents_model_kdic}")
Accidents report file located at: exercises/Accidents/AllReports.khj
Accidents modeling dictionary file located at: exercises/Accidents/Modeling.kdic

Take a look to the report

Which variables predict well the gravity of an accident?

# explorer_open(accidents_report)