Core Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset

In this notebook, we learn how to train a classifier with a more complex multi-table data where a secondary table is itself a parent table of another table (ie. snowflake schema). It is highly recommended to see the Basics 1 and Basics 2 lessons if you are not familiar with Khiops.

Make sure you have installed Khiops and Khiops Visualization.

We start by importing Khiops, checking its installation and defining some helper functions:

import os
import platform
import subprocess
from khiops import core as kh

# Define helper functions
def peek(file_path, n=10):
    """Shows the first n lines of a file"""
    with open(file_path, encoding="utf8", errors="replace") as file:
        for line in file.readlines()[:n]:
            print(line, end="")
    print("")


# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()

Training a Multi-Table Classifier

We’ll train a multi-table classifier on a extension of dataset AccidentsSummary that we used in the previous notebook Sklearn Basics 2. This dataset Accidents contains two additional tables Place and User and is organized in the following relational snowflake schema:

Accident
|
| -- 1:n -- Vehicle
|             |
|             |-- 1:n -- User
|
| -- 1:1 -- Place

Note that the target variable is Gravity.

To train the KhiopsClassifier for this setup, this schema must be codified in the dictionary file. Let’s check the contents of the Accidents dictionary file:

accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "Accidents")
accidents_kdic = os.path.join(accidents_dataset_dir, "Accidents.kdic")

print(f"Accidents dictionary file location: {accidents_kdic}")
print("")
peek(accidents_kdic, n=45)
Accidents dictionary file location: /github/home/khiops_data/samples/Accidents/Accidents.kdic

Root Dictionary Accident(AccidentId)
{
  Categorical AccidentId;
  Categorical Gravity;
  Date Date;
  Time Hour;
  Categorical Light;
  Categorical Department;
  Categorical Commune;
  Categorical InAgglomeration;
  Categorical IntersectionType;
  Categorical Weather;
  Categorical CollisionType;
  Categorical PostalAddress;
  Categorical GPSCode;
  Numerical Latitude;
  Numerical Longitude;
  Entity(Place) Place;
  Table(Vehicle) Vehicles;
};

Dictionary Place(AccidentId)
{
  Categorical AccidentId;
  Categorical RoadType;
  Categorical RoadNumber;
  Categorical RoadSecNumber;
  Categorical RoadLetter;
  Categorical Circulation;
  Numerical LaneNumber;
  Categorical SpecialLane;
  Categorical Slope;
  Categorical RoadMarkerId;
  Numerical RoadMarkerDistance;
  Categorical Layout;
  Numerical StripWidth;
  Numerical LaneWidth;
  Categorical SurfaceCondition;
  Categorical Infrastructure;
  Categorical Localization;
  Categorical SchoolNear;
};


Dictionary Vehicle(AccidentId, VehicleId)

Note the following differences in comparison with the dictionary of dataset AccidentsSummary.

  • The schema for the main table contains one extra special variable defined with the statement Entity(Place) Place which indicate a 1:1 relationship between Accident and Place tables.

  • The main table Accident and entity Place have the same key AccidentId. Table Vehicle and its child table User have two keys AccidentId and VehicleId.

Now let’s store the location of the tables and peek their contents:

accidents_data_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
print(f"Accidents data table: {accidents_data_file}")
print("")
peek(accidents_data_file)

vehicles_data_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
print(f"Vehicles data table: {vehicles_data_file}")
print("")
peek(vehicles_data_file)

places_data_file = os.path.join(accidents_dataset_dir, "Places.txt")
print(f"Places data table: {places_data_file}")
print("")
peek(places_data_file)

users_data_file = os.path.join(accidents_dataset_dir, "Users.txt")
print(f"Users data table: {users_data_file}")
print("")
peek(users_data_file)
Accidents data table: /github/home/khiops_data/samples/Accidents/Accidents.txt

AccidentId  Gravity Date    Hour    Light   Department      Commune InAgglomeration IntersectionType        Weather CollisionType   PostalAddress   GPSCode Latitude        Longitude
201800000001        NonLethal       2018-01-24      15:05:00        Daylight        590     005     No      Y-type  Normal  2Vehicles-BehindVehicles-Frontal        route des Ansereuilles  M       50.55737        2.55737
201800000002        NonLethal       2018-02-12      10:15:00        Daylight        590     011     Yes     Square  VeryGood        NoCollision     Place du général de Gaul        M       50.52936        2.52936
201800000003        NonLethal       2018-03-04      11:35:00        Daylight        590     477     Yes     T-type  Normal  NoCollision     Rue  nationale  M       50.51243        2.51243
201800000004        NonLethal       2018-05-05      17:35:00        Daylight        590     052     Yes     NoIntersection  VeryGood        2Vehicles-Side  30 rue Jules Guesde     M       50.51974        2.51974
201800000005        NonLethal       2018-06-26      16:05:00        Daylight        590     477     Yes     NoIntersection  Normal  2Vehicles-Side  72 rue Victor Hugo      M       50.51607        2.51607
201800000006        NonLethal       2018-09-23      06:30:00        TwilightOrDawn  590     052     Yes     NoIntersection  LightRain       Other   D39     M       50.52132        2.52132
201800000007        NonLethal       2018-09-26      00:40:00        NightStreelightsOn      590     133     Yes     NoIntersection  Normal  Other   4 route de camphin      M       50.52211        2.52211
201800000008        Lethal  2018-11-30      17:15:00        NightStreelightsOn      590     011     Yes     NoIntersection  Normal  Other   rue saint exupéry       M       50.53146        2.53146
201800000009        NonLethal       2018-02-18      15:57:00        Daylight        590     550     No      NoIntersection  Normal  Other   rue de l'égalité        M       50.53707        2.53707

Vehicles data table: /github/home/khiops_data/samples/Accidents/Vehicles.txt

AccidentId  VehicleId       Direction       Category        PassengerNumber FixedObstacle   MobileObstacle  ImpactPoint     Maneuver
201800000001        A01     Unknown Car<=3.5T       0       None    Vehicle RightFront      TurnToLeft
201800000001        B01     Unknown Car<=3.5T       0       None    Vehicle LeftFront       NoDirectionChange
201800000002        A01     Unknown Car<=3.5T       0       None    Pedestrian      None    NoDirectionChange
201800000003        A01     Unknown Motorbike>125cm3        0       StationaryVehicle       Vehicle Front   NoDirectionChange
201800000003        B01     Unknown Car<=3.5T       0       None    Vehicle LeftSide        TurnToLeft
201800000003        C01     Unknown Car<=3.5T       0       None    None    RightSide       Parked
201800000004        A01     Unknown Car<=3.5T       0       None    Other   RightFront      Avoidance
201800000004        B01     Unknown Bicycle 0       None    Vehicle LeftSide        None
201800000005        A01     Unknown Moped   0       None    Vehicle RightFront      PassLeft

Places data table: /github/home/khiops_data/samples/Accidents/Places.txt

AccidentId  RoadType        RoadNumber      RoadSecNumber   RoadLetter      Circulation     LaneNumber      SpecialLane     Slope   RoadMarkerId    RoadMarkerDistance      Layout  StripWidth      LaneWidth       SurfaceCondition        Infrastructure  Localization    SchoolNear
201800000001        Departamental   41              C       TwoWay  2       0       Flat                    RightCurve                      Normal  Unknown Lane    00
201800000002        Communal        41              D       TwoWay  2       0       Flat                    LeftCurve                       Normal  Unknown Lane    00
201800000003        Departamental   39              D       TwoWay  2       0       Flat                    Straight                        Normal  Unknown Lane    00
201800000004        Departamental   39                      TwoWay  2       0       Flat                    Straight                        Normal  Unknown Lane    00
201800000005        Communal                                OneWay  1       0       Flat                    Straight                        Normal  Unknown Lane    00
201800000006        Departamental   39              D       Unknown 2       0       Uphill                  LeftCurve                       Wet     Unknown Shoulder        00
201800000007        Departamental   41              D       TwoWay  2       0       Flat    16      500     Straight                        Normal  Unknown Shoulder        00
201800000008        Communal        -                       TwoWay  2       0       Flat                    Straight                        Normal  Unknown Lane    00
201800000009        Departamental   141             D       TwoWay  2       0       Flat                    Straight                        Normal  Unknown Shoulder        00

Users data table: /github/home/khiops_data/samples/Accidents/Users.txt

AccidentId  VehicleId       Seat    Category        Gender  TripReason      SafetyDevice    SafetyDeviceUsed        PedestrianLocation      PedestrianAction        PedestrianCompany       BirthYear
201800000001        A01     1       Driver  Male    Leisure SeatBelt        Yes     None    None    Unknown 1960
201800000001        B01     1       Driver  Male    None    SeatBelt        Yes     None    None    Unknown 1928
201800000002        A01     1       Driver  Male    None    SeatBelt        Yes     None    None    Unknown 1947
201800000002        A01             Pedestrian      Male    None    Helmet          OnLane<=OnSidewalk0mCrossing    Crossing        Alone   1959
201800000003        A01     1       Driver  Male    Leisure Helmet  Yes     None    None    Unknown 1987
201800000003        C01     1       Driver  Male    None    ChildrenDevice          None    None    Unknown 1977
201800000004        A01     1       Driver  Male    Leisure SeatBelt        Yes     None    None    Unknown 1982
201800000004        B01     1       Driver  Male    Leisure Helmet          None    None    Unknown 2013
201800000005        A01     1       Driver  Male    Leisure Helmet  Yes     None    None    Unknown 2001

Train a classifier for the Accidents database with 1000 variables

The call to the train_predictor is exactly the same as seen before on the exercice of the previous notebook Sklearn Basics 2. The only difference is the extension of the dictionary additional_data_tables, which contains paths of the additional tables, with two new paths:

  • Path of entity Place is Accident`Place.

  • Path of table User is Accident`Vehicles`Users.

Same as previously, we’ll ask Khiops to create 1000 additional features with its multi-table AutoML mode.

Do not forget: - The target variable is Gravity - Set max_trees=0

With these considerations, let’s now train the classifier:

accidents_results_dir = os.path.join("exercises", "Accidents")
accidents_report, accidents_model_kdic = kh.train_predictor(
    accidents_kdic,
    dictionary_name="Accident",
    data_table_path=accidents_data_file,
    target_variable="Gravity",
    results_dir=accidents_results_dir,
    additional_data_tables={
        "Accident`Vehicles": vehicles_data_file,
        "Accident`Place": places_data_file,
        "Accident`Vehicles`Users": users_data_file,
    },
    max_constructed_variables=1000,
    max_trees=0,
)
print(f"Accidents report file: {accidents_report}")
print(f"Accidents modeling dictionary file: {accidents_model_kdic}")
Accidents report file: exercises/Accidents/AllReports.khj
Accidents modeling dictionary file: exercises/Accidents/Modeling.kdic

Take a look to the report

Which variables predict well the gravity of an accident?

# To visualize uncomment the line below
# kh.visualize_report(accidents_report)