Core Basics 2: Train a Classifier on a Star Multi-Table Dataset¶

In this notebook we learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the Core Basics 1 lesson if you are not familiar with Khiops.

Make sure you have installed Khiops and Khiops Visualization.

We start by importing Khiops, checking its installation and defining some helper functions:

import os
import platform
import subprocess
from khiops import core as kh

# Define peek helper function
def peek(file_path, n=10):
    """Shows the first n lines of a file"""
    with open(file_path, encoding="utf8", errors="replace") as file:
        for line in file.readlines()[:n]:
            print(line, end="")
    print("")


# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()

Training a Multi-Table Classifier¶

We’ll train a “sarcasm detector” using the dataset HeadlineSarcasm. In its raw form, it contains a list of text headlines paired with a label that indicates whether its source is a sarcastic site (such as The Onion) or not.

We have transformed this dataset into two tables such that the text-label record

"groundbreaking study finds gratification can be deliberately postponed"    yes

is transformed to an entry in a table that contains id-label records

97 yes

and various entries in a secondary table linking a headline id to its words and positions

0   groundbreaking
1   study
2   finds
3   gratification
4   can
5   be
6   deliberately
7   postponed

Thus the HeadlineSarcasm dataset has the following multi-table schema

+-----------+
|Headline   |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId*  |
     |        |Position     |
     +-1:n--->|Word         |
              +-------------+

The HeadlineId variable is special because it is a key that links a particular headline to its words (a 1:n relation).

Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.

To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let’s check the contents of the HeadlineSarcasm dictionary file:

sarcasm_kdic = os.path.join("data", "HeadlineSarcasm", "HeadlineSarcasm.kdic")

print(f"HeadlineSarcasm dictionary file: {sarcasm_kdic}")
print("")
peek(sarcasm_kdic, n=15)

HeadlineSarcasm dictionary file: data/HeadlineSarcasm/HeadlineSarcasm.kdic

Root Dictionary Headline(HeadlineId)
{
  Categorical HeadlineId;
  Categorical IsSarcasm;
  Table(Words) HeadlineWords;
};

Dictionary Words(HeadlineId)
{
  Categorical HeadlineId;
  Numerical Position;
  Categorical Word;
};

As in the single-table case the .kdicfile describes the schema for both tables, but note the following differences: - The dictionary for the table Headline is prefixed by the Root keyword to indicate that is the main one. - For both tables, their dictionary names are followed by (HeadlineId) to indicate that HeadlineId is the key of these tables. - The schema for the main table contains an extra special variable defined with the statement Table(Words) HeadlineWords. This is, in addition to sharing the same key variable, is necessary to indicate the 1:n relationship between the main and secondary table.

Now let’s store the location main and secondary tables and peek their contents:

sarcasm_headlines_file = os.path.join("data", "HeadlineSarcasm", "Headlines.txt")
sarcasm_words_file = os.path.join("data", "HeadlineSarcasm", "HeadlineWords.txt")

print(f"HeadlineSarcasm main table file: {sarcasm_headlines_file}")
print("")
peek(sarcasm_headlines_file, n=3)

print(f"HeadlineSarcasm secondary table file location: {sarcasm_words_file}")
print("")
peek(sarcasm_words_file, n=15)

HeadlineSarcasm main table file: data/HeadlineSarcasm/Headlines.txt

HeadlineId  IsSarcasm
 yes
 no

HeadlineSarcasm secondary table file location: data/HeadlineSarcasm/HeadlineWords.txt

HeadlineId  Position        Word
 0       thirtysomething
 1       scientists
 2       unveil
 3       doomsday
 4       clock
 5       of
 6       hair
 7       loss
 0       dem
 1       rep.
 2       totally
 3       nails
 4       why
 5       congress

The call to the train_predictor will be very similar to the single-table case but there are some differences.

The first is that we must pass the path of the extra secondary data table. This is done with the additional_data_tables parameter that is a Python dictionary containing key-value pairs for each table. More precisely: - keys describe data paths of secondary tables. In this case only Headline`HeadlineWords - values describe the file paths of secondary tables. In this case only the file path we stored in sarcasm_words_file

Note: For understanding what data paths are see the “Multi-Table Tasks” section of the Khiops ``core.api`` documentation

Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the HeadlineSarcasm dataset Khiops can create features such as: - Number of different words in the headline - Most common word in the headline before the third one - Number of times the word ‘the’ appears - …

It will then evaluate, select and combine the created features to build a classifier. We’ll ask to create 1000 of these features (the default is 100).

With these considerations, let’s setup the some extra variables and train the classifier:

sarcasm_results_dir = os.path.join("exercises", "HeadlineSarcasm")

sarcasm_report, sarcasm_model_kdic = kh.train_predictor(
    sarcasm_kdic,
    dictionary_name="Headline",  # This must be the main/root dictionary
    data_table_path=sarcasm_headlines_file,  # This must be the data file for the main table
    target_variable="IsSarcasm",
    results_dir=sarcasm_results_dir,
    additional_data_tables={"Headline`HeadlineWords": sarcasm_words_file},
    max_constructed_variables=1000,  # by default Khiops constructs 100 variables for AutoML multi-table
    max_trees=0,  # by default Khiops constructs 10 decision tree variables
)
print(f"HeadlineSarcasm report file located at: {sarcasm_report}")
print(f"HeadlineSarcasm modeling dictionary file located at: {sarcasm_model_kdic}")

HeadlineSarcasm report file located at: exercises/HeadlineSarcasm/AllReports.khj
HeadlineSarcasm modeling dictionary file located at: exercises/HeadlineSarcasm/Modeling.kdic

We now may take a look at the results with the visualization tool:

# To visualize uncomment the line below
# kh.visualize_report(sarcasm_report)

Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops ``sort_data_table`` function or your favorite software. The examples of this tutorial have their tables pre-sorted.

Exercise time!¶

Repeat the previous steps with the AccidentsSummary dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:

+---------------+
|Accidents      |
+---------------+
|AccidentId*    |
|Gravity        |
|Date           |
|Hour           | +---------------+
|Light          | |Vehicles       |
|Department     | +---------------+
|Commune        | |AccidentId*    |
|InAgglomeration| |VehicleId*     |
|...            | |Direction      |
+---------------+ |Category       |
       |          |PassengerNumber|
       +---1:n--->|...            |
                  +---------------+

So for each accident we have its characteristics (such as Gravity or Light conditions) and those of each involved vehicle (its Direction or PassengerNumber). The main task for this dataset is to predict the variable Gravity that has two possible values:Lethal and NonLethal.

We first save the paths of the AccidentsSummary dictionary file and data table files into variables:

accidents_kdic = os.path.join(
    kh.get_samples_dir(), "AccidentsSummary", "Accidents.kdic"
)
accidents_data_file = os.path.join(
    kh.get_samples_dir(), "AccidentsSummary", "Accidents.txt"
)
vehicles_data_file = os.path.join(
    kh.get_samples_dir(), "AccidentsSummary", "Vehicles.txt"
)

Print the file locations and use the function `peek` to list their contents¶

Which table is the Root in this case?

print(f"Accidents dictionary file: {accidents_kdic}")
print("")
peek(accidents_kdic, n=40)

print(f"Accidents (main) data table: {accidents_data_file}")
print("")
peek(accidents_data_file)

print(f"Vehicles data table: {vehicles_data_file}")
print("")
peek(vehicles_data_file)

Accidents dictionary file: /github/home/khiops_data/samples/AccidentsSummary/Accidents.kdic

Root Dictionary Accident(AccidentId)
{
  Categorical AccidentId;
  Categorical       Gravity;
  Date Date;
  Time Hour;
  Categorical Light;
  Categorical Department;
  Categorical Commune;
  Categorical InAgglomeration;
  Categorical IntersectionType;
  Categorical Weather;
  Categorical CollisionType;
  Categorical PostalAddress;
  Table(Vehicle) Vehicles;
};

Dictionary Vehicle(AccidentId, VehicleId)
{
 Categorical AccidentId;
 Categorical VehicleId;
 Categorical Direction;
 Categorical Category;
 Numerical PassengerNumber;
 Categorical FixedObstacle;
 Categorical MobileObstacle;
 Categorical ImpactPoint;
 Categorical Maneuver;
};

Accidents (main) data table: /github/home/khiops_data/samples/AccidentsSummary/Accidents.txt

AccidentId  Gravity Date    Hour    Light   Department      Commune InAgglomeration IntersectionType        Weather CollisionType   PostalAddress
201800000001        NonLethal       2018-01-24      15:05:00        Daylight        590     005     No      Y-type  Normal  2Vehicles-BehindVehicles-Frontal        route des Ansereuilles
201800000002        NonLethal       2018-02-12      10:15:00        Daylight        590     011     Yes     Square  VeryGood        NoCollision     Place du général de Gaul
201800000003        NonLethal       2018-03-04      11:35:00        Daylight        590     477     Yes     T-type  Normal  NoCollision     Rue  nationale
201800000004        NonLethal       2018-05-05      17:35:00        Daylight        590     052     Yes     NoIntersection  VeryGood        2Vehicles-Side  30 rue Jules Guesde
201800000005        NonLethal       2018-06-26      16:05:00        Daylight        590     477     Yes     NoIntersection  Normal  2Vehicles-Side  72 rue Victor Hugo
201800000006        NonLethal       2018-09-23      06:30:00        TwilightOrDawn  590     052     Yes     NoIntersection  LightRain       Other   D39
201800000007        NonLethal       2018-09-26      00:40:00        NightStreelightsOn      590     133     Yes     NoIntersection  Normal  Other   4 route de camphin
201800000008        Lethal  2018-11-30      17:15:00        NightStreelightsOn      590     011     Yes     NoIntersection  Normal  Other   rue saint exupéry
201800000009        NonLethal       2018-02-18      15:57:00        Daylight        590     550     No      NoIntersection  Normal  Other   rue de l'égalité

Vehicles data table: /github/home/khiops_data/samples/AccidentsSummary/Vehicles.txt

AccidentId  VehicleId       Direction       Category        PassengerNumber FixedObstacle   MobileObstacle  ImpactPoint     Maneuver
201800000001        A01     Unknown Car<=3.5T       0       None    Vehicle RightFront      TurnToLeft
201800000001        B01     Unknown Car<=3.5T       0       None    Vehicle LeftFront       NoDirectionChange
201800000002        A01     Unknown Car<=3.5T       0       None    Pedestrian      None    NoDirectionChange
201800000003        A01     Unknown Motorbike>125cm3        0       StationaryVehicle       Vehicle Front   NoDirectionChange
201800000003        B01     Unknown Car<=3.5T       0       None    Vehicle LeftSide        TurnToLeft
201800000003        C01     Unknown Car<=3.5T       0       None    None    RightSide       Parked
201800000004        A01     Unknown Car<=3.5T       0       None    Other   RightFront      Avoidance
201800000004        B01     Unknown Bicycle 0       None    Vehicle LeftSide        None
201800000005        A01     Unknown Moped   0       None    Vehicle RightFront      PassLeft

We now save the results directory for this exercise:

accidents_results_dir = os.path.join("exercises", "AccidentSummary")
print(f"AccidentsSummary exercise results directory: {accidents_results_dir}")

AccidentsSummary exercise results directory: exercises/AccidentSummary

Train a classifier for the `Accidents` database with 1000 variables¶

Save the resulting file locations into the variables accidents_report and accidents_model_kdic and print them.

Do not forget: - The target variable is Gravity - The key for the additional_data_tables parameter is Accident`Vehicles and its value that of vehicles_data_file - Set max_trees=0

accidents_report, accidents_model_kdic = kh.train_predictor(
    accidents_kdic,
    dictionary_name="Accident",
    data_table_path=accidents_data_file,
    target_variable="Gravity",
    results_dir=accidents_results_dir,
    additional_data_tables={"Accident`Vehicles": vehicles_data_file},
    max_constructed_variables=1000,
    max_trees=0,
)
print(f"AccidentsSummary report file: {accidents_report}")
print(f"AccidentsSummary modeling dictionary: {accidents_model_kdic}")

AccidentsSummary report file: exercises/AccidentSummary/AllReports.khj
AccidentsSummary modeling dictionary: exercises/AccidentSummary/Modeling.kdic

Take a look to the report¶

Which variables predict well the gravity of an accident?

# To visualize uncomment the line below
# kh.visualize_report(accidents_report)