Core Basics 2: Train a Classifier on a Star Multi-Table Dataset

In this notebook we learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the Core Basics 1 lesson if you are not familiar with Khiops.

Make sure you have installed Khiops and Khiops Visualization.

We start by importing Khiops, checking its installation and defining some helper functions:

import os
import platform
import subprocess
from khiops import core as kh

# Define peek helper function
def peek(file_path, n=10):
    """Shows the first n lines of a file"""
    with open(file_path, encoding="utf8", errors="replace") as file:
        for line in file.readlines()[:n]:
            print(line, end="")
    print("")


# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()

Training a Multi-Table Classifier

We’ll train a “sarcasm detector” using the dataset HeadlineSarcasm. In its raw form, it contains a list of text headlines paired with a label that indicates whether its source is a sarcastic site (such as The Onion) or not.

We have transformed this dataset into two tables such that the text-label record

"groundbreaking study finds gratification can be deliberately postponed"    yes

is transformed to an entry in a table that contains id-label records

97 yes

and various entries in a secondary table linking a headline id to its words and positions

97  0   groundbreaking
97  1   study
97  2   finds
97  3   gratification
97  4   can
97  5   be
97  6   deliberately
97  7   postponed

Thus the HeadlineSarcasm dataset has the following multi-table schema

+-----------+
|Headline   |
+-----------+ +-------------+
|HeadlineId*| |HeadlineWords|
|IsSarcastic| +-------------+
+-----------+ |HeadlineId*  |
     |        |Position     |
     +-1:n--->|Word         |
              +-------------+

The HeadlineId variable is special because it is a key that links a particular headline to its words (a 1:n relation).

Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only for pedagogical purporses.

To train a classifier with Khiops in this multi-table setup, this schema must be codified in the dictionary file. Let’s check the contents of the HeadlineSarcasm dictionary file:

sarcasm_kdic = os.path.join("data", "HeadlineSarcasm", "HeadlineSarcasm.kdic")

print(f"HeadlineSarcasm dictionary file: {sarcasm_kdic}")
print("")
peek(sarcasm_kdic, n=15)
HeadlineSarcasm dictionary file: data/HeadlineSarcasm/HeadlineSarcasm.kdic

Root Dictionary Headline(HeadlineId)
{
  Categorical HeadlineId;
  Categorical IsSarcasm;
  Table(Words) HeadlineWords;
};

Dictionary Words(HeadlineId)
{
  Categorical HeadlineId;
  Numerical Position;
  Categorical Word;
};

As in the single-table case the .kdicfile describes the schema for both tables, but note the following differences: - The dictionary for the table Headline is prefixed by the Root keyword to indicate that is the main one. - For both tables, their dictionary names are followed by (HeadlineId) to indicate that HeadlineId is the key of these tables. - The schema for the main table contains an extra special variable defined with the statement Table(Words) HeadlineWords. This is, in addition to sharing the same key variable, is necessary to indicate the 1:n relationship between the main and secondary table.

Now let’s store the location main and secondary tables and peek their contents:

sarcasm_headlines_file = os.path.join("data", "HeadlineSarcasm", "Headlines.txt")
sarcasm_words_file = os.path.join("data", "HeadlineSarcasm", "HeadlineWords.txt")

print(f"HeadlineSarcasm main table file: {sarcasm_headlines_file}")
print("")
peek(sarcasm_headlines_file, n=3)

print(f"HeadlineSarcasm secondary table file location: {sarcasm_words_file}")
print("")
peek(sarcasm_words_file, n=15)
HeadlineSarcasm main table file: data/HeadlineSarcasm/Headlines.txt

HeadlineId  IsSarcasm
0   yes
1   no

HeadlineSarcasm secondary table file location: data/HeadlineSarcasm/HeadlineWords.txt

HeadlineId  Position        Word
0   0       thirtysomething
0   1       scientists
0   2       unveil
0   3       doomsday
0   4       clock
0   5       of
0   6       hair
0   7       loss
1   0       dem
1   1       rep.
1   2       totally
1   3       nails
1   4       why
1   5       congress

The call to the train_predictor will be very similar to the single-table case but there are some differences.

The first is that we must pass the path of the extra secondary data table. This is done with the additional_data_tables parameter that is a Python dictionary containing key-value pairs for each table. More precisely: - keys describe data paths of secondary tables. In this case only Headline`HeadlineWords - values describe the file paths of secondary tables. In this case only the file path we stored in sarcasm_words_file

Note: For understanding what data paths are see the “Multi-Table Tasks” section of the Khiops ``core.api`` documentation

Secondly, we specify how many features/aggregates Khiops will create with its multi-table AutoML mode. For the HeadlineSarcasm dataset Khiops can create features such as: - Number of different words in the headline - Most common word in the headline before the third one - Number of times the word ‘the’ appears - …

It will then evaluate, select and combine the created features to build a classifier. We’ll ask to create 1000 of these features (the default is 100).

With these considerations, let’s setup the some extra variables and train the classifier:

sarcasm_results_dir = os.path.join("exercises", "HeadlineSarcasm")

sarcasm_report, sarcasm_model_kdic = kh.train_predictor(
    sarcasm_kdic,
    dictionary_name="Headline",  # This must be the main/root dictionary
    data_table_path=sarcasm_headlines_file,  # This must be the data file for the main table
    target_variable="IsSarcasm",
    results_dir=sarcasm_results_dir,
    additional_data_tables={"Headline`HeadlineWords": sarcasm_words_file},
    max_constructed_variables=1000,  # by default Khiops constructs 100 variables for AutoML multi-table
    max_trees=0,  # by default Khiops constructs 10 decision tree variables
)
print(f"HeadlineSarcasm report file located at: {sarcasm_report}")
print(f"HeadlineSarcasm modeling dictionary file located at: {sarcasm_model_kdic}")
HeadlineSarcasm report file located at: exercises/HeadlineSarcasm/AllReports.khj
HeadlineSarcasm modeling dictionary file located at: exercises/HeadlineSarcasm/Modeling.kdic

We now may take a look at the results with the visualization tool:

# To visualize uncomment the line below
# kh.visualize_report(sarcasm_report)

Note: In the multi-table case, the input tables must be sorted by their key column in lexicographical order. To do this you may use the Khiops ``sort_data_table`` function or your favorite software. The examples of this tutorial have their tables pre-sorted.

Exercise time!

Repeat the previous steps with the AccidentsSummary dataset. It describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema:

+---------------+
|Accidents      |
+---------------+
|AccidentId*    |
|Gravity        |
|Date           |
|Hour           | +---------------+
|Light          | |Vehicles       |
|Department     | +---------------+
|Commune        | |AccidentId*    |
|InAgglomeration| |VehicleId*     |
|...            | |Direction      |
+---------------+ |Category       |
       |          |PassengerNumber|
       +---1:n--->|...            |
                  +---------------+

So for each accident we have its characteristics (such as Gravity or Light conditions) and those of each involved vehicle (its Direction or PassengerNumber). The main task for this dataset is to predict the variable Gravity that has two possible values:Lethal and NonLethal.

We first save the paths of the AccidentsSummary dictionary file and data table files into variables:

accidents_kdic = os.path.join(
    kh.get_samples_dir(), "AccidentsSummary", "Accidents.kdic"
)
accidents_data_file = os.path.join(
    kh.get_samples_dir(), "AccidentsSummary", "Accidents.txt"
)
vehicles_data_file = os.path.join(
    kh.get_samples_dir(), "AccidentsSummary", "Vehicles.txt"
)

Train a classifier for the Accidents database with 1000 variables

Save the resulting file locations into the variables accidents_report and accidents_model_kdic and print them.

Do not forget: - The target variable is Gravity - The key for the additional_data_tables parameter is Accident`Vehicles and its value that of vehicles_data_file - Set max_trees=0

accidents_report, accidents_model_kdic = kh.train_predictor(
    accidents_kdic,
    dictionary_name="Accident",
    data_table_path=accidents_data_file,
    target_variable="Gravity",
    results_dir=accidents_results_dir,
    additional_data_tables={"Accident`Vehicles": vehicles_data_file},
    max_constructed_variables=1000,
    max_trees=0,
)
print(f"AccidentsSummary report file: {accidents_report}")
print(f"AccidentsSummary modeling dictionary: {accidents_model_kdic}")
AccidentsSummary report file: exercises/AccidentSummary/AllReports.khj
AccidentsSummary modeling dictionary: exercises/AccidentSummary/Modeling.kdic

Take a look to the report

Which variables predict well the gravity of an accident?

# To visualize uncomment the line below
# kh.visualize_report(accidents_report)