Multi-Table Learning Primer¶
Machine learning on tabular data is traditionally performed on a single table containing a record for each statistical object of the sample. However, data is usually stored in databases with multiple tables whose relationships are specified through a schema. Thus, before training a predictor, a preprocessing is necessary to flatten the relational tables into a single one containing all relevant information for the learning task.
This preprocessing (which belongs to the feature engineering process) is often very time consuming. One of the main Khiops features is to automate this process by natively learning predictors from multi-table datasets. Specifically, Khiops automatically:
generates features (aggregates) from the multi-table schema
evaluates the predictive value of the generated features
selects a small subset of the generated features to build a predictor
Supported Multi-Table Schemas¶
Khiops allows to train an estimator on the two most common dataset schemas: star and snowflake.
More complex schemas are supported only by the core
library at the moment.
Star Schema¶
In this schema there is one main table containing the identifier of the statistical object and its basic properties. The main table points to one or more secondary ones, each having or more records associated to a given object.
Here is a simple example of a star schema (for brevity we do not show the tables’ columns):
Customer(id_customer)
|
+---1:1--- Address(id_customer)
|
+---1:n--- Service(id_customer, id_product)
The statistical object in this schema is a Customer
that is associated to a unique Address
and to one or more Services
. In parentheses we show the key columns of each table that act as
foreign keys to associate the records of each table.
You can find more information about the star schema in its Wikipedia article.
Snowflake Schema¶
The snowflake schema generalizes the star schema by allowing each secondary table to be in a star schema by itself, forming a tree whose root is the main table.
Here we extend the previous star schema to a snowflake schema
Customer(id_customer)
|
+---1:1--- Address(id_customer)
|
+---1:n--- Service(id_customer, id_product)
|
+---1:n--- Usage(id_customer, id_product)
Again, the statistical object in this schema is a Customer
that is associated to a unique
Address
and to one or more Services
. But additionally, each Service
is associated to one
or more Usages
.
You can find more information about the snowflake schema in its Wikipedia article.
Multi-Table Learning with Scikit-Learn Estimators¶
The supervised estimators in khiops.sklearn handle multi-table datasets with a special input
feature object X
. Specifically, instead of a pandas.DataFrame
, X
must be a dict
that
specifies the dataset schema in the following way:
X = {
"main_table": <name of the main table>,
"tables" : {
<name of the main table>: (<dataframe of the main table>, <key of the main table>),
<name of table 1>: (<dataframe of table 1>, <key of table 1>),
<name of table 2>: (<dataframe of table 2>, <key of table 2>),
...
}
"relations" : [
(<name of the main table>, <name of a different table>, <entity flag>),
(<name of another table>, <name of yet another table>, <entity flag>),
...
],
}
The three fields of this dictionary are:
main_table
: The name of the main table.tables
: A dictionary indexed by the tables’ names. Each table is associated to a 2-tuple containing the following fields:The
pandas.DataFrame
object of the table.The key columns’ names : Either a list of strings or a single string.
relations
: An optional field containing a list of tuples describing the relations between tables. The first two values (Strings) of each tuple correspond to names of both the parent and the child table involved in the relation. A third value (Boolean) can be optionally added to the tuple to indicate if the relation is either1:n
or1:1
(entity). For example, If the tuple(table1, table2, True)
is contained in this field, it means that:table1
andtable2
are in a1:1
relationshipThe key of
table1
is contained in that oftable2
(ie. keys are hierarchical)
If the
relations
field is not present then Khiops Python assumes that the tables are in a star schema.
Note
With respect to Khiops, Khiops Python sklearn estimators have some limitations. They currently do not support external data tables.
This feature will be available in upcoming releases. If you need to use it, you can use the khiops.core
sub-module (see below).
Examples¶
Star Schema¶
For the AccidentsSummary
dataset above where tables are related through the following star
schema:
Accident(AccidentId)
|
+---1:n--- Vehicle(AccidentId, VehicleId)
We build the input X
as follows:
accidents_df = pd.read_csv(f"{kh.get_samples_dir()}/AccidentsSummary/Accidents.txt", sep="\t", encoding="latin1")
vehicles_df = pd.read_csv(f"{kh.get_samples_dir()}/AccidentsSummary/Vehicles.txt", sep="\t", encoding="latin1")
X = {
"main_table" : "Accident",
"tables": {
"Accident": (accidents_df.drop("Gravity", axis=1), "AccidentId"),
"Vehicle": (vehicles_df, ["AccidentId", "VehicleId"])
}
}
Snowflake Schema¶
For the Accidents
dataset (an extension of AccidentsSummary
) where tables are related
through the following snowflake schema
Accident(AccidentId)
|
+--- 1:n --- Vehicle(AccidentId, VehicleId)
| |
| +--- 1:n --- User(AccidentId, VehicleId)
|
+--- 1:1 --- Place(AccidentId)
We build the input X
as follows:
# We use `Accidents.txt` table of `AccidentsSummary` as it contains the `Gravity` label pre-calculated
accidents_df = pd.read_csv(f"{kh.get_samples_dir()}/AccidentsSummary/Accidents.txt", sep="\t", encoding="latin1")
vehicles_df = pd.read_csv(f"{kh.get_samples_dir()}/Accidents/Vehicles.txt", sep="\t", encoding="latin1")
users_df = pd.read_csv(f"{kh.get_samples_dir()}/Accidents/Users.txt", sep="\t", encoding="latin1")
places_df = pd.read_csv(f"{kh.get_samples_dir()}/Accidents/Places.txt", sep="\t", encoding="latin1")
X = {
"main_table": "Accidents",
"tables": {
"Accidents": (accidents_df.drop("Gravity", axis=1), "AccidentId"),
"Vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
"Users": (users_df, ["AccidentId", "VehicleId"]),
"Places": (places_df, ["AccidentId"]),
},
"relations": [
("Accidents", "Vehicles"),
("Vehicles", "Users"),
("Accidents", "Places", True),
],
}
Both datasets can be found in the Khiops samples directory.
Multi-table learning with the Core API¶
The functions in khiops.core
that allow using multi-table datasets have the optional parameter
additional_data_tables
. This dictionary links the secondary tables to their data file paths and
it’s indexed by their data paths which are specified as the regular expression:
root_table_name(`table_variable_name)*
Specifically:
the data path for a root table is its name
the data path for a secondary table is composed of the name of its source root table followed by the chain of table variable names leading to it. The path parts are separated by a backtick
`
.
Types of secondary tables include:
Table
type: sub-tables in a 0:n relationshipExample: A “Customers” main table with a “Services” secondary table describing the services that each customer has subscribed to. So a customer can have zero services (inactive customer) or one or many.
Entity
table: sub-tables in a 0:1 relationshipExample: A “Customers” main table with a “Address” secondary table describing the address of a customer with fields such as “Street”, “StreetNumber”, etc. In this setting a customer can have at most one address.
External data tables: Another table set (with a
Root
table) that is entirely loaded in memoryExample: The “Address” sub-table in the example above can point to a table “City” containing information about the city where the address is located. The number of cities is much smaller than the number of addresses so it may make sense to load it entirely in memory for efficiency reasons.
Note that besides the root table names the components of a data path are table variable names and not table names. For further details about the multi-table capabilities of Khiops refer to the documentation at the Khiops site.
The class DictionaryDomain
provides the helper method extract_data_paths
that extracts the
data paths from a given root dictionary.
Note
To execute multi-table tasks, Khiops requires the data table files to be sorted by their key
columns. You may use the sort_data_table
function to preprocess your data files before
executing these tasks.
Examples¶
Star Schema¶
Let’s consider the following Khiops dictionary file for the AccidentsSummary
dataset
found in Khiops samples. Note that tables in this dataset are related through a star schema.
# samples/AccidentsSummary/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
Categorical Gravity;
// <more variables ...>
Table(Vehicle) Vehicles; // This is a table variable (type Table)
};
Dictionary Vehicle(AccidentId, VehicleId)
{
Categorical AccidentId;
Categorical VehicleId;
Categorical Direction;
Categorical Category;
// <more variables ...>
};
This dictionary represents the following relational schema:
Accident(AccidentId)
|
+---1:n--- Vehicle(AccidentId, VehicleId)
In this case the additional_data_tables
argument consists of only one path: that of the
secondary table Vehicle
. Since it is pointed by the main table Accident
via the table
variable Vehicle
the additional_data_tables
parameter should be set as:
additional_data_tables = {"Accident`Vehicles": f"{kh.get_samples_dir()}/Vehicles.txt"}
Snowflake Schema¶
Let’s now consider the dictionary file for the Accidents
dataset where tables are related
through a snowflake schema.
# samples/Accidents/Accidents.kdic
Root Dictionary Accident(AccidentId)
{
Categorical AccidentId;
// The target "Gravity" is calculated from a sub-table
// See: https://khiops.org/setup/KhiopsGuide.pdf#page=58
Categorical Gravity = IfC(
G(TableSum(Vehicles, TableCount(TableSelection(Users, EQc(Gravity, "Death")))), 0),
"Lethal", "NonLethal");
// <more variables..>
Entity(Place) Place; // This is a table variable type Entity: 1-1 relation)
Table(Vehicle) Vehicles; // This is a table variable (type Table)
};
Dictionary Place(AccidentId)
{
Categorical AccidentId;
Categorical RoadType;
// <more variables..>
Categorical SchoolNear;
};
Dictionary Vehicle(AccidentId, VehicleId)
{
Categorical AccidentId;
Categorical VehicleId;
// <more variables..>
Table(User) Users; // This is a table variable (type Table)
};
Dictionary User(AccidentId, VehicleId) {
Categorical AccidentId;
Categorical VehicleId;
Categorical Seat;
Categorical Category;
Unused Categorical Gravity; // Must be disabled since the target is a function of it
// <more variables..>
Numerical BirthYear;
};
This time, the relational schema is as follows:
Accident(AccidentId)
|
+--- 1:n --- Vehicle(AccidentId, VehicleId)
| |
| +--- 1:n --- User(AccidentId, VehicleId)
|
+--- 1:1 --- Place(AccidentId)
The additional_data_tables
parameter must be set as:
additional_data_tables = {
"Accident`Place": "/path/to/Places.txt",
"Accident`Vehicles": "/path/to/Vehicles.txt",
"Accident`Vehicles`Users": "/path/to/Users.txt"
}