Sklearn Basics 4: Train a Coclustering

The steps to train a coclustering model with Khiops are very similar to what we have already seen in the basic classifier tutorials.

We start by importing KhiopsCoclustering estimators and some helper functions:

from os import path
import numpy as np
import pandas as pd
from helper_functions import explorer_open

from khiops.sklearn import KhiopsCoclustering
/github/home/.local/lib/python3.10/site-packages/khiops/core/internals/runner.py:1259: UserWarning: Too few cores: 2. To efficiently run Khiops in parallel at least 3 processes are needed. Khiops will run in a single process.
  warnings.warn(

For this tutorial, we use the dataset CountriesByOrganization that contains the relation country-organization for a large number of countries and organizations (it is bit outdated though). The objective is to build a coclustering between Country and Organization and see which countries resemble the most in terms of organizations.

Let’s first load this dataset and check its content:

countries_data_file = path.join(
    "data", "CountriesByOrganization", "CountriesByOrganization.csv"
)
X_countries = pd.read_csv(countries_data_file, sep=";")
print("CountriesByOrganization dataset:")
display(X_countries)
CountriesByOrganization dataset:
           Country Organization
0      Afghanistan         AsDB
1      Afghanistan      COLOMBO
2      Afghanistan          ECO
3      Afghanistan       ICCROM
4      Afghanistan          NAM
...            ...          ...
11187     Zimbabwe          WHO
11188     Zimbabwe         WIPO
11189     Zimbabwe          WMO
11190     Zimbabwe          WTO
11191     Zimbabwe       WTOURO

[11192 rows x 2 columns]

Now, let’s build the coclustering model.

Note that a coclustering model is learned in an unsupervised way and aims to cluster jointly rows and columns of a matrix. So we need to provide a column name to be able to deploy it on a specific column. We do this by setting the fit parameter id_column:

khcc_countries = KhiopsCoclustering()
khcc_countries.fit(X_countries, id_column="Country")
KhiopsCoclustering()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now let’s access the coclustering training report to obtain the cluster information of the Country dimension. Since in each dimension there is a hierarchical cluster, so we only access the leaf clusters:

countries_clusters = khcc_countries.model_report_.coclustering_report.get_dimension(
    "Country"
).clusters
countries_leaf_clusters = [cluster for cluster in countries_clusters if cluster.is_leaf]
print(f"Number of leaf clusters: {len(countries_leaf_clusters)}:")
for index, cluster in enumerate(countries_leaf_clusters, start=1):
    print(f"cluster {index:02d}: {cluster.name}")
Number of leaf clusters: 12:
cluster 01: {Germany, France, Netherlands, ...}
cluster 02: {United States of America, Canada, Japan, ...}
cluster 03: {Poland, Hungary, Turkey, ...}
cluster 04: {Kazakhstan, Kyrgyzstan, Azerbaijan, ...}
cluster 05: {Venezuela, Nicaragua, Ecuador, ...}
cluster 06: {Trinidad and Tobago, Barbados, Grenada, ...}
cluster 07: {Niger, Ivory Coast, Benin, ...}
cluster 08: {Tanzania, Uganda, Kenya, ...}
cluster 09: {Qatar, Saudi Arabia, United Arab Emirates, ...}
cluster 10: {Tunisia, Algeria, Morocco, ...}
cluster 11: {India, Malaysia, Indonesia, ...}
cluster 12: {Papua New Guinea, Fiji, Nepal, ...}

The composition of the clusters is also available. For the first one we have:

print(f"Members of the cluster {countries_leaf_clusters[0].name}:")
for value_obj in countries_clusters[0].leaf_part.values:
    print(value_obj.value)
Members of the cluster {Germany, France, Netherlands, ...}:
Germany
France
Netherlands
Denmark
Sweden
Belgium
Finland
Italy
Norway
Spain
Portugal
Austria
United Kingdom
Luxembourg
Switzerland
Greece
Ireland
Iceland

The coclustering is a complex model, so it is better to visualize it with the Khiops Co-visualization app. Let’s export the report to a .khcj file and open it:

countries_report = path.join("exercises", "countries.khcj")
khcc_countries.export_report_file(countries_report)
# explorer_open(countries_report)

Finally, let’s deploy the coclustering model on the training data countries_df:

countries_predictions = khcc_countries.predict(X_countries)
print("Predicted clusters (first 10)")
display(countries_predictions[:10])
Predicted clusters (first 10)
array([['Afghanistan', '{India, Malaysia, Indonesia, ...}'],
       ['Albania', '{Poland, Hungary, Turkey, ...}'],
       ['Algeria', '{Tunisia, Algeria, Morocco, ...}'],
       ['Andorra', '{Poland, Hungary, Turkey, ...}'],
       ['Angola', '{Niger, Ivory Coast, Benin, ...}'],
       ['Antigua and Barbuda',
        '{Trinidad and Tobago, Barbados, Grenada, ...}'],
       ['Argentina', '{Venezuela, Nicaragua, Ecuador, ...}'],
       ['Armenia', '{Kazakhstan, Kyrgyzstan, Azerbaijan, ...}'],
       ['Australia', '{United States of America, Canada, Japan, ...}'],
       ['Austria', '{Germany, France, Netherlands, ...}']], dtype=object)

Exercise

We’ll build a coclustering model for the Tokyo2021 dataset. It is extracted from the Athletes table of the Tokyo 2021 Kaggle dataset and each record contains three variables: - Name: the name of a competing athlete - Country: the country (or organization) it represents - Discipline: the athlete’s discipline

The objective with this exercise is to make a coclustering between Country and Discipline and see which countries resemble the most in terms of the athletes they bring to the Olympics. We start by loading the contents into a dataframe:

tokyo_data_file = path.join("data", "Tokyo2021", "Athletes.csv")
X_tokyo = pd.read_csv(tokyo_data_file, encoding="ISO-8859-1")
print("Tokyo2021 dataset (first 10 rows):")
display(X_tokyo.head(10))
Tokyo2021 dataset (first 10 rows):
                Name                   Country           Discipline
0    AALERUD Katrine                    Norway         Cycling Road
1        ABAD Nestor                     Spain  Artistic Gymnastics
2  ABAGNALE Giovanni                     Italy               Rowing
3     ABALDE Alberto                     Spain           Basketball
4      ABALDE Tamara                     Spain           Basketball
5          ABALO Luc                    France             Handball
6       ABAROA Cesar                     Chile               Rowing
7      ABASS Abobakr                     Sudan             Swimming
8   ABBASALI Hamideh  Islamic Republic of Iran               Karate
9      ABBASOV Islam                Azerbaijan            Wrestling

Train the coclustering for the variables Country and Discipline

Call fit parameters with the following parameters: - X=X_tokyo[["Country", "Discipline"]] - id_column="Country"

khcc_tokyo = KhiopsCoclustering()
khcc_tokyo.fit(X_tokyo[["Country", "Discipline"]], id_column="Country")
KhiopsCoclustering()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Obtain the number and names of the clusters of the Country dimension

tokyo_clusters = khcc_tokyo.model_report_.coclustering_report.get_dimension(
    "Country"
).clusters
tokyo_leaf_clusters = [cluster for cluster in tokyo_clusters if cluster.is_leaf]
print(f"Number of leaf clusters: {len(tokyo_leaf_clusters)}:")
for index, cluster in enumerate(tokyo_leaf_clusters, start=1):
    print(f"cluster {index:02d}: {cluster.name}")
Number of leaf clusters: 39:
cluster 01: {Ghana, Kosovo, Republic of Moldova, ...}
cluster 02: {Jamaica, Ethiopia, Trinidad and Tobago, ...}
cluster 03: {Kenya, Fiji}
cluster 04: {Uzbekistan, Azerbaijan, Mongolia, ...}
cluster 05: {Serbia, Islamic Republic of Iran}
cluster 06: {Turkey, Tunisia, Venezuela, ...}
cluster 07: {Chinese Taipei, Thailand, Indonesia, ...}
cluster 08: {Switzerland, Austria, Hong Kong, China, ...}
cluster 09: {Colombia, Morocco, Ecuador, ...}
cluster 10: {Ukraine, Belarus, Slovakia}
cluster 11: {Kazakhstan, Croatia, Greece}
cluster 12: {Japan}
cluster 13: {Argentina}
cluster 14: {Republic of Korea}
cluster 15: {Egypt}
cluster 16: {Israel, Dominican Republic}
cluster 17: {Mexico}
cluster 18: {Zambia, Saudi Arabia, Honduras, ...}
cluster 19: {Romania}
cluster 20: {Great Britain, Ireland}
cluster 21: {New Zealand}
cluster 22: {Australia}
cluster 23: {Canada}
cluster 24: {People's Republic of China}
cluster 25: {United States of America}
cluster 26: {Italy}
cluster 27: {Poland, Lithuania}
cluster 28: {Germany, Belgium}
cluster 29: {India}
cluster 30: {Czech Republic, Nigeria, Slovenia, ...}
cluster 31: {Spain}
cluster 32: {South Africa}
cluster 33: {Netherlands}
cluster 34: {ROC}
cluster 35: {Hungary, Montenegro}
cluster 36: {Norway, Denmark, Portugal}
cluster 37: {Sweden, Angola, Bahrain}
cluster 38: {Brazil}
cluster 39: {France}

Deploy the learned coclustering model on the training data and check the obtained clusters

tokyo_predictions = khcc_tokyo.predict(X_tokyo[["Country", "Discipline"]])
print("Predicted clusters (first 10)")
display(tokyo_predictions[:10])
Predicted clusters (first 10)
array([['Norway', '{Norway, Denmark, Portugal}'],
       ['Spain', '{Spain}'],
       ['Italy', '{Italy}'],
       ['France', '{France}'],
       ['Chile', '{Zambia, Saudi Arabia, Honduras, ...}'],
       ['Sudan', '{Ghana, Kosovo, Republic of Moldova, ...}'],
       ['Islamic Republic of Iran', '{Serbia, Islamic Republic of Iran}'],
       ['Azerbaijan', '{Uzbekistan, Azerbaijan, Mongolia, ...}'],
       ['Netherlands', '{Netherlands}'],
       ['Australia', '{Australia}']], dtype=object)