utils.dataset

Submodule of khiops.utils

Classes for handling diverse data tables

Functions

check_dataset_spec

Checks that a dataset spec is valid

get_khiops_type

Translates a numpy dtype to a Khiops dictionary type

get_khiops_variable_name

Return the khiops variable name associated to a column id

read_internal_data_table

Reads into a DataFrame a data table file with the internal format settings

write_internal_data_table

Writes a DataFrame to data table file with the internal format settings

Classes

Dataset

A representation of a dataset

DatasetTable

A generic dataset table

FileTable

DatasetTable encapsulating a delimited text data file

NumpyTable

DatasetTable encapsulating a NumPy array

PandasTable

DatasetTable encapsulating a pandas dataframe

SparseTable

DatasetTable encapsulating a SciPy sparse matrix

class khiops.utils.dataset.Dataset(X, y=None, categorical_target=True, key=None)

Bases: object

A representation of a dataset

Parameters:
Xpandas.DataFrame or dict (Deprecated types: tuple and list)
Either:
  • A single dataframe

  • A dict dataset specification

ypandas.Series or str, optional

The target column.

categorical_targetbool, default True

True if the vector y should be considered as a categorical variable. If False it is considered as numeric. Ignored if y is None.

keystr

The name of the key column for all tables. Deprecated: Will be removed in khiops-python 11.

copy()

Creates a copy of the dataset

Referenced pandas.DataFrame’s, numpy.nparray’s and scipy.sparse.spmatrix’s in tables are copied as references.

create_khiops_dictionary_domain()

Creates a Khiops dictionary domain representing this dataset

Returns:
DictionaryDomain

The dictionary domain object representing this dataset

create_table_files_for_khiops(output_dir, sort=True)

Prepares the tables of the dataset to be used by Khiops

If this is a multi-table dataset it will create sorted copies the tables.

Parameters:
output_dirstr

The directory where the sorted tables will be created.

Returns:
tuple

A tuple containing:

  • The path of the main table

  • A dictionary containing the relation [table-name -> file-path] for the secondary tables. The dictionary is empty for monotable datasets.

get_table(table_name)

Returns a table by its name

Parameters:
table_name: str

The name of the table to be retrieved.

Returns:
DatasetTable

The table object for the specified name.

Raises:
KeyError

If there is no table with the specified name.

property is_in_memory

bool : True if the dataset is in-memory

A dataset is in-memory if it is constituted either of only pandas.DataFrame tables, numpy.ndarray, or scipy.sparse.spmatrix tables.

property is_multitable

bool : True if the dataset is multitable

property table_type

type : The table type of this dataset’s tables

Possible values:

to_spec()

Returns a dictionary specification of this dataset

class khiops.utils.dataset.DatasetTable(name, key=None)

Bases: ABC

A generic dataset table

check_key()

Checks that the key columns exist

create_khiops_dictionary()

Creates a Khiops dictionary representing this table

Returns:
Dictionary:

The Khiops Dictionary object describing this table’s schema

abstract create_table_file_for_khiops(output_dir, sort=True)

Creates a copy of the table at the specified directory

n_features()

Returns the number of features of the table

The target column does not count.

class khiops.utils.dataset.FileTable(name, path, key=None, sep='\t', header=True)

Bases: DatasetTable

DatasetTable encapsulating a delimited text data file

Parameters:
namestr

Name for the table.

pathstr

Path of the file containing the table.

keylist-like of str, optional

The names of the columns composing the key.

sepstr, optional

Field separator character. If not specified it will be inferred from the file.

headerbool, optional

Indicates if the table.

create_table_file_for_khiops(output_dir, sort=True)

Creates a copy of the table at the specified directory

class khiops.utils.dataset.NumpyTable(name, array, key=None)

Bases: DatasetTable

DatasetTable encapsulating a NumPy array

Parameters:
namestr

Name for the table.

arraynumpy.ndarray of shape (n_samples, n_features_in)

The data frame to be encapsulated.

key:external:term`array-like` of int, optional

The names of the columns composing the key.

create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)

Creates a copy of the table at the specified directory

class khiops.utils.dataset.PandasTable(name, dataframe, key=None)

Bases: DatasetTable

DatasetTable encapsulating a pandas dataframe

Parameters:
namestr

Name for the table.

dataframepandas.DataFrame

The data frame to be encapsulated. It must be non-empty.

keylist of str, optional

The names of the columns composing the key.

create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)

Creates a copy of the table at the specified directory

class khiops.utils.dataset.SparseTable(name, matrix, key=None)

Bases: DatasetTable

DatasetTable encapsulating a SciPy sparse matrix

Parameters:
namestr

Name for the table.

matrixscipy.sparse.spmatrix

The sparse matrix to be encapsulated.

keylist of str, optional

The names of the columns composing the key.

create_khiops_dictionary()

Creates a Khiops dictionary representing this sparse table

Adds metadata to each sparse variable

Returns:
Dictionary:

The Khiops Dictionary object describing this table’s schema

create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)

Creates a copy of the table at the specified directory

khiops.utils.dataset.check_dataset_spec(ds_spec)

Checks that a dataset spec is valid

Parameters:
ds_specdict

A specification of a multi-table dataset (see Multi-Table Learning Primer).

Raises:
TypeError

If there are objects of the spec with invalid type.

ValueError

If there are objects of the spec with invalid values.

khiops.utils.dataset.get_khiops_type(numpy_type)

Translates a numpy dtype to a Khiops dictionary type

Parameters:
numpy_typenumpy.dtype:

Numpy type of the column

Returns:
str

Khiops type name. Either “Categorical”, “Numerical” or “Timestamp”

khiops.utils.dataset.get_khiops_variable_name(column_id)

Return the khiops variable name associated to a column id

khiops.utils.dataset.read_internal_data_table(file_path_or_stream)

Reads into a DataFrame a data table file with the internal format settings

The table is read with the following settings:

  • Use tab as separator

  • Read the column names from the first line

  • Use ‘”’ as quote character

  • Use csv.QUOTE_MINIMAL

  • double quoting enabled (quotes within quotes can be escaped with ‘””’)

  • UTF-8 encoding

Parameters:
file_path_or_streamstr or file object

The path of the internal data table file to be read or a readable file object.

Returns:
pandas.DataFrame

The dataframe representation.

khiops.utils.dataset.write_internal_data_table(dataframe, file_path_or_stream)

Writes a DataFrame to data table file with the internal format settings

The table is written with the following settings:

  • Use tab as separator

  • Write the column names on the first line

  • Use ‘”’ as quote character

  • Use csv.QUOTE_MINIMAL

  • double quoting enabled (quotes within quotes can be escaped with ‘””’)

  • UTF-8 encoding

  • The index is not written

Parameters:
dataframepandas.DataFrame

The dataframe to write.

file_path_or_streamstr or file object

The path of the internal data table file to be written or a writable file object.