utils.dataset¶
Submodule of khiops.utils
Classes for handling diverse data tables
Functions¶
Checks that a dataset spec is valid |
|
Translates a numpy dtype to a Khiops dictionary type |
|
Return the khiops variable name associated to a column id |
|
Reads into a DataFrame a data table file with the internal format settings |
|
Writes a DataFrame to data table file with the internal format settings |
Classes¶
A representation of a dataset |
|
A generic dataset table |
|
DatasetTable encapsulating a delimited text data file |
|
DatasetTable encapsulating a NumPy array |
|
DatasetTable encapsulating a pandas dataframe |
|
DatasetTable encapsulating a SciPy sparse matrix |
- class khiops.utils.dataset.Dataset(X, y=None, categorical_target=True, key=None)¶
Bases:
object
A representation of a dataset
- Parameters:
- X
pandas.DataFrame
or dict (Deprecated types: tuple and list) - Either:
A single dataframe
A
dict
dataset specification
- y
pandas.Series
or str, optional The target column.
- categorical_targetbool, default True
True
if the vectory
should be considered as a categorical variable. IfFalse
it is considered as numeric. Ignored ify
isNone
.- keystr
The name of the key column for all tables. Deprecated: Will be removed in khiops-python 11.
- X
- copy()¶
Creates a copy of the dataset
Referenced pandas.DataFrame’s, numpy.nparray’s and scipy.sparse.spmatrix’s in tables are copied as references.
- create_khiops_dictionary_domain()¶
Creates a Khiops dictionary domain representing this dataset
- Returns:
DictionaryDomain
The dictionary domain object representing this dataset
- create_table_files_for_khiops(output_dir, sort=True)¶
Prepares the tables of the dataset to be used by Khiops
If this is a multi-table dataset it will create sorted copies the tables.
- Parameters:
- output_dirstr
The directory where the sorted tables will be created.
- Returns:
- tuple
A tuple containing:
The path of the main table
A dictionary containing the relation [table-name -> file-path] for the secondary tables. The dictionary is empty for monotable datasets.
- get_table(table_name)¶
Returns a table by its name
- Parameters:
- table_name: str
The name of the table to be retrieved.
- Returns:
DatasetTable
The table object for the specified name.
- Raises:
KeyError
If there is no table with the specified name.
- property is_in_memory¶
bool :
True
if the dataset is in-memoryA dataset is in-memory if it is constituted either of only pandas.DataFrame tables, numpy.ndarray, or scipy.sparse.spmatrix tables.
- property is_multitable¶
bool :
True
if the dataset is multitable
- property table_type¶
type : The table type of this dataset’s tables
Possible values:
- to_spec()¶
Returns a dictionary specification of this dataset
- class khiops.utils.dataset.DatasetTable(name, key=None)¶
Bases:
ABC
A generic dataset table
- check_key()¶
Checks that the key columns exist
- create_khiops_dictionary()¶
Creates a Khiops dictionary representing this table
- Returns:
Dictionary
:The Khiops Dictionary object describing this table’s schema
- abstract create_table_file_for_khiops(output_dir, sort=True)¶
Creates a copy of the table at the specified directory
- n_features()¶
Returns the number of features of the table
The target column does not count.
- class khiops.utils.dataset.FileTable(name, path, key=None, sep='\t', header=True)¶
Bases:
DatasetTable
DatasetTable encapsulating a delimited text data file
- Parameters:
- namestr
Name for the table.
- pathstr
Path of the file containing the table.
- keylist-like of str, optional
The names of the columns composing the key.
- sepstr, optional
Field separator character. If not specified it will be inferred from the file.
- headerbool, optional
Indicates if the table.
- create_table_file_for_khiops(output_dir, sort=True)¶
Creates a copy of the table at the specified directory
- class khiops.utils.dataset.NumpyTable(name, array, key=None)¶
Bases:
DatasetTable
DatasetTable encapsulating a NumPy array
- Parameters:
- namestr
Name for the table.
- array
numpy.ndarray
of shape (n_samples, n_features_in) The data frame to be encapsulated.
- key:external:term`array-like` of int, optional
The names of the columns composing the key.
- create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)¶
Creates a copy of the table at the specified directory
- class khiops.utils.dataset.PandasTable(name, dataframe, key=None)¶
Bases:
DatasetTable
DatasetTable encapsulating a pandas dataframe
- Parameters:
- namestr
Name for the table.
- dataframe
pandas.DataFrame
The data frame to be encapsulated. It must be non-empty.
- keylist of str, optional
The names of the columns composing the key.
- create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)¶
Creates a copy of the table at the specified directory
- class khiops.utils.dataset.SparseTable(name, matrix, key=None)¶
Bases:
DatasetTable
DatasetTable encapsulating a SciPy sparse matrix
- Parameters:
- namestr
Name for the table.
- matrix
scipy.sparse.spmatrix
The sparse matrix to be encapsulated.
- keylist of str, optional
The names of the columns composing the key.
- create_khiops_dictionary()¶
Creates a Khiops dictionary representing this sparse table
Adds metadata to each sparse variable
- Returns:
Dictionary
:The Khiops Dictionary object describing this table’s schema
- create_table_file_for_khiops(output_dir, sort=True, target_column=None, target_column_id=None)¶
Creates a copy of the table at the specified directory
- khiops.utils.dataset.check_dataset_spec(ds_spec)¶
Checks that a dataset spec is valid
- Parameters:
- ds_specdict
A specification of a multi-table dataset (see Multi-Table Learning Primer).
- Raises:
- TypeError
If there are objects of the spec with invalid type.
- ValueError
If there are objects of the spec with invalid values.
- khiops.utils.dataset.get_khiops_type(numpy_type)¶
Translates a numpy dtype to a Khiops dictionary type
- Parameters:
- numpy_type
numpy.dtype
: Numpy type of the column
- numpy_type
- Returns:
- str
Khiops type name. Either “Categorical”, “Numerical” or “Timestamp”
- khiops.utils.dataset.get_khiops_variable_name(column_id)¶
Return the khiops variable name associated to a column id
- khiops.utils.dataset.read_internal_data_table(file_path_or_stream)¶
Reads into a DataFrame a data table file with the internal format settings
The table is read with the following settings:
Use tab as separator
Read the column names from the first line
Use ‘”’ as quote character
double quoting enabled (quotes within quotes can be escaped with ‘””’)
UTF-8 encoding
- Parameters:
- file_path_or_streamstr or file object
The path of the internal data table file to be read or a readable file object.
- Returns:
pandas.DataFrame
The dataframe representation.
- khiops.utils.dataset.write_internal_data_table(dataframe, file_path_or_stream)¶
Writes a DataFrame to data table file with the internal format settings
The table is written with the following settings:
Use tab as separator
Write the column names on the first line
Use ‘”’ as quote character
double quoting enabled (quotes within quotes can be escaped with ‘””’)
UTF-8 encoding
The index is not written
- Parameters:
- dataframe
pandas.DataFrame
The dataframe to write.
- file_path_or_streamstr or file object
The path of the internal data table file to be written or a writable file object.
- dataframe