core.api¶
Submodule of khiops.core
API for the execution of the Khiops AutoML suite
The methods in this module allow to execute all Khiops and Khiops Coclustering tasks.
Functions¶
Builds a dictionary file to read the output table of a deployed model |
|
Builds a dictionary file by analyzing a data table file |
|
Builds a multi-table dictionary from a dictionary with a key |
|
Checks if a data table is compatible with a dictionary file |
|
Deploys a model on a data table |
|
Detects the format of a data table |
|
Evaluates the predictors in a dictionary file on a database |
|
Exports a Khiops dictionary file to JSON format ( |
|
Extracts clusters to a tab separated (TSV) file |
|
Extracts from data table unique occurrences of a key variable |
|
Returns the Khiops Coclustering license information |
|
Returns the Khiops license information |
|
Returns the Khiops version |
|
Returns the Khiops' samples directory path |
|
Prepares a individual-variable coclustering deployment |
|
Simplifies a coclustering model |
|
Sorts a data table |
|
Trains a coclustering model from a data table |
|
Trains a model from a data table |
|
Trains a recoding model from a data table |
- khiops.core.api.build_deployed_dictionary(dictionary_file_path_or_domain, dictionary_name, output_dictionary_file_path, batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Builds a dictionary file to read the output table of a deployed model
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary to be analyzed.
- output_dictionary_file_pathstr
Path of the output dictionary file.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Raises:
TypeError
Invalid type of an argument
Examples
- See the following functions of the
samples.py
documentation script:
- khiops.core.api.build_dictionary_from_data_table(data_table_path, output_dictionary_name, output_dictionary_file_path, detect_format=True, header_line=None, field_separator=None, batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Builds a dictionary file by analyzing a data table file
- Parameters:
- data_table_pathstr
Path of the data table file.
- output_dictionary_namestr
Name dictionary to be created.
- output_dictionary_file_pathstr
Path of the output dictionary file.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- …
See Common Parameters.
- khiops.core.api.build_multi_table_dictionary(dictionary_file_path_or_domain, root_dictionary_name, secondary_table_variable_name, output_dictionary_file_path, overwrite_dictionary_file=False, batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False)¶
Builds a multi-table dictionary from a dictionary with a key
Warning
This method is deprecated since Khiops 10.1.3 and will be removed in Khiops 11. Use the
build_multi_table_dictionary_domain
helper function to the same effect.- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a
DictionaryDomain
object.- root_dictionary_namestr
Name for the new root dictionary
- secondary_table_variable_namestr
Name, in the root dictionary, for the “table” variable of the secondary table.
- output_dictionary_file_pathstr
Path of the output dictionary path.
- overwrite_dictionary_filebool, default
False
If
True
it will overwrite an input dictionary file.- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Raises:
ValueError
Invalid values of an argument
- khiops.core.api.check_database(dictionary_file_path_or_domain, dictionary_name, data_table_path, detect_format=True, header_line=None, field_separator=None, sample_percentage=100.0, sampling_mode='Include sample', selection_variable='', selection_value='', additional_data_tables=None, max_messages=20, batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Checks if a data table is compatible with a dictionary file
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary of the table to be checked.
- data_table_pathstr
Path of the data table file.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- sample_percentagefloat, default 100.0
See the
sampling_mode
option below.- sampling_mode“Include sample” or “Exclude sample”
If equal to “Include sample” it checks
sample_percentage
percent of the data; if equal to “Exclude sample” it checks the complement of the data selected with “Include sample”. See also Database Sampling.- selection_variablestr, default “”
It checks only the records such that the value of
selection_variable
is equal toselection_value
. Ignored if equal to “”.- selection_value: str or int or float, default “”
See
selection_variable
option above. Ignored if equal to “”.- additional_data_tablesdict, optional
A dictionary containing the data paths and file paths for a multi-table dictionary file. For more details see Multi-Table Learning Primer.
- max_messagesint, default 20
Maximum number of error messages to write in the log file.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.deploy_model(dictionary_file_path_or_domain, dictionary_name, data_table_path, output_data_table_path, detect_format=True, header_line=None, field_separator=None, sample_percentage=100.0, sampling_mode='Include sample', selection_variable='', selection_value='', additional_data_tables=None, output_header_line=True, output_field_separator='\t', output_additional_data_tables=None, batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Deploys a model on a data table
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object. This file/object defines the model to be deployed. Note that this model is not necessarily a predictor, it can be a generic table transformation.
- dictionary_namestr
Name of the dictionary to be analyzed.
- data_table_pathstr
Path of the data table file.
- output_data_table_pathstr
Path of the output data file.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- sample_percentagefloat, default 100.0
See
sampling_mode
option below.- sampling_mode“Include sample” or “Exclude sample”
If equal to “Include sample” it deploys the model on
sample_percentage
percent of the data. If equal to “Exclude sample” it deploys the model on the complement of the data selected with “Include sample”. See also Database Sampling.- selection_variablestr, default “”
It deploys only the records such that the value of
selection_variable
is equal toselection_value
. Ignored if equal to “”.- selection_value: str or int or float, default “”
See
selection_variable
option above. Ignored if equal to “”.- additional_data_tablesdict, optional
A dictionary containing the data paths and file paths for a multi-table dictionary file. For more details see Multi-Table Learning Primer.
- output_header_linebool, default
True
If
True
writes a header line with the column names in the output table.- output_field_separatorstr, default “\t”
The field separator character for the output table (”” counts as “\t”).
- output_additional_data_tablesdict, optional
A dictionary containing the output data paths and file paths for a multi-table dictionary file. For more details see Multi-Table Learning Primer.
- results_prefixstr, default “”
Prefix of the result files.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Raises:
TypeError
Invalid type of an argument.
Examples
- See the following functions of the
samples.py
documentation script:
- khiops.core.api.detect_data_table_format(data_table_path, dictionary_file_path_or_domain=None, dictionary_name=None, trace=False, stdout_file_path='', stderr_file_path='')¶
Detects the format of a data table
Runs an heuristic to detect the format of a data table. The detection heuristic is more accurate if a dictionary with the table schema is provided.
- Parameters:
- data_table_pathstr
Path of the data table file.
- dictionary_file_path_or_domainstr or
DictionaryDomain
, optional Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr, optional
Name of the dictionary.
- Returns:
- tuple
- A 2-tuple containing:
the
header_line
booleanthe
field_separator
character
These are exactly the parameters expected in many Khiops Python API functions.
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.evaluate_predictor(dictionary_file_path_or_domain, train_dictionary_name, data_table_path, results_dir, detect_format=True, header_line=None, field_separator=None, sample_percentage=100.0, sampling_mode='Include sample', selection_variable='', selection_value='', additional_data_tables=None, main_target_value='', results_prefix='', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Evaluates the predictors in a dictionary file on a database
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- train_dictionary_namestr
Name of the main dictionary used while training the models.
- data_table_pathstr
Path of the evaluation data table file.
- results_dirstr
Path of the results directory.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- sample_percentagefloat, default 100.0
See
sampling_mode
option below.- sampling_mode“Include sample” or “Exclude sample”
If equal to “Include sample” it evaluates the predictor on
sample_percentage
percent of the data. If equal to “Exclude sample” it evaluates the predictor on the complement of the data selected with “Include sample”. See also Database Sampling.- selection_variablestr, default “”
It trains with only the records such that the value of
selection_variable
is equal toselection_value
. Ignored if equal “”.- selection_value: str or int or float, default “”
See
selection_variable
option above. Ignored if equal to “”.- additional_data_tablesdict, optional
A dictionary containing the data paths and file paths for a multi-table dictionary file. For more details see Multi-Table Learning Primer.
Note
Use the initial dictionary name in the data paths.
- main_target_valuestr, default “”
If this target value is specified then it guarantees the calculation of lift curves for it.
- results_prefixstr, default “”
Prefix of the result files.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Returns:
- str
The path of the JSON evaluation report (extension
.khj
).
- Raises:
TypeError
Invalid type of an argument.
Examples
- See the following functions of the
samples.py
documentation script:
- khiops.core.api.export_dictionary_as_json(dictionary_file_path_or_domain, json_dictionary_file_path, batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='')¶
Exports a Khiops dictionary file to JSON format (
.kdicj
)- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.extract_clusters(coclustering_file_path, cluster_variable, clusters_file_path, max_preserved_information=0, max_cells=0, batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Extracts clusters to a tab separated (TSV) file
- Parameters:
- coclustering_file_pathstr
Path of the coclustering model file (extension
.khc
or.khcj
).- cluster_variablestr
Name of the variable for which the clusters are extracted.
- clusters_file_pathstr
Path of the output clusters TSV file.
- max_preserved_informationint, default 0
Maximum information preserve in the simplified coclustering. If equal to 0 there is no limit.
- max_cellsint, default 0
Maximum number of cells in the simplified coclustering. If equal to 0 there is no limit.
- …
See Common Parameters.
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.extract_keys_from_data_table(dictionary_file_path_or_domain, dictionary_name, data_table_path, output_data_table_path, detect_format=True, header_line=None, field_separator=None, output_header_line=True, output_field_separator='\t', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Extracts from data table unique occurrences of a key variable
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary of the data table.
- data_table_pathstr
Path of the data table file.
- output_data_table_pathstr
Path of the output data file.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. Ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- output_header_linebool, default
True
If
True
writes a header line with the column names in the output table.- output_field_separatorstr, default “\t”
The field separator character for the output table (”” counts as “\t”).
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Raises:
TypeError
Invalid type of an argument.
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.get_khiops_coclustering_info()¶
Returns the Khiops Coclustering license information
Warning
This method is deprecated since Khiops 10.1 and will be removed in Khiops 11. Use
get_khiops_version
to obtain the Khiops version of your system.- Returns:
- tuple
A 4-tuple containing:
The tool version
The name of the machine
The ID of the machine
The number of remaining days for the license
- khiops.core.api.get_khiops_info()¶
Returns the Khiops license information
Warning
This method is deprecated since Khiops 10.1 and will be removed in Khiops 11. Use
get_khiops_version
to obtain the Khiops version of your system.- Returns:
- tuple
A 4-tuple containing:
The tool version
The name of the machine
The ID of the machine
The number of remaining days for the license
- khiops.core.api.get_khiops_version()¶
Returns the Khiops version
- Returns:
- str
The Khiops version of the current
KhiopsRunner
backend.
- khiops.core.api.get_samples_dir()¶
Returns the Khiops’ samples directory path
- Returns:
- str
The path of the Khiops samples directory.
- khiops.core.api.prepare_coclustering_deployment(dictionary_file_path_or_domain, dictionary_name, coclustering_file_path, table_variable, deployed_variable_name, results_dir, max_preserved_information=0, max_cells=0, max_part_numbers=None, build_cluster_variable=True, build_distance_variables=False, build_frequency_variables=False, variables_prefix='', results_prefix='', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Prepares a individual-variable coclustering deployment
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary to be analyzed.
- coclustering_file_pathstr
Path of the coclustering model file (extension
.khc
or.khcj
).- table_variablestr
Name of the table variable in the dictionary.
- deployed_variable_namestr
Name of the coclustering variable to deploy.
- results_dirstr
Path of the results directory.
- max_preserved_informationint, default 0
Maximum information preserve in the simplified coclustering. If equal to 0 there is no limit.
- max_cellsint, default 0
Maximum number of cells in the simplified coclustering. If equal to 0 there is no limit.
- max_part_numbersdict, optional
Dictionary associating variable names to their maximum number of parts to preserve in the simplified coclustering. For variables not present in
max_part_numbers
there is no limit.- build_cluster_variablebool, default
True
If
True
includes a cluster id variable in the deployment.- build_distance_variablesbool, default
False
If
True
includes a cluster distance variable in the deployment.- build_frequency_variablesbool, default
False
If
True
includes the frequency variables in the deployment.- variables_prefixstr, default “”
Prefix for the variables in the deployment dictionary.
- results_prefixstr, default “”
Prefix of the result files.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Raises:
TypeError
Invalid type of an argument
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.simplify_coclustering(coclustering_file_path, simplified_coclustering_file_path, results_dir, max_preserved_information=0, max_cells=0, max_total_parts=0, max_part_numbers=None, results_prefix='', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Simplifies a coclustering model
- Parameters:
- coclustering_file_pathstr
Path of the coclustering file (extension
.khc
, or.khcj
).- simplified_coclustering_file_pathstr
Path of the output coclustering file.
- results_dirstr
Path of the results directory.
- max_preserved_informationint, default 0
Maximum information preserve in the simplified coclustering. If equal to 0 there is no limit.
- max_cellsint, default 0
Maximum number of cells in the simplified coclustering. If equal to 0 there is no limit.
- max_total_partsint, default 0
Maximum number of parts totaled over all variables. If equal to 0 there is no limit.
- max_part_numbersdict, optional
Dictionary that associate variable names to their maximum number of parts to preserve in the simplified coclustering. If not set there is no limit.
- results_prefixstr, default “”
Prefix of the result files.
- …
See Common Parameters.
- Raises:
TypeError
Invalid type of an argument.
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.sort_data_table(dictionary_file_path_or_domain, dictionary_name, data_table_path, output_data_table_path, sort_variables=None, detect_format=True, header_line=None, field_separator=None, output_header_line=True, output_field_separator='\t', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Sorts a data table
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary to be analyzed.
- data_table_pathstr
Path of the data table file.
- output_data_table_pathstr
Path of the output data file.
- sort_variableslist of str, optional
The names of the variables to sort. If not set sorts the table by its key.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- output_header_linebool, default
True
If
True
writes a header line with the column names in the output table.- output_field_separatorstr, default “\t”
The field separator character for the output table (”” counts as “\t”).
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Raises:
TypeError
Invalid type of a argument.
Examples
- See the following functions of the
samples.py
documentation script:
- khiops.core.api.train_coclustering(dictionary_file_path_or_domain, dictionary_name, data_table_path, coclustering_variables, results_dir, detect_format=True, header_line=None, field_separator=None, sample_percentage=100.0, sampling_mode='Include sample', selection_variable='', selection_value='', additional_data_tables=None, frequency_variable='', min_optimization_time=0, results_prefix='', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Trains a coclustering model from a data table
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary to be analyzed.
- data_table_pathstr
Path of the data table file.
- coclustering_variableslist of str
The names of variables to use in coclustering. Min length: 2. Max length: 10.
- results_dirstr
Path of the results directory.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- sample_percentagefloat, default 100.0
See
sampling_mode
option below.- sampling_mode“Include sample” or “Exclude sample”
If equal to “Include sample” it trains the coclustering estimator on
sample_percentage
percent of the data. If equal to “Exclude sample” it trains the coclustering estimator on the complement of the data selected with “Include sample”. See also Database Sampling.- selection_variablestr, default “”
It trains with only the records such that the value of
selection_variable
is equal toselection_value
. Ignored if equal to “”.- selection_value: str or int or float, default “”
See
selection_variable
option above. Ignored if equal to “”.- additional_data_tablesdict, optional
A dictionary containing the data paths and file paths for a multi-table dictionary file. For more details see Multi-Table Learning Primer.
- frequency_variablestr, default “”
Name of frequency variable.
- min_optimization_timeint, default 0
Minimum optimization time in seconds.
- results_prefixstr, default “”
Prefix of the result files.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Returns:
- str
The path of the of the resulting coclustering file.
- Raises:
ValueError
Number of coclustering variables out of the range 2-10.
TypeError
Invalid type of an argument.
Examples
- See the following function of the
samples.py
documentation script:
- khiops.core.api.train_predictor(dictionary_file_path_or_domain, dictionary_name, data_table_path, target_variable, results_dir, detect_format=True, header_line=None, field_separator=None, sample_percentage=70.0, sampling_mode='Include sample', use_complement_as_test=True, selection_variable='', selection_value='', additional_data_tables=None, main_target_value='', snb_predictor=True, univariate_predictor_number=0, max_evaluated_variables=0, max_selected_variables=0, max_constructed_variables=100, construction_rules=None, max_trees=10, max_pairs=0, all_possible_pairs=True, specific_pairs=None, group_target_value=False, discretization_method=None, min_interval_frequency=0, max_intervals=0, grouping_method=None, min_group_frequency=0, max_groups=0, results_prefix='', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Trains a model from a data table
- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary to be analyzed.
- data_table_pathstr
Path of the data table file.
- target_variablestr
Name of the target variable. If the specified variable is categorical it constructs a classifier and if it is numerical a regressor. If equal to “” it performs an unsupervised analysis.
- results_dirstr
Path of the results directory.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- sample_percentagefloat, default 70.0
See the
sampling_mode
option below.- sampling_mode“Include sample” or “Exclude sample”
If equal to “Include sample” it trains the predictor on
sample_percentage
percent of the data and tests the model on the remainder of the data ifuse_complement_as_test
is set toTrue
. If equal to “Exclude sample” the train and test datasets above are exchanged. See also Database Sampling.- use_complement_as_testbool, default
True
Uses the complement of the sampled database as test database for computing the model’s performance metrics.
- fill_test_database_settingsbool, default
False
It creates a test database as the complement of the train database. Deprecated will be removed in Khiops 11, use
use_complement_as_test
- selection_variablestr, default “”
It trains with only the records such that the value of
selection_variable
is equal toselection_value
. Ignored if equal to “”.- selection_value: str or int or float, default “”
See
selection_variable
option above. Ignored if equal to “”.- additional_data_tablesdict, optional
A dictionary containing the data paths and file paths for a multi-table dictionary file. For more details see Multi-Table Learning Primer.
- main_target_valuestr, default “”
If this target value is specified then it guarantees the calculation of lift curves for it.
- snb_predictorbool, default
True
If
True
it trains a Selective Naive Bayes predictor.- univariate_predictor_numberint, default 0
Number of univariate predictors to train.
- map_predictorbool, default
False
If
True
trains a Maximum a Posteriori Naive Bayes predictor. Deprecated will be removed in Khiops Python 11.- max_evaluated_variablesint, default 0
Maximum number of variables to be evaluated in the SNB predictor training. If equal to 0 it evaluates all informative variables.
- max_selected_variablesint, default 0
Maximum number of variables to be selected in the SNB predictor. If equal to 0 it selects all the variables kept in the training.
- max_constructed_variablesint, default 100
Maximum number of variables to construct.
- construction_ruleslist of str, optional
Allowed rules for the automatic variable construction. If not set it uses all possible rules.
- max_treesint, default 10
Maximum number of trees to construct. Not yet available in regression.
- max_pairsint, default 0
Maximum number of variables pairs to construct.
- specific_pairslist of tuple, optional
User-specified pairs as a list of 2-tuples of variable names. If a given tuple contains only one non-empty variable name, then it generates all the pairs containing it (within the limit
max_pairs
).- all_possible_pairsbool, default
True
If
True
tries to create all possible pairs within the limitmax_pairs
. The pairs and variables given inspecific_pairs
have priority.- only_pairs_withstr, default “”
Constructs only pairs with the specifed variable name. If equal to the empty string “” it considers all variables to make pairs. Deprecated will be removed in Khiops Python 11, use
specific_pairs
.- group_target_valuebool, default
False
Allows grouping of the target variable values in classification. It can substantially increase the training time.
- discretization_methodstr
- Name of the discretization method. Its valid values depend on the task:
Supervised: “MODL” (default), “EqualWidth” or “EqualFrequency”
Unsupervised: “EqualWidth” (default), “EqualFrequency” or “None”
- min_interval_frequencyint, default 0
Minimum number of instances in an interval. If equal to 0 it is automatically calculated.
- max_intervalsint, default 0
Maximum number of intervals to construct. If equal to 0 it is automatically calculated.
- grouping_methodstr
- Name of the grouping method. Its valid values depend on the task:
Supervised: “MODL” (default) or “BasicGrouping”
Unsupervised: “BasicGrouping” (default) or “None”
- min_group_frequencyint, default 0
Minimum number of instances for a group.
- max_groupsint, default 0
Maximum number of groups. If equal to 0 it is automatically calculated.
- results_prefixstr, default “”
Prefix of the result files.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Returns:
- tuple
- A 2-tuple containing:
The reports file path
The modeling dictionary file path in the supervised case.
- Raises:
ValueError
Invalid values of an argument
TypeError
Invalid type of an argument
Examples
- See the following functions of the
samples.py
documentation script:
- khiops.core.api.train_recoder(dictionary_file_path_or_domain, dictionary_name, data_table_path, target_variable, results_dir, detect_format=True, header_line=None, field_separator=None, sample_percentage=100.0, sampling_mode='Include sample', selection_variable='', selection_value='', additional_data_tables=None, max_constructed_variables=100, construction_rules=None, max_trees=0, max_pairs=0, all_possible_pairs=True, specific_pairs=None, informative_variables_only=True, max_variables=0, keep_initial_categorical_variables=False, keep_initial_numerical_variables=False, categorical_recoding_method='part Id', numerical_recoding_method='part Id', pairs_recoding_method='part Id', group_target_value=False, discretization_method=None, min_interval_frequency=0, max_intervals=0, grouping_method=None, min_group_frequency=0, max_groups=0, results_prefix='', batch_mode=True, log_file_path=None, output_scenario_path=None, task_file_path=None, trace=False, stdout_file_path='', stderr_file_path='', **kwargs)¶
Trains a recoding model from a data table
A recoding model consists in the discretization of numerical variables and the grouping of categorical variables.
If the
target_variable
is specified these partitions are constructed in supervised mode, meaning that each resulting discretizations/groupings best separates the target variable while maintaining a simple interval/group model of the data. Different recoding methods can be specified via thenumerical_recoding_method
,categorical_recoding_method
andpairs_recoding_method
options.The output files of this process contain a dictionary file (
.kdic
) that can be used to recode databases with thedeploy_model
function.- Parameters:
- dictionary_file_path_or_domainstr or
DictionaryDomain
Path of a Khiops dictionary file or a DictionaryDomain object.
- dictionary_namestr
Name of the dictionary to be recoded.
- data_table_pathstr
Path of the data table file.
- target_variablestr
Name of the target variable. If equal to “” it trains an unsupervised recoder.
- results_dirstr
Path of the results directory.
- detect_formatbool, default
True
If
True
detects automatically whether the data table file has a header and its field separator. It’s ignored ifheader_line
orfield_separator
are set.- header_linebool, optional (default
True
ifdetect_format
isFalse
) If
True
it uses the first line of the data as column names. Overridesdetect_format
if set.- field_separatorstr, optional (default “\t” if
detect_format
isFalse
) A field separator character, overrides
detect_format
if set (”” counts as “\t”).- sample_percentagefloat, default 100.0
See
sampling_mode
option below.- sampling_mode“Include sample” or “Exclude sample”
If equal to “Include sample” it trains the recoder on
sample_percentage
percent of the data. If equal to “Exclude sample” it trains the recoder on the complement of the data selected with “Include sample”. See also Database Sampling.- selection_variablestr, default “”
It trains with only the records such that the value of
selection_variable
is equal toselection_value
. Ignored if equal to “”.- selection_value: str or int or float, default “”
See
selection_variable
option above. Ignored if equal to “”.- additional_data_tablesdict, optional
A dictionary containing the data paths and file paths for a multi-table dictionary file. For more details see Multi-Table Learning Primer.
- max_constructed_variablesint, default 100
Maximum number of variables to construct.
- construction_ruleslist of str, optional
Allowed rules for the automatic variable construction. If not set it uses all possible rules.
- max_treesint, default 0
Maximum number of trees to construct. Not yet available in regression.
- max_pairsint, default 0
Maximum number of variables pairs to construct.
- specific_pairslist of tuple, optional
User-specified pairs as a list of 2-tuples of variable names. If a given tuple contains only one non-empty variable name, then it generates all the pairs containing it (within the limit
max_pairs
).- all_possible_pairsbool, default
True
If
True
tries to create all possible pairs within the limitmax_pairs
. The pairs and variables given inspecific_pairs
have priority.- only_pairs_withstr, default “”
Constructs only pairs with the specifed variable name. If equal to the empty string “” it considers all variables to make pairs. Deprecated will be removed in Khiops Python 11, use
specific_pairs
.- group_target_valuebool, default
False
Allows grouping of the target variable values in classification. It can substantially increase the training time.
- discretization_methodstr
- Name of the discretization method. Its valid values depend on the task:
Supervised: “MODL” (default), “EqualWidth” or “EqualFrequency”.
Unsupervised: “EqualWidth” (default), “EqualFrequency” or “None”.
- min_interval_frequencyint, default 0
Minimum number of instances in an interval. If equal to 0 it is automatically calculated.
- max_intervalsint, default 0
Maximum number of intervals to construct. If equal to 0 it is automatically calculated.
- informative_variables_onlybool, default
True
If
True
keeps only informative variables.- max_variablesint, default 0
Maximum number of variables to keep. If equal to 0 keeps all variables.
- keep_initial_categorical_variablesbool, default
True
If
True
keeps the initial categorical variables.- keep_initial_numerical_variablesbool, default
True
If
True
keeps initial numerical variables.- categorical_recoding_methodstr
- Type of recoding for categorical variables. Types available:
“part Id” (default): An id for the interval/group
“part label”: A label for the interval/group
“0-1 binarization”: A 0’s and 1’s coding the interval/group id
“conditional info”: Conditional information of the interval/group
“none”: Keeps the variable as-is
- numerical_recoding_methodstr
- Type of recoding recoding for numerical variables. Types available:
“part Id” (default): An id for the interval/group
“part label”: A label for the interval/group
“0-1 binarization”: A 0’s and 1’s coding the interval/group id
“conditional info”: Conditional information of the interval/group
“center-reduction”: “(X - Mean(X)) / StdDev(X)”
“0-1 normalization”: “(X - Min(X)) / (Max(X) - Min(X))”
“rank normalization”: mean normalized rank (between 0 and 1) of the instances
“none”: Keeps the variable as-is
- pairs_recoding_methodstr
- Type of recoding for bivariate variables. Types available:
“part Id” (default): An id for the interval/group
“part label”: A label for the interval/group
“0-1 binarization”: A 0’s and 1’s coding the interval/group id
“conditional info”: Conditional information of the interval/group
“none”: Keeps the variable as-is
- grouping_methodstr
- Name of the grouping method. Its vaild values depend on the task:
Supervised: “MODL” (default) or “BasicGrouping”.
Unsupervised: “BasicGrouping” (default) or “None”.
- min_group_frequencyint, default 0
Minimum number of instances for a group.
- max_groupsint, default 0
Maximum number of groups. If equal to 0 it is automatically calculated.
- results_prefixstr, default “”
Prefix of the result files.
- …
See Common Parameters.
- dictionary_file_path_or_domainstr or
- Returns:
- tuple
- A 2-tuple containing:
The path of the JSON file report of the process
The path of the dictionary containing the recoding model
Examples
- See the following functions of the
samples.py
documentation script: