Notes¶
Khiops API Notes¶
Common Parameters¶
The functions in the khiops.core.api
have the following common parameters.
- log_file_pathstr, default “”
Path of the log file for the Khiops process (command line option
-e
of the desktop app). If equal to “” then it writes no log file.- output_scenario_pathstr, default “”
Path of the output Khiops scenario file (command line option
-o
of the desktop app). If the empty string is specified no output scenario file is generated.- task_file_pathstr, default “”
Path of the task file for the Khiops process (command line option
-p
of the desktop app). If equal to “” then it writes no task file.- tracebool, default
False
If True prints the command line executed of the process and does not delete any temporary files created.
- stdout_file_pathstr, default “”
Advanced Path to a file where the Khiops process writes its stdout stream. Normally Khiops should not write to this stream but MPI, filesystems plugins or debug versions may do it. The stream is captured with a UTF-8 encoding and replacing encoding errors. If equal to “” then it writes no file.
- stderr_file_pathstr, default “”
Advanced Path to a file where the Khiops process writes its stderr stream. Normally Khiops should not write to this stream but MPI, filesystems plugins or debug versions may do it. The stream is captured with a UTF-8 encoding and replacing encoding errors. If equal to “” then it writes no file.
- max_cores: int, optional
Advanced Maximum number of cores for Khiops executions. If not set, then Khiops uses all available CPU cores in the system.
- memory_limit_mb: int, optional
Advanced Maximum amount of memory (in MB) for Khiops executions. If not set, then Khiops uses all available system memory.
- temp_dir: str, default “”
Advanced Temporary directory for Khiops executions. If set to “”, then Khiops uses the system’s temporary directory.
- scenario_prologue: str, default “”
Advanced Prologue to prepend to all Khiops execution scenarios.
- force_ansi_scenariobool, default
False
Advanced If True the internal scenario generated by Khiops will force characters such as accentuated ones to be decoded with the UTF8->ANSI khiops transformation.
- batch_modebool, default
True
Deprecated Will be removed in Khiops 11. If
True
activates batch mode (command line option-b
of the desktop app).
Input Types¶
The types accepted in most methods and classes of khiops.core
are flexible:
str
can be replaced bybytes
This adds flexibility for file paths and automatically created variable names (data-dependent).
list
can be replaced by any class implementing thecollections.abc.Sequence
interface exceptstr
andbytes
.dict
can be replaced by any class implementing thecollections.abc.Mapping
interface.
Database Sampling¶
Several khiops.core.api
functions can operate on dataset samples instead of the full datasets.
This sampling behavior is fully customizable by the user: one can specify that the function
operates on the specified sample or on its complement.
The sampling behavior is controlled with two parameters:
sample_percentage
: A real number between 0 and 100 specifying the percentage of the data to be used as sample.sampling_mode
: A string specifying the sampling operation mode:“Include sample”: The sample consist on
sample_percentage
percent of the individuals in the dataset.“Exclude sample”: The sample consist on
100 - sample_percentage
percent of the individuals in the dataset. The sample is exactly the complement of that obtained with “Include sample”.
In the case of the train_predictor
function the additional boolean parameter
use_complement_as_test
specifies whether the complement of the selected sample is used to
evaluate the trained predictor.
An Example¶
If in the train_predictor
call we set:
sample_percentage
to 20sampling_mode
to “Exclude sample”use_complement_as_test
toTrue
specifies a 20-80 split of the dataset. Since sample_mode
is “Exclude sample” the predictor will
be trained on the 80 % part. The remaining 20 % will be used to evaluate the predictor’s performance
because use_complement_as_test
is True
.
Samples Directory Customization¶
The samples directory usable by the Khiops Python library can be customized via
the KHIOPS_SAMPLES_DIR
environment variable, which specifies the path to the Khiops sample
datasets directory.
Khiops JSON Files¶
Generalities¶
The structure of the Khiops JSON files is self-documented:
Most of the information is available as key-value pairs, where the keys resemble the labels used in Khiops’ classic report files (tab-separated plain-text files with extension
.xls
) or dictionary files.In order to be human-readable the files are beautified with a comfortable spacing and indentation.
Structure and Performance¶
The Khiops JSON files may be large (tens of MB) when analyzing datasets with many columns, or when specifying the creation of thousands of variables in the multi-table case. To handle these situations, the report attributes in the JSON file are sorted by increasing size, thus easing the use of streaming parsers.
Furthermore, memory-scalable parsing techniques can be implemented. For example, the heavier parts of the file can be separated and split into chunks. Then, these chunks can be indexed using the information found at the top of the report, allowing the on-demand access to the detailed parts of the report.
Khiops Report Files Structure (.khj)¶
At the top level the order is as follows:
Modeling report
Evaluation report(s)
Preparation report(s)
The preparation reports are at the end because they can be very large when many variables are analyzed.
Each report field is organized in three sections:
Summary: General (short) information about the report
A list of report items:
Variable statistics (preparation), trained predictor (modeling) and predictor performance (evaluation)
Each item has a “Rank”
Example: The second most informative variable has the categorical rank “R02”
Each item is described by a few summary attributes
A dictionary of detailed report items. The keys of this dictionary are the previously mentioned “Rank” attributes. Note that:
Not all report items are detailed
The detailed information may be large (example: data grid).