API#

Preprocessing#

The wildwood.preprocessing module contains the Encoder class. The Encoder performs the transformation of an input pandas.DataFrame or numpy.ndarray into a wildwood.FeaturesBitArray class.

Encoder#

class wildwood.preprocessing.Encoder(max_bins=256, subsample=200000, is_categorical=None, cat_min_categories='log', handle_unknown='error', random_state=None, verbose=False)[source]#

A class that transforms an input pandas.DataFrame or numpy.ndarray into a wildwood.FeaturesBitArray class, corresponding to a column-wise binning of the original columns.

Categorical columns are simply ordinal-encoded using contiguous non-negative integers, while continuous columns are binned using inter-quantiles intervals, so that each bin contains approximately the same number of samples.

Both mappings from categories to integers (for categorical columns) and from inter-quantile intervals to integers (for continuous columns) are computed using the .fit() method.

The .transform() method will bin the features and create the features’ bitarray. Its default behavior is to raise an error whenever an unknown category is met, but this can be changed using the handle_unknown option.

When the input is a pandas.DataFrame, we support the encoding of missing values both for categorical and numerical columns. However, when the input is a numpy.ndarray, missing values are supported only for a numerical data type. Other situations might raise unexpected errors. If a column contains missing values, the last bin (last integer) is used to encode them.

Parameters:
  • max_bins (int) – The maximum number of bins for numerical columns, not including the bin used for missing values, if any. Should be at least 3. We will use max_bins bins when the column has no missing values, and max_bins + 1 bins if it does. The last bin (at index max_bins) is used to encode missing values. If a column has less than max_bins different inter-quantile or categories, we use less than max_bins bins for it.

  • is_categorical (None or numpy.ndarray) – If not None, it is a numpy.ndarray of shape (n_features,) with boolean dtype, which corresponds to a categorical indicator for each column as specified by the user.

  • handle_unknown ({"error", "consider_missing"}, default="error") – If set to “error”, an error will be raised at transform whenever a category was not seen during fit. If set to “consider_missing”, we will consider it as a missing value (it will end up in the same bin as missing values).

  • cat_min_categories (int or {"log", "sqrt"}, default="log") – When a column is numerical and is_categorical is None, WildWood decides that it is categorical whenever its number of unique values is smaller or equal to cat_min_categories. Otherwise, it is considered numerical. If an int larger than 3 is given, we use it as cat_min_categories. If “log”, we set cat_min_categories=max(2, floor(log(n_samples))). If “sqrt”, we set cat_min_categories=max(2, floor(sqrt(n_samples))). Default is “log”.

  • subsample (int or None, default=200000) – If n_samples > subsample, then subsample samples are chosen at random to compute the quantiles. If None, the whole dataset is used.

  • random_state (int, default=None) – Allows to seed the random number generator used to generate a subsample for quantiles computation.

  • verbose (bool, default=False) – If True, display warnings concerning columns typing.

n_samples_in_#

The number of samples passed to fit.

Type:

int

n_features_in_#

The number of features (columns) passed to fit.

Type:

int

categories_#

A dictionary that maps the index of a categorical column to an array containing its raw categories. For instance, categories_[2][4] is the raw category corresponding to the bin index 4 for column index 2.

Type:

dict

binning_thresholds_#

A dictionary that maps the index of a continuous column to an array containing its binning thresholds. It is usually of length max_bins - 1, unless the column has less unique values than that.

Type:

dict

is_categorical_#

A numpy array of shape (n_features,) with boolean dtype, which indicates if each feature is considered as categorical by WildWood or not. This might differ from the is_categorical given by the user. See the _checks.py module for details.

Type:

numpy.ndarray

fit(X, y=None)[source]#

This computes, for each column of X, a mapping from raw values to bins: categories to integers for categorical columns and inter-quantile intervals for continuous columns. It also infers if columns are either categorical or continuous using the is_categorical attribute, and tries to guess it if not provided by the user.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data to fit and transform. It can be either a pandas.DataFrame or a 2D numpy.ndarray.

  • y (None) – This is ignored.

Returns:

self – The current Encoder instance

Return type:

Encoder

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

inverse_transform(features_bitarray, return_dataframe=True, columns=None, index=None)[source]#
Parameters:
  • features_bitarray

  • return_dataframe

  • columns

  • index

set_inverse_transform_request(*, columns: bool | None | str = '$UNCHANGED$', features_bitarray: bool | None | str = '$UNCHANGED$', index: bool | None | str = '$UNCHANGED$', return_dataframe: bool | None | str = '$UNCHANGED$') Encoder[source]#

Request metadata passed to the inverse_transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • columns (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for columns parameter in inverse_transform.

  • features_bitarray (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for features_bitarray parameter in inverse_transform.

  • index (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for index parameter in inverse_transform.

  • return_dataframe (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for return_dataframe parameter in inverse_transform.

Returns:

self – The updated object.

Return type:

object

set_output(*, transform=None)[source]#

Set output container.

See sphx_glr_auto_examples_miscellaneous_plot_set_output.py for an example on how to use the API.

Parameters:

transform ({"default", "pandas"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • None: Transform configuration is unchanged

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

transform(X, y=None)[source]#

Bins the columns in X. Both continuous and categorical columns are mapped to a contiguous range of non-negative integers. The resulting binned data is stored in a memory-efficient FeaturesBitArray class, which uses internally a bitarray.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – The data to transform using binning. It can be either a pandas dataframe or a 2d numpy array.

  • y (None) – This is ignored.

Returns:

output – A WildWood FeaturesBitArray class corresponding to the binned data.

Return type:

FeaturesBitArray

Random Forest algorithms#

WildWood exposes the two classes ForestClassifier for multi-class classification and ForestRegressor for regression.

Multi-class classification with the ForestClassifier class#

class wildwood.ForestClassifier(*, n_estimators: int = 10, criterion: str = 'gini', loss: str = 'log', step: float = 1.0, aggregation: bool = True, dirichlet: float = 0.5, max_depth: None | int = None, min_samples_split: int = 2, min_samples_leaf: int = 1, max_bins: int = 256, categorical_features=None, max_features: None | str | int = 'auto', handle_unknown='consider_missing', cat_min_categories='log', subsample=200000, n_jobs: int = 1, random_state=None, verbose: bool = False, class_weight=None, multiclass='multinomial', cat_split_strategy='binary')[source]#

WildWood forest for classification.

It grows in parallel n_estimators trees using bootstrap samples and aggregates their predictions (bagging). Each tree uses “in-the-bag” samples to grow itself and “out-of-bag” samples to compute aggregation weights for all possible subtrees of the whole tree.

The prediction function of each tree in WildWood is very different from the one of a standard decision trees whenever aggregation=True (default). Indeed, the predictions of a tree are computed here as an aggregation with exponential weights of all the predictions given by all possible subtrees (prunings) of the full tree. The required computations are performed efficiently thanks to a variant of the context tree weighting algorithm.

Also, both continuous and categorical features are binned with a maximum of max_bins bins, allowing to use an efficient histogram-based split search.

Parameters:
  • n_estimators (int, default=10) – The number of trees in the forest.

  • criterion ({"gini", "entropy"}, default="gini") – The impurity criterion used to measure the quality of a split. The supported impurity criteria are “gini” for the Gini impurity and “entropy” for the entropy impurity.

  • loss ({"log"}, default="log") – The loss used for the computation of the aggregation weights. Only “log” is supported for now, namely the log-loss for classification.

  • step (float, default=1.0) – Step-size for the aggregation weights. Default is 1.0 for classification with the log-loss, which is the best theoretical choice. A larger value will lead to larger aggregation weights for subtrees with better out-of-bag (validation) loss.

  • aggregation (bool, default=True) – Controls if aggregation is used in the trees. It is highly recommended to leave it as True.

  • dirichlet (float, default=0.5) – Regularization level of the class frequencies used for predictions in each node. A good default is dirichlet=0.5 for binary classification.

  • max_depth (int, default=None) – The maximum depth of a tree. If None, then nodes from the tree are split until they are “pure” (impurity is zero) or until they contain min_samples_split samples.

  • min_samples_split (int, default=2) – The minimum number of training samples and out-the-bag samples required to split a node. This must be >= 2.

  • min_samples_leaf (int, default=1) – A split point is considered if it leaves at least min_samples_leaf training samples and out-the-bag samples in the left and right childs. This must be >= 1.

  • max_bins (int, default=256) – The maximum number of bins for numerical columns, not including the bin used for missing values, if any. Should be at least 4. Before training, each column of the input array X is binned into integer-valued bins, corresponding to inter-quantile intervals, enabling faster split finding. We will use max_bins bins when the column has no missing values, and max_bins + 1 bins if it does. The last bin (at index max_bins) is used to encode missing values. If a column has less than max_bins different inter-quantile or categories, we use less than max_bins bins for it.

  • categorical_features (array-like, default=None) – Array-like containing boolean or integer values or shape (n_features,) or (n_categorical_features,) indicating the categorical features. If None : no feature will be considered categorical. If boolean array-like : boolean mask indicating categorical features. If integer array-like : integer indices indicating categorical features.

  • max_features ({"auto", "sqrt", "log2"} or int, default="auto") – The number of features to consider when looking for the best split. If int, consider max_features features at each split. If “auto”, max_features=sqrt(n_features). If “sqrt”, max_features=sqrt(n_features) (same as “auto”). If “log2”, max_features=log2(n_features) If None, max_features=n_features.

  • handle_unknown ({"error", "consider_missing"}, default="error") – If set to “error”, an error will be raised while encoding the data whenever a category in a categorical column was not seen during fit. If set to “consider_missing”, we will consider it as a missing value (it will end up in the same bin as missing values).

  • cat_min_categories (int or {"log", "sqrt"}, default="log") – When a column contains numerical values and its type is not specified through categorical_columns, WildWood decides that it is categorical whenever its number of unique values is smaller or equal to cat_min_categories. Otherwise, it is considered numerical. If an int larger than 3 is given, we use it as cat_min_categories. If “log”, we set cat_min_categories=max(2, floor(log(n_samples))). If “sqrt”, we set cat_min_categories=max(2, floor(sqrt(n_samples))). Default is “log”.

  • subsample (int or None, default=200000) – If n_samples > subsample, then subsample samples are chosen at random to compute the quantiles used to bin numerical columns. If None, the whole dataset is used.

  • n_jobs (int, default=1) – The number of jobs to run in parallel for fit(), predict(), predict_proba() and apply(). All these methods are parallelized over the trees in the forest. n_jobs=-1 means using all processors.

  • random_state (int, RandomState instance or None, default=None) – Controls both the randomness involved in bootstrapping the samples and sampling the features when looking for the best splits (if max_features < n_features). See Bootstrap and feature subsampling for details.

  • verbose (bool, default=False) – Controls the verbosity when fitting and predicting.

  • class_weight ("balanced" or None, default=None) – Weights associated with classes. If None, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)). These weights will be multiplied with sample_weight when passed through the fit() method.

  • multiclass ({"multinomial", "ovr"}, default="multinomial") – Used only for n_classes_ class classification with n_classes_ > 2 and data with categorical features. If “multinomial”, n_estimators trees will be trained to make multiclass predictions. See also cat_split_strategy in this case. If “ovr” we use a one-versus-all strategy, where labels are binarized and n_classes_ * n_estimators trees are trained to make binary predictions and the final predictions are obtained as normalized scores. Use multiclass="ovr" together with categorical_features for the best results in multiclass problems with categorical features.

  • cat_split_strategy ({"binary", "all", "random"}, default="binary") – Used only for n_classes_-class classification with n_classes_ > 2, data with categorical features and multiclass="multinomial". If “binary”, split-search for categorical features use a single loop over the bins sorted with respect to the proportion of labels with class 1 in each bin. If “all”, it uses n_classes_ loops, corresponding to the bins sorted with respect to the proportion of labels of each class. If “random”, it performs a single loop, with bins sorted at random.

classes_#

The classes labels.

Type:

ndarray of shape (n_classes,)

n_samples_in_#

The number of samples when fit() is performed.

Type:

int

n_features_in_#

The number of features when fit() is performed.

Type:

int

n_classes_#

The number of classes.

Type:

int

References

apply(X)[source]#

Apply trees in the forest to X, return leaf indices.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.

Returns:

X_leaves – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

fit(X, y, sample_weight=None, categorical_features=None, randomized_depth=False)[source]#

Trains WildWood’s forest predictor from the training set (X, y).

Parameters:
  • TODO (#) –

  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be binned into a uint8 data type.

  • y (array-like of shape (n_samples,)) – The target values (class labels in classification, real numbers in regression).

  • sample_weight (array-like of shape (n_samples,), default=None) – If None, then samples are equally weighted. Otherwise, samples are weighted. If sample_weight[42] = 3.0 then all computations do “as if” there were 3 lines with the same contents as X[42] in all computations (for split finding, node predictions and for the aggregation algorithm (computation of validation losses).

  • categorical_features (array-like, default=None) – Array-like containing boolean or integer values or shape (n_features,) or (n_categorical_features,) indicating the categorical features. Note that this can be specified as well as a parameter of the class. If None : no feature will be considered categorical. If boolean array-like : boolean mask indicating categorical features. If integer array-like : integer indices indicating categorical features.

Returns:

self – The fitted forest.

Return type:

object

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(X)[source]#

Predict class for X.

The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.

Parameters:

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input samples.

Returns:

y – The predicted classes.

Return type:

ndarray of shape (n_samples,)

predict_proba(X)[source]#

Predict class probabilities for X.

The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. If aggregation=False, the class probability of a single tree is a regularization using the dirichlet parameter of the fraction of samples of the same class in a leaf. If aggregation=True the class probability of a single tree is an aggregation with exponential weights of the predictions of all pruned subtrees it contains. See Prediction function: aggregation with exponential weights for more details.

Parameters:

X ({array-like} of shape (n_samples, n_features)) – The input samples.

Returns:

output – The class probabilities of the input samples.

Return type:

ndarray of shape (n_samples, n_classes)

predict_proba_trees(X)[source]#

Gives the predict_proba(X) of each tree in the forest.

This simply returns a (n_estimator, n_samples, n_classes) ndarray containing the predict_proba of each tree in the forest, see : meth:predict_proba for details.

Parameters:

X ({array-like} of shape (n_samples, n_features)) – The input samples.

Returns:

output – The predicted class probabilities by each tree for the input samples.

Return type:

ndarray of shape (n_estimators, n_samples, n_classes)

score(X, y, sample_weight=None)[source]#

Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

float

set_fit_request(*, categorical_features: bool | None | str = '$UNCHANGED$', randomized_depth: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') ForestClassifier[source]#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • categorical_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for categorical_features parameter in fit.

  • randomized_depth (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for randomized_depth parameter in fit.

  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') ForestClassifier[source]#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object

Regression with the ForestRegressor class#

class wildwood.ForestRegressor(*, n_estimators: int = 10, criterion: str = 'mse', loss: str = 'mse', step: float = 1.0, aggregation: bool = True, max_depth: None | int = None, min_samples_split: int = 2, min_samples_leaf: int = 1, max_bins: int = 256, categorical_features=None, max_features: str | int = 'auto', handle_unknown='consider_missing', cat_min_categories='log', subsample=200000, n_jobs: int = 1, random_state=None, verbose: bool = False)[source]#

WildWood forest for regression.

It grows in parallel n_estimator trees using bootstrap samples and aggregates their predictions (bagging). Each tree uses “in-the-bag” samples to grow itself and “out-of-bag” samples to compute aggregation weights for all possible subtrees of the whole tree.

The prediction function of each tree in WildWood is very different from the one of a standard decision trees whenever aggregation=True (default). Indeed, the predictions of a tree are computed here as an aggregation with exponential weights of all the predictions given by all possible subtrees (prunings) of the full tree. The required computations are performed efficiently thanks to a variant of the context tree weighting algorithm.

Also, continuous features are binned with a maximum of max_bins bins (+1 if it contains missing values) allowing to use an efficient histogram-based split search.

Parameters:
  • n_estimators (int, default=10) – The number of trees in the forest.

  • criterion ({"mse"}, default="mse") – The impurity criterion used to measure the quality of a split. Only “mse”, which corresponds to variance reduction for split finding is available for now.

  • loss ({"mse"}, default="mse") – The loss used for the computation of the aggregation weights. Only “mse” is supported for now, which corresponds to the least-squares loss.

  • step (float, default=1.0) – Step-size for the aggregation weights. Default is 1.0, a larger value will lead to larger aggregation weights for subtrees with better out-of-bag ( validation) loss.

  • aggregation (bool, default=True) – Controls if aggregation is used in the trees. It is highly recommended to leave it as True.

  • max_depth (int, default=None) – The maximum depth of a tree. If None, then nodes from the tree are split until they are “pure” (impurity is zero) or until they contain min_samples_split samples.

  • min_samples_split (int, default=2) – The minimum number of training samples and out-the-bag samples required to split a node. This must be >= 2.

  • min_samples_leaf (int, default=1) – A split point is considered if it leaves at least min_samples_leaf training samples and out-the-bag samples in the left and right childs. This must be >= 1.

  • max_bins (int, default=256) – The maximum number of bins for numerical columns, not including the bin used for missing values, if any. Should be at least 3. Before training, each column of the input array X is binned into integer-valued bins, corresponding to inter-quantile intervals, enabling faster split finding. We will use max_bins bins when the column has no missing values, and max_bins + 1 bins if it does. The last bin (at index max_bins) is used to encode missing values. If a column has less than max_bins different inter-quantile or categories, we use less than max_bins bins for it.

  • categorical_features (array-like, default=None) – Array-like containing boolean or integer values or shape (n_features,) or (n_categorical_features,) indicating the categorical features. If None : no feature will be considered categorical. If boolean array-like : boolean mask indicating categorical features. If integer array-like : integer indices indicating categorical features.

  • max_features ({"auto", "sqrt", "log2"} or int, default="auto") – The number of features to consider when looking for the best split. If int, consider max_features features at each split. If “auto”, max_features=sqrt(n_features). If “sqrt”, max_features=sqrt(n_features) (same as “auto”). If “log2”, max_features=log2(n_features) If None, max_features=n_features.

  • handle_unknown ({"error", "consider_missing"}, default="error") – If set to “error”, an error will be raised while encoding the data whenever a category in a categorical column was not seen during fit. If set to “consider_missing”, we will consider it as a missing value (it will end up in the same bin as missing values).

  • cat_min_categories (int or {"log", "sqrt"}, default="log") – When a column contains numerical values and its type is not specified through categorical_columns, WildWood decides that it is categorical whenever its number of unique values is smaller or equal to cat_min_categories. Otherwise, it is considered numerical. If an int larger than 3 is given, we use it as cat_min_categories. If “log”, we set cat_min_categories=max(2, floor(log(n_samples))). If “sqrt”, we set cat_min_categories=max(2, floor(sqrt(n_samples))). Default is “log”.

  • subsample (int or None, default=200000) – If n_samples > subsample, then subsample samples are chosen at random to compute the quantiles used to bin numerical columns. If None, the whole dataset is used.

  • n_jobs (int, default=1) – The number of jobs to run in parallel for fit(), predict() and apply(). All these methods are parallelized over the trees in the forest. n_jobs=-1 means using all processors.

  • random_state (int, RandomState instance or None, default=None) – Controls both the randomness involved in bootstrapping the samples and sampling the features when looking for the best splits (if max_features < n_features). See Bootstrap and feature subsampling for details.

  • verbose (bool, default=False) – Controls the verbosity when fitting and predicting.

n_samples_in_#

The number of samples when fit() is performed.

Type:

int

n_features_in_#

The number of features when fit() is performed.

Type:

int

References

apply(X)[source]#

Apply trees in the forest to X, return leaf indices.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.

Returns:

X_leaves – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

fit(X, y, sample_weight=None, categorical_features=None, randomized_depth=False)[source]#

Trains WildWood’s forest predictor from the training set (X, y).

Parameters:
  • TODO (#) –

  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples. Internally, it will be binned into a uint8 data type.

  • y (array-like of shape (n_samples,)) – The target values (class labels in classification, real numbers in regression).

  • sample_weight (array-like of shape (n_samples,), default=None) – If None, then samples are equally weighted. Otherwise, samples are weighted. If sample_weight[42] = 3.0 then all computations do “as if” there were 3 lines with the same contents as X[42] in all computations (for split finding, node predictions and for the aggregation algorithm (computation of validation losses).

  • categorical_features (array-like, default=None) – Array-like containing boolean or integer values or shape (n_features,) or (n_categorical_features,) indicating the categorical features. Note that this can be specified as well as a parameter of the class. If None : no feature will be considered categorical. If boolean array-like : boolean mask indicating categorical features. If integer array-like : integer indices indicating categorical features.

Returns:

self – The fitted forest.

Return type:

object

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

predict(X)[source]#

Predict regression target for X.

The predicted regression target of an input sample is computed as the mean predicted regression targets of the trees in the forest.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.

Returns:

y – The predicted values.

Return type:

ndarray of shape (n_samples,)

predict_trees(X)[source]#

Gives the predict(X) of each tree in the forest.

This simply returns a (n_estimator, n_samples) ndarray containing the predict of each tree in the forest, see : meth:predict for details.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.

Returns:

output – The predicted target regression values by each tree for the input samples.

Return type:

ndarray of shape (n_estimators, n_samples)

score(X, y, sample_weight=None)[source]#

Return the coefficient of determination of the prediction.

The coefficient of determination \(R^2\) is defined as \((1 - \frac{u}{v})\), where \(u\) is the residual sum of squares ((y_true - y_pred)** 2).sum() and \(v\) is the total sum of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a \(R^2\) score of 0.0.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Test samples. For some estimators this may be a precomputed kernel matrix or a list of generic objects instead with shape (n_samples, n_samples_fitted), where n_samples_fitted is the number of samples used in the fitting for the estimator.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True values for X.

  • sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score\(R^2\) of self.predict(X) w.r.t. y.

Return type:

float

Notes

The \(R^2\) score used when calling score on a regressor uses multioutput='uniform_average' from version 0.23 to keep consistent with default value of r2_score(). This influences the score method of all the multioutput regressors (except for MultiOutputRegressor).

set_fit_request(*, categorical_features: bool | None | str = '$UNCHANGED$', randomized_depth: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') ForestRegressor[source]#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • categorical_features (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for categorical_features parameter in fit.

  • randomized_depth (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for randomized_depth parameter in fit.

  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') ForestRegressor[source]#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns:

self – The updated object.

Return type:

object