coniferest package

coniferest package#

Submodules#

coniferest.aadforest module#

class coniferest.aadforest.AADForest(n_trees=100, n_subsamples=256, max_depth=None, budget='auto', C_a=1.0, C_n=1.0, prior_influence=0.0, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576, map_value=None)[source]#

Bases: Coniferest

Active Anomaly Detection with Isolation Forest.

See Das et al., 2017 https://arxiv.org/abs/1708.09441

The method solves the optimization problem:

\[\mathbf{w} = \arg\min_{\mathbf{w}} \left( \frac{C_a}{\left|\mathcal{A}\right|} \sum_{i \in \mathcal{A}} \mathrm{ReLU}\left(s(\mathbf{x_i} | \mathbf{w}) - q_{\tau}\right) + \frac{C_n}{\left|\mathcal{N}\right|} \sum_{i \in \mathcal{N}} \mathrm{ReLU}\left(q_{\tau} - s(\mathbf{x_i} | \mathbf{w})\right) + \frac{\alpha}{2} \lVert \mathbf{w} - \mathbf{w_0}\rVert^2\right),\]

where \(C_a\) is C_a, \(C_n\) is C_n, regularization parameter \(\alpha\) is prior_influence, \(\mathcal{A}\) is a set of known anomalies, \(\mathcal{N}\) is a set of known nominals, \(s(\mathbf{x_i} | \mathbf{w})\) is the anomaly score of instance with features \(\mathbf{x_i}\) given weights \(\mathbf{w}\).

This problem is reformulated as an equivalent quadratic programming problem:

\[\begin{split}\begin{bmatrix} \mathbf{w}\\ \mathbf{u} \end{bmatrix} = \arg\min_{\mathbf{w}, \mathbf{u}} \left( \frac{C_a}{\left|\mathcal{A}\right|} \sum_{i \in \mathcal{A}} u_i + \frac{C_n}{\left|\mathcal{N}\right|} \sum_{i \in \mathcal{N}} u_i + \frac{\alpha}{2} \lVert \mathbf{w} - \mathbf{w_0} \rVert^2\right),\end{split}\]

with the following convex constraints:

\[\begin{split}u_i &\ge 0 \quad & i \in \mathcal{A} \cup \mathcal{N},\\ u_i - s(\mathbf{x_i} | \mathbf{w}) &\ge - q_{\tau}\quad & i \in \mathcal{A},\\ u_i + s(\mathbf{x_i} | \mathbf{w}) &\ge q_{\tau}\quad & i \in \mathcal{N}.\\\end{split}\]

Parameters:

n_trees (int, optional) – Number of trees in the isolation forest.
n_subsamples (int, optional) – How many subsamples should be used to build every tree.
max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.
budget (int or float or "auto", optional) – Budget of anomalies. If the type is floating point it is considered as fraction of full data. If the type is integer it is considered as the number of items. If string “auto” is set then the exact parameter is found during the train. Default is “auto”.
C_a (float, default=1.0) – Finite nonnegative cost parameter for anomalies in the loss function. It is a cost for false negative anomalies.
C_n (float, default=1.0) – Finite nonnegative cost parameter for nominals in the loss function. It is a cost for false positive anomalies.
n_jobs (int, default=-1) – Number of threads to use for scoring. If -1, use all available CPUs.
random_seed (int or None, optional) – Random seed to use for reproducibility. If None - random seed is used.
prior_influence (float or callable, optional) – A regularization coefficient value in the loss function. Default is 0.0. Signature: ‘(anomaly_count, nominal_count) -> float’
map_value (["const", "exponential", "linear", "reciprocal"] or callable, optional) – A function applied to the leaf depth before weighting. Possible meaning variants are: 1, 1-exp(-x), x, -1/x.

n_features_in_#

Number of features seen during fit. Available only after fit() or fit_known() has been called.

Type:: int

apply(x, output=None)[source]#

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.
output ({"dense", "sparse"}, default="dense") – If “dense”, returns a dense array of leaf indices per tree. If “sparse”, returns a sparse CSR matrix of shape (n_samples, n_leaves) where each row has non-zero entries for leaves reached by the sample.

Returns:

x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in (dense format). If output=”sparse”, returns a sparse matrix with 1.0 in entries where sample reaches the leaf.

Return type:

ndarray of shape (n_samples, n_estimators) or csr_matrix of shape (n_samples, n_leaves)

distance(x, y=None, *, method=None)[source]#

Compute distance matrix between samples based on leaf co-occurrence.

The distance is defined as 1 minus the fraction of trees where two samples land in the same leaf. This gives a measure of dissimilarity between samples based on their paths through the forest.

Parameters:

x (ndarray shape (n_samples_x, n_features) or (n_features,)) – Input samples. If 1-D, treated as a single sample.
y (ndarray shape (n_samples_y, n_features) or (n_features,), optional) – Second set of samples for pairwise distance. If None (default), computes distances between all pairs in x.
method ({"common_leaf_ratio"}, default="common_leaf_ratio") – Distance computation method. Currently only “common_leaf_ratio” is supported.

Returns:

distances – Distance matrix where distances[i, j] is the distance between the i-th sample in x and j-th sample in y. If y is None, returns a square symmetric matrix of shape (n_samples_x, n_samples_x).

Return type:

ndarray shape (n_samples_x, n_samples_y)

Raises:

ValueError – If method is not one of the known methods.

feature_importance(x)[source]#

feature_signature(x)[source]#

fit(data, labels=None)[source]#

Build the trees with the data data.

Parameters:

data – Array with feature values of objects.
labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]#

The same fit but with a bit of different API. Known data and labels are separated from training data for time and space optimality. High chances are that known_data is much smaller that data. At that case it is not reasonable to hold the labels for whole data.

Parameters:

data – Training data (array with feature values) to build trees with.
known_data – Feature values of known data.
known_labels – Labels of known data.

Return type:

self

score_samples(samples)[source]#

Compute scores for the supplied data.

Parameters:: samples – Feature values to compute scores on.
Return type:: Array with computed scores.

coniferest.calc_paths_sum module#

coniferest.coniferest module#

class coniferest.coniferest.Coniferest(trees=None, n_subsamples=256, max_depth=None, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576)[source]#

Bases: ABC

Base class for the forests in the package. It settles the basic low-level machinery with the Rust tree builder, used here.

Parameters:

trees (list or None, optional) – List with the trees in the forest. If None, then empty list is used.
n_subsamples (int, optional) – Subsamples to use for the training.
max_depth (int or None, optional) – Maximum depth of the trees in use. If None, then log2(n_subsamples) is used.
n_jobs (int, default=-1) – Number of threads to use for building and scoring. If -1, use all available CPUs.
random_seed (int or None, optional) – Seed for the reproducibility. If None, then random seed is used.

n_features_in_#

Number of features seen during fit. Available only after the forest has been built (i.e. after build_trees(), fit(), or fit_known() has been called).

Type:: int

build_trees(data, n_trees)[source]#

Just build n_trees trees from supplied data.

Trees are built in parallel, each tree from its own random subsample of data rows. Random seeds for the trees are sampled in advance, so the result is reproducible and does not depend on the number of threads.

Parameters:

data – Features.
n_trees – Number of trees to build

Return type:

List of trees.

abstractmethod feature_importance(x)[source]#

abstractmethod feature_signature(x)[source]#

abstractmethod fit(data, labels=None)[source]#: Fit to the applied data.

abstractmethod fit_known(data, known_data=None, known_labels=None)[source]#: Fit to the applied data with priors.

property n_features_in_#: Number of features seen during fit.

property num_threads#

0 means all CPUs.

Type:: n_jobs converted to the Rust extension conventions

abstractmethod score_samples(samples)[source]#: Evaluate scores for samples.

class coniferest.coniferest.ConiferestEvaluator(coniferest)[source]#

Bases: ForestEvaluator

Fast evaluator of scores for Coniferests.

Parameters:: coniferest (Coniferest) – The forest for building the evaluator from.

class coniferest.coniferest.Tree(left, feature, value, node_average_path_length, n_subsamples, n_features)#

Bases: object

Decision tree of an isolation forest.

The tree is built on f32 or f64 data and can only be applied to data of the same dtype, see the dtype attribute; the Python side casts the data when needed.

The forest is stored as a Python list of trees. All the numpy views of the tree (left, feature, value, node_average_path_length) are copies: the tree itself is immutable.

dtype#: Data dtype the tree was built on, np.float32 or np.float64.

feature#: Split feature per node, leaf index for leaves.

leaf_values()#: Decision values of the leaves, ordered by leaf_index.

left#: Left child index per node, 0 for leaves.

n_features#

n_leaves#

n_nodes#

n_subsamples#

node_average_path_length#: Average path length per node (sidecar array).

value#: Split value per node, decision value for leaves. The array is of the tree dtype.

coniferest.evaluator module#

class coniferest.evaluator.ForestEvaluator(samples, trees, *, num_threads, sampletrees_per_batch)[source]#

Bases: object

apply(x, output=None)[source]#

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features. The data is copied unless it is C-contiguous and of the same dtype the trees were built on.
output ({"dense", "sparse"}, default="dense") – If “dense”, returns a dense array of leaf indices per tree. If “sparse”, returns a sparse CSR matrix of shape (n_samples, n_leaves) where each row has non-zero entries for leaves reached by the sample.

Returns:

Return type:

ndarray of shape (n_samples, n_estimators) or csr_matrix of shape (n_samples, n_leaves)

classmethod average_path_length(n_nodes)[source]#

Average path length is abstracted because in different cases we may want to use a bit different formulas to make the exact match with other software.

By default we use our own implementation.

property batch_size#

static combine_leaf_values(trees)[source]#: Concatenate per-tree leaf values into one global-leaf-indexed array.

distance(x, y, *, method=None)[source]#

Compute distance matrix between samples based on leaf co-occurrence.

The distance is defined as 1 minus the fraction of trees where two samples land in the same leaf. This gives a measure of dissimilarity between samples based on their paths through the forest.

Parameters:

x (ndarray shape (n_samples_x, n_features) or (n_features,)) – Input samples. If 1-D, treated as a single sample.
y (ndarray shape (n_samples_y, n_features) or (n_features,), optional) – Second set of samples for pairwise distance. If None (default), computes distances between all pairs in x.
method ({"common_leaf_ratio"}, default="common_leaf_ratio") – Distance computation method. Currently only “common_leaf_ratio” is supported.

Returns:

Return type:

ndarray shape (n_samples_x, n_samples_y)

Raises:

ValueError – If method is not one of the known methods.

property dtype#: Data dtype the trees were built on.

feature_importance(x)[source]#

feature_signature(x)[source]#

property n_leaves#

property n_trees#

score_samples(x)[source]#

Perform the computations.

Parameters:: x – Features to calculate scores of. The data is copied unless it is C-contiguous and of the same dtype the trees were built on.
Return type:: Array of scores.

coniferest.experiment module#

coniferest.isoforest module#

class coniferest.isoforest.IsolationForest(n_trees=100, n_subsamples=256, max_depth=None, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576)[source]#

Bases: Coniferest

Isolation forest.

This is a reimplementation of sklearn.ensemble.IsolationForest, which trains and evaluates much faster. It also supports multi-threading for evaluation (sample scoring).

Parameters:

n_trees (int, optional) – Number of trees in forest to build.
n_subsamples (int, optional) – Number of subsamples to use for building the trees.
max_depth (int or None, optional) – Maximal tree depth. If None, log2(n_subsamples) is used.
n_jobs (int, default=-1) – Number of threads to use for evaluation. If -1, use all available CPUs.
random_seed (int or None, optional) – Seed for reproducibility. If None, random seed is used.

n_features_in_#

Number of features seen during fit. Available only after fit() or fit_known() has been called.

Type:: int

apply(x, output=None)[source]#

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.
output ({"dense", "sparse"}, default="dense") – If “dense”, returns a dense array of leaf indices per tree. If “sparse”, returns a sparse CSR matrix of shape (n_samples, n_leaves) where each row has non-zero entries for leaves reached by the sample.

Returns:

Return type:

ndarray of shape (n_samples, n_estimators) or csr_matrix of shape (n_samples, n_leaves)

distance(x, y=None, *, method=None)[source]#

Compute distance matrix between samples based on leaf co-occurrence.

The distance is defined as 1 minus the fraction of trees where two samples land in the same leaf. This gives a measure of dissimilarity between samples based on their paths through the forest.

Parameters:

x (ndarray shape (n_samples_x, n_features) or (n_features,)) – Input samples. If 1-D, treated as a single sample.
y (ndarray shape (n_samples_y, n_features) or (n_features,), optional) – Second set of samples for pairwise distance. If None (default), computes distances between all pairs in x.
method ({"common_leaf_ratio"}, default="common_leaf_ratio") – Distance computation method. Currently only “common_leaf_ratio” is supported.

Returns:

Return type:

ndarray shape (n_samples_x, n_samples_y)

Raises:

ValueError – If method is not one of the known methods.

feature_importance(x)[source]#

feature_signature(x)[source]#

fit(data, labels=None)[source]#

Build the trees based on data.

Parameters:

data – 2-d array with features.
labels – Unused. Defaults to None.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]#: Fit to the applied data with priors.

score_samples(samples)[source]#

Compute scores for given samples.

Parameters:: samples – 2-d array with features.
Return type:: 1-d array with scores.

coniferest.label module#

class coniferest.label.Label(*values)[source]#

Bases: IntEnum

Anomalous classification labels.

Three types of labels:

-1 for anomalies, referenced either as Label.ANOMALY or as Label.A,

0 for unknowns: Label.UNKNOWN or Label.U,

1 for regular data: Label.REGULAR or Label.R.

A = -1#

ANOMALY = -1#

R = 1#

REGULAR = 1#

U = 0#

UNKNOWN = 0#

coniferest.limeforest module#

coniferest.pineforest module#

class coniferest.pineforest.PineForest(n_trees=100, n_subsamples=256, max_depth=None, n_spare_trees=400, regenerate_trees=False, weight_ratio=1.0, n_jobs=-1, random_seed=None, sampletrees_per_batch=1048576)[source]#

Bases: Coniferest

Pine Forest for active anomaly detection.

Pine Forests are filtering isolation forests. That’s a simple concept of incorporating prior knowledge about what is anomalous and what is not.

Standard fit procedure with two parameters works exactly the same as the isolation forests’ one. It differs when we supply additional parameter labels, then the behaviour changes. At that case fit generates additional not only n_trees but with additional n_spare_trees and then filters out n_spare_trees, leaving only those n_trees that deliver better scores for the data known to be anomalous.

Parameters:

n_trees (int, optional) – Number of trees to keep for estimating anomaly scores.
n_subsamples (int, optional) – How many subsamples should be used to build every tree.
max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.
n_spare_trees (int, optional) – Number of trees to generate additionally for further filtering.
regenerate_trees (bool, optional) – Should we throughout all the trees during retraining or should we mix old trees with the fresh ones. False by default, so we mix.
weight_ratio (float, optional) – What is the relative weight of false positives relative to true positives (i.e. we are not interested in negatives in anomaly detection, right?). The weight is used during the filtering process.
n_jobs (int, default=-1) – Number of threads to use for scoring. If -1, use all available CPUs.
random_seed (int or None, optional) – Random seed. If None - random seed is used.

n_features_in_#

Number of features seen during fit. Available only after fit() or fit_known() has been called.

Type:: int

apply(x, output=None)[source]#

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.
output ({"dense", "sparse"}, default="dense") – If “dense”, returns a dense array of leaf indices per tree. If “sparse”, returns a sparse CSR matrix of shape (n_samples, n_leaves) where each row has non-zero entries for leaves reached by the sample.

Returns:

Return type:

ndarray of shape (n_samples, n_estimators) or csr_matrix of shape (n_samples, n_leaves)

distance(x, y=None, *, method=None)[source]#

Compute distance matrix between samples based on leaf co-occurrence.

The distance is defined as 1 minus the fraction of trees where two samples land in the same leaf. This gives a measure of dissimilarity between samples based on their paths through the forest.

Parameters:

x (ndarray shape (n_samples_x, n_features) or (n_features,)) – Input samples. If 1-D, treated as a single sample.
y (ndarray shape (n_samples_y, n_features) or (n_features,), optional) – Second set of samples for pairwise distance. If None (default), computes distances between all pairs in x.
method ({"common_leaf_ratio"}, default="common_leaf_ratio") – Distance computation method. Currently only “common_leaf_ratio” is supported.

Returns:

Return type:

ndarray shape (n_samples_x, n_samples_y)

Raises:

ValueError – If method is not one of the known methods.

feature_importance(x)[source]#

feature_signature(x)[source]#

filter_trees(trees, data, labels, n_filter, weight_ratio=1)[source]#

Filter the trees out.

Parameters:

trees – Trees to filter.
n_filter – Number of trees to filter out.
data – The labeled objects themselves.
labels – The labels of the objects. -1 is anomaly, 1 is not anomaly, 0 is uninformative.
weight_ratio – Weight of the false positive experience relative to false negative. Defaults to 1.

fit(data, labels=None)[source]#

Build the trees with the data data.

Parameters:

data – Array with feature values of objects.
labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]#

Parameters:

data – Training data (array with feature values) to build trees with.
known_data – Feature values of known data.
known_labels – Labels of known data.

Return type:

self

score_samples(samples)[source]#

Computer scores for the supplied data.

Parameters:: samples – Feature values to compute scores on.
Return type:: Array with computed scores.

coniferest.utils module#

coniferest.utils.average_path_length(n)[source]#

Average path length computation.

The formula itself is implemented in the Rust extension and is shared with the tree builder.

Parameters:: n – Either array of tree depths to computer average path length of or one tree depth scalar.
Return type:: Average path length.

coniferest package

Contents

coniferest package#

Subpackages#

Submodules#

coniferest.aadforest module#

coniferest.calc_paths_sum module#

coniferest.coniferest module#

coniferest.evaluator module#

coniferest.experiment module#

coniferest.isoforest module#

coniferest.label module#

coniferest.limeforest module#

coniferest.pineforest module#

coniferest.utils module#