coniferest package

Subpackages

Submodules

coniferest.aadforest module

class coniferest.aadforest.AADForest(n_trees=100, n_subsamples=256, max_depth=None, tau=0.97, C_a=1.0, prior_influence=1.0, n_jobs=None, random_seed=None)[source]

Bases: Coniferest

Active Anomaly Detection with Isolation Forest.

See Das et al., 2017 https://arxiv.org/abs/1708.09441

Parameters:
  • n_trees (int, optional) – Number of trees in the isolation forest.

  • n_subsamples (int, optional) – How many subsamples should be used to build every tree.

  • max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.

  • n_jobs (int or None, optional) – Number of threads to use for scoring. If None - all available CPUs are used.

  • random_seed (int or None, optional) – Random seed to use for reproducibility. If None - random seed is used.

  • prior_influence (float or callable, optional) – An regularization coefficient value in the loss functioin. Default is 1.0. Signature: ‘(anomaly_count, nominal_count) -> float’

apply(x)[source]

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.

Returns:

x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

feature_importance(x)[source]
feature_signature(x)[source]
fit(data, labels=None)[source]

Build the trees with the data data.

Parameters:
  • data – Array with feature values of objects.

  • labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]

The same fit but with a bit of different API. Known data and labels are separated from training data for time and space optimality. High chances are that known_data is much smaller that data. At that case it is not reasonable to hold the labels for whole data.

Parameters:
  • data – Training data (array with feature values) to build trees with.

  • known_data – Feature values of known data.

  • known_labels – Labels of known data.

Return type:

self

score_samples(samples)[source]

Computer scores for the supplied data.

Parameters:

samples – Feature values to compute scores on.

Return type:

Array with computed scores.

coniferest.calc_paths_sum module

coniferest.calc_paths_sum.calc_apply(selectors, indices, data, num_threads=1)
coniferest.calc_paths_sum.calc_feature_delta_sum(selectors, indices, data, num_threads=1)
coniferest.calc_paths_sum.calc_paths_sum(selectors, indices, data, weights=None, num_threads=1)
coniferest.calc_paths_sum.calc_paths_sum_transpose(selectors, indices, data, leaf_count, weights=None, num_threads=1)

coniferest.coniferest module

class coniferest.coniferest.Coniferest(trees=None, n_subsamples=256, max_depth=None, n_jobs=-1, random_seed=None)[source]

Bases: ABC

Base class for the forests in the package. It settles the basic low-level machinery with the sklearn’s trees, used here.

Parameters:
  • trees (list or None, optional) – List with the trees in the forest. If None, then empty list is used.

  • n_subsamples (int, optional) – Subsamples to use for the training.

  • max_depth (int or None, optional) – Maximum depth of the trees in use. If None, then log2(n_subsamples)

  • n_jobs (int, optional) – Number of threads to use for scoring. If -1, then number of CPUs is used.

  • random_seed (int or None, optional) – Seed for the reproducibility. If None, then random seed is used.

build_one_tree(data)[source]

Build just one tree.

Parameters:

data – Features to build that one tree of.

Return type:

A tree.

build_trees(data, n_trees)[source]

Just build n_trees trees from supplied data.

Parameters:
  • data – Features.

  • n_trees – Number of trees to build

Return type:

List of trees.

abstract feature_importance(x)[source]
abstract feature_signature(x)[source]
abstract fit(data, labels=None)[source]

Fit to the applied data.

abstract fit_known(data, known_data=None, known_labels=None)[source]

Fit to the applied data with priors.

abstract score_samples(samples)[source]

Evaluate scores for samples.

class coniferest.coniferest.ConiferestEvaluator(coniferest, map_value=None)[source]

Bases: ForestEvaluator

Fast evaluator of scores for Coniferests.

Parameters:
  • coniferest (Coniferest) – The forest for building the evaluator from.

  • map_value (callable or None) – Optional function to map leaf values, mast accept 1-D array of values and return an array of the same shape.

classmethod extract_selectors(tree, map_value=None)[source]

Extract node representations for the tree.

Parameters:
  • tree – Tree to extract selectors from.

  • map_value – Optional function to map leaf values

Return type:

Array with selectors.

coniferest.evaluator module

class coniferest.evaluator.ForestEvaluator(samples, selectors, indices, leaf_count, *, num_threads)[source]

Bases: object

apply(x)[source]
classmethod average_path_length(n_nodes)[source]

Average path length is abstracted because in different cases we may want to use a bit different formulas to make the exact match with other software.

By default we use our own implementation.

classmethod combine_selectors(selectors_list)[source]

Combine several node arrays into one array of nodes and one array of start indices.

Parameters:

selectors_list – List of node arrays to combine.

Returns:

Pair of two arrays

Return type:

node array and array of starting indices.

feature_importance(x)[source]
feature_signature(x)[source]
score_samples(x)[source]

Perform the computations.

Parameters:

x – Features to calculate scores of. Should be C-contiguous for performance.

Return type:

Array of scores.

selector_dtype = dtype([('feature', '<i4'), ('left', '<i4'), ('value', '<f8'), ('right', '<i4'), ('log_n_node_samples', '<f4')])

coniferest.experiment module

coniferest.isoforest module

class coniferest.isoforest.IsolationForest(n_trees=100, n_subsamples=256, max_depth=None, n_jobs=None, random_seed=None)[source]

Bases: Coniferest

Isolation forest.

This is a reimplementation of sklearn.ensemble.IsolationForest, which trains and evaluates much faster. It also supports multi-threading for evaluation (sample scoring).

Parameters:
  • n_trees (int, optional) – Number of trees in forest to build.

  • n_subsamples (int, optional) – Number of subsamples to use for building the trees.

  • max_depth (int or None, optional) – Maximal tree depth. If None, log2(n_subsamples) is used.

  • n_jobs (int or None, optional) – Number of threads to use for evaluation. If None, use all available CPUs.

  • random_seed (int or None, optional) – Seed for reproducibility. If None, random seed is used.

apply(x)[source]

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.

Returns:

x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

feature_importance(x)[source]
feature_signature(x)[source]
fit(data, labels=None)[source]

Build the trees based on data.

Parameters:
  • data – 2-d array with features.

  • labels – Unused. Defaults to None.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]

Fit to the applied data with priors.

score_samples(samples)[source]

Compute scores for given samples.

Parameters:

samples – 2-d array with features.

Return type:

1-d array with scores.

coniferest.label module

class coniferest.label.Label(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: IntEnum

Anomalous classification labels.

Three types of labels:

  • -1 for anomalies, referenced either as Label.ANOMALY or as Label.A,

  • 0 for unknowns: Label.UNKNOWN or Label.U,

  • 1 for regular data: Label.REGULAR or Label.R.

A = -1
ANOMALY = -1
R = 1
REGULAR = 1
U = 0
UNKNOWN = 0

coniferest.limeforest module

class coniferest.limeforest.LimeEvaluator(pine_forest)[source]

Bases: ForestEvaluator

classmethod extract_selectors(pine)[source]
class coniferest.limeforest.RandomLime(features, selectors, values)[source]

Bases: object

paths(x)[source]
class coniferest.limeforest.RandomLimeForest(trees=100, subsamples=256, depth=None, seed=0)[source]

Bases: object

fit(data)[source]
mean_paths(data)[source]
scores(data)[source]
class coniferest.limeforest.RandomLimeGenerator(sample, depth, seed=0)[source]

Bases: object

coniferest.pineforest module

class coniferest.pineforest.PineForest(n_trees=100, n_subsamples=256, max_depth=None, n_spare_trees=400, regenerate_trees=False, weight_ratio=1.0, n_jobs=None, random_seed=None)[source]

Bases: Coniferest

Pine Forest for active anomaly detection.

Pine Forests are filtering isolation forests. That’s a simple concept of incorporating prior knowledge about what is anomalous and what is not.

Standard fit procedure with two parameters works exactly the same as the isolation forests’ one. It differs when we supply additional parameter labels, then the behaviour changes. At that case fit generates additional not only n_trees but with additional n_spare_trees and then filters out n_spare_trees, leaving only those n_trees that deliver better scores for the data known to be anomalous.

Parameters:
  • n_trees (int, optional) – Number of trees to keep for estimating anomaly scores.

  • n_subsamples (int, optional) – How many subsamples should be used to build every tree.

  • max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.

  • n_spare_trees (int, optional) – Number of trees to generate additionally for further filtering.

  • regenerate_trees (bool, optional) – Should we throughout all the trees during retraining or should we mix old trees with the fresh ones. False by default, so we mix.

  • weight_ratio (float, optional) – What is the relative weight of false positives relative to true positives (i.e. we are not interested in negatives in anomaly detection, right?). The weight is used during the filtering process.

  • n_jobs (int, optional) – Number of threads to use for scoring. If None - number of CPUs is used.

  • random_seed (int or None, optional) – Random seed. If None - random seed is used.

apply(x)[source]

Apply the forest to X, return leaf indices.

Parameters:

x (ndarray shape (n_samples, n_features)) – 2-d array with features.

Returns:

x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Return type:

ndarray of shape (n_samples, n_estimators)

feature_importance(x)[source]
feature_signature(x)[source]
filter_trees(trees, data, labels, n_filter, weight_ratio=1)[source]

Filter the trees out.

Parameters:
  • trees – Trees to filter.

  • n_filter – Number of trees to filter out.

  • data – The labeled objects themselves.

  • labels – The labels of the objects. -1 is anomaly, 1 is not anomaly, 0 is uninformative.

  • weight_ratio – Weight of the false positive experience relative to false negative. Defaults to 1.

fit(data, labels=None)[source]

Build the trees with the data data.

Parameters:
  • data – Array with feature values of objects.

  • labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.

Return type:

self

fit_known(data, known_data=None, known_labels=None)[source]

The same fit but with a bit of different API. Known data and labels are separated from training data for time and space optimality. High chances are that known_data is much smaller that data. At that case it is not reasonable to hold the labels for whole data.

Parameters:
  • data – Training data (array with feature values) to build trees with.

  • known_data – Feature values of known data.

  • known_labels – Labels of known data.

Return type:

self

score_samples(samples)[source]

Computer scores for the supplied data.

Parameters:

samples – Feature values to compute scores on.

Return type:

Array with computed scores.

coniferest.utils module

coniferest.utils.average_path_length(n)[source]

Average path length computation.

Parameters:

n – Either array of tree depths to computer average path length of or one tree depth scalar.

Return type:

Average path length.