coniferest package¶
Subpackages¶
- coniferest.datasets package
- coniferest.session package
Session
Session.current
Session.last_decision
Session.scores
Session.terminated
Session.known_labels
Session.known_anomalies
Session.known_regulars
Session.known_unknowns
Session.model
Session.current
Session.known_anomalies
Session.known_labels
Session.known_regulars
Session.known_unknowns
Session.last_decision
Session.model
Session.run()
Session.scores
Session.terminate()
Session.terminated
- Submodules
- coniferest.session.callback module
- coniferest.session.oracle module
- coniferest.sklearn package
Submodules¶
coniferest.aadforest module¶
- class coniferest.aadforest.AADForest(n_trees=100, n_subsamples=256, max_depth=None, tau=0.97, C_a=1.0, prior_influence=1.0, n_jobs=None, random_seed=None)[source]¶
Bases:
Coniferest
Active Anomaly Detection with Isolation Forest.
See Das et al., 2017 https://arxiv.org/abs/1708.09441
- Parameters:
n_trees (int, optional) – Number of trees in the isolation forest.
n_subsamples (int, optional) – How many subsamples should be used to build every tree.
max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.
n_jobs (int or None, optional) – Number of threads to use for scoring. If None - all available CPUs are used.
random_seed (int or None, optional) – Random seed to use for reproducibility. If None - random seed is used.
prior_influence (float or callable, optional) – An regularization coefficient value in the loss functioin. Default is 1.0. Signature: ‘(anomaly_count, nominal_count) -> float’
- apply(x)[source]¶
Apply the forest to X, return leaf indices.
- Parameters:
x (ndarray shape (n_samples, n_features)) – 2-d array with features.
- Returns:
x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.
- Return type:
ndarray of shape (n_samples, n_estimators)
- fit(data, labels=None)[source]¶
Build the trees with the data data.
- Parameters:
data – Array with feature values of objects.
labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.
- Return type:
self
- fit_known(data, known_data=None, known_labels=None)[source]¶
The same fit but with a bit of different API. Known data and labels are separated from training data for time and space optimality. High chances are that known_data is much smaller that data. At that case it is not reasonable to hold the labels for whole data.
- Parameters:
data – Training data (array with feature values) to build trees with.
known_data – Feature values of known data.
known_labels – Labels of known data.
- Return type:
self
coniferest.calc_paths_sum module¶
- coniferest.calc_paths_sum.calc_apply(selectors, indices, data, num_threads=1)¶
- coniferest.calc_paths_sum.calc_feature_delta_sum(selectors, indices, data, num_threads=1)¶
- coniferest.calc_paths_sum.calc_paths_sum(selectors, indices, data, weights=None, num_threads=1)¶
- coniferest.calc_paths_sum.calc_paths_sum_transpose(selectors, indices, data, leaf_count, weights=None, num_threads=1)¶
coniferest.coniferest module¶
- class coniferest.coniferest.Coniferest(trees=None, n_subsamples=256, max_depth=None, n_jobs=-1, random_seed=None)[source]¶
Bases:
ABC
Base class for the forests in the package. It settles the basic low-level machinery with the sklearn’s trees, used here.
- Parameters:
trees (list or None, optional) – List with the trees in the forest. If None, then empty list is used.
n_subsamples (int, optional) – Subsamples to use for the training.
max_depth (int or None, optional) – Maximum depth of the trees in use. If None, then log2(n_subsamples)
n_jobs (int, optional) – Number of threads to use for scoring. If -1, then number of CPUs is used.
random_seed (int or None, optional) – Seed for the reproducibility. If None, then random seed is used.
- build_one_tree(data)[source]¶
Build just one tree.
- Parameters:
data – Features to build that one tree of.
- Return type:
A tree.
- build_trees(data, n_trees)[source]¶
Just build n_trees trees from supplied data.
- Parameters:
data – Features.
n_trees – Number of trees to build
- Return type:
List of trees.
- class coniferest.coniferest.ConiferestEvaluator(coniferest, map_value=None)[source]¶
Bases:
ForestEvaluator
Fast evaluator of scores for Coniferests.
- Parameters:
coniferest (Coniferest) – The forest for building the evaluator from.
map_value (callable or None) – Optional function to map leaf values, mast accept 1-D array of values and return an array of the same shape.
coniferest.evaluator module¶
- class coniferest.evaluator.ForestEvaluator(samples, selectors, indices, leaf_count, *, num_threads)[source]¶
Bases:
object
- classmethod average_path_length(n_nodes)[source]¶
Average path length is abstracted because in different cases we may want to use a bit different formulas to make the exact match with other software.
By default we use our own implementation.
- classmethod combine_selectors(selectors_list)[source]¶
Combine several node arrays into one array of nodes and one array of start indices.
- Parameters:
selectors_list – List of node arrays to combine.
- Returns:
Pair of two arrays
- Return type:
node array and array of starting indices.
- score_samples(x)[source]¶
Perform the computations.
- Parameters:
x – Features to calculate scores of. Should be C-contiguous for performance.
- Return type:
Array of scores.
- selector_dtype = dtype([('feature', '<i4'), ('left', '<i4'), ('value', '<f8'), ('right', '<i4'), ('log_n_node_samples', '<f4')])¶
coniferest.experiment module¶
coniferest.isoforest module¶
- class coniferest.isoforest.IsolationForest(n_trees=100, n_subsamples=256, max_depth=None, n_jobs=None, random_seed=None)[source]¶
Bases:
Coniferest
Isolation forest.
This is a reimplementation of sklearn.ensemble.IsolationForest, which trains and evaluates much faster. It also supports multi-threading for evaluation (sample scoring).
- Parameters:
n_trees (int, optional) – Number of trees in forest to build.
n_subsamples (int, optional) – Number of subsamples to use for building the trees.
max_depth (int or None, optional) – Maximal tree depth. If None, log2(n_subsamples) is used.
n_jobs (int or None, optional) – Number of threads to use for evaluation. If None, use all available CPUs.
random_seed (int or None, optional) – Seed for reproducibility. If None, random seed is used.
- apply(x)[source]¶
Apply the forest to X, return leaf indices.
- Parameters:
x (ndarray shape (n_samples, n_features)) – 2-d array with features.
- Returns:
x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.
- Return type:
ndarray of shape (n_samples, n_estimators)
coniferest.label module¶
- class coniferest.label.Label(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
IntEnum
Anomalous classification labels.
Three types of labels:
-1 for anomalies, referenced either as Label.ANOMALY or as Label.A,
0 for unknowns: Label.UNKNOWN or Label.U,
1 for regular data: Label.REGULAR or Label.R.
- A = -1¶
- ANOMALY = -1¶
- R = 1¶
- REGULAR = 1¶
- U = 0¶
- UNKNOWN = 0¶
coniferest.limeforest module¶
- class coniferest.limeforest.LimeEvaluator(pine_forest)[source]¶
Bases:
ForestEvaluator
coniferest.pineforest module¶
- class coniferest.pineforest.PineForest(n_trees=100, n_subsamples=256, max_depth=None, n_spare_trees=400, regenerate_trees=False, weight_ratio=1.0, n_jobs=None, random_seed=None)[source]¶
Bases:
Coniferest
Pine Forest for active anomaly detection.
Pine Forests are filtering isolation forests. That’s a simple concept of incorporating prior knowledge about what is anomalous and what is not.
Standard fit procedure with two parameters works exactly the same as the isolation forests’ one. It differs when we supply additional parameter labels, then the behaviour changes. At that case fit generates additional not only n_trees but with additional n_spare_trees and then filters out n_spare_trees, leaving only those n_trees that deliver better scores for the data known to be anomalous.
- Parameters:
n_trees (int, optional) – Number of trees to keep for estimating anomaly scores.
n_subsamples (int, optional) – How many subsamples should be used to build every tree.
max_depth (int or None, optional) – Maximum depth of every tree. If None, log2(n_subsamples) is used.
n_spare_trees (int, optional) – Number of trees to generate additionally for further filtering.
regenerate_trees (bool, optional) – Should we throughout all the trees during retraining or should we mix old trees with the fresh ones. False by default, so we mix.
weight_ratio (float, optional) – What is the relative weight of false positives relative to true positives (i.e. we are not interested in negatives in anomaly detection, right?). The weight is used during the filtering process.
n_jobs (int, optional) – Number of threads to use for scoring. If None - number of CPUs is used.
random_seed (int or None, optional) – Random seed. If None - random seed is used.
- apply(x)[source]¶
Apply the forest to X, return leaf indices.
- Parameters:
x (ndarray shape (n_samples, n_features)) – 2-d array with features.
- Returns:
x_leafs – For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.
- Return type:
ndarray of shape (n_samples, n_estimators)
- filter_trees(trees, data, labels, n_filter, weight_ratio=1)[source]¶
Filter the trees out.
- Parameters:
trees – Trees to filter.
n_filter – Number of trees to filter out.
data – The labeled objects themselves.
labels – The labels of the objects. -1 is anomaly, 1 is not anomaly, 0 is uninformative.
weight_ratio – Weight of the false positive experience relative to false negative. Defaults to 1.
- fit(data, labels=None)[source]¶
Build the trees with the data data.
- Parameters:
data – Array with feature values of objects.
labels – Optional. Labels of objects. May be regular, anomalous or unknown. See Label data for details.
- Return type:
self
- fit_known(data, known_data=None, known_labels=None)[source]¶
The same fit but with a bit of different API. Known data and labels are separated from training data for time and space optimality. High chances are that known_data is much smaller that data. At that case it is not reasonable to hold the labels for whole data.
- Parameters:
data – Training data (array with feature values) to build trees with.
known_data – Feature values of known data.
known_labels – Labels of known data.
- Return type:
self