US baby names dataset¶
This notebook gives an example of Active Anomaly Detection with coniferest
and US baby names dataset.
Developers of conferest
: - Matwey Kornilov (MSU) - Vladimir Korolev - Konstantin Malanchev (LINCC Frameworks / CMU), notebook author
The tutorial is co-authored by Etienne Russeil (LPC)
Install and import modules¶
[1]:
%pip install coniferest
%pip install pandas
%pip install requests
Requirement already satisfied: coniferest in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (0.0.15)
Requirement already satisfied: click in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from coniferest) (8.1.7)
Requirement already satisfied: joblib in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from coniferest) (1.4.2)
Requirement already satisfied: numpy in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from coniferest) (2.2.0)
Requirement already satisfied: scikit-learn<2,>=1.4 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from coniferest) (1.5.2)
Requirement already satisfied: matplotlib in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from coniferest) (3.9.3)
Requirement already satisfied: onnxconverter-common in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from coniferest) (1.14.0)
Requirement already satisfied: scipy>=1.6.0 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from scikit-learn<2,>=1.4->coniferest) (1.14.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from scikit-learn<2,>=1.4->coniferest) (3.5.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (4.55.2)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (24.2)
Requirement already satisfied: pillow>=8 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (11.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (3.2.0)
Requirement already satisfied: python-dateutil>=2.7 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from matplotlib->coniferest) (2.9.0.post0)
Requirement already satisfied: onnx in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from onnxconverter-common->coniferest) (1.17.0)
Requirement already satisfied: protobuf==3.20.2 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from onnxconverter-common->coniferest) (3.20.2)
Requirement already satisfied: six>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib->coniferest) (1.17.0)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: pandas in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (2.2.3)
Requirement already satisfied: numpy>=1.23.2 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from pandas) (2.2.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from pandas) (2024.2)
Requirement already satisfied: six>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: requests in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from requests) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from requests) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from requests) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /home/docs/checkouts/readthedocs.org/user_builds/coniferest/envs/latest/lib/python3.11/site-packages (from requests) (2024.8.30)
Note: you may need to restart the kernel to use updated packages.
[2]:
import zipfile
from io import BytesIO
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from coniferest.pineforest import PineForest
from coniferest.isoforest import IsolationForest
from coniferest.pineforest import PineForest
from coniferest.session import Session
from coniferest.session.callback import TerminateAfter, prompt_decision_callback, Label
Data preparation¶
Download data and put into a single data frame
[3]:
%%time
URL = 'https://www.ssa.gov/OACT/babynames/state/namesbystate.zip'
response = requests.get(URL)
if not response.ok:
raise RuntimeError('Cannot download the file')
dfs = []
with zipfile.ZipFile(BytesIO(response.content)) as zip_ref:
for filename in zip_ref.namelist():
if not filename.lower().endswith('.txt'):
continue
with zip_ref.open(filename) as f:
df = pd.read_csv(f, header=None, names=['State', 'Gender', 'Year', 'Name', 'Count'])
dfs.append(df)
raw = pd.concat(dfs, axis=0, ignore_index=True)
raw
CPU times: user 3.02 s, sys: 477 ms, total: 3.5 s
Wall time: 3.92 s
[3]:
State | Gender | Year | Name | Count | |
---|---|---|---|---|---|
0 | AK | F | 1910 | Mary | 14 |
1 | AK | F | 1910 | Annie | 12 |
2 | AK | F | 1910 | Anna | 10 |
3 | AK | F | 1910 | Margaret | 8 |
4 | AK | F | 1910 | Helen | 7 |
... | ... | ... | ... | ... | ... |
6504156 | WY | M | 2023 | Parker | 5 |
6504157 | WY | M | 2023 | Rhett | 5 |
6504158 | WY | M | 2023 | Roman | 5 |
6504159 | WY | M | 2023 | Ryan | 5 |
6504160 | WY | M | 2023 | Timothy | 5 |
6504161 rows × 5 columns
Let’s load the data and transform it into a feature matrix where each column is the number (normalized by peak value) of US citizens that got this name in a given year. We apply few quality filter. We require the name to appear at least 10000 times over the full time range, this will prevent noisy data from names that are barely used.
Optionally, we use first few Fourier terms to better detect “bumps” and “waves” in the time-series.
[4]:
%%time
WITH_FFT = True
all_years = np.unique(raw['Year'])
all_names = np.unique(raw['Name'])
# Accumulate names over states and genders
counts = raw.groupby(['Name', 'Year']).apply(lambda df: df['Count'].sum())
# Tranform to dataframe where names are labels, years are columns and counts are values (features)
years = [f'year_{i}' for i in all_years]
year_columns = pd.DataFrame(data=0.0, index=all_names, columns=years)
for name, year in counts.index:
year_columns.loc[name, f'year_{year}'] = counts.loc[name, year]
# Account for total population changes
trend = year_columns.sum(axis=0)
detrended = year_columns / trend
# Normalise and filter
norm = detrended.apply(lambda column: column / detrended.max(axis=1))
filtered = norm[year_columns.sum(axis=1) >= 10_000]
if WITH_FFT:
# Fourier-transform, normalize by zero frequency and get power-spectrum for few lowest frequencies
power_spectrum = np.square(np.abs(np.fft.fft(filtered)))
power_spectrum_norm = power_spectrum / power_spectrum[:, 0, None]
power_spectrum_low_freq = power_spectrum_norm[:, 1:21]
frequencies = [f'freq_{i}' for i in range(power_spectrum_low_freq.shape[1])]
power = pd.DataFrame(data=power_spectrum_low_freq, index=filtered.index, columns=frequencies)
# Concatenate time-series data and power spectrum
final = pd.merge(filtered, power, left_index=True, right_index=True)
else:
# Use time-series data
final = filtered
final
<timed exec>:7: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
CPU times: user 1min 53s, sys: 459 ms, total: 1min 53s
Wall time: 1min 53s
[4]:
year_1910 | year_1911 | year_1912 | year_1913 | year_1914 | year_1915 | year_1916 | year_1917 | year_1918 | year_1919 | ... | freq_10 | freq_11 | freq_12 | freq_13 | freq_14 | freq_15 | freq_16 | freq_17 | freq_18 | freq_19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Aaliyah | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.006223 | 0.011881 | 0.012348 | 0.005589 | 0.000240 | 0.001929 | 0.005152 | 0.004159 | 0.000908 | 0.000148 |
Aaron | 0.045718 | 0.053744 | 0.062024 | 0.077323 | 0.074977 | 0.065559 | 0.066898 | 0.066008 | 0.065487 | 0.066251 | ... | 0.000102 | 0.000375 | 0.000008 | 0.000174 | 0.000008 | 0.000116 | 0.000323 | 0.000183 | 0.000018 | 0.000206 |
Abbey | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.003382 | 0.002492 | 0.000490 | 0.000341 | 0.000792 | 0.000629 | 0.000184 | 0.000041 | 0.000019 | 0.000142 |
Abbie | 0.359932 | 0.269787 | 0.306439 | 0.464589 | 0.215530 | 0.302400 | 0.349221 | 0.307588 | 0.197794 | 0.346314 | ... | 0.000872 | 0.000340 | 0.000209 | 0.000339 | 0.001921 | 0.000232 | 0.000076 | 0.000102 | 0.000040 | 0.000605 |
Abbigail | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.002642 | 0.001734 | 0.001141 | 0.001235 | 0.001421 | 0.001037 | 0.000383 | 0.000060 | 0.000073 | 0.000134 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Zelma | 0.932474 | 1.000000 | 0.729411 | 0.723820 | 0.652394 | 0.669620 | 0.600107 | 0.576397 | 0.536196 | 0.576366 | ... | 0.012419 | 0.010827 | 0.009236 | 0.007330 | 0.005842 | 0.005533 | 0.005881 | 0.005967 | 0.006053 | 0.006224 |
Zion | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.029102 | 0.015499 | 0.004003 | 0.004927 | 0.011328 | 0.011587 | 0.005323 | 0.001245 | 0.003695 | 0.007141 |
Zoe | 0.009243 | 0.000000 | 0.003225 | 0.000000 | 0.001845 | 0.002542 | 0.004274 | 0.004372 | 0.002133 | 0.001221 | ... | 0.007844 | 0.009167 | 0.006538 | 0.002683 | 0.000962 | 0.000918 | 0.001534 | 0.002198 | 0.002286 | 0.002146 |
Zoey | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.009403 | 0.011478 | 0.011382 | 0.009234 | 0.005897 | 0.002628 | 0.000721 | 0.000420 | 0.000919 | 0.001503 |
Zuri | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.049637 | 0.039396 | 0.035679 | 0.034023 | 0.030917 | 0.027702 | 0.022838 | 0.016293 | 0.011780 | 0.010174 |
2309 rows × 134 columns
Plotting function
[5]:
def basic_plot(idx):
cols = [col.startswith('year') for col in final.columns]
all_years = [int(col.removeprefix('year_')) for col in final.columns[cols]]
if isinstance(idx, str):
counts = final.loc[idx][cols]
title = idx
else:
counts = final.iloc[idx][cols]
title = final.iloc[idx].name
plt.plot(all_years, counts.values)
# plt.ylim(-0.1, 1.1)
plt.title(title)
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()
We can now easily look at the evolution of a given name over the years
[6]:
print(len(final))
basic_plot('Anastasia')
basic_plot('Leo')
2309
Classical anomaly detection¶
[7]:
model = IsolationForest(random_seed=1, n_trees=1000)
model.fit(np.array(final))
scores = model.score_samples(np.array(final))
ordered_scores, ordered_index = zip(*sorted(zip(scores, final.index)))
print(f"Top 10 weirdest names : {ordered_index[:10]}")
print(f"Top 10 most regular names : {ordered_index[-10:]}")
Top 10 weirdest names : ('Manuel', 'Alfonso', 'Marshall', 'Vincent', 'Margarita', 'Byron', 'Ignacio', 'Benito', 'Anthony', 'Ruben')
Top 10 most regular names : ('Aleah', 'Kenzie', 'Camden', 'Jayla', 'Haven', 'Kian', 'Bryson', 'Teagan', 'Holden', 'Raegan')
Let’s have a look at their distributions
[8]:
for normal in ordered_index[-4:]:
basic_plot(normal)
[9]:
for weird in ordered_index[:4]:
basic_plot(weird)
It seems that anomalies are either very localised peak or very recent trending names.
Active Anomaly Detection¶
First, we need a function helping us to make a decision.
Comment dummy decision function and uncomment interactive one¶
[10]:
# Comment
def help_decision(metadata, data, session):
"""Dummy, says YES to everything"""
return Label.ANOMALY
### UNCOMMENT
# def help_decision(metadata, data, session):
# """Plots data and asks expert interactively"""
# basic_plot(metadata)
# return prompt_decision_callback(metadata, data, session)
Let’s create a model and run a session
Let’s run PineForest and say YES every time we see a recent growth
[11]:
model = PineForest(
# Number of trees to use for predictions
n_trees=256,
# Number of new tree to grow for each decision
n_spare_trees=768,
# Fix random seed for reproducibility
random_seed=0,
)
session = Session(
data=final,
metadata=final.index,
model=model,
decision_callback=help_decision,
on_decision_callbacks=[
TerminateAfter(10),
],
)
session.run()
[11]:
<coniferest.session.Session at 0x7f361c98f9d0>
Wow! Good almost 100% of the behavior we were looking for
Now we can run it again and say YES every time we see a sharp peak
[12]:
model = PineForest(
# Number of trees to use for predictions
n_trees=256,
# Number of new tree to grow for each decision
n_spare_trees=768,
# Fix random seed for reproducibility
random_seed=0,
)
session = Session(
data=final,
metadata=final.index,
model=model,
decision_callback=help_decision,
on_decision_callbacks=[
TerminateAfter(20),
],
)
session.run()
[12]:
<coniferest.session.Session at 0x7f35fc540a50>
PineForest learns the profile interesting for the user and outputs it.
Try to change Pineforest hyperparameters and see how results change. Try different random seeds and number of trees.
[ ]: