
US baby names dataset

This notebook gives an example of Active Anomaly Detection with coniferest and US baby names dataset.

Developers of conferest: - Matwey Kornilov (MSU) - Vladimir Korolev - Konstantin Malanchev (LINCC Frameworks / CMU), notebook author

The tutorial is co-authored by Etienne Russeil (LPC)

Run this NB in Google Colab

Install and import modules

%pip install coniferest
%pip install pandas
%pip install requests
import zipfile
from io import BytesIO

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from coniferest.pineforest import PineForest
from coniferest.isoforest import IsolationForest
from coniferest.pineforest import PineForest
from coniferest.session import Session
from coniferest.session.callback import TerminateAfter, prompt_decision_callback, Label

Data preparation

Download data and put into a single data frame


URL = ''

response = requests.get(URL)
if not response.ok:
    raise RuntimeError('Cannot download the file')

dfs = []
with zipfile.ZipFile(BytesIO(response.content)) as zip_ref:
    for filename in zip_ref.namelist():
        if not filename.lower().endswith('.txt'):
        with as f:
            df = pd.read_csv(f, header=None, names=['State', 'Gender', 'Year', 'Name', 'Count'])
raw = pd.concat(dfs, axis=0, ignore_index=True)

State Gender Year Name Count
0 AK F 1910 Mary 14
1 AK F 1910 Annie 12
2 AK F 1910 Anna 10
3 AK F 1910 Margaret 8
4 AK F 1910 Helen 7
... ... ... ... ... ...
6408036 WY M 2022 Lane 5
6408037 WY M 2022 Michael 5
6408038 WY M 2022 Nicholas 5
6408039 WY M 2022 River 5
6408040 WY M 2022 Silas 5

6408041 rows × 5 columns

Let’s load the data and transform it into a feature matrix where each column is the number (normalized by peak value) of US citizens that got this name in a given year. We apply few quality filter. We require the name to appear at least 10000 times over the full time range, this will prevent noisy data from names that are barely used.

Optionally, we use first few Fourier terms to better detect “bumps” and “waves” in the time-series.



all_years = np.unique(raw['Year'])
all_names = np.unique(raw['Name'])

# Accumulate names over states and genders
counts = raw.groupby(['Name', 'Year']).apply(lambda df: df['Count'].sum())

# Tranform to dataframe where names are labels, years are columns and counts are values (features)
years = [f'year_{i}' for i in all_years]
year_columns = pd.DataFrame(data=0.0, index=all_names, columns=years)
for name, year in counts.index:
    year_columns.loc[name, f'year_{year}'] = counts.loc[name, year]

# Account for total population changes
trend = year_columns.sum(axis=0)
detrended = year_columns / trend

# Normalise and filter
norm = detrended.apply(lambda column: column / detrended.max(axis=1))
filtered = norm[year_columns.sum(axis=1) >= 10_000]

    # Fourier-transform, normalize by zero frequency and get power-spectrum for few lowest frequencies
    power_spectrum = np.square(np.abs(np.fft.fft(filtered)))
    power_spectrum_norm = power_spectrum / power_spectrum[:, 0, None]
    power_spectrum_low_freq = power_spectrum_norm[:, 1:21]
    frequencies = [f'freq_{i}' for i in range(power_spectrum_low_freq.shape[1])]
    power = pd.DataFrame(data=power_spectrum_low_freq, index=filtered.index, columns=frequencies)

    # Concatenate time-series data and power spectrum
    final = pd.merge(filtered, power, left_index=True, right_index=True)
    # Use time-series data
    final = filtered

year_1910 year_1911 year_1912 year_1913 year_1914 year_1915 year_1916 year_1917 year_1918 year_1919 ... freq_10 freq_11 freq_12 freq_13 freq_14 freq_15 freq_16 freq_17 freq_18 freq_19
Aaliyah 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.003798 0.010392 0.014046 0.009058 0.001551 0.000523 0.004251 0.005545 0.002679 0.000158
Aaron 0.045713 0.053738 0.062017 0.077314 0.074968 0.065551 0.066889 0.066000 0.065479 0.066243 ... 0.000027 0.000414 0.000062 0.000149 0.000064 0.000026 0.000249 0.000294 0.000018 0.000106
Abbey 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.003482 0.002272 0.000375 0.000402 0.000803 0.000559 0.000138 0.000031 0.000025 0.000196
Abbie 0.359906 0.269767 0.306417 0.464556 0.215515 0.302378 0.349195 0.307566 0.197780 0.346289 ... 0.001223 0.000129 0.000258 0.000432 0.001641 0.000377 0.000124 0.000077 0.000007 0.000592
Abbigail 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.002585 0.001688 0.001163 0.001316 0.001434 0.000935 0.000288 0.000048 0.000093 0.000134
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Zelma 0.932472 1.000000 0.729412 0.723820 0.652394 0.669620 0.600104 0.576395 0.536197 0.576365 ... 0.012243 0.010688 0.009015 0.007108 0.005727 0.005567 0.005932 0.005930 0.006165 0.005992
Zion 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.034877 0.024488 0.009058 0.003736 0.009423 0.014325 0.010923 0.003766 0.001656 0.005069
Zoe 0.009243 0.000000 0.003225 0.000000 0.001845 0.002542 0.004274 0.004372 0.002133 0.001221 ... 0.005160 0.008107 0.007699 0.004506 0.002055 0.000953 0.000763 0.001230 0.001576 0.001706
Zoey 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.007129 0.010408 0.012235 0.011975 0.009734 0.006270 0.003117 0.001407 0.000990 0.001205
Zuri 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.056581 0.044149 0.038412 0.037009 0.035146 0.033844 0.031390 0.025277 0.018562 0.013838

2280 rows × 133 columns

Plotting function

def basic_plot(idx):
    cols = [col.startswith('year') for col in final.columns]
    all_years = [int(col.removeprefix('year_')) for col in final.columns[cols]]

    if isinstance(idx, str):
        counts = final.loc[idx][cols]
        title = idx
        counts = final.iloc[idx][cols]
        title = final.iloc[idx].name

    plt.plot(all_years, counts.values)
    # plt.ylim(-0.1, 1.1)

We can now easily look at the evolution of a given name over the years


Classical anomaly detection

model = IsolationForest(random_seed=1, n_trees=1000)
scores = model.score_samples(np.array(final))
ordered_scores, ordered_index = zip(*sorted(zip(scores, final.index)))

print(f"Top 10 weirdest names : {ordered_index[:10]}")
print(f"Top 10 most regular names : {ordered_index[-10:]}")
Top 10 weirdest names : ('Manuel', 'Alfonso', 'Marshall', 'Vincent', 'Margarita', 'Byron', 'Anthony', 'Benito', 'Rudy', 'Ignacio')
Top 10 most regular names : ('Annalise', 'Aubrie', 'Fernanda', 'Holden', 'Rylan', 'Jayla', 'Teagan', 'Kailyn', 'Bryson', 'Raegan')

Let’s have a look at their distributions

for normal in ordered_index[-4:]:
for weird in ordered_index[:4]:

It seems that anomalies are either very localised peak or very recent trending names.

Active Anomaly Detection

First, we need a function helping us to make a decision.

Comment dummy decision function and uncomment interactive one

# Comment
def help_decision(metadata, data, session):
    """Dummy, says YES to everything"""
    return Label.ANOMALY

# def help_decision(metadata, data, session):
#     """Plots data and asks expert interactively"""
#     basic_plot(metadata)
#     return prompt_decision_callback(metadata, data, session)

Let’s create a model and run a session

Let’s run PineForest and say YES every time we see a recent growth

model = PineForest(
    # Number of trees to use for predictions
    # Number of new tree to grow for each decision
    # Fix random seed for reproducibility
session = Session(
<coniferest.session.Session at 0x7ff7ea42aa90>

Wow! Good almost 100% of the behavior we were looking for

Now we can run it again and say YES every time we see a sharp peak

model = PineForest(
    # Number of trees to use for predictions
    # Number of new tree to grow for each decision
    # Fix random seed for reproducibility
session = Session(
<coniferest.session.Session at 0x7ff844156610>

PineForest learns the profile interesting for the user and outputs it.

Try to change Pineforest hyperparameters and see how results change. Try different random seeds and number of trees.

