Count Thresholder
Map infrequent categorical variables to a new/separate category. Input columns
to the CountThresholder must by of type int, string, dict, or list. For each
column in the input, the transformed output is a column where the input
category is retained as is if it has has occurred at least threshold times in
the training data. Categories that do not satisfy the above are set to
output_category_name
.
The behaviour for different input data column types is as follows:
(see transform()
for examples).
string : Strings are marked with the
output_category_name
if the threshold condition described above is not satisfied.int : Behave the same way as string. If
output_category_name
is of type string, then the entire column is cast to string.list : Each of the values in the list are mapped in the same way as a string value.
dict : They key of the dictionary is treated as a namespace and the value is treated as a sub-category in the namespace. The categorical variable passed through the transformer is a combination of the namespace and the sub-category.
You specify the threshold at which to perserve the categories with the parameter threshold.
Introductory Example
from graphlab.toolkits.feature_engineering import *
# Create data.
sf = gl.SFrame({'a': [1,2,3], 'b' : [2,3,4]})
# Create a transformer.
count_tr = gl.feature_engineering.create(sf, CountThresholder(threshold = 1))
# Transform the data.
transformed_sf = count_tr.transform(sf)
# Save the transformer.
count_tr.save('save-path')
# Return the categories that are not discarded.
count_tr['categories']
Columns:
feature str
category str
Rows: 6
Data:
+---------+----------+
| feature | category |
+---------+----------+
| a | 1 |
| a | 2 |
| a | 3 |
| b | 2 |
| b | 3 |
| b | 4 |
+---------+----------+
[6 rows x 2 columns]
Fitting and transforming
Once a CountThresholder object is constructed, it must first be fitted and then the transform function can be called to generate encoded features.
# String/Integer columns
# ----------------------------------------------------------------------
sf = gl.SFrame({'a' : [1,2,3,2,3], 'b' : [2,3,4,2,3]})
# Set all categories that did not occur at least 2 times to None.
count_tr = gl.feature_engineering.CountThresholder(threshold = 2)
# Fit and transform on the same data.
transformed_sf = count_tr.fit_transform(sf)
Columns:
a int
b int
Rows: 3
Data:
+-------+--------+
| a | b |
+-------+--------+
| None | 2 |
| 2 | 3 |
| 3 | None |
| 2 | 2 |
| 3 | 3 |
+-------+--------+
[5 rows x 2 columns]
# Lists can be used to encode sets of categories for each example.
# ----------------------------------------------------------------------
sf = gl.SFrame({'categories': [['cat', 'mammal'],
['cat', 'mammal'],
['human', 'mammal'],
['seahawk', 'bird'],
['duck', 'bird'],
['seahawk', 'bird']]})
# Construct and fit.
from graphlab.toolkits.feature_engineering import CountThresholder
count_tr = graphlab.feature_engineering.create(sf, CountThresholder(threshold = 2))
# Transform the data
transformed_sf = count_tr.transform(sf)
Columns:
categories list
Rows: 6
Data:
+-----------------+
| categories |
+-----------------+
| [cat, mammal] |
| [cat, mammal] |
| [None, mammal] |
| [seahawk, bird] |
| [None, bird] |
| [seahawk, bird] |
+-----------------+
[6 rows x 1 columns]
# Dictionaries can be used for name spaces & sub-categories.
# ----------------------------------------------------------------------
sf = gl.SFrame({'attributes':
[{'height':'tall', 'age': 'senior', 'weight': 'thin'},
{'height':'short', 'age': 'child', 'weight': 'thin'},
{'height':'giant', 'age': 'adult', 'weight': 'fat'},
{'height':'short', 'age': 'child', 'weight': 'thin'},
{'height':'tall', 'age': 'child', 'weight': 'fat'}]})
# Construct and fit.
from graphlab.toolkits.feature_engineering import CountThresholder
count_tr = gl.feature_engineering.create(sf,
CountThresholder(threshold = 2))
# Transform the data
transformed_sf = count_tr.transform(sf)
Columns:
attributes dict
Rows: 5
Data:
+-------------------------------+
| attributes |
+-------------------------------+
| {'age': None, 'weight': 't... |
| {'age': 'child', 'weight':... |
| {'age': None, 'weight': No... |
| {'age': 'child', 'weight':... |
| {'age': 'child', 'weight':... |
+-------------------------------+