TF-IDF

The prototypical application of TF-IDF transformations involves document collections, where each element represents a document in bag-of-words format, i.e. a dictionary whose keys are words and whose values are the number of times the word occurs in the document. For more details, check the reference section for further reading.

The TF-IDF transformation performs the following computation

$$ \mbox{TF-IDF}(w, d) = tf(w, d) * log(N / f(w))

$$ where $$tf(w, d)$$ is the number of times word $$w$$ appeared in document $$d$$, $$f(w)$$ is the number of documents word $$w$$ appeared in, $$N$$ is the number of documents, and we use the natural logarithm.

The transformed output is a column of type dictionary (max_categories per column dimension sparse vector) where the key corresponds to the index of the categorical variable and the value is 1.

The behavior of TF-IDF for each input data column type for supported types is as follows.

  • dict: Each (key, value) pair is treated as count associated with the key for this row. A common example is to have a dict element contain a bag-of-words representation of a document, where each key is a word and each value is the number of times that word occurs in the document. All non-numeric values are ignored.

  • list: The list is converted to bag of words of format, where the keys are the unique elements in the list and the values are the counts of those unique elements. After this step, the behaviour is identical to dict.

  • string: Behaves identically to a dict, where the dictionary is generated by converting the string into a bag-of-words format. For example, 'I really like really fluffy dogs' would get converted to {'I' : 1, 'really': 2, 'like': 1, 'fluffy': 1, 'dogs':1}.

Introductory Example

import graphlab as gl

# Create data.
sf = gl.SFrame({'a': ['1','2','3'], 'b' : [2,3,4]})

# Create a one-hot encoder.
from graphlab.toolkits.feature_engineering import TFIDF
encoder = gl.feature_engineering.create(sf, TFIDF('a'))

# Transform the data.
transformed_sf = encoder.transform(sf)
Columns:
      a dict
      b int

Rows: 3

Data:
+---------------------------+---+
|             a             | b |
+---------------------------+---+
| {'1': 1.0986122886681098} | 2 |
| {'2': 1.0986122886681098} | 3 |
| {'3': 1.0986122886681098} | 4 |
+---------------------------+---+
[3 rows x 2 columns]
# Save the transformer.
>>> encoder.save('save-path')

# Return the indices in the encoding.
>>> encoder['document_frequencies']
Columns:
    feature_column  str
    term    str
    document_frequency  str

Rows: 3

Data:
+----------------+------+--------------------+
| feature_column | term | document_frequency |
+----------------+------+--------------------+
|       a        |  1   |         1          |
|       a        |  2   |         1          |
|       a        |  3   |         1          |
+----------------+------+--------------------+
[3 rows x 3 columns]

Fitting and Transforming

# For list columns:

l1 = ['a','good','example']
l2 = ['a','better','example']
sf = gl.SFrame({'a' : [l1,l2]})
tfidf = gl.feature_engineering.TFIDF('a')
fit_tfidf = tfidf.fit(sf)
transformed_sf = fit_tfidf.transform(sf)
Columns:
        a   dict

Rows: 2

Data:
+-------------------------------+
|               a               |
+-------------------------------+
| {'a': 0.0, 'good': 0.69314... |
| {'better': 0.6931471805599... |
+-------------------------------+
[2 rows x 1 columns]
# For string columns:

sf = gl.SFrame({'a' : ['a good example', 'a better example']})
tfidf = gl.feature_engineering.TFIDF('a')
fit_tfidf = tfidf.fit(sf)
transformed_sf = fit_tfidf.transform(sf)
Columns:
        a   dict

Rows: 2

Data:
+-------------------------------+
|               a               |
+-------------------------------+
| {'a': 0.0, 'good': 0.69314... |
| {'better': 0.6931471805599... |
+-------------------------------+
[2 rows x 1 columns]
# For dictionary columns:
sf = gl.SFrame(
    {'docs': [{'this': 1, 'is': 1, 'a': 2, 'sample': 1},
              {'this': 1, 'is': 1, 'another': 2, 'example': 3}]})
tfidf = gl.feature_engineering.TFIDF('docs')
fit_tfidf = tfidf.fit(sf)
transformed_sf = fit_tfidf.transform(sf)
Columns:
    docs  dict

Rows: 2

Data:
+-------------------------------+
|              docs             |
+-------------------------------+
| {'this': 0.0, 'a': 1.38629... |
| {'this': 0.0, 'is': 0.0, '... |
+-------------------------------+
[2 rows x 1 columns]