BM25

The BM25 function scores each document in a corpus according to the document's relevance to a particular text query. For a query with terms $$q_1, \ldots, q_n$$, the BM25 score for document $$d$$ is:

$$ \mbox{BM25}(d) = \sum{i=1}^n IDF(q_i) \frac{f(q_i) (k_1+1)}{f(q_i) + k_1 (1-b+b*|D|/d{avg}))}

$$ where:

  • $$f(q_i)$$ is the number of times term $$q_i$$ occurs in document $$d$$,
  • $$|D|$$ is the number of words in document $$d$$,
  • $$d_{avg}$$ is the average number of words per document,
  • $$b$$ and $$k_1$$ are free parameters for Okapi BM25,

The first quantity in the sum is the inverse document frequency. For a corpus with $$N$$ documents, inverse document frequency for term $$q_i$$ is:

$$ \mbox{IDF}(q_i) = \log \frac{N - N(q_i) + 0.5}{N(q_i) + 0.5}

$$ where $$N(q_i)$$ is the number of documents in the corpus that contain term $$q_i$$.

The transformed output is a column of type float with the BM25 score for each document. For more details on the BM25 score see http://en.wikipedia.org/wiki/Okapi_BM25.

The behavior of BM25 for different input data column types is as follows:

  • dict : Each (key, value) pair is treated as count associated with the key for this row. A common example is to have a dict element contain a bag-of-words representation of a document, where each key is a word and each value is the number of times that word occurs in the document. All non-numeric values are ignored.

  • list : The list is converted to bag of words of format, where the keys are the unique elements in the list and the values are the counts of those unique elements. After this step, the behaviour is identical to dict.

  • string : Behaves identically to a dict, where the dictionary is generated by converting the string into a bag-of-words format. For example, "I really like really fluffy dogs" would get converted to {'I' : 1, 'really': 2, 'like': 1,'fluffy': 1, 'dogs':1}.

Introductory Example

import graphlab as gl

# Create data.
sf = gl.SFrame({'docs': [{'this': 1, 'is': 1, 'a': 2, 'sample': 1},
                         {'this': 1, 'is': 1, 'another': 2, 'example': 3},
                         {'final': 1, 'doc': 1, 'here': 2}]})

# Create a query set 
query = ['a','query','example']

# Create a BM25 encoder
from graphlab.toolkits.feature_engineering import BM25
encoder = gl.feature_engineering.create(dataset = sf, transformers = BM25(feature = 'docs', query = query))

# Transform the data.
transformed_sf = encoder.transform(sf)
 Data:
+----------------+
|      docs      |
+----------------+
| 0.744711615513 |
| 0.789682123696 |
|      0.0       |
+----------------+
[3 rows x 1 columns]
# Save the transformer.
encoder.save('save-path')

# Return the indices in the encoding.
encoder['document_frequencies']
Data:
+----------------+---------+--------------------+
| feature_column |   term  | document_frequency |
+----------------+---------+--------------------+
|      docs      |    a    |         1          |
|      docs      | example |         1          |
+----------------+---------+--------------------+
[2 rows x 3 columns]

Fitting and Transforming

import graphlab as gl

# Dictionary Input:
sf = gl.SFrame({'docs': [{'this': 1, 'is': 1, 'a': 2, 'sample': 1},
                         {'this': 1, 'is': 1, 'another': 2, 'example': 3},
                         {'final': 1, 'doc': 1, 'here': 2}]})
# Create a query set 
query = ['a','query','example']
encoder = gl.feature_engineering.BM25(feature = 'docs', query = query)
encoder = encoder.fit(data = sf)
transformed_sf = encoder.transform(data = sf)
transformed_sf
+----------------+
|      docs      |
+----------------+
| 0.744711615513 |
| 0.789682123696 |
|      0.0       |
+----------------+
[3 rows x 1 columns]
# List Input:
l1 = ['this', 'is', 'a', 'a', 'sample']
l2 = ['this', 'is', 'another', 'another', 'example', 'example', 'example']
l3 = ['final', 'doc', 'here', 'here']
sf = gl.SFrame({'docs' : [l1,l2,l3]})

# Create a query set 
query = ['a','query','example']
encoder = gl.feature_engineering.BM25(feature = 'docs', query = query)
encoder = encoder.fit(data = sf)
transformed_sf = encoder.transform(data = sf)
transformed_sf
+----------------+
|      docs      |
+----------------+
| 0.744711615513 |
| 0.789682123696 |
|      0.0       |
+----------------+
[3 rows x 1 columns]
# String Input:
s1 = 'this is a a sample'
s2 = 'this is another another example example example'
s3 = 'final doc here here'
sf = gl.SFrame({'docs' : [s1,s2,s3]})

# Create a query set 
query = ['a','query','example']
encoder = gl.feature_engineering.BM25(feature = 'docs', query = query)
encoder = encoder.fit(data = sf)
transformed_sf = encoder.transform(data = sf)
transformed_sf
+----------------+
|      docs      |
+----------------+
| 0.744711615513 |
| 0.789682123696 |
|      0.0       |
+----------------+
[3 rows x 1 columns]