Exercises

In the code block below, import the StackOverflow dataset SFrame that you saved during earlier exercises. Note that this data is shared courtesy of StackExchange and is under the Creative Commons Attribution-ShareAlike 3.0 Unported License. This particular version of the data set was used in a recent Kaggle competition.

import os
if os.path.exists('stack_overflow'):
    sf = graphlab.SFrame('stack_overflow')
else:
    sf= graphlab.SFrame('http://s3.amazonaws.com/dato-datasets/stack_overflow')
    sf.save('stack_overflow')

Question 1: Visually explore the above data using GraphLab Canvas.

sf.show()

In this section we will make a model that can be used to recommend new tags to users.

Question 2: Create a new column called Tags where each element is a list of all the tags used for that question. (Hint: Check out sf.pack_columns .)

sf = sf.pack_columns(column_prefix='Tag', new_column_name='Tags')

Question 3: Make your SFrame only contain the OwnerUserId column and the Tags column you created in the previous step.

sf = sf[['OwnerUserId', 'Tags']]

Question 4: Use the following Python function to modify the Tags column to not have any empty strings in the list.

def remove_empty(tags):
    return [tag for tag in tags if tag != '']
sf['Tags'] = sf['Tags'].apply(remove_empty)

Question 5: Create a new SFrame called user_tag that has a row for every (user, tag) pair. (Hint: See sf.stack .)

user_tag = sf.stack(column_name='Tags', new_column_name='Tag')

Question 6: Create a new SFrame called user_tag_count that has three columns:

- `OwnerUserId`
- `Tag`
- `Count`

where Count contains the number of times the given Tag was used by that particular OwnerUserId. Hint: See groupby .

user_tag_count = user_tag.groupby(['OwnerUserId', 'Tag'], graphlab.aggregate.COUNT)

Question 7: Visually explore this summarized version of your data set with GraphLab Canvas.

user_tag_count.show()
Creating a Model

Question 8: Use graphlab.recommender.create() to create a model that can be used to recommend tags to each user.

m = graphlab.recommender.create(user_tag_count, user_id='OwnerUserId', item_id='Tag')

Question 9: Print a summary of the model by simply entering the name of the object.

m

Question 10: Get all unique users from the first 10000 observations and save them as a variable called users.

users = user_tag_count.head(10000)['OwnerUserId'].unique()

Question 11: Get 20 recommendations for each user in your list of users. Save these as a new SFrame called recs.

recs = m.recommend(users, k=20)

When people use recommendation systems for online commerice, it's often useful to be able to recommending products from a single category of items, e.g. recommending shoes to somebody who typically buys shirts.

To illustrate how this can be done with GraphLab Create, suppose we have a Javascript user who is trying to learn Python. Below we will take just the Javascript users and see what Python tags to recommend them.

Question 12: Create a variable called javascript_users that contains all unique users who have used the javascript tag.

javascript_users = user_tag_count['OwnerUserId'][user_tag_count['Tag'] == 'javascript'].unique()

Question 13: Use the model you created above to find the 20 most similar items to the tag "python". Create a variable called python_items containing just these similar items.

python_items = m.get_similar_items(['python'], k=20)
python_items = python_items['similar']

Question 14: For each user in javascript_users, make 5 recommendations among the items in python_items.

python_recs = m.recommend(users=javascript_users, items=python_items, k=5)

Question 15: Use GraphLab Canvas to find out the 10 most often recommended items.

python_recs.show()  # Then click on the Summary tab and look at the histogram in the second column.

Question 16: Save your model to a file.

m.save('my_model')
Experimenting with new models

Question 17: Create a train/test split of the user_tag_count data from the section above. Hint: Use random_split_by_user .

train, test = graphlab.recommender.util.random_split_by_user(user_tag_count,
                                                             user_id='OwnerUserId',
                                                             item_id='Tag')

Question 18: Create a recommender model like you did above that only uses the training set.

m1 = graphlab.recommender.create(train, user_id='OwnerUserId', item_id='Tag')

Question 19: Create a matrix factorization model that is better at ranking by setting unobserved_rating_regularization argument to 1.

m2 = graphlab.ranking_factorization_recommender.create(train,
                                                       user_id='OwnerUserId',
                                                       item_id='Tag',
                                                       target='Count',
                                                       ranking_regularization=1)

Question 20: Retrieve the coefficients for each user that were learned by this algorithm.

m2['coefficients']['OwnerUserId']

Question 21: Compare the predictive performance of the two models. Given the ability to make 10 recommendations, which model predicted the highest proportion of items in the test set (on average)?

results = graphlab.recommender.util.compare_models(test, [m1, m2],
                                                   metric='precision_recall')