Cross validation

Data is the first argument for all of the model parameter search functions. This argument allows for several different input types to allow you to better evaluate model performance on a given set of parameters.

You can provide a train/test pair: by default, each model will be trained on the first element and evaluated on both elements.

url = 'http://s3.amazonaws.com/gl-testdata/xgboost/mushroom.csv' 
data = gl.SFrame.read_csv(url)
(train, valid) = data.random_split(.7)
gl.model_parameter_search.create((train, valid), my_model, my_params)

You can provide a list of train/test pairs. The results for each model will be averaged across the folds.

folds = [(train0, valid0), (train1, valid1)]
gl.model_parameter_search.create(folds, my_model, my_params)

We also provide a convenience object KFold for performing model search using K folds.

folds = gl.cross_validation.KFold(sf, 5)
job = gl.random_search.create(folds,
                              my_model,  
                              my_params)

In this case, the returned KFold object splits the data lazily to minimize communication costs.

Cross validation for a single parameter set

We also provide a convenience function for evaluating model performance via cross validation for a given set of parameters.

url = 'http://s3.amazonaws.com/gl-testdata/xgboost/mushroom.csv'
data = gl.SFrame.read_csv(url)
data['label'] = (data['label'] == 'p')
folds = gl.cross_validation.KFold(data, 5)
params = {'target': 'label', 'max_depth': 5}
job = gl.cross_validation.cross_val_score(folds,
                                          gl.boosted_trees_classifier.create,
                                          params)
print job.get_results()

This is analogous to sklearn's cross_val_score.

Additional reading

To learn more about the benefits of k-fold cross-validation, check out Chapter 5.1 of Introduction to Statistical Learning.