Defining the search

Dato Distributed supports several specific methods for doing model parameter search.

Using dictionaries to specify parameters

The way you specify the set of parameters over which to search is through a dictionary. The dictionary keys are the names of the parameters and the values are the parameter values. Any values that are str, int, or floats are treated as a list containing a single value.

For example, specifying {"target": "y"} means that “y” will be the chosen target every time the model is fit. There are some list-typed arguments; in particular, features is a list of features to be used in the model. If you want to search over a list-typed argument, you must provide an iterable over valid argument values. For example, using {"features": [["col_a"], ["col_a", "col_b"]]} would search over the two feature sets. If you just wanted to use the same set of features for each model, you would do {"features": [["col_a"]]}.

Specifying a grid gearch

Grid searches are especially useful when you have a relatively small set of parameters over which to search.

You may define a grid of parameters by specifying the possible values for each parameter. The method grid_search.create will then train a model for each unique combination.

The collection of all combinations of valid parameter values defines a grid of model parameters that will be considered. For example, providing the following params dictionary

params = {'target': 'label', 
          'step_size': 0.3, 
          'features': [['a'], ['a', 'b']], 
          'max_depth': [.1, .2]}

will create the following set of combinations:

[{'target': 'label', 'step_size': 0.3, 'features': ['a'], 'max_depth': .1}, 
 {'target': 'label', 'step_size': 0.3, 'features': ['a'], 'max_depth': .2}, 
 {'target': 'label', 'step_size': 0.3, 'features': ['a', 'b'], 'max_depth': .1}, 
 {'target': 'label', 'step_size': 0.3, 'features': ['a', 'b'], 'max_depth': .2}]

Using a random search space

You may not always know which areas of a search space are most promising. In such situations, it can be useful to pick parameter combinations from random distributions. The top-level method, model_parameter_search, currently chooses random_search.create by default.

For example, for a real-valued parameter such as step_size, you could might want to draw values from an exponential distribution. In the following example, each parameter combination will contain

a target value of 'Y'
a max_depth value of either 5 or 7 (chosen randomly)
a step_size value drawn randomly from an exponential distribution with mean of 0.1

import scipy.stats
url = 'http://s3.amazonaws.com/gl-testdata/xgboost/mushroom.csv'
data = gl.SFrame.read_csv(url)
data['label'] = (data['label'] == 'p')

train, valid = data.random_split(.8)
params = {'target': 'label',
          'max_depth': [5, 7],
          'step_size': scipy.stats.distributions.expon(.1)}
job = gl.random_search.create((train, valid),
                            gl.boosted_trees_regression.create,
                            params)
job.get_results()

Columns:
        model_id        int
        max_depth       int
        step_size       float
        target  str
        training_rmse   float
        validation_rmse float

Rows: 8

Data:
+----------+-----------+----------------+--------+-------------------+
| model_id | max_depth |   step_size    | target |   training_rmse   |
+----------+-----------+----------------+--------+-------------------+
|    9     |     7     | 0.742280945789 | label  | 0.000562821322042 |
|    8     |     5     | 0.37544111673  | label  |  0.00963600115039 |
|    1     |     5     | 0.138909527035 | label  |   0.11368970605   |
|    0     |     7     | 0.977843893103 | label  | 0.000269710408328 |
|    3     |     7     | 0.32559648473  | label  |  0.0110626696535  |
|    2     |     5     | 0.330703633987 | label  |  0.0137912720349  |
|    5     |     7     | 0.408652318249 | label  |  0.00367912426229 |
|    4     |     7     | 0.295146249231 | label  |  0.0162840474088  |
+----------+-----------+----------------+--------+-------------------+
+-------------------+
|  validation_rmse  |
+-------------------+
| 0.000790839939725 |
|  0.0123972020261  |
|   0.114722098681  |
| 0.000369491390958 |
|  0.0120762185507  |
|  0.0169411827805  |
|  0.00439583387505 |
|  0.0171414864358  |
+-------------------+

Manually specifying parameters

If you want full control over your parameter search, then you can use the manual_search.create function. All you need to do is to pass in a list of parameter dictionaries; a model will be fit for each parameter set.

factory = gl.boosted_trees_classifier.create
params = [{'target': 'label', 'max_depth': 3}, 
          {'target': 'label', 'max_depth': 6}]
job = gl.manual_search.create((train, valid),
                              factory, params)