Distributing model parameter search
For all model parameter search methods and cross_val_score, you have the choice of running the jobs locally or remotely.
Local
By default, jobs are scheduled to run locally in an asynchronous fashion. This is called a LocalAsync environment.
Remote
You may also run jobs on an EC2 cluster or a Hadoop cluster. This is especially useful when you want to perform a larger scale parameter search.
For EC2, you first create an EC2 environment and pass it into the environment argument:
ec2config = graphlab.deploy.Ec2Config()
ec2 = graphlab.deploy.ec2_cluster.create(name='mps',
                                         s3_path='s3://bucket/path',
                                         ec2_config=ec2config,
                                         num_hosts=4)
j = graphlab.model_parameter_search.create((train, valid),
                                           my_model, my_params,
                                           environment=ec2)
For launching jobs on a Hadoop cluster, you instead create a Hadoop environment and pass this object into the environment argument:
hd = gl.deploy.hadoop_cluster.create(name='hadoop-cluster',
                                     dato_dist_path=<path to installation>)
j = graphlab.model_parameter_search.create((train, valid),
                                           my_model, my_params,
                                           environment=hd)
For more details on creating EC2- and Hadoop-based environment, checkout the API docs or the Deployment chapter of the userguide.
When getting started, it is useful to keep perform_trial_run=True to make sure you are creating your models properly.