Evaluating models.

Evaluation metrics are tied to the machine learning task. There are different metrics for the tasks of classification, regression, ranking, clustering, topic modeling, etc. If you built a classifier to detect spam emails vs. normal emails, then you should consider classification performance metrics: for example, average accuracy, log-loss, and AUC. If you are trying to predict a score, such as Google’s daily stock price, then you might want to consider regression metrics like the root mean-squared error (RMSE). If you are ranking items by relevance to a query, such as in a search engine, then look into ranking losses such as precision-recall (also popular as a classification metric), or NDCG. These are all examples of performance metrics for various tasks.

Classification Metrics

Classification is about predicting class labels given input data. In binary classification, there are two possible output classes. In multi-class classification, there are more than two possible classes. An example of binary classification is spam detection, where the input data could be the email text and metadata (sender, sending time), and the output label is either “spam” or “not spam.” Sometimes, people use generic names for the two classes: “positive” and “negative,” or “class 1” and “class 0.”

There are many ways of measuring classification performance

Regression Metrics

In a regression task, the model learns to predict numeric scores. An example is predicting the price of a stock on future days given past price history and other information about the company and the market. Another example is personalized recommendations, where the goal is to predict a user’s rating for an item.

There are many ways of measuring regression performance