Recipe for Success
Predicting the Number of Steps in Recipes on Food.com
By Matthew Yeh
Introduction
Have you ever wondered how many steps a recipe has before you start cooking? This project aims answer the question, “What is the relationship between recipe features such as the number of ingredients and average rating with the number of steps in the recipe?” The project aims to predict the number of steps in a recipe based on features such as its average rating and the number of minutes it takes to make.
The dataset this project is based on contains recipes and ratings from food.com. It was originally scraped and used by the authors of this recommender systems paper.
The raw dataset has 83782 recipes. The columns that are relevant to this project are:
avg_rating
: the average rating of the recipe, computed from the user ratingsminutes
: the number of minutes it takes to make the recipen_steps
: the number of steps in the recipen_ingredients
: the number of ingredients used in the recipenutrition
: the nutritional information of the recipe, including calories, total fat, sugar, and sodium
Data Cleaning and Exploratory Data Analysis
The data cleaning process involved a number of steps:
-
Assembling the dataset: The dataset was split into two files, so I had to merge them into a single DataFrame. This involved loading the data from the recipe file, which contained the recipe ID, name, and other information, and the rating file, which contained reviews and ratings for the recipes. I replaced ratings of 0 with NaN values, as ratings range from 1 to 5 stars, meaning 0 ratings indicated invalid or missing data. This ensured that the average ratings would not be skewed by 0 ratings. I then merged the two DataFrames on the recipe ID. I then used this merged DataFrame to calculate the average rating of each recipe by grouping by the recipe ID and taking the mean of the ratings.
-
Creating new columns based off of the
nutrition
column: Thenutrition
column contained lists of nutritional information for each recipe. I extracted the calories, total fat, sugar, sodium, protein, saturated fat, and carbohydrates from the lists and created new columns for each of these values. I then removed thenutrition
column from the DataFrame. -
Cleaning the data: I removed any rows with missing values in the
avg_rating
column. There were no missing values in any other quantitative columns, so I did not need to remove any other rows. I did not impute any missing values, as it would not make sense to impute the average rating of a recipe. -
Exploratory Data Analysis: I created a number of visualizations to explore the relationships between the features in the dataset. I found that recipes that take longer to make tend to have more steps, and that recipes with more ingredients tend to have more steps.
The plot above shows the distribution of the number of steps in the recipes. The distribution is right-skewed, with most recipes having fewer than 20 steps but some recipes having far more, with an extreme outlier at 100 steps. The mode is 7 steps, with 6,846 recipes having 7 steps. This indicates that most recipes are relatively simple to make, with only a few steps involved.
This plot shows the relationship between the number of steps and the number of ingredients in the recipes. There is generally a positive relationship between the two variables, with recipes with more ingredients tending to have more steps. This makes intuitive sense, as more ingredients would likely require more steps to prepare.
Next, I created a table aggregating the average rating, calories, and number of steps for recipes in different time ranges:
time_range (mins) | avg_rating | calories (#) | n_steps |
---|---|---|---|
<15 | 4.67088 | 313.495 | 5.54733 |
15-30 | 4.62338 | 375.643 | 9.33545 |
30-60 | 4.60655 | 445.806 | 11.4932 |
1-2 hrs | 4.62744 | 558.003 | 13.1389 |
2+ hrs | 4.59355 | 553.625 | 12.3121 |
The table shows that recipes that take longer to make tend to have lower ratings, more calories, and more steps.
Framing a Prediction Problem
Based on the analysis in the previous section, there appear to be relationships between the number of steps in a recipe and features such as the time in minutes to make, the average rating, the number of ingredients, and the number of calories. I will use these features to predict the number of steps in a recipe, all of which would be known at the time of prediction for a given recipe from food.com.
I will be predicting the number of steps in a recipe, which is a regression problem. The response variable is n_steps
, which is a quantitative variable. I chose this variable because it is likely to be related to the complexity of the recipe and can be predicted using the other features in the dataset.
I will be using the mean squared error as the metric to evaluate my model. I chose this metric because it penalizes large errors more than the mean absolute error, which is important because I want to minimize the number of large errors in my predictions. The mean squared error is also easier to interpret than other metrics such as the mean absolute percentage error, which can be difficult to interpret when the target variable has a wide range of values.
The R^2 score will also be used to evaluate the model. This metric measures the proportion of the variance in the target variable that is predictable from the features. It is a useful metric for regression problems because it provides an indication of how well the model explains the variance in the target variable.
Baseline Model
My baseline model is a linear regression model that uses the following features:
avg_rating
: the average rating of the recipeminutes
: the number of minutes it takes to make the recipe These were chosen based on the results from the aggregated table, which showed that recipes that take longer to make tend to have more steps. In addition, there appeared to be a slight negative relationship between the average rating and the number of steps in the recipes.
No encoding was necessary, as both of these features are already quantitative. The performance of the model is as follows:
- Mean Squared Error: 39.79106912197526
- R^2 Score: 0.0005333644176367391
The baseline model is not very good, as the R^2 score is very low and the mean squared error is relatively high, indicating that the model does not explain much of the variance in the number of steps in the recipes.
Final Model
The final modeling algorithm I chose is a LASSO regression model that uses the following features:
avg_rating
: the average rating of the recipeminutes
: the number of minutes it takes to make the recipecalories (#)
: the number of calories in the recipen_ingredients
: the number of ingredients in the recipe
These features were chosen because they are likely to be related to the number of steps in a recipe. For example, recipes with more ingredients may have more steps, and recipes with more calories may have more steps. The LASSO regression model was chosen because it performs feature selection by shrinking the coefficients of irrelevant features to 0. This helps prevent overfitting and makes the model more interpretable.
I log-transformed the minute
feature to make the data more normally distributed, as it was right-skewed with some recipes taking a very long time to make (these recipes mostly involved some kind of pickling or fermentation). I also standardized the features so that they have a mean of 0 and a standard deviation of 1. This helps the model converge faster and makes the coefficients more interpretable.
To create more features, I also used polynomial transformations of the features. I used the PolynomialFeatures
class from scikit-learn to create polynomial features of the log-transformed minute
column and the n_ingredients
column. This allows the model to capture non-linear relationships between the features and the target variable.
I also used a quantile transformer to transform the calories (#)
column. This transformation helps the model better capture the relationship between the calories (#)
column and the target variable by making the data more normally distributed.
The regularization hyperparameter that performed the best was alpha=0.03125 or 2^(-5). The polynomial degrees that performed the best was 5 for the minute
column and 4 for the n_ingredients
column.
These hyperparameters were selected using cross-validation. The final model’s performance is as follows:
- Mean Squared Error: 29.4196790148663
- R^2 Score: 0.2610405235715909
The final model is an improvement over the baseline model, with a lower mean squared error and a higher R^2 score, indicating that the model explains more of the variance in the number of steps in the recipes. This increase in performance is likely due to the addition of features that are more likely related to the number of steps in a recipe, and the use of a more complex model that can capture non-linear relationships between the features and the target variable. However, there is still room for improvement, as the R^2 score is still relatively low.