Dataset
Dataset Structure
Each recipe in the main 500.000 dataset includes the following attributes:
Textual Information
Field | Description |
---|---|
Name | The recipe title |
Description | A brief textual overview or background of the recipe |
Steps | An ordered list of instructions for preparation |
Categorical Fields
Field | Examples |
---|---|
Tags | vegan , gluten-free |
General Category | quick , dessert , holiday , main course , appetizer |
Nutritional Values
Nutrient | Unit |
---|---|
Calories | |
Cholesterol | |
Sodium | |
Carbohydrates | |
Fiber | |
Sugar | |
Protein | |
Total Fat | |
Saturated Fat |
Ingredients
Each ingredient entry includes:
Field | Description |
---|---|
Name | The name of the ingredient |
Quantity | How much is needed |
Unit of Measure | The measurement unit (e.g., grams, cups) |
Optional Metadata | Preparation style or additional descriptors (e.g., “chopped”, “fresh”) |
Additional Metadata
Field | Description | Unit |
---|---|---|
Preparation Time | Time needed to prepare | |
Cooking Time | Time needed to cook | |
Serving Size | Number of servings | |
Author | Creator of the recipe | |
Images | Associated images (if any) |
Preprocessing and Filtering
To reduce sparsity and improve semantic consistency, we filtered out all tags, categories, and ingredients that appeared fewer than five times across the corpus. This resulted in a significant reduction of ingredient types from over 450,000 to approximately 20,000 standardized ingredients. To handle removed or rare ingredients, we applied semantic similarity-based imputation using pretrained language models, replacing rare items with the most similar frequent equivalents. The same strategy was applied to rare tags and categories. For modeling purposes, we retained a subset of the most semantically informative features: Name, Steps, Ingredients, Nutritional Information, Tags, and General Category.
Extended Recipes via RecipeNLG
To further scale our dataset, we leveraged the RecipeNLG dataset, which contains over 2 million recipes, including many from Food.com. RecipeNLG provides Title, Ingredients, and Steps, but lacks nutritional data, categories, and tags. To address this, we trained a classification model using our curated 500k dataset to impute missing tags and general categories in RecipeNLG. This enables the integration of additional recipe data for training and evaluation while maintaining semantic richness.
Ingredient Nutrition Matching
We also incorporated nutritional values for individual ingredients using the USDA FoodData Central database. Ingredient-level nutrition information was aggregated per recipe and cross-checked with available recipe-level nutrition estimates when possible, increasing the reliability of health related attributes in the dataset.