Skip to Content

Dataset

Dataset Structure

Each recipe in the main 500.000 dataset includes the following attributes:

Textual Information

FieldDescription
NameThe recipe title
DescriptionA brief textual overview or background of the recipe
StepsAn ordered list of instructions for preparation

Categorical Fields

FieldExamples
Tagsvegan, gluten-free
General Categoryquick, dessert, holiday, main course, appetizer

Nutritional Values

NutrientUnit
Calories
Cholesterol
Sodium
Carbohydrates
Fiber
Sugar
Protein
Total Fat
Saturated Fat

Ingredients

Each ingredient entry includes:

FieldDescription
NameThe name of the ingredient
QuantityHow much is needed
Unit of MeasureThe measurement unit (e.g., grams, cups)
Optional MetadataPreparation style or additional descriptors (e.g., “chopped”, “fresh”)

Additional Metadata

FieldDescriptionUnit
Preparation TimeTime needed to prepare
Cooking TimeTime needed to cook
Serving SizeNumber of servings
AuthorCreator of the recipe
ImagesAssociated images (if any)

Preprocessing and Filtering

To reduce sparsity and improve semantic consistency, we filtered out all tags, categories, and ingredients that appeared fewer than five times across the corpus. This resulted in a significant reduction of ingredient types from over 450,000 to approximately 20,000 standardized ingredients. To handle removed or rare ingredients, we applied semantic similarity-based imputation using pretrained language models, replacing rare items with the most similar frequent equivalents. The same strategy was applied to rare tags and categories. For modeling purposes, we retained a subset of the most semantically informative features: Name, Steps, Ingredients, Nutritional Information, Tags, and General Category.

Extended Recipes via RecipeNLG

To further scale our dataset, we leveraged the RecipeNLG dataset, which contains over 2 million recipes, including many from Food.com. RecipeNLG provides Title, Ingredients, and Steps, but lacks nutritional data, categories, and tags. To address this, we trained a classification model using our curated 500k dataset to impute missing tags and general categories in RecipeNLG. This enables the integration of additional recipe data for training and evaluation while maintaining semantic richness.

Ingredient Nutrition Matching

We also incorporated nutritional values for individual ingredients using the USDA FoodData Central database. Ingredient-level nutrition information was aggregated per recipe and cross-checked with available recipe-level nutrition estimates when possible, increasing the reliability of health related attributes in the dataset.

Last updated on