Related Work

Research on computational modeling of culinary data has expanded rapidly in recent years, driven by the emergence of large-scale recipe datasets and the increasing interest in personalized nutrition, food recommendation systems, and generative cooking tools. Much of the early work in this domain focused on rule-based or retrieval-based methods, which relied on keyword matching, user ratings, or ingredient overlaps to suggest recipes. While simple, these approaches often failed to generalize or capture semantic nuances in recipe text.

A significant advancement came with the Recipe1M dataset introduced by Salvador et al. (2017), which enabled the development of joint multimodal embeddings linking recipes and images. Their im2recipe model used a deep neural network trained with a cosine similarity loss to align visual and textual modalities in a shared embedding space. Subsequent works extended this idea, using contrastive learning (e.g., triplet loss) and transformer-based encoders to further enhance performance in cross-modal retrieval tasks.

Other studies explored sequence modeling approaches for ingredient prediction or instruction generation, often leveraging recurrent neural networks or transformer architectures. For example, Kusupati et al. (2020) used transformers to model ingredient co-occurrence, while Wang et al. (2019) proposed hierarchical models to separately embed ingredients and instructions before merging them into a unified representation.

More recently, research has focused on personalized or health-aware embeddings, where models are fine-tuned based on user preferences, dietary goals, or cultural constraints. These systems often require access to user-specific data or additional metadata, making them harder to generalize without substantial preprocessing or user feedback loops.

In contrast to prior work, our method proposes a text-only, autoencoder-based architecture that learns holistic recipe embeddings by reconstructing the full recipe from a latent space. This approach avoids reliance on image data or user interactions, offering a lightweight, interpretable, and general-purpose embedding suitable for multiple downstream tasks. Additionally, we address a major limitation in existing datasets by releasing a new, curated Food.com dataset with cleaned ingredient and instruction fields, enabling more effective model training and evaluation.