MUHAI Recipe Execution Benchmark

a benchmark for natural language understanding
Image

About

This benchmark for recipe understanding in autonomous agents aims to support progressing the domain of natural language understanding by providing a setting in which performance can be measured on the everyday human activity of cooking. Showing deep understanding of such an activity requires both linguistic and extralinguistic skills, including reasoning with domain knowledge. For this goal, the benchmark provides a number of recipes written in natural (human) English that should be converted to a procedural semantic network of cooking operations that can be interpreted and executed by autonomous agents. A system, which supports one-click installation and execution, is also included that can perform recipe execution tasks in simulation allowing both analysis and evaluation of predicted networks. The provided evaluation metrics are mostly simulation-based, because demonstrating deep understanding of recipes can be done by effectively taking all the appropriate actions required for cooking the intended dish.

Download

The full benchmark has been made available standalone and as part of the Babel toolkit. Both options provide the same benchmark functionalities, but the Babel toolkit also provides the option of extending the system.

Components

  • MUHAI Cooking Language

    A new representation language composed of executable predicate-argument structures that encode meaning, dependencies and temporality between procedural cooking operations. Complete recipe execution can be expressed in this language, including ingredient gathering.

  • Test Set

    A test set of 30 recipes has been provided with gold standard annotations for the MUHAI Cooking Language. These recipes were collected from a variety of sources and contain many linguistic and extralinguistic challenges that are inherent to the cooking domain.

  • Metrics

    A suite of simulation- and non-simulation-based metrics allowing multiperspective estimates that optimize transferability to real-world utility. These consist of:

    • Smatch Score: A commonly used semantic graph comparison score that measures overlap between semantic structures
    • Goal-Condition Success: A metric that measures how many required goal-conditions have been traversed during recipe execution
    • Dish Approximation Score: A similarity estimate of two prepared food products
    • Recipe Execution Time: An efficiency measure that tracks how long recipe execution would take using a given solution

  • Kitchen Simulator

    A qualitative simulation engine implemented in Incremental Recruitment Language using the Babel Toolkit. It requires a file containing a procedural semantic network composed of MUHAI Cooking Language primitives as input and returns both execution and evaluation results as output for further analysis.

  • Extensive Documentation

    Extensive documentation accompanies the benchmark, including information about data sources, the MUHAI Cooking Language and initial kitchen states of the simulator. In-depth usage examples have been added as well that demonstrate how to optimally use the benchmark and interpret its results.

Master's Thesis

This benchmark has been developed by Robin De Haes under the guidance of Professor Dr. Paul Van Eecke and Dr. Jens Nevens from the VUB research team EHAI in the context of his master's thesis at VUB. This thesis contains elaborate information about the benchmark's background, design choices and implementation details.
A Benchmark for Recipe Understanding in Autonomous Agents Author: Robin De Haes
Promotor: Professor Dr. Paul Van Eecke
Advisor: Dr. Jens Nevens

Citation:
De Haes, R. (2023). A Benchmark for Recipe Understanding in Autonomous Agents (Master's thesis). Vrije Universiteit Brussel, Brussels, Belgium.

Acknowledgements

The initial idea behind the benchmark has been developed in the context of the European project MUHAI, which aims to introduce Meaning and Understanding in Human-centric Artificial Intelligence.