NOTE: It is recommended to use Firefox to optimally explore the contents of this page.
I. The CLEVR Grammar
This section provides a complete specification of the CLEVR Grammar. In total, the grammar consists of 170 constructions, 55 of which are morphological or lexical constructions. The remaining 115 constructions collaboratively capture the grammatical structures that are used in the dataset, e.g. noun phrases, prepositional phrases and a wide variety of interrogative structures.
The complete construction inventory of the CLEVR Grammar is shown below. Every construction can be further expanded by clicking on it. To further explore the constructions of the CLEVR Grammar, a search box is given at the top of the construction inventory. To use it, first enter the name of a construction, e.g. "cube-morph-cxn". When clicking 'Search', the search result will be shown below the construction inventory.
In this section, we demonstrate the comprehension process for different questions from the CLEVR dataset. The FCG web interface contains the following parts: the initial transient structure, the construction application process, a list of applied constructions, the resulting transient structure and finally the semantic representation. Note that many of the boxes will reveil more information when you click on them. More information on how to use this demo can be found in the web demonstration guide.
Example 1
As a first example, we demonstrate the comprehension process on the example sentence that is used throughout the paper: "What material is the red cube?".
The first example is still rather small, containing only five predicates. In this example, we take a more complex question: "What number of red cubes are the same size as the blue ball?".
Comprehending "what number of red cubes are the same size as the blue ball"
In this third example, we show that the meaning representation does not necessarily have to be a sequence of predicates. It can also be a tree-structure, as demonstrated by the question "Are there an equal number of blue things and green balls?".
Comprehending "are there an equal number of blue things and green balls"
There are many different question types in the CLEVR dataset. In this fourth and final example, we demonstrate yet another type of question: "Do the large metal cube left of the red thing and the small cylinder have the same color?".
Comprehending "do the large metal cube left of the red thing and the small cylinder have the same color"
Fluid Construction Grammar is a bidirectional formalism. It allows to map utterances to meanings, but also meanings to utterances. This is also true for the CLEVR Grammar. In this section, we demonstrate this formulation process. What is noteworthy, because of the design of the CLEVR dataset, is that multiple questions map to the same semantic representation, e.g. the questions "What material is the red cube?" and "There is a red cube; what is its material?". Using FCG's 'formulate-all' operation, which explores the entire search space instead of stopping at the first solution, we can show all different questions obtained from a single input meaning. We demonstrate this using the semantic representation of the example sentence "What material is the red cube?"
The 'formulate-all' operation for this example returns six possible questions. In theory, there are many more possible solutions because of the large amount of synonyms for the different nouns and adjectives. A cube can also be a block, metal can also be shiny and so on. To avoid an enormous search space, the CLEVR Grammar was configured to only explore the grammatical variation and ignore this lexical variation.
IV. Operational VQA system
In order to have a fully operational VQA system, the semantic representation that is the result of FCG's comprehension operation needs to be executed. In particular, it needs to be executed on a specific scene of objects. For this, we use a procedural semantics framework called Incremental Recruitment Language (IRL). In this section, we demonstrate both the comprehension process of an input question and the subsequent execution process of the resulting semantic representation on a scene of objects. The scene of objects is shown in the image below.