Computational Construction Grammar for Visual Question Answering

This web demonstration accompanies the paper:

Anonymous Authors (2019). Computational Construction Grammar for Visual Question Answering. Submitted to Linguistics Vanguard.

Explanations on how to use this demo can be found here.


This demonstration has the following parts:

I. The CLEVR Grammar

II. Comprehension

III. Formulation

IV. Operational VQA system

NOTE: It is recommended to use Firefox to optimally explore the contents of this page.


I. The CLEVR Grammar

This section provides a complete specification of the CLEVR Grammar. In total, the grammar consists of 170 constructions, 55 of which are morphological or lexical constructions. The remaining 115 constructions collaboratively capture the grammatical structures that are used in the dataset, e.g. noun phrases, prepositional phrases and a wide variety of interrogative structures.

The complete construction inventory of the CLEVR Grammar is shown below. Every construction can be further expanded by clicking on it. To further explore the constructions of the CLEVR Grammar, a search box is given at the top of the construction inventory. To use it, first enter the name of a construction, e.g. "cube-morph-cxn". When clicking 'Search', the search result will be shown below the construction inventory.

II. Comprehension

In this section, we demonstrate the comprehension process for different questions from the CLEVR dataset. The FCG web interface contains the following parts: the initial transient structure, the construction application process, a list of applied constructions, the resulting transient structure and finally the semantic representation. Note that many of the boxes will reveil more information when you click on them. More information on how to use this demo can be found in the web demonstration guide.

Example 1

As a first example, we demonstrate the comprehension process on the example sentence that is used throughout the paper: "What material is the red cube?".


Comprehending "what material is the red cube"


Applying
in comprehension


initial structure
application process
applied constructions
resulting structure

Meaning:

gstruct0(get-context?source-490)struct3(filter?target-1008?source-490?shape-90)struct0:varSOURCEdash4901->struct3:varSOURCEdash4904struct1(bindattribute-category?attribute-369material)struct7(query?target-1115?target-object-177?attribute-369)struct1:varATTRIBUTEdash3692->struct7:varATTRIBUTEdash3698struct2(bindcolor-category?color-114red)struct6(filter?target-1007?target-1008?color-114)struct2:varCOLORdash1143->struct6:varCOLORdash1147struct4(bindshape-category?shape-90cube)struct3:varSHAPEdash904->struct4:varSHAPEdash905struct3:varTARGETdash10084->struct6:varTARGETdash10087struct5(unique?target-object-177?target-1007)struct5:varTARGETdash10076->struct6:varTARGETdash10077struct5:varTARGETdashOBJECTdash1776->struct7:varTARGETdashOBJECTdash1778

Example 2

The first example is still rather small, containing only five predicates. In this example, we take a more complex question: "What number of red cubes are the same size as the blue ball?".


Comprehending "what number of red cubes are the same size as the blue ball"


Applying
in comprehension


initial structure
application process
applied constructions
resulting structure

Meaning:

gstruct0(get-context?source-545)struct6(filter?target-1127?source-545?shape-102)struct0:varSOURCEdash5451->struct6:varSOURCEdash5457struct1(filter?target-1130?target-1129?color-122)struct8(filter?target-1129?target-1241?shape-100)struct1:varTARGETdash11292->struct8:varTARGETdash11299struct9(bindcolor-category?color-122red)struct1:varCOLORdash1222->struct9:varCOLORdash12210struct12(count!?target-1158?target-1130)struct1:varTARGETdash11302->struct12:varTARGETdash113013struct2(filter?target-1126?target-1127?color-124)struct3(unique?target-object-187?target-1126)struct2:varTARGETdash11263->struct3:varTARGETdash11264struct2:varTARGETdash11273->struct6:varTARGETdash11277struct7(bindcolor-category?color-124blue)struct2:varCOLORdash1243->struct7:varCOLORdash1248struct11(same?target-1241?target-object-187?attribute-407)struct3:varTARGETdashOBJECTdash1874->struct11:varTARGETdashOBJECTdash18712struct4(bindshape-category?shape-100cube)struct4:varSHAPEdash1005->struct8:varSHAPEdash1009struct5(bindshape-category?shape-102sphere)struct5:varSHAPEdash1026->struct6:varSHAPEdash1027struct8:varTARGETdash12419->struct11:varTARGETdash124112struct10(bindattribute-category?attribute-407size)struct10:varATTRIBUTEdash40711->struct11:varATTRIBUTEdash40712

Example 3

In this third example, we show that the meaning representation does not necessarily have to be a sequence of predicates. It can also be a tree-structure, as demonstrated by the question "Are there an equal number of blue things and green balls?".


Comprehending "are there an equal number of blue things and green balls"


Applying
in comprehension


initial structure
application process
applied constructions
resulting structure

Meaning:

gstruct0(bindshape-category?shape-113thing)struct4(filter?target-1265?source-613?shape-113)struct0:varSHAPEdash1131->struct4:varSHAPEdash1135struct1(bindshape-category?shape-112sphere)struct2(filter?target-1262?source-613?shape-112)struct1:varSHAPEdash1122->struct2:varSHAPEdash1123struct2:varSOURCEdash6133->struct4:varSOURCEdash6135struct7(filter?target-1263?target-1262?color-139)struct2:varTARGETdash12623->struct7:varTARGETdash12628struct3(bindcolor-category?color-139green)struct3:varCOLORdash1394->struct7:varCOLORdash1398struct6(get-context?source-613)struct4:varSOURCEdash6135->struct6:varSOURCEdash6137struct8(filter?target-1266?target-1265?color-136)struct4:varTARGETdash12655->struct8:varTARGETdash12659struct5(bindcolor-category?color-136blue)struct5:varCOLORdash1366->struct8:varCOLORdash1369struct10(count!?count-148?target-1263)struct7:varTARGETdash12638->struct10:varTARGETdash126311struct9(count!?count-147?target-1266)struct8:varTARGETdash12669->struct9:varTARGETdash126610struct11(equal-integer?target-1294?count-147?count-148)struct9:varCOUNTdash14710->struct11:varCOUNTdash14712struct10:varCOUNTdash14811->struct11:varCOUNTdash14812

Example 4

There are many different question types in the CLEVR dataset. In this fourth and final example, we demonstrate yet another type of question: "Do the large metal cube left of the red thing and the small cylinder have the same color?".


Comprehending "do the large metal cube left of the red thing and the small cylinder have the same color"


Applying
in comprehension


initial structure
application process
applied constructions
resulting structure

Meaning:

gstruct0(bindspatial-relation-category?spatial-relation-208left)struct23(relate?source-665?target-object-242?spatial-relation-208)struct0:varSPATIALdashRELATIONdash2081->struct23:varSPATIALdashRELATIONdash20824struct1(filter?target-1391?target-1384?size-44)struct2(unique?target-object-241?target-1391)struct1:varTARGETdash13912->struct2:varTARGETdash13913struct4(bindsize-category?size-44large)struct1:varSIZEdash442->struct4:varSIZEdash445struct5(filter?target-1384?target-1383?material-31)struct1:varTARGETdash13842->struct5:varTARGETdash13846struct20(query?src-299?target-object-241?attribute-496)struct2:varTARGETdashOBJECTdash2413->struct20:varTARGETdashOBJECTdash24121struct3(bindattribute-category?attribute-496color)struct3:varATTRIBUTEdash4964->struct20:varATTRIBUTEdash49621struct6(bindmaterial-category?material-31metal)struct5:varMATERIALdash316->struct6:varMATERIALdash317struct7(filter?target-1383?source-665?shape-123)struct5:varTARGETdash13836->struct7:varTARGETdash13838struct9(bindshape-category?shape-123cube)struct7:varSHAPEdash1238->struct9:varSHAPEdash12310struct7:varSOURCEdash6658->struct23:varSOURCEdash66524struct8(bindshape-category?shape-124cylinder)struct13(filter?target-1385?source-667?shape-124)struct8:varSHAPEdash1249->struct13:varSHAPEdash12414struct10(bindshape-category?shape-122thing)struct11(filter?target-1386?source-667?shape-122)struct10:varSHAPEdash12211->struct11:varSHAPEdash12212struct11:varSOURCEdash66712->struct13:varSOURCEdash66714struct18(filter?target-1389?target-1386?color-148)struct11:varTARGETdash138612->struct18:varTARGETdash138619struct12(bindcolor-category?color-148red)struct12:varCOLORdash14813->struct18:varCOLORdash14819struct16(filter?target-1393?target-1385?size-45)struct13:varTARGETdash138514->struct16:varTARGETdash138517struct19(get-context?source-667)struct13:varSOURCEdash66714->struct19:varSOURCEdash66720struct14(bindsize-category?size-45small)struct14:varSIZEdash4515->struct16:varSIZEdash4517struct15(unique?target-object-234?target-1393)struct15:varTARGETdash139316->struct16:varTARGETdash139317struct21(query?src-300?target-object-234?attribute-496)struct15:varTARGETdashOBJECTdash23416->struct21:varTARGETdashOBJECTdash23422struct17(unique?target-object-242?target-1389)struct17:varTARGETdash138918->struct18:varTARGETdash138919struct17:varTARGETdashOBJECTdash24218->struct23:varTARGETdashOBJECTdash24224struct20:varATTRIBUTEdash49621->struct21:varATTRIBUTEdash49622struct22(equal??target-1419?src-299?src-300?attribute-496)struct20:varSRCdash29921->struct22:varSRCdash29923struct21:varSRCdash30022->struct22:varSRCdash30023struct21:varATTRIBUTEdash49622->struct22:varATTRIBUTEdash49623

III. Formulation

Fluid Construction Grammar is a bidirectional formalism. It allows to map utterances to meanings, but also meanings to utterances. This is also true for the CLEVR Grammar. In this section, we demonstrate this formulation process. What is noteworthy, because of the design of the CLEVR dataset, is that multiple questions map to the same semantic representation, e.g. the questions "What material is the red cube?" and "There is a red cube; what is its material?". Using FCG's 'formulate-all' operation, which explores the entire search space instead of stopping at the first solution, we can show all different questions obtained from a single input meaning. We demonstrate this using the semantic representation of the example sentence "What material is the red cube?"


Formulating (all solutions) 

gstruct0(get-context?context)struct1(filter?cube-set?context?shape-1)struct0:varCONTEXT1->struct1:varCONTEXT2struct2(filter?red-cube-set?cube-set?color-1)struct1:varCUBEdashSET2->struct2:varCUBEdashSET3struct5(bindshape-category?shape-1cube)struct1:varSHAPEdash12->struct5:varSHAPEdash16struct3(unique?red-cube?red-cube-set)struct2:varREDdashCUBEdashSET3->struct3:varREDdashCUBEdashSET4struct6(bindcolor-category?color-1red)struct2:varCOLORdash13->struct6:varCOLORdash17struct4(query?target?red-cube?attribute-1)struct3:varREDdashCUBE4->struct4:varREDdashCUBE5struct7(bindattribute-category?attribute-1material)struct4:varATTRIBUTEdash15->struct7:varATTRIBUTEdash18

Computing all solutions for application of
in formulation


initial structure
application process
solution 1
applied constructions
resulting structure
solution 2
applied constructions
resulting structure
solution 3
applied constructions
resulting structure
solution 4
applied constructions
resulting structure
solution 5
applied constructions
resulting structure
solution 6
applied constructions
resulting structure

Utterances:

The 'formulate-all' operation for this example returns six possible questions. In theory, there are many more possible solutions because of the large amount of synonyms for the different nouns and adjectives. A cube can also be a block, metal can also be shiny and so on. To avoid an enormous search space, the CLEVR Grammar was configured to only explore the grammatical variation and ignore this lexical variation.

IV. Operational VQA system

In order to have a fully operational VQA system, the semantic representation that is the result of FCG's comprehension operation needs to be executed. In particular, it needs to be executed on a specific scene of objects. For this, we use a procedural semantics framework called Incremental Recruitment Language (IRL). In this section, we demonstrate both the comprehension process of an input question and the subsequent execution process of the resulting semantic representation on a scene of objects. The scene of objects is shown in the image below.


Comprehending "what material is the red cube"


Applying
in comprehension


initial structure
application process
applied constructions
resulting structure

Meaning:

gstruct0(get-context?source-877)struct3(filter?target-1781?source-877?shape-160)struct0:varSOURCEdash8771->struct3:varSOURCEdash8774struct1(bindattribute-category?attribute-648material)struct7(query?target-1846?target-object-307?attribute-648)struct1:varATTRIBUTEdash6482->struct7:varATTRIBUTEdash6488struct2(bindcolor-category?color-204red)struct6(filter?target-1780?target-1781?color-204)struct2:varCOLORdash2043->struct6:varCOLORdash2047struct4(bindshape-category?shape-160cube)struct3:varSHAPEdash1604->struct4:varSHAPEdash1605struct3:varTARGETdash17814->struct6:varTARGETdash17817struct5(unique?target-object-307?target-1780)struct5:varTARGETdash17806->struct6:varTARGETdash17807struct5:varTARGETdashOBJECTdash3076->struct7:varTARGETdashOBJECTdash3078


Evaluating irl program

irl program
((get-context ?source-877) (bind attribute-category ?attribute-648 material) (bind color-category ?color-204 red) (filter ?target-1781 ?source-877 ?shape-160) (bind shape-category ?shape-160 cube) (unique ?target-object-307 ?target-1780) (filter ?target-1780 ?target-1781 ?color-204) (query ?target-1846 ?target-object-307 ?attribute-648))
gstruct0(get-context?source-877)struct3(filter?target-1781?source-877?shape-160)struct0:varSOURCEdash8771->struct3:varSOURCEdash8774struct1(bindattribute-category?attribute-648material)struct7(query?target-1846?target-object-307?attribute-648)struct1:varATTRIBUTEdash6482->struct7:varATTRIBUTEdash6488struct2(bindcolor-category?color-204red)struct6(filter?target-1780?target-1781?color-204)struct2:varCOLORdash2043->struct6:varCOLORdash2047struct4(bindshape-category?shape-160cube)struct3:varSHAPEdash1604->struct4:varSHAPEdash1605struct3:varTARGETdash17814->struct6:varTARGETdash17817struct5(unique?target-object-307?target-1780)struct5:varTARGETdash17806->struct6:varTARGETdash17807struct5:varTARGETdashOBJECTdash3076->struct7:varTARGETdashOBJECTdash3078

solution 1
?target-1846
METAL
material-category
1.000
?target-object-307
size: SMALL
color: RED
material: METAL
shape: CUBE
clevr-object
1.000
?target-1780
size: SMALL
color: RED
material: METAL
shape: CUBE
clevr-object
clevr-object-set
1.000
?target-1781
size: LARGE
color: YELLOW
material: METAL
shape: CUBE
clevr-object
size: LARGE
color: YELLOW
material: METAL
shape: CUBE
clevr-object
size: SMALL
color: BROWN
material: METAL
shape: CUBE
clevr-object
size: SMALL
color: RED
material: METAL
shape: CUBE
clevr-object
clevr-object-set
1.000
?source-877
size: LARGE
color: YELLOW
material: METAL
shape: CUBE
clevr-object
size: LARGE
color: PURPLE
material: RUBBER
shape: CYLINDER
clevr-object
size: LARGE
color: YELLOW
material: METAL
shape: CUBE
clevr-object
size: LARGE
color: CYAN
material: RUBBER
shape: CYLINDER
clevr-object
size: LARGE
color: RED
material: METAL
shape: SPHERE
clevr-object
size: SMALL
color: BROWN
material: METAL
shape: CUBE
clevr-object
size: SMALL
color: RED
material: METAL
shape: CUBE
clevr-object
size: SMALL
color: RED
material: METAL
shape: CYLINDER
clevr-object
size: SMALL
color: BLUE
material: RUBBER
shape: SPHERE
clevr-object
size: SMALL
color: GREEN
material: RUBBER
shape: SPHERE
clevr-object
clevr-object-set
1.000
?attribute-648
MATERIAL
attribute-category
1.000
?color-204
RED
color-category
1.000
?shape-160
CUBE
shape-category
1.000