A Brief Summary of Gated-Attention Architectures for Task-Oriented Language Grounding
A link to the original paper
What is it about
Approach
- State Processing Module-creates joint representation of instructions and images observed by agent. Two types of joint representation
- Concatenation-used by previous papers
-
Gated-attention multimodal fusion- prefered method in this paper-multiplicative interactions between modalities.
-
Takes current state as input st = {It,L}
- Consists of CNN to process image xI = f (It ; θconv ) ∈ Rd×H×W<. D-feature maps, h,w height and width of each feature map. θconv the configuration of the CNN.
- GRU network to process instructions Let xL = f(L;θgru). Θgru Parameters of GRU network.
-
Multimodal fusion unit to combine bothimage and instruction M(xI,xL)
- Fusion Through GRU
- Intuition of GRU is CNN will detect certain features of the visual object and gates relevant features according to instruction aL
- xL is passed through fully-connected linear layer with sigmoid activation .O/p dimension of linear layers equals number of feature maps-d- in o/p of CNN(image)
- Output is aL = h(xL ) ∈ Rd called attention vector
-
Each element in aL is expanded in HxW resulting in a 3-dim matrix M(aL) ∈ Rd×H×W such that MaL [i, j, k] = aL[i].
- This matrix is multiplied element wise with the output of the CNN MGA(xI,xL)=M(h(xL))⊙xI =M(aL)⊙xI
- Fusion through concatenation
- If through concatenation Mconcat(xI,xL) = [vec(xI);vec(xL)]. Vec represents flattened inputs
- Policy Module-Learns policy to implement instructions
-
Output from the multimodal fusion unit(concatenation or GRU)-combined representation of visual element and instruction- is fed as input to policy module
- Through Imitation learning-from doom game
- Contains fully-connected layer to estimate policy function
- Oracle which tells exact actions to perform
-
Agent re-orients using left-right turns and moves forward while re-orienting again if the when the orientation angle is greater than min turn angle in env
- Through RL- Positive rewards and Negative rewards according to action
- Uses A3C algorithm-uses deep NN to learn policy and value fn. Runs multiple threads to update parameters.A3C consists of LSTM layer followed by fully-connected layers
- Uses Entropy Regularization- for improved exploration of env
- Uses Generalized Advantage Estimator- to reduce policy learning gradient
Environment
- Create environment in which agent can execute NL instructions and gain positive rewards on successful completion of task.
- Instruction is a combination of action+attributes+object
- Instruction can have multiple attributes but actions and objects are limited to 1 per instruction
- Attributes such as color shape size
- 70 manually generated instructions and for each instruction env allows automatic creation of multiple episodes with randomly selected objects- one correct object and 4 other incorrect objects-(limited to 5 objects per episode) placed randomly in the episode
- Challenges
- same instruction can refer to different objects in different episodes. Ex ‘Go to red card’
- Objects might occlude each other
- Objects may not be present in the field of view of agent
- Map can be complicated necessitating better exploration in order to make correct decision
- Difficulty levels in spawning of objects in episodes
- Easy-Objects at fixed locations along single line along field of view
- Medium- Objects at random locations, but in field of view. Agent in fixed location
- Hard- Objects and agent at random locations. Objects may or maynot be in the field of view initially.
Agent needs to explore map.
Experimental Setup
- Experiments are performed in all three difficulty modes
- Objects restricted to 5 per setup(1 correct and 4 incorrect)
- In training objects spawned from set of 55 instructions. Additional 15 are kept out for testing in zero-shot evaluation(never seen)
- Episode ends when agent reaches any object or episode time=30 elapses
- Evaluation metric is ‘accuracy’-reaching correct object before time elapses
- Two scenarios for evaluation
- Multitask-generalization
- To make sure model isn’t overfitting on training set and can perform with instructions on unseen maps
- Unseen maps but training set instructions
- Zero-shot Evaluation
- To test whether model can generalize to new conditions(unseen both)
- Both instructions and maps are unseen in this evaluation
Hyperparameters
Conclusion
- Models(A3C for RL and BC/DAgger for imitation) using GRU outperformed in both Multitask and Zero-Shot task generalization across all modes of difficulty