close

Se connecter

Se connecter avec OpenID

Attention Models

IntégréTéléchargement
Attention Models
Attention Models
Olof Mogren
Chalmers University of Technology
• Focus on parts of input
• Improves NN performance on different tasks
• IBM1 attention mechanism (1980’s)
Feb 2016
Attention Models
Arxiv 2016
Multi-Way, Multilingual Neural Machine Translation with a Shared ...
Incorporating Structural Alignment Biases into an Attentional Neural ...
Language to Logical Form with Neural Attention
Human Attention Estimation for Natural Images: An Automatic Gaze ...
• “One of the most exciting advancements”
- Ilya Sutskever, Dec 2015
Implicit Distortion and Fertility Models for Attention-based ...
Survey on the attention based RNN model and its applications in ...
From Softmax to Sparsemax: A Sparse Model of Attention and ...
A Convolutional Attention Network for Extreme Summarization ...
Learning Efficient Algorithms with Hierarchical Attentive Memory
Attentive Pooling Networks
Attention-Based Convolutional Neural Network for Machine ...
Modelling Language using RNNs
y1
y2
Encoder-Decoder Framework
y1
y3
y2
y3
encoder
{
{
decoder
x1
x2
x3
• Recurrent Neural Networks
• Sequence to Sequence Learning with Neural Networks Ilya
• Gated additive sequence modelling:
Sutskever, Oriol Vinyals, Quoc V. Le, NIPS 2014
LSTM (and variants) details
• Neural Machine Translation (NMT)
• Fixed vector representation for sequences
• Reversed input sentence!
NMT with Attention
NMT with Attention
αij =
p(yi |y1 , ..., yi−1 , x) = g(yi−1 , si , ci )
yt-1
yt
s t-1
st
si = f (si−1 , yi−1 , ci )
ci =
PTx
αij =
j=1 αij hj
exp(eij )
PTx
k=1 exp(eik )
x3
x2
x1
• Language models: P (wordi |word1 , ..., wordi−1 )
eij = a(si−1 , hj )
αt,5
αt,1
+
αt,1
αt,2
exp(eij )
PTx
k=1 exp(eik )
αt,T
αt,2
αt,3
h1
h2
h3
hT
h1
h2
h3
hT
x1
x2
x3
xT
αt,3
αt,4
αt,6
...
Neural Machine Translation by Jointly Learning to Align and Translate
si-1
yt
st
+
αt,T
eij = a(si−1 , hj )
yt-1
s t-1
αt,1
αt,2
αt,T
αt,3
h1
h2
h3
hT
h1
h2
h3
hT
x1
x2
x3
xT
hj
Bahdanau, Cho, Bengio, ICLR 2015
Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Cho, Bengio, ICLR 2015
L'
Il
c onvie nt
de
note r
q ue
l'
e nvironne m e nt
m a rin
est
le
m oins
c onnu
de
l'
e nvironne m e nt
.
<e nd>
a c c ord
s ur
la
zone
é c onom iq ue
e uropé e nne
a
é té
s ig né
en
a oût
1992
.
<e nd>
(a)
Caption Generation
• Convolutional network:
Oxford net,
19 layers,
stacks of 3x3 conv-layers,
max-pooling.
• Annotation vectors: a = {a1 , ..., aL }, ai ∈ RD
• Attention over a.
It
s hould
be
note d
tha t
the
m a rine
e nvironm e nt
is
the
le a s t
known
of
e nvironm e nts
.
<e nd>
Caption Generation
<e nd>
.
1992
Aug us t
in
s ig ne d
wa s
Are a
Ec onom ic
Europe a n
the
on
a g re e m e nt
The
Alignment - (more)
• “Translating” from images to natural language
(b)
Attention Visualization
Source Code Summarization
• Predict function names
given function body
• Convolutional attention
Source Code Summarization
Target
is
m2
bul l et
α=
κ=
<s> { r et ur n ( mFl ags & e Bul l et Fl ag ) == e Bul l et Fl ag ; } </ s>
<s> { r et ur n ( mFl ags & e Bul l et Fl ag ) == e Bul l et Fl ag ; } </ s>
0.436
m3
E ND
α=
κ=
<s> { r et ur n ( mFl ags & e Bul l et Fl ag ) == e Bul l et Fl ag ; } </ s>
<s> { r et ur n ( mFl ags & e Bul l et Fl ag ) == e Bul l et Fl ag ; } </ s>
0.174
<s> { r et ur n ( mFl ags & e Bul l et Fl ag ) == e Bul l et Fl ag ; } </ s>
<s> { r et ur n ( mFl ags & e Bul l et Fl ag ) == e Bul l et Fl ag ; } </ s>
0.012
handled (copy
mechanism)
details
A Convolutional Attention Network for Extreme Summarization of Source
CodeAllamanis et al. Feb 2016 (arxiv draft)
A Convolutional Attention Network for Extreme Summarization of Source
CodeAllamanis et al. Feb 2016 (arxiv draft)
Memory Networks
mogren@chalmers.se
• Attention refers back to internal memory; state of encoder
http://mogren.one/
• Neural Turing Machines
• (End-To-End) Memory Networks:
explicit memory mechanisms
(out of scope today)
λ
m1
mechanism; 1D patterns
• Out of vocabulary terms
Attention Vectors
α=
κ=
http://www.cse.chalmers.se/research/lab/
Appendix
Teaching Machines to Read and Comprehend, Dec 2015
Hermann, Kocisky, Greffenstette,
Espeholt, Kay, Suleyman, Blunsom
De s truc tion
of
the
e q uipm e nt
m e a ns
tha t
Syria
can
no
long e r
produc e
ne w
c he m ic a l
we a pons
.
<e nd>
DRAW, A Recurrent Neural Network For Image Generation - 2015
Gregor, Danihelka, Graves, Rezende, Wierstra
La
de s truc tion
de
l'
é q uipe m e nt
s ig nifie
q ue
la
Syrie
ne
pe ut
plus
produire
de
nouve lle s
a rm e s
c him iq ue s
.
<e nd>
(c)
"
This
will
c ha ng e
my
Alignment - (back)
Draw
"
Ce la
va
c ha ng e r
m on
a ve nir
a ve c
ma
fa m ille
"
,
a
dit
l'
hom m e
.
<e nd>
LSTM
Source Code Summarization
• Kl1 : patterns in input
• Kl2 (and Kα , Kκ ): higher level abstractions
• α, κ: attention over input subtokens
• Simple version: only Kα , for decoding
• Complete version: uses Kλ for deciding on
generation or copying
back
Christopher Olah
A Convolutional Attention Network for Extreme Summarization of Source
Code
Allamanis et al. Feb 2016 (arxiv draft)
back
IBM Model 1: The first translation attention model!
Soft vs Hard Attention
A simple generative model for p(s|t) is derived by introducing a
latent variable a into the conditional probabiliy:
J
X p(J|I ) Y
p(s|t) =
p(sj |taj ),
(I + 1)J
a
j=1
where:
• s and t are the input (source) and output (target) sentences
of length J and I respectively,
• a is a vector of length J consisting of integer indexes into the
target sentence, known as the alignment,
• p(J|I ) is not importent for training the model and we’ll treat
it as a constant .
To learn this model we use the EM algorithm to find the MLE
values for the parameters p(sj |taj ).
back
Soft
• Weighted average of whole input
• Differentiable loss
• Increased computational cost
Hard
• Sample parts of input
• Policy gradient
• Variational methods
• Reinforcement Learning
• Decreased computational cost
Auteur
Document
Catégorie
Uncategorized
Affichages
4
Taille du fichier
3 851 KB
Étiquettes
1/--Pages
signaler