\(s_t\) and \(h_i\) are source hidden states and target hidden state, the shape is `(n,1)`

. \(c_t\) is the final context vector, and \(\alpha_{t,s}\) is alignment score.

\[\begin{aligned}
c_t&=\sum_{i=1}^n \alpha_{t,s}h_i \\

\alpha_{t,s}&= \frac{\exp(score(s_t,h_i))}{\sum_{i=1}^n \exp(score(s_t,h_i))}
\end{aligned}
\]

## Global(Soft) VS Local(Hard)

Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.

## Content-based VS Location-based

Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.

Here are several popular attention mechanisms:

#### Dot-Product

\[score(s_t,h_i)=s_t^Th_i\]

#### Scaled Dot-Product

\[score(s_t,h_i)=\frac{s_t^Th_i}{\sqrt{n}}\] where n is the vectors dimension. Google’s Transformer model has similar scaling factor when calculate self-attention: \(score=\frac{KQ^T}{\sqrt{n}}\)

#### Location-Base

\[socre(s_t,h_i)=softmax(W_as_t)\]

#### General

\[score(s_t,h_i)=s_t^TW_ah_i\]

\(Wa\)’s shape is `(n,n)`

#### Concat

\[score(s_t,h_i)=v_a^Ttanh(W_a[s_t,h_i])\]

\(v_a\)’s shape is `(x,1)`

, and \(Wa\) ’s shape is `(x,x)`

. This is similar to a neural network with one hidden layer.

When I doing a slot filling project, I compare these mechanisms. **Concat** attention produce the best result.

Ref: