[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Attention Mechanism ๊ธฐ๋ฒ•, NLP

2022. 4. 3. 14:27ใ†๐Ÿงช Data Science/Paper review

 

 

(๋…ผ๋ฌธ title ์บก์ณ)

 

 

์˜ค๋Š˜ ๋ฆฌ๋ทฐํ•  paper๋Š” 2014๋…„์— ๋ฐœํ‘œ๋œ ๋…ผ๋ฌธ 'NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE'์ด๋‹ค. ์ด๋‹ค. ๊ธฐ์กด์˜ Seq2Seq ๋ฐฉ์‹์—์„œ ํ™•์žฅ๋œ Attention ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์—ฌ ๋งŽ์€ ์ฃผ๋ชฉ์„ ๋ฐ›์•˜๋‹ค. Attention ๊ธฐ๋ฒ•์€ NLP ์˜์—ญ์—์„œ ํ•„์ˆ˜๋กœ ์•Œ์•„์•ผ ํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด๋‹ค.

 

 

[Source url: https://arxiv.org/abs/1409.0473 , Cornell University] 

[Github url: https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb]

[์ž์„ธํ•œ Attention Mechanism ์„ค๋ช… : https://wikidocs.net/22893]

 

 

 

 

1. Summary

 

๊ธฐ์กด์˜ Seq2Seq ๋ฐฉ์‹ ํ›ˆ๋ จ์€ NLP ์„ฑ๋Šฅ์„ ํ–ฅ์ƒํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ Context vector์— ๋ฌธ์žฅ์˜ ๋ชจ๋“  ์ •๋ณด๋ฅผ ์••์ถ•์‹œํ‚ค๋Š” ๋ฐฉ์‹์€, ์••์ถ•์‹œํ‚ค๋Š” ์ •๋ณด๊ฐ€ ๊ธธ์–ด์ง€๋ฉด ๊ธธ์–ด์งˆ์ˆ˜๋ก ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ๋‚ณ์•˜๋‹ค. ์ด์— ๋”ฐ๋ผ ๋ชจ๋ธ์ด ์Šค์Šค๋กœ(soft alignment) ์—ฐ๊ด€์„ฑ ์žˆ๋Š” Sentence part๋ฅผ ์ฐพ์•„์„œ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค.

 

์ €์ž๋Š” ์ž์„ธํ•œ Attention ๊ธฐ๋ฒ• ์„ค๋ช…์— ์•ž์„œ, Seq2Seq ์›๋ฆฌ๊ณผ RNN ์ž‘๋™ ๋ฐฉ์‹ ๋“ฑ์„ ์งš๊ณ  ๋„˜์–ด๊ฐ„๋‹ค. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„  RNN ๋ชจ๋ธ๋กœ bidirectional RNN(์–‘๋ฐฉํ–ฅ RNN)์„ ํ™œ์šฉํ•œ๋‹ค. ๋ณดํ†ต์˜ RNN ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋’ค์˜ ์š”์†Œ๋ฅผ ์ฐธ๊ณ ํ•  ์ˆ˜ ์—†์ง€๋งŒ, ์ˆœ๋ฐฉํ–ฅ/์—ญ๋ฐฉํ–ฅ RNN์„ ๋ถ™์—ฌ์„œ ๋ชจ๋‘ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐœ์กฐํ•œ ๊ตฌ์กฐ์ด๋‹ค.

 

 

Dot-product attention์„ ๊ทธ๋ฆผ์œผ๋กœ ํ‘œํ˜„ํ–ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ์›๋ฆฌ ์ž์ฒด๋Š” ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.

[์ถœ์ฒ˜ : https://wikidocs.net/22893]

 

 

 

 

 

2. ๊ณต์‹ ์ดํ•ด

 

๋ณธ๋ž˜ Encoder-Decoder ๋ฐฉ์‹์€ h (hiddent state)๋งŒ์„ ๊ณ ๋ คํ•œ c (Context vector) ๋ฅผ ํ™œ์šฉํ–ˆ๋‹ค.

 

 

 

ํ•˜์ง€๋งŒ ์ €์ž๊ฐ€ ์ œ์‹œํ•œ ๋ชจ๋ธ์€ ์กฐ๊ธˆ ๋‹ค๋ฅธ Context vector๋ฅผ ์ œ์‹œํ•œ๋‹ค.

์ฐจ๊ทผ์ฐจ๊ทผ ์‚ดํŽด๋ณด์ž.

 

 

 

i ๋ฒˆ์งธ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ ์“ฐ์ด๋Š” ํžˆ๋“  ์Šคํ…Œ์ดํŠธ s ์™€ Encoder์˜ j ๋ฒˆ์งธ ์—ด ๋ฒกํ„ฐ h๊ฐ€ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ์Šค์นผ๋ผ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. a ๋Š” alignment model ์œ ์‚ฌ๋„๋ฅผ ์ž˜ ๋ฝ‘์•„๋‚ด๋Š” ํ•จ์ˆ˜๋ฉด ์–ด๋–ค ๊ฒƒ์ด๋“  ๊ฐ€๋Šฅํ•˜๋‹ค.

 

 

 

* ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ํ™•๋ฅ  ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค.

* Tx ๊ฐ’์€ Encoder์— ์ž…๋ ฅ๋œ ๋‹จ์–ด ์ˆ˜๋ฅผ ๋งํ•œ๋‹ค.

 

์ด์ „์— ๊ณ„์‚ฐํ•œ ์œ ์‚ฌ๋„ ์Šค์นผ๋ผ ๊ฐ’์ด ๋“ค์–ด๊ฐ„๋‹ค.

i ๋ฒˆ์งธ ๋‹จ์–ด ์˜ˆ์ธกํ•  ๋•Œ ์ •์˜๋˜๋Š” attention value ์•ŒํŒŒ๋ฅผ ๊ตฌํ•œ๋‹ค.

 

 

attention value๋Š” ์ธ์ฝ”๋”์˜ ์—ด ๋ฒกํ„ฐ h์™€ ๊ณฑํ•ด์ ธ Context vector๋ฅผ ๋„์ถœํ•œ๋‹ค.

 

 

์ด์ „์— ๊ตฌํ•ด์ง„ Context vector๋Š” i ๋ฒˆ์งธ ํžˆ๋“  ์Šคํ…Œ์ดํŠธ๋ฅผ ๊ตฌํ•˜๋Š” ์˜์–‘์ œ๊ฐ€ ๋œ๋‹ค.

 

 

๋งˆ์ง€๋ง‰, ์ด๋•Œ๊นŒ์ง€ ๊ตฌํ•œ ์š”์†Œ๋“ค์„ ๋‹จ์–ด ์˜ˆ์ธก์— ์‚ฌ์šฉํ•œ๋‹ค.

 

 

 

 

 

3. Conclusion

 

 

์‹คํ—˜ ๊ฒฐ๊ณผ, ๋ฌธ์žฅ ๊ธธ์ด๊ฐ€ ๊ธธ์ˆ˜๋ก Attention ๊ธฐ๋ฒ•์€ ๋น›์„ ๋ƒˆ๋‹ค.

์‹ฌ์ง€์–ด 30 ํฌ๊ธฐ์˜ ๋ฌธ์žฅ์„ ํ•™์Šตํ•œ Attention model ์ด 50 ํฌ๊ธฐ์˜ ๋ฌธ์žฅ์„ ํ•™์Šตํ•œ model ๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. 

 

 

 

[๋…ผ๋ฌธ Conclusion ์ฃผ์š”๋ฌธ์žฅ ๋ฐœ์ทŒ]

 

We conjectured that the use of a fixed-length context vector is problematic for translating long sentences, based on a recent empirical study reported by Cho et al. (2014b) and Pouget-Abadie et al. (2014). In this paper, we proposed a novel architecture that addresses this issue. This lets the model focus only on information relevant to the generation of the next target word. The experiment revealed that the proposed RNNsearch outperforms the conventional encoder–decoder model (RNNencdec). We were able to conclude that the model can correctly align each target word with the relevant words. One of challenges left for the future is to better handle unknown, or rare words.