본문 바로가기

연구/논문 리뷰

Mem2Seq: Effectively Incorporating Knowledge Bases into End-to-EndTask-Oriented Dialog Systems 논문리뷰

Mem2Seq

Created: Apr 22, 2020 4:36 PM

paper

code

Abstract

Mem2Seq : memories의 multi-hip에 attention + pointer network

Introduction

Task-oriented dialog systems

  • 유저가 specific goal을 달성할 수 있도록 한다. (restaurant reservation)
  • KB를 쿼리할 수 있는 능력은 필수적.

Traditionally → pipeline system

  • Language Understading → dialog management → knowledge query → Language Generation
  • stability of such pipelined systems via combining domain-specific knowledge and slot-filling techniques
  • 하지만 모듈 간의 종속성을 모델링하는 것은 복잡하며 KB 해석은 인간의 노력이 필요.

Recently → end-to-end

RNN encoder-decoder models (Serban et al., 2016; Wen et al., 2017; Zhao et al., 2017)

  • they can directly map plain text dialog history to the output responses
  • the dialog states are latent(잠재된), there is no need for hand-crafted state labels.

attention-based copy mechanism (Gulcehre et al., 2016; Eric and Manning, 2017)

  • copy words directly from the input sources to the output responses.
  • unknown tokens이 dialog history에 나타나도 모델이 correct한 entity를 생성할 수 있음.

above approaches' Problems

  1. RNN hidden states에 KB information을 효과적으로 넣으려고 노력함.
  2. long-sequences에 attention mechanisms을 적용엔 오랜 시간이 소요됨.

end-to-end memory networks (MemNNs)

  • recurrent attention models over a possibly large external memory
  • approach
    • They write external memories into several embedding matrices
    • use query vectors to read memories repeatedly.
  • 이러한 방법은 external KB information을 memorize하고, long dialog history를 빠르게 encode.
    • multi-hop mechanism → high performance
  • Problems
    • word-by-word를 생성하지 않고 predifined candidate pool에서 response를 선택하는 방식.

Memory-to-Sequence (ours)

  • 위의 problem을 해결.
  • MemNN + sequential generative architecture
    • global multihop attention mechanisms to copy words directly from dialog history or KBs.
  • contributions
    1. first model to combine multi-hop attention mechanisms with the idea of pointer networks.
      • KB information을 효과적으로 통합.
    2. memory access를 컨트롤하기 위해 dynamic queries를 생성하는 법을 학습.
    3. faster and achieve state-of-the-art results in several task-oriented dialog datasets.

Model Description

Mem2Seq는 MemNN encoder, memeory decoder로 이루어져 있음.

Mem2Seq%20b6450c3355ac42c48b76166819fe8717/Untitled.png

MemNN encoder : create a vector representation of the dialog history.

memory decoder : reads and copies the memory to generate a response.

 

$X = {x_1,..,x_n,}$

$\text{all the words in dialog history} \ B = {b_1,..,b_l}$

$\text{KB tuples} \ U = [B;X]$

$\text{concatenation of the two sets X and B} \ Y = {y_1,..,y_m}$

$\text{the set of words in the expected system response} \ PTR = {ptr_1,..,ptr_m} ~~ \text{poinrter index}$

$u_z \in U$

$\text{is the input sequence} \ n+l+1 ~$

$\text{is the sentinel position index.}$

Mem2Seq%20b6450c3355ac42c48b76166819fe8717/Untitled%201.png

  • y_i = u_z : expected system reponse의 word가 u_z에 속한다면 즉 입력 중에서 뽑힌 단어라면 max(z) ... ?

Memory Encoder

standard MemNN with adjacent weighted tying

input : word-level information in U

memories

  • set of trainable embedding matrices C = {C^1, .. C^{K+1}}
  • C^k : maps token to vector.

모델은 K개의 홉을 돌면서 각 메모리 i에 대한 attention weight를 계산한다. (q^k == query vector)

Mem2Seq%20b6450c3355ac42c48b76166819fe8717/Untitled%202.png

p^k : soft memory selector.

  • query vector에 대한 memory relevance를 결정.

C^k = [vocab, D]?

C^k_i = vocab 중 i번째 token에 해당하는 emb [1, D]

생각

  • p^k_i
    • query emb과 i번째 단어의 유사도? 즉 해당 word가 해당 query에 얼마나 영향을 줄 수 있을지?
  • p^k
    • 위의 p^k_i(각각 word)를 다 모아서 현재 쿼리 q^k에 대해 모든 vocab 단어들에 대한 중요도를 계산하여 둔 것?

memory를 읽는 방식

Mem2Seq%20b6450c3355ac42c48b76166819fe8717/Untitled%203.png

  • weighted sum over C^{k+1}
  • adjacent weighted tying라서 k+1을 쓴다함.

hop을 지날 때 마다 query vectors가 업데이트 된다.

$q^{k+1} = q^k + o^k$

o^K가 최종적은 encoder의 result가 되고 decoder의 input이 된다.

Memory Decoder

RNN + MemNN

MemNN은 X(dialog history), B(KB) 모두 load.

MemNN에서 GRU가 dynamic query generator로 사용됨.

GRU는 step마다 이전에 생성된 word와 previous query를 입력으로 받고, 새로운 query vector를 만들어 낸다.

Mem2Seq%20b6450c3355ac42c48b76166819fe8717/Untitled%204.png

query(h_t)는 MemNN을 통과시켜 token을 생성한다.

h_0 = o^K (result of encoder.)

step 마다 2개의 distribution이 학습된다.

  1. P_vocab : one over all the words in the vocabulary

    Mem2Seq%20b6450c3355ac42c48b76166819fe8717/Untitled%205.png
  2. P_ptr : dialog history and KB inofrmation

    $P_{ptr} = p^K_t$

Our decoder generates tokens by pointing to the input words in the memory (pointer network와 유사.)

we expect the attention weights in the first and the last hop to show a “looser” and “sharper” distribution, respectively.

  • 1st hop : for retrieving memory information
  • last hop : choose the exact token leveraging the pointer supervision.

jointly learned by minimizing the sum of two standard cross-entropy losses

$ \text{between }P_{vocab}(\hat{y}_t)~ \text{ and }y_t \in Y \text{ for vocab distribution} $

$ \text{between }P_{ptr}(\hat{y}_t)~ \text{ and }ptr_t \in PTR \text{ for memory distribution} $

Sentinel

expected word가 memories에 없다면 P_ptr은 sentinel token($)을 생성하도록 학습된다.

Mem2Seq%20b6450c3355ac42c48b76166819fe8717/Untitled%201.png

sentinel 이 선택되면 모델은 P_vocab에서 토큰을 생성.

sentinel이 선택되지 않는다면 모델은 P_ptr에서 memory content를 얻는다.

sentinel token은 each time step 마다 어떤 분포를 사용할지 컨트롤하는 hard gate로 사용된다.

→ 이를 통해 모델은 gating function이 필요없음. & soft gate function에 제약을 받지 않음.

Memory Content

We store word-level content X in the memory module.

  • we add temporal information and speaker information in each token of X to capture the sequential dependencies.
  • ex : hello t1$u → “hello” at time step 1 spoken by a user.

B, the KB information

  • (subject, relation, object) representation.
  • ex : (The Westin, Distance, 5 miles)
  • s,r,o의 word emb를 더하여 KB memory representation을 생성한다.
  • 트리플 중 object는 P_ptr을 통해서 generated word로 사용된다.
    • ex : (The Westin, Distance, 5 miles) is pointed → “5 miles” as an output word
  • KB 중 specific dialog에 관련된 specific section만 메모리에 load 된다.

Note that QRN, MemNN and GMemNN viewed bAbI dialog tasks as classification problems. Although their tasks are easier compared to our generative methods, Mem2Seq models can still overpass the performance.

Related works

E2E memory network 기반.

  • Liu and Perez, 2017; Wu et al., 2017, 2018
  • In each of these architectures, the output is produced by generating a sequence of tokens, or by selecting a set of predefined utterances.

seq-to-seq 기반

  • Zhao et al., 2017
    • These architectures have better language modeling ability, but they do not work well in KB retrieval.
  • Eric and Manning,2017
    • Seq2Seq fails to map the correct entities to the generated input. To alleviate this problem, copy augmented Seq2Seq models were used.
    • These models outperform utterance selection methods by copying relevant information directly from the KBs.