BERT에서의 long text 처리

Article

이 논문의 RoBERT 방식의 아이디어를 가져옴.

ex : 250 words sentence
- Chunk 1 : 0:200 words
- Chunk 2 : 150:250 words (Chunk 1의 뒷 50 words를 앞에 붙인다.)

split the input sequence into segments of a fixed size with overlap.

각각의 segment에 대해서 BERT로부터 H, P를 얻는다.

이러한 segment-level representations(H?)을 stacking하여 small LSTM(100-dim)의 input sequence로 만든다.

해당 LSTM의 output은 document embedding.

2 fully connect layers (with ReLU, softmax(same dim as the number of classes))를 통해 최종 결과를 predict.

LSTM → small Transformer model (2 layers of transformer building block containing self-attention, fully connected, etc.)