본문 바로가기

전체 글

(22)

[논문읽기] Repulsive Attention:Rethinking Multi-head Attention as Bayesian Inference 1 Introduction 왜 multi-head 가 single attention에 비해 좋은 성능을 내는지 그 원인이 정확하게 이해되지 않는다. (아마 perspective 의 다양성이라 생각되지만, 확실하지 않음) 여기에서는 stochastic(확률적) setting 에 deterministic attention을 적용시켜 Bayesian 관점에서 multi-head attention을 이해하려 한다. extra trainalble parameter나 다른 규제를적용시키는 게 아니라 multi-head attention의 repulsiveness를 향상시킬 수 있는 새로운 알고리즘을 소개한다. (head의 유사도를 Loss 로 해서 repulsive 구현) Bayesian interpretation..

[논문읽기] Are Sixteen Heads Really Better than One? 1 Introduction training 이후, test를 할 때 대부분의 attention head 를 remove해도 된다. encoder- decoder layer는 pruning에 민감하게 반응, multi head가 무언가 중요한 역할을 함. training 을 통해서 중요하고, 안중요한 head 들이 생김을 알 수 있음. 2 Background: Attention, Multi-headed Attention, and Masking 2.3 Masking Attention Heads 특정 head의 영향을 배제하기 위해 masking을 진행. 그 경우 식은 아래와 같음. 3 Are All Attention Heads Important? 한 개 이상의 head를 remove하면서 변화를 관찰. 3...

[논문읽기] Accelerating Training of Transformer-BasedLanguage Models with Progressive Layer Dropping 1 Introduction NLP 문제를 해결함에 있어서 Transformer 형태를 이용한 방법들이 많은 효과를 보았다. 그러나 self attention과 parallelizable recurrence, 엄청나게 높은 performance의 hardware, pre-training step등에서 상당히 많은 시간이 소모됨을 볼 수 있다. 이 논문에서는 pre-training transformer network의 속도를 향상시키기 위해 training 테크닉과 구조의 변화를 소개한다. layer수를 줄이거나, stochastic depth를 시도해보는 것이 효과가 없었다고 한다. (stochastic Depth란?) -> 네트워크의 길이를 효과적으로 줄이기 위해 무작위로 레이어 전체를 뛰어넘도록 하였다..

이전 1 2 3 4 5 6 ··· 8 다음

티스토리툴바