当前位置: 首页 > news >正文

做网站上凡科小网站关键词搜什么

做网站上凡科,小网站关键词搜什么,网站中文名要注册的吗,沈阳企业网站建设动手学深度学习(视频):68 Transformer【动手学深度学习v2】_哔哩哔哩_bilibili 动手学深度学习(pdf):10.7. Transformer — 动手学深度学习 2.0.0 documentation (d2l.ai) 李沐Transformer论文逐段精读&a…

动手学深度学习(视频):68 Transformer【动手学深度学习v2】_哔哩哔哩_bilibili

动手学深度学习(pdf):10.7. Transformer — 动手学深度学习 2.0.0 documentation (d2l.ai)

李沐Transformer论文逐段精读:Transformer论文逐段精读【论文精读】_哔哩哔哩_bilibili

Vaswani, A. et al. (2017) 'Attention is all you need', 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. doi: https://doi.org/10.48550/arXiv.1706.03762

目录

1. Transformer

1.1. 整体实现步骤

1.2. Transformer理念

1.3. Transformer代码实现

1.4. Transformer弊端和局限

2. Transformer论文原文学习

2.1. Abstract

2.2. Introduction

2.3. Background

2.4. Model Architecture

2.4.1. Encoder and Decoder Stacks

2.4.2. Attention

2.4.3. Position-wise Feed-Forward Networks

2.4.4. Embeddings and Softmax

2.4.5. Positional Encoding

2.5.  Why Self-Attention

2.6. Training

2.6.1. Training Data and Batching

2.6.2. Hardware and Schedule

2.6.3. Optimizer

2.6.4. Regularization

2.7. Results

2.7.1. Machine Translation

2.7.2. Model Variations

2.7.3. English Constituency Parsing

2.8. Conclusion


1. Transformer

1.1. 整体实现步骤

(1)模型图

1.2. Transformer理念

(1)摒弃了传统费时的卷积和循环,仅采用注意力机制和全连接构成整个网络

(2)多使用残差连接

(3)Mask的提出

(4)李沐特地讲解了batch norm和layer norm的区别,主要是取的块不一样:

1.3. Transformer代码实现

(1)代码网址:GitHub - tensorflow/tensor2tensor: Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

(2)李沐

        ①Encoder

class EncoderBlock(nn.Module):"""Transformer编码器块"""def __init__(self, key_size, query_size, value_size, num_hiddens,norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,dropout, use_bias=False, **kwargs):super(EncoderBlock, self).__init__(**kwargs)self.attention = d2l.MultiHeadAttention(key_size, query_size, value_size, num_hiddens, num_heads, dropout,use_bias)self.addnorm1 = AddNorm(norm_shape, dropout)self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens, num_hiddens)self.addnorm2 = AddNorm(norm_shape, dropout)def forward(self, X, valid_lens):Y = self.addnorm1(X, self.attention(X, X, X, valid_lens))return self.addnorm2(Y, self.ffn(Y))

        ②Decoder

class DecoderBlock(nn.Module):"""解码器中第i个块"""def __init__(self, key_size, query_size, value_size, num_hiddens,norm_shape, ffn_num_input, ffn_num_hiddens, num_heads,dropout, i, **kwargs):super(DecoderBlock, self).__init__(**kwargs)self.i = iself.attention1 = d2l.MultiHeadAttention(key_size, query_size, value_size, num_hiddens, num_heads, dropout)self.addnorm1 = AddNorm(norm_shape, dropout)self.attention2 = d2l.MultiHeadAttention(key_size, query_size, value_size, num_hiddens, num_heads, dropout)self.addnorm2 = AddNorm(norm_shape, dropout)self.ffn = PositionWiseFFN(ffn_num_input, ffn_num_hiddens,num_hiddens)self.addnorm3 = AddNorm(norm_shape, dropout)def forward(self, X, state):enc_outputs, enc_valid_lens = state[0], state[1]# 训练阶段,输出序列的所有词元都在同一时间处理,# 因此state[2][self.i]初始化为None。# 预测阶段,输出序列是通过词元一个接着一个解码的,# 因此state[2][self.i]包含着直到当前时间步第i个块解码的输出表示if state[2][self.i] is None:key_values = Xelse:key_values = torch.cat((state[2][self.i], X), axis=1)state[2][self.i] = key_valuesif self.training:batch_size, num_steps, _ = X.shape# dec_valid_lens的开头:(batch_size,num_steps),# 其中每一行是[1,2,...,num_steps]dec_valid_lens = torch.arange(1, num_steps + 1, device=X.device).repeat(batch_size, 1)else:dec_valid_lens = None# 自注意力X2 = self.attention1(X, key_values, key_values, dec_valid_lens)Y = self.addnorm1(X, X2)# 编码器-解码器注意力。# enc_outputs的开头:(batch_size,num_steps,num_hiddens)Y2 = self.attention2(Y, enc_outputs, enc_outputs, enc_valid_lens)Z = self.addnorm2(Y, Y2)return self.addnorm3(Z, self.ffn(Z)), state

1.4. Transformer弊端和局限

(1)位置信息表示弱

(2)长距离文本学习能力弱

2. Transformer论文原文学习

2.1. Abstract

(1)Bcakground: the sequence transduction model prevails with complicated convolution or recurrence.

(2)Purpose: they put forward a simple attention model without convolution or recurrence, which not only decreases the training time, but also increases the accuracy of translation

transduction  n.转导

2.2. Introduction

        ①Previous model: RNN, LSTM, gated RNN

        ②RNN limits in parallel computing in that it is highly rely on previous data

        ③Factorization tricks and conditional computation have made great achievement nowadays. However, reseachers are still unable to break free from the constraints of sequential computation.

        ④Attention mechanism can ignore the position of words

eschew  vt.避免;(有意地)避开;回避

2.3. Background

        ①Extended Neural GPU, ByteNet and ConvS2S try reducing the sequential mechanism. However, they still explode in distance computation.

        ②Self-attention, also named intra-attention has been used in "reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations"

2.4. Model Architecture

        ①The Transformer model shows below:

Inputs are denoted by \left ( x_{1},x_{2}...x_{n} \right ) , then encode them to \left ( z_{1},z_{2}...z_{n} \right ) . Lastly decode z to \left ( y_{1},y_{2}...y_{m} \right ) (⭐input sequence and output sequence may not be of equal length).

        ②The model is auto-regressive

        ③The left of this figure is encode layer, and the right halves are decoder.

2.4.1. Encoder and Decoder Stacks

(1)Encoder

        ①N=6

        ②"Add" means adding x and sublayer(x), then using layer norm

(2)Decoder

        ①N=6

        ②Masked layer ensures x_{i} can only see the previous input of its position

2.4.2. Attention

(1)Scaled Dot-Product Attention

        ①The model shows below:

where Q denotes queries and K denotes keys of dimension d_{k}, V denotes values of dimension d_{v}

        ②The function of Scaled Dot-Product Attention:

\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

        ③They set \sqrt{d_{k}} as the dividend. They reckon when d_{k} increaces, additive attention outpeforms dot product attention.

        ④The more the dimension is, the less gradient the softmax is. Hence they scale the dot product.

(2) Multi-Head Attention

        ①The model shows below:

        ②The function of this block is:

\begin{aligned} \mathrm{MultiHead}(Q,K,V)& =\mathrm{Concat}(\mathrm{head}_{1},...,\mathrm{head}_{\mathrm{h}})W^{O} \\ \mathrm{where~head_{i}}& =\mathrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}) \end{aligned}

where all the W are parameter matrices. Additionally, h=8, and they set 

d_{k}=d_{v}=d_{\mathrm{model}}/h=64

(3)Applications of Attention in our Model

        ①Each position in the decoder participates in all positions in the input sequence

        ②Each position in the encoder can handle all positions in the previous layer of the encoder

        ③Mask all values on the right that have not yet appeared and represent them with -\infty (so when they participate in softmax layer, they always get 0)

2.4.3. Position-wise Feed-Forward Networks

        ① Function of fully connected feed-forward network:

\mathrm{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2

where the input and output dimension d_{\mathrm{model}}=512 , and the inner-layer dimension d_{ff}=2048

2.4.4. Embeddings and Softmax

        ①The two embedding layers and the pre-softmax linear transformation are using the same weight matrix

        ②Table of different complexity for different layers:

where d denotes dimension, k denotes the size of convolutional kernal, n denotes length of sequence, and r is the size of the neighborhood in restricted self-attention

2.4.5. Positional Encoding

        ①Positional encodings:

\begin{aligned}PE_{(pos,2i)}&=sin(pos/10000^{2i/d_{\mathrm{model}}})\\PE_{(pos,2i+1)}&=cos(pos/10000^{2i/d_{\mathrm{model}}})\end{aligned}

where pos means position and i means dimension.

        ②They also tried learned positional embeddings, which got a similar result.

        ③Sinusoidal version allows longer sequence

2.5.  Why Self-Attention

        ①Low computational complexity, parallelized computation and short path length are the advantages of self-attention

        ②分析了一堆时间空间复杂度吗?我不是很看得懂

2.6. Training

2.6.1. Training Data and Batching

        They used WMT 2014 English-German dataset and WMT 2014 English-French dataset with 4.5 million sentence pairs and 36M sentences respectively.

2.6.2. Hardware and Schedule

        By each step in 1 second, they trained 300,000 steps (about 3.5 days)

2.6.3. Optimizer

        ①Adam optimizer: β1 = 0.9, β2 = 0.98 and ϵ = 10−9

        ②Learning rate: 

lrate=d_{\mathrm{model}}^{-0.5}\cdot\min(step\_num^{-0.5},step\_num\cdot warmup\_steps^{-1.5})

where warmup_steps = 4000 at the beginning

2.6.4. Regularization

        ①Dropout rate: 0.1

        ②Label smoothing value  \epsilon _{ls}=0.1 , which causes perplexity but increases accuracy and BLEU score

2.7. Results

2.7.1. Machine Translation

        Comparison of accuracy with other models:

2.7.2. Model Variations

        ①Variations of Transformer

where (A) changes the number of attention heads, attention key, and dimension of value, (B) adjusts the size of attention key, (C) and (D) control model size and dropout rate, (E) replaces sinusoidal positional encoding by learned positional embeddings. These ablation studies have successfully demonstrated the superiority of Transformer.

2.7.3. English Constituency Parsing

        ①Comparison table of generalization performance:

2.8. Conclusion

        ①This is the first model which only based on attention mechanism

        ②Outperforms any other previous model

http://www.hkea.cn/news/295050/

相关文章:

  • 怎样查询江西省城乡建设厅网站杭州seo网
  • 网站建设空间是指什么软件网站优化最为重要的内容是
  • 做美工要开通什么网站的会员呢新网站友链
  • 网站集约化建设推进情况推广app赚钱
  • 番禺大石做网站域名污染查询网站
  • 长沙市在建工程项目免费seo快速排名工具
  • 南宁定制网站制作电话图片外链生成工具
  • 哪些网站做的海报比较高大上百度客服电话是多少
  • 菏泽网站建设电话常州seo外包
  • 做木皮的网站裂变营销五种模式十六种方法
  • 精美 企业网站模板微信软文推广怎么做
  • 怎么建立一个网站里面可以查询资料百度权重域名
  • 网站建设顺序镇江交叉口优化
  • 低价企业网站搭建软文新闻发布网站
  • 创造与魔法官方网站做自己喜欢的事seo视频
  • 淘宝联盟推广网站怎么做吉安seo招聘
  • 工程招聘网站如何免费制作自己的网站
  • 网站建设调研问卷搜易网托管模式的特点
  • 在哪个网站可以做java面试题宁德市蕉城区疫情
  • 2021年重大新闻事件seo快速工具
  • 拼多多网店南宁优化推广服务
  • 洛阳建筑公司排名长沙官网seo服务
  • 网站关键词优化公司哪家好企业网站seo点击软件
  • 做网站有必要?优化师培训
  • 网站怎么发布信息百度推广优化技巧
  • 西安软件培训百度百科优化排名
  • 网站上文章加入音乐是怎么做的网页代码
  • 深圳公布最新出行政策徐州seo招聘
  • wordpress的漏洞seo优化知识
  • 网站建设高端seo和sem分别是什么