餐饮美食网站模板源码,上海网站建设制作微信,上海网站建设海淘科技,嘉定网络公司欢迎关注我的CSDN#xff1a;https://spike.blog.csdn.net/ 本文地址#xff1a;https://spike.blog.csdn.net/article/details/143749468 免责声明#xff1a;本文来源于个人知识与公开资料#xff0c;仅用于学术交流#xff0c;欢迎讨论#xff0c;不支持转载。 影响 (… 欢迎关注我的CSDNhttps://spike.blog.csdn.net/ 本文地址https://spike.blog.csdn.net/article/details/143749468 免责声明本文来源于个人知识与公开资料仅用于学术交流欢迎讨论不支持转载。 影响 (多模态)大语言模型 参数量的主要网络模块即 Linear、Embedding、Norm(LayerNorm or RMSNorm) 等 3 个部分其中多模态大模型还包括 Conv3D手动计算参数量与 PyTorch 直接计算保持一致。
PyTorch 源码
def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)以 Qwen2-VL-7B-Instruct 、Qwen2-7B-Instruct、Llama-3.1-8B-Instruct 为例。
网络结构参数量
Linear参数矩阵或者加上biasLinear(in_featuresw, out_featuresh, biasTrue) 参数量是 xw*hh当 biasFalse, 则是 xw*h。Embedding认为是没有 bias 的 Linear。Norm LayerNorm 包括 2 个可训练参数 γ \gamma γ 和 β \beta β假设 hidden_size 的大小为 hhidden_size 每一维都有两个参数即 2*hidden_sizeRMSNorm 每 1 维则只有 1 个可训练参数 , 即 hidden_size Conv3D即 Conv3d(3, 1280, kernel_size(2, 14, 14), stride(2, 14, 14), biasFalse)即参数量输入维度*输出维度*卷积核 3*1280*2*14*141505280RotaryEmbedding、Activition 和 Dropout旋转位置编码、激活函数、Dropout 都没有可训练参数
Llama-3.1-8B-Instruct 参数量 128256 ∗ 4096 32 ∗ ( 4096 ∗ 4096 ∗ 2 4096 ∗ 1024 ∗ 2 4096 ∗ 14336 ∗ 3 2 ∗ 4096 ) 4096 4096 ∗ 128256 8030261248 8 B 128256*4096 32*(4096*4096*2 4096*1024*2 4096*14336*3 2*4096) 4096 4096*128256 8030261248 8B 128256∗409632∗(4096∗4096∗24096∗1024∗24096∗14336∗32∗4096)40964096∗12825680302612488B
即 P a r a m e t e r s E m b e d d i n g l a y e r s ∗ ( L i n e a r Q K V O L i n e a r m l p R M S N o r m ) R M S N o r m L i n e a r Parameters Embedding layers*(Linear_{QKVO} Linear_{mlp}RMSNorm) RMSNorm Linear ParametersEmbeddinglayers∗(LinearQKVOLinearmlpRMSNorm)RMSNormLinear
计算参数量[Info] parameters: 8030261248
大语言模型 Llama-3.1-8B-Instruct 的网络结构
LlamaForCausalLM((model): LlamaModel((embed_tokens): Embedding(128256, 4096)(layers): ModuleList((0-31): 32 x LlamaDecoderLayer((self_attn): LlamaSdpaAttention((q_proj): Linear(in_features4096, out_features4096, biasFalse)(k_proj): Linear(in_features4096, out_features1024, biasFalse)(v_proj): Linear(in_features4096, out_features1024, biasFalse)(o_proj): Linear(in_features4096, out_features4096, biasFalse)(rotary_emb): LlamaRotaryEmbedding())(mlp): LlamaMLP((gate_proj): Linear(in_features4096, out_features14336, biasFalse)(up_proj): Linear(in_features4096, out_features14336, biasFalse)(down_proj): Linear(in_features14336, out_features4096, biasFalse)(act_fn): SiLU())(input_layernorm): LlamaRMSNorm((4096,), eps1e-05)(post_attention_layernorm): LlamaRMSNorm((4096,), eps1e-05)))(norm): LlamaRMSNorm((4096,), eps1e-05)(rotary_emb): LlamaRotaryEmbedding())(lm_head): Linear(in_features4096, out_features128256, biasFalse)
)多模态视觉大模型 Qwen2-VL-7B-Instruct 的网络结构
Qwen2VLForConditionalGeneration((visual): Qwen2VisionTransformerPretrainedModel((patch_embed): PatchEmbed((proj): Conv3d(3, 1280, kernel_size(2, 14, 14), stride(2, 14, 14), biasFalse))(rotary_pos_emb): VisionRotaryEmbedding()(blocks): ModuleList((0-31): 32 x Qwen2VLVisionBlock((norm1): LayerNorm((1280,), eps1e-06, elementwise_affineTrue)(norm2): LayerNorm((1280,), eps1e-06, elementwise_affineTrue)(attn): VisionSdpaAttention((qkv): Linear(in_features1280, out_features3840, biasTrue)(proj): Linear(in_features1280, out_features1280, biasTrue))(mlp): VisionMlp((fc1): Linear(in_features1280, out_features5120, biasTrue)(act): QuickGELUActivation()(fc2): Linear(in_features5120, out_features1280, biasTrue))))(merger): PatchMerger((ln_q): LayerNorm((1280,), eps1e-06, elementwise_affineTrue)(mlp): Sequential((0): Linear(in_features5120, out_features5120, biasTrue)(1): GELU(approximatenone)(2): Linear(in_features5120, out_features3584, biasTrue))))(model): Qwen2VLModel((embed_tokens): Embedding(152064, 3584)(layers): ModuleList((0-27): 28 x Qwen2VLDecoderLayer((self_attn): Qwen2VLSdpaAttention((q_proj): Linear(in_features3584, out_features3584, biasTrue)(k_proj): Linear(in_features3584, out_features512, biasTrue)(v_proj): Linear(in_features3584, out_features512, biasTrue)(o_proj): Linear(in_features3584, out_features3584, biasFalse)(rotary_emb): Qwen2VLRotaryEmbedding())(mlp): Qwen2MLP((gate_proj): Linear(in_features3584, out_features18944, biasFalse)(up_proj): Linear(in_features3584, out_features18944, biasFalse)(down_proj): Linear(in_features18944, out_features3584, biasFalse)(act_fn): SiLU())(input_layernorm): Qwen2RMSNorm((3584,), eps1e-06)(post_attention_layernorm): Qwen2RMSNorm((3584,), eps1e-06)))(norm): Qwen2RMSNorm((3584,), eps1e-06)(rotary_emb): Qwen2VLRotaryEmbedding())(lm_head): Linear(in_features3584, out_features152064, biasFalse)
)总参数量[Info] parameters: 8291375616
视觉模型的参数量[Info] parameters model.visual: 675759104语言模型的参数量[Info] parameters model.model: 7070619136 [Info] parameters model.lm_head: 544997376
即675759104(8.15%) 7070619136(85.28%) 544997376(6.57%) 8291375616 8B
Qwen2-VL-7B-Instruct 的 Qwen2VisionTransformerPretrainedModel 参数量
patch_embed 参数量 3*1280*2*14*141505280blocks 参数量[Info] parameters model.visual.blocks: 629678080 详细计算公式32*(1280*2*2 (12801)*3840 (12801)*1280 1280*5121 5120*1281)629678080 merger 参数量
合并计算公式 3 ∗ 1280 ∗ 2 ∗ 14 ∗ 14 32 ∗ ( 1280 ∗ 2 ∗ 2 ( 1280 1 ) ∗ 3840 ( 1280 1 ) ∗ 1280 1280 ∗ 5121 5120 ∗ 1281 ) 1280 ∗ 2 5120 ∗ 5121 ( 5120 1 ) ∗ 3584 675759104 3*1280*2*14*14 32*(1280*2*2 (12801)*3840 (12801)*1280 1280*5121 5120*1281) 1280*2 5120*5121 (51201)*3584 \\ 675759104 3∗1280∗2∗14∗1432∗(1280∗2∗2(12801)∗3840(12801)∗12801280∗51215120∗1281)1280∗25120∗5121(51201)∗3584675759104
Qwen2-VL-7B-Instruct 的 Qwen2VLModel 参数量 152064 ∗ 3584 28 ∗ ( ( 3584 1 ) ∗ 3584 ( 3584 1 ) ∗ 512 ∗ 2 3584 ∗ 3584 3584 ∗ 18944 ∗ 3 2 ∗ 3584 ) 3584 7070619136 3584 ∗ 152064 544997376 152064*3584 28*((35841)*3584 (35841)*512*2 3584*3584 3584*18944*3 2*3584) 3584 \\ 7070619136 \\ 3584 * 152064 544997376 152064∗358428∗((35841)∗3584(35841)∗512∗23584∗35843584∗18944∗32∗3584)358470706191363584∗152064544997376
因此Qwen2-VL-7B 的数据量完全对齐。
测试
# 预训练模型, 查看其词表大小
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessorprint(f[Info] transformers version: {transformers.__version__})def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)# ------------ Qwen2-VL-7B ----------- #
model_path [your path]/llm/Qwen/Qwen2-VL-7B-Instruct
print(f[Info] model_path: {model_path})# Load the model in half-precision on the available device(s)
model Qwen2VLForConditionalGeneration.from_pretrained(model_path, torch_dtypeauto, device_mapauto
)
processor AutoProcessor.from_pretrained(model_path)
configuration model.config
print(f[Info] Qwen2-VL-7B vocab_size: {configuration.vocab_size})
print(model)
print(f[Info] parameters: {count_parameters(model)})
print(f[Info] parameters model.visual: {count_parameters(model.visual)})
print(f[Info] parameters model.model: {count_parameters(model.model)})
print(f[Info] parameters model.lm_head: {count_parameters(model.lm_head)})
print(f[Info] parameters model.visual.patch_embed: {count_parameters(model.visual.patch_embed)})
print(f[Info] parameters model.visual.blocks: {count_parameters(model.visual.blocks)})
print(f[Info] parameters model.visual.blocks[0].norm1: {count_parameters(model.visual.blocks[0].norm1)})
print(f[Info] parameters model.visual.blocks[0].norm2: {count_parameters(model.visual.blocks[0].norm2)})
print(f[Info] parameters model.visual.blocks[0].attn: {count_parameters(model.visual.blocks[0].attn)})
print(f[Info] parameters model.visual.blocks[0].mlp: {count_parameters(model.visual.blocks[0].mlp)})
# ------------ Qwen2-VL-7B ----------- ## ------------ Qwen2-7B ----------- #
model_path [your path]/llm/Qwen/Qwen2-7B-Instruct
print(f[Info] model_path: {model_path})device cuda # the device to load the model onto
model AutoModelForCausalLM.from_pretrained(model_path, device_mapauto)
tokenizer AutoTokenizer.from_pretrained(model_path)
print(f[Info] Qwen2-7B vocab_size: {tokenizer.vocab_size})
print(model)
print(f[Info] parameters: {count_parameters(model)})
# ------------ Qwen2-7B ----------- ## ------------ Llama-3.1-8B ----------- #
model_path [your path]/llm/Meta-Llama-3.1-8B-Instruct
print(f[Info] model_path: {model_path})
tokenizer AutoTokenizer.from_pretrained(model_path)
model AutoModelForCausalLM.from_pretrained(model_path,torch_dtypetorch.bfloat16,device_mapauto,
)
print(f[Info] Llama-3.1-8B vocab_size: {tokenizer.vocab_size})
print(model)
print(f[Info] parameters: {count_parameters(model)})
# ------------ Llama-3.1-8B ----------- #Qwen2-7B 的参数量是 7615616512即 7070619136 544997376 7615616512 参考
大模型的参数量是如何计算的大模型参数量如何计算如何根据模型结构计算大模型的参数量