当前位置: 首页 > news >正文

好看的个人网站模板搜索关键词分析

好看的个人网站模板,搜索关键词分析,付网站建设费用计入科目,石家庄网络平台如何在24GB的GPU上运行DeepSeek-R1-Distill-Qwen-32B 一、背景二、解决方案三、操作步骤1.下载模型2.安装依赖3.量化4.生成推理代码5.运行A.缓存上限为128条B.不限制缓存上限C.输出内容 一、背景 随着深度学习的不断发展#xff0c;大型语言模型#xff08;LLM#xff0c;L… 如何在24GB的GPU上运行DeepSeek-R1-Distill-Qwen-32B 一、背景二、解决方案三、操作步骤1.下载模型2.安装依赖3.量化4.生成推理代码5.运行A.缓存上限为128条B.不限制缓存上限C.输出内容 一、背景 随着深度学习的不断发展大型语言模型LLMLarge Language Model在自然语言处理领域展现出了强大的能力。然而伴随着模型参数规模的指数级增长运行这些模型所需的计算资源也变得异常庞大尤其是对显存GPU内存的需求。因此如何在有限的GPU显存下有效地运行超大规模的LLM成为了一个亟待解决的挑战。 本文验证在GPU显存受限的情况下如何高效地运行超出GPU内存容量的LLM模型。通过对模型权重的量化和内存管理策略的优化期望能够突破硬件瓶颈为大型模型的部署和应用提供新的思路。 二、解决方案 下面的方案主要包括权重量化、内存缓存机制以及自定义Linear的设计。具体方案如下 权重的INT4块量化 量化策略将模型的权重参数进行INT44位整数块量化处理量化的块大小设定为128。这种量化方式能够大幅度减少模型权重所占用的存储空间。内存优势经过INT4量化后的权重占用空间显著降低使得所有权重可以加载到主机HOST内存中。这不仅缓解了GPU显存的压力还为后续的高效读取奠定了基础。 减少磁盘I/O操作 全量加载将所有量化后的INT4权重一次性加载到HOST内存中避免了在模型运行过程中频繁进行磁盘读写操作。这种方式有效减少了磁盘I/O带来的时间开销和性能瓶颈。 设备内存缓存机制 缓存设计在GPU设备内存中建立一个缓存机制设定最大缓存条目数为N。N的取值与具体的GPU配置相关目的是充分利用可用的设备内存最大化其占用率提升数据读取效率。动态管理缓存机制需要智能地管理内存的分配和释放确保在不超过设备内存上限的情况下高效地存取所需的数据。 权重预加载线程 职责分离引入一个专门的权重预加载线程负责将HOST内存中的INT4权重进行反量化处理即将INT4还原为计算所需的格式并将处理后的权重加载到GPU设备内存的缓存中。效率优化通过预加载线程的异步处理提升了数据准备的效率确保模型在需要数据时可以及时获取最大程度减少等待时间。 自定义Linear模块 模块替换将原有的nn.Linear层替换为自定义的Module。在模型构建和加载过程中使用该自定义模块来承载线性计算任务。运行机制自定义的Module在前向传播forward过程中从设备内存的缓存中获取所需的权重进行计算。计算完成后立即释放权重占用的设备内存以供后续的计算任务使用。优势这种动态加载和释放的机制避免了在整个计算过程中权重长时间占用设备内存极大地提高了内存的利用效率。 三、操作步骤 1.下载模型 # 模型介绍: https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B# 下载模型 apt install git-lfs -y git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.git2.安装依赖 MAX_JOBS4 pip install flash-attn2.3.6 pip install torch-tb-profiler3.量化 cat extract_weights.py EOF import torch import os from tqdm import tqdm from glob import glob import torch import sys from safetensors.torch import safe_open, save_filedef quantize_tensor_int4(tensor):将bfloat16的Tensor按照块大小128进行量化为int4并返回每个块的scale。参数tensor (torch.Tensor): bfloat16类型的输入Tensor。返回int4_tensor (torch.Tensor): 量化后的uint8类型的Tensor存储int4值每个元素包含两个int4值。scales (torch.Tensor): 每个块对应的bfloat16类型的scale值。# 确保输入Tensor为bfloat16类型tensor tensor.to(torch.bfloat16)# 将Tensor展平为一维flat_tensor tensor.flatten()N flat_tensor.numel()block_size 128num_blocks (N block_size - 1) // block_size # 计算块的数量# 计算每个元素的块索引indices torch.arange(N, deviceflat_tensor.device)block_indices indices // block_size # shape: [N]# 计算每个块的x_maxabs_tensor flat_tensor.abs()zeros_needed num_blocks * block_size - N# 对张量进行填充使其长度为num_blocks * block_sizeif zeros_needed 0:padded_abs_tensor torch.cat([abs_tensor, torch.zeros(zeros_needed, deviceabs_tensor.device, dtypeabs_tensor.dtype)])else:padded_abs_tensor abs_tensorreshaped_abs_tensor padded_abs_tensor.view(num_blocks, block_size)x_max reshaped_abs_tensor.max(dim1).values # shape: [num_blocks]# 处理x_max为0的情况避免除以0x_max_nonzero x_max.clone()x_max_nonzero[x_max_nonzero 0] 1.0 # 防止除以0# 计算scalescales x_max_nonzero / 7.0 # shape: [num_blocks]scales scales.to(torch.bfloat16)# 量化scales_expanded scales[block_indices] # shape: [N]q torch.round(flat_tensor / scales_expanded).clamp(-8, 7).to(torch.int8)# 将有符号int4转换为无符号表示q_unsigned q 0x0F # 将范围[-8,7]映射到[0,15]# 如果元素数量是奇数补充一个零if N % 2 ! 0:q_unsigned torch.cat([q_unsigned, torch.zeros(1, dtypetorch.int8, deviceq.device)])# 打包两个int4到一个uint8q_pairs q_unsigned.view(-1, 2)int4_tensor (q_pairs[:, 0].to(torch.uint8) 4) | q_pairs[:, 1].to(torch.uint8)return int4_tensor, scalestorch.set_default_device(cuda) if len(sys.argv)!3:print(f{sys.argv[0]} input_model_dir output_dir) else:input_model_dirsys.argv[1]output_dirsys.argv[2]if not os.path.exists(output_dir):os.makedirs(output_dir)state_dicts {}for file_path in tqdm(glob(os.path.join(input_model_dir, *.safetensors))):with safe_open(file_path, frameworkpt, devicecuda) as f:for name in f.keys():param: torch.Tensor f.get_tensor(name)#print(name,param.shape,param.dtype)if norm in name or embed in name:state_dicts[name] paramelse:if weight in name:int4_tensor, scalesquantize_tensor_int4(param)state_dict{}state_dict[w]int4_tensor.datastate_dict[scales]scales.datastate_dict[shape]param.shapetorch.save(state_dict, os.path.join(output_dir, f{name}.pt))else:torch.save(param.data, os.path.join(output_dir, f{name}.pt))torch.save(state_dicts, os.path.join(output_dir, others.pt)) EOF python extract_weights.py DeepSeek-R1-Distill-Qwen-32B ./data 4.生成推理代码 cat infer.py EOF import sys import os from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig import torch import time import numpy as np import json from torch.utils.data import Dataset, DataLoader import threading from torch import Tensor from tqdm import tqdm import triton import triton.language as tl import time import queue device torch.device(cuda:0 if torch.cuda.is_available() else cpu)triton.jit def dequantize_kernel(int4_ptr, # 量化后的 int4 张量指针scales_ptr, # 每个块的 scale 值指针output_ptr, # 输出张量指针N, # 总元素数量num_blocks, # 总块数BLOCK_SIZE: tl.constexpr # 每个线程块处理的元素数量 ):# 计算全局元素索引pid tl.program_id(axis0)offs pid * BLOCK_SIZE tl.arange(0, BLOCK_SIZE)mask offs N# 计算 int4 张量中的索引int4_idxs offs // 2 # 每个 uint8 包含两个 int4 值int4_vals tl.load(int4_ptr int4_idxs, maskint4_idxs (N 1) // 2)# 提取高 4 位和低 4 位的 int4 值shift 4 * (1 - (offs % 2))q (int4_vals shift) 0x0Fq q.to(tl.int8)# 将无符号 int4 转换为有符号表示q (q 8) % 16 - 8 # 将范围 [0, 15] 映射回 [-8, 7]# 计算每个元素所属的块索引block_size 128block_idxs offs // block_sizescales tl.load(scales_ptr block_idxs, maskblock_idxs num_blocks)# 反量化dequantized q.to(tl.float32) * scales# 存储结果tl.store(output_ptr offs, dequantized, maskmask)def dequantize_tensor_int4_triton(int4_tensor, scales, original_shape):N original_shape.numel()num_blocks scales.numel()output torch.empty(N, dtypetorch.bfloat16, deviceint4_tensor.device)# 动态调整块大小A100建议512-1024BLOCK_SIZE min(1024, triton.next_power_of_2(N))grid (triton.cdiv(N, BLOCK_SIZE),)dequantize_kernel[grid](int4_tensor, scales, output,N, scales.numel(), BLOCK_SIZEBLOCK_SIZE)output output.view(original_shape)return outputdef load_pinned_tensor(path):data torch.load(path, map_locationcpu,weights_onlyTrue) # 先加载到CPU# 递归遍历所有对象对Tensor设置pin_memorydef _pin(tensor):if isinstance(tensor, torch.Tensor):return tensor.pin_memory()elif isinstance(tensor, dict):return {k: _pin(v) for k, v in tensor.items()}elif isinstance(tensor, (list, tuple)):return type(tensor)(_pin(x) for x in tensor)else:return tensorreturn _pin(data)class WeightCache:def __init__(self, weight_names, weight_dir, max_cache_size):self.weight_names weight_namesself.weight_dir weight_dirif max_cache_size-1:self.max_cache_size len(weight_names)else:self.max_cache_size max_cache_sizeself.cache {}self.cache_lock threading.Lock()self.condition threading.Condition(self.cache_lock)self.index 0self.weight_cpu []self.dequantized {}self.accessed_weights set() # 用于记录被 get 过的权值for name in tqdm(self.weight_names):weight_path os.path.join(self.weight_dir, name .pt)self.weight_cpu.append(load_pinned_tensor(weight_path))self.loader_thread threading.Thread(targetself._loader)self.loader_thread.daemon Trueself.loader_thread.start()self.last_ts time.time()def _loader(self):stream torch.cuda.Stream()while True:with self.condition:while len(self.cache) self.max_cache_size:# 尝试删除已被 get 过的权值removed Falsefor weight_name in list(self.cache.keys()):if weight_name in self.accessed_weights:del self.cache[weight_name]self.accessed_weights.remove(weight_name)removed Truebreak # 每次删除一个if not removed:self.condition.wait()# 加载新的权值到缓存if self.index len(self.weight_names):self.index 0weight_name self.weight_names[self.index]if weight_name in self.cache:time.sleep(0.01)continuew self.weight_cpu[self.index]with torch.cuda.stream(stream):if weight in weight_name:new_weight {w: w[w].to(device, non_blockingFalse),scales: w[scales].to(device, non_blockingFalse),shape: w[shape]}else:new_weight w.to(device, non_blockingFalse)with self.condition:self.cache[weight_name] new_weightself.index 1self.condition.notify_all()def wait_full(self):with self.condition:while len(self.cache) self.max_cache_size:self.condition.wait()print(len(self.cache), self.max_cache_size)def get(self, weight_name):with self.condition:while weight_name not in self.cache:self.condition.wait()weight self.cache[weight_name] # 不再从缓存中删除self.accessed_weights.add(weight_name) # 记录被 get 过的权值self.condition.notify_all()return weightclass TextGenerationDataset(Dataset):def __init__(self, json_data):self.data json.loads(json_data)def __len__(self):return len(self.data)def __getitem__(self, idx):item self.data[idx]input_text item[input]expected_output item[expected_output]return input_text, expected_outputclass Streamer:def __init__(self, tokenizer):self.cache []self.tokenizer tokenizerself.start_time None # 用于记录开始时间self.token_count 0 # 用于记录生成的令牌数量def put(self, token):if self.start_time is None:self.start_time time.time() # 初始化开始时间decoded self.tokenizer.decode(token[0], skip_special_tokensTrue)self.cache.append(decoded)self.token_count token.numel() # 增加令牌计数elapsed_time time.time() - self.start_timetokens_per_sec self.token_count / elapsed_time if elapsed_time 0 else 0print(f{tokens_per_sec:.2f} tokens/sec| {.join(self.cache)}, end\r, flushTrue)def end(self):total_time time.time() - self.start_time if self.start_time else 0print(\nGeneration complete.)if total_time 0:avg_tokens_per_sec self.token_count / total_timeprint(f总令牌数: {self.token_count}, 总耗时: {total_time:.2f}s, 平均速度: {avg_tokens_per_sec:.2f} tokens/sec.)else:print(总耗时过短无法计算每秒生成的令牌数。)class MyLinear(torch.nn.Module):__constants__ [in_features, out_features]in_features: intout_features: intweight: Tensordef __init__(self, in_features: int,out_features: int,bias: bool True,deviceNone,dtypeNone) - None:factory_kwargs {device: device, dtype: dtype}super().__init__()self.in_features in_featuresself.out_features out_featuresself.weight Trueif bias:self.bias Trueelse:self.bias Falsedef forward(self, x):w self.weight_cache.get(f{self.w_name}.weight)weightdequantize_tensor_int4_triton(w[w], w[scales],w[shape])if self.bias:bias self.weight_cache.get(f{self.w_name}.bias)else:bias Nonereturn torch.nn.functional.linear(x,weight,bias)def set_linear_name(model,weight_cache):for name, module in model.named_modules():if isinstance(module, torch.nn.Linear):module.w_namenamemodule.weight_cacheweight_cachetorch.nn.LinearMyLinearinput_model_dirsys.argv[1] input_weights_dirsys.argv[2] cache_queue_sizeint(sys.argv[3])torch.set_default_device(cuda) tokenizer AutoTokenizer.from_pretrained(input_model_dir) from transformers.models.qwen2 import Qwen2ForCausalLM,Qwen2Config configQwen2Config.from_pretrained(f{input_model_dir}/config.json) config.use_cacheTrue config.torch_dtypetorch.float16 config._attn_implementationflash_attention_2 model Qwen2ForCausalLM(config).bfloat16().bfloat16().to(device) checkpointtorch.load(f{input_weights_dir}/others.pt,weights_onlyTrue) model.load_state_dict(checkpoint)weight_map[] with open(os.path.join(input_model_dir,model.safetensors.index.json)) as f:for name in json.load(f)[weight_map].keys():if norm in name or embed in name:passelse:weight_map.append(name)json_data r [{input: 1.12.3?, expected_output: TODO} ] weight_cache WeightCache(weight_map,input_weights_dir,cache_queue_size) print(wait done) set_linear_name(model,weight_cache) model.eval()test_dataset TextGenerationDataset(json_data) test_dataloader DataLoader(test_dataset, batch_size1, shuffleFalse)dataloader_iter iter(test_dataloader) input_text, expected_outputnext(dataloader_iter) inputs tokenizer(input_text, return_tensorspt).to(device) streamer Streamer(tokenizer)if True:stream torch.cuda.Stream()with torch.cuda.stream(stream):with torch.inference_mode():#outputs model.generate(**inputs, max_length4096,streamerstreamer,do_sampleTrue,pad_token_idtokenizer.eos_token_id,num_beams1,repetition_penalty1.1)outputs model.generate(**inputs, max_length4096,streamerstreamer,use_cacheconfig.use_cache) else:def trace_handler(prof):print(prof.key_averages().table(sort_byself_cuda_time_total, row_limit-1))prof.export_chrome_trace(output.json)import torch.autograd.profiler as profilerfrom torch.profiler import profile, record_function, ProfilerActivitywith torch.profiler.profile(activities[torch.profiler.ProfilerActivity.CPU,torch.profiler.ProfilerActivity.CUDA],on_trace_readytrace_handler) as p:stream torch.cuda.Stream()with torch.cuda.stream(stream):with torch.inference_mode():outputs model.generate(**inputs, max_length4096,streamerstreamer,use_cacheTrue,do_sampleTrue,pad_token_idtokenizer.eos_token_id,num_beams1,repetition_penalty1.1)#outputs model.generate(**inputs, max_length4096,streamerstreamer)#outputs model.generate(**inputs, max_length8)p.step() EOF5.运行 A.缓存上限为128条 export TRITON_CACHE_DIR$PWD/cache python infer.py DeepSeek-R1-Distill-Qwen-32B data 128性能 总令牌数: 403, 总耗时: 572.34s, 平均速度: 0.70 tokens/sec.GPU利用率 || | 0 NVIDIA GeForce RTX 3090 Off | 00000000:03:00.0 Off | N/A | | 71% 62C P0 186W / 350W | 10354MiB / 24576MiB | 95% Default | | | | N/A | --------------------------------------------------------------------------------------- | 0 N/A N/A 47129 C python 10258MiB | -----------------------------------------------------------------------------------------B.不限制缓存上限 export TRITON_CACHE_DIR$PWD/cache python infer.py DeepSeek-R1-Distill-Qwen-32B data -1性能 总令牌数: 403, 总耗时: 72.84s, 平均速度: 5.53 tokens/sec.GPU利用率 || | 0 NVIDIA GeForce RTX 3090 Off | 00000000:03:00.0 Off | N/A | | 73% 65C P0 330W / 350W | 22678MiB / 24576MiB | 97% Default | | | | N/A | --------------------------------------------------------------------------------------- | 0 N/A N/A 47903 C python 22582MiB | -----------------------------------------------------------------------------------------C.输出内容 1.12.3? Let me think. Okay, so I need to add 1.1 and 2.3 together. Hmm, let me visualize this. I remember that when adding decimals, it’s important to line up the decimal points to make sure each place value is correctly added. So, I can write them one under the other like this: 1.1 2.3 ------Starting from the rightmost digit, which is the tenths place. 1 (from 1.1) plus 3 (from 2.3) equals 4. So, I write down 4 in the tenths place. Next, moving to the units place. 1 (from 1.1) plus 2 (from 2.3) equals 3. So, I write down 3 in the units place. Putting it all together, the sum is 3.4. Let me double-check to make sure I didn’t make a mistake. 1.1 plus 2 is 3.1, and then adding the 0.3 more gives me 3.4. Yep, that seems right. I think I got it! The answer should be 3.4. To add 1.1 and 2.3, follow these steps: Align the decimal points: 1.1 2.3 ------Add the tenths place: 1 (from 1.1) 3 (from 2.3) 4Add the units place: 1 (from 1.1) 2 (from 2.3) 3Combine the results: 3.4Final Answer: \boxed{3.4} Generation complete.
http://www.hkea.cn/news/14434174/

相关文章:

  • 查询学校信息的网站开发小程序流程
  • 个人网站的搭建步骤网站开发证有没有用
  • 360 的网站链接怎么做专业做网站的公司
  • 织梦网站采集侠怎么做桥头镇网站建设
  • 中山网站关键字优化网站建设遇到哪些危险
  • 个人网站名称有哪些店面设计怎么样
  • 网站上的视频上传怎么做网站设计与制作合同
  • 厦门市海沧建设局网站字体设计在线生成
  • 无锡企业建站wordpress分享小图片不
  • 网站建设免费售后服务域名代备案服务
  • 国内 响应式网站邯郸市建设局网站2017
  • 网站建设规划范文郑州建设局官网
  • 精细化学品网站建设专业建设质量报告
  • 爱站网关键词查询网站的工具wordpress 栏目模板
  • 网站建设报价明细ppt做会动彩字网站
  • 网站外链建设分析网站可以做话筒台标吗
  • 网站建设与管理心得二级网站免费建
  • 电子商务网站建设管理答案成都网站制作设计公司
  • 网站开发完后期维护重要吗百度联盟广告
  • 有哪些招聘网站做那种网站受欢迎
  • 商城网站网络公司互联网营销师是干什么的
  • 网站建设好怎么优化建设银行永泰支行网站
  • 建站优化信息推广软件开发公司排名国内
  • 金华网站建设团队网站开发教程免费
  • 网站界面设计形考任务打鱼在线游戏网站建设
  • 临海最火自适应网站建设模仿wordpress主题
  • 邗江区网站建设套餐成都网站建设方案外包
  • 专业建站分销商城竞价网站做推广方案
  • 余杭门户网站下列哪个不属于网页制作工具
  • 网站后台如何更改it学校哪个比较好