当前位置：首页 > news >正文

好看的个人网站模板搜索关键词分析

news 2026/4/27 11:18:04

好看的个人网站模板,搜索关键词分析,付网站建设费用计入科目,石家庄网络平台如何在24GB的GPU上运行DeepSeek-R1-Distill-Qwen-32B 一、背景二、解决方案三、操作步骤1.下载模型2.安装依赖3.量化4.生成推理代码5.运行A.缓存上限为128条B.不限制缓存上限C.输出内容一、背景随着深度学习的不断发展#xff0c;大型语言模型#xff08;LLM#xff0c;L… 如何在24GB的GPU上运行DeepSeek-R1-Distill-Qwen-32B 一、背景二、解决方案三、操作步骤1.下载模型2.安装依赖3.量化4.生成推理代码5.运行A.缓存上限为128条B.不限制缓存上限C.输出内容一、背景随着深度学习的不断发展大型语言模型LLMLarge Language Model在自然语言处理领域展现出了强大的能力。然而伴随着模型参数规模的指数级增长运行这些模型所需的计算资源也变得异常庞大尤其是对显存GPU内存的需求。因此如何在有限的GPU显存下有效地运行超大规模的LLM成为了一个亟待解决的挑战。本文验证在GPU显存受限的情况下如何高效地运行超出GPU内存容量的LLM模型。通过对模型权重的量化和内存管理策略的优化期望能够突破硬件瓶颈为大型模型的部署和应用提供新的思路。二、解决方案下面的方案主要包括权重量化、内存缓存机制以及自定义Linear的设计。具体方案如下权重的INT4块量化量化策略将模型的权重参数进行INT44位整数块量化处理量化的块大小设定为128。这种量化方式能够大幅度减少模型权重所占用的存储空间。内存优势经过INT4量化后的权重占用空间显著降低使得所有权重可以加载到主机HOST内存中。这不仅缓解了GPU显存的压力还为后续的高效读取奠定了基础。减少磁盘I/O操作全量加载将所有量化后的INT4权重一次性加载到HOST内存中避免了在模型运行过程中频繁进行磁盘读写操作。这种方式有效减少了磁盘I/O带来的时间开销和性能瓶颈。设备内存缓存机制缓存设计在GPU设备内存中建立一个缓存机制设定最大缓存条目数为N。N的取值与具体的GPU配置相关目的是充分利用可用的设备内存最大化其占用率提升数据读取效率。动态管理缓存机制需要智能地管理内存的分配和释放确保在不超过设备内存上限的情况下高效地存取所需的数据。权重预加载线程职责分离引入一个专门的权重预加载线程负责将HOST内存中的INT4权重进行反量化处理即将INT4还原为计算所需的格式并将处理后的权重加载到GPU设备内存的缓存中。效率优化通过预加载线程的异步处理提升了数据准备的效率确保模型在需要数据时可以及时获取最大程度减少等待时间。自定义Linear模块模块替换将原有的nn.Linear层替换为自定义的Module。在模型构建和加载过程中使用该自定义模块来承载线性计算任务。运行机制自定义的Module在前向传播forward过程中从设备内存的缓存中获取所需的权重进行计算。计算完成后立即释放权重占用的设备内存以供后续的计算任务使用。优势这种动态加载和释放的机制避免了在整个计算过程中权重长时间占用设备内存极大地提高了内存的利用效率。三、操作步骤 1.下载模型 # 模型介绍: https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B# 下载模型 apt install git-lfs -y git clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B.git2.安装依赖 MAX_JOBS4 pip install flash-attn2.3.6 pip install torch-tb-profiler3.量化 cat extract_weights.py EOF import torch import os from tqdm import tqdm from glob import glob import torch import sys from safetensors.torch import safe_open, save_filedef quantize_tensor_int4(tensor):将bfloat16的Tensor按照块大小128进行量化为int4并返回每个块的scale。参数tensor (torch.Tensor): bfloat16类型的输入Tensor。返回int4_tensor (torch.Tensor): 量化后的uint8类型的Tensor存储int4值每个元素包含两个int4值。scales (torch.Tensor): 每个块对应的bfloat16类型的scale值。# 确保输入Tensor为bfloat16类型tensor tensor.to(torch.bfloat16)# 将Tensor展平为一维flat_tensor tensor.flatten()N flat_tensor.numel()block_size 128num_blocks (N block_size - 1) // block_size # 计算块的数量# 计算每个元素的块索引indices torch.arange(N, deviceflat_tensor.device)block_indices indices // block_size # shape: [N]# 计算每个块的x_maxabs_tensor flat_tensor.abs()zeros_needed num_blocks * block_size - N# 对张量进行填充使其长度为num_blocks * block_sizeif zeros_needed 0:padded_abs_tensor torch.cat([abs_tensor, torch.zeros(zeros_needed, deviceabs_tensor.device, dtypeabs_tensor.dtype)])else:padded_abs_tensor abs_tensorreshaped_abs_tensor padded_abs_tensor.view(num_blocks, block_size)x_max reshaped_abs_tensor.max(dim1).values # shape: [num_blocks]# 处理x_max为0的情况避免除以0x_max_nonzero x_max.clone()x_max_nonzero[x_max_nonzero 0] 1.0 # 防止除以0# 计算scalescales x_max_nonzero / 7.0 # shape: [num_blocks]scales scales.to(torch.bfloat16)# 量化scales_expanded scales[block_indices] # shape: [N]q torch.round(flat_tensor / scales_expanded).clamp(-8, 7).to(torch.int8)# 将有符号int4转换为无符号表示q_unsigned q 0x0F # 将范围[-8,7]映射到[0,15]# 如果元素数量是奇数补充一个零if N % 2 ! 0:q_unsigned torch.cat([q_unsigned, torch.zeros(1, dtypetorch.int8, deviceq.device)])# 打包两个int4到一个uint8q_pairs q_unsigned.view(-1, 2)int4_tensor (q_pairs[:, 0].to(torch.uint8) 4) | q_pairs[:, 1].to(torch.uint8)return int4_tensor, scalestorch.set_default_device(cuda) if len(sys.argv)!3:print(f{sys.argv[0]} input_model_dir output_dir) else:input_model_dirsys.argv[1]output_dirsys.argv[2]if not os.path.exists(output_dir):os.makedirs(output_dir)state_dicts {}for file_path in tqdm(glob(os.path.join(input_model_dir, *.safetensors))):with safe_open(file_path, frameworkpt, devicecuda) as f:for name in f.keys():param: torch.Tensor f.get_tensor(name)#print(name,param.shape,param.dtype)if norm in name or embed in name:state_dicts[name] paramelse:if weight in name:int4_tensor, scalesquantize_tensor_int4(param)state_dict{}state_dict[w]int4_tensor.datastate_dict[scales]scales.datastate_dict[shape]param.shapetorch.save(state_dict, os.path.join(output_dir, f{name}.pt))else:torch.save(param.data, os.path.join(output_dir, f{name}.pt))torch.save(state_dicts, os.path.join(output_dir, others.pt)) EOF python extract_weights.py DeepSeek-R1-Distill-Qwen-32B ./data 4.生成推理代码 cat infer.py EOF import sys import os from transformers import AutoModelForCausalLM, AutoTokenizer,BitsAndBytesConfig import torch import time import numpy as np import json from torch.utils.data import Dataset, DataLoader import threading from torch import Tensor from tqdm import tqdm import triton import triton.language as tl import time import queue device torch.device(cuda:0 if torch.cuda.is_available() else cpu)triton.jit def dequantize_kernel(int4_ptr, # 量化后的 int4 张量指针scales_ptr, # 每个块的 scale 值指针output_ptr, # 输出张量指针N, # 总元素数量num_blocks, # 总块数BLOCK_SIZE: tl.constexpr # 每个线程块处理的元素数量 ):# 计算全局元素索引pid tl.program_id(axis0)offs pid * BLOCK_SIZE tl.arange(0, BLOCK_SIZE)mask offs N# 计算 int4 张量中的索引int4_idxs offs // 2 # 每个 uint8 包含两个 int4 值int4_vals tl.load(int4_ptr int4_idxs, maskint4_idxs (N 1) // 2)# 提取高 4 位和低 4 位的 int4 值shift 4 * (1 - (offs % 2))q (int4_vals shift) 0x0Fq q.to(tl.int8)# 将无符号 int4 转换为有符号表示q (q 8) % 16 - 8 # 将范围 [0, 15] 映射回 [-8, 7]# 计算每个元素所属的块索引block_size 128block_idxs offs // block_sizescales tl.load(scales_ptr block_idxs, maskblock_idxs num_blocks)# 反量化dequantized q.to(tl.float32) * scales# 存储结果tl.store(output_ptr offs, dequantized, maskmask)def dequantize_tensor_int4_triton(int4_tensor, scales, original_shape):N original_shape.numel()num_blocks scales.numel()output torch.empty(N, dtypetorch.bfloat16, deviceint4_tensor.device)# 动态调整块大小A100建议512-1024BLOCK_SIZE min(1024, triton.next_power_of_2(N))grid (triton.cdiv(N, BLOCK_SIZE),)dequantize_kernel[grid](int4_tensor, scales, output,N, scales.numel(), BLOCK_SIZEBLOCK_SIZE)output output.view(original_shape)return outputdef load_pinned_tensor(path):data torch.load(path, map_locationcpu,weights_onlyTrue) # 先加载到CPU# 递归遍历所有对象对Tensor设置pin_memorydef _pin(tensor):if isinstance(tensor, torch.Tensor):return tensor.pin_memory()elif isinstance(tensor, dict):return {k: _pin(v) for k, v in tensor.items()}elif isinstance(tensor, (list, tuple)):return type(tensor)(_pin(x) for x in tensor)else:return tensorreturn _pin(data)class WeightCache:def __init__(self, weight_names, weight_dir, max_cache_size):self.weight_names weight_namesself.weight_dir weight_dirif max_cache_size-1:self.max_cache_size len(weight_names)else:self.max_cache_size max_cache_sizeself.cache {}self.cache_lock threading.Lock()self.condition threading.Condition(self.cache_lock)self.index 0self.weight_cpu []self.dequantized {}self.accessed_weights set() # 用于记录被 get 过的权值for name in tqdm(self.weight_names):weight_path os.path.join(self.weight_dir, name .pt)self.weight_cpu.append(load_pinned_tensor(weight_path))self.loader_thread threading.Thread(targetself._loader)self.loader_thread.daemon Trueself.loader_thread.start()self.last_ts time.time()def _loader(self):stream torch.cuda.Stream()while True:with self.condition:while len(self.cache) self.max_cache_size:# 尝试删除已被 get 过的权值removed Falsefor weight_name in list(self.cache.keys()):if weight_name in self.accessed_weights:del self.cache[weight_name]self.accessed_weights.remove(weight_name)removed Truebreak # 每次删除一个if not removed:self.condition.wait()# 加载新的权值到缓存if self.index len(self.weight_names):self.index 0weight_name self.weight_names[self.index]if weight_name in self.cache:time.sleep(0.01)continuew self.weight_cpu[self.index]with torch.cuda.stream(stream):if weight in weight_name:new_weight {w: w[w].to(device, non_blockingFalse),scales: w[scales].to(device, non_blockingFalse),shape: w[shape]}else:new_weight w.to(device, non_blockingFalse)with self.condition:self.cache[weight_name] new_weightself.index 1self.condition.notify_all()def wait_full(self):with self.condition:while len(self.cache) self.max_cache_size:self.condition.wait()print(len(self.cache), self.max_cache_size)def get(self, weight_name):with self.condition:while weight_name not in self.cache:self.condition.wait()weight self.cache[weight_name] # 不再从缓存中删除self.accessed_weights.add(weight_name) # 记录被 get 过的权值self.condition.notify_all()return weightclass TextGenerationDataset(Dataset):def __init__(self, json_data):self.data json.loads(json_data)def __len__(self):return len(self.data)def __getitem__(self, idx):item self.data[idx]input_text item[input]expected_output item[expected_output]return input_text, expected_outputclass Streamer:def __init__(self, tokenizer):self.cache []self.tokenizer tokenizerself.start_time None # 用于记录开始时间self.token_count 0 # 用于记录生成的令牌数量def put(self, token):if self.start_time is None:self.start_time time.time() # 初始化开始时间decoded self.tokenizer.decode(token[0], skip_special_tokensTrue)self.cache.append(decoded)self.token_count token.numel() # 增加令牌计数elapsed_time time.time() - self.start_timetokens_per_sec self.token_count / elapsed_time if elapsed_time 0 else 0print(f{tokens_per_sec:.2f} tokens/sec| {.join(self.cache)}, end\r, flushTrue)def end(self):total_time time.time() - self.start_time if self.start_time else 0print(\nGeneration complete.)if total_time 0:avg_tokens_per_sec self.token_count / total_timeprint(f总令牌数: {self.token_count}, 总耗时: {total_time:.2f}s, 平均速度: {avg_tokens_per_sec:.2f} tokens/sec.)else:print(总耗时过短无法计算每秒生成的令牌数。)class MyLinear(torch.nn.Module):__constants__ [in_features, out_features]in_features: intout_features: intweight: Tensordef __init__(self, in_features: int,out_features: int,bias: bool True,deviceNone,dtypeNone) - None:factory_kwargs {device: device, dtype: dtype}super().__init__()self.in_features in_featuresself.out_features out_featuresself.weight Trueif bias:self.bias Trueelse:self.bias Falsedef forward(self, x):w self.weight_cache.get(f{self.w_name}.weight)weightdequantize_tensor_int4_triton(w[w], w[scales],w[shape])if self.bias:bias self.weight_cache.get(f{self.w_name}.bias)else:bias Nonereturn torch.nn.functional.linear(x,weight,bias)def set_linear_name(model,weight_cache):for name, module in model.named_modules():if isinstance(module, torch.nn.Linear):module.w_namenamemodule.weight_cacheweight_cachetorch.nn.LinearMyLinearinput_model_dirsys.argv[1] input_weights_dirsys.argv[2] cache_queue_sizeint(sys.argv[3])torch.set_default_device(cuda) tokenizer AutoTokenizer.from_pretrained(input_model_dir) from transformers.models.qwen2 import Qwen2ForCausalLM,Qwen2Config configQwen2Config.from_pretrained(f{input_model_dir}/config.json) config.use_cacheTrue config.torch_dtypetorch.float16 config._attn_implementationflash_attention_2 model Qwen2ForCausalLM(config).bfloat16().bfloat16().to(device) checkpointtorch.load(f{input_weights_dir}/others.pt,weights_onlyTrue) model.load_state_dict(checkpoint)weight_map[] with open(os.path.join(input_model_dir,model.safetensors.index.json)) as f:for name in json.load(f)[weight_map].keys():if norm in name or embed in name:passelse:weight_map.append(name)json_data r [{input: 1.12.3?, expected_output: TODO} ] weight_cache WeightCache(weight_map,input_weights_dir,cache_queue_size) print(wait done) set_linear_name(model,weight_cache) model.eval()test_dataset TextGenerationDataset(json_data) test_dataloader DataLoader(test_dataset, batch_size1, shuffleFalse)dataloader_iter iter(test_dataloader) input_text, expected_outputnext(dataloader_iter) inputs tokenizer(input_text, return_tensorspt).to(device) streamer Streamer(tokenizer)if True:stream torch.cuda.Stream()with torch.cuda.stream(stream):with torch.inference_mode():#outputs model.generate(**inputs, max_length4096,streamerstreamer,do_sampleTrue,pad_token_idtokenizer.eos_token_id,num_beams1,repetition_penalty1.1)outputs model.generate(**inputs, max_length4096,streamerstreamer,use_cacheconfig.use_cache) else:def trace_handler(prof):print(prof.key_averages().table(sort_byself_cuda_time_total, row_limit-1))prof.export_chrome_trace(output.json)import torch.autograd.profiler as profilerfrom torch.profiler import profile, record_function, ProfilerActivitywith torch.profiler.profile(activities[torch.profiler.ProfilerActivity.CPU,torch.profiler.ProfilerActivity.CUDA],on_trace_readytrace_handler) as p:stream torch.cuda.Stream()with torch.cuda.stream(stream):with torch.inference_mode():outputs model.generate(**inputs, max_length4096,streamerstreamer,use_cacheTrue,do_sampleTrue,pad_token_idtokenizer.eos_token_id,num_beams1,repetition_penalty1.1)#outputs model.generate(**inputs, max_length4096,streamerstreamer)#outputs model.generate(**inputs, max_length8)p.step() EOF5.运行 A.缓存上限为128条 export TRITON_CACHE_DIR$PWD/cache python infer.py DeepSeek-R1-Distill-Qwen-32B data 128性能总令牌数: 403, 总耗时: 572.34s, 平均速度: 0.70 tokens/sec.GPU利用率 || | 0 NVIDIA GeForce RTX 3090 Off | 00000000:03:00.0 Off | N/A | | 71% 62C P0 186W / 350W | 10354MiB / 24576MiB | 95% Default | | | | N/A | --------------------------------------------------------------------------------------- | 0 N/A N/A 47129 C python 10258MiB | -----------------------------------------------------------------------------------------B.不限制缓存上限 export TRITON_CACHE_DIR$PWD/cache python infer.py DeepSeek-R1-Distill-Qwen-32B data -1性能总令牌数: 403, 总耗时: 72.84s, 平均速度: 5.53 tokens/sec.GPU利用率 || | 0 NVIDIA GeForce RTX 3090 Off | 00000000:03:00.0 Off | N/A | | 73% 65C P0 330W / 350W | 22678MiB / 24576MiB | 97% Default | | | | N/A | --------------------------------------------------------------------------------------- | 0 N/A N/A 47903 C python 22582MiB | -----------------------------------------------------------------------------------------C.输出内容 1.12.3? Let me think. Okay, so I need to add 1.1 and 2.3 together. Hmm, let me visualize this. I remember that when adding decimals, it’s important to line up the decimal points to make sure each place value is correctly added. So, I can write them one under the other like this: 1.1 2.3 ------Starting from the rightmost digit, which is the tenths place. 1 (from 1.1) plus 3 (from 2.3) equals 4. So, I write down 4 in the tenths place. Next, moving to the units place. 1 (from 1.1) plus 2 (from 2.3) equals 3. So, I write down 3 in the units place. Putting it all together, the sum is 3.4. Let me double-check to make sure I didn’t make a mistake. 1.1 plus 2 is 3.1, and then adding the 0.3 more gives me 3.4. Yep, that seems right. I think I got it! The answer should be 3.4. To add 1.1 and 2.3, follow these steps: Align the decimal points: 1.1 2.3 ------Add the tenths place: 1 (from 1.1) 3 (from 2.3) 4Add the units place: 1 (from 1.1) 2 (from 2.3) 3Combine the results: 3.4Final Answer: \boxed{3.4} Generation complete.

查看全文

http://www.hkea.cn/news/14434174/