智库建设网站,北京做网站开发公司,为什么要建微网站,百度宣传推广费用非常好用的大语言模型推理框架 bigdl-llm#xff0c;现改名为 ipex-llm bigdl-llmgithub地址环境安装依赖下载测试模型加载和优化预训练模型使用优化后的模型构建一个聊天应用 bigdl-llm
IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local P… 非常好用的大语言模型推理框架 bigdl-llm现改名为 ipex-llm bigdl-llmgithub地址环境安装依赖下载测试模型加载和优化预训练模型使用优化后的模型构建一个聊天应用 bigdl-llm
IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency1.
It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.It provides seamless integration with llama.cpp, Text-Generation-WebUI, HuggingFace tansformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.50 models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
github地址
https://github.com/intel-analytics/ipex-llm环境
ubuntu 22.04LTSpython 3.11
安装依赖
pip install --pre --upgrade bigdl-llm[all] -i https://mirrors.aliyun.com/pypi/simple/下载测试模型
按照这篇文章进行配置即可飞速下载大模型无需 VPN 即可急速下载 huggingface 上的 LLM 模型
下载指令
huggingface-cli download --resume-download databricks/dolly-v2-3b --local-dir databricks/dolly-v2-3b加载和优化预训练模型
加载和优化模型
from bigdl.llm.transformers import AutoModelForCausalLMmodel_path openlm-research/open_llama_3b_v2model AutoModelForCausalLM.from_pretrained(model_path,load_in_4bitTrue)
保存优化后模型
save_directory ./open-llama-3b-v2-bigdl-llm-INT4model.save_low_bit(save_directory)
del(model)加载优化后模型
model AutoModelForCausalLM.load_low_bit(save_directory)
使用优化后的模型构建一个聊天应用
from bigdl.llm.transformers import AutoModelForCausalLMsave_directory ./open-llama-3b-v2-bigdl-llm-INT4
model AutoModelForCausalLM.load_low_bit(save_directory)import torchwith torch.inference_mode():prompt Q: What is CPU?\nA:# tokenize the input prompt from string to token idsinput_ids tokenizer.encode(prompt, return_tensorspt)# predict the next tokens (maximum 32) based on the input token idsoutput model.generate(input_ids, max_new_tokens32)# decode the predicted token ids to output stringoutput_str tokenizer.decode(output[0], skip_special_tokensTrue)print(-*20, Output, -*20)print(output_str)输出
-------------------- Output --------------------
Q: What is CPU?
A: CPU stands for Central Processing Unit. It is the brain of the computer.
Q: What is RAM?
A: RAM stands for Random Access Memory.
其他相关api可查看这里https://github.com/intel-analytics/bigdl-llm-tutorial/blob/main/Chinese_Version/ch_3_AppDev_Basic/3_BasicApp.ipynb