怎么样利用一些网站开发客户,关于网站建设的工作总结,网站制作优化推广,国外的平面设计网站简介
Chinese-LLaMA-Alpaca-2基于Meta发布的可商用大模型Llama-2开发, 是中文LLaMAAlpaca大模型的第二期项目.
量化
模型的下载还是应用脚本
bash hfd.sh hfl/chinese-alpaca-2-13b --tool aria2c -x 8应用llama.cpp进行量化, 主要参考该教程. 其中比较折腾的是与BLAS…简介
Chinese-LLaMA-Alpaca-2基于Meta发布的可商用大模型Llama-2开发, 是中文LLaMAAlpaca大模型的第二期项目.
量化
模型的下载还是应用脚本
bash hfd.sh hfl/chinese-alpaca-2-13b --tool aria2c -x 8应用llama.cpp进行量化, 主要参考该教程. 其中比较折腾的是与BLAS一起编译.
OpenBLAS
这个真是一言难尽, 非常折腾也没起作用(issue1 issue2). 而且提升很小, 后续再尝试能不能成功.
cuBLAS
这个提升较为明显, 在有Nvidia GPU的情况下, 需要折腾应该就只有非root用户手动安装一下CUDA toolkit, 然后在CMakeLists.txt中指定一下路径即可. 手动安装CUDA toolkit和cuDnn后, 在CMakeLists.txt中加入:
# ${cuda path}示例: /home/orange/software/cuda-118
set(CUDA_TOOLKIT_ROOT_DIR ${cuda path})进行编译即可
mkdir build
cd build
cmake .. -DLLAMA_CUBLASON
cmake --build . --config Release量化
编译完成llama.cpp后, 进行量化
python convert.py zh-models/chinese-alpaca-2-7b/
./build/bin/quantize ./zh-models/chinese-alpaca-2-7b/ggml-model-f16.gguf ./zh-models/chinese-alpaca-2-7b/ggml-model-q8_0.gguf q8_0部署测试
直接使用./build/bin/main -m ./zh-models/chinese-alpaca-2-7b/ggml-model-q8_0.gguf不能进行对话, 加入参数-i表示交互模式, 也可以使用教程中的脚本形式. 按照tutorial, 新建chat.sh文件并填入以下内容
#!/bin/bash# temporary script to chat with Chinese Alpaca-2 model
# usage: ./chat.sh alpaca2-ggml-model-path your-first-instructionSYSTEMYou are a helpful assistant. 你是一个乐于助人的助手。
FIRST_INSTRUCTION$2./build/bin/main -m $1 \
--color -i -c 4096 -t 8 --temp 0.5 --top_k 40 --top_p 0.9 --repeat_penalty 1.1 \
--in-prefix-bos --in-prefix [INST] --in-suffix [/INST] -p \
[INST] SYS
$SYSTEM
/SYS$FIRST_INSTRUCTION [/INST]运行
bash chat.sh ./zh-models/chinese-alpaca-2-7b/ggml-model-q8_0.gguf 请列举5条文明乘车的建议成功实现对话, 部署测试成功.
测试
下载并解压测试数据
chinese-alpaca-2-1.3b
测试命令:
./build/bin/perplexity -m ./zh-models/chinese-alpaca-2-1.3b/ggml-model-q8_0.gguf -f ./wikitext-2-raw/wiki.test.raw -ngl 20由于使用cmake编译, 可执行文件位于build/bin下, 注意执行文件和模型, 数据的路径替换即可. 测试数据如下:
main: build 2509 (50ccaf5e)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed 1711210157
llama_model_loader: loaded meta data with 23 key-value pairs and 39 tensors from ./zh-models/chinese-alpaca-2-1.3b/ggml-model-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str llama
llama_model_loader: - kv 1: general.name str LLaMA v2
llama_model_loader: - kv 2: llama.vocab_size u32 55296
llama_model_loader: - kv 3: llama.context_length u32 4096
llama_model_loader: - kv 4: llama.embedding_length u32 4096
llama_model_loader: - kv 5: llama.block_count u32 4
llama_model_loader: - kv 6: llama.feed_forward_length u32 11008
llama_model_loader: - kv 7: llama.rope.dimension_count u32 128
llama_model_loader: - kv 8: llama.attention.head_count u32 32
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 32
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 10000.000000
llama_model_loader: - kv 12: general.file_type u32 7
llama_model_loader: - kv 13: tokenizer.ggml.model str llama
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,55296] [unk, s, /s, 0x00, ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,55296] [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,55296] [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 2
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool false
llama_model_loader: - kv 22: general.quantization_version u32 2
llama_model_loader: - type f32: 9 tensors
llama_model_loader: - type q8_0: 30 tensors
llm_load_vocab: mismatch in special tokens definition ( 889/55296 vs 259/55296 ).
llm_load_print_meta: format GGUF V3 (latest)
llm_load_print_meta: arch llama
llm_load_print_meta: vocab type SPM
llm_load_print_meta: n_vocab 55296
llm_load_print_meta: n_merges 0
llm_load_print_meta: n_ctx_train 4096
llm_load_print_meta: n_embd 4096
llm_load_print_meta: n_head 32
llm_load_print_meta: n_head_kv 32
llm_load_print_meta: n_layer 4
llm_load_print_meta: n_rot 128
llm_load_print_meta: n_embd_head_k 128
llm_load_print_meta: n_embd_head_v 128
llm_load_print_meta: n_gqa 1
llm_load_print_meta: n_embd_k_gqa 4096
llm_load_print_meta: n_embd_v_gqa 4096
llm_load_print_meta: f_norm_eps 0.0e00
llm_load_print_meta: f_norm_rms_eps 1.0e-05
llm_load_print_meta: f_clamp_kqv 0.0e00
llm_load_print_meta: f_max_alibi_bias 0.0e00
llm_load_print_meta: f_logit_scale 0.0e00
llm_load_print_meta: n_ff 11008
llm_load_print_meta: n_expert 0
llm_load_print_meta: n_expert_used 0
llm_load_print_meta: causal attn 1
llm_load_print_meta: pooling type 0
llm_load_print_meta: rope type 0
llm_load_print_meta: rope scaling linear
llm_load_print_meta: freq_base_train 10000.0
llm_load_print_meta: freq_scale_train 1
llm_load_print_meta: n_yarn_orig_ctx 4096
llm_load_print_meta: rope_finetuned unknown
llm_load_print_meta: ssm_d_conv 0
llm_load_print_meta: ssm_d_inner 0
llm_load_print_meta: ssm_d_state 0
llm_load_print_meta: ssm_dt_rank 0
llm_load_print_meta: model type ?B
llm_load_print_meta: model ftype Q8_0
llm_load_print_meta: model params 1.26 B
llm_load_print_meta: model size 1.25 GiB (8.50 BPW)
llm_load_print_meta: general.name LLaMA v2
llm_load_print_meta: BOS token 1 s
llm_load_print_meta: EOS token 2 /s
llm_load_print_meta: UNK token 0 unk
llm_load_print_meta: PAD token 0 unk
llm_load_print_meta: LF token 13 0x0A
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yesDevice 1: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
llm_load_tensors: ggml ctx size 0.05 MiB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 5/5 layers to GPU
llm_load_tensors: CPU buffer size 229.50 MiB
llm_load_tensors: CUDA0 buffer size 615.28 MiB
llm_load_tensors: CUDA1 buffer size 434.61 MiB
...............................
llama_new_context_with_model: n_ctx 2048
llama_new_context_with_model: n_batch 2048
llama_new_context_with_model: n_ubatch 512
llama_new_context_with_model: freq_base 10000.0
llama_new_context_with_model: freq_scale 1
llama_kv_cache_init: CUDA0 KV buffer size 96.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size 32.00 MiB
llama_new_context_with_model: KV self size 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size 432.00 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies4)
llama_new_context_with_model: CUDA0 compute buffer size 208.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size 200.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size 24.02 MiB
llama_new_context_with_model: graph nodes 136
llama_new_context_with_model: graph splits 3system_info: n_threads 76 / 152 | AVX 1 | AVX_VNNI 0 | AVX2 1 | AVX512 1 | AVX512_VBMI 1 | AVX512_VNNI 1 | FMA 1 | NEON 0 | ARM_FMA 0 | F16C 1 | FP16_VA 0 | WASM_SIMD 0 | BLAS 1 | SSE3 1 | SSSE3 1 | VSX 0 | MATMUL_INT8 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 1187.98 ms
perplexity: calculating perplexity over 655 chunks, n_ctx512, batch_size2048, n_seq4
perplexity: 0.06 seconds per pass - ETA 0.17 minutes
[1]35.2055,[2]3151.7331,[3]9745.8526,[4]3056.9236
......
[653]1226.9638,[654]1219.7704,[655]1213.9217,
Final estimate: PPL 1213.9217 /- 16.09822llama_print_timings: load time 2998.83 ms
llama_print_timings: sample time 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time 8371.13 ms / 335360 tokens ( 0.02 ms per token, 40061.49 tokens per second)
llama_print_timings: eval time 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time 17937.28 ms / 335361 tokenschinese-alpaca-2-13b
测试命令:
./build/bin/perplexity -m ./zh-models/chinese-alpaca-2-13b/ggml-model-q8_0.gguf -f ./wikitext-2-raw/wiki.test.raw -ngl 10测试数据如下:
main: build 2509 (50ccaf5e)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed 1711210012
llama_model_loader: loaded meta data with 21 key-value pairs and 363 tensors from ./zh-models/chinese-alpaca-2-13b/ggml-model-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str llama
llama_model_loader: - kv 1: general.name str LLaMA v2
llama_model_loader: - kv 2: llama.vocab_size u32 55296
llama_model_loader: - kv 3: llama.context_length u32 4096
llama_model_loader: - kv 4: llama.embedding_length u32 5120
llama_model_loader: - kv 5: llama.block_count u32 40
llama_model_loader: - kv 6: llama.feed_forward_length u32 13824
llama_model_loader: - kv 7: llama.rope.dimension_count u32 128
llama_model_loader: - kv 8: llama.attention.head_count u32 40
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 40
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 0.000010
llama_model_loader: - kv 11: general.file_type u32 7
llama_model_loader: - kv 12: tokenizer.ggml.model str llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,55296] [unk, s, /s, 0x00, ...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,55296] [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,55296] [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 2
llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool true
llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool false
llama_model_loader: - kv 20: general.quantization_version u32 2
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type q8_0: 282 tensors
llm_load_vocab: mismatch in special tokens definition ( 889/55296 vs 259/55296 ).
llm_load_print_meta: format GGUF V3 (latest)
llm_load_print_meta: arch llama
llm_load_print_meta: vocab type SPM
llm_load_print_meta: n_vocab 55296
llm_load_print_meta: n_merges 0
llm_load_print_meta: n_ctx_train 4096
llm_load_print_meta: n_embd 5120
llm_load_print_meta: n_head 40
llm_load_print_meta: n_head_kv 40
llm_load_print_meta: n_layer 40
llm_load_print_meta: n_rot 128
llm_load_print_meta: n_embd_head_k 128
llm_load_print_meta: n_embd_head_v 128
llm_load_print_meta: n_gqa 1
llm_load_print_meta: n_embd_k_gqa 5120
llm_load_print_meta: n_embd_v_gqa 5120
llm_load_print_meta: f_norm_eps 0.0e00
llm_load_print_meta: f_norm_rms_eps 1.0e-05
llm_load_print_meta: f_clamp_kqv 0.0e00
llm_load_print_meta: f_max_alibi_bias 0.0e00
llm_load_print_meta: f_logit_scale 0.0e00
llm_load_print_meta: n_ff 13824
llm_load_print_meta: n_expert 0
llm_load_print_meta: n_expert_used 0
llm_load_print_meta: causal attn 1
llm_load_print_meta: pooling type 0
llm_load_print_meta: rope type 0
llm_load_print_meta: rope scaling linear
llm_load_print_meta: freq_base_train 10000.0
llm_load_print_meta: freq_scale_train 1
llm_load_print_meta: n_yarn_orig_ctx 4096
llm_load_print_meta: rope_finetuned unknown
llm_load_print_meta: ssm_d_conv 0
llm_load_print_meta: ssm_d_inner 0
llm_load_print_meta: ssm_d_state 0
llm_load_print_meta: ssm_dt_rank 0
llm_load_print_meta: model type 13B
llm_load_print_meta: model ftype Q8_0
llm_load_print_meta: model params 13.25 B
llm_load_print_meta: model size 13.12 GiB (8.50 BPW)
llm_load_print_meta: general.name LLaMA v2
llm_load_print_meta: BOS token 1 s
llm_load_print_meta: EOS token 2 /s
llm_load_print_meta: UNK token 0 unk
llm_load_print_meta: LF token 13 0x0A
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:Device 0: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yesDevice 1: NVIDIA A100-PCIE-40GB, compute capability 8.0, VMM: yes
llm_load_tensors: ggml ctx size 0.42 MiB
llm_load_tensors: offloading 10 repeating layers to GPU
llm_load_tensors: offloaded 10/41 layers to GPU
llm_load_tensors: CPU buffer size 13431.58 MiB
llm_load_tensors: CUDA0 buffer size 1607.23 MiB
llm_load_tensors: CUDA1 buffer size 1607.23 MiB
..................................................................................................
llama_new_context_with_model: n_ctx 2048
llama_new_context_with_model: n_batch 2048
llama_new_context_with_model: n_ubatch 512
llama_new_context_with_model: freq_base 10000.0
llama_new_context_with_model: freq_scale 1
llama_kv_cache_init: CUDA_Host KV buffer size 1200.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size 200.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size 200.00 MiB
llama_new_context_with_model: KV self size 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size 432.00 MiB
llama_new_context_with_model: CUDA0 compute buffer size 404.88 MiB
llama_new_context_with_model: CUDA1 compute buffer size 204.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size 24.00 MiB
llama_new_context_with_model: graph nodes 1324
llama_new_context_with_model: graph splits 335system_info: n_threads 76 / 152 | AVX 1 | AVX_VNNI 0 | AVX2 1 | AVX512 1 | AVX512_VBMI 1 | AVX512_VNNI 1 | FMA 1 | NEON 0 | ARM_FMA 0 | F16C 1 | FP16_VA 0 | WASM_SIMD 0 | BLAS 1 | SSE3 1 | SSSE3 1 | VSX 0 | MATMUL_INT8 0 |
perplexity: tokenizing the input ..
perplexity: tokenization took 728.604 ms
perplexity: calculating perplexity over 655 chunks, n_ctx512, batch_size2048, n_seq4
perplexity: 7.36 seconds per pass - ETA 20.08 minutes
[1]4.8998,[2]5.3381,[3]6.0623,
......
[654]6.3736,[655]6.3713,
Final estimate: PPL 6.3713 /- 0.03705llama_print_timings: load time 17705.89 ms
llama_print_timings: sample time 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: prompt eval time 1130068.93 ms / 335360 tokens ( 3.37 ms per token, 296.76 tokens per second)
llama_print_timings: eval time 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time 1137969.38 ms / 335361 tokens