请问可以再纯CPU环境运行吗?
我用cpu服务器跑,运行到“模型加载完成”后就卡住了,是我的代码存在问题还是不支持纯CPU环境呢?
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json
import os
model_path = "/mnt/genie/GENIE_en_7b" # 确认路径,或改为 "/mnt/GENIE_en_7b" 或 "/mnt/genice/GENIE_en_8b"
device = torch.device("cpu")
if not os.path.exists(model_path):
raise FileNotFoundError(f"模型路径 {model_path} 不存在,请检查路径或文件完整性")
print("正在加载分词器...")
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, use_fast=True)
print("分词器加载完成。")
print("正在加载模型...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
local_files_only=True,
torch_dtype=torch.float32,
low_cpu_mem_usage=True # 优化内存使用
).to(device)
print("模型加载完成。")
PROMPT_TEMPLATE = "Human:\n{query}\n\nAssistant:\n"
EHR = [
"慢性乙型肝炎病史10余年,曾有肝功能异常,中医治疗后好转;1年余前查HBsAg转阴,但肝脏病理提示病毒性肝炎伴肝纤维化(G1S3-4)"
]
texts = [PROMPT_TEMPLATE.format(query=k) for k in EHR]
temperature = 0.7
max_new_tokens = 50
for prompt in texts:
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
temperature=temperature,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.eos_token_id,
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
try:
json_result = json.loads(result.split("Assistant:\n")[1])
print("="*20)
print(json.dumps(json_result, indent=2, ensure_ascii=False))
except json.JSONDecodeError:
print("="*20)
print("无效 JSON:", result)
Hello, the inference framework supports CPU-only environment. However, as the output is very long, it may take a long period to output.
As a reference, using vllm to output 8192 tokens will take 4 seconds, using transformers (your code) with GPU will take 5 mins, and using CPU may take 1 hour.
Hello, the inference framework supports CPU-only environment. However, as the output is very long, it may take a long period to output.
As a reference, using vllm to output 8192 tokens will take 4 seconds, using transformers (your code) with GPU will take 5 mins, and using CPU may take 1 hour.
Hello, I have been constantly showing issues with the vLLM version in a pure cpu environment and it can't be used normally. May I ask if there are any plans to launch ollama
You can use transformers instead of vLLM, just slower (and CPU so much slower). I personally do not recommend use CPU-only as large language models usually do not fit.
For ollama, I think it is the developer who can decide which models are included.