请问可以再纯CPU环境运行吗？

by moran09 - opened May 14

Discussion

moran09

May 14

我用cpu服务器跑，运行到“模型加载完成”后就卡住了，是我的代码存在问题还是不支持纯CPU环境呢？

moran09

May 14

•

edited May 14

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import json
import os

model_path = "/mnt/genie/GENIE_en_7b" # 确认路径，或改为 "/mnt/GENIE_en_7b" 或 "/mnt/genice/GENIE_en_8b"
device = torch.device("cpu")

if not os.path.exists(model_path):
raise FileNotFoundError(f"模型路径 {model_path} 不存在，请检查路径或文件完整性")

print("正在加载分词器...")
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, use_fast=True)
print("分词器加载完成。")

print("正在加载模型...")
model = AutoModelForCausalLM.from_pretrained(
model_path,
local_files_only=True,
torch_dtype=torch.float32,
low_cpu_mem_usage=True # 优化内存使用
).to(device)
print("模型加载完成。")

PROMPT_TEMPLATE = "Human:\n{query}\n\nAssistant:\n"
EHR = [
"慢性乙型肝炎病史10余年，曾有肝功能异常，中医治疗后好转；1年余前查HBsAg转阴，但肝脏病理提示病毒性肝炎伴肝纤维化（G1S3-4）"
]
texts = [PROMPT_TEMPLATE.format(query=k) for k in EHR]

temperature = 0.7
max_new_tokens = 50

for prompt in texts:
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(
**inputs,
temperature=temperature,
max_new_tokens=max_new_tokens,
pad_token_id=tokenizer.eos_token_id,
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
try:
json_result = json.loads(result.split("Assistant:\n")[1])
print("="*20)
print(json.dumps(json_result, indent=2, ensure_ascii=False))
except json.JSONDecodeError:
print("="*20)
print("无效 JSON:", result)

yinghy2018

Medical Informatics Lab at Tsinghua University org May 14

Hello, the inference framework supports CPU-only environment. However, as the output is very long, it may take a long period to output.
As a reference, using vllm to output 8192 tokens will take 4 seconds, using transformers (your code) with GPU will take 5 mins, and using CPU may take 1 hour.

moran09

May 15

Hello, the inference framework supports CPU-only environment. However, as the output is very long, it may take a long period to output.
As a reference, using vllm to output 8192 tokens will take 4 seconds, using transformers (your code) with GPU will take 5 mins, and using CPU may take 1 hour.

Hello, I have been constantly showing issues with the vLLM version in a pure cpu environment and it can't be used normally. May I ask if there are any plans to launch ollama

yinghy2018

Medical Informatics Lab at Tsinghua University org May 15

You can use transformers instead of vLLM, just slower (and CPU so much slower). I personally do not recommend use CPU-only as large language models usually do not fit.
For ollama, I think it is the developer who can decide which models are included.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment