Any idea on how to fix this: KeyError: 'layers.31.mlp.shared_expert.down_proj.weight'

#1
by kq - opened

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 && export CUDA_VISIBLE_DEVICES=0,1,2,3 && vllm serve /home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --port 12303 --gpu-memory-utilization 0.80 --dtype float16 --tensor-parallel-size 4 --max-model-len 131072 --max-seq-len-to-capture 131072 --api-key token-deaf --reasoning-parser deepseek_r1 --enable-auto-tool-choice --tool-call-parser hermes

INFO 09-13 21:32:24 [init.py:216] Automatically detected platform cuda.
(APIServer pid=5689) INFO 09-13 21:32:29 [api_server.py:1896] vLLM API server version 0.10.2rc3.dev38+g99bfef841
(APIServer pid=5689) INFO 09-13 21:32:29 [utils.py:328] non-default args: {'model_tag': '/home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit', 'port': 12303, 'api_key': ['token-deaf'], 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit', 'dtype': 'float16', 'max_model_len': 131072, 'max_seq_len_to_capture': 131072, 'reasoning_parser': 'deepseek_r1', 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.8}
(APIServer pid=5689) INFO 09-13 21:32:39 [init.py:750] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=5689) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=5689) WARNING 09-13 21:32:39 [init.py:2792] Casting torch.bfloat16 to torch.float16.
(APIServer pid=5689) INFO 09-13 21:32:39 [init.py:1831] Using max model len 131072
(APIServer pid=5689) WARNING 09-13 21:32:40 [_ipex_ops.py:16] Import error msg: No module named 'intel_extension_for_pytorch'
(APIServer pid=5689) INFO 09-13 21:32:40 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=5689) INFO 09-13 21:32:40 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=5689) INFO 09-13 21:32:40 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=5689) INFO 09-13 21:32:41 [config.py:390] Setting attention block size to 272 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=5689) INFO 09-13 21:32:41 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-13 21:32:48 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=5870) INFO 09-13 21:32:52 [core.py:655] Waiting for init message from front-end.
(EngineCore_DP0 pid=5870) INFO 09-13 21:32:52 [core.py:76] Initializing a V1 LLM engine (v0.10.2rc3.dev38+g99bfef841) with config: model='/home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit', speculative_config=None, tokenizer='/home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=5870) WARNING 09-13 21:32:52 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 18 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=5870) INFO 09-13 21:32:52 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_d93da4d8'), local_subscribe_addr='ipc:///tmp/a04a3827-dc06-4c6e-9813-591136f37ead', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-13 21:32:59 [init.py:216] Automatically detected platform cuda.
INFO 09-13 21:32:59 [init.py:216] Automatically detected platform cuda.
INFO 09-13 21:32:59 [init.py:216] Automatically detected platform cuda.
INFO 09-13 21:32:59 [init.py:216] Automatically detected platform cuda.
W0913 21:33:04.109000 5951 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0913 21:33:04.109000 5951 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0913 21:33:04.245000 5950 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0913 21:33:04.245000 5950 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0913 21:33:04.298000 5949 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0913 21:33:04.298000 5949 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0913 21:33:04.342000 5948 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
W0913 21:33:04.342000 5948 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 09-13 21:33:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ab8da3c2'), local_subscribe_addr='ipc:///tmp/520e9f52-10b7-4579-8e80-a6a13321e633', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-13 21:33:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4b2989a0'), local_subscribe_addr='ipc:///tmp/66b2cff3-d898-4278-b6ba-f2a49c656f3f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-13 21:33:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_0a52f59b'), local_subscribe_addr='ipc:///tmp/26c5e039-d2a7-46ff-99f4-c3842611624d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-13 21:33:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_37fc6e66'), local_subscribe_addr='ipc:///tmp/ac2045bc-67f8-4d43-80e0-100f62da6e44', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W913 21:33:06.408255837 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W913 21:33:06.422664470 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W913 21:33:06.554744240 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W913 21:33:06.563152943 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-13 21:33:06 [init.py:1439] Found nccl from library libnccl.so.2
INFO 09-13 21:33:06 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-13 21:33:06 [init.py:1439] Found nccl from library libnccl.so.2
INFO 09-13 21:33:06 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-13 21:33:06 [init.py:1439] Found nccl from library libnccl.so.2
INFO 09-13 21:33:06 [init.py:1439] Found nccl from library libnccl.so.2
INFO 09-13 21:33:06 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-13 21:33:06 [pynccl.py:70] vLLM is using nccl==2.27.3
WARNING 09-13 21:33:07 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-13 21:33:07 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-13 21:33:07 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-13 21:33:07 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 09-13 21:33:07 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_52b7b2ac'), local_subscribe_addr='ipc:///tmp/b8c9cc70-a741-454a-8104-7f9063096525', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 09-13 21:33:07 [parallel_state.py:1165] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 09-13 21:33:07 [parallel_state.py:1165] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 09-13 21:33:07 [parallel_state.py:1165] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-13 21:33:07 [parallel_state.py:1165] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 09-13 21:33:07 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-13 21:33:07 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-13 21:33:07 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-13 21:33:07 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(Worker_TP2 pid=5950) INFO 09-13 21:33:07 [gpu_model_runner.py:2340] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit...
(Worker_TP3 pid=5951) INFO 09-13 21:33:07 [gpu_model_runner.py:2340] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit...
(Worker_TP1 pid=5949) INFO 09-13 21:33:07 [gpu_model_runner.py:2340] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit...
(Worker_TP0 pid=5948) INFO 09-13 21:33:07 [gpu_model_runner.py:2340] Starting to load model /home/deaf/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit...
(Worker_TP2 pid=5950) INFO 09-13 21:33:07 [gpu_model_runner.py:2372] Loading model from scratch...
(Worker_TP3 pid=5951) INFO 09-13 21:33:07 [gpu_model_runner.py:2372] Loading model from scratch...
(Worker_TP1 pid=5949) INFO 09-13 21:33:07 [gpu_model_runner.py:2372] Loading model from scratch...
(Worker_TP0 pid=5948) INFO 09-13 21:33:07 [gpu_model_runner.py:2372] Loading model from scratch...
(Worker_TP2 pid=5950) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
(Worker_TP2 pid=5950) torch_dtype is deprecated! Use dtype instead!
(Worker_TP2 pid=5950) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP3 pid=5951) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
(Worker_TP3 pid=5951) torch_dtype is deprecated! Use dtype instead!
(Worker_TP3 pid=5951) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP2 pid=5950) INFO 09-13 21:33:08 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP3 pid=5951) INFO 09-13 21:33:08 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP0 pid=5948) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
(Worker_TP0 pid=5948) torch_dtype is deprecated! Use dtype instead!
(Worker_TP0 pid=5948) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP0 pid=5948) INFO 09-13 21:33:08 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP1 pid=5949) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using BitBLASLinearKernel for CompressedTensorsWNA16
(Worker_TP1 pid=5949) torch_dtype is deprecated! Use dtype instead!
(Worker_TP1 pid=5949) INFO 09-13 21:33:08 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
(Worker_TP1 pid=5949) INFO 09-13 21:33:08 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP2 pid=5950) INFO 09-13 21:33:08 [cuda.py:369] Using Flash Attention backend on V1 engine.
(Worker_TP3 pid=5951) INFO 09-13 21:33:08 [cuda.py:369] Using Flash Attention backend on V1 engine.
(Worker_TP0 pid=5948) INFO 09-13 21:33:08 [cuda.py:369] Using Flash Attention backend on V1 engine.
(Worker_TP1 pid=5949) INFO 09-13 21:33:08 [cuda.py:369] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:00<?, ?it/s]
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] WorkerProc failed to start.
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] Traceback (most recent call last):
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 574, in worker_main
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] worker = WorkerProc(*args, **kwargs)
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 440, in __init__
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] self.worker.load_model()
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2373, in load_model
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] self.model = model_loader.load_model(
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] self.load_weights(model, model_config)
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 265, in load_weights
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] loaded_weights = model.load_weights(
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] ^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1216, in load_weights
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] return loader.load_weights(weights)
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 291, in load_weights
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] autoloaded_weights = set(self._load_module("", self.module, weights))
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 249, in _load_module
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] yield from self._load_module(prefix,
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 222, in _load_module
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] loaded_params = module_load_weights(weights)
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1044, in load_weights
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] param = params_dict[name]
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] ~~~~~~~~~~~^^^^^^
(Worker_TP3 pid=5951) ERROR 09-13 21:33:11 [multiproc_executor.py:600] KeyError: 'layers.31.mlp.shared_expert.down_proj.weight'
(Worker_TP3 pid=5951) INFO 09-13 21:33:11 [multiproc_executor.py:561] Parent process exited, terminating worker
(Worker_TP2 pid=5950) INFO 09-13 21:33:11 [multiproc_executor.py:561] Parent process exited, terminating worker
(Worker_TP0 pid=5948) INFO 09-13 21:33:11 [multiproc_executor.py:561] Parent process exited, terminating worker
Loading safetensors checkpoint shards: 0% Completed | 0/10 [00:01<?, ?it/s]
(Worker_TP0 pid=5948)
(Worker_TP1 pid=5949) INFO 09-13 21:33:11 [multiproc_executor.py:561] Parent process exited, terminating worker
[rank0]:[W913 21:33:12.009355025 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] EngineCore failed to start.
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] Traceback (most recent call last):
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 710, in run_engine_core
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 509, in __init__
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] self._init_executor()
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 107, in _init_executor
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 512, in wait_for_ready
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] raise e from None
(EngineCore_DP0 pid=5870) ERROR 09-13 21:33:13 [core.py:719] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=5870) Process EngineCore_DP0:
(EngineCore_DP0 pid=5870) Traceback (most recent call last):
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=5870) self.run()
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=5870) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 723, in run_engine_core
(EngineCore_DP0 pid=5870) raise e
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 710, in run_engine_core
(EngineCore_DP0 pid=5870) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=5870) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 509, in __init__
(EngineCore_DP0 pid=5870) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_DP0 pid=5870) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=5870) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=5870) self._init_executor()
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 107, in _init_executor
(EngineCore_DP0 pid=5870) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=5870) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=5870) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 512, in wait_for_ready
(EngineCore_DP0 pid=5870) raise e from None
(EngineCore_DP0 pid=5870) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=5689) Traceback (most recent call last):
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/bin/vllm", line 8, in
(APIServer pid=5689) sys.exit(main())
(APIServer pid=5689) ^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=5689) args.dispatch_function(args)
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=5689) uvloop.run(run_server(args))
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=5689) return __asyncio.run(
(APIServer pid=5689) ^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=5689) return runner.run(main)
(APIServer pid=5689) ^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=5689) return self._loop.run_until_complete(task)
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=5689) return await main
(APIServer pid=5689) ^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=5689) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=5689) async with build_async_engine_client(
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=5689) return await anext(self.gen)
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=5689) async with build_async_engine_client_from_engine_args(
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=5689) return await anext(self.gen)
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=5689) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils/init.py", line 1595, in inner
(APIServer pid=5689) return fn(*args, **kwargs)
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 209, in from_vllm_config
(APIServer pid=5689) return cls(
(APIServer pid=5689) ^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 136, in init
(APIServer pid=5689) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=5689) return AsyncMPClient(*client_args)
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=5689) super().init(
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=5689) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=5689) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=5689) next(self.gen)
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=5689) wait_for_engine_startup(
(APIServer pid=5689) File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=5689) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=5689) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/home/deaf/miniconda3/envs/vllm/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

P.S.
vllm -v
INFO 09-13 21:37:22 [init.py:216] Automatically detected platform cuda.
0.10.2rc3.dev38+g99bfef841.cu129

Also with tranformer upgraded

Owner

Thank you for trying this model. Please redownload the weights in the next 30 minutes, which should fix this problem.

Impressive quick reply. Thank you very much

Thank you. I love the quality of this model.
It works! Nice work! I love this model. running on 4xRTX3090 allow 5.72x 128K context length . Avg generation throughput 88token/s on first 4K context, and about 70token/s on 64K context.

(APIServer pid=38412) INFO 09-14 12:38:26 [loggers.py:123] Engine 000: Avg prompt throughput: 6647.7 tokens/s, Avg generation throughput: 66.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.0%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO 09-14 12:38:36 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 68.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO 09-14 12:38:46 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 69.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.2%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO 09-14 12:38:56 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 68.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO 09-14 12:39:06 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 68.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.4%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO 09-14 12:39:16 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 69.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.5%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO 09-14 12:39:26 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 69.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO 09-14 12:39:36 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 69.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.7%, Prefix cache hit rate: 0.0%
(APIServer pid=38412) INFO: 10.10.2.5:63975 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Owner

If you meet any errors or have any positive and negative feedbacks in the future, please don't hesitate to let me know :) Any feedbacks would help me improve my models!

Sign up or log in to comment