Model not loading

#8
by nephepritou - opened

Using command to load:

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2 python3 -m vllm.entrypoints.openai.api_server --model /home/user/llm/models/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit --served-model-name qwen3-next-thinking --port 5007 -pp 3 --dtype float16 --max-model-len 131072 --gpu-memory-utilization 0.95 --max-num-seqs 4 --enable-auto-tool-choice --tool-call-parser hermes --trust-remote-code

Got an error:

INFO 11-17 17:07:36 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=812078) INFO 11-17 17:07:36 [api_server.py:1977] vLLM API server version 0.11.1rc7.dev237+gd4acf518d
(APIServer pid=812078) INFO 11-17 17:07:36 [utils.py:253] non-default args: {'port': 5007, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/home/user/llm/models/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 131072, 'served_model_name': ['qwen3-next-instruct'], 'pipeline_parallel_size': 3, 'gpu_memory_utilization': 0.95, 'max_num_seqs': 4}
(APIServer pid=812078) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=812078) INFO 11-17 17:07:40 [model.py:631] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=812078) WARNING 11-17 17:07:40 [model.py:1971] Casting torch.bfloat16 to torch.float16.
(APIServer pid=812078) INFO 11-17 17:07:40 [model.py:1745] Using max model len 131072
(APIServer pid=812078) INFO 11-17 17:07:40 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=812078) INFO 11-17 17:07:40 [config.py:308] Disabling cascade attention since it is not supported for hybrid models.
(APIServer pid=812078) INFO 11-17 17:07:41 [config.py:432] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=812078) INFO 11-17 17:07:41 [config.py:456] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore_DP0 pid=812858) INFO 11-17 17:07:45 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc7.dev237+gd4acf518d) with config: model='/home/user/llm/models/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit', speculative_config=None, tokenizer='/home/user/llm/models/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=3, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=qwen3-next-instruct, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 8, 'local_cache_dir': None}
(EngineCore_DP0 pid=812858) WARNING 11-17 17:07:45 [multiproc_executor.py:869] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 11-17 17:07:48 [parallel_state.py:1208] world_size=3 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:42063 backend=nccl
INFO 11-17 17:07:52 [parallel_state.py:1208] world_size=3 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:42063 backend=nccl
INFO 11-17 17:07:55 [parallel_state.py:1208] world_size=3 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:42063 backend=nccl
[Gloo] Rank 0 is connected to 2 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 2 is connected to 2 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 1 is connected to 2 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 2 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 2 is connected to 2 peer ranks. Expected number of connected peer ranks is : 2
[Gloo] Rank 1 is connected to 2 peer ranks. Expected number of connected peer ranks is : 2
INFO 11-17 17:07:55 [pynccl.py:111] vLLM is using nccl==2.27.7
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 11-17 17:07:55 [parallel_state.py:1394] rank 2 in world size 3 is assigned as DP rank 0, PP rank 2, TP rank 0, EP rank 0
INFO 11-17 17:07:55 [parallel_state.py:1394] rank 1 in world size 3 is assigned as DP rank 0, PP rank 1, TP rank 0, EP rank 0
INFO 11-17 17:07:55 [parallel_state.py:1394] rank 0 in world size 3 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(Worker_PP0 pid=813163) INFO 11-17 17:07:55 [gpu_model_runner.py:3047] Starting to load model /home/user/llm/models/Qwen3-Next-80B-A3B-Thinking-AWQ-4bit...
(Worker_PP0 pid=813163) INFO 11-17 17:07:55 [layer.py:342] Enabled separate cuda stream for MoE shared_experts
(Worker_PP2 pid=813601) INFO 11-17 17:07:55 [layer.py:342] Enabled separate cuda stream for MoE shared_experts
(Worker_PP0 pid=813163) INFO 11-17 17:07:55 [compressed_tensors_moe.py:162] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_PP2 pid=813601) INFO 11-17 17:07:55 [compressed_tensors_moe.py:162] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_PP1 pid=813363) INFO 11-17 17:07:55 [layer.py:342] Enabled separate cuda stream for MoE shared_experts
(Worker_PP1 pid=813363) INFO 11-17 17:07:55 [compressed_tensors_moe.py:162] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_PP2 pid=813601) INFO 11-17 17:07:55 [cuda.py:418] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(Worker_PP2 pid=813601) INFO 11-17 17:07:55 [cuda.py:427] Using FLASH_ATTN backend.
(Worker_PP1 pid=813363) INFO 11-17 17:07:55 [cuda.py:418] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(Worker_PP1 pid=813363) INFO 11-17 17:07:55 [cuda.py:427] Using FLASH_ATTN backend.
(Worker_PP0 pid=813163) INFO 11-17 17:07:55 [cuda.py:418] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(Worker_PP0 pid=813163) INFO 11-17 17:07:55 [cuda.py:427] Using FLASH_ATTN backend.
Loading safetensors checkpoint shards:   0% Completed | 0/10 [00:00<?, ?it/s]
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743] WorkerProc failed to start.
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743] Traceback (most recent call last):
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 715, in worker_main
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     worker = WorkerProc(*args, **kwargs)
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 555, in __init__
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     self.worker.load_model()
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 275, in load_model
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3064, in load_model
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     self.model = model_loader.load_model(
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 49, in load_model
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     model = initialize_model(
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]             ^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 55, in initialize_model
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1218, in __init__
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     self.set_moe_parameters()
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]   File "/home/user/llm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1158, in set_moe_parameters
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743]     raise RuntimeError("No Qwen3Next layer found in the model.layers.")
(Worker_PP2 pid=813601) ERROR 11-17 17:07:56 [multiproc_executor.py:743] RuntimeError: No Qwen3Next layer found in the model.layers

It failed with 0.11.1rc7.dev237+gd4acf518d. But with 0.11.0 it loads, but without MTP support.

Interesting. Could you try adding the arg --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'?

Interesting. Could you try adding the arg --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'?

I've tried it in first place - output is same. For some reason, VLLM nightly can't find Qwen3Next layer. What is your VLLM build version? I want to try install specific version and verify is it broken in vllm or just my own setup.

https://github.com/vllm-project/vllm/pull/28960 - culprit found and fixed, at least for me.

Well... it loaded, but crashed. I think it's above my capabilities :(

I tried with both vllm model 0.11.1rc7.dev264+gf6aa12269.cu129 and 0.11.1rc7.dev134+g5d6ce2b96.precompiled and both works. Is the crash after the pr the same with before the pr? Or is it a different issue?

It was crashing with --dtype half in args (recommended for AWQ). Now it loads, but only without spec decode.
Error is AttributeError: 'GPUModelRunner' object has no attribute 'drafter'. Maybe -pp 3 with RTX 3090 is not compatible?

Well, don't worry - https://github.com/vllm-project/vllm/issues/27404
It's an VLLM issue with Qwen3 Next, spec decode and pipeline parallelism. Not an issue with weights. But thanks to you I've become a contributor to VLLM :D

Sign up or log in to comment