Transformers documentation
LW-DETR
This model was released on 2024-04-05 and added to Hugging Face Transformers on 2026-01-10.
LW-DETR
LW-DETR proposes a light-weight Detection Transformer (DETR) architecture designed to compete with and surpass the dominant YOLO series for real-time object detection. It achieves a new state-of-the-art balance between speed (latency) and accuracy (mAP) by combining recent transformer advances with efficient design choices.
The LW-DETR architecture is characterized by its simple and efficient structure: a plain ViT Encoder, a Projector, and a shallow DETR Decoder. It enhances the DETR architecture for efficiency and speed using the following core modifications:
- Efficient ViT Encoder: Uses a plain ViT with interleaved window/global attention and a window-major organization to drastically reduce attention complexity and latency.
- Richer Input: Aggregates multi-level features from the encoder and uses a C2f Projector (YOLOv8) to pass two-scale features ($1/8$ and $1/32$).
- Faster Decoder: Employs a shallow 3-layer DETR decoder with deformable cross-attention for lower latency and faster convergence.
- Optimized Queries: Uses a mixed-query scheme combining learnable content queries and generated spatial queries.
You can find all the available LW DETR checkpoints under the AnnaZhang organization. The original code can be found here.
This model was contributed by stevenbucaille.
Click on the LW-DETR models in the right sidebar for more examples of how to apply LW-DETR to different object detection tasks.
The example below demonstrates how to perform object detection with the Pipeline and the AutoModel class.
from transformers import pipeline
import torch
pipeline = pipeline(
"object-detection",
model="AnnaZhang/lwdetr_small_60e_coco",
dtype=torch.float16,
device_map=0
)
pipeline("http://images.cocodataset.org/val2017/000000039769.jpg")Resources
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LwDetr.
- Scripts for finetuning LwDetrForObjectDetection with Trainer or Accelerate can be found here.
- See also: Object detection task guide.
LwDetrConfig
class transformers.LwDetrConfig
< source >( backbone_config = None projector_scale_factors: list = [] hidden_expansion = 0.5 c2f_num_blocks = 3 activation_function = 'silu' batch_norm_eps = 1e-05 d_model = 256 dropout = 0.1 decoder_ffn_dim = 2048 decoder_n_points = 4 decoder_layers: int = 3 decoder_self_attention_heads: int = 8 decoder_cross_attention_heads: int = 16 decoder_activation_function = 'relu' num_queries = 300 attention_bias = True attention_dropout = 0.0 activation_dropout = 0.0 group_detr: int = 13 init_std = 0.02 disable_custom_kernels = True class_cost = 2 bbox_cost = 5 giou_cost = 2 mask_loss_coefficient = 1 dice_loss_coefficient = 1 bbox_loss_coefficient = 5 giou_loss_coefficient = 2 eos_coefficient = 0.1 focal_alpha = 0.25 auxiliary_loss = True **kwargs )
Parameters
- backbone_config (
PretrainedConfigordict, optional) — The configuration of the backbone model. If not provided, will default toLwDetrViTConfigwith a small ViT architecture optimized for detection tasks. - projector_scale_factors (
list[float], optional, defaults to[]) — Scale factors for the feature pyramid network. Each scale factor determines the resolution of features at different levels. Supported values are 0.5, 1.0, and 2.0. - hidden_expansion (
float, optional, defaults to 0.5) — Expansion factor for hidden dimensions in the projector layers. - c2f_num_blocks (
int, optional, defaults to 3) — Number of blocks in the C2F layer. - activation_function (
str, optional, defaults to"silu") — The non-linear activation function in the projector. Supported values are"silu","relu","gelu". - batch_norm_eps (
float, optional, defaults to 1e-05) — The epsilon value for batch normalization layers. - d_model (
int, optional, defaults to 256) — Dimension of the model layers and the number of expected features in the decoder inputs. - dropout (
float, optional, defaults to 0.1) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - decoder_ffn_dim (
int, optional, defaults to 2048) — Dimension of the “intermediate” (often named feed-forward) layer in decoder. - decoder_n_points (
int, optional, defaults to 4) — The number of sampled keys in each feature level for each attention head in the decoder. - decoder_layers (
int, optional, defaults to 3) — Number of decoder layers in the transformer. - decoder_self_attention_heads (
int, optional, defaults to 8) — Number of attention heads for each attention layer in the decoder self-attention. - decoder_cross_attention_heads (
int, optional, defaults to 16) — Number of attention heads for each attention layer in the decoder cross-attention. - decoder_activation_function (
str, optional, defaults to"relu") — The non-linear activation function in the decoder. Supported values are"relu","silu","gelu". - num_queries (
int, optional, defaults to 300) — Number of object queries, i.e. detection slots. This is the maximal number of objects LwDetrModel can detect in a single image. - attention_bias (
bool, optional, defaults toTrue) — Whether to add bias to the attention layers. - attention_dropout (
float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. - activation_dropout (
float, optional, defaults to 0.0) — The dropout ratio for activations inside the fully connected layer. - group_detr (
int, optional, defaults to 13) — Number of groups for Group DETR attention mechanism, which helps reduce computational complexity. - init_std (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - disable_custom_kernels (
bool, optional, defaults toTrue) — Disable the use of custom CUDA and CPU kernels. This option is necessary for the ONNX export, as custom kernels are not supported by PyTorch ONNX export. - class_cost (
float, optional, defaults to 2) — Relative weight of the classification error in the Hungarian matching cost. - bbox_cost (
float, optional, defaults to 5) — Relative weight of the L1 error of the bounding box coordinates in the Hungarian matching cost. - giou_cost (
float, optional, defaults to 2) — Relative weight of the generalized IoU loss of the bounding box in the Hungarian matching cost. - mask_loss_coefficient (
float, optional, defaults to 1) — Relative weight of the Focal loss in the panoptic segmentation loss. - dice_loss_coefficient (
float, optional, defaults to 1) — Relative weight of the DICE/F-1 loss in the panoptic segmentation loss. - bbox_loss_coefficient (
float, optional, defaults to 5) — Relative weight of the L1 bounding box loss in the object detection loss. - giou_loss_coefficient (
float, optional, defaults to 2) — Relative weight of the generalized IoU loss in the object detection loss. - eos_coefficient (
float, optional, defaults to 0.1) — Relative classification weight of the ‘no-object’ class in the object detection loss. - focal_alpha (
float, optional, defaults to 0.25) — Alpha parameter in the focal loss. - auxiliary_loss (
bool, optional, defaults toTrue) — Whether auxiliary decoding losses (loss at each decoder layer) are to be used.
This is the configuration class to store the configuration of a LwDetrModel. It is used to instantiate a LW-DETR model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the LW-DETR AnnaZhang/lwdetr_small_60e_coco architecture.
LW-DETR (Lightweight Detection Transformer) is a transformer-based object detection model designed for real-time detection tasks. It replaces traditional CNN-based detectors like YOLO with a more efficient transformer architecture that achieves competitive performance while being computationally lightweight.
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Examples:
>>> from transformers import LwDetrConfig, LwDetrModel
>>> # Initializing a LW-DETR AnnaZhang/lwdetr_small_60e_coco style configuration
>>> configuration = LwDetrConfig()
>>> # Initializing a model (with random weights) from the AnnaZhang/lwdetr_small_60e_coco style configuration
>>> model = LwDetrModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configLwDetrViTConfig
class transformers.LwDetrViTConfig
< source >( hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 mlp_ratio = 4 hidden_act = 'gelu' dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-06 image_size = 256 pretrain_image_size = 224 patch_size = 16 num_channels = 3 qkv_bias = True window_block_indices = [] use_absolute_position_embeddings = True out_features = None out_indices = None cae_init_values: float = 0.1 num_windows = 16 **kwargs )
Parameters
- hidden_size (
int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer. - num_hidden_layers (
int, optional, defaults to 12) — Number of hidden layers in the Transformer encoder. - num_attention_heads (
int, optional, defaults to 12) — Number of attention heads for each attention layer in the Transformer encoder. - mlp_ratio (
int, optional, defaults to 4) — Ratio of mlp hidden dim to embedding dim. - hidden_act (
strorfunction, optional, defaults to"gelu") — The non-linear activation function (function or string) in the encoder and pooler. If string,"gelu","relu","selu"and"gelu_new"are supported. - dropout_prob (
float, optional, defaults to 0.0) — The dropout probability for all fully connected layers in the embeddings, encoder, and pooler. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (
float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers. - image_size (
int, optional, defaults to 256) — The size (resolution) of each image. - pretrain_image_size (
int, optional, defaults to 224) — The size (resolution) of each image during pretraining. - patch_size (
int, optional, defaults to 16) — The size (resolution) of each patch. - num_channels (
int, optional, defaults to 3) — The number of input channels. - qkv_bias (
bool, optional, defaults toTrue) — Whether to add a bias to the queries, keys and values. - window_block_indices (
list[int], optional, defaults to[]) — List of indices of blocks that should have window attention instead of regular global self-attention. - use_absolute_position_embeddings (
bool, optional, defaults toTrue) — Whether to add absolute position embeddings to the patch embeddings. - out_features (
list[str], optional) — If used as backbone, list of features to output. Can be any of"stem","stage1","stage2", etc. (depending on how many stages the model has). If unset andout_indicesis set, will default to the corresponding stages. If unset andout_indicesis unset, will default to the last stage. Must be in the same order as defined in thestage_namesattribute. - out_indices (
list[int], optional) — If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how many stages the model has). If unset andout_featuresis set, will default to the corresponding stages. If unset andout_featuresis unset, will default to the last stage. Must be in the same order as defined in thestage_namesattribute. - cae_init_values (
float, optional, defaults to 0.1) — Initialization value for CAE parameters whenuse_caeis enabled. - num_windows (
int, optional, defaults to 16) — Number of windows for window-based attention. Must be a perfect square and the image size must be divisible by the square root of this value. This enables efficient window-major feature map organization.
This is the configuration class to store the configuration of a LwDetrViTModel. It is used to instantiate an
LW-DETR ViT model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the LW-DETR ViT
AnnaZhang/lwdetr_small_60e_coco architecture.
LW-DETR ViT is the Vision Transformer backbone used in the LW-DETR model for real-time object detection. It features interleaved window and global attention mechanisms to reduce computational complexity while maintaining high performance. The model uses a window-major feature map organization for efficient attention computation.
Configuration objects inherit from VitDetConfig and can be used to control the model outputs. Read the documentation from VitDetConfig for more information.
Example:
>>> from transformers import LwDetrViTConfig, LwDetrViTModel
>>> # Initializing a LW-DETR ViT configuration
>>> configuration = LwDetrViTConfig()
>>> # Initializing a model (with random weights) from the configuration
>>> model = LwDetrViTModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configLwDetrModel
class transformers.LwDetrModel
< source >( config: LwDetrConfig )
Parameters
- config (LwDetrConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare LW Detr Model (consisting of a backbone and decoder Transformer) outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor = None pixel_mask: torch.LongTensor | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.lw_detr.modeling_lw_detr.LwDetrModelOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using DeformableDetrImageProcessorFast. See DeformableDetrImageProcessorFast.call() for details (processor_classuses DeformableDetrImageProcessorFast for processing images). - pixel_mask (
torch.LongTensorof shape(batch_size, height, width), optional) — Mask to avoid performing attention on padding pixel values. Mask values selected in[0, 1]:- 1 for pixels that are real (i.e. not masked),
- 0 for pixels that are padding (i.e. masked).
Returns
transformers.models.lw_detr.modeling_lw_detr.LwDetrModelOutput or tuple(torch.FloatTensor)
A transformers.models.lw_detr.modeling_lw_detr.LwDetrModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (LwDetrConfig) and inputs.
- init_reference_points (
torch.FloatTensorof shape(batch_size, num_queries, 4)) — Initial reference points sent through the Transformer decoder. - last_hidden_state (
torch.FloatTensor | None.last_hidden_stateof shape(batch_size, sequence_length, hidden_size), defaults toNone) — Sequence of hidden-states at the output of the last layer of the model. - intermediate_hidden_states (
torch.FloatTensorof shape(batch_size, config.decoder_layers, num_queries, hidden_size)) — Stacked intermediate hidden states (output of each layer of the decoder). - intermediate_reference_points (
torch.FloatTensorof shape(batch_size, config.decoder_layers, num_queries, 4)) — Stacked intermediate reference points (reference points of each layer of the decoder). - enc_outputs_class (
torch.FloatTensorof shape(batch_size, sequence_length, config.num_labels), optional, returned whenconfig.with_box_refine=Trueandconfig.two_stage=True) — Predicted bounding boxes scores where the topconfig.two_stage_num_proposalsscoring bounding boxes are picked as region proposals in the first stage. Output of bounding box binary classification (i.e. foreground and background). - enc_outputs_coord_logits (
torch.FloatTensorof shape(batch_size, sequence_length, 4), optional, returned whenconfig.with_box_refine=Trueandconfig.two_stage=True) — Logits of predicted bounding boxes coordinates in the first stage.
The LwDetrModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, DeformableDetrModel
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("AnnaZhang/lwdetr_small_60e_coco")
>>> model = DeformableDetrModel.from_pretrained("AnnaZhang/lwdetr_small_60e_coco")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 300, 256]LwDetrForObjectDetection
class transformers.LwDetrForObjectDetection
< source >( config: LwDetrConfig )
Parameters
- config (LwDetrConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
LW DETR Model (consisting of a backbone and decoder Transformer) with object detection heads on top, for tasks such as COCO detection.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: FloatTensor = None pixel_mask: torch.LongTensor | None = None labels: list[dict] | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.lw_detr.modeling_lw_detr.LwDetrObjectDetectionOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.FloatTensorof shape(batch_size, num_channels, image_size, image_size), optional) — The tensors corresponding to the input images. Pixel values can be obtained using DeformableDetrImageProcessorFast. See DeformableDetrImageProcessorFast.call() for details (processor_classuses DeformableDetrImageProcessorFast for processing images). - pixel_mask (
torch.LongTensorof shape(batch_size, height, width), optional) — Mask to avoid performing attention on padding pixel values. Mask values selected in[0, 1]:- 1 for pixels that are real (i.e. not masked),
- 0 for pixels that are padding (i.e. masked).
- labels (
list[Dict]of len(batch_size,), optional) — Labels for computing the bipartite matching loss. List of dicts, each dictionary containing at least the following 2 keys: ‘class_labels’ and ‘boxes’ (the class labels and bounding boxes of an image in the batch respectively). The class labels themselves should be atorch.LongTensorof len(number of bounding boxes in the image,)and the boxes atorch.FloatTensorof shape(number of bounding boxes in the image, 4).
Returns
transformers.models.lw_detr.modeling_lw_detr.LwDetrObjectDetectionOutput or tuple(torch.FloatTensor)
A transformers.models.lw_detr.modeling_lw_detr.LwDetrObjectDetectionOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (LwDetrConfig) and inputs.
- loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsare provided)) — Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized scale-invariant IoU loss. - loss_dict (
Dict, optional) — A dictionary containing the individual losses. Useful for logging. - logits (
torch.FloatTensorof shape(batch_size, num_queries, num_classes + 1)) — Classification logits (including no-object) for all queries. - pred_boxes (
torch.FloatTensorof shape(batch_size, num_queries, 4)) — Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding possible padding). You can use~DeformableDetrProcessor.post_process_object_detectionto retrieve the unnormalized bounding boxes. - auxiliary_outputs (
list[Dict], optional) — Optional, only returned when auxiliary losses are activated (i.e.config.auxiliary_lossis set toTrue) and labels are provided. It is a list of dictionaries containing the two above keys (logitsandpred_boxes) for each decoder layer. - init_reference_points (
torch.FloatTensorof shape(batch_size, num_queries, 4)) — Initial reference points sent through the Transformer decoder. - last_hidden_state (
torch.FloatTensor | None.last_hidden_stateof shape(batch_size, sequence_length, hidden_size), defaults toNone) — Sequence of hidden-states at the output of the last layer of the model. - intermediate_hidden_states (
torch.FloatTensorof shape(batch_size, config.decoder_layers, num_queries, hidden_size)) — Stacked intermediate hidden states (output of each layer of the decoder). - intermediate_reference_points (
torch.FloatTensorof shape(batch_size, config.decoder_layers, num_queries, 4)) — Stacked intermediate reference points (reference points of each layer of the decoder). - enc_outputs_class (
torch.FloatTensorof shape(batch_size, sequence_length, config.num_labels), optional, returned whenconfig.with_box_refine=Trueandconfig.two_stage=True) — Predicted bounding boxes scores where the topconfig.two_stage_num_proposalsscoring bounding boxes are picked as region proposals in the first stage. Output of bounding box binary classification (i.e. foreground and background). - enc_outputs_coord_logits (
torch.FloatTensorof shape(batch_size, sequence_length, 4), optional, returned whenconfig.with_box_refine=Trueandconfig.two_stage=True) — Logits of predicted bounding boxes coordinates in the first stage.
The LwDetrForObjectDetection forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from transformers import AutoImageProcessor, LwDetrForObjectDetection
>>> from PIL import Image
>>> import requests
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image_processor = AutoImageProcessor.from_pretrained("AnnaZhang/lwdetr_small_60e_coco")
>>> model = LwDetrForObjectDetection.from_pretrained("AnnaZhang/lwdetr_small_60e_coco")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> # convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> target_sizes = torch.tensor([image.size[::-1]])
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
... 0
... ]
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
... box = [round(i, 2) for i in box.tolist()]
... print(
... f"Detected {model.config.id2label[label.item()]} with confidence "
... f"{round(score.item(), 3)} at location {box}"
... )
Detected cat with confidence 0.8 at location [16.5, 52.84, 318.25, 470.78]
Detected cat with confidence 0.789 at location [342.19, 24.3, 640.02, 372.25]
Detected remote with confidence 0.633 at location [40.79, 72.78, 176.76, 117.25]LwDetrViTBackbone
class transformers.LwDetrViTBackbone
< source >( config )
Parameters
- config (LwDetrViTBackbone) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Lw Detr backbone.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Tensor **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BackboneOutput or tuple(torch.FloatTensor)
Parameters
- pixel_values (
torch.Tensorof shape(batch_size, num_channels, image_size, image_size)) — The tensors corresponding to the input images. Pixel values can be obtained using DeformableDetrImageProcessorFast. See DeformableDetrImageProcessorFast.call() for details (processor_classuses DeformableDetrImageProcessorFast for processing images).
Returns
transformers.modeling_outputs.BackboneOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BackboneOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (LwDetrConfig) and inputs.
-
feature_maps (
tuple(torch.FloatTensor)of shape(batch_size, num_channels, height, width)) — Feature maps of the stages. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)or(batch_size, num_channels, height, width), depending on the backbone.Hidden-states of the model at the output of each stage plus the initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length). Only applicable if the backbone uses attention.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LwDetrViTBackbone forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Examples:
>>> from transformers import LwDetrViTConfig, LwDetrViTBackbone
>>> import torch
>>> config = LwDetrViTConfig()
>>> model = LwDetrViTBackbone(config)
>>> pixel_values = torch.randn(1, 3, 224, 224)
>>> with torch.no_grad():
... outputs = model(pixel_values)
>>> feature_maps = outputs.feature_maps
>>> list(feature_maps[-1].shape)
[1, 768, 14, 14]