0% found this document useful (0 votes)
24 views32 pages

Describe Anything: Detailed Localized Image and Video Captioning

The Describe Anything Model (DAM) is introduced to enhance detailed localized captioning for images and videos by preserving local details and global context through innovations like a focal prompt and a localized vision backbone. To address the challenges of data scarcity and evaluation limitations, a Semi-supervised learning Data Pipeline (DLC-SDP) and a new benchmark (DLC-Bench) are proposed, allowing for scalable data generation and improved assessment of localized captions. DAM achieves state-of-the-art performance across multiple benchmarks, demonstrating its effectiveness in generating rich, detailed descriptions for specified regions.

Uploaded by

dinhhogiabao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views32 pages

Describe Anything: Detailed Localized Image and Video Captioning

The Describe Anything Model (DAM) is introduced to enhance detailed localized captioning for images and videos by preserving local details and global context through innovations like a focal prompt and a localized vision backbone. To address the challenges of data scarcity and evaluation limitations, a Semi-supervised learning Data Pipeline (DLC-SDP) and a new benchmark (DLC-Bench) are proposed, allowing for scalable data generation and improved assessment of localized captions. DAM achieves state-of-the-art performance across multiple benchmarks, demonstrating its effectiveness in generating rich, detailed descriptions for specified regions.

Uploaded by

dinhhogiabao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Describe Anything: Detailed Localized Image and Video

Captioning
Long Lian1,2 Yifan Ding1 Yunhao Ge1 Sifei Liu1 Hanzi Mao1 Boyi Li1,2 Marco Pavone1
Ming-Yu Liu1 Trevor Darrell2 Adam Yala2,3 Yin Cui1
1 NVIDIA 2 UC Berkeley 3 UCSF

+ / / /

A white, fluffy dog with a thick coat, pointed ears, and a black nose. The dog has a
arXiv:2504.16072v1 [cs.CV] 22 Apr 2025

wide-open mouth with a pink tongue hanging out, and its eyes are dark and alert.
The fur is dense and appears soft, with some darker patches on its back.

+ / / /
on any frame

A ginger cat with a sleek, short-haired coat is captured in a series of graceful


movements. Initially, the cat is seen with its head turned slightly to the side, ears
perked up, and eyes wide open, suggesting alertness or curiosity. As the sequence
⋯ ⋯
progresses, the cat's body is elongated, indicating a stretch or a poised stance. Its
tail is held high, curving slightly at the tip, a sign of confidence or playfulness. The
cat's front paw is extended forward, as if reaching or preparing to pounce.

Figure 1: Describe Anything Model (DAM) generates detailed localized captions for user-specified regions
within images (top) and videos (bottom). DAM accepts various region specifications, including clicks, scribbles,
boxes, and masks. For videos, specifying the region in any frame suffices.

Abstract 1. Introduction
Generating detailed and accurate descriptions for Image captioning has been a longstanding challenge in
specific regions in images and videos remains a computer vision and natural language processing [18],
fundamental challenge for vision-language mod- as it involves understanding and describing visual
els. We introduce the Describe Anything Model content in natural language. While recent Vision-
(DAM), a model designed for detailed localized Language Models (VLMs) have achieved impressive
captioning (DLC). DAM preserves both local details results in image-level captioning, generating detailed
and global context through two key innovations: and accurate captions for specific regions within an
a focal prompt, which ensures high-resolution en- image remains an open problem. This challenge in-
tensifies with videos, where models must additionally
coding of targeted regions, and a localized vision
capture dynamic visual content, such as human ac-
backbone, which integrates precise localization
tions, object motions, and human-object interactions.
with its broader context. To tackle the scarcity
If resolved, it would open new doors for fine-grained
of high-quality DLC data, we propose a Semi-
grounded image/video understanding [49, 102] and
supervised learning (SSL)-based Data Pipeline
generation [40, 42].
(DLC-SDP). DLC-SDP starts with existing segmenta-
tion datasets and expands to unlabeled web images Most existing VLMs (e.g., GPT-4o [54]) lack mecha-
using SSL. We introduce DLC-Bench, a benchmark nisms for precise localization. Recent approaches that
designed to evaluate DLC without relying on refer- empower VLMs to take 2D localization cues such as
ence captions. DAM sets new state-of-the-art on 7 bounding boxes [30, 82, 85, 101] often yield brief
benchmarks spanning keyword-level, phrase-level, phrases rather than detailed descriptions. While there
are methods [15, 45, 93, 95, 99, 100] that produce
and detailed multi-sentence localized image and
longer captions, they provide minimal detail or in-
video captioning.

© 2025 NVIDIA. All rights reserved.


Describe Anything: Detailed Localized Image and Video Captioning

clude unrelated content from other regions, as shown

Previous methods
Local
Vision LLM
Feature
Backbone Lack of details 😢
in Fig. 5. This raises the question: What makes detailed
Image Extractor Regional
Feature Feature
Reason: region details
localized captioning (DLC) so challenging? Full Image Region Mask Baseline Output Description already lost in image
A dark green, circular object feature extraction
with a smooth surface and a

We identify three key obstacles to DLC: slightly raised, rounded edge.

Naive Fix: Zooming In


Image Region Region Mask Baseline Output Description
The mouse is a wireless, ergonomic
1. Loss of Region Details: As shown in Fig. 2, design with a smooth, matte finish. … Degraded
understanding
😢
prior methods extract local features from global Image Regional Reason: regions are given
Feature Local Feature
image representations, often leading to loss of
Vision LLM without image context
Feature
Backbone
Extractor

fine-grained details—particularly for small ob-


jects in complex scenes. By the time the LLM pro- Figure 2: Top: Prior regional captioners derive re-
cesses the visual features, crucial details neces- gional features from global image representations,
sary for generating precise captions are already leading to vague descriptions. Bottom: Zooming in
lost. Cropping the region of interest may en- (cropping the image region) enhances detail but loses
hance detail but risks losing essential contextual contextual cues, degrading recognition. This under-
scores the need for a design that encodes detail-rich
cues.
regional features while preserving context for im-
2. Scarcity of High-Quality Datasets: Datasets proved DLC performance.
such as RefCOCOs [31, 50] and Visual Genome
[34] typically offer only short phrases that To overcome the scarcity of high-quality DLC datasets,
do not suffice for training models to generate we introduce Semi-supervised learning (SSL)-based
rich, detailed captions. Recent synthetic data Data Pipeline (DLC-SDP) to generate high-quality lo-
approaches [45, 93] are based on bounding calized captions in two stages. First, leveraging high-
boxes that could not precisely convey the exact quality masks and keywords (e.g., class names, part
region of interests, while methods that rely names, or entities) from human-annotated segmen-
on global captions [28] may have difficulty tation datasets, we query a VLM to expand each key-
capturing non-salient regions. word into a detailed caption given each mask-referred
3. Limitations in Benchmarks: Prior localized region. Second, inspired by self-training-based SSL
captioning benchmarks compare generated cap- in image classification [7, 8, 37, 69, 88], DLC-SDP
tions against reference captions using language- performs self-training with web images as an unla-
based image captioning metrics [2, 5, 43, 56, beled dataset and segmentation datasets as labeled
79] or LLM-based scoring [95, 96]. However, data. An LLM further summarizes descriptions into
such techniques are not very applicable to DLC. multiple granularities, yielding a diverse dataset with
Since the reference captions provided in the high-quality localized captions, enabling our model
benchmarks often lack comprehensive details of to outperform strong API-only baselines such as GPT-
the region, DLC models are often unfairly penal- 4o [54] and o1 [55].
ized for correct details not explicitly mentioned
To mitigate the limitations of current benchmarks, we
in the reference.
introduce DLC-Bench, which evaluates detailed local-
We propose the following solutions to these chal- ized captions based on a set of predefined positive and
lenges: negative attributes for each region, eliminating the
reliance on comprehensive reference captions. This
To tackle the loss of details in regional features, we
approach provides a more flexible and accurate eval-
propose the Describe Anything Model (DAM), which
uation, encouraging models to generate informative
preserves both local detail and global context. DAM
and precise descriptions.
achieves this through two key innovations (Fig. 3): 1)
the focal prompt, which encodes the region of interest We summarize our contributions as follows:
with high token density, and 2) the localized vision
1. Describe Anything Model (DAM): A novel ar-
backbone, which ensures precise localization while
chitecture with a focal prompt and a localized
integrating global context. These components enable
vision backbone for multi-granular regional im-
DAM to generate detailed and accurate captions, even
age and video captioning.
for small objects in complex scenes.

2
Describe Anything: Detailed Localized Image and Video Captioning

Component Previous Practice Problem Our Solution Advantages


Describe Extracting regional features Regional details already lost Providing focal prompt to pro- Detail-rich contextful fea-
Anything from global image features in image feature extraction posed localized vision back- tures allowing for accurate,
Model and not provided to the LLM bone multi-granular localized de-
(DAM) scriptions
SSL Data Query a data curation VLM Imprecise referring to data Reframe the query into a mask- Leverage high-quality pre-
Pipeline with referring boxes and curation model referred keyword expansion cise human annotated re-
(DLC-SDP) global image captions question gional masks and keywords
Fully supervised learning Limited data with high an- Semi-supervised learning Scalable to diverse web-
notation quality scale unlabeled datasets
Benchmark Pred caption + reference GT Incorrect hallucination Pred caption + query for pos- Accurate detail and hallu-
(DLC- caption → language-based sim- penalty for correct details itive/negative attributes → cination assessment with-
Bench) ilarity metrics or LLM scorer not present in the reference LLM scorer out relying on reference cap-
caption tions

Table 1: Advantages of our proposed model DAM, our SSL data pipeline DLC-SDP, and our benchmark DLC-
Bench to previous practices.

2. SSL Data Pipeline (DLC-SDP): A semi- with capturing intricate details in the referring regions,
supervised data pipeline that leverages as shown by the examples in Fig. 5. This is because
high-quality segmentation annotations and prior models either extract localized features from
unlabeled web images for scalable and diverse global image embeddings or simply encode the re-
data curation. ferring condition as referring tokens, which leads to
3. DLC-Bench: A benchmark designed to evaluate insufficient regional details for the LLM, especially for
DLC without reference captions. small objects. We address this via focal prompting
and a localized vision backbone, balancing local detail
Unlike generalist models, we focus on localized im-
with global context.
age and video captioning across multiple granulari-
ties, achieving SOTA performance on 7 benchmarks Another limitation is the scarcity of high-quality
across keyword, phrase, and detailed multi-sentence datasets. Datasets like RefCOCOs [31, 50] and
captioning. Advantages of DAM, DLC-SDP, and DLC- VG [34] provide only short phrases. Recent
Bench to prior practices are presented in Tab. 1. We approaches [28, 45, 61, 93, 95, 96] use bounding-
release our code, models, data, and benchmark at box-based VLM queries, sometimes augmented with
describe-anything.github.io. global captions, for synthetic generation, which leads
to caveats discussed in App. C.1. We propose an
2. Related Work SSL data pipeline that uses human-annotated and
unlabeled data for richer regional descriptions.
Vision-Language Models (VLMs). VLMs integrate vi-
sual and textual inputs for multimodal understanding Benchmarking Localized Captioning. [28, 30, 59,
and are broadly classified into BLIP-style [1, 3, 22, 25, 61, 93, 95, 98, 100, 101] evaluate localized captioning
36, 38, 39] and LLaVA-style [4, 6, 9, 19, 20, 44, 46– by computing language-based image captioning met-
48, 81, 87, 103]. However, these models lack precise rics [2, 5, 43, 56, 79] between predicted captions and
localization capabilities, limiting their ability to gen- reference captions. However, these metrics focus on
erate regional descriptions. textual matching and may not correlate well with the
factual correctness or quality for detailed descriptions.
Localized Image Captioning. While general VLMs
[95, 96] use Sentence-BERT [63] and text-only LLMs
generate image-level captions, localized captioning
to score the predictions against reference captions.
requires fine-grained regional descriptions. SoM [91,
However, reference captions often lack comprehensive
92] augments VLMs with visual markers, but these
details about the region of interest, which penalizes
markers may blend with the background, as discussed
models for correct details not explicitly mentioned in
in App. A. Region-aware VLMs [15, 28, 30, 45, 57, 61,
the reference by treating them as hallucinations, as
70, 82, 83, 95, 99–101] introduce regional referring
discussed in App. C.2. Our DLC-Bench resolves this
inputs. Recent efforts such as Merlin, Artemis, and
issue by eliminating the need for reference captions.
VideoRefer [59, 94, 96] extend region-based caption-
ing to videos. However, these methods still struggle

3
Describe Anything: Detailed Localized Image and Video Captioning

Vision Models with Focus. Prior works en- Detailed Localized Descriptions
A red, translucent, dome-shaped light with a white base. ….

hance attention to salient regions using focal


self-attention [90], ToSA [68], ToMe [10], DW- Large Language Model (LLM)

ViT [65], Quadformer [66], and V* [86]. These


methods allocate resources dynamically to salient Textual Prompt Tokens t Fused Visual Features z′
regions defined by the model. In contrast, our focal Describe the masked region in detail.

prompt explicitly prioritizes user-specified regions, Localized Vision Backbone

ensuring accurate and detailed captions even for fR ⋯ (Repeat N times)

non-salient objects. Global Visual Features z KV Gated


Cross-Attention*
fG (Repeat N times)
Q

3. DAM: Describe Anything Model Self-Attention


Weight Sharing
Self-Attention

Describe Anything Model (DAM) generates detailed Positional


+ Encoding P +
Positional
Encoding P
+ +
localized descriptions of user-specified regions within
images and videos. DAM effectively balances local de- Image Patch
Embed EI
Mask Patch
Embed* EM
Image Patch
Embed EI
Mask Patch
Embed* EM

tail and contextual information through our proposed


focal prompt and localized vision backbone. Focal Prompt

Focal
3.1. Task Formulation Full Image I with Mask M
Cropping
Focal Crop I′ with Mask M′
The task of detailed localized captioning involves gen-
erating comprehensive textual descriptions focused ex- Figure 3: Architecture of the Describe Anything
clusively on specified regions within images or videos. Model (DAM). DAM employs a focal prompt to en-
Formally, given 𝑁 input frames 𝐼 (𝑖) ∈ R𝐻×𝑊 ×3 and code user-specified regions with high token density
corresponding binary masks 𝑀 (𝑖) ∈ {0, 1}
𝐻×𝑊
indi- while preserving context for detailed understanding.
Focal cropping is applied to the image and its corre-
cating the region of interest in each frame, the ob-
sponding mask, retaining surrounding areas for local
jective is to produce a detailed description 𝑇 of the context. Both the full image and the focal crop are
content within the region through a captioning model: the inputs into the localized vision backbone, where
(︁ )︁ images and binary masks are embedded in a spatially
𝑇 = CaptioningModel {𝐼 (𝑖) , 𝑀 (𝑖) }𝑁𝑖=1 (1) aligned fashion. Global context from the full image is
leveraged to help understand the focal crop through
gated cross-attention. The resulting visual features
We focus on using binary masks 𝑀 (𝑖) as the local- and prompt tokens are fed into a large language model
ization input, since other forms of localization (e.g., to generate detailed, context-aware descriptions. * in-
points, scribbles, boxes, or masks on an image or a dicates initialized to output zeros.
subset of frames in a video) can be transformed into
masks via segmentation models such as SAM [32] and crop centered around the specified area, along with
SAM 2 [62]. For simplicity, we first introduce our their corresponding masks.
method for localized image captioning, omitting the
frame index 𝑖, and later extend it to videos in Sec. 3.3. We first extract the bounding box 𝐵 of the mask 𝑀
and expand it by a factor 𝛼 in both the horizontal and
3.2. Model Architecture vertical directions to include additional surrounding
As shown in Fig. 3, DAM consists of two key compo- context:
nents: the focal prompt and the localized vision back- 𝐵 ′ = ExpandBox(𝐵, 𝛼). (2)
bone. For instance, setting 𝛼 = 3 results in a region that can
be up to 9 times as large as the original bounding box,
3.2.1. Focal Prompt
subject to clipping at the image boundaries. If either
To provide a detailed representation of the region the height or width of the expanded box is less than
of interest within its context, we introduce the focal 48 pixels, we enforce a minimum size of 48 pixels in
prompt, which includes both the full image and a focal that dimension to ensure sufficient context for very

4
Describe Anything: Detailed Localized Image and Video Captioning

small regions. Attention Adapters. To integrate global context into


the focal prompt, we insert gated cross-attention
The focal crop of the image and mask are then:
adapters [1, 38] into each transformer block of the
𝐼 ′ = 𝐼|𝐵 ′ , 𝑀 ′ = 𝑀 |𝐵 ′ , (3) regional vision encoder 𝑓R . After the self-attention and
feed-forward layers, we add a gated cross-attention
where |𝐵 ′ denotes cropping to 𝐵 ′ . The focal prompt mechanism that allows local features to attend to
thus consists of 1) the full image 𝐼 and its mask 𝑀 global features:
and 2) the focal crop 𝐼 ′ and its mask 𝑀 ′ . By including ′
(︁ )︁ (︁ )︁
both the full image and the focal crop, along with their h(𝑙) = h(𝑙) + tanh 𝛾 (𝑙) · CrossAttn h(𝑙) , z , (6)
masks, the focal prompt contains both global context ′
(︁ )︁ (︁ ′
)︁
(𝑙)
and a detailed view of the region of interest. hAdapter = h(𝑙) + tanh 𝛽 (𝑙) · FFN h(𝑙) , (7)

3.2.2. Localized Vision Backbone where h(𝑙) is the output of the 𝑙-th self-attention block
Effectively processing all four components of the focal in 𝑓R , 𝛾 (𝑙) and 𝛽 (𝑙) are learnable scaling parameters ini-
prompt with a VLM is non-trivial, as naively concate- tialized to zero, and CrossAttn denotes cross-attention
nating the full image and the focal crop leads to a with queries from h(𝑙) and keys and values from the
loss in performance (Tab. 8). We propose the localized global features z, similar to how cross-attention is em-
(𝑙)
vision backbone, which 1) achieves localized under- ployed in encoder-decoder Transformers [78]. hAdapter
standing by encoding the masks in a spatially aligned is used in place of h(𝑙) in the next Transformer block.
manner and 2) integrates global context into the re- To reduce the number of parameters, 𝑓R shares self-
gion of interest through gated cross-attention. attention block weights with 𝑓G .

Handling Localization Inputs. Similar to how an By initializing 𝛾 (𝑙) and 𝛽 (𝑙) to zero, we ensure that the
image is encoded by a linear patch embedding layer initial behavior of the model remains identical to the
in vision transformers (ViTs) [24], we integrate the original VLM prior to fine-tuning. During training, the
mask 𝑀 into its corresponding full image 𝐼 through model learns to leverage the global context to enhance
another patch embedding layer that takes in 2D inputs local feature representations, facilitating detailed and
with one channel. contextually accurate descriptions.

Specifically, the full image 𝐼 and its mask 𝑀 are pro- Generating Detailed Localized Descriptions. The
cessed through patch embedding layers, followed by visual features from both the global and regional vi-
the global vision encoder 𝑓G (·) to obtain global visual sion encoders are combined and fed into the large
features z. The focal crop 𝐼 ′ and its mask 𝑀 ′ undergo language model to generate detailed, context-aware
a similar process with the regional vision encoder 𝑓R (·), descriptions 𝑇 :
except that 𝑓R (·) also takes z as a context to obtain
𝑇 = LLM(t, z′ ), (8)
the final fused visual features z′ . Specifically, we have:
where t denotes textual prompt tokens.
x = 𝐸I (𝐼) + 𝐸M (𝑀 ) + 𝑃, z = 𝑓G (x), (4)
Notably, the proposed components do not increase the
x′ = 𝐸I (𝐼 ′ ) + 𝐸M (𝑀 ′ ) + 𝑃, z′ = 𝑓R (x′ , z), (5) sequence length of the vision tokens, ensuring that
where 𝐸I (·) and 𝐸M (·) are the image and mask patch DAM remains efficient. By initializing new modules
embedding layer, respectively, x and x′ are global and (mask embedding 𝐸M and scaling parameters 𝛾 (𝑙) and
focal embedded inputs with information for both the 𝛽 (𝑙) ) to zeros, we preserve the pre-trained capabilities
image and the mask, and 𝑃 denotes the positional of the VLM prior to fine-tuning, allowing for smooth
encoding. adaptation of an off-the-shelf VLM without rerunning
pre-training. Thanks to this design, our model requires
The newly added mask embedding layer 𝐸M is initial- way less training data (∼ 1.5M samples) than prior
ized to output zeros, ensuring that the VLM’s initial works that involve VLM pretraining.
behavior is unaffected prior to fine-tuning.

Regional Feature Encoding with Gated Cross-

5
Describe Anything: Detailed Localized Image and Video Captioning

3.3. Extension to Videos techniques inspired by successful approaches in


Since images can be considered as videos with a single image classification [7, 8, 69, 89]. Our self-training
frame, the model naturally extends to handling videos approach involves four steps:
by processing sequences of frames and their corre- 1. Mask Generation. We use open-vocabulary seg-
sponding masks. The visual features from all frames mentation models [30, 44] to extract object
are concatenated in the sequence dimension and fed masks from unlabaled web images.
into the language model to generate detailed local- 2. Description Generation. Our DAM, initially
ized descriptions across the video frames, compatible trained on the DLC dataset based on annotated
with how VLMs are pretrained to handle videos. We segmentation datasets, generates detailed local-
leverage SAM 2 [62] to turn sparse localizations into ized descriptions for these regions.
a mask for each frame. 3. Confidence-based Filtering. We apply CLIP-
based confidence filtering to keep only high-
4. DLC-SDP: SSL-based Data Pipeline quality samples, following SSL literature.
The effectiveness of DAM depends critically on the 4. Data Expansion. The newly generated (image,
availability of high-quality training data for detailed mask, description) triplets are added to our train-
localized descriptions. To this end, we propose ing dataset.
the Semi-supervised learning (SSL)-based Data This semi-supervised approach dramatically expands
Pipeline (DLC-SDP), a two-stage approach that the range of object categories and increases data di-
enables us to build a large and diverse dataset with versity beyond the initial supervised dataset. Fur-
high-quality localized descriptions. thermore, to support the capability of DAM for multi-
granular captioning, we leverage an LLM to summa-
4.1. Stage 1: Leveraging Existing Annota- rize the detailed descriptions into shorter forms (e.g.,
tions phrases or short sentences), enabling DAM to flexibly
The first stage of DLC-SDP reframes the data genera- generate captions ranging from succinct phrases to
tion problem into a vision-grounded description expan- multi-sentence narratives.
sion task. We observed that although current VLMs By systematically curating our training data through
often struggle to generate detailed localized descrip- this two-stage SSL pipeline, DAM achieves signifi-
tions when given a referring mask, they can effectively cant performance improvements. Notably, our model
expand short localized descriptions into detailed ones. trained on data obtained with DLC-SDP outperforms
Inspired by this observation, we leverage high-quality GPT-4o [54] and o1 [55], two strong closed-sourced
human-annotated masks and keywords (object class baselines, showing the effectiveness of DLC-SDP. We
names, part names, entities, etc.) from existing seg- present implementation details of DLC-SDP in App. H.
mentation datasets, reframing the VLM query to ex-
pand each regional keyword into a detailed caption
given the referring masks.
5. DLC-Bench: Benchmark for DLC
We introduce DLC-Bench, a benchmark designed for
Importantly, our model is trained to predict these
DLC to eliminate the need for comprehensive refer-
high-quality descriptions without taking the initial
ence captions. The core intuition behind DLC-Bench
keywords as inputs. Since there are no class labels
is that an ideal description should be rich in relevant
provided at inference time–and existing VLMs perform
details while strictly avoiding factual errors or infor-
poorly without them–our approach ensures superior
mation for irrelevant regions. Therefore, we assess
data quality compared to direct VLM prompting for
predictions based on a set of predefined positive and
distillation.
negative attributes for each region.

4.2. Stage 2: SSL with Unlabeled Data As illustrated in Fig. 4, the evaluation process for a
Since it is not scalable to rely on high-quality model like DAM has two steps:
manual annotations, the second stage of DLC-SDP 1. The model is prompted to generate a detailed de-
employs self-training-based semi-supervised learning scription for each masked region in the benchmark

6
Describe Anything: Detailed Localized Image and Video Captioning

Task: describe this object (b) Detailed Localized Caption


A white electric stove with a smooth, ßat
Detailed Localized Captioner cooktop featuring two visible black burners.
(e.g., DAM) The control panel is located at the back, with
a black surface and a few visible knobs.

(c) Question to the LLM Judge


Q: Which of the following is applicable to the description?
LLM (d) LLM Response
Choices:
A. The control panel or the stove is not mentioned. (Missing details) Judge C. (Right location)
B. The location of the control panel is mentioned in the description
but is not back. (Factual error)
C. The location of the control panel is mentioned in the description
and is back. (Right location) The model gets a point.
(a) A region in DLC-Bench for
D. The location of the control panel is not mentioned, but the
the model to scribe control panel of the stove is mentioned. (Partial recognition)

Figure 4: We propose DLC-Bench, a benchmark tailored to detailed localized captioning. In DLC-Bench, a


captioning model is prompted to describe a specified image region (a). The generated description (b) is then
evaluated by querying an LLM Judge (c). Points are assigned or deducted based on the LLM’s response (d).
The question we show in (c) is an example of positive questions.

Method LVIS (%) PACO (%)


Short Captioning Metrics Long Cap. Metrics
Sem. Sim. (↑) Sem. IoU (↑) Sem. Sim. (↑) Sem. IoU (↑)
Method BLEU METEOR ROUGE-L CIDEr SPICE CLAIR
LLaVA-7B [48] 49.0 19.8 42.2 14.6 Method BLEU METEOR ROUGE-L CIDEr SPICE
Shikra-7B [15] 49.7 19.8 43.6 11.4
Shikra-7B [15] 29.5 11.1 23.9 42.7 9.0 34.5
GPT4RoI-7B [99] 51.3 12.0 48.0 12.1 Shikra-7B [15] 18.2 15.3 25.2 49.8 22.0 GPT4RoI-7B [99] 27.1 11.6 26.8 59.9 11.1 43.9
Osprey-7B [95] 65.2 38.2 73.1 52.7 GPT4RoI-7B [99] 19.7 17.7 29.9 61.7 24.0 Ferret-7B [93] 24.6 10.7 22.3 39.7 8.2 45.2
Ferret-13B [93] 65.0 37.8 - - Ferret-7B [93] 11.1 8.8 22.7 38.1 17.5 GLaMM-7B [61] 23.2 10.1 23.8 51.1 8.7 43.8
VP-SPHINX-7B [45] 86.0 61.2 74.2 49.9 VP-SPHINX-13B [45] 15.2 15.6 27.2 67.4 24.0 VP-SPHINX-13B [45] 22.6 10.7 22.6 32.4 7.6 51.2
VP-LLAVA-8B [45] 86.7 61.5 75.7 50.0 RegionGPT-7B [28] 16.1 16.7 27.4 54.6 20.5 RegionGPT-7B [28] 25.4 12.2 25.3 42.0 8.1 37.2
DAM-8B (Ours) 89.0 77.7 84.2 73.2 DAM-8B (Ours) 22.6 17.8 31.2 74.7 25.5 DAM-8B (Ours) 38.7 19.4 37.1 70.0 16.9 57.9

Table 2: LVIS [29] and PACO [60] Table 3: Zero-shot evaluation on Table 4: Zero-shot evaluation on the de-
open-class keyword-level caption- phrase-level dataset Flickr30k tailed captioning dataset Ref-L4 [14].
ing benchmarks. DAM excels par- Entities [58]. Our model achieves Our method achieves 33.4% and 13.1%
ticularly in the challenging PACO 12.3% average relative improve- average relative improvement on the
benchmark that requires distin- ment against previous best. short/long language-based captioning
guishing between objects and parts. metrics, respectively.

dataset. This approach provides a more flexible and accurate


2. An LLM serves as a judge, assessing the generated evaluation, encouraging models to generate informa-
description by responding to a set of manually curated tive and precise descriptions without constraints from
positive and negative questions about region details. incomplete reference captions.

DLC-Bench employs two categories of questions for Our DLC-Bench comprises a total of 892 manually ver-
each annotated instance, with example in Fig. A.5: ified questions covering a wide range of attributes and
potential cases for hallucinations. Details on the cura-
• Positive questions focus on specific attributes of
tion process and the scoring mechanism are provided
object parts that should be present in the descrip-
in App. D.
tion. The model earns a point if the description
accurately includes the specified detail; omis-
sions receive no points, while factual errors incur
6. Results
a penalty. DAM excels at localized image and video captioning
• Negative questions focus on details that should across multiple granularities including keyword,
not be present–either attributes typical of simi- phrase, and detailed captions, achieving SOTA on 7
lar objects but absent in the target instance, or in-domain and zero-shot benchmarks (Tabs. 2 to 7).
descriptions irrelevant to the specified region. We explain the details for each benchmark in App. B.
A point is awarded if the model correctly omits
such details; conversely, including them results 6.1. Quantitative Results
in a penalty. To avoid getting high scores for Open-class keyword-level localized captioning task
captions that are completely off, point could requires the model to output keywords containing the
only be awarded if the caption has the correct object and part entities to describe the region. Tested
recognition of the object. on object-level LVIS [29] and part-level PACO [60]

7
Describe Anything: Detailed Localized Image and Video Captioning

datasets in Tab. 2, our method achieves state-of-the-art Method #Params Pos (%) Neg (%) Avg (%)

performance. In the PACO benchmark, a challenging General VLMs:


GPT-4o [54] - 43.4 79.6 61.5
benchmark that includes both full objects and parts o1 [55]† - 46.3 78.8 62.5
Claude 3.7 Sonnet [73]† - 21.8 50.4 36.1
in complex scenes and requires the model to decide Gemini 2.5 Pro [74, 75]† - 36.5 75.2 55.8
whether the region is an object or a part, our method Llama-3.2 Vision [25] 11B 30.7 63.8 47.3
VILA1.5-Llama-3 [44] 8B 22.5 61.0 41.8
achieves 73.2% semantic IoU and 84.2% semantic InternVL2.5 [20, 21, 84] 8B 15.9 42.0 28.9
LLaVA v1.6 [46–48] 7B 15.4 55.0 35.2
similarity, outperforming the previous best by 23.2% Qwen2.5-VL [77, 81] 7B 20.3 62.2 41.2
and 8.5% respectively. VILA1.5 [44] 3B 16.0 50.0 33.0
Region-specific VLMs (full / cropped input):
Phrase-level localized captioning task requires the GPT4RoI [99] 7B 6.5/3.5 46.2/52.0 26.3/27.7
Shikra [15] 7B 2.7/8.0 41.8/51.4 22.2/29.7
model to output a phrase containing a brief description Ferret [93] 7B 6.4/14.2 38.4/46.8 22.4/30.5
RegionGPT [28] 7B 13.0/10.6 41.4/46.4 27.2/28.5
for each region that includes object identification and ControlCap [101] 0.3B 18.3/ 3.6 75.6/53.6 47.0/28.6
attributes typically within a few words. Tested zero- SCA [30] 3B 3.4/ 0.1 44.6/18.4 24.0/ 9.3
OMG-LLaVA [100] 7B 0.9/ 5.6 16.0/32.6 8.5/19.1
shot on Flickr30k Entities [58], our model achieves VP-SPHINX [45] 13B 11.7/26.3 33.2/71.6 22.5/49.0
strong performance, outperforming the previous best DAM (Ours) 3B 52.3 82.2 67.3

by 12.3% relative improvement on an average of 5 Table 5: Accuracies on detailed localized caption-


metrics. ing in our proposed DLC-Bench. DAM outperforms
previous API-only models, open-source models, and
Detailed localized captioning task requires the
region-specific VLMs on detailed localized captioning.
model to output a detailed description for each Underlined: the second-best method. †: models with
the region with the length spanning from a long thinking mode.
sentence to multiple sentences. An ideal description
Method BLEU@4 METEOR ROUGE-L CIDEr SPICE
includes the description of the object in the region, Osprey-7B [95] 0.7 12.0 18.0 1.2 15.6
its parts, as well as their attributes and relationships. Ferret-13B [93] 0.5 10.2 17.0 1.2 11.2
Shikra-7B [15] 1.3 11.5 19.3 3.1 13.6
We benchmark this capability on the challenging Merlin-7B [94] 3.3 11.3 26.0 10.5 20.1
Artemis-7B [59] 15.5 18.0 40.8 53.2 25.4
Ref-L4 [14] benchmark and our proposed DLC-Bench. VideoRefer-7B [96] 16.5 18.7 42.4 68.6 28.3
On the Ref-L4 benchmark in Tab. 4, our method DAM-8B (Ours) 19.8 21.0 45.9 91.0 31.4

achieves 33.4% relative improvement on average Table 6: Detailed localized video captioning on HC-
over the previous best on short language-based STVG [71].
captioning metrics [2, 5, 43, 56, 79], and 13.1% In Tab. 6, our proposed DAM achieves 19.8% relative
relative improvement on the long language-based improvement over the previous best on HC-STVG, in-
captioning metrics [13]. cluding concurrent work VideoRefer [96]. In Tab. 7,
We also benchmark various regional captioning mod- the benchmark proposed by concurrent work Video-
els on our proposed DLC-Bench, which does not Refer [96], our proposed DAM surpasses the previous
suffer from the limitations of requiring reference cap- best in both zero-shot and in-domain settings, where
tions in previous benchmarks. As shown in Tab. 5, zero-shot indicates not being trained on in-domain
our Describe Anything Model (DAM) significantly out- datasets derived from Panda-70M [17], which the
performs existing general and region-specific VLMs, benchmark also sources videos from.
achieving state-of-the-art positive and negative accu- Finally, we analyzed the performance of DAM in HD
racy and demonstrating its ability to produce detailed (hallucination detection) sub-task and found that DAM
and accurate descriptions. Remarkably, DAM sur- often predicts correct details not present in the refer-
passes GPT-4o [54] and o1 [55], two strong API-only ence caption. This indicates that the lower zero-shot
baselines. DAM also surpasses models with thinking performance on this sub-task is not necessarily due to
mode enabled [55, 73, 74]. the hallucination of our model but rather due to the
Detailed localized video captioning requires the missing details in the reference caption. We illustrate
model to output a detailed description for each region this further in App. C.2.
in a video. We benchmark this capability on the chal-
lenging HC-STVG [71] benchmark and the detailed
captioning benchmark proposed by VideoRefer [96].

8
Describe Anything: Detailed Localized Image and Video Captioning

RegionGPT
a slice of cucumber that is on the left side of the camera
Method SC AD TD HD† Avg.
ControlCap
star shaped slice of green cucumber
Zero-shot:
Segment and Caption Anything (SCA)
white paper on suitcase
Qwen2-VL-7B [81] 3.30 2.54 2.22 2.12 2.55
Task: describe this object
InternVL2-26B [20] 4.08 3.35 3.08 2.28 3.20
GLaMM
the cut up cucumber on the top GPT-4o-mini [54] 3.89 3.18 2.62 2.50 3.05
OMG-LLaVA GPT-4o [54] 4.15 3.31 3.11 2.43 3.25
The image features a green cucumber slice with a red tomato on top
of it. The tomato slice is placed on top of the green cucumber slice,
Osprey-7B [95] 3.30 2.66 2.10 1.58 2.41
creating a visually appealing and healthy snack. The combination of Ferret-7B [93] 3.20 2.38 1.97 1.38 2.23
the green cucumber and tomato on the same slice makes for a
delicious and nutritious meal option. Elysium-7B [80] 2.35 0.30 0.02 3.59 1.57
VP-SPHINX
Artemis-7B [59] 3.42 1.34 1.39 2.90 2.26
In the region, there is a rectangular piece of vegetable that resembles DAM-8B (Ours) 4.45 3.30 3.03 2.58 3.34
a camera. It is a square piece of a vegetable, possibly a cucumber,
positioned horizontally. It is located towards the center of the region In-domain*:
and is the focal point of the image.
VideoRefer-7B [96] 4.44 3.27 3.10 3.04 3.46
Describe Anything Model (Ours)
A rectangular slice of cucumber with a light green, watery interior. The DAM-8B (Ours) 4.69 3.61 3.34 3.09 3.68
slice features a central star-shaped pattern with radiating lines

Table 7: Performance on detailed localized video


extending outward. The surface is smooth with a slightly translucent
appearance, and the edges are clean and straight. The cucumber has
a uniform light green color throughout.
description on VideoRefer-Bench-D [96]. †: We pro-
Figure 5: DAM generates detailed and localized de- vide analysis on hallucination scores (HD) in Sec. 6.1
scriptions, whereas prior works generate descriptions and App. C.2. *: trained on in-domain VideoRefer-
that are less precise. Green: correct description. Red: 700k with regard to VideoRefer-Bench, both sourcing
factual error or mislocalization. videos from Panda-70M [17].

6.2. Qualitative Results Input


Video
Qualitative comparisons in Fig. 5 show that DAM ex-
A person wearing a black shirt and dark shorts is captured in a dynamic sequence
cels in both accuracy and the level of details. of movement. Initially, the individual appears to be in a running motion, with their
body slightly leaning forward, suggesting speed and agility. The arms are bent at
the elbows, and the legs are positioned in a way that indicates a swift, forward
Detailed Localized Video Captioning. DAM stride. As the sequence progresses, the person maintains a consistent pace, with
their legs alternating in a rhythmic pattern, indicative of a sprinting action. The
describes user-specified objects in videos with posture remains upright, and the head is slightly tilted forward, emphasizing focus
and determination. The overall movement conveys a sense of urgency and
localization from any frame. As shown in Fig. 6 (a), athleticism, as the person continues to move swiftly across the scene.

DAM effectively describes objects under challenging


(a) DAM reliably describes user-specified objects in videos,
conditions, such as motion and occlusion. We offer even under strong camera and object motion and heavy
more video examples in App. G.3. occlusion.

Controlling Description Granularity. As shown in Input


Video
Fig. 7, DAM allows control over the amount of details The car is a blue SUV with a boxy design, featuring a spare tire mounted on the
and length of descriptions with different prompts. rear door. It has a four-door configuration with black side mirrors and tinted
windows. The vehicle is equipped with silver alloy wheels and black trim around
the wheel arches. The front of the car includes a rectangular grille and rectangular
Zero-shot 3D Object Captioning. Our model can headlights. The side of the car displays a horizontal silver stripe running along the
length of the doors. The rear of the car has vertical taillights and a visible license
also describe objects in multi-view datasets such as plate area. The car's body is slightly worn, indicating use over time.

Co3Dv2 [64], integrating information from multiple (b) DAM can describe objects in multi-view datasets, like
frames to provide coherent descriptions of 3D objects this car from Co3Dv2 [64], by integrating information from
(Fig. 6(b)). multiple frames.
Figure 6: DAM accurately describes user-specified
6.3. Ablations regions in videos and multi-view scenes under
challenging conditions. More results presented
Visual Prompting. We analyze different prompting
in Fig. A.8.
strategies and find that both localized inputs and con-
textual information are crucial for accurate descrip- Task: describe this object
Prompt: Provide a brief description of
the masked region.
tions. Using only the full image limits focus on specific A modern chair with a curved backrest and
light wood legs.
regions (48.7%), while local crops improve detail but Prompt: Provide a highly detailed
lose context (60.1%). Simply concatenating both per- description of the masked region.
A modern chair with a curved backrest and a
forms poorly (42.4%). Adding cross-attention signifi- textured, light brown fabric upholstery. The
chair features a smooth, rounded top edge
cantly improves performance (63.2%), and using focal and a slightly tapered back. It is supported
by four slender, light-colored wooden legs
that are slightly splayed for stability.
crops further enhances results (65.4%). Our best ap-
Figure 7: DAM offers multi-granular localized de-
proach, the focal prompt, integrates focal crops with scriptions.
cross-attention, achieving 67.3% accuracy without
increasing sequence length for the LLM.

9
Describe Anything: Detailed Localized Image and Video Captioning

Prompting XAttn #IT Pos (%) Neg (%) Avg (%)


Full Image Only No 196 32.1 65.4 48.7
Local Crop Only No 196 43.5 76.6 60.1 (+11.4)
Full + Local Crop No* 392 26.3 58.6 42.4 (-6.3)
Full + Local Crop Yes 196 45.7 80.6 63.2 (+14.5)
Focal Crop Only No 196 47.3 83.6 65.4 (+16.7)
Full + Focal Crop Yes 196 52.3 82.2 67.3 (+18.6)

Table 8: Ablation studies across different visual


prompts, cross-attention settings, and the num-
ber of image tokens. Local crop denotes cropping
without surrounding context. XAttn: cross-attention.
#IT: number of image tokens. * indicates the full im-
age and the crop are concatenated on the sequence
dimension. Bold: Our proposed focal prompt.

Data # regions Pos (%) Neg (%) Avg (%)


LVIS 373k 34.0 72.6 53.3
+ additional datasets 602k 47.5 80.0 63.8 (+10.5)
+ SSL on 10% of SA-1B 1.38M 52.3 82.2 67.3 (+14.0)

Table 9: Our model benefits from diverse datasets


generated by DLC-SDP. Scaling the dataset in size
and diversity significantly improves model perfor-
mance, and SSL further enhances performance using
widely available unannotated images.

Data Scaling. Expanding supervised datasets boosts


performance, demonstrating the value of diverse re-
gional data. Incorporating semi-supervised learning
(SSL) with 10% of unannotated SA-1B images further
improves accuracy to 67.3%, showcasing our data
pipeline’s scalability.

Additional ablations, including training a model with


the architecture from prior work on our data, com-
paring image-only vs image-video joint training, and
ablating prompt augmentations, are in App. E.

7. Discussions and Conclusion


We introduced DAM, a model for detailed localized
captioning in images and videos, balancing local de-
tail and global context through a focal prompt and
localized vision backbone. We developed DLC-SDP, an
SSL data pipeline leveraging segmentation datasets
and unlabeled web images for high-quality captions.
We also proposed DLC-Bench, a benchmark using
attribute-based evaluation to overcome limitations
of reference-based scoring. DAM achieves SOTA in
7 benchmarks in multi-granular regional captioning.
We present additional discussions in App. C.

10
Describe Anything: Detailed Localized Image and Video Captioning

A. Challenges in Generating Detailed masks on the image [91, 92], as shown in Fig. A.1(c).
Localized Descriptions with Off-the- However, these markings may blend into the object
or the background in highly complex scenes, making
Shelf VLMs
them unrecognizable to the VLMs. This issue is espe-
Although cutting-edge Vision-Language Models cially common for small objects that are not the main
(VLMs), such as GPT-4o [54] and LLaVA [46–48], focus of the scene. Furthermore, the markings may
excel at generating global-level image descriptions, render the image out-of-distribution, confusing the
producing detailed localized image captions remains VLMs and disrupting the quality of output that they
an open problem. Specifically, these VLMs only take were originally capable of generating.
in RGB images along with text prompts and do not
allow users to accurately specify regions of interest. The exploration above highlights a conflict between
the precision of localization and the availability of con-
While users could employ text to localize the object text. On one hand, we want the model to accurately
to be described, this approach is often cumbersome focus on a specific region without mentioning other
and inefficient, requiring precise referring phrases that regions, such as other objects or the background. On
may still be difficult for the VLM to interpret. This the other hand, the model needs to leverage contex-
can lead to mislocalization of the intended object, as tual information to correctly identify the object in the
illustrated in Fig. A.1(a). region of interest. This conflict makes it very difficult
The required effort for both the user and the model for current VLMs to produce high-quality localized
can be significantly reduced if the user is allowed to descriptions.
specify the region directly using a representation in 2D Our proposed model overcomes this challenge by tak-
coordinates that the model can understand. With this ing the localization as a separate input in 2D space.
idea in mind, we focus on generating detailed localized This approach has the advantage of making the local-
descriptions by enabling users to specify a region in ization more explicit for the VLMs to parse while keep-
an image for the model to describe in detail. Since ing the image within its original distribution, thus pre-
spatial representations such as points and boxes can venting the model from being distracted by the mark-
be converted into masks using SAM [32] and SAM ings. This technique leads to accurate localization
2 [62], we concentrate on regions specified by mask even in complex scenes, as illustrated in Fig. A.1(d).
inputs. Note that since Fig. A.1(d) mainly focuses on explain-
A first attempt to address this problem with existing ing the design choices of inputting mask inputs to
VLMs is to reduce the task to global image caption- the model, focal prompting is included as a part of
ing by presenting only the region to the VLM through the model and is omitted in this figure for simplicity.
masking or cropping, as shown in Fig. A.1(b). While We refer readers to Fig. 3 for illustrations on focal
this forces the VLM to focus solely on the specified prompting.
region, freeing users from the burden of expressing
localizations as phrases, the lack of contextual infor- B. Evaluation Benchmarks
mation makes the task much more challenging and Our DAM is designed to perform well at localized
often confuses the VLM. This confusion can prevent image and video captioning across multiple granular-
the model from correctly identifying the object, let ities, including keyword, phrase, and detailed cap-
alone providing detailed descriptions of its parts. In tions. Therefore, we evaluate and achieve SOTA in 7
more extreme cases, the model may even refuse to in-domain and zero-shot benchmarks:
caption the region due to insufficient information in
the cropped or masked image. Therefore, generating 1. The LVIS open-class keyword-level benchmark
detailed localized captions requires more than just the in Tab. 2.
local region. 2. PACO open-class keyword-level benchmark (in-
cluding object and parts as regions) in Tab. 2.
An alternative approach to prompt existing off-the- 3. Flickr30k Entities phrase-level benchmark
shelf VLMs for localized descriptions is to overlay in Tab. 3.
markings such as points, scribbles, contours, and alpha 4. Ref-L4 detailed captioning benchmark in Tab. 4.

1
Describe Anything: Detailed Localized Image and Video Captioning

(a) Baseline 1: Existing VLM + Localization through text

The rightmost building in the image


Prompt: Describe the rightmost is a sleek, modern skyscraper under
building in the image construction, … ✗
Hard for humans to come Hard for models to localize the
Input Image up with accurate phrasing 😢 exact referred object 😢

(b) Baseline 2: Existing VLM + Localization through cropping or masking

The image shows a tall, multi-layered


Task: describe this object Input image:
structure that appears to resemble a
damaged or partially broken

skyscraper. …
Hard for the model to understand
Easy for humans to indicate
the region without context 😢
the object of interest 😁

(c) Baseline 3: Existing VLM + Localization through markings on image

The region that is marked with the


on
mask in the image is the Statue of
Liberty. It is located near the far
Object of ✗
right of the image, …
Interest
Hard for the model to localize
Easy for humans to indicate
the object in complex scenes 😢
the object of interest 😁

(d) Our Approach: Describe Anything Model (Ours) + Localization as a separate input

A modern skyscraper with a stepped


+ design, featuring a series of vertical, ✔
evenly spaced windows….

Easy for humans to indicate Easy for models to localize the


the object of interest 😁 exact referred object 😁

Figure A.1: Existing Vision-Language Models (VLMs) do not perform well in generating localized descrip-
tions. (a) to (c) demonstrate several ways to prompt existing VLMs, but none achieves satisfactory performance,
leading to the need for a new method that is capable of providing detailed and localized descriptions. In
(d), we propose a model that accepts the condition in a separate form of input, making it easy for users to
specify the object of interest and for the models to accurately localize the referred object. Note that our focal
prompt, proposed in Sec. 3, is considered part of the Describe Anything Model and is not shown in the figure
for simplicity.

5. Our proposed DLC-Bench detailed localized cap- 1. For LVIS [29], this involves predicting the class
tioning benchmark in Tab. 5. name as a keyword, given the segmentation
6. HC-STVG detailed video captioning benchmark mask of an object. A typical class name ranges
in Tab. 6. from one word to four words.
7. VideoRefer detailed video captioning benchmark 2. For PACO [60], this involves predicting the class
in Tab. 7. name of an object in the mask if the mask con-
tains a full object, or the object name and the
We offer an explanation for each setup.
part name if the mask contains an object part.
This is especially challenging because it would
B.1. Keyword-level Localized Captioning
require the model to understand nuances be-
Benchmarks tween full objects and object parts.
Open-class keyword-level localized captioning bench-
marks, proposed in [95], require the model to out- B.2. Phrase-level Localized Captioning
put keywords containing the object and part entities Benchmarks
to describe the region. In contrast to closed-class
Phrase-level localized captioning task requires the
keyword-level localized captioning, which constraints
model to output a phrase containing a brief descrip-
the model output to several choices provided, open-
tion for each the region that includes object identi-
class keyword-level localized captioning takes free-
fication and attributes typically within a few words.
form text outputs from the model. The evaluation
The metrics typically used in phrase-level benchmarks
results are in Tab. 2.
are CIDER, METEOR, BLEU, ROUGE_L, and SPICE

2
Describe Anything: Detailed Localized Image and Video Captioning

[2, 5, 43, 56, 79]. We refer these metrics as short cap-


tioning metrics, as opposed to metrics from LLM-based
I donÕt have enough information to
know what the box is referring to! 😢

evaluations that support evaluating detailed captions. Possibility 1: the dish


Possibility 2: the plate
Region of Interest

We perform zero-shot evaluation on the grounded 🤖


in the data
phrases in Flickr30k Entities [58], where our model is pipeline

not trained on the entities annotated in the training


split of Flickr30k Entities. Results are in Tab. 3.
ItÕs the dish! 😁
a dish that appears to be a seafood

B.3. Detailed Localized Captioning Bench- meal, speciÞcally including shrimp É


Region of Interest
🤖
marks in the data
pipeline

Detailed localized captioning task requires the model


to output a detailed description for each the region
with the length spanning from a long sentence to mul- ItÕs the plate! 😁

tiple sentences. a white, oval-shaped dish with a


slightly raised rim É

1. We perform zero-shot evaluation on detailed cap-


Region of Interest 🤖
in the data
pipeline
tions in the Objects365 [67] split of Ref-L4 [14]
since we do not train on Objects365 dataset. We
evaluate the prediction quality by computing Figure A.2: Caveats for using boxes to indicate re-
short captioning metrics and CLAIR [13] score gion of interests. Top: Using a box to indicate the
against the reference captions in the dataset. region of interest leads to ambiguity. Middle and Bot-
tom: Switching to a mask representation leads to more
We use CLAIR to evaluate raw detailed outputs,
specific referring and correct descriptions.
while we summarize both the prediction and
ground truth with GPT-4o-mini [54] before eval-
uation with short captioning metrics. No ground Subject Correspondence (SC), Appearance De-
truth or reference captions are provided to GPT- scription (AD), Temporal Description (TD), and
4o-mini, with the LLM setting exactly the same Hallucination Detection (HD). Zero-shot setting
for all models for fairness. Results are in Tab. 4. indicates that our model is not trained on Panda-
2. We evaluate our model with DLC-Bench, our pro- 70M [17], the dataset that VideoRefer-Bench
posed benchmark for fine-grained region-based sources the videos from. In-domain setting in-
captioning. This evaluation is also zero-shot. We dicates mixing the detailed caption subset of
present details about our benchmark in App. D. VideoRefer-700k, which is also curated from
Results are in Tab. 5. Panda-70M [17], into our training data. Results
are in Tab. 7.
B.4. Detailed Localized Video Captioning
Benchmarks C. Discussions
1. We conduct evaluation on HC-STVG [71], a C.1. The Caveats of Using Referring Boxes
spatial-temporal video grounding dataset with in Data Pipeline
detailed captions used in prior and concurrent
Caveats exist when boxes are used to refer to regions
work [59, 96]. Following prior work [59], we
in the data pipeline. As shown in Fig. A.2, boxes
evaluate the quality of localized captions with
can be ambiguous in terms of what they are referring
CIDER, METEOR, BLEU, ROUGE_L, and SPICE
to, causing uncertainty for the VLM that we use in
[2, 5, 43, 56, 79]. Results are in Tab. 6.
our data pipeline. In contrast, masks are much more
2. We also perform evaluation on the detailed local-
specific in terms of the region that it is referring to.
ized video description benchmark in VideoRefer-
This motivates us to use manually annotated masks in
Bench proposed by concurrent work [96]. GPT-
existing segmentation datasets rather than bounding
4o is used to provide four dimensions of scores
boxes in order to curate high-quality data for DLC
on a scale of 1 to 5. The four dimensions are
with little referring ambiguity. We additionally take

3
Describe Anything: Detailed Localized Image and Video Captioning

Input Video

A man with short, dark hair and a slight stubble is wearing a denim shirt with rolled-up sleeves. He begins by standing
with his hands clasped together in front of him, exuding a calm and composed demeanor. As the sequence
progresses, he raises his hands slightly, maintaining a gentle and expressive gesture. His facial expression shifts
DAM Output subtly, suggesting he is engaged in conversation or explanation. The man's movements are smooth and deliberate, as
he occasionally opens his hands and gestures outward, indicating emphasis or clariÞcation. Throughout the
sequence, his posture remains upright and attentive, conveying a sense of engagement and focus.

Reference Caption A man with short black hair is standing on the left, wearing a black jacket, as if reporting news

GPT Evaluation on
Hallucination Hallucination Detection: 1
Explanation: The predicted answer includes several imaginative elements, such as gestures and expressions, that are
Detection (HD) not mentioned in the correct answer, indicating hallucinations in the description.
Dimension

Figure A.3: The pitfall of using reference captions for caption evaluation. Evaluation benchmarks based on
reference captions may incorrectly treat correct details in the predicted caption as hallucination. Since the GPT
evaluator relies solely on the ground truth caption without viewing the video, it mistakenly flags gestures and
expressions as hallucinations, resulting in a low score. However, the evaluation is invalid since the predicted
details are correct.

in manually annotated keywords (e.g., class names, source of information, it incorrectly believes that the
part names, entities) in the datasets for regions we gestures and expressions are hallucinations and gives
are annotating in our data pipeline to further reduce our caption a very low score for the hallucination
the ambiguity and potential confusion for our VLM in detection dimension. However, the evaluation is not
the data pipeline. valid, as our model is correct in the descriptions about
the gestures and expressions.
C.2. The Pitfall of Using Reference Cap-
This indicates that the lower score on this sub-task is
tions in Benchmarks not due to the hallucination of our model, but rather
As discussed in Sec. 5 and 6.1, caveats exist for using due to the missing details in the reference caption
a “ground truth” reference caption for benchmarking and the fact that our model, evaluated in a zero-shot
localized descriptions. Specifically, since such a ref- setting, does not have awareness for what types of
erence caption is hardly comprehensive and may not details are preferred by or included in the reference
contain all the details about the region of interest, caption.
the metrics from the benchmark will treat the correct
details in the caption prediction about the region of C.3. Failure Cases
interest that are not mentioned in the ground truth We show two failure cases of DAM in Fig. A.4. In
reference caption as hallucinations. This discourages Fig. A.4(a), DAM misrecognizes the frog-shaped slip-
the model from generating detailed captions. per to be a frog. In Fig. A.4(b), DAM describes the
We analyzed the performance of our method in HD person as pulling the body upward. We expect these
(hallucination detection) sub-task in VideoRefer- errors to be mitigated by broader data coverage.
Bench [96] and found that our model often predicts
correct details that are not present in the reference C.4. Potential Limitations
caption. Specifically, the example in Fig. A.3 shows DAM is only trained for multi-granular localized cap-
this phenomenon. While our model’s prediction tioning, especially for detailed localized captioning
includes appearance and motion details about the (DLC) and is not specifically optimized for other gen-
change of the person’s gesture and expression, such eral vision-language tasks. However, DAM is designed
details are not mentioned in the reference caption in for in-depth analysis for the task of multi-granular im-
the dataset. Since the GPT evaluator does not see the age and video localized descriptions rather than for
video and uses the ground truth caption as the only breadth for general vision-language understanding,

4
Describe Anything: Detailed Localized Image and Video Captioning

A bright green frog with a smooth, glossy texture. The frog has a rounded head with prominent, slightly raised eyes. Its body is
streamlined, tapering towards the tail, and the limbs are short and sturdy.
DAM (Ours)

(a) DAM might still misrecognize a region and output an incorrect description. For example, it misrecognizes the
frog-shaped slipper to be a frog.

Input
Video

A person wearing a sleeveless white top and blue shorts is engaged in a dynamic exercise routine. The individual is equipped with a
pair of red and white athletic shoes, which provide stability and support during the activity. The person is holding onto a bar or
handle with both hands, suggesting a pull-up or similar upper-body exercise. The posture is upright, with the back slightly arched,
indicating the exertion of muscles during the movement. The legs are bent at the knees, with feet firmly planted on the ground,
providing a stable base. Throughout the sequence, the person maintains a consistent rhythm, pulling the body upward and then
DAM (Ours)
lowering it back down in a controlled manner. This repetitive motion showcases the individual's strength and endurance, as they
continue the exercise with focus and determination.

(b) DAM might also be confused by the combination of the object motion and the camera motion. In this example, it
makes the mistake of describing the person as pulling the body upward.

Figure A.4: Failure cases for our proposed DAM.

which justifies the design choice. the regional feature used in our approach comes di-
rectly from a global and a focal view of the input image,
C.5. Computational Efficiency with cross-attention enhancing the focal representa-
DAM incorporates our proposed localized vision en- tion. This design is justified as the vision encoder is
coder, which differs from the SigLIP [97] vision en- much smaller than the LLM (400M vs. 3B/8B parame-
coder used in [44] by adding two key components: ters), with minimal latency impact (0.06s compared to
patch embedding layers for encoding the mask and 1.49s for 3B LLM as measured in our pipeline). This
cross-attention blocks. Importantly, these components overhead is outweighed by the benefits of preserving
do not alter the dimensions or sequence length of the fine details that global image features miss as indi-
vision features passed to the large language model, cated in Tab. 8), especially for small regions. Finally,
ensuring that the parameter count and computational DAM 3B outperforms much larger models in challeng-
efficiency of the large language model are unaffected. ing (Tab. 5), showing our efficiency.
Since the vision encoder represents only a small frac-
tion of the total parameters and computational oper- C.6. Training Data
ations, the overall increase in FLOPs and parameter In addition to the details in data annotation presented
count remains marginal, maintaining the model’s effi- in App. H.1, we discuss the training data of our
ciency. work in this section and present a comparison with
recent works. Compared with recent work Ferret
To be more specific, unlike prior works that derive
[93] which used 1.1M unreleased samples and
regional features from image features for each region,
RegionGPT [28] which used 1.5M unreleased samples,

5
Describe Anything: Detailed Localized Image and Video Captioning

we train our model on a comparable amount of data properties, and removed hallucinated or ambiguous
(1.5M samples). However, we obtain much better properties from the VLM outputs. Finally, we turned
performance (Tab. 5), which shows the effectiveness these properties into questions that test whether a
of DAM. description accurately covers the property.

Negative Question Generation. We targeted mis-


C.7. Performances of Baseline Models on
localization and hallucination, which are two types
DLC-Bench of negatives (i.e., cases in which a property or an
Interestingly, region-specific VLMs often perform on object should not be included in the descriptions).
par or worse than generic VLMs. This is likely be- Specifically, for mislocalization errors, we prompted
cause many are trained on datasets with short regional the VLMs to generate a list of objects in the image that
captions, leading them to produce brief, phrase-level are not in the masked region. We also prompted the
descriptions. Even when prompted for longer descrip- VLMs to generate a list of parts that are commonly
tions [28, 100], these models tend to include irrele- associated with the object type of the masked region
vant details about the background, speculations, or but are not present or visible in the masked object in
hallucinations, due to insufficient regional informa- the image (e.g., the head of a corgi if it is occluded
tion. Providing crops instead of full images leads to and thus not included in the masked region).
mixed results for different region-specific VLMs since
these models are not designed to describe regions in To avoid biasing towards one specific off-the-shelf
crops. VLM, we leveraged multiple VLMs for different
instances to generate initial positives and negatives.
Specifically, we annotated 34 regions using GPT-
D. Details for DLC-Bench 4o [54], 35 using Gemini 1.5 Pro [75, 76], and 31
Image and Instance Selection. We leveraged a sub- using Anthropic Claude 3.5 Sonnet [72] for the
set of the Objects365 v2 [67] validation set, which initial property generation. We used the same image
was manually annotated with segmentation masks prompting method for all VLMs as we did when
in [27], for image and instance selection. We col- prompting the VLMs in the first stage of our data
lected a set of 892 challenging questions from this pipeline.
subset, each containing one object of interest. Each
Note that the choices for each question are mutually
question is manually inspected, and questions with am-
exclusive, which ensures one option is always valid
biguous or unclear answers are filtered out. To main-
and leaves no room for two options to be true at the
tain the integrity of our benchmark, we conducted
same time.
de-duplication to ensure that no images used in the
benchmark were present in our training dataset for Scoring Mechanism. Our evaluation methodology
detailed localized captioning. involves scoring the models based on their ability to
include correct details and exclude incorrect or irrele-
Positive Question Generation. For each masked
vant information.
region, we prompted an off-the-shelf Visual Language
Model (VLM) to generate a list of parts. Subsequently, To evaluate a model like DAM for its ability to out-
for the whole object and each part, we asked the VLM put detailed localized captions, we first prompt the
to generate a list of properties covering aspects such as model to generate descriptions for each of the masked
color, shape, texture, materials, and size. Each prop- instances. Then, instead of directly asking our model
erty is stored in the form ([object name], [part to provide answers to these questions, we prompt a
name], [property name], [property value]). text-only LLM, Llama 3.1 8B [25], to serve as a judge
For example, if the masked region is a corgi, the VLM to rate the localized descriptions according to the pos-
could describe the brown fur of the corgi as (corgi, itive and negative questions.
fur, color, brown).
For each model-generated description, we apply the
We used this list of properties as a starting point for following scoring rules:
manual curation. We then manually added significant
• Positive Scoring: For each positive question, if
properties that the VLM missed, revised inaccurate

6
Describe Anything: Detailed Localized Image and Video Captioning

A white electric stove with a smooth, ßat cooktop featuring two visible black burners. The control
Task: describe this region panel is located at the back, with a black surface and a few visible knobs. The stove has a slightly
raised backguard.
(b) An example description to be evaluated
Q: Which of the following is applicable to the description?
Choices:
A. The control panel or the stove is not mentioned. (score: 0)
B. The location of the control panel is mentioned in the description but is not back. (score: -1)
C. The location of the control panel is mentioned in the description and is back. (score: 1)
D. The location of the control panel is not mentioned, but the control panel of the stove is
mentioned. (score: 0.5)
(c) An example positive question in DLC-Bench
(a) An example region in the
benchmark for the model to describe Q: Which of the following is applicable to the description?
Choices:
A. The induction cooktop surface of the stove is mentioned in the description. (score: -1)
B. The induction cooktop surface of the stove is not mentioned in the description. (score: 1)
C. The stove is not mentioned in the description. (score: 0)
(d) An example negative question in DLC-Bench

Figure A.5: An example from DLC-Bench for detailed localized captioning. (a) The process begins by
prompting a model to describe a specified region within the image. The resulting description is then evaluated
using a text-only LLM as a judge that rates each response by answering positive and negative questions. (b)
shows an example description to be evaluated. (c) Positive questions are designed to test whether the model
correctly identifies specific details within the described region. The model receives points for accurate details
and is penalized for factual errors. The bold option (option C) indicates that the LLM judge believes that option
C is applicable, allowing the model to get a point for this example positive question. (d) Negative questions
ensure the model refrains from mentioning irrelevant or nonexistent details. Mislocalization or hallucination
results in penalties to prevent false positives. The bold option (option B) indicates that the LLM judge believes
that option B is applicable, allowing the model to get a point for this negative question.

the description correctly includes the specified when the object recognition is incorrect, since
detail, the model receives a point. To prevent otherwise a caption that is random and com-
models from artificially inflating their scores pletely off could get high scores on the nega-
by generating excessively long descriptions and tive questions. We present a negative example
guessing details, we penalize incorrect details in Fig. A.5(d).
and discourage models from including uncertain
The positive (negative) score for a model is the sum of
or erroneous content. If the detail is mentioned
points for positive (negative) questions, normalized by
but incorrectly (e.g., wrong color), a penalty of
the maximum possible score to yield a percentage for
one point is applied. No point is awarded if the
comparison. We also average the positive and negative
description does not mention the detail. Partial
scores to obtain an overall score, which represents
points (0.5 points) are awarded for answers that
the model’s overall capability in detailed localized
are partially correct but insufficiently detailed.
captioning.
Note that the model gets positive points only
when the object recognition is correct, as the We present an example from DLC-Bench in Fig. A.5.
correctness of the details depends on the cor- The example region in Fig. A.5(a) features a stove with
rectness of the overall region recognition. We coil burners. An example description of the region is
present a positive example in Fig. A.5(c). presented in Fig. A.5(b). For the example positive
• Negative Scoring: For each negative question, if question in Fig. A.5(c), the LLM judge selects option
the description appropriately excludes the incor- C, as the caption correctly mentions that the control
rect or irrelevant detail, the model gets a point. panel is at the back, allowing the model to get a point
If the description includes the detail, indicating for this positive question. For the negative question
mislocalization or hallucination, a penalty is ap- in Fig. A.5(d), the LLM judge selects option B, as the
plied. The model gets zero or negative points caption correctly indicates that it is not an induction

7
Describe Anything: Detailed Localized Image and Video Captioning

cooktop, allowing the model to get a point for this not as detailed as when it is trained exclusively with
negative question. the original prompt. Importantly, the model’s perfor-
mance on the negative questions remains unchanged,
Evaluation Setting. For our models, we follow our
indicating that prompt augmentation does not lead to
inference setting described in App. H.
hallucinations or mislocalization.

E. Additional Ablation Studies Despite the slight degradation of the performance


in the benchmark (0.6% in the overall accuracy),
Model Architecture with the Same Training Data.
we observed that prompt augmentation improves
A model’s performance is largely due to two factors:
instruction-following capabilities when prompts
model architecture design and training data. Since
include additional instructions, particularly those
both factors differ for different models, it is hard to
specifying requirements on the length of the outputs.
compare the effectiveness of different model architec-
Therefore, we default to using the model without
tures head-to-head.
prompt augmentation in our benchmark evaluations,
To this end, we compare our model architecture including ablations. In contrast, we employ the
against VP-SPHINX [45], the strongest prior baseline model with prompt augmentation in the qualitative
in most benchmarks that we tested on. By continu- evaluations.
ously training a VP-SPHINX model [45] on our data
Image-only Training vs Image+Video Joint Train-
after pre-training on the originally proposed datasets.
ing. We also compared our image-only DAM with
This is a fair comparison since our method is also
DAM with image + video joint training in Tab. A.3.
fine-tuned from a pretrained VLM, VILA-1.5 [44],
We show that our model with image-video joint train-
with two stages of training prior to training on our
ing slightly outperforms our model with image-only
region-specific dataset.
training on detailed localized image captioning. Note
As shown in Tab. A.1, our model architecture achieves that for this ablation study, we keep the model size
much better performance on detailed localized cap- the same and use the 3B model for both image-only
tioning benchmark DLC-Bench with trained on the training and image-video joint training. We use image-
same dataset from our proposed data pipeline. This only training as the default option for results in our
justifies that our proposed focal prompt and localized benchmark for simplicity.
visual backbone are able to provide more detailed
features compared to just the global image features F. Additional Quantitative Results
extracted by a vision encoder on the full image with a
Set-of-Marks Prompting. We present a comparison
regional referring feature, as employed in [45].
with baseline VLMs that use Set-of-Marks (SoM)
Prompt Augmentation. We compared variants of our prompting [91] in Tab. A.4. SoM leads to degraded
model with and without prompt augmentation. As results compared to the prompt engineering method
shown in Tab. A.2, incorporating prompt augmenta- used in stage one of our data annotation pipeline. This
tion slightly degrades our model’s performance on the is mostly because the marks proposed by SoM blend in
positive questions in our benchmark. We hypothe- with the object or the background in complex scenes.
size that despite introducing variations in the prompts They might also mask out some part of the object,
and enhancing the model’s instruction-following ca- which interferes with the model’s understanding
pabilities, prompt augmentation creates a mismatch capabilities. Therefore, for fair comparisons, we use
between the prompts used during training and those the same prompt engineering method as we use in
used during evaluation (as we always use the same stage one of our data annotation pipeline in our main
prompt for evaluation, which is detailed in App. H.3). result in Tab. 5. Importantly, region-specific VLMs,
Since the prompt used during evaluation might not be including DAM, have predefined ways of encoding
the same as the prompt used in training, the model regional inputs, making SoM inapplicable to these
may also occasionally reference other tasks from our models.
mixing dataset ShareGPT-4V for the length of outputs.
This may cause the model to produce outputs that are

8
Describe Anything: Detailed Localized Image and Video Captioning

VP-SPHINX Arch Our Arch


Avg (%) 50.2 63.8

Table A.1: Ablations on architecture design compared to our strongest baseline VP-SPHINX [45]. We
trained a model with VP-SPHINX [45] architecture on our curated DLC data from various segmentation datasets.
The results on DLC-Bench indicate the advantages of our model architecture that allows detailed localized
features to be presented to the LLM for DLC.

Prompt Augmentation Pos (%) Neg (%) Avg (%)


No 52.3 82.2 67.3
Yes 51.3 82.2 66.7

Table A.2: Comparison of performance of DAM with and without prompt augmentation. Prompt augmen-
tation has minimal effect on DAM’s performance on DLC-Bench. While descriptions generated by the model
may occasionally be less detailed, leading to a slight decrease in the performance on positive questions, we
observed that prompt augmentation enhances instruction following when prompts include specific guidelines,
such as length constraints. We use the model without prompt augmentation with our benchmark, including
ablations, by default.

Setting Pos (%) Neg (%) Avg (%)


Image-only Training 52.3 82.2 67.3
Image+Video Joint Training 52.4 85.4 68.9

Table A.3: Comparison of performance of our image-only DAM and DAM trained with both localized
image description task and localized video description task. Joint training benefits generating high-quality
localized image descriptions compared to image-only training.

G. Additional Qualitative Results G.2. Zero-shot QA Capabilities


Although not trained on any regional QA datasets,
G.1. Detailed Localized Image Captioning
DAM surprisingly exhibits emerging zero-shot capa-
In Fig. A.7, we present additional examples from
bilities on regional QA.
LVIS [29] to show our model’s strong performance on
detailed localized image captioning. In Fig. A.6, we show examples of our model perform-
ing zero-shot QA. DAM is able to identify properties of
Our model demonstrates robust localization and re-
objects in the masked regions. For example, it is able
gion understanding capabilities. In the first example,
to identify the color of the clothing, the material of the
it accurately describes the sofa cushion without men-
stick, and the textural pattern of the fish in the first
tioning the dog that is outside the masked region. In
three examples. DAM is also capable of performing ob-
the second example, it correctly identifies the roller
ject recognition for a region in the image, identifying
blind, which would be challenging to recognize based
the strawberry in the last image.
solely on a local crop without context. In the third
example, the model provides a detailed description
G.3. Detailed Localized Video Captioning
of the giraffe without referencing the birds perched
on it, as they fall outside the masked region. These We present more examples for detailed localized video
examples highlight our model’s precise localization captioning in Fig. A.8 and Fig. A.9. Our model can
abilities and its effectiveness in perceiving regional describe objects in videos with large object motion
details with contextual understanding. and camera motion. DAM can also identify stationary
objects by indicating that they are stationary in the
description.

9
Describe Anything: Detailed Localized Image and Video Captioning

Method #Params Pos (%) Neg (%) Avg (%) Region of Interest

API-only General VLMs:


Q: What is the main
GPT-4o (SOM) [54] - 5.0 29.2 17.1 color of the clothing in
o1 (SOM) [55]† - 0.8 28.0 14.4 the masked area?
Claude 3.7 Sonnet (SOM) [73]† - 0.5 40.2 20.4
A: The main color of
Gemini 2.5 Pro (SOM) [74, 75]† - 13.2 65.0 39.1
the clothing in the
Open-source General VLMs: image is blue. ✔
Llama-3.2 Vision (SOM) [25] 11B 16.8 40.4 28.6
Llama-3 VILA1.5 (SOM) [44] 8B 0.6 0.6 0.6
InternVL2.5 (SOM) [20, 21, 84] 8B 8.6 28.6 18.6 Region of Interest
LLaVA v1.6 (SOM) [46–48] 7B 2.2 3.8 3.0 Q: What is the
Qwen2.5-VL (SOM) [77, 81] 7B 8.5 27.2 17.8 material of the object
VILA1.5 (SOM) [44] 3B -0.4 15.4 7.5 in the masked region
in the image?
DAM (Ours) 3B 52.3 82.2 67.3
A: The object in
the image is made
Table A.4: Additional results with existing general of wood. ✔

VLMs using Set-of-Mark (SoM) prompting [91]. The Region of Interest


results are accuracies on detailed localized captioning
in DLC-Bench. Compared with results in Tab. 5 which Q: How would you
describe patterns of
are obtained with the same prompt engineering as we the object in the
use in the stage 1 of our data pipeline, SoM leads to masked region?
Answer with one
degraded quality. In this comparison, the advantages word.
of our method, compared with prior baselines, are A: Striped ✔
much larger. Negative numbers are due to penalties
from factual errors. Note that region-specific VLMs, Region of Interest
including our proposed DAM, have predefined ways
of inputting regions, and thus SoM prompting is not
Q: What is the object
applicable to these models. †: models with thinking in the masked region?
mode. Answer with one
word.
A: Strawberry ✔
G.4. Qualitative Comparisons with Strong
Baselines
Figure A.6: Emerging zero-shot QA capabilities.
Detailed Localized Image Captioning. We also DAM could answer questions about regions in an im-
present qualitative comparisons with GPT-4o [54] and age, showcasing capabilities such as object recognition
our strongest open-weight baseline VP-SPHINX [45] and property identification.
in detailed localized image captioning in Fig. A.10.

In both examples, GPT-4o could not correctly recog- models, GPT-4o [54], Qwen2.5-VL [77], and recent
nize the objects in the masked regions, providing only work VideoRefer- [96], in detailed localized video
vague descriptions. VP-SPHINX, while better than GPT- captioning in Fig. A.11. In the top example, we ob-
4o, still struggles with accurate object recognition and served that GPT-4o struggles to interpret the cow’s
detailed descriptions. In the left image, VP-SPHINX in- movements. Similarly, Qwen2.5-VL-7B incorrectly per-
correctly describes a group of seals when the masked ceives the cow as stationary. VideoRefer-7B provides
region contains only one seal. In the right image, minimal motion and appearance details. In contrast,
VP-SPHINX identifies the towel but provides minimal our 8B model accurately identifies the motion of the
detail, missing key attributes like its color and texture. cow, providing more detailed information about it.

Our model outputs detailed and high-quality descrip- In the bottom example, GPT-4o misidentifies the ob-
tions of the seal and the towel. This improvement ject, mistakenly assuming the animal is transforming
stems from our model’s design which enables the fu- into a wolf or a pig. Meanwhile, Qwen2.5-VL-7B be-
sion of object-specific information with broader con- lieves only the sheep’s head is moving. VideoRefer-7B
textual understanding. recognizes that the sheep is moving but provides little
detail about the appearance of the sheep. In con-
Detailed Localized Video Captioning. We present
trast, our model correctly identifies the animal in the
comparisons with three strong video understanding

10
Describe Anything: Detailed Localized Image and Video Captioning

Images with Regions Detailed Localized Dataset # Images # Regions


of Interest Descriptions
Stage 1:
Region of Interest
LVIS [29] 90,613 373,551
Mapillary Vistas v2.0 [53] 17,762 100,538
COCO Stuff [11] 28,365 32,474
A brown leather cushion with a OpenImages v7 [33, 35] 64,874 96,006
smooth texture and a slightly PACO [60] 24,599 81,325
curved upper edge, exhibiting a
sheen indicative of a soft, Stage 2:
possibly plush surface. SA-1B (10%) 592,822 774,309
Total 819,035 1,458,203
Region of Interest
Table A.5: Dataset statistics across stages with total
images and regions for training detailed localized
image captioning. In stage 1, we annotated 684k
A navy blue roller blind with a
smooth texture and a horizontal
regions across 226k images from existing instance
seam dividing it into two sections. and semantic segmentation datasets. In stage 2, we
perform SSL on 10% of SA-1B images without using
the masks provided by the datasets, resulting in 774k
regions across 593k images. In total, we annotated
Region of Interest 1.46M regions across 819k images with detailed lo-
calized descriptions. This diverse and high-quality
An adult giraffe with a pattern of dataset is the key to our model’s performance. Note
large, irregular brown patches
separated by cream-colored lines that due to filtering the number of instances and im-
on its neck. The giraffe has two ages are lower than the number of instances and im-
ossicones on top of its head, dark ages in the original dataset.
eyes with long eyelashes, and a
pair of small ears. Its skin appears
smooth, and the giraffe's
Dataset # Videos # Regions
expression is calm.
SA-V [62] 36,922 93,969
Figure A.7: Additional results from LVIS [29]
demonstrating DAM’s detailed localized image cap- Table A.6: Dataset statistics across stages with total
tioning capabilities. Our model exhibits robust re- videos and regions for training detailed localized
gion understanding and localization across diverse video captioning. We label 94k regions across 37k
scenarios. It produces precise descriptions of objects videos from SA-V dataset [62] for detailed localized
within masked regions while successfully identifying video captioning. Note that each region indicates an
challenging details like the roller blind in the second instance across multiple frames in the video.
example through effective use of contextual cues.
81k annotated instances from PACO [60] to improve
masked region as a sheep throughout the video and its part description capabilities, leading to 684k an-
accurately recognizes its full movement, providing notated regions, as detailed in Tab. A.5. To prompt
details about its motion and appearance. a VLM to output detailed localized descriptions, we
input a cropped image and a masked image. While the
cropped image allows coarse localization and provides
H. Implementation Details
high token density per pixel for clear descriptions, the
H.1. Data Annotation Pipeline masked image helps localize the object of interest
Stage 1. We annotate four existing instance and se- when there are multiple instances with the same cate-
mantic segmentation datasets for detailed localized gory. The category name is also provided in the text
descriptions. We use off-the-shelf VLMs for region prompt, relieving the model from having to identify
annotations, with 603k regions across 202k images the object without the context from the image. We
with detailed localized descriptions in total in stage present the prompt for data annotation in Tab. A.7.
1. For the model variant used in PACO [60] open-
Stage 2. We annotate 10% of SA-1B through self-
class dataset evaluation, we additionally merged in

11
Describe Anything: Detailed Localized Image and Video Captioning

Input
Video

A young child, dressed in a light-colored long-sleeve shirt and beige pants, is captured in a sequence of dynamic movements. The
child appears to be in mid-motion, possibly dancing or playing, with one arm raised and the other slightly bent. The child's posture
suggests a lively and energetic demeanor, with a slight forward lean indicating movement. The legs are positioned in a way that
suggests a step or a jump, with one foot slightly lifted off the ground. The child's head is tilted slightly upward, adding to the sense of
motion and engagement. Throughout the sequence, the child maintains a consistent posture, with the arms and legs moving in
coordination, conveying a sense of playful activity.
(a) DAM accurately describes fine-grained details for the masked region.

Input
Video

A cat with a predominantly white coat and distinct orange patches is seen in a series of fluid movements. Initially, the cat is crouched
low, its body compact and poised, suggesting a state of alertness or curiosity. As the sequence progresses, the cat begins to rise, its
body elongating and stretching, indicating a transition from a crouched position to a more upright stance. The cat's ears are perked,
and its eyes are focused, suggesting attentiveness to its surroundings. The tail, initially curled, extends outwards, providing balance
and a sense of readiness. The cat's movements are smooth and deliberate, showcasing its agility and grace. As it continues to rise, the
cat's posture becomes more upright, and its head turns slightly, as if scanning the environment or preparing to move in a new
direction. The sequence captures the essence of a cat's natural curiosity and fluid motion, highlighting its ability to navigate its
environment with ease and elegance.
(b) DAM is still able to output high-quality descriptions in scenes with fast camera and object motion.

Input
Video

A polished brown leather shoe, characterized by its sleek, glossy finish and classic design, moves gracefully through a sequence of
steps. The shoe's pointed toe and smooth surface reflect light subtly, emphasizing its refined craftsmanship. As it progresses, the shoe
maintains a steady, rhythmic motion, suggesting a confident stride. The heel, slightly elevated, provides a sense of balance and
stability, while the sole appears sturdy, designed for comfort and support. Throughout the sequence, the shoe's movement is fluid and
purposeful, indicative of a person walking with intent and poise. The shoe's interaction with the ground is seamless, showcasing its
flexibility and durability. The overall impression is one of elegance and sophistication, with the shoe serving as both a functional and
stylish accessory.
(c) DAM is able to accurately keep track of small objects across the dynamic scene.

Input
Video

A stone sculpture of a standing figure, draped in a flowing robe, is depicted in a serene and poised stance. The figure's right arm is
bent at the elbow, with the hand gently resting on the chest, while the left arm is slightly extended outward, suggesting a gesture of
contemplation or prayer. The head is adorned with a simple, rounded headpiece, and the facial features are finely detailed, conveying
a sense of calm and introspection. The robe, intricately carved, wraps around the body, creating soft folds and drapes that add to the
sculpture's lifelike quality. The figure stands firmly on a base, exuding a sense of stability and grace. Throughout the sequence, the
sculpture remains stationary, embodying a timeless and tranquil presence.
(d) DAM correctly identifies stationary objects in the video, offering detailed descriptions for the appearance of the specified object.

Figure A.8: Additional results from DAM on detailed localized video captioning (Part 1). Our model is
able to accurately describe small objects in complex scenes that involve large object motion and camera motion.
Our model also correctly identifies stationary objects captured by a non-stationary camera by saying they are
stationary. Videos visualized in this figure are from SA-V [62] dataset.

labeling, resulting in 774k annotations across 593k final number of instances and images is lower than
images, as detailed in Tab. A.5. Due to filtering, the the original 10% subset of SA-1B. We do not use the

12
Describe Anything: Detailed Localized Image and Video Captioning

Input
Video

An elephant, characterized by its large, rounded ears and long, curved tusks, moves gracefully through the sequence. Its skin appears
rough and textured, with a grayish-brown hue that suggests a natural, earthy tone. The elephant's trunk, a versatile and muscular
appendage, is prominently visible, swaying gently as it walks. Its eyes are small and partially obscured by the folds of its skin, giving it
a calm and serene expression. The elephant's legs are sturdy and columnar, supporting its massive body with ease. As it progresses
through the sequence, the elephant maintains a steady and deliberate pace, its movements fluid and unhurried. The tusks, slightly
curved and pointed, catch the light subtly, adding a touch of elegance to its robust form. Overall, the elephant exudes a sense of
majesty and tranquility, moving with a quiet confidence that is both powerful and graceful.

(a) DAM accurately describes fine-grained details for the masked region.

Input
Video

A white SUV is prominently featured, showcasing a sleek and modern design. The vehicle's body is smooth and aerodynamic, with a
slightly elevated stance typical of SUVs. The front of the car is characterized by a distinctive grille and sharp, angular headlights that
give it a bold and assertive look. As the sequence progresses, the SUV moves steadily forward, maintaining a consistent trajectory. The
wheels, with their intricate alloy design, rotate smoothly, indicating a well-maintained vehicle. The side mirrors are positioned to
provide optimal visibility, and the windows are tinted, adding a touch of sophistication. The SUV's movement is fluid, suggesting a
well-engineered suspension system that handles the road with ease. Throughout the sequence, the car remains the focal point,
exuding a sense of reliability and style.

(b) DAM is able to handle objects that are heavily occluded in some frames.

Input
Video

A person wearing a blue jacket with white text on the back is captured in a sequence of dynamic movements. Initially, the individual is
seen in a slightly bent posture, suggesting a moment of balance or readiness. As the sequence progresses, the person appears to be
in motion, possibly walking or running, with their arms extended forward, indicating a sense of urgency or direction. The jacket, with
its distinct color and text, remains a focal point throughout the sequence, emphasizing the person's presence and movement. The
person's head is slightly tilted, and their posture suggests a forward-leaning stance, consistent with active movement. The sequence
concludes with the person seemingly in mid-stride, capturing a sense of continuous motion and purpose.

(c) DAM is able to output detailed descriptions for objects despite occlusion due to object motion and human articulation.

Input
Video

A cat with a sleek, short-haired coat of light gray and white fur is prominently featured. Its ears are perked up, and its large, expressive
eyes are wide open, giving it an alert and curious demeanor. The cat's body is slightly turned, suggesting a poised and attentive
stance. As the sequence progresses, the cat's head tilts slightly upward, maintaining its focused gaze. Its whiskers are long and
prominent, adding to its inquisitive expression. The cat's posture remains mostly upright, with subtle shifts in weight that suggest a
readiness to move or react. Throughout the sequence, the cat's overall demeanor is one of calm vigilance, as if it is observing
something of interest with keen attention.
(d) DAM correctly identifies slight motion in the video, showing its capability for detailed motion analysis.

Figure A.9: Additional results from DAM on detailed localized video captioning (Part 2). Our model is
able to accurately describe objects that are partially occluded and is able to perceive and describe slight motion.
Videos visualized in this figure are from MOSE [23] dataset.

masks provided with SA-1B, as they contain a large tect objects in the images, and then use SAM [62] to
number of masks for parts. Instead, we employ the generate masks for the detected instances. Finally, we
open-vocabulary detector OWL-ViT v2 [51, 52] to de- use SigLIP [97] to evaluate the image-text similarity,

13
Describe Anything: Detailed Localized Image and Video Captioning

Input Input
Image Image

A light gray, elongated, and slightly curved object with a


A dark brown, elongated, and slightly curved object with Did not recognize smooth texture and subtle, irregular surface patterns. The Did not recognize
a smooth texture and a tapered end. the object ✗ object has a narrow, tapering end and a broader, slightly the object ✗

GPT-4o GPT-4o wavy opposite end.

A large group of seals is scattered across a sandy beach.


The seals appear to be resting, lying on their sides, with
some facing the water and others turned toward the land. A towel is hanging off the edge of a white bathtub. The
Did not recognize Recognize the object ✔
They exhibit a range of sizes, suggesting both adult and towel is positioned in a way that suggests it is easily
the object ✗ Minimal details ✗
younger seals are present. The beach itself is sandy, and accessible for someone taking a bath. The towel appears
VP-SPHINX the water's edge is visible. In the background, a few birds VP-SPHINX to be clean and ready for use.
can be seen, likely seagulls, and the sky is overcast.

The harbor seal has a sleek, elongated body with a dark


A light gray towel with a soft, plush texture, featuring a
brown coloration. Its skin appears smooth and slightly Recognize the object ✔ Recognize the object ✔
subtle, darker gray pattern along the edges. The towel is
shiny, with a few lighter patches scattered across its Abundant details ✔ Abundant details ✔
folded and draped, creating a flowing, slightly wrinkled
back. The seal's body tapers towards the tail, which is not
DAM (Ours) visible in the image. DAM (Ours) appearance.

Figure A.10: Qualitative comparisons demonstrate the superior localized image understanding capabilities
of our model compared to GPT-4o [54] and VP-SPHINX [45], our strongest open-weight baseline. GPT-4o
struggles to recognize objects in masked regions accurately, offering only vague descriptions. In the left image,
VP-SPHINX incorrectly describes a group of seals when the masked region contains only one seal. In the right
image, VP-SPHINX identifies the towel but provides minimal detail, missing key attributes like its color. In
contrast, our model delivers precise, detailed descriptions and captures the seal’s sleek elongated body, dark
brown coloration with lighter patches, and the towel’s light gray color, wrinkled texture, and darker edge
pattern. This superior performance stems from our model’s architecture that effectively fuses object-specific
details with broader contextual understanding.

taking the region as an image. we use the high-quality masklets that come with the
videos. We found that many masklets cover parts of
To ensure data quality, we apply extensive filtering
an instance, which is not necessarily helpful in de-
(i.e., rejection sampling) based on confidence scores
scribing the whole object as a common use case of our
from OWL-ViT v2, SAM, and SigLIP image-text similar-
model. Therefore, we performed instance segmenta-
ity. We also ensure we have at most two instances per
tion on the videos with ViTDet [41] + Cascade Mask
image, and for images with two instances, these two
R-CNN [12] trained by EVA-02 [26] and used voting
instances have to be from different classes. The object
to match the segmentation masks with the masklets.
category names produced by OWL-ViT v2 are then put
In this way, we filter out most of the masklets that are
into a variant of our Describe Anything model, which
parts, since they likely do not correspond to instance
is trained on data from stage 1 and optimized for self-
masks. The matched masklets carry the class name
labeling. This variant generates descriptions with a
from the matched instance segmentation mask, which
50% probability of incorporating class names during
is used in the annotation process to obtain a detailed
training, as during self-labeling we have a class name
localized caption for each masklet.
as a part of each input. The object category proposals
used by OWL-ViT v2 are generated by VILA 1.5 [44].
H.2. Model Training
Detailed localized video captioning. We annotated We start from off-the-shelf VILA 1.5 [44] models that
94k regions across 37k videos from SA-V dataset [62] are publicly available on HuggingFace. For image-only
for detailed localized video captioning, as detailed training, we fine-tune VILA 1.5 3B model. For joint
in Tab. A.6. Note that each region, also called masklet, image-video training, we use VILA 1.5 8B model. We
indicates an instance across multiple frames in the use SigLIP [97] vision encoder, following VILA 1.5.
video. In contrast to the use of SA-1B, where we To prevent catastrophic forgetting and to maintain
did not use the masks that come with the dataset, instruction following capabilities, we mix in ShareGPT-

14
Describe Anything: Detailed Localized Image and Video Captioning

Input
Video

A cow is standing in an outdoor area, tethered with a rope. Its body is mostly stationary, with slight movements as it shifts its
weight from one leg to another. The cow occasionally flicks its tail, indicating a relaxed state. Its head is lowered, possibly
Incorrect motion
grazing or sniffing the ground. The cow's coat is smooth, with a consistent color pattern. As the sequence progresses, the cow
understanding ✗
GPT-4o takes small steps forward, adjusting its position slightly. The motion is gradual, with the cow maintaining a calm demeanor
throughout.

A cow, tethered by a rope, stands in a courtyard surrounded by other cows and a blue vehicle. The cow's body is
predominantly light brown with darker patches, and it has a white patch on its face. Its tail is slightly raised, and it appears to
Incorrect motion
be looking around, possibly observing its surroundings or reacting to sounds. The cow remains stationary throughout the
video, maintaining a calm demeanor. The background includes a concrete floor scattered with some debris, a wall with a
understanding ✗
window, and a blue vehicle parked nearby. The overall scene suggests a rural setting where the cows are likely being kept for
Qwen2.5-VL
grazing or other agricultural purposes.

The cow that was the first to move forward and start walking. Minimal motion and
appearance details ✗
VideoRefer

A cow with a rich brown coat and a lighter patch on its rump is depicted in a sequence of movements. Initially, the cow is seen
with its head slightly lowered, suggesting a calm demeanor. As the sequence progresses, the cow begins to move forward, its Recognize the
legs extending in a steady, rhythmic gait. The tail, with its tufted end, sways gently with each step, adding a sense of fluidity to motion ✔
its motion. The cow's body remains mostly upright, with its back slightly arched, indicating a relaxed posture. The legs, sturdy
DAM (Ours) and well-defined, carry the cow forward with a sense of purpose. Throughout the sequence, the cow maintains a consistent Abundant details

pace, its movements smooth and unhurried, embodying a serene and composed presence.

Input
Video

An animal is present in the region, initially appearing as a sheep with a woolly coat. It stands on a patch of ground with
sparse vegetation. As the sequence progresses, the animal undergoes a series of transformations. Its appearance shifts
from a sheep to a creature resembling a wolf, with a sleeker, more streamlined body and a different texture. The
Incorrect appearance
transformation continues as the animal takes on the form of a pig, with a rounded body and a pinkish hue. This change is
understanding ✗
GPT-4o gradual, with the animal's features morphing smoothly from one form to another. Throughout these transformations, the
animal remains in a similar position, occasionally adjusting its stance slightly. The sequence concludes with the animal
returning to a sheep-like form, completing the cycle of transformations.

A sheep stands in a field, initially facing forward. It then turns its head to the left, looking over its shoulder. The sheep's body
remains mostly stationary, but its head movement suggests it might be reacting to something off-camera. The background Incorrect motion
Qwen2.5-VL consists of dry grass and scattered twigs, indicating an outdoor rural setting. understanding ✗

The sheep in the video is white and appears to be grazing. It is moving around the field, occasionally stopping to nibble on Minimal appearance
the grass. The sheep's motion is slow and deliberate as it searches for food. details ✗
VideoRefer

A sheep with a thick, woolly coat of light beige and white fur is seen moving gracefully. Its body is robust and rounded, with
a slightly curved back. The sheep's head is small in proportion to its body, with a pair of alert, medium-sized ears that stand Recognize the
upright. As it moves, the sheep's legs are sturdy and well-positioned, supporting its weight as it walks with a steady, motion ✔
deliberate pace. The tail is short and slightly curved, adding to its balanced silhouette. Throughout the sequence, the sheep
DAM (Ours) maintains a consistent gait, suggesting a calm and purposeful demeanor. Its movements are smooth and fluid, indicating a Abundant details

sense of ease and familiarity with its surroundings.

Figure A.11: Our proposed DAM demonstrates superior localized video understanding compared to
GPT-4o [54], QwenVL-2.5 [77], and VideoRefer [96]. Top figure: DAM accurately captures the cow’s
forward movement with comprehensive details, whereas GPT-4o and QwenVL-2.5 mistakenly perceive the
cow as stationary. Compared to VideoRefer, DAM provides richer descriptions of both motion and appearance.
Bottom figure: DAM correctly recognizes the animal as a sheep and accurately describes its graceful movement,
while GPT-4o erroneously identifies it as transforming into other animals, and QwenVL-2.5 incorrectly perceives
that only the sheep’s head is moving. VideoRefer provides limited appearance details, while DAM offers
extensive, accurate descriptions. These cases highlight DAM’s precise understanding of motion and appearance
throughout video sequences.

15
Describe Anything: Detailed Localized Image and Video Captioning

1 You are responsible to write a very descriptive caption to describe the {{ category }} in the
provided SEGMENTED image . You may leverage the surrounding context of the SEGMENTED
image provided in the CROPPED image .
2 You must not mention any background in the caption and only describe the {{ category }} in
the SEGMENTED image ! The caption must ONLY contain sufficient details to reconstruct
the same {{ category }} in the SEGMENTED image but nothing else !
3 Here are some additional rules you need to follow when describing the {{ category }} in the
SEGMENTED image :
4 1. If there are multiple {{ category }} in the CROPPED image , focus on the {{ category }} in
the SEGMENTED image .
5 2. If the {{ category }} in the SEGMENTED image is occluded by other objects , only describe
the visible part . DO NOT mention anything that is not directly related to the visible
part of {{ category }} , such as " A segment of " , which part is invisible , etc . For objects
with text written on it , describe the object instead of just outputting the text
written on it .
6 Here is the SEGMENTED image that needs caption :

Table A.7: Our prompt for data annotation in stage 1.

4V [16] with our localized image/video captioning ing, as analyzed in App. G. For these models, during
dataset collected with our proposed data pipeline. Fol- training, we randomly select one of 15 prompts from a
lowing the VILA 1.5 training and inference recipe, we predefined set. These prompts may or may not include
treat videos as 8 images concatenated in the sequence. a {prompt_suffix}. The default prompt suffix is in
detail. However, we introduce variability by condition-
We closely follow VILA 1.5’s recipe of the supervised
ing the prompt on the number of words or sentences
fine-tuning stage and train all modules, including the
in the target caption.
vision backbone, the projector, and the LLM. We fine-
tune the model for 1 epoch. For the 3B model, we use Specifically, with a 20% probability, we condition the
a batch size of 2048 with a learning rate of 1e-4 on 8 prompt on the number of sentences, using suffixes like
Nvidia A100 GPUs. For the 8B model, we use a batch in one sentence or in [number of sentences] sentences
size of 2048 with a learning rate of 1e-5 on 32 Nvidia (e.g., in 2 sentences). If the caption contains only one
A100 GPUs. Both models take less than a day to train. sentence, we use phrases like in a sentence or in one
We use a cosine scheduler with a warmup ratio of 0.03. sentence.
No weight decay is used. For training our model that
With another 20% probability, we condition the
takes in a class name for self-labeling, we randomly
prompt on the number of words in the target caption.
put the class name in the prompt with 50% probability.
For captions with a small word count, we use exact
For models without prompt augmentation, which is
numbers (e.g., in 3 words). For longer captions (up
detailed below, we simply use the prompt “Describe
to 200 words), we may round the word count to the
the masked region in detail.” Following VILA,
nearest ten and use phrases like in about 50 words or
we always put image tokens in front of the textual
in around 50 words. If the caption exceeds 200 words,
tokens. As for the setting for the focal crop, we extend
we use the suffix in more than 200 words.
the crop by 1× the width towards left and right, and
1× the height towards top and bottom, unless we hit The list of prompts that include a {prompt_suffix}
the boundaries of the image, in which case we take the is as follows:
boundaries, i.e. 𝛼 = 3 and the total area of the crop is
1. Describe the masked region {prompt_suffix}.
enlarged up to 9×. If either the height or width is less
2. Describe the masked area {prompt_suffix}.
than 48 pixels, we take 48 pixels for that direction
3. What can you describe about the masked region
to encode more context for very small regions, since
{prompt_suffix}?
the small regions themselves do not have much useful
4. Can you describe the masked region
information.
{prompt_suffix}?
Prompt Augmentation. We trained a variant of our 5. Provide an explanation of the masked region
model with prompt augmentation to enhance general- {prompt_suffix}.
ization capabilities beyond detailed localized caption- 6. Depict the masked area {prompt_suffix}.

16
Describe Anything: Detailed Localized Image and Video Captioning

7. Portray the masked area {prompt_suffix}. • If the prompt_suffix is in detail (the default op-
8. Describe what the masked region looks like tion), we may choose from either set of prompts.
{prompt_suffix}. • If the prompt_suffix specifies word or sentence
9. Illustrate the masked region {prompt_suffix}. counts, we select only from prompts that include
10. How would you explain the masked area {prompt_suffix}.
{prompt_suffix}?
This approach introduces variability in the prompts,
11. What details can you provide about the masked
encouraging the model to generate responses with con-
region {prompt_suffix}?
trols from the prompts in mind, thereby enhancing its
12. What does the masked region entail
generalization and instruction-following capabilities.
{prompt_suffix}?
13. How would you illustrate the masked region
H.3. Inference Setting
{prompt_suffix}?
14. How would you depict the masked area Unless otherwise mentioned, our prompt for obtaining
{prompt_suffix}? detailed localized image descriptions at inference time
15. How would you portray the masked area is the following:
{prompt_suffix}? Describe the masked region in detail.
Additionally, we have prompts that inherently request Our prompt for obtaining detailed localized video
detailed descriptions without requiring a suffix: descriptions at inference time is the following:
1. Give a detailed description of the masked region. Given the video in the form of a sequence
2. Provide a thorough description of the masked of frames above, describe the object in
region. the masked region in the video in detail.
3. Can you explain the details of the masked area? Focus on appearance, motion, and actions.
4. Give a detailed account of the masked region. If the motion involves multiple stages or
5. Describe the masked area comprehensively. steps, break down each stage and describe
6. Provide an in-depth description of the masked the movements or changes sequentially.
region. Ensure each phase of motion is described
7. Explain the specifics of the masked area. clearly, highlighting transitions between
8. Can you provide a thorough explanation of the actions.
masked region?
9. What are the details of the masked area? For Co3Dv2 [64] sequences that we treat as videos,
10. Provide a comprehensive description of the we use the following prompt:
masked area. Describe the masked region in the video
11. What specific details can you provide about the in detail. The video consists of multiple
masked region? views of a stationary object. Focus on the
12. Can you give an in-depth account of the masked appearance of the object without mentioning
section? any motion or actions.
13. What are the main characteristics of the masked
region?
14. Give a thorough description of the masked area’s
details.
15. Provide detailed information about the masked
area.

For prompts without a suffix, we do not condition the


generation on the number of words or sentences.

During training, we select prompts based on the


prompt_suffix:

17
Describe Anything: Detailed Localized Image and Video Captioning

References [11] H Caesar, J Uijlings, and V Ferrari. Coco-stuff: Thing


and stuff classes in context. arxiv. arXiv:1612.03716,
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc,
2016. 11
Antoine Miech, Iain Barr, Yana Hasson, Karel
Lenc, Arthur Mensch, Katherine Millican, Malcolm [12] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn:
Reynolds, et al. Flamingo: a visual language model High quality object detection and instance segmen-
for few-shot learning. NeurIPS, 2022. 3, 5 tation. TPAMI, 2019. 14

[2] Peter Anderson, Basura Fernando, Mark Johnson, [13] David Chan, Suzanne Petryk, Joseph E Gonzalez,
and Stephen Gould. Spice: Semantic propositional Trevor Darrell, and John Canny. Clair: Evaluat-
image caption evaluation. In ECCV, pages 382–398. ing image captions with large language models.
Springer, 2016. 2, 3, 8 arXiv:2310.12971, 2023. 8, 3

[3] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- [14] Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song,
sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and
Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Hongyang Zhang. Revisiting referring expression
Openflamingo: An open-source framework for train- comprehension evaluation in the era of large multi-
ing large autoregressive vision-language models. modal models. arXiv:2406.16866, 2024. 7, 8, 3
arXiv:2308.01390, 2023. 3
[15] Keqin Chen, Zhao Zhang, Weili Zeng, Richong
[4] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Zhang, Feng Zhu, and Rui Zhao. Shikra: Un-
Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu leashing multimodal llm’s referential dialogue magic.
Han, Fei Huang, et al. Qwen technical report. arXiv:2306.15195, 2023. 1, 3, 7, 8
arXiv:2309.16609, 2023. 3
[16] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Con-
[5] Satanjeev Banerjee and Alon Lavie. Meteor: An au- ghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin.
tomatic metric for mt evaluation with improved cor- Sharegpt4v: Improving large multi-modal models
relation with human judgments. In ACL Workshops, with better captions. arXiv:2311.12793, 2023. 16
2005. 2, 3, 8
[17] Tsai-Shien Chen, Aliaksandr Siarohin, Willi
[6] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Menapace, Ekaterina Deyneka, Hsiang-wei Chao,
Maxwell Nye, Augustus Odena, Arushi Somani, and Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee,
Sağnak Taşırlar. Fuyu-8b: A multimodal architecture Jian Ren, Ming-Hsuan Yang, et al. Panda-70m:
for ai agents, 2023. 3 Captioning 70m videos with multiple cross-modality
teachers. In CVPR, pages 13320–13331, 2024. 8, 9,
[7] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex
3
Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raf-
fel. Remixmatch: Semi-supervised learning with [18] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-
distribution alignment and augmentation anchoring. ishna Vedantam, Saurabh Gupta, Piotr Dollár, and
arXiv:1911.09785, 2019. 2, 6 C Lawrence Zitnick. Microsoft coco captions: Data
collection and evaluation server. arXiv:1504.00325,
[8] David Berthelot, Nicholas Carlini, Ian Goodfellow,
2015. 1
Nicolas Papernot, Avital Oliver, and Colin A Raffel.
Mixmatch: A holistic approach to semi-supervised [19] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil
learning. NeurIPS, 2019. 2, 6 Mustafa, Soravit Changpinyo, Jialin Wu, Car-
los Riquelme Ruiz, Sebastian Goodman, Xiao Wang,
[9] Lucas Beyer, Andreas Steiner, André Susano Pinto,
Yi Tay, et al. On scaling up a multilingual vision and
Alexander Kolesnikov, Xiao Wang, Daniel Salz,
language model. In CVPR, 2024. 3
Maxim Neumann, Ibrahim Alabdulmohsin, Michael
Tschannen, Emanuele Bugliarello, et al. Paligemma: [20] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su,
A versatile 3b vlm for transfer. arXiv:2407.07726, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang,
2024. 3 Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu,
Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision
[10] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao
foundation models and aligning for generic visual-
Zhang, Christoph Feichtenhofer, and Judy Hoff-
linguistic tasks. arXiv:2312.14238, 2023. 3, 8, 9,
man. Token merging: Your vit but faster.
10
arXiv:2210.09461, 2022. 4

18
Describe Anything: Detailed Localized Image and Video Captioning

[21] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, [32] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi
Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Mao, Chloe Rolland, Laura Gustafson, Tete Xiao,
Ye, Hao Tian, Zhaoyang Liu, et al. Expanding per- Spencer Whitehead, Alexander C Berg, Wan-Yen Lo,
formance boundaries of open-source multimodal et al. Segment anything. In ICCV, 2023. 4, 1
models with model, data, and test-time scaling.
[33] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Fer-
arXiv:2412.05271, 2024. 8, 10
rari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan
[22] Wenliang Dai, Junnan Li, D Li, AMH Tiong, J Zhao, Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali,
W Wang, B Li, P Fung, and S Hoi. Instructblip: To- Matteo Malloci, Jordi Pont-Tuset, Andreas Veit,
wards general-purpose vision-language models with Serge Belongie, Victor Gomes, Abhinav Gupta, Chen
instruction tuning. arxiv 2023. arXiv:2305.06500, Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh
2023. 3 Narayanan, and Kevin Murphy. Openimages: A
public dataset for large-scale multi-label and multi-
[23] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang,
class image classification. Dataset available from
Philip HS Torr, and Song Bai. Mose: A new dataset
https://storage.googleapis.com/openimages/web/index.html,
for video object segmentation in complex scenes. In
2017. 11
CVPR, 2023. 13
[34] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
[24] Alexey Dosovitskiy. An image is worth 16x16
son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
words: Transformers for image recognition at scale.
Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.
arXiv:2010.11929, 2020. 5
Visual genome: Connecting language and vision us-
[25] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, ing crowdsourced dense image annotations. IJCV,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, 2017. 2, 3
Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan,
[35] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper
et al. The llama 3 herd of models. arXiv:2407.21783,
Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab
2024. 3, 8, 6, 10
Kamali, Stefan Popov, Matteo Malloci, Alexander
[26] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Kolesnikov, Tom Duerig, and Vittorio Ferrari. The
Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual open images dataset v4: Unified image classification,
representation for neon genesis. Image and Vision object detection, and visual relationship detection at
Computing, 2024. 14 scale. IJCV, 2020. 11

[27] Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rash- [36] Hugo Laurençon, Lucile Saulnier, Léo Tronchon,
wan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, We- Stas Bekman, Amanpreet Singh, Anton Lozhkov,
icheng Kuo, Huizhong Chen, Liang-Chieh Chen, et al. Thomas Wang, Siddharth Karamcheti, Alexander
Dataseg: Taming a universal multi-dataset multi-task Rush, Douwe Kiela, et al. Obelics: An open web-scale
segmentation model. NeurIPS, 2024. 6 filtered dataset of interleaved image-text documents.
NeurIPS, 2024. 3
[28] Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin
Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and [37] Dong-Hyun Lee et al. Pseudo-label: The simple and
Sifei Liu. Regiongpt: Towards region understanding efficient semi-supervised learning method for deep
vision language model. In CVPR, 2024. 2, 3, 7, 8, 5, neural networks. In Workshop on challenges in rep-
6 resentation learning, ICML, page 896. Atlanta, 2013.
2
[29] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
dataset for large vocabulary instance segmentation. [38] Junnan Li, Dongxu Li, Caiming Xiong, and Steven
In CVPR, 2019. 7, 2, 9, 11 Hoi. Blip: Bootstrapping language-image pre-
training for unified vision-language understanding
[30] Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng
and generation. In ICML, 2022. 3, 5
Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng
Liu. Segment and caption anything. In CVPR, 2024. [39] Junnan Li, Dongxu Li, Silvio Savarese, and Steven
1, 3, 6, 8 Hoi. Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large lan-
[31] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,
guage models. In ICML, 2023. 3
and Tamara Berg. Referitgame: Referring to objects
in photographs of natural scenes. In EMNLP, 2014. [40] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang,
2, 3 Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan

19
Describe Anything: Detailed Localized Image and Video Captioning

Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Dehghani, Zhuoran Shen, et al. Simple open-
Grounded language-image pre-training. 2022 ieee. vocabulary object detection. In ECCV, 2022. 13
In CVPR, pages 10955–10965, 2021. 1
[52] Matthias Minderer, Alexey Gritsenko, and Neil
[41] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming Houlsby. Scaling open-vocabulary object detection.
He. Exploring plain vision transformer backbones NeurIPS, 2024. 13
for object detection. In ECCV, 2022. 14
[53] Gerhard Neuhold, Tobias Ollmann, Samuel
[42] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Rota Bulo, and Peter Kontschieder. The mapillary
Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and vistas dataset for semantic understanding of street
Yong Jae Lee. Gligen: Open-set grounded text-to- scenes. In ICCV, 2017. 11
image generation. In CVPR, pages 22511–22521,
[54] OpenAI. Gpt-4o system card, 2024. 1, 2, 6, 8, 9, 3,
2023. 1
10, 14, 15
[43] Chin-Yew Lin. Rouge: A package for automatic evalu-
[55] OpenAI. Learning to reason with llms, 2024. 2, 6, 8,
ation of summaries. In Text summarization branches
10
out, pages 74–81, 2004. 2, 3, 8
[56] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
[44] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mo-
Jing Zhu. Bleu: a method for automatic evaluation of
hammad Shoeybi, and Song Han. Vila: On pre-
machine translation. In ACL, pages 311–318, 2002.
training for visual language models. In CVPR, 2024.
2, 3, 8
3, 6, 8, 5, 10, 14
[57] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao,
[45] Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao,
Shaohan Huang, Shuming Ma, and Furu Wei.
Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang
Kosmos-2: Grounding multimodal large language
Zhang, and Hongsheng Li. Draw-and-understand:
models to the world. arXiv:2306.14824, 2023. 3
Leveraging visual prompts to enable MLLMs to com-
prehend what you want. In ICLR, 2025. 1, 2, 3, 7, 8, [58] Bryan A Plummer, Liwei Wang, Chris M Cervantes,
9, 10, 14 Juan C Caicedo, Julia Hockenmaier, and Svetlana
Lazebnik. Flickr30k entities: Collecting region-to-
[46] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae
phrase correspondences for richer image-to-sentence
Lee. Improved baselines with visual instruction tun-
models. In ICCV, pages 2641–2649, 2015. 7, 8, 3
ing, 2023. 3, 8, 1, 10
[59] Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tian-
[47] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan-
ren Ma, Pengyu Yan, David Doermann, Qixiang Ye,
han Zhang, Sheng Shen, and Yong Jae Lee. Llava-
and Yunjie Tian. Artemis: Towards referential under-
next: Improved reasoning, ocr, and world knowledge,
standing in complex videos. NeurIPS, 37:114321–
2024.
114347, 2024. 3, 8, 9
[48] Haotian Liu, Chunyuan Li, Qingyang Wu, and
[60] Vignesh Ramanathan, Anmol Kalia, Vladan Petro-
Yong Jae Lee. Visual instruction tuning. NeurIPS,
vic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang,
2024. 3, 7, 8, 1, 10
Aaron Marquez, Rama Kovvuri, Abhishek Kadian,
[49] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, et al. Paco: Parts and attributes of common objects.
Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jian- In CVPR, pages 7141–7151, 2023. 7, 2, 11
wei Yang, Hang Su, et al. Grounding dino: Marrying
[61] Hanoona Rasheed, Muhammad Maaz, Sahal
dino with grounded pre-training for open-set object
Shaji, Abdelrahman Shaker, Salman Khan, Hisham
detection. In European Conference on Computer Vi-
Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan
sion, pages 38–55. Springer, 2024. 1
Yang, and Fahad S Khan. Glamm: Pixel grounding
[50] Junhua Mao, Jonathan Huang, Alexander Toshev, large multimodal model. In CVPR, 2024. 3, 7
Oana Camburu, Alan L Yuille, and Kevin Murphy.
[62] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu,
Generation and comprehension of unambiguous ob-
Ronghang Hu, Chaitanya Ryali, Tengyu Ma,
ject descriptions. In CVPR, 2016. 2, 3
Haitham Khedr, Roman Rädle, Chloe Rolland, Laura
[51] Matthias Minderer, Alexey Gritsenko, Austin Stone, Gustafson, et al. Sam 2: Segment anything in images
Maxim Neumann, Dirk Weissenborn, Alexey Dosovit- and videos. arXiv:2408.00714, 2024. 4, 6, 1, 11, 12,
skiy, Aravindh Mahendran, Anurag Arnab, Mostafa 13, 14

20
Describe Anything: Detailed Localized Image and Video Captioning

[63] Nils Reimers and Iryna Gurevych. Sentence-bert: Hauth, et al. Gemini: a family of highly capable
Sentence embeddings using siamese bert-networks. multimodal models. arXiv:2312.11805, 2023. 8, 6,
arXiv preprint arXiv:1908.10084, 2019. 3 10

[64] Jeremy Reizenstein, Roman Shapovalov, Philipp Hen- [76] Gemini Team, M Reid, N Savinov, D Teplyashin, Lep-
zler, Luca Sbordone, Patrick Labatut, and David ikhin Dmitry, T Lillicrap, JB Alayrac, R Soricut, A
Novotny. Common objects in 3d: Large-scale learning Lazaridou, O Firat, et al. Gemini 1.5: Unlocking
and evaluation of real-life 3d category reconstruction. multimodal understanding across millions of tokens
In ICCV, 2021. 9, 17 of context. arXiv:2403.05530, 2024. 6

[65] Pengzhen Ren, Changlin Li, Guangrun Wang, Yun [77] Qwen Team. Qwen2.5-vl, 2025. 8, 10, 15
Xiao, Qing Du, Xiaodan Liang, and Xiaojun Chang.
Beyond fixation: Dynamic window visual transformer. [78] A Vaswani. Attention is all you need. NeurIPS, 2017.
In CVPR, 2022. 4 5

[66] Tomer Ronen, Omer Levy, and Avram Golbert. Vision [79] Ramakrishna Vedantam, C Lawrence Zitnick, and
transformers with mixed-resolution tokenization. In Devi Parikh. Cider: Consensus-based image descrip-
CVPR, 2023. 4 tion evaluation. In CVPR, 2015. 2, 3, 8

[67] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, [80] Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie,
Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Ob- and Can Huang. Elysium: Exploring object-level
jects365: A large-scale, high-quality dataset for ob- perception in videos via mllm. In ECCV, pages 166–
ject detection. In ICCV, 2019. 3, 6 185. Springer, 2024. 9

[68] Manish Kumar Singh, Rajeev Yasarla, Hong Cai, [81] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi-
Mingu Lee, and Fatih Porikli. Tosa: Token se- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin
lective attention for efficient vision transformers. Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-
arXiv:2406.08816, 2024. 4 language model’s perception of the world at any res-
olution. arXiv:2409.12191, 2024. 3, 8, 9, 10
[69] Kihyuk Sohn, David Berthelot, Nicholas Carlini,
Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Do- [82] Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yun-
gus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fix- long Tang, Zhe Li, Mingqi Gao, and Shanshan Zhao.
match: Simplifying semi-supervised learning with Caption anything: Interactive image description with
consistency and confidence. NeurIPS, 2020. 2, 6 diverse multimodal controls. arXiv:2305.02677,
2023. 1, 3
[70] Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang
Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and [83] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang,
Jiaqi Wang. Alpha-clip: A clip model focusing on Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li,
wherever you want. In CVPR, 2024. 3 Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project:
Towards panoptic visual recognition and understand-
[71] Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie ing of the open world. arXiv:2308.01907, 2023. 3
Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-
centric spatio-temporal video grounding with visual [84] Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao,
transformers. IEEE Transactions on Circuits and Sys- Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou
tems for Video Technology, 32(12):8238–8249, 2021. Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhanc-
8, 3 ing the reasoning ability of multimodal large lan-
guage models via mixed preference optimization.
[72] Claude Team. Claude 3.5 sonnet, 2024. 6 arXiv:2411.10442, 2024. 8, 10

[73] Claude Team. Claude 3.7 sonnet and claude code, [85] Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe
2025. 8, 10 Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang.
Grit: A generative region-to-text transformer for ob-
[74] Gemini Team. Gemini 2.5: Our most intelligent ai
ject understanding. arXiv:2212.00280, 2022. 1
model. 2025. 8, 10
[86] Penghao Wu and Saining Xie. V*: Guided visual
[75] Gemini Team, Rohan Anil, Sebastian Borgeaud,
search as a core mechanism in multimodal llms.
Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu
arXiv:2312.14135, 2023. 4
Soricut, Johan Schalkwyk, Andrew M Dai, Anja

21
Describe Anything: Detailed Localized Image and Video Captioning

[87] Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian [98] Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen
Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu,
Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foun- William Yang Wang, Shih-Fu Chang, Zhe Gan, and
dation model integrating visual understanding and Yinfei Yang. Ferret-v2: An improved baseline for re-
generation. arXiv:2409.04429, 2024. 3 ferring and grounding with large language models.
In COLM, 2024. 3
[88] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong,
and Quoc Le. Unsupervised data augmentation for [99] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao,
consistency training. NeurIPS, 2020. 2 Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and
Ping Luo. Gpt4roi: Instruction tuning large language
[89] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and
model on region-of-interest. arXiv:2307.03601,
Quoc V Le. Self-training with noisy student improves
2023. 1, 3, 7, 8
imagenet classification. In CVPR, pages 10687–
10698, 2020. 6 [100] Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan,
Shengqiong Wu, Shunping Ji, Chen Change Loy, and
[90] Jianwei Yang, Chunyuan Li, Pengchuan Zhang,
Shuicheng Yan. Omg-llava: Bridging image-level,
Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao.
object-level, pixel-level reasoning and understand-
Focal self-attention for local-global interactions in
ing. arXiv:2406.19389, 2024. 1, 3, 8, 6
vision transformers. arXiv:2107.00641, 2021. 4
[101] Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu,
[91] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou,
Chen Gong, Qixiang Ye, and Fang Wan. Controlcap:
Chunyuan Li, and Jianfeng Gao. Set-of-mark prompt-
Controllable region-level captioning. pages 21–38,
ing unleashes extraordinary visual grounding in gpt-
2024. 1, 3, 8
4v. arXiv:2310.11441, 2023. 3, 1, 8, 10
[102] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun-
[92] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang,
yuan Li, Noel Codella, Liunian Harold Li, Luowei
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The
Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip:
dawn of lmms: Preliminary explorations with gpt-4v
Region-based language-image pretraining. In CVPR,
(ision). arXiv:2309.17421, 2023. 3, 1
pages 16793–16803, 2022. 1
[93] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du,
[103] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu
Mohamed Elhoseiny. Minigpt-4: Enhancing vision-
Chang, and Yinfei Yang. Ferret: Refer and ground
language understanding with advanced large lan-
anything anywhere at any granularity. In ICLR, 2024.
guage models. arXiv:2304.10592, 2023. 3
1, 2, 3, 7, 8, 9, 5

[94] En Yu, Liang Zhao, Yana Wei, Jinrong Yang, Dong-


ming Wu, Lingyu Kong, Haoran Wei, Tiancai Wang,
Zheng Ge, Xiangyu Zhang, et al. Merlin: Empower-
ing multimodal llms with foresight minds. In ECCV,
pages 425–443. Springer, 2024. 3, 8

[95] Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xin-
jie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey:
Pixel understanding with visual instruction tuning.
In CVPR, 2024. 1, 2, 3, 7, 8, 9

[96] Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng,


Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao
Zhang, Yueting Zhuang, et al. Videorefer suite: Ad-
vancing spatial-temporal object understanding with
video llm. arXiv:2501.00599, 2024. 2, 3, 8, 9, 4, 10,
15

[97] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov,


and Lucas Beyer. Sigmoid loss for language image
pre-training. In ICCV, 2023. 5, 13, 14

22

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy