Describe Anything: Detailed Localized Image and Video Captioning
Describe Anything: Detailed Localized Image and Video Captioning
Captioning
Long Lian1,2 Yifan Ding1 Yunhao Ge1 Sifei Liu1 Hanzi Mao1 Boyi Li1,2 Marco Pavone1
Ming-Yu Liu1 Trevor Darrell2 Adam Yala2,3 Yin Cui1
1 NVIDIA 2 UC Berkeley 3 UCSF
+ / / /
A white, fluffy dog with a thick coat, pointed ears, and a black nose. The dog has a
arXiv:2504.16072v1 [cs.CV] 22 Apr 2025
wide-open mouth with a pink tongue hanging out, and its eyes are dark and alert.
The fur is dense and appears soft, with some darker patches on its back.
+ / / /
on any frame
Figure 1: Describe Anything Model (DAM) generates detailed localized captions for user-specified regions
within images (top) and videos (bottom). DAM accepts various region specifications, including clicks, scribbles,
boxes, and masks. For videos, specifying the region in any frame suffices.
Abstract 1. Introduction
Generating detailed and accurate descriptions for Image captioning has been a longstanding challenge in
specific regions in images and videos remains a computer vision and natural language processing [18],
fundamental challenge for vision-language mod- as it involves understanding and describing visual
els. We introduce the Describe Anything Model content in natural language. While recent Vision-
(DAM), a model designed for detailed localized Language Models (VLMs) have achieved impressive
captioning (DLC). DAM preserves both local details results in image-level captioning, generating detailed
and global context through two key innovations: and accurate captions for specific regions within an
a focal prompt, which ensures high-resolution en- image remains an open problem. This challenge in-
tensifies with videos, where models must additionally
coding of targeted regions, and a localized vision
capture dynamic visual content, such as human ac-
backbone, which integrates precise localization
tions, object motions, and human-object interactions.
with its broader context. To tackle the scarcity
If resolved, it would open new doors for fine-grained
of high-quality DLC data, we propose a Semi-
grounded image/video understanding [49, 102] and
supervised learning (SSL)-based Data Pipeline
generation [40, 42].
(DLC-SDP). DLC-SDP starts with existing segmenta-
tion datasets and expands to unlabeled web images Most existing VLMs (e.g., GPT-4o [54]) lack mecha-
using SSL. We introduce DLC-Bench, a benchmark nisms for precise localization. Recent approaches that
designed to evaluate DLC without relying on refer- empower VLMs to take 2D localization cues such as
ence captions. DAM sets new state-of-the-art on 7 bounding boxes [30, 82, 85, 101] often yield brief
benchmarks spanning keyword-level, phrase-level, phrases rather than detailed descriptions. While there
are methods [15, 45, 93, 95, 99, 100] that produce
and detailed multi-sentence localized image and
longer captions, they provide minimal detail or in-
video captioning.
Previous methods
Local
Vision LLM
Feature
Backbone Lack of details 😢
in Fig. 5. This raises the question: What makes detailed
Image Extractor Regional
Feature Feature
Reason: region details
localized captioning (DLC) so challenging? Full Image Region Mask Baseline Output Description already lost in image
A dark green, circular object feature extraction
with a smooth surface and a
2
Describe Anything: Detailed Localized Image and Video Captioning
Table 1: Advantages of our proposed model DAM, our SSL data pipeline DLC-SDP, and our benchmark DLC-
Bench to previous practices.
2. SSL Data Pipeline (DLC-SDP): A semi- with capturing intricate details in the referring regions,
supervised data pipeline that leverages as shown by the examples in Fig. 5. This is because
high-quality segmentation annotations and prior models either extract localized features from
unlabeled web images for scalable and diverse global image embeddings or simply encode the re-
data curation. ferring condition as referring tokens, which leads to
3. DLC-Bench: A benchmark designed to evaluate insufficient regional details for the LLM, especially for
DLC without reference captions. small objects. We address this via focal prompting
and a localized vision backbone, balancing local detail
Unlike generalist models, we focus on localized im-
with global context.
age and video captioning across multiple granulari-
ties, achieving SOTA performance on 7 benchmarks Another limitation is the scarcity of high-quality
across keyword, phrase, and detailed multi-sentence datasets. Datasets like RefCOCOs [31, 50] and
captioning. Advantages of DAM, DLC-SDP, and DLC- VG [34] provide only short phrases. Recent
Bench to prior practices are presented in Tab. 1. We approaches [28, 45, 61, 93, 95, 96] use bounding-
release our code, models, data, and benchmark at box-based VLM queries, sometimes augmented with
describe-anything.github.io. global captions, for synthetic generation, which leads
to caveats discussed in App. C.1. We propose an
2. Related Work SSL data pipeline that uses human-annotated and
unlabeled data for richer regional descriptions.
Vision-Language Models (VLMs). VLMs integrate vi-
sual and textual inputs for multimodal understanding Benchmarking Localized Captioning. [28, 30, 59,
and are broadly classified into BLIP-style [1, 3, 22, 25, 61, 93, 95, 98, 100, 101] evaluate localized captioning
36, 38, 39] and LLaVA-style [4, 6, 9, 19, 20, 44, 46– by computing language-based image captioning met-
48, 81, 87, 103]. However, these models lack precise rics [2, 5, 43, 56, 79] between predicted captions and
localization capabilities, limiting their ability to gen- reference captions. However, these metrics focus on
erate regional descriptions. textual matching and may not correlate well with the
factual correctness or quality for detailed descriptions.
Localized Image Captioning. While general VLMs
[95, 96] use Sentence-BERT [63] and text-only LLMs
generate image-level captions, localized captioning
to score the predictions against reference captions.
requires fine-grained regional descriptions. SoM [91,
However, reference captions often lack comprehensive
92] augments VLMs with visual markers, but these
details about the region of interest, which penalizes
markers may blend with the background, as discussed
models for correct details not explicitly mentioned in
in App. A. Region-aware VLMs [15, 28, 30, 45, 57, 61,
the reference by treating them as hallucinations, as
70, 82, 83, 95, 99–101] introduce regional referring
discussed in App. C.2. Our DLC-Bench resolves this
inputs. Recent efforts such as Merlin, Artemis, and
issue by eliminating the need for reference captions.
VideoRefer [59, 94, 96] extend region-based caption-
ing to videos. However, these methods still struggle
3
Describe Anything: Detailed Localized Image and Video Captioning
Vision Models with Focus. Prior works en- Detailed Localized Descriptions
A red, translucent, dome-shaped light with a white base. ….
⋯
Cross-Attention*
fG (Repeat N times)
Q
Focal
3.1. Task Formulation Full Image I with Mask M
Cropping
Focal Crop I′ with Mask M′
The task of detailed localized captioning involves gen-
erating comprehensive textual descriptions focused ex- Figure 3: Architecture of the Describe Anything
clusively on specified regions within images or videos. Model (DAM). DAM employs a focal prompt to en-
Formally, given 𝑁 input frames 𝐼 (𝑖) ∈ R𝐻×𝑊 ×3 and code user-specified regions with high token density
corresponding binary masks 𝑀 (𝑖) ∈ {0, 1}
𝐻×𝑊
indi- while preserving context for detailed understanding.
Focal cropping is applied to the image and its corre-
cating the region of interest in each frame, the ob-
sponding mask, retaining surrounding areas for local
jective is to produce a detailed description 𝑇 of the context. Both the full image and the focal crop are
content within the region through a captioning model: the inputs into the localized vision backbone, where
(︁ )︁ images and binary masks are embedded in a spatially
𝑇 = CaptioningModel {𝐼 (𝑖) , 𝑀 (𝑖) }𝑁𝑖=1 (1) aligned fashion. Global context from the full image is
leveraged to help understand the focal crop through
gated cross-attention. The resulting visual features
We focus on using binary masks 𝑀 (𝑖) as the local- and prompt tokens are fed into a large language model
ization input, since other forms of localization (e.g., to generate detailed, context-aware descriptions. * in-
points, scribbles, boxes, or masks on an image or a dicates initialized to output zeros.
subset of frames in a video) can be transformed into
masks via segmentation models such as SAM [32] and crop centered around the specified area, along with
SAM 2 [62]. For simplicity, we first introduce our their corresponding masks.
method for localized image captioning, omitting the
frame index 𝑖, and later extend it to videos in Sec. 3.3. We first extract the bounding box 𝐵 of the mask 𝑀
and expand it by a factor 𝛼 in both the horizontal and
3.2. Model Architecture vertical directions to include additional surrounding
As shown in Fig. 3, DAM consists of two key compo- context:
nents: the focal prompt and the localized vision back- 𝐵 ′ = ExpandBox(𝐵, 𝛼). (2)
bone. For instance, setting 𝛼 = 3 results in a region that can
be up to 9 times as large as the original bounding box,
3.2.1. Focal Prompt
subject to clipping at the image boundaries. If either
To provide a detailed representation of the region the height or width of the expanded box is less than
of interest within its context, we introduce the focal 48 pixels, we enforce a minimum size of 48 pixels in
prompt, which includes both the full image and a focal that dimension to ensure sufficient context for very
4
Describe Anything: Detailed Localized Image and Video Captioning
3.2.2. Localized Vision Backbone where h(𝑙) is the output of the 𝑙-th self-attention block
Effectively processing all four components of the focal in 𝑓R , 𝛾 (𝑙) and 𝛽 (𝑙) are learnable scaling parameters ini-
prompt with a VLM is non-trivial, as naively concate- tialized to zero, and CrossAttn denotes cross-attention
nating the full image and the focal crop leads to a with queries from h(𝑙) and keys and values from the
loss in performance (Tab. 8). We propose the localized global features z, similar to how cross-attention is em-
(𝑙)
vision backbone, which 1) achieves localized under- ployed in encoder-decoder Transformers [78]. hAdapter
standing by encoding the masks in a spatially aligned is used in place of h(𝑙) in the next Transformer block.
manner and 2) integrates global context into the re- To reduce the number of parameters, 𝑓R shares self-
gion of interest through gated cross-attention. attention block weights with 𝑓G .
Handling Localization Inputs. Similar to how an By initializing 𝛾 (𝑙) and 𝛽 (𝑙) to zero, we ensure that the
image is encoded by a linear patch embedding layer initial behavior of the model remains identical to the
in vision transformers (ViTs) [24], we integrate the original VLM prior to fine-tuning. During training, the
mask 𝑀 into its corresponding full image 𝐼 through model learns to leverage the global context to enhance
another patch embedding layer that takes in 2D inputs local feature representations, facilitating detailed and
with one channel. contextually accurate descriptions.
Specifically, the full image 𝐼 and its mask 𝑀 are pro- Generating Detailed Localized Descriptions. The
cessed through patch embedding layers, followed by visual features from both the global and regional vi-
the global vision encoder 𝑓G (·) to obtain global visual sion encoders are combined and fed into the large
features z. The focal crop 𝐼 ′ and its mask 𝑀 ′ undergo language model to generate detailed, context-aware
a similar process with the regional vision encoder 𝑓R (·), descriptions 𝑇 :
except that 𝑓R (·) also takes z as a context to obtain
𝑇 = LLM(t, z′ ), (8)
the final fused visual features z′ . Specifically, we have:
where t denotes textual prompt tokens.
x = 𝐸I (𝐼) + 𝐸M (𝑀 ) + 𝑃, z = 𝑓G (x), (4)
Notably, the proposed components do not increase the
x′ = 𝐸I (𝐼 ′ ) + 𝐸M (𝑀 ′ ) + 𝑃, z′ = 𝑓R (x′ , z), (5) sequence length of the vision tokens, ensuring that
where 𝐸I (·) and 𝐸M (·) are the image and mask patch DAM remains efficient. By initializing new modules
embedding layer, respectively, x and x′ are global and (mask embedding 𝐸M and scaling parameters 𝛾 (𝑙) and
focal embedded inputs with information for both the 𝛽 (𝑙) ) to zeros, we preserve the pre-trained capabilities
image and the mask, and 𝑃 denotes the positional of the VLM prior to fine-tuning, allowing for smooth
encoding. adaptation of an off-the-shelf VLM without rerunning
pre-training. Thanks to this design, our model requires
The newly added mask embedding layer 𝐸M is initial- way less training data (∼ 1.5M samples) than prior
ized to output zeros, ensuring that the VLM’s initial works that involve VLM pretraining.
behavior is unaffected prior to fine-tuning.
5
Describe Anything: Detailed Localized Image and Video Captioning
4.2. Stage 2: SSL with Unlabeled Data As illustrated in Fig. 4, the evaluation process for a
Since it is not scalable to rely on high-quality model like DAM has two steps:
manual annotations, the second stage of DLC-SDP 1. The model is prompted to generate a detailed de-
employs self-training-based semi-supervised learning scription for each masked region in the benchmark
6
Describe Anything: Detailed Localized Image and Video Captioning
Table 2: LVIS [29] and PACO [60] Table 3: Zero-shot evaluation on Table 4: Zero-shot evaluation on the de-
open-class keyword-level caption- phrase-level dataset Flickr30k tailed captioning dataset Ref-L4 [14].
ing benchmarks. DAM excels par- Entities [58]. Our model achieves Our method achieves 33.4% and 13.1%
ticularly in the challenging PACO 12.3% average relative improve- average relative improvement on the
benchmark that requires distin- ment against previous best. short/long language-based captioning
guishing between objects and parts. metrics, respectively.
DLC-Bench employs two categories of questions for Our DLC-Bench comprises a total of 892 manually ver-
each annotated instance, with example in Fig. A.5: ified questions covering a wide range of attributes and
potential cases for hallucinations. Details on the cura-
• Positive questions focus on specific attributes of
tion process and the scoring mechanism are provided
object parts that should be present in the descrip-
in App. D.
tion. The model earns a point if the description
accurately includes the specified detail; omis-
sions receive no points, while factual errors incur
6. Results
a penalty. DAM excels at localized image and video captioning
• Negative questions focus on details that should across multiple granularities including keyword,
not be present–either attributes typical of simi- phrase, and detailed captions, achieving SOTA on 7
lar objects but absent in the target instance, or in-domain and zero-shot benchmarks (Tabs. 2 to 7).
descriptions irrelevant to the specified region. We explain the details for each benchmark in App. B.
A point is awarded if the model correctly omits
such details; conversely, including them results 6.1. Quantitative Results
in a penalty. To avoid getting high scores for Open-class keyword-level localized captioning task
captions that are completely off, point could requires the model to output keywords containing the
only be awarded if the caption has the correct object and part entities to describe the region. Tested
recognition of the object. on object-level LVIS [29] and part-level PACO [60]
7
Describe Anything: Detailed Localized Image and Video Captioning
datasets in Tab. 2, our method achieves state-of-the-art Method #Params Pos (%) Neg (%) Avg (%)
achieves 33.4% relative improvement on average Table 6: Detailed localized video captioning on HC-
over the previous best on short language-based STVG [71].
captioning metrics [2, 5, 43, 56, 79], and 13.1% In Tab. 6, our proposed DAM achieves 19.8% relative
relative improvement on the long language-based improvement over the previous best on HC-STVG, in-
captioning metrics [13]. cluding concurrent work VideoRefer [96]. In Tab. 7,
We also benchmark various regional captioning mod- the benchmark proposed by concurrent work Video-
els on our proposed DLC-Bench, which does not Refer [96], our proposed DAM surpasses the previous
suffer from the limitations of requiring reference cap- best in both zero-shot and in-domain settings, where
tions in previous benchmarks. As shown in Tab. 5, zero-shot indicates not being trained on in-domain
our Describe Anything Model (DAM) significantly out- datasets derived from Panda-70M [17], which the
performs existing general and region-specific VLMs, benchmark also sources videos from.
achieving state-of-the-art positive and negative accu- Finally, we analyzed the performance of DAM in HD
racy and demonstrating its ability to produce detailed (hallucination detection) sub-task and found that DAM
and accurate descriptions. Remarkably, DAM sur- often predicts correct details not present in the refer-
passes GPT-4o [54] and o1 [55], two strong API-only ence caption. This indicates that the lower zero-shot
baselines. DAM also surpasses models with thinking performance on this sub-task is not necessarily due to
mode enabled [55, 73, 74]. the hallucination of our model but rather due to the
Detailed localized video captioning requires the missing details in the reference caption. We illustrate
model to output a detailed description for each region this further in App. C.2.
in a video. We benchmark this capability on the chal-
lenging HC-STVG [71] benchmark and the detailed
captioning benchmark proposed by VideoRefer [96].
8
Describe Anything: Detailed Localized Image and Video Captioning
RegionGPT
a slice of cucumber that is on the left side of the camera
Method SC AD TD HD† Avg.
ControlCap
star shaped slice of green cucumber
Zero-shot:
Segment and Caption Anything (SCA)
white paper on suitcase
Qwen2-VL-7B [81] 3.30 2.54 2.22 2.12 2.55
Task: describe this object
InternVL2-26B [20] 4.08 3.35 3.08 2.28 3.20
GLaMM
the cut up cucumber on the top GPT-4o-mini [54] 3.89 3.18 2.62 2.50 3.05
OMG-LLaVA GPT-4o [54] 4.15 3.31 3.11 2.43 3.25
The image features a green cucumber slice with a red tomato on top
of it. The tomato slice is placed on top of the green cucumber slice,
Osprey-7B [95] 3.30 2.66 2.10 1.58 2.41
creating a visually appealing and healthy snack. The combination of Ferret-7B [93] 3.20 2.38 1.97 1.38 2.23
the green cucumber and tomato on the same slice makes for a
delicious and nutritious meal option. Elysium-7B [80] 2.35 0.30 0.02 3.59 1.57
VP-SPHINX
Artemis-7B [59] 3.42 1.34 1.39 2.90 2.26
In the region, there is a rectangular piece of vegetable that resembles DAM-8B (Ours) 4.45 3.30 3.03 2.58 3.34
a camera. It is a square piece of a vegetable, possibly a cucumber,
positioned horizontally. It is located towards the center of the region In-domain*:
and is the focal point of the image.
VideoRefer-7B [96] 4.44 3.27 3.10 3.04 3.46
Describe Anything Model (Ours)
A rectangular slice of cucumber with a light green, watery interior. The DAM-8B (Ours) 4.69 3.61 3.34 3.09 3.68
slice features a central star-shaped pattern with radiating lines
Co3Dv2 [64], integrating information from multiple (b) DAM can describe objects in multi-view datasets, like
frames to provide coherent descriptions of 3D objects this car from Co3Dv2 [64], by integrating information from
(Fig. 6(b)). multiple frames.
Figure 6: DAM accurately describes user-specified
6.3. Ablations regions in videos and multi-view scenes under
challenging conditions. More results presented
Visual Prompting. We analyze different prompting
in Fig. A.8.
strategies and find that both localized inputs and con-
textual information are crucial for accurate descrip- Task: describe this object
Prompt: Provide a brief description of
the masked region.
tions. Using only the full image limits focus on specific A modern chair with a curved backrest and
light wood legs.
regions (48.7%), while local crops improve detail but Prompt: Provide a highly detailed
lose context (60.1%). Simply concatenating both per- description of the masked region.
A modern chair with a curved backrest and a
forms poorly (42.4%). Adding cross-attention signifi- textured, light brown fabric upholstery. The
chair features a smooth, rounded top edge
cantly improves performance (63.2%), and using focal and a slightly tapered back. It is supported
by four slender, light-colored wooden legs
that are slightly splayed for stability.
crops further enhances results (65.4%). Our best ap-
Figure 7: DAM offers multi-granular localized de-
proach, the focal prompt, integrates focal crops with scriptions.
cross-attention, achieving 67.3% accuracy without
increasing sequence length for the LLM.
9
Describe Anything: Detailed Localized Image and Video Captioning
10
Describe Anything: Detailed Localized Image and Video Captioning
A. Challenges in Generating Detailed masks on the image [91, 92], as shown in Fig. A.1(c).
Localized Descriptions with Off-the- However, these markings may blend into the object
or the background in highly complex scenes, making
Shelf VLMs
them unrecognizable to the VLMs. This issue is espe-
Although cutting-edge Vision-Language Models cially common for small objects that are not the main
(VLMs), such as GPT-4o [54] and LLaVA [46–48], focus of the scene. Furthermore, the markings may
excel at generating global-level image descriptions, render the image out-of-distribution, confusing the
producing detailed localized image captions remains VLMs and disrupting the quality of output that they
an open problem. Specifically, these VLMs only take were originally capable of generating.
in RGB images along with text prompts and do not
allow users to accurately specify regions of interest. The exploration above highlights a conflict between
the precision of localization and the availability of con-
While users could employ text to localize the object text. On one hand, we want the model to accurately
to be described, this approach is often cumbersome focus on a specific region without mentioning other
and inefficient, requiring precise referring phrases that regions, such as other objects or the background. On
may still be difficult for the VLM to interpret. This the other hand, the model needs to leverage contex-
can lead to mislocalization of the intended object, as tual information to correctly identify the object in the
illustrated in Fig. A.1(a). region of interest. This conflict makes it very difficult
The required effort for both the user and the model for current VLMs to produce high-quality localized
can be significantly reduced if the user is allowed to descriptions.
specify the region directly using a representation in 2D Our proposed model overcomes this challenge by tak-
coordinates that the model can understand. With this ing the localization as a separate input in 2D space.
idea in mind, we focus on generating detailed localized This approach has the advantage of making the local-
descriptions by enabling users to specify a region in ization more explicit for the VLMs to parse while keep-
an image for the model to describe in detail. Since ing the image within its original distribution, thus pre-
spatial representations such as points and boxes can venting the model from being distracted by the mark-
be converted into masks using SAM [32] and SAM ings. This technique leads to accurate localization
2 [62], we concentrate on regions specified by mask even in complex scenes, as illustrated in Fig. A.1(d).
inputs. Note that since Fig. A.1(d) mainly focuses on explain-
A first attempt to address this problem with existing ing the design choices of inputting mask inputs to
VLMs is to reduce the task to global image caption- the model, focal prompting is included as a part of
ing by presenting only the region to the VLM through the model and is omitted in this figure for simplicity.
masking or cropping, as shown in Fig. A.1(b). While We refer readers to Fig. 3 for illustrations on focal
this forces the VLM to focus solely on the specified prompting.
region, freeing users from the burden of expressing
localizations as phrases, the lack of contextual infor- B. Evaluation Benchmarks
mation makes the task much more challenging and Our DAM is designed to perform well at localized
often confuses the VLM. This confusion can prevent image and video captioning across multiple granular-
the model from correctly identifying the object, let ities, including keyword, phrase, and detailed cap-
alone providing detailed descriptions of its parts. In tions. Therefore, we evaluate and achieve SOTA in 7
more extreme cases, the model may even refuse to in-domain and zero-shot benchmarks:
caption the region due to insufficient information in
the cropped or masked image. Therefore, generating 1. The LVIS open-class keyword-level benchmark
detailed localized captions requires more than just the in Tab. 2.
local region. 2. PACO open-class keyword-level benchmark (in-
cluding object and parts as regions) in Tab. 2.
An alternative approach to prompt existing off-the- 3. Flickr30k Entities phrase-level benchmark
shelf VLMs for localized descriptions is to overlay in Tab. 3.
markings such as points, scribbles, contours, and alpha 4. Ref-L4 detailed captioning benchmark in Tab. 4.
1
Describe Anything: Detailed Localized Image and Video Captioning
(d) Our Approach: Describe Anything Model (Ours) + Localization as a separate input
Figure A.1: Existing Vision-Language Models (VLMs) do not perform well in generating localized descrip-
tions. (a) to (c) demonstrate several ways to prompt existing VLMs, but none achieves satisfactory performance,
leading to the need for a new method that is capable of providing detailed and localized descriptions. In
(d), we propose a model that accepts the condition in a separate form of input, making it easy for users to
specify the object of interest and for the models to accurately localize the referred object. Note that our focal
prompt, proposed in Sec. 3, is considered part of the Describe Anything Model and is not shown in the figure
for simplicity.
5. Our proposed DLC-Bench detailed localized cap- 1. For LVIS [29], this involves predicting the class
tioning benchmark in Tab. 5. name as a keyword, given the segmentation
6. HC-STVG detailed video captioning benchmark mask of an object. A typical class name ranges
in Tab. 6. from one word to four words.
7. VideoRefer detailed video captioning benchmark 2. For PACO [60], this involves predicting the class
in Tab. 7. name of an object in the mask if the mask con-
tains a full object, or the object name and the
We offer an explanation for each setup.
part name if the mask contains an object part.
This is especially challenging because it would
B.1. Keyword-level Localized Captioning
require the model to understand nuances be-
Benchmarks tween full objects and object parts.
Open-class keyword-level localized captioning bench-
marks, proposed in [95], require the model to out- B.2. Phrase-level Localized Captioning
put keywords containing the object and part entities Benchmarks
to describe the region. In contrast to closed-class
Phrase-level localized captioning task requires the
keyword-level localized captioning, which constraints
model to output a phrase containing a brief descrip-
the model output to several choices provided, open-
tion for each the region that includes object identi-
class keyword-level localized captioning takes free-
fication and attributes typically within a few words.
form text outputs from the model. The evaluation
The metrics typically used in phrase-level benchmarks
results are in Tab. 2.
are CIDER, METEOR, BLEU, ROUGE_L, and SPICE
2
Describe Anything: Detailed Localized Image and Video Captioning
3
Describe Anything: Detailed Localized Image and Video Captioning
Input Video
A man with short, dark hair and a slight stubble is wearing a denim shirt with rolled-up sleeves. He begins by standing
with his hands clasped together in front of him, exuding a calm and composed demeanor. As the sequence
progresses, he raises his hands slightly, maintaining a gentle and expressive gesture. His facial expression shifts
DAM Output subtly, suggesting he is engaged in conversation or explanation. The man's movements are smooth and deliberate, as
he occasionally opens his hands and gestures outward, indicating emphasis or clariÞcation. Throughout the
sequence, his posture remains upright and attentive, conveying a sense of engagement and focus.
Reference Caption A man with short black hair is standing on the left, wearing a black jacket, as if reporting news
GPT Evaluation on
Hallucination Hallucination Detection: 1
Explanation: The predicted answer includes several imaginative elements, such as gestures and expressions, that are
Detection (HD) not mentioned in the correct answer, indicating hallucinations in the description.
Dimension
Figure A.3: The pitfall of using reference captions for caption evaluation. Evaluation benchmarks based on
reference captions may incorrectly treat correct details in the predicted caption as hallucination. Since the GPT
evaluator relies solely on the ground truth caption without viewing the video, it mistakenly flags gestures and
expressions as hallucinations, resulting in a low score. However, the evaluation is invalid since the predicted
details are correct.
in manually annotated keywords (e.g., class names, source of information, it incorrectly believes that the
part names, entities) in the datasets for regions we gestures and expressions are hallucinations and gives
are annotating in our data pipeline to further reduce our caption a very low score for the hallucination
the ambiguity and potential confusion for our VLM in detection dimension. However, the evaluation is not
the data pipeline. valid, as our model is correct in the descriptions about
the gestures and expressions.
C.2. The Pitfall of Using Reference Cap-
This indicates that the lower score on this sub-task is
tions in Benchmarks not due to the hallucination of our model, but rather
As discussed in Sec. 5 and 6.1, caveats exist for using due to the missing details in the reference caption
a “ground truth” reference caption for benchmarking and the fact that our model, evaluated in a zero-shot
localized descriptions. Specifically, since such a ref- setting, does not have awareness for what types of
erence caption is hardly comprehensive and may not details are preferred by or included in the reference
contain all the details about the region of interest, caption.
the metrics from the benchmark will treat the correct
details in the caption prediction about the region of C.3. Failure Cases
interest that are not mentioned in the ground truth We show two failure cases of DAM in Fig. A.4. In
reference caption as hallucinations. This discourages Fig. A.4(a), DAM misrecognizes the frog-shaped slip-
the model from generating detailed captions. per to be a frog. In Fig. A.4(b), DAM describes the
We analyzed the performance of our method in HD person as pulling the body upward. We expect these
(hallucination detection) sub-task in VideoRefer- errors to be mitigated by broader data coverage.
Bench [96] and found that our model often predicts
correct details that are not present in the reference C.4. Potential Limitations
caption. Specifically, the example in Fig. A.3 shows DAM is only trained for multi-granular localized cap-
this phenomenon. While our model’s prediction tioning, especially for detailed localized captioning
includes appearance and motion details about the (DLC) and is not specifically optimized for other gen-
change of the person’s gesture and expression, such eral vision-language tasks. However, DAM is designed
details are not mentioned in the reference caption in for in-depth analysis for the task of multi-granular im-
the dataset. Since the GPT evaluator does not see the age and video localized descriptions rather than for
video and uses the ground truth caption as the only breadth for general vision-language understanding,
4
Describe Anything: Detailed Localized Image and Video Captioning
A bright green frog with a smooth, glossy texture. The frog has a rounded head with prominent, slightly raised eyes. Its body is
streamlined, tapering towards the tail, and the limbs are short and sturdy.
DAM (Ours)
(a) DAM might still misrecognize a region and output an incorrect description. For example, it misrecognizes the
frog-shaped slipper to be a frog.
Input
Video
A person wearing a sleeveless white top and blue shorts is engaged in a dynamic exercise routine. The individual is equipped with a
pair of red and white athletic shoes, which provide stability and support during the activity. The person is holding onto a bar or
handle with both hands, suggesting a pull-up or similar upper-body exercise. The posture is upright, with the back slightly arched,
indicating the exertion of muscles during the movement. The legs are bent at the knees, with feet firmly planted on the ground,
providing a stable base. Throughout the sequence, the person maintains a consistent rhythm, pulling the body upward and then
DAM (Ours)
lowering it back down in a controlled manner. This repetitive motion showcases the individual's strength and endurance, as they
continue the exercise with focus and determination.
(b) DAM might also be confused by the combination of the object motion and the camera motion. In this example, it
makes the mistake of describing the person as pulling the body upward.
which justifies the design choice. the regional feature used in our approach comes di-
rectly from a global and a focal view of the input image,
C.5. Computational Efficiency with cross-attention enhancing the focal representa-
DAM incorporates our proposed localized vision en- tion. This design is justified as the vision encoder is
coder, which differs from the SigLIP [97] vision en- much smaller than the LLM (400M vs. 3B/8B parame-
coder used in [44] by adding two key components: ters), with minimal latency impact (0.06s compared to
patch embedding layers for encoding the mask and 1.49s for 3B LLM as measured in our pipeline). This
cross-attention blocks. Importantly, these components overhead is outweighed by the benefits of preserving
do not alter the dimensions or sequence length of the fine details that global image features miss as indi-
vision features passed to the large language model, cated in Tab. 8), especially for small regions. Finally,
ensuring that the parameter count and computational DAM 3B outperforms much larger models in challeng-
efficiency of the large language model are unaffected. ing (Tab. 5), showing our efficiency.
Since the vision encoder represents only a small frac-
tion of the total parameters and computational oper- C.6. Training Data
ations, the overall increase in FLOPs and parameter In addition to the details in data annotation presented
count remains marginal, maintaining the model’s effi- in App. H.1, we discuss the training data of our
ciency. work in this section and present a comparison with
recent works. Compared with recent work Ferret
To be more specific, unlike prior works that derive
[93] which used 1.1M unreleased samples and
regional features from image features for each region,
RegionGPT [28] which used 1.5M unreleased samples,
5
Describe Anything: Detailed Localized Image and Video Captioning
we train our model on a comparable amount of data properties, and removed hallucinated or ambiguous
(1.5M samples). However, we obtain much better properties from the VLM outputs. Finally, we turned
performance (Tab. 5), which shows the effectiveness these properties into questions that test whether a
of DAM. description accurately covers the property.
6
Describe Anything: Detailed Localized Image and Video Captioning
A white electric stove with a smooth, ßat cooktop featuring two visible black burners. The control
Task: describe this region panel is located at the back, with a black surface and a few visible knobs. The stove has a slightly
raised backguard.
(b) An example description to be evaluated
Q: Which of the following is applicable to the description?
Choices:
A. The control panel or the stove is not mentioned. (score: 0)
B. The location of the control panel is mentioned in the description but is not back. (score: -1)
C. The location of the control panel is mentioned in the description and is back. (score: 1)
D. The location of the control panel is not mentioned, but the control panel of the stove is
mentioned. (score: 0.5)
(c) An example positive question in DLC-Bench
(a) An example region in the
benchmark for the model to describe Q: Which of the following is applicable to the description?
Choices:
A. The induction cooktop surface of the stove is mentioned in the description. (score: -1)
B. The induction cooktop surface of the stove is not mentioned in the description. (score: 1)
C. The stove is not mentioned in the description. (score: 0)
(d) An example negative question in DLC-Bench
Figure A.5: An example from DLC-Bench for detailed localized captioning. (a) The process begins by
prompting a model to describe a specified region within the image. The resulting description is then evaluated
using a text-only LLM as a judge that rates each response by answering positive and negative questions. (b)
shows an example description to be evaluated. (c) Positive questions are designed to test whether the model
correctly identifies specific details within the described region. The model receives points for accurate details
and is penalized for factual errors. The bold option (option C) indicates that the LLM judge believes that option
C is applicable, allowing the model to get a point for this example positive question. (d) Negative questions
ensure the model refrains from mentioning irrelevant or nonexistent details. Mislocalization or hallucination
results in penalties to prevent false positives. The bold option (option B) indicates that the LLM judge believes
that option B is applicable, allowing the model to get a point for this negative question.
the description correctly includes the specified when the object recognition is incorrect, since
detail, the model receives a point. To prevent otherwise a caption that is random and com-
models from artificially inflating their scores pletely off could get high scores on the nega-
by generating excessively long descriptions and tive questions. We present a negative example
guessing details, we penalize incorrect details in Fig. A.5(d).
and discourage models from including uncertain
The positive (negative) score for a model is the sum of
or erroneous content. If the detail is mentioned
points for positive (negative) questions, normalized by
but incorrectly (e.g., wrong color), a penalty of
the maximum possible score to yield a percentage for
one point is applied. No point is awarded if the
comparison. We also average the positive and negative
description does not mention the detail. Partial
scores to obtain an overall score, which represents
points (0.5 points) are awarded for answers that
the model’s overall capability in detailed localized
are partially correct but insufficiently detailed.
captioning.
Note that the model gets positive points only
when the object recognition is correct, as the We present an example from DLC-Bench in Fig. A.5.
correctness of the details depends on the cor- The example region in Fig. A.5(a) features a stove with
rectness of the overall region recognition. We coil burners. An example description of the region is
present a positive example in Fig. A.5(c). presented in Fig. A.5(b). For the example positive
• Negative Scoring: For each negative question, if question in Fig. A.5(c), the LLM judge selects option
the description appropriately excludes the incor- C, as the caption correctly mentions that the control
rect or irrelevant detail, the model gets a point. panel is at the back, allowing the model to get a point
If the description includes the detail, indicating for this positive question. For the negative question
mislocalization or hallucination, a penalty is ap- in Fig. A.5(d), the LLM judge selects option B, as the
plied. The model gets zero or negative points caption correctly indicates that it is not an induction
7
Describe Anything: Detailed Localized Image and Video Captioning
cooktop, allowing the model to get a point for this not as detailed as when it is trained exclusively with
negative question. the original prompt. Importantly, the model’s perfor-
mance on the negative questions remains unchanged,
Evaluation Setting. For our models, we follow our
indicating that prompt augmentation does not lead to
inference setting described in App. H.
hallucinations or mislocalization.
8
Describe Anything: Detailed Localized Image and Video Captioning
Table A.1: Ablations on architecture design compared to our strongest baseline VP-SPHINX [45]. We
trained a model with VP-SPHINX [45] architecture on our curated DLC data from various segmentation datasets.
The results on DLC-Bench indicate the advantages of our model architecture that allows detailed localized
features to be presented to the LLM for DLC.
Table A.2: Comparison of performance of DAM with and without prompt augmentation. Prompt augmen-
tation has minimal effect on DAM’s performance on DLC-Bench. While descriptions generated by the model
may occasionally be less detailed, leading to a slight decrease in the performance on positive questions, we
observed that prompt augmentation enhances instruction following when prompts include specific guidelines,
such as length constraints. We use the model without prompt augmentation with our benchmark, including
ablations, by default.
Table A.3: Comparison of performance of our image-only DAM and DAM trained with both localized
image description task and localized video description task. Joint training benefits generating high-quality
localized image descriptions compared to image-only training.
9
Describe Anything: Detailed Localized Image and Video Captioning
Method #Params Pos (%) Neg (%) Avg (%) Region of Interest
In both examples, GPT-4o could not correctly recog- models, GPT-4o [54], Qwen2.5-VL [77], and recent
nize the objects in the masked regions, providing only work VideoRefer- [96], in detailed localized video
vague descriptions. VP-SPHINX, while better than GPT- captioning in Fig. A.11. In the top example, we ob-
4o, still struggles with accurate object recognition and served that GPT-4o struggles to interpret the cow’s
detailed descriptions. In the left image, VP-SPHINX in- movements. Similarly, Qwen2.5-VL-7B incorrectly per-
correctly describes a group of seals when the masked ceives the cow as stationary. VideoRefer-7B provides
region contains only one seal. In the right image, minimal motion and appearance details. In contrast,
VP-SPHINX identifies the towel but provides minimal our 8B model accurately identifies the motion of the
detail, missing key attributes like its color and texture. cow, providing more detailed information about it.
Our model outputs detailed and high-quality descrip- In the bottom example, GPT-4o misidentifies the ob-
tions of the seal and the towel. This improvement ject, mistakenly assuming the animal is transforming
stems from our model’s design which enables the fu- into a wolf or a pig. Meanwhile, Qwen2.5-VL-7B be-
sion of object-specific information with broader con- lieves only the sheep’s head is moving. VideoRefer-7B
textual understanding. recognizes that the sheep is moving but provides little
detail about the appearance of the sheep. In con-
Detailed Localized Video Captioning. We present
trast, our model correctly identifies the animal in the
comparisons with three strong video understanding
10
Describe Anything: Detailed Localized Image and Video Captioning
11
Describe Anything: Detailed Localized Image and Video Captioning
Input
Video
A young child, dressed in a light-colored long-sleeve shirt and beige pants, is captured in a sequence of dynamic movements. The
child appears to be in mid-motion, possibly dancing or playing, with one arm raised and the other slightly bent. The child's posture
suggests a lively and energetic demeanor, with a slight forward lean indicating movement. The legs are positioned in a way that
suggests a step or a jump, with one foot slightly lifted off the ground. The child's head is tilted slightly upward, adding to the sense of
motion and engagement. Throughout the sequence, the child maintains a consistent posture, with the arms and legs moving in
coordination, conveying a sense of playful activity.
(a) DAM accurately describes fine-grained details for the masked region.
Input
Video
A cat with a predominantly white coat and distinct orange patches is seen in a series of fluid movements. Initially, the cat is crouched
low, its body compact and poised, suggesting a state of alertness or curiosity. As the sequence progresses, the cat begins to rise, its
body elongating and stretching, indicating a transition from a crouched position to a more upright stance. The cat's ears are perked,
and its eyes are focused, suggesting attentiveness to its surroundings. The tail, initially curled, extends outwards, providing balance
and a sense of readiness. The cat's movements are smooth and deliberate, showcasing its agility and grace. As it continues to rise, the
cat's posture becomes more upright, and its head turns slightly, as if scanning the environment or preparing to move in a new
direction. The sequence captures the essence of a cat's natural curiosity and fluid motion, highlighting its ability to navigate its
environment with ease and elegance.
(b) DAM is still able to output high-quality descriptions in scenes with fast camera and object motion.
Input
Video
A polished brown leather shoe, characterized by its sleek, glossy finish and classic design, moves gracefully through a sequence of
steps. The shoe's pointed toe and smooth surface reflect light subtly, emphasizing its refined craftsmanship. As it progresses, the shoe
maintains a steady, rhythmic motion, suggesting a confident stride. The heel, slightly elevated, provides a sense of balance and
stability, while the sole appears sturdy, designed for comfort and support. Throughout the sequence, the shoe's movement is fluid and
purposeful, indicative of a person walking with intent and poise. The shoe's interaction with the ground is seamless, showcasing its
flexibility and durability. The overall impression is one of elegance and sophistication, with the shoe serving as both a functional and
stylish accessory.
(c) DAM is able to accurately keep track of small objects across the dynamic scene.
Input
Video
A stone sculpture of a standing figure, draped in a flowing robe, is depicted in a serene and poised stance. The figure's right arm is
bent at the elbow, with the hand gently resting on the chest, while the left arm is slightly extended outward, suggesting a gesture of
contemplation or prayer. The head is adorned with a simple, rounded headpiece, and the facial features are finely detailed, conveying
a sense of calm and introspection. The robe, intricately carved, wraps around the body, creating soft folds and drapes that add to the
sculpture's lifelike quality. The figure stands firmly on a base, exuding a sense of stability and grace. Throughout the sequence, the
sculpture remains stationary, embodying a timeless and tranquil presence.
(d) DAM correctly identifies stationary objects in the video, offering detailed descriptions for the appearance of the specified object.
Figure A.8: Additional results from DAM on detailed localized video captioning (Part 1). Our model is
able to accurately describe small objects in complex scenes that involve large object motion and camera motion.
Our model also correctly identifies stationary objects captured by a non-stationary camera by saying they are
stationary. Videos visualized in this figure are from SA-V [62] dataset.
labeling, resulting in 774k annotations across 593k final number of instances and images is lower than
images, as detailed in Tab. A.5. Due to filtering, the the original 10% subset of SA-1B. We do not use the
12
Describe Anything: Detailed Localized Image and Video Captioning
Input
Video
An elephant, characterized by its large, rounded ears and long, curved tusks, moves gracefully through the sequence. Its skin appears
rough and textured, with a grayish-brown hue that suggests a natural, earthy tone. The elephant's trunk, a versatile and muscular
appendage, is prominently visible, swaying gently as it walks. Its eyes are small and partially obscured by the folds of its skin, giving it
a calm and serene expression. The elephant's legs are sturdy and columnar, supporting its massive body with ease. As it progresses
through the sequence, the elephant maintains a steady and deliberate pace, its movements fluid and unhurried. The tusks, slightly
curved and pointed, catch the light subtly, adding a touch of elegance to its robust form. Overall, the elephant exudes a sense of
majesty and tranquility, moving with a quiet confidence that is both powerful and graceful.
(a) DAM accurately describes fine-grained details for the masked region.
Input
Video
A white SUV is prominently featured, showcasing a sleek and modern design. The vehicle's body is smooth and aerodynamic, with a
slightly elevated stance typical of SUVs. The front of the car is characterized by a distinctive grille and sharp, angular headlights that
give it a bold and assertive look. As the sequence progresses, the SUV moves steadily forward, maintaining a consistent trajectory. The
wheels, with their intricate alloy design, rotate smoothly, indicating a well-maintained vehicle. The side mirrors are positioned to
provide optimal visibility, and the windows are tinted, adding a touch of sophistication. The SUV's movement is fluid, suggesting a
well-engineered suspension system that handles the road with ease. Throughout the sequence, the car remains the focal point,
exuding a sense of reliability and style.
(b) DAM is able to handle objects that are heavily occluded in some frames.
Input
Video
A person wearing a blue jacket with white text on the back is captured in a sequence of dynamic movements. Initially, the individual is
seen in a slightly bent posture, suggesting a moment of balance or readiness. As the sequence progresses, the person appears to be
in motion, possibly walking or running, with their arms extended forward, indicating a sense of urgency or direction. The jacket, with
its distinct color and text, remains a focal point throughout the sequence, emphasizing the person's presence and movement. The
person's head is slightly tilted, and their posture suggests a forward-leaning stance, consistent with active movement. The sequence
concludes with the person seemingly in mid-stride, capturing a sense of continuous motion and purpose.
(c) DAM is able to output detailed descriptions for objects despite occlusion due to object motion and human articulation.
Input
Video
A cat with a sleek, short-haired coat of light gray and white fur is prominently featured. Its ears are perked up, and its large, expressive
eyes are wide open, giving it an alert and curious demeanor. The cat's body is slightly turned, suggesting a poised and attentive
stance. As the sequence progresses, the cat's head tilts slightly upward, maintaining its focused gaze. Its whiskers are long and
prominent, adding to its inquisitive expression. The cat's posture remains mostly upright, with subtle shifts in weight that suggest a
readiness to move or react. Throughout the sequence, the cat's overall demeanor is one of calm vigilance, as if it is observing
something of interest with keen attention.
(d) DAM correctly identifies slight motion in the video, showing its capability for detailed motion analysis.
Figure A.9: Additional results from DAM on detailed localized video captioning (Part 2). Our model is
able to accurately describe objects that are partially occluded and is able to perceive and describe slight motion.
Videos visualized in this figure are from MOSE [23] dataset.
masks provided with SA-1B, as they contain a large tect objects in the images, and then use SAM [62] to
number of masks for parts. Instead, we employ the generate masks for the detected instances. Finally, we
open-vocabulary detector OWL-ViT v2 [51, 52] to de- use SigLIP [97] to evaluate the image-text similarity,
13
Describe Anything: Detailed Localized Image and Video Captioning
Input Input
Image Image
Figure A.10: Qualitative comparisons demonstrate the superior localized image understanding capabilities
of our model compared to GPT-4o [54] and VP-SPHINX [45], our strongest open-weight baseline. GPT-4o
struggles to recognize objects in masked regions accurately, offering only vague descriptions. In the left image,
VP-SPHINX incorrectly describes a group of seals when the masked region contains only one seal. In the right
image, VP-SPHINX identifies the towel but provides minimal detail, missing key attributes like its color. In
contrast, our model delivers precise, detailed descriptions and captures the seal’s sleek elongated body, dark
brown coloration with lighter patches, and the towel’s light gray color, wrinkled texture, and darker edge
pattern. This superior performance stems from our model’s architecture that effectively fuses object-specific
details with broader contextual understanding.
taking the region as an image. we use the high-quality masklets that come with the
videos. We found that many masklets cover parts of
To ensure data quality, we apply extensive filtering
an instance, which is not necessarily helpful in de-
(i.e., rejection sampling) based on confidence scores
scribing the whole object as a common use case of our
from OWL-ViT v2, SAM, and SigLIP image-text similar-
model. Therefore, we performed instance segmenta-
ity. We also ensure we have at most two instances per
tion on the videos with ViTDet [41] + Cascade Mask
image, and for images with two instances, these two
R-CNN [12] trained by EVA-02 [26] and used voting
instances have to be from different classes. The object
to match the segmentation masks with the masklets.
category names produced by OWL-ViT v2 are then put
In this way, we filter out most of the masklets that are
into a variant of our Describe Anything model, which
parts, since they likely do not correspond to instance
is trained on data from stage 1 and optimized for self-
masks. The matched masklets carry the class name
labeling. This variant generates descriptions with a
from the matched instance segmentation mask, which
50% probability of incorporating class names during
is used in the annotation process to obtain a detailed
training, as during self-labeling we have a class name
localized caption for each masklet.
as a part of each input. The object category proposals
used by OWL-ViT v2 are generated by VILA 1.5 [44].
H.2. Model Training
Detailed localized video captioning. We annotated We start from off-the-shelf VILA 1.5 [44] models that
94k regions across 37k videos from SA-V dataset [62] are publicly available on HuggingFace. For image-only
for detailed localized video captioning, as detailed training, we fine-tune VILA 1.5 3B model. For joint
in Tab. A.6. Note that each region, also called masklet, image-video training, we use VILA 1.5 8B model. We
indicates an instance across multiple frames in the use SigLIP [97] vision encoder, following VILA 1.5.
video. In contrast to the use of SA-1B, where we To prevent catastrophic forgetting and to maintain
did not use the masks that come with the dataset, instruction following capabilities, we mix in ShareGPT-
14
Describe Anything: Detailed Localized Image and Video Captioning
Input
Video
A cow is standing in an outdoor area, tethered with a rope. Its body is mostly stationary, with slight movements as it shifts its
weight from one leg to another. The cow occasionally flicks its tail, indicating a relaxed state. Its head is lowered, possibly
Incorrect motion
grazing or sniffing the ground. The cow's coat is smooth, with a consistent color pattern. As the sequence progresses, the cow
understanding ✗
GPT-4o takes small steps forward, adjusting its position slightly. The motion is gradual, with the cow maintaining a calm demeanor
throughout.
A cow, tethered by a rope, stands in a courtyard surrounded by other cows and a blue vehicle. The cow's body is
predominantly light brown with darker patches, and it has a white patch on its face. Its tail is slightly raised, and it appears to
Incorrect motion
be looking around, possibly observing its surroundings or reacting to sounds. The cow remains stationary throughout the
video, maintaining a calm demeanor. The background includes a concrete floor scattered with some debris, a wall with a
understanding ✗
window, and a blue vehicle parked nearby. The overall scene suggests a rural setting where the cows are likely being kept for
Qwen2.5-VL
grazing or other agricultural purposes.
The cow that was the first to move forward and start walking. Minimal motion and
appearance details ✗
VideoRefer
A cow with a rich brown coat and a lighter patch on its rump is depicted in a sequence of movements. Initially, the cow is seen
with its head slightly lowered, suggesting a calm demeanor. As the sequence progresses, the cow begins to move forward, its Recognize the
legs extending in a steady, rhythmic gait. The tail, with its tufted end, sways gently with each step, adding a sense of fluidity to motion ✔
its motion. The cow's body remains mostly upright, with its back slightly arched, indicating a relaxed posture. The legs, sturdy
DAM (Ours) and well-defined, carry the cow forward with a sense of purpose. Throughout the sequence, the cow maintains a consistent Abundant details
✔
pace, its movements smooth and unhurried, embodying a serene and composed presence.
Input
Video
An animal is present in the region, initially appearing as a sheep with a woolly coat. It stands on a patch of ground with
sparse vegetation. As the sequence progresses, the animal undergoes a series of transformations. Its appearance shifts
from a sheep to a creature resembling a wolf, with a sleeker, more streamlined body and a different texture. The
Incorrect appearance
transformation continues as the animal takes on the form of a pig, with a rounded body and a pinkish hue. This change is
understanding ✗
GPT-4o gradual, with the animal's features morphing smoothly from one form to another. Throughout these transformations, the
animal remains in a similar position, occasionally adjusting its stance slightly. The sequence concludes with the animal
returning to a sheep-like form, completing the cycle of transformations.
A sheep stands in a field, initially facing forward. It then turns its head to the left, looking over its shoulder. The sheep's body
remains mostly stationary, but its head movement suggests it might be reacting to something off-camera. The background Incorrect motion
Qwen2.5-VL consists of dry grass and scattered twigs, indicating an outdoor rural setting. understanding ✗
The sheep in the video is white and appears to be grazing. It is moving around the field, occasionally stopping to nibble on Minimal appearance
the grass. The sheep's motion is slow and deliberate as it searches for food. details ✗
VideoRefer
A sheep with a thick, woolly coat of light beige and white fur is seen moving gracefully. Its body is robust and rounded, with
a slightly curved back. The sheep's head is small in proportion to its body, with a pair of alert, medium-sized ears that stand Recognize the
upright. As it moves, the sheep's legs are sturdy and well-positioned, supporting its weight as it walks with a steady, motion ✔
deliberate pace. The tail is short and slightly curved, adding to its balanced silhouette. Throughout the sequence, the sheep
DAM (Ours) maintains a consistent gait, suggesting a calm and purposeful demeanor. Its movements are smooth and fluid, indicating a Abundant details
✔
sense of ease and familiarity with its surroundings.
Figure A.11: Our proposed DAM demonstrates superior localized video understanding compared to
GPT-4o [54], QwenVL-2.5 [77], and VideoRefer [96]. Top figure: DAM accurately captures the cow’s
forward movement with comprehensive details, whereas GPT-4o and QwenVL-2.5 mistakenly perceive the
cow as stationary. Compared to VideoRefer, DAM provides richer descriptions of both motion and appearance.
Bottom figure: DAM correctly recognizes the animal as a sheep and accurately describes its graceful movement,
while GPT-4o erroneously identifies it as transforming into other animals, and QwenVL-2.5 incorrectly perceives
that only the sheep’s head is moving. VideoRefer provides limited appearance details, while DAM offers
extensive, accurate descriptions. These cases highlight DAM’s precise understanding of motion and appearance
throughout video sequences.
15
Describe Anything: Detailed Localized Image and Video Captioning
1 You are responsible to write a very descriptive caption to describe the {{ category }} in the
provided SEGMENTED image . You may leverage the surrounding context of the SEGMENTED
image provided in the CROPPED image .
2 You must not mention any background in the caption and only describe the {{ category }} in
the SEGMENTED image ! The caption must ONLY contain sufficient details to reconstruct
the same {{ category }} in the SEGMENTED image but nothing else !
3 Here are some additional rules you need to follow when describing the {{ category }} in the
SEGMENTED image :
4 1. If there are multiple {{ category }} in the CROPPED image , focus on the {{ category }} in
the SEGMENTED image .
5 2. If the {{ category }} in the SEGMENTED image is occluded by other objects , only describe
the visible part . DO NOT mention anything that is not directly related to the visible
part of {{ category }} , such as " A segment of " , which part is invisible , etc . For objects
with text written on it , describe the object instead of just outputting the text
written on it .
6 Here is the SEGMENTED image that needs caption :
4V [16] with our localized image/video captioning ing, as analyzed in App. G. For these models, during
dataset collected with our proposed data pipeline. Fol- training, we randomly select one of 15 prompts from a
lowing the VILA 1.5 training and inference recipe, we predefined set. These prompts may or may not include
treat videos as 8 images concatenated in the sequence. a {prompt_suffix}. The default prompt suffix is in
detail. However, we introduce variability by condition-
We closely follow VILA 1.5’s recipe of the supervised
ing the prompt on the number of words or sentences
fine-tuning stage and train all modules, including the
in the target caption.
vision backbone, the projector, and the LLM. We fine-
tune the model for 1 epoch. For the 3B model, we use Specifically, with a 20% probability, we condition the
a batch size of 2048 with a learning rate of 1e-4 on 8 prompt on the number of sentences, using suffixes like
Nvidia A100 GPUs. For the 8B model, we use a batch in one sentence or in [number of sentences] sentences
size of 2048 with a learning rate of 1e-5 on 32 Nvidia (e.g., in 2 sentences). If the caption contains only one
A100 GPUs. Both models take less than a day to train. sentence, we use phrases like in a sentence or in one
We use a cosine scheduler with a warmup ratio of 0.03. sentence.
No weight decay is used. For training our model that
With another 20% probability, we condition the
takes in a class name for self-labeling, we randomly
prompt on the number of words in the target caption.
put the class name in the prompt with 50% probability.
For captions with a small word count, we use exact
For models without prompt augmentation, which is
numbers (e.g., in 3 words). For longer captions (up
detailed below, we simply use the prompt “Describe
to 200 words), we may round the word count to the
the masked region in detail.” Following VILA,
nearest ten and use phrases like in about 50 words or
we always put image tokens in front of the textual
in around 50 words. If the caption exceeds 200 words,
tokens. As for the setting for the focal crop, we extend
we use the suffix in more than 200 words.
the crop by 1× the width towards left and right, and
1× the height towards top and bottom, unless we hit The list of prompts that include a {prompt_suffix}
the boundaries of the image, in which case we take the is as follows:
boundaries, i.e. 𝛼 = 3 and the total area of the crop is
1. Describe the masked region {prompt_suffix}.
enlarged up to 9×. If either the height or width is less
2. Describe the masked area {prompt_suffix}.
than 48 pixels, we take 48 pixels for that direction
3. What can you describe about the masked region
to encode more context for very small regions, since
{prompt_suffix}?
the small regions themselves do not have much useful
4. Can you describe the masked region
information.
{prompt_suffix}?
Prompt Augmentation. We trained a variant of our 5. Provide an explanation of the masked region
model with prompt augmentation to enhance general- {prompt_suffix}.
ization capabilities beyond detailed localized caption- 6. Depict the masked area {prompt_suffix}.
16
Describe Anything: Detailed Localized Image and Video Captioning
7. Portray the masked area {prompt_suffix}. • If the prompt_suffix is in detail (the default op-
8. Describe what the masked region looks like tion), we may choose from either set of prompts.
{prompt_suffix}. • If the prompt_suffix specifies word or sentence
9. Illustrate the masked region {prompt_suffix}. counts, we select only from prompts that include
10. How would you explain the masked area {prompt_suffix}.
{prompt_suffix}?
This approach introduces variability in the prompts,
11. What details can you provide about the masked
encouraging the model to generate responses with con-
region {prompt_suffix}?
trols from the prompts in mind, thereby enhancing its
12. What does the masked region entail
generalization and instruction-following capabilities.
{prompt_suffix}?
13. How would you illustrate the masked region
H.3. Inference Setting
{prompt_suffix}?
14. How would you depict the masked area Unless otherwise mentioned, our prompt for obtaining
{prompt_suffix}? detailed localized image descriptions at inference time
15. How would you portray the masked area is the following:
{prompt_suffix}? Describe the masked region in detail.
Additionally, we have prompts that inherently request Our prompt for obtaining detailed localized video
detailed descriptions without requiring a suffix: descriptions at inference time is the following:
1. Give a detailed description of the masked region. Given the video in the form of a sequence
2. Provide a thorough description of the masked of frames above, describe the object in
region. the masked region in the video in detail.
3. Can you explain the details of the masked area? Focus on appearance, motion, and actions.
4. Give a detailed account of the masked region. If the motion involves multiple stages or
5. Describe the masked area comprehensively. steps, break down each stage and describe
6. Provide an in-depth description of the masked the movements or changes sequentially.
region. Ensure each phase of motion is described
7. Explain the specifics of the masked area. clearly, highlighting transitions between
8. Can you provide a thorough explanation of the actions.
masked region?
9. What are the details of the masked area? For Co3Dv2 [64] sequences that we treat as videos,
10. Provide a comprehensive description of the we use the following prompt:
masked area. Describe the masked region in the video
11. What specific details can you provide about the in detail. The video consists of multiple
masked region? views of a stationary object. Focus on the
12. Can you give an in-depth account of the masked appearance of the object without mentioning
section? any motion or actions.
13. What are the main characteristics of the masked
region?
14. Give a thorough description of the masked area’s
details.
15. Provide detailed information about the masked
area.
17
Describe Anything: Detailed Localized Image and Video Captioning
[2] Peter Anderson, Basura Fernando, Mark Johnson, [13] David Chan, Suzanne Petryk, Joseph E Gonzalez,
and Stephen Gould. Spice: Semantic propositional Trevor Darrell, and John Canny. Clair: Evaluat-
image caption evaluation. In ECCV, pages 382–398. ing image captions with large language models.
Springer, 2016. 2, 3, 8 arXiv:2310.12971, 2023. 8, 3
[3] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- [14] Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song,
sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and
Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Hongyang Zhang. Revisiting referring expression
Openflamingo: An open-source framework for train- comprehension evaluation in the era of large multi-
ing large autoregressive vision-language models. modal models. arXiv:2406.16866, 2024. 7, 8, 3
arXiv:2308.01390, 2023. 3
[15] Keqin Chen, Zhao Zhang, Weili Zeng, Richong
[4] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Zhang, Feng Zhu, and Rui Zhao. Shikra: Un-
Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu leashing multimodal llm’s referential dialogue magic.
Han, Fei Huang, et al. Qwen technical report. arXiv:2306.15195, 2023. 1, 3, 7, 8
arXiv:2309.16609, 2023. 3
[16] Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Con-
[5] Satanjeev Banerjee and Alon Lavie. Meteor: An au- ghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin.
tomatic metric for mt evaluation with improved cor- Sharegpt4v: Improving large multi-modal models
relation with human judgments. In ACL Workshops, with better captions. arXiv:2311.12793, 2023. 16
2005. 2, 3, 8
[17] Tsai-Shien Chen, Aliaksandr Siarohin, Willi
[6] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Menapace, Ekaterina Deyneka, Hsiang-wei Chao,
Maxwell Nye, Augustus Odena, Arushi Somani, and Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee,
Sağnak Taşırlar. Fuyu-8b: A multimodal architecture Jian Ren, Ming-Hsuan Yang, et al. Panda-70m:
for ai agents, 2023. 3 Captioning 70m videos with multiple cross-modality
teachers. In CVPR, pages 13320–13331, 2024. 8, 9,
[7] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex
3
Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raf-
fel. Remixmatch: Semi-supervised learning with [18] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-
distribution alignment and augmentation anchoring. ishna Vedantam, Saurabh Gupta, Piotr Dollár, and
arXiv:1911.09785, 2019. 2, 6 C Lawrence Zitnick. Microsoft coco captions: Data
collection and evaluation server. arXiv:1504.00325,
[8] David Berthelot, Nicholas Carlini, Ian Goodfellow,
2015. 1
Nicolas Papernot, Avital Oliver, and Colin A Raffel.
Mixmatch: A holistic approach to semi-supervised [19] Xi Chen, Josip Djolonga, Piotr Padlewski, Basil
learning. NeurIPS, 2019. 2, 6 Mustafa, Soravit Changpinyo, Jialin Wu, Car-
los Riquelme Ruiz, Sebastian Goodman, Xiao Wang,
[9] Lucas Beyer, Andreas Steiner, André Susano Pinto,
Yi Tay, et al. On scaling up a multilingual vision and
Alexander Kolesnikov, Xiao Wang, Daniel Salz,
language model. In CVPR, 2024. 3
Maxim Neumann, Ibrahim Alabdulmohsin, Michael
Tschannen, Emanuele Bugliarello, et al. Paligemma: [20] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su,
A versatile 3b vlm for transfer. arXiv:2407.07726, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang,
2024. 3 Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu,
Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision
[10] Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao
foundation models and aligning for generic visual-
Zhang, Christoph Feichtenhofer, and Judy Hoff-
linguistic tasks. arXiv:2312.14238, 2023. 3, 8, 9,
man. Token merging: Your vit but faster.
10
arXiv:2210.09461, 2022. 4
18
Describe Anything: Detailed Localized Image and Video Captioning
[21] Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, [32] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi
Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Mao, Chloe Rolland, Laura Gustafson, Tete Xiao,
Ye, Hao Tian, Zhaoyang Liu, et al. Expanding per- Spencer Whitehead, Alexander C Berg, Wan-Yen Lo,
formance boundaries of open-source multimodal et al. Segment anything. In ICCV, 2023. 4, 1
models with model, data, and test-time scaling.
[33] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Fer-
arXiv:2412.05271, 2024. 8, 10
rari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan
[22] Wenliang Dai, Junnan Li, D Li, AMH Tiong, J Zhao, Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali,
W Wang, B Li, P Fung, and S Hoi. Instructblip: To- Matteo Malloci, Jordi Pont-Tuset, Andreas Veit,
wards general-purpose vision-language models with Serge Belongie, Victor Gomes, Abhinav Gupta, Chen
instruction tuning. arxiv 2023. arXiv:2305.06500, Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh
2023. 3 Narayanan, and Kevin Murphy. Openimages: A
public dataset for large-scale multi-label and multi-
[23] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang,
class image classification. Dataset available from
Philip HS Torr, and Song Bai. Mose: A new dataset
https://storage.googleapis.com/openimages/web/index.html,
for video object segmentation in complex scenes. In
2017. 11
CVPR, 2023. 13
[34] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
[24] Alexey Dosovitskiy. An image is worth 16x16
son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
words: Transformers for image recognition at scale.
Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.
arXiv:2010.11929, 2020. 5
Visual genome: Connecting language and vision us-
[25] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, ing crowdsourced dense image annotations. IJCV,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, 2017. 2, 3
Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan,
[35] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper
et al. The llama 3 herd of models. arXiv:2407.21783,
Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab
2024. 3, 8, 6, 10
Kamali, Stefan Popov, Matteo Malloci, Alexander
[26] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Kolesnikov, Tom Duerig, and Vittorio Ferrari. The
Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual open images dataset v4: Unified image classification,
representation for neon genesis. Image and Vision object detection, and visual relationship detection at
Computing, 2024. 14 scale. IJCV, 2020. 11
[27] Xiuye Gu, Yin Cui, Jonathan Huang, Abdullah Rash- [36] Hugo Laurençon, Lucile Saulnier, Léo Tronchon,
wan, Xuan Yang, Xingyi Zhou, Golnaz Ghiasi, We- Stas Bekman, Amanpreet Singh, Anton Lozhkov,
icheng Kuo, Huizhong Chen, Liang-Chieh Chen, et al. Thomas Wang, Siddharth Karamcheti, Alexander
Dataseg: Taming a universal multi-dataset multi-task Rush, Douwe Kiela, et al. Obelics: An open web-scale
segmentation model. NeurIPS, 2024. 6 filtered dataset of interleaved image-text documents.
NeurIPS, 2024. 3
[28] Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin
Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and [37] Dong-Hyun Lee et al. Pseudo-label: The simple and
Sifei Liu. Regiongpt: Towards region understanding efficient semi-supervised learning method for deep
vision language model. In CVPR, 2024. 2, 3, 7, 8, 5, neural networks. In Workshop on challenges in rep-
6 resentation learning, ICML, page 896. Atlanta, 2013.
2
[29] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A
dataset for large vocabulary instance segmentation. [38] Junnan Li, Dongxu Li, Caiming Xiong, and Steven
In CVPR, 2019. 7, 2, 9, 11 Hoi. Blip: Bootstrapping language-image pre-
training for unified vision-language understanding
[30] Xiaoke Huang, Jianfeng Wang, Yansong Tang, Zheng
and generation. In ICML, 2022. 3, 5
Zhang, Han Hu, Jiwen Lu, Lijuan Wang, and Zicheng
Liu. Segment and caption anything. In CVPR, 2024. [39] Junnan Li, Dongxu Li, Silvio Savarese, and Steven
1, 3, 6, 8 Hoi. Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large lan-
[31] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,
guage models. In ICML, 2023. 3
and Tamara Berg. Referitgame: Referring to objects
in photographs of natural scenes. In EMNLP, 2014. [40] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang,
2, 3 Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan
19
Describe Anything: Detailed Localized Image and Video Captioning
Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Dehghani, Zhuoran Shen, et al. Simple open-
Grounded language-image pre-training. 2022 ieee. vocabulary object detection. In ECCV, 2022. 13
In CVPR, pages 10955–10965, 2021. 1
[52] Matthias Minderer, Alexey Gritsenko, and Neil
[41] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming Houlsby. Scaling open-vocabulary object detection.
He. Exploring plain vision transformer backbones NeurIPS, 2024. 13
for object detection. In ECCV, 2022. 14
[53] Gerhard Neuhold, Tobias Ollmann, Samuel
[42] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Rota Bulo, and Peter Kontschieder. The mapillary
Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and vistas dataset for semantic understanding of street
Yong Jae Lee. Gligen: Open-set grounded text-to- scenes. In ICCV, 2017. 11
image generation. In CVPR, pages 22511–22521,
[54] OpenAI. Gpt-4o system card, 2024. 1, 2, 6, 8, 9, 3,
2023. 1
10, 14, 15
[43] Chin-Yew Lin. Rouge: A package for automatic evalu-
[55] OpenAI. Learning to reason with llms, 2024. 2, 6, 8,
ation of summaries. In Text summarization branches
10
out, pages 74–81, 2004. 2, 3, 8
[56] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
[44] Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mo-
Jing Zhu. Bleu: a method for automatic evaluation of
hammad Shoeybi, and Song Han. Vila: On pre-
machine translation. In ACL, pages 311–318, 2002.
training for visual language models. In CVPR, 2024.
2, 3, 8
3, 6, 8, 5, 10, 14
[57] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao,
[45] Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao,
Shaohan Huang, Shuming Ma, and Furu Wei.
Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang
Kosmos-2: Grounding multimodal large language
Zhang, and Hongsheng Li. Draw-and-understand:
models to the world. arXiv:2306.14824, 2023. 3
Leveraging visual prompts to enable MLLMs to com-
prehend what you want. In ICLR, 2025. 1, 2, 3, 7, 8, [58] Bryan A Plummer, Liwei Wang, Chris M Cervantes,
9, 10, 14 Juan C Caicedo, Julia Hockenmaier, and Svetlana
Lazebnik. Flickr30k entities: Collecting region-to-
[46] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae
phrase correspondences for richer image-to-sentence
Lee. Improved baselines with visual instruction tun-
models. In ICCV, pages 2641–2649, 2015. 7, 8, 3
ing, 2023. 3, 8, 1, 10
[59] Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tian-
[47] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan-
ren Ma, Pengyu Yan, David Doermann, Qixiang Ye,
han Zhang, Sheng Shen, and Yong Jae Lee. Llava-
and Yunjie Tian. Artemis: Towards referential under-
next: Improved reasoning, ocr, and world knowledge,
standing in complex videos. NeurIPS, 37:114321–
2024.
114347, 2024. 3, 8, 9
[48] Haotian Liu, Chunyuan Li, Qingyang Wu, and
[60] Vignesh Ramanathan, Anmol Kalia, Vladan Petro-
Yong Jae Lee. Visual instruction tuning. NeurIPS,
vic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang,
2024. 3, 7, 8, 1, 10
Aaron Marquez, Rama Kovvuri, Abhishek Kadian,
[49] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, et al. Paco: Parts and attributes of common objects.
Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jian- In CVPR, pages 7141–7151, 2023. 7, 2, 11
wei Yang, Hang Su, et al. Grounding dino: Marrying
[61] Hanoona Rasheed, Muhammad Maaz, Sahal
dino with grounded pre-training for open-set object
Shaji, Abdelrahman Shaker, Salman Khan, Hisham
detection. In European Conference on Computer Vi-
Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan
sion, pages 38–55. Springer, 2024. 1
Yang, and Fahad S Khan. Glamm: Pixel grounding
[50] Junhua Mao, Jonathan Huang, Alexander Toshev, large multimodal model. In CVPR, 2024. 3, 7
Oana Camburu, Alan L Yuille, and Kevin Murphy.
[62] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu,
Generation and comprehension of unambiguous ob-
Ronghang Hu, Chaitanya Ryali, Tengyu Ma,
ject descriptions. In CVPR, 2016. 2, 3
Haitham Khedr, Roman Rädle, Chloe Rolland, Laura
[51] Matthias Minderer, Alexey Gritsenko, Austin Stone, Gustafson, et al. Sam 2: Segment anything in images
Maxim Neumann, Dirk Weissenborn, Alexey Dosovit- and videos. arXiv:2408.00714, 2024. 4, 6, 1, 11, 12,
skiy, Aravindh Mahendran, Anurag Arnab, Mostafa 13, 14
20
Describe Anything: Detailed Localized Image and Video Captioning
[63] Nils Reimers and Iryna Gurevych. Sentence-bert: Hauth, et al. Gemini: a family of highly capable
Sentence embeddings using siamese bert-networks. multimodal models. arXiv:2312.11805, 2023. 8, 6,
arXiv preprint arXiv:1908.10084, 2019. 3 10
[64] Jeremy Reizenstein, Roman Shapovalov, Philipp Hen- [76] Gemini Team, M Reid, N Savinov, D Teplyashin, Lep-
zler, Luca Sbordone, Patrick Labatut, and David ikhin Dmitry, T Lillicrap, JB Alayrac, R Soricut, A
Novotny. Common objects in 3d: Large-scale learning Lazaridou, O Firat, et al. Gemini 1.5: Unlocking
and evaluation of real-life 3d category reconstruction. multimodal understanding across millions of tokens
In ICCV, 2021. 9, 17 of context. arXiv:2403.05530, 2024. 6
[65] Pengzhen Ren, Changlin Li, Guangrun Wang, Yun [77] Qwen Team. Qwen2.5-vl, 2025. 8, 10, 15
Xiao, Qing Du, Xiaodan Liang, and Xiaojun Chang.
Beyond fixation: Dynamic window visual transformer. [78] A Vaswani. Attention is all you need. NeurIPS, 2017.
In CVPR, 2022. 4 5
[66] Tomer Ronen, Omer Levy, and Avram Golbert. Vision [79] Ramakrishna Vedantam, C Lawrence Zitnick, and
transformers with mixed-resolution tokenization. In Devi Parikh. Cider: Consensus-based image descrip-
CVPR, 2023. 4 tion evaluation. In CVPR, 2015. 2, 3, 8
[67] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, [80] Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie,
Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Ob- and Can Huang. Elysium: Exploring object-level
jects365: A large-scale, high-quality dataset for ob- perception in videos via mllm. In ECCV, pages 166–
ject detection. In ICCV, 2019. 3, 6 185. Springer, 2024. 9
[68] Manish Kumar Singh, Rajeev Yasarla, Hong Cai, [81] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhi-
Mingu Lee, and Fatih Porikli. Tosa: Token se- hao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin
lective attention for efficient vision transformers. Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-
arXiv:2406.08816, 2024. 4 language model’s perception of the world at any res-
olution. arXiv:2409.12191, 2024. 3, 8, 9, 10
[69] Kihyuk Sohn, David Berthelot, Nicholas Carlini,
Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Do- [82] Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yun-
gus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fix- long Tang, Zhe Li, Mingqi Gao, and Shanshan Zhao.
match: Simplifying semi-supervised learning with Caption anything: Interactive image description with
consistency and confidence. NeurIPS, 2020. 2, 6 diverse multimodal controls. arXiv:2305.02677,
2023. 1, 3
[70] Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang
Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and [83] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang,
Jiaqi Wang. Alpha-clip: A clip model focusing on Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li,
wherever you want. In CVPR, 2024. 3 Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project:
Towards panoptic visual recognition and understand-
[71] Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie ing of the open world. arXiv:2308.01907, 2023. 3
Jin, Hongxu Jiang, Qian Yu, and Dong Xu. Human-
centric spatio-temporal video grounding with visual [84] Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao,
transformers. IEEE Transactions on Circuits and Sys- Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou
tems for Video Technology, 32(12):8238–8249, 2021. Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. Enhanc-
8, 3 ing the reasoning ability of multimodal large lan-
guage models via mixed preference optimization.
[72] Claude Team. Claude 3.5 sonnet, 2024. 6 arXiv:2411.10442, 2024. 8, 10
[73] Claude Team. Claude 3.7 sonnet and claude code, [85] Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe
2025. 8, 10 Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang.
Grit: A generative region-to-text transformer for ob-
[74] Gemini Team. Gemini 2.5: Our most intelligent ai
ject understanding. arXiv:2212.00280, 2022. 1
model. 2025. 8, 10
[86] Penghao Wu and Saining Xie. V*: Guided visual
[75] Gemini Team, Rohan Anil, Sebastian Borgeaud,
search as a core mechanism in multimodal llms.
Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu
arXiv:2312.14135, 2023. 4
Soricut, Johan Schalkwyk, Andrew M Dai, Anja
21
Describe Anything: Detailed Localized Image and Video Captioning
[87] Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian [98] Haotian Zhang, Haoxuan You, Philipp Dufter, Bowen
Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Zhang, Chen Chen, Hong-You Chen, Tsu-Jui Fu,
Xie, Hongxu Yin, Li Yi, et al. Vila-u: a unified foun- William Yang Wang, Shih-Fu Chang, Zhe Gan, and
dation model integrating visual understanding and Yinfei Yang. Ferret-v2: An improved baseline for re-
generation. arXiv:2409.04429, 2024. 3 ferring and grounding with large language models.
In COLM, 2024. 3
[88] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong,
and Quoc Le. Unsupervised data augmentation for [99] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao,
consistency training. NeurIPS, 2020. 2 Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and
Ping Luo. Gpt4roi: Instruction tuning large language
[89] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and
model on region-of-interest. arXiv:2307.03601,
Quoc V Le. Self-training with noisy student improves
2023. 1, 3, 7, 8
imagenet classification. In CVPR, pages 10687–
10698, 2020. 6 [100] Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan,
Shengqiong Wu, Shunping Ji, Chen Change Loy, and
[90] Jianwei Yang, Chunyuan Li, Pengchuan Zhang,
Shuicheng Yan. Omg-llava: Bridging image-level,
Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao.
object-level, pixel-level reasoning and understand-
Focal self-attention for local-global interactions in
ing. arXiv:2406.19389, 2024. 1, 3, 8, 6
vision transformers. arXiv:2107.00641, 2021. 4
[101] Yuzhong Zhao, Yue Liu, Zonghao Guo, Weijia Wu,
[91] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou,
Chen Gong, Qixiang Ye, and Fang Wan. Controlcap:
Chunyuan Li, and Jianfeng Gao. Set-of-mark prompt-
Controllable region-level captioning. pages 21–38,
ing unleashes extraordinary visual grounding in gpt-
2024. 1, 3, 8
4v. arXiv:2310.11441, 2023. 3, 1, 8, 10
[102] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chun-
[92] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang,
yuan Li, Noel Codella, Liunian Harold Li, Luowei
Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The
Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip:
dawn of lmms: Preliminary explorations with gpt-4v
Region-based language-image pretraining. In CVPR,
(ision). arXiv:2309.17421, 2023. 3, 1
pages 16793–16803, 2022. 1
[93] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du,
[103] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu
Mohamed Elhoseiny. Minigpt-4: Enhancing vision-
Chang, and Yinfei Yang. Ferret: Refer and ground
language understanding with advanced large lan-
anything anywhere at any granularity. In ICLR, 2024.
guage models. arXiv:2304.10592, 2023. 3
1, 2, 3, 7, 8, 9, 5
[95] Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xin-
jie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey:
Pixel understanding with visual instruction tuning.
In CVPR, 2024. 1, 2, 3, 7, 8, 9
22