PANORAMA: Panoptic grounded captioning via mask-guided refinement

Sara Pieri1
Evangelos Kazakos2
Shizhe Chen1
Josef Sivic2
Cordelia Schmid1
1Inria, École normale supérieure, CNRS, PSL Research University
2Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague

Abstract

Current vision-language models (VLMs) can produce detailed image captions but often hallucinate content, limiting their reliability and explainability. We tackle this by reframing captioning as a panoptic grounding task, which requires the model to generate comprehensive descriptions for all foreground and background elements while explicitly grounding each mention to a pixel-level mask. Our contributions are three-fold: First, we introduce PANORAMA, a novel mask-guided refinement framework that conditions generation on candidate panoptic segmentation masks coming from a region proposal module. These candidate masks enrich the visual input to the VLM and are refined and matched with the output captions by a mask matching and refinement module to produce precise, per-entity grounding with consistent, detailed descriptions. Second, we construct PanoCaps, a new human-annotated dataset built from panoptic segmentation corpora. It provides dense, open-vocabulary captions with high-quality, entity-level image-text alignment. Third, PANORAMA achieves state-of-the-art performance on both the existing Grounded Conversation Generation (GCG) benchmark and our new dataset. Extensive ablations highlight the effectiveness of mask guidance and refinement in ensuring complete object coverage and consistent descriptions.


Method



Our Approach. PANORAMA is a framework for panoptic grounded captioning that turns an image into a scene description where every mentioned entity is linked to a pixel-level segmentation mask. It first uses a region proposal module to propose regions, and encodes these into compact region embeddings alongside the usual image patch tokens. The vision–language model then generates a caption interleaved with [SEG] tokens, each corresponding to an entity mention. For each [SEG] token, a region matching module selects the most relevant region proposal, which is then passed to a region refinement module to refine the mask. This “propose–match–refine” pipeline produces captions that are both comprehensive and tightly grounded in the image.




Our Dataset. PanoCaps is our human-annotated dataset for panoptic grounded captioning, pairing detailed scene-level captions with pixel-level panoptic masks for every mentioned entity. It contains about 3.5K images from multiple segmentation sources, with dense, open-vocabulary captions and over 99% mask coverage, so both foreground objects and background “stuff” are exhaustively described. Unlike automatically generated corpora that often have noisy text and incomplete masks, PanoCaps offers high-quality, diverse, and tightly aligned image–text–mask supervision with open-vocabulary mentions, making it a strong benchmark and training set for dense, natural grounding.


Qualitative Results



Qualitative Performance. Various examples illustrate that PANORAMA produces coherent image-level captions while densely grounding a wide variety of objects, including small instances and stuff regions, across diverse scenes. Note how the model aligns entity mentions in the captions with accurate image masks and generates detailed, semantically coherent descriptions that capture both the global scene context (e.g. “clear blue sky” and “leafless trees” in the second row) and salient object-level details (“wet black and white dog” in the third row).


BibTeX

@article{,
            author       = {},
            title        = {},
            booktitle    = {},
            year         = {},
            }

Acknowledgements

This work was performed using HPC resources from GENCI-IDRIS (Grant 2025-AD011015795 and AD011015795R1). It was funded in part by the French government under management of Agence Nationale de la Recherche as part of the “France 2030" program, reference ANR-23-IACL-0008 (PR[AI]RIE-PSAI project), the ANR project VideoPredict ANR-21-FAI1-0002- 01. Cordelia Schmid would like to acknowledge the support by the Körber European Science Prize.