Amodal Counting using VLMs

I am working on the CAPTURe benchmark, which evaluates the ability of vision-language models (VLMs) to perform amodal counting—inferring the number of occluded objects based on visible patterns. This task reflects critical spatial reasoning skills that current VLMs largely lack.

My goal is to develop a system that improves counting accuracy under occlusion by enhancing pattern recognition, integrating visual and textual cues, and exploring hybrid approaches such as inpainting pipelines, coordinate extraction, or reasoning modules.

Occluded image example with an output from GPT4-o. People can easily infer the missing number of cups and correctly reason over occluded patterns, models generally struggle to reason over these occluded scenes.
---
Work is in process
Expected Completion Date: November, 2025
---