Video Individual Counting (VIC), which seeks to count unique individuals across video sequences without duplication, has broader applications than traditional Video Crowd Counting (VCC), including urban planning, event management, and safety monitoring. However, although current VIC approaches have demonstrated strong capabilities, their reliance on identity-level or group-level annotations necessitates substantial labeling effort and expense. To reduce the high costs of manual annotation, we introduce VIC-SSL, a novel self-supervised learning approach that utilizes unlabeled data along with the innovative feature-level augmentation technique called Foreground-driven ShiftMix (F-ShiftMix). By blending and shifting in the feature space rather than the image space, F-ShiftMix generates realistic crowd motion without explicit annotations, while preserving global semantic coherence. Furthermore, VIC-SSL integrates the Cost-guided Flow Prompt (CFP) and the Distinction-aware Cross-Attention (DCA) to enhance flow-aware localization and inter-frame correspondence learning. Our extensive experiments across three datasets, including SenseCrowd, CroHD, and CARLA, demonstrate that VIC-SSL substantially outperforms existing methods, achieving state-of-the-art results with significantly reduced data requirements. These results showcase VIC-SSL's potential to dramatically lower annotation costs and improve the deployment feasibility of VIC systems in complex scenarios.
@inproceedings{huang2025flowing,
title={Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting},
author={Feng-Kai Huang and Bo-Lun Huang and Li-Wu Tsao and Jhih-Ciang Wu and Hong-Han Shuai and Wen-Huang Cheng},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
year={2025}
}