[ACM MM 2025] Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting

Feng-Kai Huang¹, Bo-Lun Huang², Li-Wu Tsao², Jhih-Ciang Wu³, Hong-Han Shuai²^*, Wen-Huang Cheng¹
^*Corresponding Author

¹National Taiwan University,

²National Yang Ming Chiao Tung University,

³National Taiwan Normal University

Code

Teaser

VIC-SSL: A self-supervised method that learns inter-frame correspondences without manual annotations by Foreground-driven ShiftMix augmentation strategy, reducing the required data of downstream Video Individual Counting.

Abstract

Video Individual Counting (VIC), which seeks to count unique individuals across video sequences without duplication, has broader applications than traditional Video Crowd Counting (VCC), including urban planning, event management, and safety monitoring. However, although current VIC approaches have demonstrated strong capabilities, their reliance on identity-level or group-level annotations necessitates substantial labeling effort and expense. To reduce the high costs of manual annotation, we introduce VIC-SSL, a novel self-supervised learning approach that utilizes unlabeled data along with the innovative feature-level augmentation technique called Foreground-driven ShiftMix (F-ShiftMix). By blending and shifting in the feature space rather than the image space, F-ShiftMix generates realistic crowd motion without explicit annotations, while preserving global semantic coherence. Furthermore, VIC-SSL integrates the Cost-guided Flow Prompt (CFP) and the Distinction-aware Cross-Attention (DCA) to enhance flow-aware localization and inter-frame correspondence learning. Our extensive experiments across three datasets, including SenseCrowd, CroHD, and CARLA, demonstrate that VIC-SSL substantially outperforms existing methods, achieving state-of-the-art results with significantly reduced data requirements. These results showcase VIC-SSL's potential to dramatically lower annotation costs and improve the deployment feasibility of VIC systems in complex scenarios.

Overall Architecture

Pipeline

Architecture of VIC-SSL. VIC-SSL consists of two stages: self-supervised pre-training and fine-tuning. In pre-training, a feature extractor processes an input frame \(I\), and F-ShiftMix module simulates crowd motion by blending and shifting this feature map \(F\) to create a pseudo-reference feature map \(\tilde{F}\). A Cost-guided Flow Prompt (CFP) \(\rho\) is then generated, guiding the Distinction-aware Cross-Attention (DCA) to emphasize frame differences. During fine-tuning, a real reference frame is used. After distinction-aware features are extracted by DCA, a decoupling mask \(M\) generated by the mask decoder identifies the distinct regions and decouples the feature map, predicting inflow and shared crowd densities by the density decoders.

Coarse Candidates Generation

Visualization of different shifting strategies

Foreground-driven ShiftMix (F-ShiftMix) is a feature-level augmentation that operates on a feature map \(F\), extracted from an input image \(I\), to simulate motion by shifting and blending its patches. Unlike random shifting, F-ShiftMix is driven by a foreground mask. It applies a high shifting ratio to features corresponding to the crowd (foreground) while keeping background features largely static to generate a more realistic simulation of crowd dynamics. Crucially, it generates a corresponding transportation map, which serves as a pseudo-label to supervise the model in learning inter-frame correspondences from unlabeled data.

Interaction-Aware Reasoning

Illustration of the Cost-guided Flow Prompt Encoder

The Cost-guided Flow Prompt (CFP) is a learnable prompt designed to guide cross-attention toward dynamic motion. It is generated by encoding a localized cost volume \(C\) of the original feature map \(F\) and the reference map \(\tilde{F}\). The resulting prompt \(\rho\) is then concatenated with \(F\) to create a guided feature map \(F_g\), forcing the model to focus on regions with significant movement.

Interaction-Aware Reasoning

Illustration of the Distinction-aware Cross-Attention (DCA)

Distinction-aware Cross-Attention (DCA) is a modified cross-attention mechanism designed to explicitly extract differences between frames to identify pedestrian inflow. Instead of the traditional additive residual connection, it computes the difference, \(z' = Q - z\), between a query \(Q\) and the initial attention output \(z\)..

Quantitative Results

HICO_DET&VCOCO

Performance comparison on SenseCrowd dataset

On the SenseCrowd dataset, VIC-SSL establishes a new state-of-the-art across multiple metrics. With its self-supervised pre-training, it significantly surpasses the previous leading method, CGNet, achieving improvements of 14.6% in MAE and 28.8% in MSE. The model's strength is particularly evident in challenging conditions, as it reduces the MAE by 29.8 compared to PDTR in the highest-density scenarios (\(\mathcal{D}_4\)). This superior performance stems from its enhanced ability to learn inter-frame correspondences from unlabeled data, with the DCA module improving feature alignment and the CFP module discerning fine-grained local movements.

HICO_DET&VCOCO

The (a) MAE and (b) MSE analysis of fine-tuning on different amounts of data on SenseCrowd.

VIC-SSL demonstrates high data efficiency, requiring only 50% of labeled data to outperform state-of-the-art methods in MSE. Even with just 25% of labeled data, it matches DRNet's performance and surpasses FMDC. This strong result stems from the self-supervised pre-training strategy, which effectively learns robust inter-frame correspondences from abundant unlabeled data. Consequently, less labeled data is required during fine-tuning to achieve high performance.

Qualitative Results

HICO_DET&VCOCO

Comparison of DCA attention maps learned with and without pre-training

A model trained from scratch struggles to maintain visual correspondence, resulting in mismatched attention values as it loses track of individuals across frames. Conversely, the pre-trained VIC-SSL model demonstrates a strong ability to accurately associate the same individuals between frames, even without using any identity-level labels. This robust correspondence is a key factor that substantially boosts the final VIC performance, validating the effectiveness of the proposed F-ShiftMix pre-training strategy.

BibTeX

@inproceedings{huang2025flowing,
  title={Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting},
  author={Feng-Kai Huang and Bo-Lun Huang and Li-Wu Tsao and Jhih-Ciang Wu and Hong-Han Shuai and Wen-Huang Cheng},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025}
}