[ACM MM 2025] Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting


    1National Taiwan University, 2National Yang Ming Chiao Tung University,
    3National Taiwan Normal University

    Teaser

    VIC-SSL: A self-supervised method that learns inter-frame correspondences without manual annotations by Foreground-driven ShiftMix augmentation strategy, reducing the required data of downstream Video Individual Counting.

    Abstract

    Video Individual Counting (VIC), which seeks to count unique individuals across video sequences without duplication, has broader applications than traditional Video Crowd Counting (VCC), including urban planning, event management, and safety monitoring. However, although current VIC approaches have demonstrated strong capabilities, their reliance on identity-level or group-level annotations necessitates substantial labeling effort and expense. To reduce the high costs of manual annotation, we introduce VIC-SSL, a novel self-supervised learning approach that utilizes unlabeled data along with the innovative feature-level augmentation technique called Foreground-driven ShiftMix (F-ShiftMix). By blending and shifting in the feature space rather than the image space, F-ShiftMix generates realistic crowd motion without explicit annotations, while preserving global semantic coherence. Furthermore, VIC-SSL integrates the Cost-guided Flow Prompt (CFP) and the Distinction-aware Cross-Attention (DCA) to enhance flow-aware localization and inter-frame correspondence learning. Our extensive experiments across three datasets, including SenseCrowd, CroHD, and CARLA, demonstrate that VIC-SSL substantially outperforms existing methods, achieving state-of-the-art results with significantly reduced data requirements. These results showcase VIC-SSL's potential to dramatically lower annotation costs and improve the deployment feasibility of VIC systems in complex scenarios.

    Overall Architecture

    Pipeline

    Architecture of VIC-SSL. VIC-SSL consists of two stages: self-supervised pre-training and fine-tuning. In pre-training, a feature extractor processes an input frame \(I\), and F-ShiftMix module simulates crowd motion by blending and shifting this feature map \(F\) to create a pseudo-reference feature map \(\tilde{F}\). A Cost-guided Flow Prompt (CFP) \(\rho\) is then generated, guiding the Distinction-aware Cross-Attention (DCA) to emphasize frame differences. During fine-tuning, a real reference frame is used. After distinction-aware features are extracted by DCA, a decoupling mask \(M\) generated by the mask decoder identifies the distinct regions and decouples the feature map, predicting inflow and shared crowd densities by the density decoders.

    Quantitative Results

    HICO_DET&VCOCO

    Performance comparison on SenseCrowd dataset

    On the SenseCrowd dataset, VIC-SSL establishes a new state-of-the-art across multiple metrics. With its self-supervised pre-training, it significantly surpasses the previous leading method, CGNet, achieving improvements of 14.6% in MAE and 28.8% in MSE. The model's strength is particularly evident in challenging conditions, as it reduces the MAE by 29.8 compared to PDTR in the highest-density scenarios (\(\mathcal{D}_4\)). This superior performance stems from its enhanced ability to learn inter-frame correspondences from unlabeled data, with the DCA module improving feature alignment and the CFP module discerning fine-grained local movements.

    HICO_DET&VCOCO

    The (a) MAE and (b) MSE analysis of fine-tuning on different amounts of data on SenseCrowd.

    VIC-SSL demonstrates high data efficiency, requiring only 50% of labeled data to outperform state-of-the-art methods in MSE. Even with just 25% of labeled data, it matches DRNet's performance and surpasses FMDC. This strong result stems from the self-supervised pre-training strategy, which effectively learns robust inter-frame correspondences from abundant unlabeled data. Consequently, less labeled data is required during fine-tuning to achieve high performance.

    Qualitative Results

    HICO_DET&VCOCO

    Comparison of DCA attention maps learned with and without pre-training

    A model trained from scratch struggles to maintain visual correspondence, resulting in mismatched attention values as it loses track of individuals across frames. Conversely, the pre-trained VIC-SSL model demonstrates a strong ability to accurately associate the same individuals between frames, even without using any identity-level labels. This robust correspondence is a key factor that substantially boosts the final VIC performance, validating the effectiveness of the proposed F-ShiftMix pre-training strategy.

    BibTeX

    @inproceedings{huang2025flowing,
      title={Flowing Crowd to Count Flows: A Self-Supervised Framework for Video Individual Counting},
      author={Feng-Kai Huang and Bo-Lun Huang and Li-Wu Tsao and Jhih-Ciang Wu and Hong-Han Shuai and Wen-Huang Cheng},
      booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
      year={2025}
    }