Abstract

In the field of autonomous driving, cooperative perception through vehicle-to-vehicle communication significantly enhances environmental understanding by leveraging multi-sensor data, including LiDAR, cameras, and radar. However, traditional early or late fusion methods face challenges such as high bandwidth and computational resources, which make it difficult to balance data transmission efficiency with the accuracy of perception of the surrounding environment, especially for the detection of smaller objects such as pedestrians. To address these challenges, this paper proposes a novel cooperative perception framework based on two-stage intermediate-level sensor feature fusion specifically designed for complex traffic scenarios where pedestrians and vehicles coexist. In such scenarios, the model demonstrates superior performance in detecting small objects like pedestrians compared to mainstream perception methods while also improving the cooperative perception accuracy for medium and large objects such as vehicles. Furthermore, to thoroughly validate the reliability of the proposed model, we conducted both qualitative and quantitative experiments on mainstream simulated and real-world datasets. The experimental results demonstrate that our approach outperforms state-of-the-art perception models in terms of mAP, achieving up to a 4.1% improvement in vehicle detection accuracy and a remarkable 29.2% enhancement in pedestrian detection accuracy.