Urban Scene Segmentation and Cross-Dataset Transfer Learning using SegFormer
Abstract
Semantic segmentation is essential for autonomous driving applications, but state-of-the-art models are typically evaluated on large datasets like Cityscapes, leaving smaller datasets underexplored. This research gap limits our understanding of how transformer-based models generalize across diverse urban scenes with limited training data. This paper presents a comprehensive evaluation of SegFormer architectural variants (B3, B4, B5) on the CamVid dataset and investigates cross-dataset transfer learning from CamVid to KITTI. Using an optimization framework combining cross-entropy loss with class weighting and boundary-aware components, our experiments establish
new performance baselines on CamVid and demonstrate that transfer learning provides benefits w hen target domain data is limited. We achieve a modest 2.57% relative mean Intersection over Union (mIoU) improvement on KITTI through knowledge transfer from CamVid, along with 61.1% faster convergence. Additionally, we observe substantial class-specific improvements of up to 30.75% for challenging c ategories. Our analysis provides insights into model scaling effects, c ross-dataset k nowledge t ransfer m echanisms, a nd p ractical s trategies for addressing data scarcity in urban scene segmentation.