Weakly-Supervised Stitching Network
for Real-World Panoramic Image Generation

ECCV 2022



Dae-Young Song1, Geonsoo Lee1, HeeKyung Lee2, Gi-Mun Um2, and Donghyeon Cho1

1Computer Vision and Image Processing (CVIP) Lab., Chungnam National University, Daejeon, South Korea
2Electronics and Telecommunication Research Institute (ETRI), Daejeon, South Korea

Abstract


Generate a 360° panarama without genuine ground-truth.

Recently, there has been growing attention on an end-to-end deep learning-based stitching model. However, the most challenging point in deep learning-based stitching is to obtain pairs of input images with a narrow field of view and ground truth images with a wide field of view captured from real-world scenes. To overcome this difficulty, we develop a weakly-supervised learning mechanism to train the stitching model without requiring genuine ground truth images. In addition, we propose a stitching model that takes multiple real-world fisheye images as inputs and creates a 360° output image in an equirectangular projection format. In particular, our model consists of color consistency corrections, warping, and blending, and is trained by perceptual and SSIM losses. The effectiveness of the proposed algorithm is verified on two real-world stitching datasets.



Image Stitching Network Architecture


The image cannot be displayed!



Dataset Configurations


The image cannot be displayed!


More Ablation Studies


The image cannot be displayed!

The image cannot be displayed!

The image cannot be displayed!

The image cannot be displayed!

The image cannot be displayed!

The image cannot be displayed!

The image cannot be displayed!


Photoshop Result


The image cannot be displayed!


Effect of Local Warping Layer


The image cannot be displayed!


Additional Description for Perceptual Loss


The image cannot be displayed!

The image cannot be displayed!

Geometric distortions occur if L1 loss or low level feature map is used for training, because there are differences in centers between GT cameras. Therefore, we adopt high-level (3rd, 4th, and 5th maxpooling layer's output) feature maps to calculate perceptual loss. Since the VGG-16 is trained for classification, features such as edges or shapes are calculated with GT in the low-level feature map, whereas features in high-level feature map to classify objects are calculated. Specifically, the features represented in the figure below is shown for each level of each maxpooling layer of the VGG-16.


The image cannot be displayed!

Download here [ file1 | file2 | file3 ] to observe at a larger resolution.



Contact


For more questions, please contact eadyoung@naver.com or eadgaudiyoung@gmail.com.



Citation



@InProceedings{Song2022Weakly,
  author={Song, Dae-Young and Lee, Geonsoo and Lee, HeeKyung and Um, Gi-Mun and Cho, Donghyeon},
  title={Weakly-Supervised Stitching Network for Real-World Panoramic Image Generation},
  journal={European Conference on Computer Vision (ECCV)},
  pages={54--71},
  year={2022},
  organization={Springer}
}

@article{song2021end,
  title={End-to-End Image Stitching Network via Multi-Homography Estimation},
  author={Song, Dae-Young and Um, Gi-Mun and Lee, Hee Kyung and Cho, Donghyeon},
  journal={IEEE Signal Processing Letters (SPL)},
  volume={28},
  pages={763--767},
  year={2021},
  publisher={IEEE}
}