关键词: building deep learning hyperparameter self-attention vision transformer

Mesh : Semantics Electric Power Supplies Neural Networks, Computer Problem Solving Telemetry Image Processing, Computer-Assisted

来  源:   DOI:10.3390/s23115166   PDF(Pubmed)

Abstract:
Semantic segmentation with deep learning networks has become an important approach to the extraction of objects from very high-resolution remote sensing images. Vision Transformer networks have shown significant improvements in performance compared to traditional convolutional neural networks (CNNs) in semantic segmentation. Vision Transformer networks have different architectures to CNNs. Image patches, linear embedding, and multi-head self-attention (MHSA) are several of the main hyperparameters. How we should configure them for the extraction of objects in VHR images and how they affect the accuracy of networks are topics that have not been sufficiently investigated. This article explores the role of vision Transformer networks in the extraction of building footprints from very-high-resolution (VHR) images. Transformer-based models with different hyperparameter values were designed and compared, and their impact on accuracy was analyzed. The results show that smaller image patches and higher-dimension embeddings result in better accuracy. In addition, the Transformer-based network is shown to be scalable and can be trained with general-scale graphics processing units (GPUs) with comparable model sizes and training times to convolutional neural networks while achieving higher accuracy. The study provides valuable insights into the potential of vision Transformer networks in object extraction using VHR images.
摘要:
利用深度学习网络进行语义分割已成为从高分辨率遥感图像中提取目标的重要方法。与传统的卷积神经网络(CNN)相比,VisionTransformer网络在语义分割方面的性能显着提高。视觉转换器网络具有与CNN不同的架构。映像修补程序,线性嵌入,多头自我注意力(MHSA)是几个主要的超参数。我们应该如何配置它们以提取VHR图像中的对象以及它们如何影响网络的准确性是尚未得到充分研究的主题。本文探讨了视觉变压器网络在从极高分辨率(VHR)图像中提取建筑物足迹中的作用。设计并比较了具有不同超参数值的变压器模型,并分析了它们对准确性的影响。结果表明,较小的图像块和较高维嵌入可以获得更好的精度。此外,基于Transformer的网络被证明是可扩展的,并且可以使用具有与卷积神经网络相当的模型大小和训练时间的通用规模图形处理单元(GPU)进行训练,同时实现更高的精度。该研究为视觉变压器网络在使用VHR图像进行对象提取中的潜力提供了有价值的见解。
公众号