关键词: Numba Python Zernike moments bioinformatics computational geometry molecular surface principal component analysis protein structure shape retrieval

来  源:   DOI:10.3390/molecules29010052   PDF(Pubmed)

Abstract:
Object retrieval systems measure the degree of similarity of the shape of 3D models. They search for the elements of the 3D model databases that resemble the query model. In structural bioinformatics, the query model is a protein tertiary/quaternary structure and the objective is to find similarly shaped molecules in the Protein Data Bank. With the ever-growing size of the PDB, a direct atomic coordinate comparison with all its members is impractical. To overcome this problem, the shape of the molecules can be encoded by fixed-length feature vectors. The distance of a protein to the entire PDB can be measured in this low-dimensional domain in linear time. The state-of-the-art approaches utilize Zernike-Canterakis moments for the shape encoding and supply the retrieval process with geometric data of the input structures. The BioZernike descriptors are a standard utility of the PDB since 2020. However, when trying to calculate the ZC moments locally, the issue of the deficiency of libraries readily available for use in custom programs (i.e., without relying on external binaries) is encountered, in particular programs written in Python. Here, a fast and well-documented Python implementation of the Pozo-Koehl algorithm is presented. In contrast to the more popular algorithm by Novotni and Klein, which is based on the voxelized volume, the PK algorithm produces ZC moments directly from the triangular surface meshes of 3D models. In particular, it can accept the molecular surfaces of proteins as its input. In the presented PK-Zernike library, owing to Numba\'s just-in-time compilation, a mesh with 50,000 facets is processed by a single thread in a second at the moment order 20. Since this is the first time the PK algorithm is used in structural bioinformatics, it is employed in a novel, simple, but efficient protein structure retrieval pipeline. The elimination of the outlying chain fragments via a fast PCA-based subroutine improves the discrimination ability, allowing for this pipeline to achieve an 0.961 area under the ROC curve in the BioZernike validation suite (0.997 for the assemblies). The correlation between the results of the proposed approach and of the 3D Surfer program attains values up to 0.99.
摘要:
对象检索系统测量3D模型的形状的相似程度。它们搜索类似于查询模型的3D模型数据库的元素。在结构生物信息学中,查询模型是蛋白质三级/四级结构,目的是在蛋白质数据库中找到形状相似的分子。随着PDB规模的不断扩大,与所有成员直接进行原子坐标比较是不切实际的。为了克服这个问题,分子的形状可以由固定长度的特征向量编码。蛋白质与整个PDB的距离可以在该低维区域中以线性时间测量。最先进的方法利用Zernike-Canterakis矩进行形状编码,并为检索过程提供输入结构的几何数据。BioZernike描述符是自2020年以来PDB的标准实用程序。然而,当尝试在本地计算ZC矩时,自定义程序中随时可用的库不足的问题(即,在不依赖外部二进制文件的情况下)遇到,特别是用Python编写的程序。这里,提出了Pozo-Koehl算法的快速且有据可查的Python实现。与Novotni和Klein更流行的算法相反,这是基于体素化的体积,PK算法直接从3D模型的三角形表面网格产生ZC矩。特别是,它可以接受蛋白质的分子表面作为其输入。在呈现的PK-Zernike图书馆中,由于Numba的及时编译,具有50,000个刻面的网格在瞬间20秒内由单个线程处理。由于这是PK算法首次用于结构生物信息学,它被用在小说中,简单,而是高效的蛋白质结构检索管道。通过基于PCA的快速子程序消除边链片段提高了辨别能力,允许该管道在BioZernike验证套件中的ROC曲线下达到0.961面积(组件为0.997)。所提出的方法和3DSurfer程序的结果之间的相关性达到高达0.99的值。
公众号