关键词: approximation algorithms average case analysis big data coreset dimensionality reduction sparsification

来  源:   DOI:10.3390/s21196689   PDF(Pubmed)

Abstract:
Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik-Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for n≫1/ε, where ε is the approximation error, improving, e.g., both [ICML\'17] and applications for principal component analysis (PCA) [NIPS\'16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.
摘要:
coreset通常是一组输入项的小加权子集,对于给定的一组查询(模型,分类器,假设)。也就是说,所有查询的最大(最坏情况)错误是有界的。为了获得更小的核心集,我们建议自然松弛:在给定的查询集上的平均误差是有界的核心集。我们提供了确定性和随机化(通用)算法,用于为任何有限的查询集计算此类核心。与最坏情况错误的大多数对应核心集不同,在这项工作中,coreset的大小与输入大小及其Vapnik-Chervonenkis(VC)维度无关。主要技术是将平均情况下的核复位减少到向量摘要问题中,其中目标是计算n个输入向量的加权子集,该子集近似它们的总和。然后,我们提出了第一种算法,用于计算输入大小为线性的时间加权子集,对于n1/ε,其中ε是近似误差,改进,例如,[ICML\'17]和主成分分析(PCA)[NIPS\'16]的应用。实验结果在实践中也显示出显著且一致的改进。提供开放源代码。
公众号