有限查询集的平均大小写错误的核心集。Coresets for the Average Case Error for Finite Query Sets.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik-Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for n≫1/ε, where ε is the approximation error, improving, e.g., both [ICML\'17] and applications for principal component analysis (PCA) [NIPS\'16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.

摘要：

coreset通常是一组输入项的小加权子集，对于给定的一组查询(模型，分类器,假设)。也就是说,所有查询的最大（最坏情况）错误是有界的。为了获得更小的核心集，我们建议自然松弛：在给定的查询集上的平均误差是有界的核心集。我们提供了确定性和随机化（通用）算法，用于为任何有限的查询集计算此类核心。与最坏情况错误的大多数对应核心集不同，在这项工作中，coreset的大小与输入大小及其Vapnik-Chervonenkis(VC)维度无关。主要技术是将平均情况下的核复位减少到向量摘要问题中，其中目标是计算n个输入向量的加权子集，该子集近似它们的总和。然后，我们提出了第一种算法，用于计算输入大小为线性的时间加权子集，对于n1/ε，其中ε是近似误差，改进,例如，[ICML\'17]和主成分分析(PCA)[NIPS\'16]的应用。实验结果在实践中也显示出显著且一致的改进。提供开放源代码。