文档介绍:摘 要
top-k join 查询返回用户最感兴趣的 k 个连接结果。近来 top-k join 已经成为一个重要的研究课题,其在 Web 数据库,信息抽取和数据挖掘中均有应用。星型模式的数据仓库在实际应用中也存在 top-k join 查询,如有时决策者只想查询星型连接结果中他最感兴趣的 k 个。然而,现有 top-k join 算法不适合星型模式。为了在星型模式上有效地支持 top-k join 查询,本文提出两种索引并基于这两种索引提出一个适用于星型模式的多路 top-k join 算法 MTJS。该算法通过采用一个比现有算法更优的上界和一个剪枝策略获得了更高的效率。此外,实验也表明该算法比现有算法效率更高。
获取精确的 top-k join 查询结果的代价是较高的,而且有时决策者希望牺牲 top-k join 查询结果的精确性来缩短查询的执行时间和资源消耗。更重要的是, 我们发现现有的近似 top-k join 算法因没有充分考虑星型模式的固有特点而不适合星型模式, 因而本文提出一个基于星型模式的近似多路 top-k join 算法 MT JS- e 。 MTJS - e 是 MTJS 的一个变种算法,其通过引入一个参数ε来返回近似的 top-k join 结果。MTJS - e 因使用了 MTJS 中的总上界和剪枝策略等优化, 其性能优于现有近似的 top-k join 算法。此外,我们通过实验证明了 MTJS - e 的效率优于现有算法。而且还发现 MTJS - e 返回的结果的实际精确程度远远优于其近似度定义的精确程度。
关键词:数据仓库;星型模式;多路 top-k join 算法
Abstract
Top-k join query returns k join results that users are most interested in. Top -k join has e one of the main research issues recently, and it’s dominant in many applications, for example, data mining, web databases and information retrieval. Top-k join query also exists in data warehouse based on the star schema in practical application. For example, sometimes just the top-k join results that the decision maker is most interested in are desirable. However, the current existing algorithms aren’t suitable for the data warehouse based on the star schema. In order to efficiently support top-k join query on star schema, we propose two kinds of indices and a multiple top-k join algorithm that is suitable for star schema based on these indices. By using a tighter upper bound than current existing algorithms and a pruning strategy, the algorithm is more efficient than the current existing algorithms. Furthermore, the experiment also shows that our algorithm is more efficient than the current existing algorithm.
MTJS is the exact top-k join algorithm. However, obtaining the exact top-k join query results leads to an expensive cost, and sometimes the decision maker is willing to sacrifice the accuracy of the top-k join query results to reduce the execution time of