###
工程科学与技术:2018,50(2):133-140
←前一篇   |   后一篇→
本文二维码信息
码上扫一扫!
云数据库中等宽直方图的分布式并行构造方法
(1.中国科学院 成都计算机应用研究所, 四川 成都 610041;2.中国科学院大学, 北京 100049;3.贵州大学 现代制造技术教育部重点实验室, 贵州 贵阳 550003)
Distributed and Parallel Construction Method of Equi-width Histogram in Cloud Database
(1.Chengdu Inst. of Computer Applications,Chinese Academy of Sciences,Chengdu 610041,China;2.Univ. of Chinese Academy of Sciences,Beijing 100049,China;3.Key Lab. of Advanced Manufacturing Technol.,Ministry of Education,Guizhou Univ.,Guiyang 550003,China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 1542次   下载 1403
投稿时间:2016-09-26    修订日期:2018-01-02
中文摘要: 直方图能够直观描述数据分布,在数据库查询优化中起着重要作用。然而在分布式云数据库场景中,现有直方图构建方法存在并行资源利用率低,网络传输量较高的问题。针对该问题,基于关系型云数据库提出一种等宽直方图的分布式并行构造方法。首先,根据集群中分布式存储的数据无关性,基于master-slave架构在直方图任务开始前由集群中请求发起节点对经RPC(remote procedure call)协议获取到的多个工作节点最值数据比较得到数据表在整个集群的全局最大值、最小值;然后,考虑到算法运行过程中数据传输量的优化,集群中工作节点对本地数据扫描、排序,划分至依据全局最值信息构建的直方图桶内,实现聚合子直方图的并行构建以提高集群计算资源利用率;最后,请求发起节点对并行构建的多个子直方图中边界值相等的桶频率值聚合得到全局直方图。算法利用分布式思想实现了关系型云数据库中直方图的构建,将计算任务划分成多个子任务并行执行,子直方图信息代替数据分片的传输大幅优化了网络带宽的负载。算法已应用于关系型云数据库内核以优化SQL语句执行路径的初始扫描开销、数据选择率等关键参数。人工合成数据与评分数据的实验结果证明,算法运行过程中的网络传输量与数据库表中元组个数无关,且具有良好的可拓展性。
Abstract:Description of data distribution has been commonly used in databases to support query optimization,among which histograms are of particular interest.The existing conventional histogram construction methods have the problem of low efficiency of parallel resource utilization and high data transmission.To address these issues,a distributed and parallel constructing method was proposed for equi-width histogram in relational cloud database.Since the data ranges of different storage nodes are not the same,firstly,the local maximum and minimum values of working nodes were transferred to the application request node using the RPC protocol.These values were compared with each other to get the global maximum and minimum values based on master-slave model before the start of the histogram construction.Secondly,considering the optimization of data transmission during the histogram estimation,each working node scanned,sorted and partitioned the data into buckets according to the global maximum and minimum values which were transmitted from the application request node.Sub-histograms that were constructed in parallel improved the efficiency of resource utilization in the cluster.Finally,sub-histograms were aggregated to obtain the global histogram in the application request node.The algorithm divided the histogram task into multiple small tasks that could be simultaneously executed in a distributed cluster.During the histogram estimation,only a small portion of information about buckets and a few necessary data need to be transmitted over the network.The algorithm has been applied to the relational cloud database to optimize the initial scanning cost of the SQL statement,the data selection rate and other key parameters.The experimental results of the synthetic data and the real data demonstratedthat the amount of the data transmission was unrelated with the table size and the proposed algorithm achievedgood scalability.
文章编号:201601076     中图分类号:TP311.133.1    文献标志码:
基金项目:国家自然科学基金资助项目(61640209;91746116);四川省科技创新苗子工程资助项目(SCMZ2006012);贵州省科技计划资助项目(黔科合JZ字[2014]2004号;黔科合人字(2015)13号)
作者简介:王阳(1987-),男,博士生.研究方向:分布式并行计算.E-mail:wangyang2014casit@outlook.com
引用文本:
王阳,钟勇,周渭博,杨观赐.云数据库中等宽直方图的分布式并行构造方法[J].工程科学与技术,2018,50(2):133-140.
WANG Yang,ZHONG Yong,ZHOU Weibo,YANG Guanci.Distributed and Parallel Construction Method of Equi-width Histogram in Cloud Database[J].Advanced Engineering Sciences,2018,50(2):133-140.