本文已被:浏览 2408次 下载 1673次
投稿时间:2017-08-05 修订日期:2018-02-28
投稿时间:2017-08-05 修订日期:2018-02-28
中文摘要: 目前,跨文本集的话题发现模型(cross-collection LDA,ccLDA)只适用于各个数据源话题相似度很高的场景,而且其全局话题和每个数据源的局部话题会强制对齐,存在词语稀疏的问题。针对ccLDA模型中的不足,提出了改进的跨文本集话题发现模型(improved ccLDA,IccLDA)。该模型在采样时先判断词语属于全局话题还是局部话题,再分别进行采样,避免了ccLDA模型中全局话题和局部话题必须对齐的缺点,进而降低了词语在全局话题和局部话题的分散程度,使该模型可以适用于多数据源的场景。在公开数据集上进行了多数据源文本集的话题发现实验,并进行了话题比较性分析。实验结果表明,在设置不同的话题数时,IccLDA模型的困惑度值均低于LDA模型和ccLDA模型,表明IccLDA模型具有更优的建模能力。最后,在真实数据集上开展了进一步实验验证,证明了本文提出的改进模型不仅建模能力优于原始模型,还可以有效地发现各个数据源讨论的公共话题和每个数据源讨论的局部话题,更适用于多数据源场景的文本话题发现。
Abstract:At present,ccLDA (cross collection LDA) model has been found only applicable to data sources that topic similarity is very high,and its global topics and local topics of each data source will be forced alignment,hence causing words sparse.In order to solve the problem of ccLDA model,an improved ccLDA topic model (IccLDA) was proposed.When sampling,this model firstly decides whether words are global topics or local topics,and then takes samples respectively.In this way,it can avoid the problem that the global topics and local topics in ccLDA model must be aligned,and also can reduce the dispersion degree of the words in the global topics and local topics,making the model suitable for multiple data source scenarios.The topic discovery experiments of multiple data source were conducted on public data sets,and a comparative analysis of topics was conducted.The experimental results showed that the confusion degree of IccLDA model is lower than LDA model and ccLDA model,indicating that IccLDA model has better modeling ability.Finally,further experimental verification was performed with the data sets of real-world scenarios.The result showed that the improved model not only has better modeling ability than the traditional models,but also can effectively discover public topics discussed by various data sources and local topics discussed by each data source,and is more suitable for topic discovery in multiple data source scenarios.
keywords: topic detection topic model LDA multi-source IccLDA
文章编号:201700626 中图分类号:TP391.1 文献标志码:
基金项目:国家科技支撑计划资助项目(2012BAH18B05);国家自然科学基金资助项目(61272447);四川省科技厅计划资助项目(16ZHSF0483)
作者简介:陈兴蜀(1968-),女,教授,博士生导师,博士.研究方向:云计算及大数据安全;可信计算.E-mail:chenxsh@scu.edu.cn
引用文本:
陈兴蜀,马晨曦,王文贤,高悦,王海舟.基于改进的ccLDA多数据源热点话题检测模型[J].工程科学与技术,2018,50(2):141-147.
CHEN Xingshu,MA Chenxi,WANG Wenxian,GAO Yue,WANG Haizhou.Multi-source Topic Detection Analysis Based on Improved ccLDA Model[J].Advanced Engineering Sciences,2018,50(2):141-147.
引用文本:
陈兴蜀,马晨曦,王文贤,高悦,王海舟.基于改进的ccLDA多数据源热点话题检测模型[J].工程科学与技术,2018,50(2):141-147.
CHEN Xingshu,MA Chenxi,WANG Wenxian,GAO Yue,WANG Haizhou.Multi-source Topic Detection Analysis Based on Improved ccLDA Model[J].Advanced Engineering Sciences,2018,50(2):141-147.