Establishing Semantic Similarity of the Cluster Documents and Extracting Key Entities in the Problem of the Semantic Analysis of News Texts

  •  Anastasia Soloshenko    
  •  Yulia Orlova    
  •  Vladimir Rozaliev    
  •  Alla Zaboleeva-Zotova    


This paper is dedicated to the problem of establishing semantic similarity for the documents of the news cluster and extracting key entities from the article's text. The existing methods and algorithms for fuzzy duplicate detection texts are briefly reviewed and analysed, such as TF-IDF and its modifications, Long Sent, Megashingles and Log Shingles, and Lex Rand. The shingles algorithm essence and its main stages are described in detail. Several options of the parallel implementation for the shingles algorithm are presented: for multiprocessor heterogeneous computing systems using CUDA and Open CL and for distributed computing systems using Google App Engine. The parameters of the algorithm (operation time, acceleration) applied to the problem of the semantic analysis for news texts are assessed. In addition, the methods and algorithms for extracting key phrases from the news text are reviewed: graph methods, in particular TextRank, building horizontal visibility graphs, the Viterbi algorithm, types of Markov random fields method, as well as a comprehensive context-sensitive algorithm for news text analysis (a combination of statistical algorithms for extracting key words and algorithms for forming semantic coherence of the text blocks). These methods are analysed from the standpoint of applicability to the news articles analysis. Particular attention is paid to the peculiarities of the news text structure. Although the thematic classification and selection of key entities in text documents are powerful text processing tools, these stages of analysis cannot give a complete picture of the news piece semantics. The paper presents a methodology and a comprehensive analysis of news text, based on a combination of semantic analysis and subsequent text abstracting submitting it in a compressed format - so-called mind map.

This work is licensed under a Creative Commons Attribution 4.0 License.