For the past 20 years,The rapid development of digital technology is reshaping human production and life,Data generated by information -centric exposure exponential growth,Information overload makes it more difficult to deal with traditional technologies. This data becomes more difficult,Need a new technical solution。Topic Modeling can extract hidden themes from massive text data,Digging the problems、Views、Information such as emotion and trend。Current,The application scope of the theme model is continuously expanding,Except for widely used in the field of business and many natural sciences,It is also gradually educating、Sociology、Literature、Law、History、Philosophy and other humanities and social science research fields play a greater role。
bet365 live casino games
Theme model is a text mining technology,It aims to find the hidden theme from a given text collection,and allocate the topic for each document。The basic principle of the theme model is,Assuming that each document is composed of multiple themes,and each theme is composed of words。Statistical analysis of the frequency and probability of words,The theme model can infer the hidden theme and classify the document。This technology can be a text at different levels (such as a single sentence、Paragraph、Article、Webpage、Works, etc.) Model the theme。At a single sentence level,The theme model can be used to identify the theme in a sentence,and help understand the meaning of the sentence。At the webpage or social media data level,The theme model can be used to tap the viewpoint and tendency of users on a certain topic,Understand the user's interest and preference for different topics。For a book composed of multiple chapters,The theme model can analyze the theme structure and proportion of the whole book,You can also use each chapter as a text,Integrated analysis found that the number bet365 Play online games of themes of each chapter and the ratio of each theme in different chapters,From this found that the theme distribution structure and change trend of the whole book。
Theme model usually involves the following four steps。First is a text pre -processing,Convert the document to the marking signs containing only meaningful words,Remove the stop words as needed、Pre -processing steps such as stemd extraction。Followed by building a word — document matrix,Show the document as a word — document matrix,Among them, each line represents a document,Each column represents a word,Matrix elements indicate the number of times the word appears in the document。Create a model again,Use the theme model algorithm to build a word distribution of each theme and the theme distribution of each document。Finally, the theme is inferred,For new documents,You can use the trained model to infer its theme distribution。
Main method
Methods of theme models Various。Generally speaking,Based on mathematical methods,Theme model can be divided into probability theme model and non -probability theme model。Probability Theme model mainly includes: Popular Potential Semantic Analysis (PLSA)、Potential Dilikre allocation (LDA)、Structural theme model (STM) and hierarchical potential Dilikley allocation (HLDA), etc.。Non -probability theme models mainly include: potential semantic analysis (LSA) and non -negative matrix decomposition (NNMF), etc.。In specific applications,You need to choose the appropriate theme model according to the purpose of the research purpose。Here we mainly discuss three classic theme model methods: PLSA、LDA and STM。
PLSA developed by THOMAS Hoffman,It is a word -based text -based text mining and diminishing reduction technology,It is also the first statistical model that reveals the semantics in the terminology matrix of the language stall in the textbook。This technology develops potential semantic analysis from the framework of linear algebra to the framework of probability statistics。PLSA laid the foundation for text analysis,But there are some problems。This model contains a large number of parameters,And these parameters will Bet365 app download also increase linearly with the number of documents,and cannot allocate the probability of unprepared document,If it is applied to a large corpus, it is easy to cause overfitting。
To solve the above problems,David M. Blei and other scholars such as the PLSA model,I proposed a more generalized language statistics model,that is LDA。This method allows documents to "overlap" with the content,instead of being divided into discrete groups,This can reflect the typical usage of natural language。Specifically,In this model,The words of multiple themes can form a document in proportion。Since LDA has multiple generating models,So it is also easy to adapt to specific application requirements。Therefore,Compared with PLSA's entire data -based parameter estimation,LDA can introduce the defects of the existence of limited data statistics through the parameters,to improve the generalization performance of the model。
STM is a further expansion of the LDA model,Allows the variable (such as author、Time、Comment type、Comment location、Positioner of the speaker, etc.) Incorporate the document — the theme ratio and theme -the prior distribution of the term matrix。This,STM can generate the theme structure and distribution ratio,Context that appears at different frequencies,At the same time, it can also show the theme trend chart with time changes,and the vocabulary difference diagram of the theme。Therefore,Whether in the theoretical optimality or application practice,STM can achieve the optimization of calculation according to the needs of the researcher。
Application field
Since it is from,Theme model has been widely used in the economy、Business、Academic Research and other fields。For example,In the economic field,Theme models are often applied to the financial market trend prediction and other aspects,to effectively discover market risks and opportunities。In the business field,The theme model can analyze product reviews and social media texts,Help companies understand consumer demand and attitude,Optimized product design and brand marketing strategy,bet365 live casino games Implement business intelligence。In academic research,The theme model can analyze massive literature,Help researchers discover hot topics in the literature,To provide guidance for subsequent research。The following focuses on introducing the theme model in communication、Linguistics、Applications in humanities and social science research such as history and philosophy。
Current,Computing and Communication is a development forefront in the field of communication。The theme model is based on the cross -section and vertical of various media discourse。other,Researchers can also use theme models to analyze themes and trends in social media data,To identify the public's views and attitudes of an event or topic。In short,Application of theme model in the field of communication,It can help us better understand the media environment and public opinion,Therefore, it provides a basis for optimizing the effect of communication。
Application of theme model in the field of linguistics,It can be divided into three aspects: voice recognition、Text classification and language knowledge extraction。First,Voice recognition is the process of converting voice signals into text information。Analysis of a large number of voice data with the theme model,It can extract a semantic theme corresponding to the voice signal,to improve the accuracy rate of recognition。Next,In terms of text classification,The theme model can be according to the topic、Speaker、Modern and other factors quickly and effectively perform automatic classification of massive texts。Last,The field of language knowledge extraction is also widely used in theme model。Language knowledge extraction can be understood as,Automatically extract language knowledge from a large number of texts (such as vocabulary、grammar structure、Sentence type, etc.),The result is to increase the depth of linguistics research。
In history、The field of philosophical research,Theme model can be used to study a specific period in the history of cultural history、Themes involved in specific regions or specific social groups、Topics and semantic features,Then explore different cultures、Differences between Bet365 app download civilization and value system、Similarity and interactive relationship。For example,The theme modeling of the comment on Chinese cultural relics,It can be found in the philosophy of traditional Chinese culture、Values of morality and outlook on life。Colin Allen team first introduced the theme model into the research work of Ke Shizhe,With the help of LDA, the theme modeling of the literature read by Darwin,How to accumulate deep and broad thinking space through reading the literature。
Due to the number of texts processed, theoretical is not subject to restrictions,and can solve the traditional text that cannot be answered in a huge narrative question,Theme model works significantly in the research and transformation of data -driven data drive of humanities and sociology。Current,In the field of data analysis,Some complex algorithms、Analysis of existing data and software packages、Entry semantic network analysis based on relational research,All are deeply integrated with the theme model。
Future Challenge
Theme model is a relatively active research field,Its advantages in practical applications are becoming more and more obvious。With the "big data" research based on the social and cultural field, it is more and more common,Related research tools have become more and more important。In this process,Theme model ushered in development opportunities,At the same time, you also face some challenges。
First,The stability of the theme model is concerned by many scholars。The stability of the theme model can be expressed as: when a theme model algorithm is applied to a data set with the same parameter,After multiple operations,The output result may not be consistent。When the model keeps the same input or update document,Traditional theme model results are often unstable。So,How to generate a stable and accurate theme model? Face this question,Many researchers just use random initialization methods,The result of the theme model has certain certainty。In the unsupervised learning,The common strategy to reduce instability is to use integrated clustering technology,This is a combination of large and diverse clusters Bet365 app download to achieve more stable、Solution of accurate effect。But,This kind of research also lacks multi -dimensional attention to the unstable problem of theme model。
Second,Another challenge facing the theme model is explanatory problem。Vocabulary under a theme is sometimes difficult to find a superior concept to define this theme,Not to mention the summary of the concept of superiors varies from person to person,It is inevitable to have subjectivity。For this question,Evaluate the quality of the theme model is a step to realize the explanatory product。The most widely used measurement method is to use Likelihood。But the calm value is not suitable for providing good interpretability in the probability model。Automatic measurement of theme quality is a good choice for quality inspection and explanatory.。other,In order to better explain questions related to theme model,You need to find a suitable theme model for a specific application,and explore the relationship between multiple models。
third,Theme model helps multiple types of text analysis,But applied to literary texts based on narrative may not be wise.。The "Word Bag" method used by theme model,I will ignore the grammar of the text、Context and other important contents,This leads to the phenomenon of "relationship seems better than grammar"。For this specific type of text,Some other analysis methods seem to be more effective。For example,Franco Moreti's network analysis of Shakespeare's drama and narrative logic model of David Herman。These methods pay more attention to establishing the relationship between objects and plots in the text,to reveal the deeper connotation of the text。Therefore,In actual application,Researchers need to consider the type of text、Target and needs,Select the right method for analysis and research。
With the rapid development of the Internet and the continuous growth of data,Theme model will also usher in a wider application prospect。On the one hand,As an important text analysis method,The theme model can be with the new statistical method、Digital data or space data fusion,To better cope with the richness Bet365 lotto review of the text semantics,Provide more comprehensive for deepening humanities and social science research、Accurate information support。On the other hand,Combining theme model and semantic network analysis,Can make the two complement each other,Help understand the correlation between different topics and concepts,Therefore, in order to further broaden the application field of theme model、Enhance its explanation,Provide greater development space。
(This article is the key project of the National Social Science Foundation "Research on Chinese Political discourse international communication based on text" (18AY006) phased results)
(The author is a doctoral student in Graduate School of Xi'an University of Foreign Languages、Associate Professor; Dean of the Graduate School of Xi'an University of Foreign Languages、Professor)
Friendship link:
Website filing number: Jinggong.com Anmi 11010502030146 Ministry of Industry and Information Technology:
All rights reserved by China Social Sciences Magazine shall not be reprinted and used without permission
General Editor Email: zzszbj@126.com This website contact information: 010-85886809 Address: 11-12, Building 1, Building 1, No. 15, Guanghua Road, Chaoyang District, Beijing: 100026
>