Explore the theme model explanatory problem
November 03, 2020 09:24 Source: "China Social Sciences", November 3, 2020, Issue 2039 Author: Wang Xiaohong Pujiang Huai Colin Allen

  

bet365 Play online games

LDA theme model (Latent Dirichlet Allocation Topic Model,LDA-TM),Due to its remote reading and super book frame function, it can intuitively present a single text and massive text library in a theme word clustering,It is increasingly used to assist humanistic interpretation and demonstration,It has now covered news communication、Literature、History、Culture、Poetry、Ancient Chinese classics and philosophy and other fields。For example,The Chinese Code theme model developed by Xi'an Jiaotong University and Indiana University in the United States,The LDA theme model established on the basis of ancient Chinese text corpus。

But,Artificial intelligence and machine learning are just the vocabulary category obtained by the algorithm,What is the meaning of each class (that is, the theme),At least at present, people need to be explained by others。Speaking of image,Explanation of professional knowledge in the field of humanities,Just label the theme。And human users can only judge by reviewing the highest number of the highest probability core words in a topic,There are two problems here here。First, all the words are distributed in each theme,Only rely on the top 15 or 20 maximum weight words to determine the meaning of a theme,In fact, I have to give up the theme weight of most words,Will there be certain limitations? Another job in our laboratory is investigating this。Second, the LDA theme model based on humanistic corpus training requires good quality,This is the basis of explanation and argument,So,Can the calculation Bet365 app download method of evaluating model quality be established? This work is the first time from the perspective of the theme model to make such an attempt。

As a new tool for auxiliary humanities research,LDA theme model has good quality,It means that the trained word cluster (theme) is explained,Makes people easily judge and explain its meaning。Although the theme model based on the humanistic corpus is "Chinese Code",There is no unified standard explanation of theme content,But when the appraiser actually faces the theme,The explanation of different themes is difficult to have a large difference。So,The explanatoryity of the model can be associated with the difficulty of judging artificially,That is the difficulty of making judgments by human work,The interpretability of this model is better。Because of the level of background knowledge、Target、Differences of various other psychological factors appearing during motivation and judgment,The results of manual judgment are often large。and,For manual judgments, you need to find and organize the appropriate personnel to participate in the evaluation model,The efficiency of this method is relatively low。Our goal is to refer to the results of the manual evaluation,Try to build a reliable calculation method to evaluate the explanatoryness of the model,Artificial methods with low -efficiency efficiency。

  Reference for computing evaluation with manual evaluation

We first obtain artificial assessment results of the model quality of the model by questionnaire。We invited 150 students from different majors from a key university in China and packed them。Through the method of systematic sampling, 75 themes are extracted from the Chinese Code theme model,and group these 75 themes。Display the top 15 of each theme to the students (specific bet365 best casino games forms such as,Topic 25: Qi,Service,Heat,Governance,Disease,medicine,Cold,Blood,Huang,medicine,pulse,Yang,Pain,medicine,Yin)。Each theme is evaluated by 50 students by reading the top 15 most representative words。We ask students to summarize the meaning of each theme in 2-3 words,and give the evaluation score of difficulty explanation。Finally, we collected 3750 pieces of data。

During the evaluation process,There are multiple differences in the psychological factors and knowledge backgrounds of the appraisers,It is difficult to find a standard manual evaluation result。In this study,All appraisers are college students who have certain common sense in Chinese traditional culture,Their knowledge background remains roughly the same level。This,The average score represents the result of the manual evaluation results.,Then the evaluation result will have a level structure,We need to give different weights to the expert evaluation results and student evaluation results)。Final,Each theme has 50 points given by the appraiser,We take the average value of these 50 evaluation points as the result of the artificial assessment results of the theme。

  Figure 2: Test the combination relationship between the 15 core characters of Topic 25 (K = 40),A set of core characters is always concentrated in a specific area,Therefore, we call this group "theme"。

  Explore possible calculation methods

Many factors may affect people's understanding and explanation of the theme。Explanation of the theme model of the Chinese Code,We propose two assumptions。Assumption 1: "Semantic similarity" hypothesis。The semantic similarity between the first 15 words will affect the appraiser's difficulty of summarizing and explaining the theme,Semantic similarity between bet365 Play online games words is higher,Evaluations are easier to summarize and explain the meaning expressed in this set of words。Assumption 2: "Words Familiarity" Assumption。The appraiser's familiarity of a set of words will affect the difficulty of generalization and explanation of it,The more familiar to the words,The easier it is to summarize and explain the meaning expressed in the theme。

Assume that the corresponding calculation method is the measurement word distance,We adopt an open source "Chinese Symatic Ci Dictionary Plan" (https://github.com/huyingxi/synonyms) to measure the synonym between theme words。The dictionary uses Word2vec (https://radimrehurek.com/gensim/models/word2vec.html),Train high -quality synonym models in big data with rich context information。The principle is to map the semantic expression to the vector table,This,All vocabulary is mapped into a high -dimensional vector space,Similarity between words and words can be measured according to the vector spacing in the high -dimensional space。

Download Wikihilae Chinese Character Corgal Library、Simplified conversion、Jieba Word、4 steps for training word vectors,We calculate the degree of synonyms between the words and words in each theme (numerical range 0-1,The closer to 1,The more similar meaning),Then compares the results obtained before (numerical range 1-5,The larger the value,The easier to explain the theme)。We assume,Two values ​​should present positive correlation,However, the calculation results show a very weak negative correlation (as shown in Figure 1)。

Assuming that the two correspond to the relationship between the familiarity and the interpretability of the theme of calculation,We are considering the two aspects of measurement theme entropy and theme words。The "Xiangnong Entropy" Bet365 lotto review of the theme is to measure the distribution of the theme in the corpus document。The higher the theme entropy value,The more likely this theme is the high -weight theme of many documents。According to our assumptions,The interpretability of theme entropy and theme should present a negative correlation,Because the lower the theme entropy,The theme appears in less documents,The meaning is more clear。The results of the data are consistent with us,but the correlation shown is very weak (as shown in Figure 1)。

About word frequency,The higher the word frequency,It means that the more people are familiar with the word。So,The theme of the high -frequency word is easier to explain。Although high -frequency words are often like "ritual", "reason", "Tao", "qi", Chinese philosophy core words,But multiple semantics does not reduce the explanatoryness of the theme,When people recognize the meaning of the theme,Judgment is often based on the correlation between words and words (that is, the context)。The theme of the theme model can just analyze the different contexts of a poly word。For example,"Qi" has the qi of Chinese medicine theory、qi of Taoist cosmic theory、The context of the qi of science and science theory。So,We assume that the theme word frequency should be related to the explanatoryness。The results of the data are consistent with our assumptions,But the positive correlation of the display is also weak (as shown in Figure 1 and right)。

Another,When calculating frequency,Considering the ancient Chinese context of the Chinese Code,and the artificial evaluation subject is in the current cognitive cultural background,We use Chinese characters frequency table rather than Chinese word frequency table,And the frequency table of modern Chinese characters instead of ancient Chinese characters frequency table。

  Discussion and reflection

In the exploration of the theme model,An important and interesting cognitive problem is,How do people summarize the meaning of the words in the theme model? We carefully examine the most easy and most difficult explanation of the most difficult and difficult explanation of the artificial assessment,I found that in addition to the familiarity factors of the above -mentioned inspection,The judgment of the appraiser's explanatory theme may be based on whether the word can be composed of the word。From this,Let's arrange the first 15 words in each theme,Calculate the dual -word words it can compose、Three words and the number of four words,Check these combinations through the modern Chinese dictionary and word frequency list,The number of meaningful words composed of this method (as shown in Figure 2)。Data analysis display,The number of meaningful words (that is, the combined availability of the theme) is correlated with the explanatoryness,Consistent with our predictions。

Our preliminary analysis above obtained,Semantic similarity of topic words、Theme entropy and theme words are the three possible calculation methods for evaluating the quality of theme model。but,When the appraiser is explained when evaluating the theme,Compared with the semantic similarity between words and words,The influence of people's familiarity on the explanation of the theme may be more important,The calculation method of designing according to familiarity may be more meaningful。At the same time,Examine how people explain a set of theme words in the background of the theme model of the Chinese code,Discover the relationship between the words in the theme,Discover the relationship between the words in the theme。Considering the weak related results obtained between the previous measurement bet365 Play online games of the theme word distance and explanatoryness,Combination with a combination of word distance with combination,can be used as an idea for further inspection。

  (Author: Calculation Philosophy Laboratory of Xi'an Jiaotong University; Department of Philosophy of Nanjing University; Department of Science and Science Philosophy at the University of Pittsburgh University)

Editor in charge: Zhang Yueying
QR code icon 2.jpg
Key recommendation
The latest article
Graphics
bet365 live casino games
Video

Friendship link: The official website of the Chinese Academy of Social Sciences |

Website filing number: Jinggong.com Anxian 11010502030146 Ministry of Industry and Information Technology:

All rights reserved by China Social Sciences Magazine shall not be reprinted and used without permission

General Editor Email: zzszbj@126.com This website Contact information: 010-85886809 Address: 11-12, Building 1, Building 1, No. 15, Guanghua Road, Chaoyang District, Beijing: 100026