Python uses scikit learn to calculate TF-IDF

Cai junshuai 2021-09-15 08:14:06

 

1 Scikit-learn Download and install

1.1 brief introduction

Scikit-learn It is a simple and effective tool for data mining and data analysis , It is based on Python Machine learning module , be based on BSD free use .

Scikit-learn The basic function of is divided into six parts : classification (Classification)、 Return to (Regression)、 clustering (Clustering)、 Data dimension reduction (Dimensionality reduction)、 Model selection (Model selection)、 Data preprocessing (Preprocessing).

Scikit-Learn The machine learning models in are very rich , Include SVM, Decision tree ,GBDT,KNN wait , The appropriate model can be selected according to the type of problem , Please refer to the official website for details , It is recommended that you download resources from the official website 、 modular 、 Document learning .

1.2 Install the software

pip install scikit-learn
  • 1.
  • 1

Re pass ”from sklearn import feature_extraction” Import .

2 TF-IDF Basic knowledge of

2.1 TF-IDF Concept

TF-IDF(Term Frequency-InversDocument Frequency) It is a weighting technology commonly used in information processing and data mining . This technology adopts a statistical method , According to the frequency of words appearing in the text and the frequency of documents appearing in the whole corpus, the importance of a word in the whole corpus is calculated . Its advantage is that it can filter out some common but unimportant words , Keep important words that affect the whole text at the same time .

  • TF(Term Frequency) Indicates how often a keyword appears throughout the article .
  • IDF(InversDocument Frequency) Indicates the frequency of calculating the inverted text . Text frequency refers to the number of times a keyword appears in all articles of the whole corpus . Inverse document frequency is also called inverse document frequency , It's the reciprocal of the frequency of the document , It is mainly used to reduce the effect of some common words in all documents that have little impact on documents .

computing method : By taking the local part of the quantity ( Word frequency ) And the global component ( Reverse document frequency ) Multiply to calculate tf-idf, And the resulting document is normalized to unit length . The formula of nonstandard weight in document in file , Pictured :

python Use scikit-learn Calculation TF-IDF_ Word frequency

Separate steps

(1) Calculate word frequency

Word frequency = The total number of times a word appears in an article / The total number of words in the article  
 
python Use scikit-learn Calculation TF-IDF_ Download and install _02

(2) Calculate the inverse document frequency

Reverse document frequency (IDF) = log( Total number of documents in thesaurus / Number of documents containing the word +1)

2.2 An example is given to illustrate the calculation

Let's start with an example . Suppose there is a long article 《 Bee culture in China 》, We are going to use the computer to extract its keywords .

An easy way to think of , Is to find the most frequent words . If a word is important , It should appear many times in this article . therefore , We carry out ” Word frequency ”(Term Frequency, Abbreviation for TF) Statistics .

( Be careful : The most frequently used word is —-“ Of ”、” yes ”、” stay ”—- The most commonly used words in this category . They are called ” Stop words ”(stop words), It doesn't help to find the result 、 Words that must be filtered out .)

Suppose we filter them out , Just think about the rest of the words that make sense . There's another problem , We may find that ” China ”、” The bees ”、” farming ” These three words appear as often . Does that mean , As a key word , They are of the same importance ?

Obviously not . because ” China ” It's a very common word , Relatively speaking ,” The bees ” and ” farming ” Not so common . If these three words appear the same number of times in an article , There is reason to think that ,” The bees ” and ” farming ” More important than ” China ”, in other words , On keyword ranking ,” The bees ” and ” farming ” Should be in ” China ” In front of .

therefore , We need an importance adjustment factor , Measure whether a word is a common word . If a word is rare , But it appears many times in this article , So it probably reflects the characteristics of this article , That's what we need .

In statistical language , On the basis of word frequency , Assign one to each word ” Importance ” The weight . The most common word (” Of ”、” yes ”、” stay ”) Give minimum weight , More common words (” China ”) Give less weight , Less common words (” The bees ”、” farming ”) Give greater weight to . This weight is called ” Reverse document frequency ”(Inverse Document Frequency, Abbreviation for IDF), Its size is inversely proportional to the frequency of a word .

got it ” Word frequency ”(TF) and ” Reverse document frequency ”(IDF) in the future , Multiply these two values , You get a word of TF-IDF value . The more important a word is to an article , its TF-IDF The more it's worth . therefore , The top words , This is the key word of this article .

Here are the details of the algorithm .

  1. First step , Calculate word frequency .

python Use scikit-learn Calculation TF-IDF_ machine learning _03

  1. The second step , Calculate the inverse document frequency

python Use scikit-learn Calculation TF-IDF_ The weight _04

  1. The third step , Calculation TF-IDF.

python Use scikit-learn Calculation TF-IDF_ Download and install _05

You can see ,TF-IDF It's proportional to the number of times a word appears in a document , In inverse proportion to the number of times the word appears in the whole language . therefore , Automatic keyword extraction algorithm is very clear , It's the calculation of every word in the document TF-IDF value , And then in descending order , Take the top words .

Or to 《 Bee culture in China 》 For example , Suppose the length of this article is 1000 Word ,” China ”、” The bees ”、” farming ” Each appears 20 Time , The three words ” Word frequency ”(TF) All for 0.02. then , Search for Google Find out , contain ” Of ” The web page of words has 250 One hundred million , Suppose this is the total number of Chinese pages . contain ” China ” Of the web pages 62.3 One hundred million , contain ” The bees ” The web page of is 0.484 One hundred million , contain ” farming ” The web page of is 0.973 One hundred million . Then their inverse document frequency (IDF) and TF-IDF as follows :

python Use scikit-learn Calculation TF-IDF_ The weight _06

As can be seen from the above table ,” The bees ” Of TF-IDF Highest value ,” farming ” secondly ,” China ” The minimum .( If you also calculate ” Of ” The word TF-IDF, That would be an extremely close 0 Value .) therefore , If you choose only one word ,” The bees ” This is the key word of this article .

3 Scikit-Learn Middle computation TF-IDF

Scikit-Learn in TF-IDF The weight calculation method mainly uses two classes :CountVectorizer and TfidfTransformer.

3.1 CountVectorizer

CountVectorizer Class converts words in the text to a word frequency matrix .

For example, a matrix contains an element a[i][j], It said j Words in i Frequency of words in similar texts .

It passes through fit_transform Function to calculate the number of times each word appears ,

adopt get_feature_names() You can get the keywords of all the texts in the word bag ,

adopt toarray() You can see the result of the word frequency matrix .

  • # coding:utf-8
    from sklearn.feature_extraction.text import CountVectorizer
    # corpus
    corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    ]
    # Convert words in the text to word frequency matrix
    vectorizer = CountVectorizer()
    # Count the number of times a word appears
    X = vectorizer.fit_transform(corpus)
    # Get all text keywords in the word bag
    word = vectorizer.get_feature_names()
    print word
    # Check the word frequency result
    print X.toarray()
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    [u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']
    [[0 1 1 1 0 0 1 0 1]
    [0 1 0 1 0 2 1 0 1]
    [1 0 0 0 1 0 1 1 0]
    [0 1 1 1 0 0 1 0 1]]
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
  • You can see from the results , Total includes 9 A characteristic word , namely :

    [u’and’, u’document’, u’first’, u’is’, u’one’, u’second’, u’the’, u’third’, u’this’]

    At the same time, the output of each sentence contains the number of characteristic words .

    for example , The first sentence “This is the first document.”, Its corresponding word frequency is [0, 1, 1, 1, 0, 0, 1, 0, 1],

    Suppose the initial sequence number is from 1 Start counting , Then the frequency of the word means that there is No 2 A word of position “document” common 1 Time 、 The first 3 A word of position “first” common 1 Time 、 The first 4 A word of position “is” common 1 Time 、 The first 9 A word of position “this” common 1 word .

    therefore , Every sentence will get a word frequency vector .

    3.2 TfidfTransformer

    TfidfTransformer Used for statistics vectorizer Of each word in TF-IDF value . The usage is as follows

  • # coding:utf-8
    from sklearn.feature_extraction.text import CountVectorizer
    # corpus
    corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    ]
    # Convert words in the text to word frequency matrix
    vectorizer = CountVectorizer()
    # Count the number of times a word appears
    X = vectorizer.fit_transform(corpus)
    # Get all text keywords in the word bag
    word = vectorizer.get_feature_names()
    print word
    # Check the word frequency result
    print X.toarray()
    # ----------------------------------------------------
    from sklearn.feature_extraction.text import TfidfTransformer
    # Class call
    transformer = TfidfTransformer()
    print transformer
    # Put the word frequency matrix X Statistics TF-IDF value
    tfidf = transformer.fit_transform(X)
    # Look at the data structure tfidf[i][j] Express i Class text tf-idf The weight
    print tfidf.toarray()
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
    • 22.
    • 23.
    • 24.
    • 25.
    • 26.
    • 27.
    • 28.
    • 29.
    • 30.
    • 31.

    The output is shown below :

  • python Use scikit-learn Calculation TF-IDF_ Word frequency _07

     

  • 4 A complete example of mini

    Generally, word frequency statistics and calculation are required at the same time TF-IDF value , Then use the core code :

  • vectorizer=CountVectorizer()
    transformer=TfidfTransformer()
    tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))
    
    • 1.
    • 2.
    • 3.
    # coding:utf-8
    __author__ = "liuxuejiang"
    import jieba
    import jieba.posseg as pseg
    import os
    import sys
    from sklearn import feature_extraction
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import CountVectorizer
    if __name__ == "__main__":
    corpus=[" I Came to Beijing Tsinghua University ",# The first is the result of word segmentation , Words are separated by spaces
    " He Came to 了 NetEase Hangzhou research building ",# The second type of text segmentation results
    " Xiao Ming master graduation And China Academy of sciences, ",# The third type of text segmentation results
    " I Love Beijing The tiananmen square "]# The fourth kind of text segmentation results
    vectorizer=CountVectorizer()# This class will convert the words in the text into word frequency matrix , Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts
    transformer=TfidfTransformer()# This class will count the tf-idf A weight
    tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))# first fit_transform It's calculation tf-idf, the second fit_transform It is to turn the text into a word frequency matrix
    word=vectorizer.get_feature_names()# Get all the words in the bag model
    weight=tfidf.toarray()# take tf-idf Matrix extraction , Elements a[i][j] Express j Words in i Class text tf-idf The weight
    for i in range(len(weight)):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text
    print u"------- Here is the output of ",i,u" A text like word tf-idf The weight ------"
    for j in range(len(word)):
    print word[j],weight[i][j]
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
    • 22.
    • 23.
    • 24.

    Output is as follows :

  • ------- Here is the output of 0 A text like word tf-idf The weight ------ # The original text of this class is :" I came to tsinghua university in Beijing "
    China 0.0
    Beijing 0.52640543361
    building 0.0
    The tiananmen square 0.0
    Xiao Ming 0.0
    Came to 0.52640543361
    Hangzhou research 0.0
    graduation 0.0
    Tsinghua University 0.66767854461
    master 0.0
    Academy of sciences, 0.0
    NetEase 0.0
    ------- Here is the output of 1 A text like word tf-idf The weight ------ # The original text of this class is : " He came to the netease hangyan building "
    China 0.0
    Beijing 0.0
    building 0.525472749264
    The tiananmen square 0.0
    Xiao Ming 0.0
    Came to 0.414288751166
    Hangzhou research 0.525472749264
    graduation 0.0
    Tsinghua University 0.0
    master 0.0
    Academy of sciences, 0.0
    NetEase 0.525472749264
    ------- Here is the output of 2 A text like word tf-idf The weight ------ # The original text of this class is : " Xiao Ming graduated from the Chinese Academy of Sciences with a master's degree “
    China 0.4472135955
    Beijing 0.0
    building 0.0
    The tiananmen square 0.0
    Xiao Ming 0.4472135955
    Came to 0.0
    Hangzhou research 0.0
    graduation 0.4472135955
    Tsinghua University 0.0
    master 0.4472135955
    Academy of sciences, 0.4472135955
    NetEase 0.0
    ------- Here is the output of 3 A text like word tf-idf The weight ------ # The original text of this class is : " I love tian 'anmen square in Beijing "
    China 0.0
    Beijing 0.61913029649
    building 0.0
    The tiananmen square 0.78528827571
    Xiao Ming 0.0
    Came to 0.0
    Hangzhou research 0.0
    graduation 0.0
    Tsinghua University 0.0
    master 0.0
    science 0.0
    
    • 1.
    • 2.
    • 3.
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
    • 22.
    • 23.
    • 24.
    • 25.
    • 26.
    • 27.
    • 28.
    • 29.
    • 30.
    • 31.
    • 32.
    • 33.
    • 34.
    • 35.
    • 36.
    • 37.
    • 38.
    • 39.
    • 40.
    • 41.
    • 42.
    • 43.
    • 44.
    • 45.
    • 46.
    • 47.
    • 48.
    • 49.
    • 50.
    • 51.

# coding:utf-8 __author__ = "liuxuejiang"import jieba import jieba.posseg as pseg import os import sys from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer if __name__ == "__main__": corpus=[" I Came to Beijing Tsinghua University ",# The first is the result of word segmentation , Words are separated by spaces " He Came to 了 NetEase Hangzhou research building ",# The second type of text segmentation results " Xiao Ming master graduation And China Academy of sciences, ",# The third type of text segmentation results " I Love Beijing The tiananmen square "]# The fourth kind of text segmentation results vectorizer=CountVectorizer()# This class will convert the words in the text into word frequency matrix , Matrix elements a[i][j] Express j Words in i Frequency of words in similar texts transformer=TfidfTransformer()# This class will count the tf-idf A weight tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))# first fit_transform It's calculation tf-idf, the second fit_transform It is to turn the text into a word frequency matrix word=vectorizer.get_feature_names()# Get all the words in the bag model weight=tfidf.toarray()# take tf-idf Matrix extraction , Elements a[i][j] Express j Words in i Class text tf-idf The weight for i in range(len(weight)):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text printu"------- Here is the output of ",i,u" A text like word tf-idf The weight ------"for j in range(len(word)): print word[j],weight[i][j]

Please bring the original link to reprint ,thank
Similar articles

2021-09-15