How to use artificial intelligence to realize data classification?
mob604757044d68 2021-07-20 04:49:21
Data has become the most valuable information asset in an organization . The recognized best practice of data governance is classified management , The maturity of artificial intelligence technology makes it possible to classify massive data in real time . Network security startups Scarlett , Technology in machine learning and natural language processing has been the focus of the industry . In recent days, , Safebull and Scarlett had an in-depth discussion on this topic , This is also the first time that its amazing data classification engine effect and the advanced technology behind it have appeared in the media .
Scarlett is committed to machine learning 、 natural language processing 、 And data mining and other advanced technologies are applied to content recognition 、 Data governance 、 Data security 、 Detection and response 、 And Threat Intelligence . Its self-developed lightweight AI engine , Use the unique algorithm and code implementation of scenario optimization , It is suitable for a variety of business scenarios , It can be distributed and pre embedded in various platforms , It effectively expands the application scope of artificial intelligence in information security , It is favored and adopted by many partners ,2015 in “ China's network security enterprises 50 strong ” Top 10 security startups selected .

 picture Safe cattle  picture Can you give a brief introduction to the data classification engine ?
Scarlett : From zero lines of code to self-developed content recognition engine , Applied machine learning 、 natural language processing 、 And text clustering technology , It can accurately classify and grade data in real time based on content . This unsupervised machine learning engine can analyze a large number of unlabeled original document sets , Automatically organize topics according to content , And the semantic similarity can be adjusted flexibly by manual intervention , Satisfactory clustering results are obtained . after , The clustering results are used as annotation samples , Implement supervised machine learning , Extract short sentences or long combination words as semantic features , Automatic generation of text classification rule base , In the process , Users can also manually intervene in feature selection , We can also use reverse samples to strengthen training . Push text classification rules to endpoints deployed in the organization 、 The server 、 And lightweight distributed classifiers in networks, etc , It can sense the distribution and usage of key data in real time , Provide basic support for data governance .
 picture Safe cattle  picture Too many technical terms , I understand that it can automatically sort out multiple documents to different topics ?

Scarlett : Yes . We will use stand-alone software for demonstration today , The effect is the same as that of the server engine . In this demonstration , We choose a folder that contains a large number of documents, directory input software , Start automatic analysis . The software can parse the text content in different format documents , Intelligent extraction of semantic features , And clustering according to content similarity . As shown in the screenshot below , Divide the document into LPG / fuel oil / The gasoline and diesel market 、 The international oil market 、 as well as IT Operation and maintenance, etc . There is no human intervention in this process , The samples were not labeled manually , It's a typical scenario of unsupervised machine learning .


 picture


 picture Safe cattle  picture What do the words after each category mean ? Key words ?
Scarlett : In order to help users quickly understand the results of unsupervised clustering , Here we add a simple text summarization algorithm , List out the representative keywords , Users can know at a glance what each category is about . It should be noted that , These keywords are not semantic features that classifiers will use , When we talk about supervised training later, we will see that the software can select more meaningful short sentences or combinations .
 picture Safe cattle  picture Without human intervention, it really brightens my eyes . But if the result of automatic clustering deviates from the user's expectation , How to adjust it ?
Scarlett : You can see that there are content similarity adjustment controls above the clustering results , We can immediately observe different clustering results by dragging the pointer . For example, as shown in the figure below , Set the similarity to 40%, Oil market reports have gone from four categories to two : The domestic oil market 、 And the international oil market ,IT Operation and maintenance documents are a separate category .


 picture


And then reduce the similarity to 30%, All the oil market reports become one kind , and IT Operation and maintenance documents are still of the same kind .


 picture


If you drag the pointer in the opposite direction , Increase the similarity to 65%, It can be observed that the fuel oil market was originally divided into one category , It is further subdivided into weekly and monthly reports .


 picture


Convenient human-computer interface , Hidden behind it is a lot of carefully tuned natural language processing 、 machine learning 、 And data mining algorithms . When users comb data , According to different management objectives and requirements , The similarity can be adjusted flexibly through manual intervention , Clustering documents into topic partitions that meet their needs .
 picture Safe cattle  picture Market analysis report of gasoline, diesel and fuel oil, although the segments are very close , But it can be divided into two categories by AI , That sounds easy to understand . But why the same fuel oil market report , It can also be divided into weekly and monthly reports ?
Scarlett : Although both are fuel oil markets , But the purpose and content focus of weekly and monthly newspapers are different , The focus and focus of users' writing will also be different , The result is that there are some differences in content structure and description words . The similarities and differences in these details can be easily distinguished by AI engine , Therefore, it can achieve the effect of accurate classification of subdivided fields .

 picture Safe cattle
Then you can use the clustering results to train the machine automatically ?
Scarlett : Yes , And then there's supervised machine learning , To refine and select semantic features , Prepare for automatic generation of classification rules . Let's create a new category , It's called gasoline and diesel market analysis , Import a group of samples corresponding to the clustering results into , The results of searching semantic features are shown in the figure below .


 picture


The advanced technology of Chinese word segmentation is popular in the market , It's just a basic function step in our engine . Word segmentation algorithm also has obvious differences . You can see , In this page , Such as “ Distillate stocks are on a month on month basis ” This professional vocabulary is also correctly extracted . Our natural language processing engine doesn't need to rely on professional Thesaurus . The ability to discover new words and proper nouns is only the first step . A 10000 word article can cut out nearly 10000 candidate semantic features , But which of these features is more important , It's more meaningful for text categorization , More and more advanced algorithms are needed to score and filter . Our software intelligently sorts semantic features and recommends them to users .
 picture Safe cattle  picture What are the semantic features ? Why not keywords ?
Scarlett : A lot of users will care about this . High quality semantic features generally refer to short sentences or long combinations . for example , The last page shows the software extracted “ Diesel market fundamentals ”, If we use not good Chinese word segmentation and feature extraction algorithm , Will get “ diesel oil ”“、 market ”、“ basic ”、 and “ Noodles ” Four key words ; obviously , The combination words selected by our software can better represent the meaning of the subject content , It is more accurate and efficient in data classification . by comparison , The content recognition effect of keyword and regular expression technology in text classification is very poor . Let's take an example that we all understand , When it comes to identifying contract categories , The user's first reaction is “ contract ” keyword , But many documents that are not contracts still contain “ contract ” Two words , And some contracts don't have this keyword at all , It's all about using “ agreement ” or “ Memorandum of understanding ” etc. . Even by writing regular expressions to represent multiple keyword combination logic , There will still be a large number of false positives or omissions . And the AI engine calculates up to hundreds of semantic features , The mathematical model of specific classification is constructed in multi-dimensional vector space , So it has high recognition accuracy . Even if you replace or delete some keywords , It can still classify text accurately .

 picture Safe cattle  picture How is the semantic feature score and ranking calculated ? According to the frequency of words ?
Scarlett : Scoring of semantic features , It is not simply calculated according to the number of occurrences , It is based on the semantic importance of the short sentence or combination in the sample , Sort by integrating multiple dimensional factors . for example ,“ Distillate stocks are on a month on month basis ” In this class, only 13 Time , and “ International market review ” and “ Future market outlook ” Respectively 25 Time , But the system algorithm says “ Distillate stocks are on a month on month basis ” More representative of this kind of sample content , So instead, they score higher and rank higher . This is the effect of natural language understanding by real AI engine .

 picture Safe cattle
How to automatically generate data classification and recognition rules in the next step ?
Scarlett : Select the appropriate combination in the semantic feature list and save , A few simple mouse clicks can output classification rules to a separate file . Import this rule file to the server , Distributed lightweight classification engine , It can be used in massive data quickly 、 Accurately identify and find the data you need to find .


 picture


 picture Safe cattle
I'm really impressed by the ease of use of your software , But I see all kinds of technical reports that AI has many complex algorithms ? Isn't parameter tuning very troublesome ?
Scarlett : Yes , machine learning 、 natural language processing 、 There are many complex algorithms in data mining , Specific algorithms are more effective in specific areas , Many researchers have published many papers on parameter tuning . Our early version of the user interface had a dozen parameters tab page , Each page contains multiple parameters of a specific algorithm for adjustment and optimization . however , In the later practical application process , We found that this kind of operation is too complicated for users , Because few users can understand the principle of parameter tuning . therefore , We've done a lot of product work , At present, only two options of content similarity adjustment and semantic feature selection are reserved in the interface , Greatly reduced user intervention scenarios , Greatly improved ease of use . A more prominent advantage is , We have successfully overcome the problem of cross industry algorithm standardization , Regardless of Finance 、 telecom 、 energy 、 The government 、 Users of any industry such as manufacturing , You can use the Standard Version directly , You can still get satisfactory results without customization .
 picture Safe cattle  picture What are the application scenarios of AI data classification engine ?
Scarlett : Data governance has become an important focus of information security , Data classification engine has been successfully applied in email content filtering 、 Confidential document management 、 Knowledge mining 、 Intelligence analysis 、 Anti fraud 、 Electronic discovery and archiving 、 Data leakage prevention and other fields . This engine is implemented using a unique optimized algorithm and code , Stable 、 High speed 、 Low energy consumption , Low computing power , It is suitable for a variety of business scenarios , It can be distributed and pre embedded in various products and platforms , It has been favored and adopted by many partners , In many industries, wealth 500 Great value is created in strong companies .



Please bring the original link to reprint ,thank
Similar articles

2021-07-20

2021-07-20

2021-07-20