One 、 explain

IG yes information gain Abbreviation , The Chinese name is information gain , It's a very effective way to select features ( Especially in the use of svm When classifying ). There is no detailed introduction here , Those who are interested can googling once .

chi-square Is a common feature selection method , In the article about seed word expansion , Detailed description , No more details here .

Two 、weka How to use in

1、 Feature filter code

package com.lvxinjian.alg.models.feature;
import java.nio.charset.Charset;
import java.util.ArrayList; import weka.attributeSelection.ASEvaluation;
import weka.attributeSelection.AttributeEvaluator;
import weka.attributeSelection.Ranker;
import weka.core.Instances; import com.iminer.tool.common.util.FileTool;
* @Description : Use Weka The method of feature selection based on ( At present, we support IG、Chi-square)
public class FeatureSelectorByWeka { /**
* @function Use weka Built in algorithms filter features
* @param eval Object instances of feature filtering methods
* @param data arff Formatted data
* @param maxNumberOfAttribute The maximum number of features supported
* @param outputPath lex The output file
* @throws Exception
public void EvalueAndRank(ASEvaluation eval , Instances data ,int maxNumberOfAttribute , String outputPath) throws Exception
Ranker rank = new Ranker();
eval.buildEvaluator(data);, data); // Filter attributes according to a specific search algorithm Used here Ranker The algorithm is just an attribute according to InfoGain/Chi-square Sort the size of
int[] attrIndex =, data); // Print result information Here we show the sorting result of the attributes
ArrayList<String> attributeWords = new ArrayList<String>();
for (int i = 0; i < attrIndex.length; i++) {
// If the weight is equal to 0, And then jump out of the loop
if (((AttributeEvaluator) eval).evaluateAttribute(attrIndex[i]) == 0)
if (i >= maxNumberOfAttribute)
attributeWords.add(i + "\t"
+ data.attribute(attrIndex[i]).name() + "\t" + "1");
FileTool.SaveListToFile(attributeWords, outputPath, false,
} }
package com.lvxinjian.alg.models.feature;
import weka.attributeSelection.ASEvaluation;
import weka.attributeSelection.ChiSquaredAttributeEval;
import weka.attributeSelection.InfoGainAttributeEval;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource; import com.iminer.alg.models.generatefile.ParameterUtils; /**
* @Description : IG、Chi-square Feature screening
public class WekaFeatureSelector extends FeatureSelector{ /**
* The largest number of features
private int maxFeatureNum = 10000;
* Save path of feature file
private String outputPath = null;
* @Fields rule For feature filtering rules
private String classname = "CLASS";
* Feature selection methods , The default is IG
private String selectMethod = "IG"; private boolean Initialization(String options){
try {
String [] paramArrayOfString = options.split(" "); // The maximum number of initialization features
String maxFeatureNum = ParameterUtils.getOption("maxFeatureNum", paramArrayOfString);
if(maxFeatureNum.length() != 0)
this.maxFeatureNum = Integer.parseInt(maxFeatureNum);
// Initialize category
String classname = ParameterUtils.getOption("class", paramArrayOfString);
if(classname.length() != 0)
this.classname = classname;
System.out.println("use default class name(\"CLASS\")");
// Initialize feature save path
String outputPath = ParameterUtils.getOption("outputPath", paramArrayOfString);
if(outputPath.length() != 0)
this.outputPath = outputPath;
System.out.println("please initialze output path.");
return false;
String selectMethod = ParameterUtils.getOption("selectMethod", paramArrayOfString);
if(selectMethod.length() != 0)
this.selectMethod = selectMethod;
System.out.println("use default select method(IG)");
} catch (Exception e) {
return false;
return true;
public boolean selectFeature(Object obj ,String options) throws IOException {
try {
return false;
Instances data = (Instances)obj;
ASEvaluation selector = null;
selector = new InfoGainAttributeEval();
else if(this.selectMethod.equals("CHI"))
selector = new ChiSquaredAttributeEval();
FeatureSelectorByWeka attributeSelector = new FeatureSelectorByWeka();
attributeSelector.EvalueAndRank(selector, data ,this.maxFeatureNum ,this.outputPath);
} catch (Exception e) {
// TODO Auto-generated catch block
} return true;
} public static void main(String [] args) throws Exception
String root = "C:\\Users\\Administrator\\Desktop\\12_05\\ model training \\1219\\";
WekaFeatureSelector selector = new WekaFeatureSelector();
Instances data = + "train.Bigram.arff");
String options = "-maxFeatureNum 10000 -outputPath lex.txt";
selector.selectFeature(data, options);

Reference resources :

weka Data mining ( Two )---- feature selection (IG、chi-square)

Weka Study four ( Attribute selection )

weka feature selection (IG、chi-square) More articles about

  1. Chi Square Distance

    The chi squared distance d(x,y) is, as you already know, a distance between two histograms x=[x_1,.. ...

  2. Feature selection Chi Chi square test

    Feature selection Chi Chi square test The greater the chi square value , The greater the deviation from the original hypothesis , The selection process also becomes to calculate it and category for each word Ci The chi square value of , From big to small ( In this case, the larger the square root value is, the more relevant it is ), Take before k Just one . The experimental results for English text show that ...

  3. 【Machine Learning】wekaの Introduction to feature selection

    Anyone who has read this blog should understand , Feature selection code implementation should include 3 Parts of : search algorithm : Evaluation function : data : therefore , The general form of the code is : AttributeSelection attsel = new Attribut ...

  4. BendFord&#39;s law&#39;s Chi square test bendford'law e=log10(1+l/n) o=freq of first ...

  5. Feature selection in text mining (python Realization )

    The space for machine learning algorithms . The time complexity depends on the size of the input data , Dimensional conventions (Dimensionality reduction) Is a method used to reduce the dimension of input data . Dimension conventions can be divided into two categories : feature selection (feature ...

  6. Use Python Feature selection of text mining based on / extract

    In the problems of text mining and text classification , The initial data of a text is a matrix representing the document as a vector space model , And all this matrix has is different words , Feature selection is often used . The reason is that the characteristics of the text are generally words (term), With semantic information , Use ...

  7. scikit-learn: Knowledge points used in actual projects ( summary )

    zero . Common to all projects : Data set formats and predictors ) ...

  8. NLP- feature selection

    Feature selection in text classification 1 Research background For the classification of high latitudes , We usually reduce the dimension of features before classification , The technology of feature dimension reduction generally includes feature extraction and feature selection . And for text categorization , We usually use feature selection methods . feature extraction :PCA. Line ...

  9. use R Conduct market research and Consumer Perception Analysis

    // // The problem is data Understand the problem Understand the customer's problem : Who is the customer? ( An airline )? communication , communication , communication ! The problem should be specific An airline : How does the passenger experience ? What needs to be improved ? Category : Compare . describe . clustering , Discrimination or regression What do you need ...

Random recommendation

  1. say something focus /focusin /focusout /blur event

    Event trigger time focus: When focusable When the element gets the focus , Blistering is not supported :focusin: and focus equally , Only this event supports bubbling :blur: When focusable When the element loses focus , Blistering is not supported :focusou ...

  2. Lua String Library ( Arrangement )

    Lua String library subset 1. Basic string functions :     There are some very simple functions in the string Library , Such as :    1). string.len(s) Return string s The length of :    2). string.rep(s,n) return ...

  3. PHP - Get and set include_path .

    PHP - Get and set include_path classification :             PHP              2011-02-16 13:19     2818 Human reading      Comment on (1)     ...

  4. Interest Python introduction ( One ): First time to know Python

    [ Editor's note ] The author of this article is Abhishek Jaiswal , Good at .NET.C#.Python And other languages . In this paper , In a lively and interesting tone, the author introduces Python Basic knowledge of language , Later learning more calendar ...

  5. BZOJ1680: [Usaco2005 Mar]Yogurt factory

    1680: [Usaco2005 Mar]Yogurt factory Time Limit: 5 Sec  Memory Limit: 64 MBSubmit: 106  Solved: 74[Su ...

  6. ##DAY4 The basic concept of events 、 The basic concept of touch 、 Responder chain 、 gesture

    ##DAY4   The basic concept of events . The basic concept of touch . Responder chain . gesture #pragma mark ——————— The basic concept of events  ——————————— The basic concept of events : 1) The event is when the user's finger touches the screen and on the screen ...

  7. c# Use GDI+ Simple drawing

    private void button2_Click(object sender, EventArgs e) { Bitmap image = new Bitmap(200, 200); Graphi ...

  8. JavaWeb The phone book of the project , Two versions , And summarize and reflect

    Using technology : Oracle database The front background : Servlet + jsp + JDBC + html + css + js The front-end interface is customized , But it has to do what it needs Realization function : Users can log in After logging in, you can ...

  9. 05_Python Format Operation

    Python Format output print('name: %s,version: %s,code: %d' %('Python',3.6,3)) print('name: {name},version: {v ...

  10. Linux Order four

    Homework four : 1)  new directory /test/dir, Owner for tom, The array is group1,/test The permissions of the directory are 777 # useradd tom [root@localhost /]# groupadd gr ...