Detailed analysis of scikit-learn for text categorization

Text classification using scikit-learn

Multi-label classification format

For multi-label classification problems, one sample may belong to multiple categories at the same time. For example, a news belongs to multiple topics. In this case, the dependent variable yy needs to be expressed using a matrix.

Multi-category classification means that the possible value of y is greater than 2, but the category to which y belongs is unique. It is strictly different from the multi-label classification problem. All scikit-learn classifiers support multi-category classification by default. However, when you need to modify the algorithm yourself, you can also use scikit-learn to achieve pre-data preparation for multi-category classification.

For multi-category or multi-label classification problems, there are two strategies for building classifiers: One-vs-All and One-vs-One. Below, some examples demonstrate how to implement these two types of strategies.
Detailed analysis of scikit-learn for text categorization

#from sklearn.preprocessing import MultiLabelBinarizery = [[2,3,4],[2],[0,1,3],[0,1,2,3,4],[0,1,2]]MultiLabelBinarizer ().fit_transform(y)array([[0, 0, 1, 1, 1], [0, 0, 1, 0, 0], [1, 1, 0, 1, 0], [1, 1 , 1, 1, 1], [1, 1, 1, 0, 0]]) One-Vs-The-Rest strategy

This strategy is also called the One-vs-all strategy, that is, by constructing K discriminants (K is the number of categories), and the ii discriminant classifies the samples as the ii category or the non-ii category. Although this classification method is time consuming, it is possible to obtain an intuitive understanding of the category by the discriminant corresponding to each category (for example, each topic in the text classification can be distinguished by high frequency feature words belonging to only the category).

Multi-category classification learning

From sklearn import datasetsfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn.svm import LinearSVCiris = datasets.load_iris()X,y = iris.data,iris.targetOneVsRestClassifier(LinearSVC(random_state = 0)).fit(X,y).predict(X )array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, twenty two])

Multi-label classification learning

Kaggle has a competition on multi-label classification issues: Multi-label classification of printed media articles to topics (Address: https://).

The introduction to the competition is as follows:

This is a multi-label classification competition for articles coming from Greek printed media. Raw data comes from the scanning of print media, article segmentation, and optical character segmentation, and therefore is quite noisy. Each article is examined by a human annotator and categorized to one or more of the topics being monitored. Topics range from specific persons, products, and companies that can be easily categorized based on keywords, to more general semantic concepts, such as environment or economy. Building multi-label classifiers for the automated annotation of articles into topics can support the work of human annotators by suggesting a list of all topics by order of relevance, or even automate the annotation process for media and / or categories that are easier to predict. This saves valuable time and allows a media monitoring Company to expand the portfolio of media being monitored.

We download the corresponding data from this website and learn from the case of multi-label classification.

Data description

This text dataset has been formalized with the word bag model, a total of 201561 feature words, each text corresponding to one or more tags, a total of 203 classification tags. The website provides two data formats: ARFF and LIBSVM. The ARFF format data is mainly for weka, and the LIBSVM format is suitable for the LIBSVM module in matlab. Here, we use data in the LIBSVM format.

Each line of data begins with a comma-separated sequence of integers representing the category label. This is followed by a separate id:value pair. Where id is the ID of the feature word, and value is the TF-IDF value of the feature word in the document.

The form is as follows.

58,152 833:0.032582 1123:0.003157 1629:0.038548 ...

Data loading

# Load modulesimport os import sysimport numpy as npfrom sklearn.datasets import load_svmlight_filefrom sklearn.preprocessing import LabelBinarizerfrom sklearn.preprocessing import MultiLabelBinarizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn import metrics # set working directoryos.chdir ( "D: \\ my_python_workfile \\Thesis\\kaggle_multilabel_classification")# read filesX_train,y_train = load_svmlight_file("./data/wise2014-train.libsvm",dtype=np.float64,multilabel=True)X_test,y_test = load_svmlight_file("./data/wise2014 -test.libsvm",dtype = np.float64,multilabel=True)

Model fitting and prediction

# transform y into a matrixmb = MultiLabelBinarizer()y_train = mb.fit_transform(y_train)# fit the model and predictclf = OneVsRestClassifier(LogisticRegression(),n_jobs=-1)clf.fit(X_train,y_train)pred_y = clf.predict( X_test)

Model evaluation

Since there is no real label for the test set, let's look at the prediction of the training set.

# training set resulty_predicted = clf.predict(X_train)#report #print(metrics.classification_report(y_train,y_predicted))import numpy as npnp.mean(y_predicted == y_train)0.99604661023482433

Save results

# write the outputout_file = open("pred.csv","w")out_file.write("ArticleId,Labels")id = 64858for i in xrange(pred_y.shape[0]): label = list(mb.classes_[ Np.where(pred_y[i,:]==1)[0]].astype("int")) label = " ".join(map(str,label)) if label == "": # if the Label is empty label = "103" out_file.write(str(id+i)+","+label+"")out_file.close()One-Vs-One strategy

The One-Vs-One strategy is to establish a discriminant between the two categories. Thus, a total of K(K−1)/2K(K−1)/2 discriminants is needed. Finally, the category of the sample is determined by voting. .

Multi-category classification learning

From sklearn import datasetsfrom sklearn.multiclass import OneVsOneClassifierfrom sklearn.svm import LinearSVCiris = datasets.load_iris()X,y = iris.data,iris.targetOneVsOneClassifier(LinearSVC(random_state = 0)).fit(X,y).predict(X )array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) References

BS TYPE POWER CORD

Guangdong Kaihua Electric Appliance Co., Ltd. , https://www.kaihuacable.com