Python scikit-learn Predictionfail

Question

I'm new to Python and Machine Learning. I try to implement a simple Machine Learning script to predict the Topic of a Text, e.g. Texts about Barack Obama should be Mapped to Politicians.

I think i make the right moves to do that, but im not 100% sure so i ask you guys.

First of all here is my little script:

#imports
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
#dictionary for mapping the targets
categories_dict = {'0' : 'politiker','1' : 'nonprofit org'}

import glob
#get filenames from docs
filepaths = glob.glob('Data/*.txt')
print(filepaths)

docs = []

for path in filepaths:
doc = open(path,'r')
docs.append(doc.read())
#print docs


count_vect = CountVectorizer()
#train Data
X_train_count = count_vect.fit_transform(docs)
#print X_train_count.shape

#tfidf transformation (occurences to frequencys)
tfdif_transform = TfidfTransformer()
X_train_tfidf = tfdif_transform.fit_transform(X_train_count)

#get the categories you want to predict in a set, these must be in the order the train        docs are!
categories = ['0','0','0','1','1']
clf = MultinomialNB().fit(X_train_tfidf,categories)

#try to predict
to_predict = ['Barack Obama is the President of the United States','Greenpeace']

#transform(not fit_transform) the new data you want to predict
X_pred_counts = count_vect.transform(to_predict)
X_pred_tfidf = tfdif_transform.transform(X_pred_counts)
print X_pred_tfidf

#predict
predicted = clf.predict(X_pred_tfidf)

for doc,category in zip(to_predict,predicted):
    print('%r => %s' %(doc,categories_dict[category]))

Im sure about the general Workflow that is required to use this, but im not sure how i map the categories to the docs i use to train the classifier. I know that they must be in correct order and i think i got that but it doesn't output the right category.

Is that because my Documents i use to Train the Classifier are bad, or do i make a certain mistake im not aware of?

He predicts that both new Texts are about Target 0 (Politicians)

Thanks in advance.

the data i use to train this thing is basically just a few Articles i got from the internet. I uploaded it to Dropbox. Its just plain text. dropbox.com/sh/tv4dbs23nzosdss/AAAKUTYSS_q_RNyd-z21EQ6ja?dl=0 — Schweigerama
– Schweigerama, Commented Jan 13, 2015 at 14:28
The problem here is that you are just doing tf-idf, it is difficult to map semantics or topics using what is essentially a word counting model even with NB if those terms never appear in your training set. Topic modelling maybe a better approach: google.co.uk/… also NLTK has some nice methods but it's not high performant — EdChum
– EdChum, Commented Jan 13, 2015 at 14:34

elyase · Accepted Answer · 2015-01-13 21:21:28Z

1

It looks like the model hyper parameters are not rightly tuned. It is difficult to make conclusions with so little data but if you use:

model = MultinomialNB(0.5).fit(X, y)
# or
model = LogisticRegression().fit(X, y)

you will get the expected results, at least for words like "Greenpeace", "Obama", "President" which are so obviously correlated with its corresponding class. I took a quick look at the coefficients of the model and it seems to be doing the right thing.

For a more sophisticated approach to topic modeling I recommend you take a look at gensim.

edited Jan 13, 2015 at 21:21

answered Jan 13, 2015 at 14:59

elyase

41.2k12 gold badges121 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Schweigerama Over a year ago

thank you! yeah the Data is quite little, but i will get the right datasets in summer so i just need something to work with you know? ;) i will try that out tommorrow. Tank you!

Schweigerama Over a year ago

i think i read you can let these parameters tune themselves in scikit? Is this better then tune manually?

elyase Over a year ago

Normally you have to tune them manually using a validation set. Scikit learn has some helper functions (example GridSearchCV). There are also a couple models that autotune like LassoCV.

Collectives™ on Stack Overflow

Python scikit-learn Predictionfail

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related