1

I'm new to Python and Machine Learning. I try to implement a simple Machine Learning script to predict the Topic of a Text, e.g. Texts about Barack Obama should be Mapped to Politicians.

I think i make the right moves to do that, but im not 100% sure so i ask you guys.

First of all here is my little script:

#imports
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
#dictionary for mapping the targets
categories_dict = {'0' : 'politiker','1' : 'nonprofit org'}

import glob
#get filenames from docs
filepaths = glob.glob('Data/*.txt')
print(filepaths)

docs = []

for path in filepaths:
doc = open(path,'r')
docs.append(doc.read())
#print docs


count_vect = CountVectorizer()
#train Data
X_train_count = count_vect.fit_transform(docs)
#print X_train_count.shape

#tfidf transformation (occurences to frequencys)
tfdif_transform = TfidfTransformer()
X_train_tfidf = tfdif_transform.fit_transform(X_train_count)

#get the categories you want to predict in a set, these must be in the order the train        docs are!
categories = ['0','0','0','1','1']
clf = MultinomialNB().fit(X_train_tfidf,categories)

#try to predict
to_predict = ['Barack Obama is the President of the United States','Greenpeace']

#transform(not fit_transform) the new data you want to predict
X_pred_counts = count_vect.transform(to_predict)
X_pred_tfidf = tfdif_transform.transform(X_pred_counts)
print X_pred_tfidf

#predict
predicted = clf.predict(X_pred_tfidf)

for doc,category in zip(to_predict,predicted):
    print('%r => %s' %(doc,categories_dict[category]))

Im sure about the general Workflow that is required to use this, but im not sure how i map the categories to the docs i use to train the classifier. I know that they must be in correct order and i think i got that but it doesn't output the right category.

Is that because my Documents i use to Train the Classifier are bad, or do i make a certain mistake im not aware of?

He predicts that both new Texts are about Target 0 (Politicians)

Thanks in advance.

3
  • looks ok, but it is difficult to say more without the data. Commented Jan 13, 2015 at 14:25
  • the data i use to train this thing is basically just a few Articles i got from the internet. I uploaded it to Dropbox. Its just plain text. dropbox.com/sh/tv4dbs23nzosdss/AAAKUTYSS_q_RNyd-z21EQ6ja?dl=0 Commented Jan 13, 2015 at 14:28
  • The problem here is that you are just doing tf-idf, it is difficult to map semantics or topics using what is essentially a word counting model even with NB if those terms never appear in your training set. Topic modelling maybe a better approach: google.co.uk/… also NLTK has some nice methods but it's not high performant Commented Jan 13, 2015 at 14:34

1 Answer 1

1

It looks like the model hyper parameters are not rightly tuned. It is difficult to make conclusions with so little data but if you use:

model = MultinomialNB(0.5).fit(X, y)
# or
model = LogisticRegression().fit(X, y)

you will get the expected results, at least for words like "Greenpeace", "Obama", "President" which are so obviously correlated with its corresponding class. I took a quick look at the coefficients of the model and it seems to be doing the right thing.

For a more sophisticated approach to topic modeling I recommend you take a look at gensim.

Sign up to request clarification or add additional context in comments.

3 Comments

thank you! yeah the Data is quite little, but i will get the right datasets in summer so i just need something to work with you know? ;) i will try that out tommorrow. Tank you!
i think i read you can let these parameters tune themselves in scikit? Is this better then tune manually?
Normally you have to tune them manually using a validation set. Scikit learn has some helper functions (example GridSearchCV). There are also a couple models that autotune like LassoCV.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.