5


I have millions of short (up to 30 words) documents which I need to split into several known categories. It's possible, that a document matches several of the categories (seldom, but possible). It's also possible that a document doesn't match any of the categories (also seldom). I also have millions of documents which have already been categorized. What algorithm should I use to do the job. I don't need to do it fast. I need to be sure that the algorithm categorizes correctly (as far as possible).
What algorithm should I use? Is there an implementation of in in C#?
Thank you for your help!

5 Answers 5

8

Take a look at term frequency and inverse document frequency also cosine similarity to find important words to create categories and assign documents to categories based on similarity

EDIT:

Found an example here

Sign up to request clarification or add additional context in comments.

Comments

1

Interesting articles :

Comments

1

The major issue IMHO here is the length of the documents. I think I would call it phrase classification and there is work going on on this because of the twitter thing. You could bring in additional text performing a web search on the 30 words and then analyzing the top matches. There is a paper about this but I can't find it right now. Then I would try a feature vector approach (tdf-idf as in Jimmy's answer) and a multiclass SVM for classification.

Comments

0

Perhaps a decision tree combined with a NN?

Comments

0

You can use SVM Algorithm for Classify text in C# with libsvm.net library.

1 Comment

Why the late (and less complete answer)?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.