0

I was trying to emulate a research which included machine learning. In that the researcher used both feature selection and feature reduction before using Gaussian Classifiers from classification.

My question is as follows: Say I have 3 classes. I select the (say,) the top 3 best features for each class from a total of (say) 10 features. The features selected are for example as follows:

Class 1: F1 F2 F9
Class 2: F3 F4 F9
Class 3: F1 F5 F10

Since principal component analysis or Linear Discriminant analysis both work on the complete data-set or atleast datasets in which all classes have the same features how do I perform feature reduction on such a set and then perform training?

Here is the link for the paper: Speaker Dependent Audio Visual Emotion Recognition

Following is an exerpt from the paper:

The top 40 visual features were selected with Plus l-Take Away r algorithm using Bhattacharyya distance as a criterion function. The PCA and LDA were then applied to the selected feature set and finally single component Gaussian classifier was used for classification.

7
  • You seem to misunderstand feature selection, or you have some non-standard setup of multiclass classification. Suppose that you didn't do dimensionality reduction, then how would you make a decision using different features for each class? Commented Nov 22, 2013 at 16:09
  • @larsmans I don't understand that part. But the paper states that top 40 features were extracted for "each" class. Or did I misunderstand something? Commented Nov 22, 2013 at 16:25
  • @larsmans For example have a look at the following link see figure 3: personal.ee.surrey.ac.uk/Personal/P.Jackson/pub/avsp08/… Commented Nov 22, 2013 at 16:27
  • 1
    I don't see anything in the papers you linked that suggests features were selected independently for each class. Commented Nov 22, 2013 at 18:42
  • 2
    It says "top 40 visual features for a neutral frame" (my emphasis). I'm guessing "neutral frame" refers to orientation, not class value. You can use the whole data set if you are trying to find the features that yield the maximum inter-class distances. Commented Nov 22, 2013 at 19:29

1 Answer 1

2

In the linked paper, a single set of features is developed for all classes. Bhattacharyya distance is a bounded distance measure of how separable two Gaussian distributions are. The article doesn't appear to describe specifically how the Bhattacharyya distance is used (the average of a matrix of inter-class distances?). But once you have your Bhattacharyya-based metric, there are a few ways you can select your features. You can start with an empty set of features and progressively add features to the set (based on how separable the classes are with the new feature). Or you can start with all the features and progressively discard features that provide the least separability. The Plus l-Take Away r algorithm combines those two approaches.

Once the subset of original features has been selected, the feature reduction step reduces dimensionality through some transformation of the original features. As you quoted, the authors used both PCA and LDA. The important distinction between the two is that PCA is independent of the training class labels and to reduce dimensionality, you must choose how much of the variance to retain. Whereas LDA tries to maximize separability of the classes (by maximizing the ratio of between-class to within-class covariances) and provides a number of features equal to one less than the number of classes.

But the important point here is that after feature selection and reduction, the same set of features is used for all classes.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks a lot but my original confusion is still there. If you could help me with that? Can you write a sample algo just to show how a distance measure can be used? No need to emphasize on plus l take away r just a simple SFS wud do I just want to knw how to use the distance. Thanks a lot.
Just a comment about my last comment under question would also do. Just tell me please if that approach is correct.
Just as a simple example, you could calculate a matrix of Bhattacharyya distances (A-B, A-C,... A-Z, B-C,..., Y-Z). It would only need to be a lower diagonal matrix (to avoid symmetric and self-distances). Then take the average of the array and use that as your distance measure in your SFS algorithm.
Thanks a lot! Average of the array means average of the whole matrix right ie, add each element of the matrix then divide by total number of elements in matrix? Umm I understand the concept now thanks just one more question. Since the number of classes is constant in each case would there be actually a difference between using average distances or the sum of distances? Iv marked it as correct but please do answer this :P
Correct. Since the number classes is fixed, either the sum or mean would be equivalent for feature selection.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.