Sorting the rows of data by similarity of columns in R

Question

I have a data set and I want to sort it in the following way in R. I hope I can explain clearly.

Sort by the elements seen in the main column. This will give us two chunks, one chunk with all As and one chunk with all Gs.
Then for the first chunk, move to the -1 column position, and sort by the elements seen there (there are two elements, C/T). This will break the first chunk into two smaller chunks, one with A at the main column and C at the - 1st column; and one chunk with A at the main column and T at the - 1st column.
For the second chunk, move to the -1 column and do the same. I will end up with two smaller chunks, one with G at the main column and C at the - 1st column; and one with G at the main column and T at the -1th column.
Move to the +1 column and do the same. At each step, I will end up partitioning each of the existing chunks into two new chunks.

I do not want to break the row pattern. I want to sort the rows (swap the arrangement of the rows), but I won't re-arrange the columns. How can I do that?

An idea: I did this sorting by hand and I got a normal distribution shape. That's why I gave weights (for every column) which were obtained by normal distribution function. After that I got a weighted covariance matrix (number of rows x number of rows) by using the dissimilarity coefficient between rows and weights. Then I ranked the data by using eigenvectors of correlation matrix which has the penalty for missing data. However I could not reach the result that I reached by hand. My data is so big but I am sharing a small part of it.

-7  -6  -5  -4  -3  -2  -1  Main    1   2   3   4
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   T   C   G   C   T   C   G   G   G   T   G
A   C   C   A   C   C   T   A   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
G   C   T   G   C   T   T   G   G   G   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   G   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   G   G   G   T   G
A   C   C   A   T   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   G   C   T   T   G   A   G   C   T
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G
A   C   C   A   C   C   T   A   G   A   T   G

Edward Carney · Accepted Answer · 2018-01-31 04:56:58Z

0

As I read your query, you wish to sort column Main, then X.1 within groups of Main, then X1 within groups of X.1. The following will do just that:

library(dplyr)
data.sort <- arrange(data, Main, X.1, X1)

   X.7 X.6 X.5 X.4 X.3 X.2 X.1 Main X1 X2 X3 X4
1    A   C   C   A   C   C   T    A  G  A  T  G
2    A   C   C   A   C   C   T    A  G  A  T  G
3    A   C   C   A   C   C   T    A  G  A  T  G
4    A   C   C   A   C   C   T    A  G  A  T  G
5    A   C   C   A   C   C   T    A  G  A  T  G
6    A   C   C   A   C   C   T    A  G  A  T  G
7    A   C   C   A   C   C   T    A  G  A  T  G
8    A   C   C   A   C   C   T    A  G  A  T  G
9    A   C   C   A   C   C   T    A  G  A  T  G
10   A   C   C   A   C   C   T    A  G  A  T  G
11   A   C   C   A   C   C   T    A  G  A  T  G    
12   A   C   C   A   C   C   T    A  G  A  T  G
13   A   C   C   A   C   C   T    A  G  A  T  G
14   A   C   C   A   C   C   T    A  G  A  T  G
15   A   C   C   A   C   C   T    A  G  A  T  G
16   A   C   C   A   T   C   T    A  G  A  T  G
17   A   C   C   A   C   C   T    A  G  A  T  G
18   A   C   C   A   C   C   T    A  G  A  T  G
19   A   C   C   A   C   C   T    A  G  A  T  G
20   A   C   C   A   C   C   T    A  G  A  T  G
21   A   C   C   A   C   C   T    A  G  A  T  G
22   A   C   C   A   C   C   T    A  G  A  T  G
23   A   C   C   A   C   C   T    A  G  A  T  G
24   A   C   C   A   C   C   T    A  G  A  T  G
25   A   C   C   A   C   C   T    A  G  A  T  G
26   A   C   C   A   C   C   T    A  G  A  T  G
27   A   T   C   G   C   T   C    G  G  G  T  G
28   A   C   C   G   C   T   T    G  A  G  C  T
29   G   C   T   G   C   T   T    G  G  G  T  G
30   A   C   C   A   C   C   T    G  G  A  T  G
31   G   C   T   G   C   T   T    G  G  G  T  G
32   A   C   C   A   C   C   T    G  G  A  T  G
33   A   C   C   A   C   C   T    G  G  A  T  G
34   A   C   C   A   C   C   T    G  G  G  T  G

You can reverse the order with desc() as follows:

data.sort <- arrange(data, desc(Main), desc(X.1), desc(X1))

N.B. The column names need to be set up without minus signs, numbers, etc.

edited Jan 31, 2018 at 4:56

answered Jan 31, 2018 at 4:39

Edward Carney

1,3929 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

genmes Over a year ago

Thanks for your sharing. As I said it is a small part of my data set. I have more than 600 columns. How will this solution work?

Edward Carney Over a year ago

The other way to do the multi-column sort is to use the order() function—data[order(data[,8],data[,7],data[,9]),] is how it would be done. I don't know if either method would scale up to 600 columns, though. That is an extensive branching tree (600 deep), if every column is to be used.

genmes Over a year ago

Unfortunately, it doesn't give what I want. As far as I understand I should use a clustering method but I do not know which one and how.

Collectives™ on Stack Overflow

Sorting the rows of data by similarity of columns in R

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related