Creating Multiple columns in r dataframe based on existing multiple columns

Question

I have a csv dataset which is similar to the following:

Age    Food_1_1 Food_1_2 Food_1_3  Amount_1_1 Amount_1_2 Amount_1_3
6        a        b          a      2          3           4
6        b        b          c      1          2           3
6                 c          a                 4           1
9        c        c          b      1          3           1
9        c        a          b      1          2           1

Using R, I want to get the following data set which contains new set of columns a, b and c by adding the corresponding values:

Age    Food_1_1 Food_1_2 Food_1_3  Amount_1_1 Amount_1_2 Amount_1_3   a   b   c    
6            a        b        a      2          3           4        6   3   0
6            b        b        c      1          2           3        0   3   3
6                     c        a                 4           1        1   0   4
9            c        c        b      1          3           1        0   1   4
9            c        a        b      1          2           1        2   1   1

Note: My data also contains missing values. The variables Food_1_1:Food_1_3 are factors and the variables Value1:Value3 are numeric. For more clearity: 1st row of column "a" contains the addition of all amount through Amount_1_1 to Amount_1_3 related to a (say for example 2+4 =6).

In your table, there are not variables "Monday:Wednesday". Which of the columns are factors then? — Fabrizio
– Fabrizio, Commented May 22, 2020 at 11:03

Darren Tsai · Accepted Answer · 2020-05-23 07:47:08Z

Stack Food and Amount to 2 columns and count observations by group.

library(tidyverse)

df %>%
  rowid_to_column("id") %>% 
  pivot_longer(-(1:2), names_to = c(".value", "set"), names_sep = "_1_", values_drop_na = T) %>% 
  group_by(id, Food) %>%
  tally(Amount) %>%
  ungroup() %>% 
  pivot_wider(names_from = Food, values_from = n, values_fill = list(n = 0))

More concisely, use the argument values_fn:

df %>%
  rowid_to_column("id") %>% 
  pivot_longer(-(1:2), names_to = c(".value", "set"), names_sep = "_1_", values_drop_na = T) %>% 
  pivot_wider(id_cols = id, names_from = Food, values_from = Amount,
              values_fill = list(Amount = 0), values_fn = list(Amount = sum))

# # A tibble: 5 x 4
#      id     a     b     c
#   <int> <int> <int> <int>
# 1     1     6     3     0
# 2     2     0     3     3
# 3     3     1     0     4
# 4     4     0     1     4
# 5     5     2     1     1

Data

df <- structure(list(Age = c(6L, 6L, 6L, 9L, 9L), Food_1_1 = structure(c(1L, 
2L, NA, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
Food_1_2 = structure(c(2L, 2L, 3L, 3L, 1L), .Label = c("a", 
"b", "c"), class = "factor"), Food_1_3 = structure(c(1L, 
3L, 1L, 2L, 2L), .Label = c("a", "b", "c"), class = "factor"), 
Amount_1_1 = c(2L, 1L, NA, 1L, 1L), Amount_1_2 = c(3L, 2L, 
4L, 3L, 2L), Amount_1_3 = c(4L, 3L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -5L))

Daniel O · Accepted Answer · 2020-05-25 11:19:47Z

1

Using Darren Tsai's Example Data and in Base-R

we can get the values with

sapply(c("a","b","c"), function(x) apply(df,1, function(z) sum(as.numeric(z[grep(x,z)+3]))))

     a b c
[1,] 6 3 0
[2,] 0 3 3
[3,] 1 0 4
[4,] 0 1 4
[5,] 2 1 1

If you want to append them to your data frame then we can

cbind(df, The-above-code-goes-here )


  Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3 a b c
1   6        a        b        a          2          3          4 6 3 0
2   6        b        b        c          1          2          3 0 3 3
3   6     <NA>        c        a         NA          4          1 1 0 4
4   9        c        c        b          1          3          1 0 1 4
5   9        c        a        b          1          2          1 2 1 1

Edit: A breakdown of the code.

sapply(c("a","b","c"), function(x) ... ) This sapply function takes the arguments from the list provided a, b, & c, and runs the function in ... one at a time but renames the arguments to x so that they can be referenced.

The next layer is apply(df,1, function(z) ... ) This apply function takes the argument df - your data frame, and runs the function inside of it ... row-by-row (specified with 1).

grep(x,z) is now finding which column the letter's x = a,b,c are found in that specific row of the dataframe df = z. Now that we've located the correct column(s) that contain the letters, the actual numbers are we want to sum are three columns to the the right, so we add grep(x,z)+3 to access those numbers.

lastly sum(as.numeric(z[grep(x,z)+3])) We are summing z[ 'columns' ] where if you recal z is the row of your dataframe in every iteration of the apply 'loop'.

edited May 25, 2020 at 11:19

answered May 22, 2020 at 12:26

Daniel O

4,3789 silver badges21 bronze badges

2 Comments

fitrat Over a year ago

@ Thank you Daniel. Your code worked for my given data. However, I have some questions. What do these function(x) and function (z) for? In apply function, should it be apply (df, 2) instead of 1 as I am summing the elements across the rows? Also, I was wondering what does grep(x,z)+3 means?

Daniel O Over a year ago

I've added some detailed explanation for you. If this helped please consider accepting my answer.

Fabrizio · Accepted Answer · 2020-05-22 11:22:36Z

0

A possible solution would be the following: First, consider I put your data in a txt file formatted as follows:

Age,Food_1_1,Food_1_2,Food_1_3,Amount_1_1,Amount_1_2,Amount_1_3
6,a,b,a,2,3,4
6,b,b,c,1,2,3
6,,c,a,,4,1
9,c,c,b,1,3,1
9,c,a,b,1,2,1

The code to read the file and compute the sums is the following:

data=read.csv("test_table.txt", stringsAsFactors = FALSE)
lwords=c("a","b","c")
for(aw in lwords){
  data[,ncol(data)+1]=apply(data, 1, function(x, aw){
    return(sum(as.numeric(x[5:7][aw==x[2:4]])))
  }, aw=aw)
  colnames(data)[ncol(data)]=aw
}

The idea is very simple. In the for-loop you check if a word (imagine "a") appears in columns 2 to 4. This generates a list of true/false. You then use this list of true/false to sum the content of the columns 5 to 7.

The result:

> data
  Age Food_1_1 Food_1_2 Food_1_3 Amount_1_1 Amount_1_2 Amount_1_3 a b c
1   6        a        b        a          2          3          4 6 3 0
2   6        b        b        c          1          2          3 0 3 3
3   6                 c        a         NA          4          1 1 0 4
4   9        c        c        b          1          3          1 0 1 4
5   9        c        a        b          1          2          1 2 1 1

answered May 22, 2020 at 11:22

Fabrizio

94711 silver badges23 bronze badges

2 Comments

fitrat Over a year ago

Thank you. However, I am getting an entire 3rd row with NAs for columns a, b ,c. Is there any way, I can fix it for the missing value in that row. I am assuming this is happening because of the missing Food_1_1 in that row.

Fabrizio Over a year ago

I can't really help unless I see your data frame. Have you tried to format it as I wrote? As a comma-separated file?

Collectives™ on Stack Overflow

Creating Multiple columns in r dataframe based on existing multiple columns

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related