Creating new column based on row values of multiple data subsetting conditions

Question

I have a dataframe that looks more or less like follows (the original one has 12 years of data):

   Year   Quarter   Age_1   Age_2   Age_3   Age_4
   2005      1       158     120     665     32
   2005      2       257     145     121     14
   2005      3       68       69     336     65
   2005      4       112     458     370     101
   2006      1       75      457     741     26
   2006      2       365     134     223     45
   2006      3       257     121     654     341
   2006      4       175     124     454     12
   2007      1       697     554     217     47
   2007      2       954     987     118     54
   2007      4       498     235     112     65

Where the numbers in the age columns represents the amount of individuals in each age class for a specific quarter within a specific year. It is noteworthy that sometimes not all quarters in a specific year have data (e.g., third quarter is not represented in 2007). Also, each row represents a sampling event. Although not shown in this example, in the original dataset I always have more than one sampling event for a specific quarter within a specific year. For example, for the first quarter in 2005 I have 47 sampling events, leading therefore to 47 rows.

What I´d like to have now is a dataframe structured in a way like:

       Year   Quarter   Age_1   Age_2   Age_3   Age_4    Cohort
       2005      1       158     120     665     32        158
       2005      2       257     145     121     14        257
       2005      3       68       69     336     65         68
       2005      4       112     458     370     101       112
       2006      1       75      457     741     26        457 
       2006      2       365     134     223     45        134
       2006      3       257     121     654     341       121
       2006      4       175     124     454     12        124
       2007      1       697     554     217     47         47
       2007      2       954     987     118     54         54
       2007      4       498     235     112     65         65

In this case, I want to create a new column (Cohort) in my original dataset which basically follows my cohorts along my dataset. In other words, when I´m in my first year of data (2005 with all quarters), I take the row values of Age_1 and paste it into the new column. When I move to the next year (2006), then I take all my row values related to my Age_2 and paste it to the new column, and so on and so forth.

I have tried to use the following function, but somehow it only works for the first couple of years:

extract_cohort_quarter <- function(d, yearclass=2005, quarterclass=1) {

 ny <- 1:nlevels(d$Year) #no. of Year levels in the dataset 
 nq <- 1:nlevels(d$Quarter)
 age0 <- (paste("age", ny, sep="_"))
 year0 <- as.character(yearclass + ny - 1)

quarter <- as.character(rep(1:4, length(age0)))
age <- rep(age0,each=4)
year <- rep(year0,each=4)

df <- data.frame(year,age,quarter,stringsAsFactors=FALSE)

n <- nrow(df)
dnew <- NULL
for(i in 1:n) {
    tmp <- subset(d, Year==df$year[i] & Quarter==df$quarter[i])
    tmp$Cohort <- tmp[[age[i]]]
    dnew <- rbind(dnew, tmp)
}
levels(dnew$Year) <- paste("Yearclass_", yearclass, ":", 
year,":",quarter,":", age, sep="")
dnew
}

I have plenty of data from age_1 to age_12 for all the years and quarters, so I don´t think that it´s something related to the data structure itself.

Is there an easier solution to solve this problem? Or is there a way to improve my extract_cohort_quarter() function? Any help will be much appreciated.

-M

Why is the value of Cohort for 2007, Quarter 1, not 217? — InspectorSands
– InspectorSands, Commented Nov 17, 2017 at 14:42
@hermestrismegistus I appologize, I made a typos error. I pasted the values of age_4 instead of age_3 - thus, you´re right..it should be 217 instead of 47. — Marie-Christine Rufener
– Marie-Christine Rufener, Commented Nov 18, 2017 at 9:46

denis · Accepted Answer · 2017-11-17 15:53:23Z

2

I have a simple solution but that demands bit of knowledge of the data.table library. I think you can easily adapt it to your further needs. Here is the data:

DT <- as.data.table(list(Year   = c(2005,   2005,   2005,   2005,   2006,   2006    ,2006   ,2006,  2007,   2007,   2007),
                         Quarter= c(1,  2,  3,  4   ,1  ,2  ,3  ,4  ,1  ,2  ,4),
                         Age_1  = c(158,    257,    68, 112 ,75,    365,    257,    175,    697 ,954,   498),
                         Age_2= c(120   ,145    ,69 ,458    ,457,   134 ,121    ,124    ,554    ,987,   235),
                         Age_3= c(665   ,121    ,336    ,370    ,741    ,223    ,654    ,454,217,118,112),
                         Age_4= c(32,14,65,101,26,45,341,12,47,54,65)

))

Here is th code :

DT[,index := .GRP, by = Year]
DT[,cohort := get(paste0("Age_",index)),by = Year]

and the output:

> DT
    Year Quarter Age_1 Age_2 Age_3 Age_4 index cohort
 1: 2005       1   158   120   665    32     1    158
 2: 2005       2   257   145   121    14     1    257
 3: 2005       3    68    69   336    65     1     68
 4: 2005       4   112   458   370   101     1    112
 5: 2006       1    75   457   741    26     2    457
 6: 2006       2   365   134   223    45     2    134
 7: 2006       3   257   121   654   341     2    121
 8: 2006       4   175   124   454    12     2    124
 9: 2007       1   697   554   217    47     3    217
10: 2007       2   954   987   118    54     3    118
11: 2007       4   498   235   112    65     3    112

What it does:

DT[,index := .GRP, by = Year]

creates an index for all different year in your table (by = Year makes an operation for group of year, .GRP create an index following the grouping sequence). I use it to call the column that you named Age_ with the number created

DT[,cohort := get(paste0("Age_",index)),by = Year]

You can even do everything in the single line

DT[,cohort := get(paste0("Age_",.GRP)),by = Year]

I hope it helps

answered Nov 17, 2017 at 15:53

denis

5,7211 gold badge16 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Marie-Christine Rufener Over a year ago

Thank you very much. Your suggestion worked just perfectly! Would you also know how could I create a new Year column into this dataset which basically would return me an info like: Yearclass_2005:2007:Q1:age_3 In my previous function I was able to obtain this by running the following code line: levels(dnew$Year) <- paste("Yearclass_", yearclass, ":", year,":",quarter,":", age, sep="")

denis Over a year ago

DT[,newcol := paste0(Year,":",Quarter,";age_",.GRP), by = Year]

akrun · Accepted Answer · 2017-11-18 12:12:23Z

1

Here is an option using tidyverse

library(dplyr)
library(tidyr)
df1 %>%
    gather(key, Cohort, -Year, -Quarter) %>%
    separate(key, into = c('key1', 'key2')) %>%
    mutate(ind = match(Year, unique(Year))) %>%
    group_by(Year) %>%
    filter(key2 == Quarter[ind]) %>% 
    mutate(newcol = paste(Year, Quarter, paste(key1, ind, sep="_"), sep=":")) %>%
    ungroup %>% 
    select(Cohort, newcol) %>%
    bind_cols(df1, .)
#   Year Quarter Age_1 Age_2 Age_3 Age_4 Cohort       newcol
#1  2005       1   158   120   665    32    158 2005:1:Age_1
#2  2005       2   257   145   121    14    257 2005:2:Age_1
#3  2005       3    68    69   336    65     68 2005:3:Age_1
#4  2005       4   112   458   370   101    112 2005:4:Age_1
#5  2006       1    75   457   741    26    457 2006:1:Age_2
#6  2006       2   365   134   223    45    134 2006:2:Age_2
#7  2006       3   257   121   654   341    121 2006:3:Age_2
#8  2006       4   175   124   454    12    124 2006:4:Age_2
#9  2007       1   697   554   217    47     47 2007:1:Age_3
#10 2007       2   954   987   118    54     54 2007:2:Age_3
#11 2007       4   498   235   112    65     65 2007:4:Age_3

edited Nov 18, 2017 at 12:12

answered Nov 17, 2017 at 15:42

akrun

891k38 gold badges590 silver badges700 bronze badges

2 Comments

Marie-Christine Rufener Over a year ago

Thanks...it worked pretty straightforward! Just one more question: Is there also an easy solution to return me a new column as the following code line (exposed previously in my function): levels(dnew$Year) <- paste("Yearclass_", yearclass, ":", year,":",quarter,":", age, sep="") Basically it should inform me a new time stamp that is specific to year, quarter and age. From the previous data frame example, if we are in row 10, then my new column would return something like: 2007:2:Age_3 (it is age 3, because the value in the Cohort column corresponds to this particular age)

akrun Over a year ago

@Marie-ChristineRufener Added the new column

Collectives™ on Stack Overflow

Creating new column based on row values of multiple data subsetting conditions

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related