Splitting specific column values into multiple columns

Question

-Dataset

ID<-c(1,2,3,4,5,6,7)
method<-c("cheque","DD","DD","Cheque","NetBank","NetBank","Cash")
type<-c("Type1","Type1","Type2","Type2","Type3","Type3","Type4")    
aid<-c("A1","A1","A2","A2","A3","A3","A4")  
month<-c("JAN","JAN","FEB","FEB","MAR","MAR","APR")
year<-c(2016,2016,2015,2015,2017,2017,2018)
Outcome<-c("Positive","Positive","Negative","Negative","Medium","Medium","Neutral")
ser_no<-c("A00001","A00001","A00002","A00002","A00003","A00003","A00004")
Units<-c(100,200,300,400,500,600,700)
amt<-c(1000,1500,2000,3000,4000,2500,6000)
user_cnt<-c(20,20,15,15,32,32,44)

data<-data.frame(ID=ID,type=type,aid=aid,month=month,year=year,Outcome=Outcome,ser_no=ser_no,Units=Units,amt=amt,user_cnt=user_cnt,method=method)

R> data

  ID  type aid month year  Outcome ser_no Units  amt user_cnt  method
  1 Type1  A1   JAN 2016 Positive A00001   100 1000       20  cheque
  2 Type1  A1   JAN 2016 Positive A00001   200 1500       20      DD
  3 Type2  A2   FEB 2015 Negative A00002   300 2000       15      DD
  4 Type2  A2   FEB 2015 Negative A00002   400 3000       15  Cheque
  5 Type3  A3   MAR 2017   Medium A00003   500 4000       32 NetBank
  6 Type3  A3   MAR 2017   Medium A00003   600 2500       32 NetBank
  7 Type4  A4   APR 2018  Neutral A00004   700 6000       44     


Output<-sqldf("select type,aid,month,year,Outcome,ser_no,count(distinct ID) as members,count(type) as entries,sum(UNITS) as UNITS,sum(amt) as amt,
min(amt) as LowestAmt,max(amt) as HighestAmount,AVG(amt) as Mean,user_cnt,cast (count(distinct ID) as real)/user_cnt as Suggestion 
from data group by type,aid,month,year,Outcome,ser_no")

R>Output

 type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion  
 Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000     
 Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333 
 Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000 
 Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727

I want to add method column from data in my output.

Method column can have only four value 1.Cheque 2.DD 3.NetBank 4.Blank denotes cash

I want to add a method column values in output below(Check last four columns). Can it be done without sqldf. I am trying to find the occurence of method values in a group.

Example: As per GROUP BY clause the row one has 1 Cheque and 1 DD value thus there count is displayed as 1. Netbank and Cash values are not present thus there count is 0. As per GROUP BY clause the row three has2 Netbank values thus there count is displayed as 2 and as there are no Netbank,Cash and Cheque values thus there count is 0.

 type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion    Cheque      DD  Netbank     Cash
 Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000      1         1      0         0
 Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333      1         1      0         0
 Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000      0         0      2         0
 Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727      0         0      0         1

PIG · Accepted Answer · 2017-12-11 21:10:24Z

2

I could not solve the case issue on 'cheque' as tolower is not working under sqldf.So included both the options.

sqldf("select type
,aid
,month
,year
,Outcome
,ser_no
,count(distinct ID) as members
,count(type) as entries
,sum(UNITS) as UNITS
,sum(amt) as amt
,min(amt) as LowestAmt
,max(amt) as HighestAmount
,AVG(amt) as Mean
,user_cnt
,cast (count(distinct ID) as real)/user_cnt as Suggestion
,count(case when lower(method)='cheque' then method  end ) as cheque 
,count(case when method ='DD' then method  end ) as DD 
,count(case when method ='NetBank' then method  end ) as NetBank 
,count(case when method ='Cash' then method  end ) as Cash 
from data
group by type,aid,month,year,Outcome,ser_no")

edited Dec 11, 2017 at 21:10

answered Dec 11, 2017 at 10:11

PIG

6125 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Akki · Accepted Answer · 2017-12-11 11:47:35Z

1

with data table :

library(data.table)
DT <- setDT(data)
DT[,method := tolower(method)] # to avoid different count with upper and lower case
plouf<-dcast(DT[,.N, by = .(type,method)],type~ method)
plouf[is.na(plouf)]<-0 

    type cash cheque dd netbank
1: Type1   0     1    1      0
2: Type2   0     1    1      0
3: Type3   0     0    0      2
4: Type4   1     0    0      0

Here DT[,.N, by = .(type,method)] counts the different methods, and dcasdt transform it into large format. You can then merge with your output

Output <- setDT(Output)
Output[plouf, on = "type"]

    type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion cash
1: Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000    0
2: Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333    0
3: Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000    0
4: Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727    1
   cheque dd netbank
1:      1  1      0
2:      1  1      0
3:      0  0      2
4:      0  0      0

edited Dec 11, 2017 at 11:47

Akki

1,2513 gold badges15 silver badges35 bronze badges

answered Dec 11, 2017 at 7:48

denis

5,7211 gold badge16 silver badges45 bronze badges

2 Comments

Akki Over a year ago

Is there any way to get the expected output in one query itself rather than appending it? I am grouping on Six Columns type,aid,month,year,Outcome,ser_no and not just on type. dcast(data, type + month + year + Outcome + aid + ser_no ~ method, fun.aggregate = length)

denis Over a year ago

you could use the same code here by creating a grouping variable : DT[,grp := paste(type,aid,month,year,Outcome,ser_no,sep = "_")] And then dcast(DT[,.N, by = .(grp,method)],grp ~ method)

Uwe · Accepted Answer · 2017-12-12 11:59:14Z

1

The whole aggregation can be done in one statement using data.table:

library(data.table)
setDT(data)[
  , .(members = uniqueN(ID), entries = .N, UNITS = sum(Units), amt = sum(amt), 
      LowestAmt = min(amt), HighestAmount = max(amt), Mean = mean(amt), 
      user_cnt = first(user_cnt), Suggestion =  uniqueN(ID) / first(user_cnt),
      Cheque = sum(tolower(method) == "cheque"), DD = sum(tolower(method) == "dd"), 
      NetBank = sum(tolower(method) == "netbank"), 
      Cash = sum(tolower(method) %in% c("cash", ""))), 
  by = .(type, aid, month, year, Outcome, ser_no)]

    type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion Cheque DD NetBank Cash
1: Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000      1  1       0    0
2: Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333      1  1       0    0
3: Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000      0  0       2    0
4: Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727      0  0       0    1

In case there are more than just 4 different values in method I would suggest other approaches like dcast() and join.

answered Dec 12, 2017 at 11:59

Uwe

42.8k13 gold badges97 silver badges143 bronze badges

1 Comment

Akki Over a year ago

It is much slower than PIG answer on dataset of 1.4 million rows. PIG answer takes 17 seconds where data.table takes 1 minute 14 seconds. @Uwe

Collectives™ on Stack Overflow

Splitting specific column values into multiple columns

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related