0

-Dataset

ID<-c(1,2,3,4,5,6,7)
method<-c("cheque","DD","DD","Cheque","NetBank","NetBank","Cash")
type<-c("Type1","Type1","Type2","Type2","Type3","Type3","Type4")    
aid<-c("A1","A1","A2","A2","A3","A3","A4")  
month<-c("JAN","JAN","FEB","FEB","MAR","MAR","APR")
year<-c(2016,2016,2015,2015,2017,2017,2018)
Outcome<-c("Positive","Positive","Negative","Negative","Medium","Medium","Neutral")
ser_no<-c("A00001","A00001","A00002","A00002","A00003","A00003","A00004")
Units<-c(100,200,300,400,500,600,700)
amt<-c(1000,1500,2000,3000,4000,2500,6000)
user_cnt<-c(20,20,15,15,32,32,44)

data<-data.frame(ID=ID,type=type,aid=aid,month=month,year=year,Outcome=Outcome,ser_no=ser_no,Units=Units,amt=amt,user_cnt=user_cnt,method=method)  

R> data

  ID  type aid month year  Outcome ser_no Units  amt user_cnt  method
  1 Type1  A1   JAN 2016 Positive A00001   100 1000       20  cheque
  2 Type1  A1   JAN 2016 Positive A00001   200 1500       20      DD
  3 Type2  A2   FEB 2015 Negative A00002   300 2000       15      DD
  4 Type2  A2   FEB 2015 Negative A00002   400 3000       15  Cheque
  5 Type3  A3   MAR 2017   Medium A00003   500 4000       32 NetBank
  6 Type3  A3   MAR 2017   Medium A00003   600 2500       32 NetBank
  7 Type4  A4   APR 2018  Neutral A00004   700 6000       44     


Output<-sqldf("select type,aid,month,year,Outcome,ser_no,count(distinct ID) as members,count(type) as entries,sum(UNITS) as UNITS,sum(amt) as amt,
min(amt) as LowestAmt,max(amt) as HighestAmount,AVG(amt) as Mean,user_cnt,cast (count(distinct ID) as real)/user_cnt as Suggestion 
from data group by type,aid,month,year,Outcome,ser_no")

R>Output

 type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion  
 Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000     
 Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333 
 Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000 
 Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727 
  • I want to add method column from data in my output.

    Method column can have only four value 1.Cheque 2.DD 3.NetBank 4.Blank denotes cash

I want to add a method column values in output below(Check last four columns). Can it be done without sqldf. I am trying to find the occurence of method values in a group.

Example: As per GROUP BY clause the row one has 1 Cheque and 1 DD value thus there count is displayed as 1. Netbank and Cash values are not present thus there count is 0. As per GROUP BY clause the row three has2 Netbank values thus there count is displayed as 2 and as there are no Netbank,Cash and Cheque values thus there count is 0.

 type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion    Cheque      DD  Netbank     Cash
 Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000      1         1      0         0
 Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333      1         1      0         0
 Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000      0         0      2         0
 Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727      0         0      0         1

3 Answers 3

2

I could not solve the case issue on 'cheque' as tolower is not working under sqldf.So included both the options.

sqldf("select type
,aid
,month
,year
,Outcome
,ser_no
,count(distinct ID) as members
,count(type) as entries
,sum(UNITS) as UNITS
,sum(amt) as amt
,min(amt) as LowestAmt
,max(amt) as HighestAmount
,AVG(amt) as Mean
,user_cnt
,cast (count(distinct ID) as real)/user_cnt as Suggestion
,count(case when lower(method)='cheque' then method  end ) as cheque 
,count(case when method ='DD' then method  end ) as DD 
,count(case when method ='NetBank' then method  end ) as NetBank 
,count(case when method ='Cash' then method  end ) as Cash 
from data
group by type,aid,month,year,Outcome,ser_no")
Sign up to request clarification or add additional context in comments.

Comments

1

with data table :

library(data.table)
DT <- setDT(data)
DT[,method := tolower(method)] # to avoid different count with upper and lower case
plouf<-dcast(DT[,.N, by = .(type,method)],type~ method)
plouf[is.na(plouf)]<-0 

    type cash cheque dd netbank
1: Type1   0     1    1      0
2: Type2   0     1    1      0
3: Type3   0     0    0      2
4: Type4   1     0    0      0

Here DT[,.N, by = .(type,method)] counts the different methods, and dcasdt transform it into large format. You can then merge with your output

Output <- setDT(Output)
Output[plouf, on = "type"]

    type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion cash
1: Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000    0
2: Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333    0
3: Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000    0
4: Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727    1
   cheque dd netbank
1:      1  1      0
2:      1  1      0
3:      0  0      2
4:      0  0      0

2 Comments

Is there any way to get the expected output in one query itself rather than appending it? I am grouping on Six Columns type,aid,month,year,Outcome,ser_no and not just on type. dcast(data, type + month + year + Outcome + aid + ser_no ~ method, fun.aggregate = length)
you could use the same code here by creating a grouping variable : DT[,grp := paste(type,aid,month,year,Outcome,ser_no,sep = "_")] And then dcast(DT[,.N, by = .(grp,method)],grp ~ method)
1

The whole aggregation can be done in one statement using data.table:

library(data.table)
setDT(data)[
  , .(members = uniqueN(ID), entries = .N, UNITS = sum(Units), amt = sum(amt), 
      LowestAmt = min(amt), HighestAmount = max(amt), Mean = mean(amt), 
      user_cnt = first(user_cnt), Suggestion =  uniqueN(ID) / first(user_cnt),
      Cheque = sum(tolower(method) == "cheque"), DD = sum(tolower(method) == "dd"), 
      NetBank = sum(tolower(method) == "netbank"), 
      Cash = sum(tolower(method) %in% c("cash", ""))), 
  by = .(type, aid, month, year, Outcome, ser_no)]
    type aid month year  Outcome ser_no members entries UNITS  amt LowestAmt HighestAmount Mean user_cnt Suggestion Cheque DD NetBank Cash
1: Type1  A1   JAN 2016 Positive A00001       2       2   300 2500      1000          1500 1250       20 0.10000000      1  1       0    0
2: Type2  A2   FEB 2015 Negative A00002       2       2   700 5000      2000          3000 2500       15 0.13333333      1  1       0    0
3: Type3  A3   MAR 2017   Medium A00003       2       2  1100 6500      2500          4000 3250       32 0.06250000      0  0       2    0
4: Type4  A4   APR 2018  Neutral A00004       1       1   700 6000      6000          6000 6000       44 0.02272727      0  0       0    1

In case there are more than just 4 different values in method I would suggest other approaches like dcast() and join.

1 Comment

It is much slower than PIG answer on dataset of 1.4 million rows. PIG answer takes 17 seconds where data.table takes 1 minute 14 seconds. @Uwe

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.