How can I improve this R function

Question

I am new to R. I created the function below to calculate the mean of dataset contained in 332 csv files. Seek advice on how I could improve this code. It takes 38 sec to run which make me think it is not very efficient.

pollutantmean <- function(directory, pollutant, id = 1:332) {
        files_list <- list.files(directory, full.names = TRUE) #creats list of files
        dat <- data.frame() #creates empty dataframe
                for(i in id){
                        dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
        good <- complete.cases(dat) #remove all NA values from dataset
        mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE

In short - never use loops in R, they are always slow. Also do you really need 332 files? this is horribly slow. Why not append them to one big file? — lejlot
– lejlot, Commented Apr 19, 2015 at 12:19
Maybe just create dat using dat <- do.call(rbind, lapply(files_list, read.csv)) instead of the way you are doing it. — David Arenburg
– David Arenburg, Commented Apr 19, 2015 at 12:21
@lejlot, first - loops are not always slow, usually it depends on circumstances. 2- this is just some Coursera task. — David Arenburg
– David Arenburg, Commented Apr 19, 2015 at 12:22
@DavidArenburg I would love to see an example of fast loop in R :-) — lejlot
– lejlot, Commented Apr 19, 2015 at 12:29
@lejlot take a look here there are some nice examples at the end too. — David Arenburg
– David Arenburg, Commented Apr 19, 2015 at 12:33

Colonel Beauvel · Accepted Answer · 2015-04-19 14:10:00Z

4

Instead of creating a void data.frame and rbind each time with a for loop, you can store all data.frames in a list and combine them in one shot. You can also use na.rm option of mean function not to take into account NA values.

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    files_list = list.files(directory, full.names = TRUE)[id] 
    df         = do.call(rbind, lapply(files_list, read.csv))

    mean(df[[pollutant]], na.rm=TRUE)
}

Optional - I would increase the readability with magrittr:

library(magrittr)

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    list.files(directory, full.names = TRUE)[id] %>%
        lapply(read.csv) %>%
        do.call(rbind,.) %>%
        extract2(pollutant) %>%
        mean(na.rm=TRUE)
}

edited Apr 19, 2015 at 14:10

answered Apr 19, 2015 at 12:21

Colonel Beauvel

31.3k11 gold badges49 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 11:57:03Z

1

You can improve it by using data.table's fread function (see Quickly reading very large tables as dataframes in R) Also binding the result using data.table::rbindlist is way faster.

require(data.table)    

pollutantmean <- function(directory, pollutant, id = 1:332) {
    files_list = list.files(directory, full.names = TRUE)[id]
    DT = rbindlist(lapply(files_list, fread))
    mean(DT[[pollutant]], na.rm=TRUE) 
}

edited May 23, 2017 at 11:57

CommunityBot

11 silver badge

answered Apr 19, 2015 at 13:49

Rentrop

21.6k12 gold badges75 silver badges104 bronze badges

Collectives™ on Stack Overflow

How can I improve this R function

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related