3

I am new to R. I created the function below to calculate the mean of dataset contained in 332 csv files. Seek advice on how I could improve this code. It takes 38 sec to run which make me think it is not very efficient.

pollutantmean <- function(directory, pollutant, id = 1:332) {
        files_list <- list.files(directory, full.names = TRUE) #creats list of files
        dat <- data.frame() #creates empty dataframe
                for(i in id){
                        dat<- rbind(dat,read.csv(files_list[i])) #combin all the monitor data together
}
        good <- complete.cases(dat) #remove all NA values from dataset
        mean(dat[good,pollutant]) #calculate mean
} #run time ~ 37sec - NEED TO OPTIMISE THE CODE
5
  • In short - never use loops in R, they are always slow. Also do you really need 332 files? this is horribly slow. Why not append them to one big file? Commented Apr 19, 2015 at 12:19
  • Maybe just create dat using dat <- do.call(rbind, lapply(files_list, read.csv)) instead of the way you are doing it. Commented Apr 19, 2015 at 12:21
  • 4
    @lejlot, first - loops are not always slow, usually it depends on circumstances. 2- this is just some Coursera task. Commented Apr 19, 2015 at 12:22
  • @DavidArenburg I would love to see an example of fast loop in R :-) Commented Apr 19, 2015 at 12:29
  • 1
    @lejlot take a look here there are some nice examples at the end too. Commented Apr 19, 2015 at 12:33

2 Answers 2

4

Instead of creating a void data.frame and rbind each time with a for loop, you can store all data.frames in a list and combine them in one shot. You can also use na.rm option of mean function not to take into account NA values.

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    files_list = list.files(directory, full.names = TRUE)[id] 
    df         = do.call(rbind, lapply(files_list, read.csv))

    mean(df[[pollutant]], na.rm=TRUE)
}

Optional - I would increase the readability with magrittr:

library(magrittr)

pollutantmean <- function(directory, pollutant, id = 1:332)
{
    list.files(directory, full.names = TRUE)[id] %>%
        lapply(read.csv) %>%
        do.call(rbind,.) %>%
        extract2(pollutant) %>%
        mean(na.rm=TRUE)
}
Sign up to request clarification or add additional context in comments.

Comments

1

You can improve it by using data.table's fread function (see Quickly reading very large tables as dataframes in R) Also binding the result using data.table::rbindlist is way faster.

require(data.table)    

pollutantmean <- function(directory, pollutant, id = 1:332) {
    files_list = list.files(directory, full.names = TRUE)[id]
    DT = rbindlist(lapply(files_list, fread))
    mean(DT[[pollutant]], na.rm=TRUE) 
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.