How to process multiple csv files for identifying null values in R?

Question

I have various .csv files. Each file has multiple columns. I am using the given code in R to pursue a quality check that for a particular column, how many rows have valid values and how many are null. The code works well for a single csv file. But I want to run the code for all the csv files and need output for each csv file. Additionally, I want a log file. Could anyone please help me by modifying the code how it can be used to process various csv files.

install.packages("readr") 
library(readr)

check_column <- function(df, column) {
  valid_values <- !is.na(df[[column]])
  num_valid <- sum(valid_values)
  num_null <- nrow(df) - num_valid
  return(c(num_valid, num_null))
}

#Read the CSV file
df <- read_csv("data.csv")

for (column in names(df)) {
  results <- check_column(df, column)
  print(paste(column, ": ", results[1], " valid, ", results[2], " null"))
}

Sample data: (Not all files have same number of columns)

Csv1.csv

D_T  Temp (°C)  Press (Pa)  ...
2021-03-01 00:00:00+00  28  1018  ...
2021-03-02 00:00:00+00  27  1017  ...
2021-03-03 00:00:00+00  28  1019  ...
..
..

Csv2.csv

D_T  Temp (°C)  Vel (m/s)  Press (Pa_...
2022-03-01 00:00:00+00  28  118  1018  ...
2022-03-02 00:00:00+00  27  117  1019  ...
2022-03-03 00:00:00+00  28  119  1018  ...
..
..

See stackoverflow.com/a/24376207/3358227 for many discussions of how to do things on a list of tables, including how to make that list to begin with. It tends to use lapply and friends, but the premise remains solid. — r2evans
– r2evans, Commented Jul 7, 2023 at 1:55

Ludwig · Accepted Answer · 2023-07-10 23:04:55Z

1

How about something like this? This will not store anything in a variable. Let me know if you need help with it.

library(readr)    
    
for(files in list.files(pattern=".*csv$")) {
    file <- read_csv(files)
    out <- file(paste0(files, ".log"), open = "w")
    sapply(colnames(file), function(x) {
            cat(
                    paste0(x, ":"),
                    sum(!is.na(file[, x])),
                    "valid,",
                    sum(is.na(file[, x])),
                    "null\n",
                    file = out
            )
    })
    close(out)
}

To write into one file only:

library(readr)    

out <- file("output.log", open = "w")
for(files in list.files(pattern=".*csv$")) {
        file <- read_csv(files)
        cat(files, "\n", file = out)
        sapply(colnames(file), function(x) {
                cat(
                        paste0(x, ":"),
                        sum(!is.na(file[, x])),
                        "valid,",
                        sum(is.na(file[, x])),
                        "null\n",
                        file = out
                )
        })   
}
close(out)

edited Jul 10, 2023 at 23:04

answered Jul 6, 2023 at 23:02

Ludwig

6114 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Michael_Brun Over a year ago

Thank you for your response. But I have more than 50 csv files. (I gave an example of only 2 csv files). Could you please modify your code accordingly.

Ludwig Over a year ago

Have you tried it? It works for any number of files. Set your working directory to where the files are with setwd(). list.files() lists every file in your current working directory. Inside the loop every file gets read and its contents displayed. I just created two example files to show that it works.

Michael_Brun Over a year ago

Okay, I set my working directory setwd("F:/Test") and list files as list.files(path="F:/Test/",pattern='.*csv',full.names=TRUE) and followed your code from for loop. It gives me error: Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input). Kindly help.

Ludwig Over a year ago

Try the readr version read_csv() instead of read.csv() if it worked for you before. I also posted an updated version that will write the output into a log file instead of into the console.

r2evans Over a year ago

duplicate 'row.names' means something about that file makes reading it problematic. That has nothing to do with the fact that it was called in a for loop, it seems more likely to me that either (a) not all files are CSV files, or (b) not all of them are well-structured CSV. Check the file that erred and check it manually. Can you read it by itself? Are there more ,-delimited tokens on the first line than on the second?

|

Collectives™ on Stack Overflow

How to process multiple csv files for identifying null values in R?

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related