1

I have various .csv files. Each file has multiple columns. I am using the given code in R to pursue a quality check that for a particular column, how many rows have valid values and how many are null. The code works well for a single csv file. But I want to run the code for all the csv files and need output for each csv file. Additionally, I want a log file. Could anyone please help me by modifying the code how it can be used to process various csv files.

install.packages("readr") 
library(readr)

check_column <- function(df, column) {
  valid_values <- !is.na(df[[column]])
  num_valid <- sum(valid_values)
  num_null <- nrow(df) - num_valid
  return(c(num_valid, num_null))
}

#Read the CSV file
df <- read_csv("data.csv")

for (column in names(df)) {
  results <- check_column(df, column)
  print(paste(column, ": ", results[1], " valid, ", results[2], " null"))
}

Sample data: (Not all files have same number of columns)

Csv1.csv

D_T  Temp (°C)  Press (Pa)  ...
2021-03-01 00:00:00+00  28  1018  ...
2021-03-02 00:00:00+00  27  1017  ...
2021-03-03 00:00:00+00  28  1019  ...
..
.. 

Csv2.csv

D_T  Temp (°C)  Vel (m/s)  Press (Pa_...
2022-03-01 00:00:00+00  28  118  1018  ...
2022-03-02 00:00:00+00  27  117  1019  ...
2022-03-03 00:00:00+00  28  119  1018  ...
..
.. 
1
  • See stackoverflow.com/a/24376207/3358227 for many discussions of how to do things on a list of tables, including how to make that list to begin with. It tends to use lapply and friends, but the premise remains solid. Commented Jul 7, 2023 at 1:55

1 Answer 1

1

How about something like this? This will not store anything in a variable. Let me know if you need help with it.

library(readr)    
    
for(files in list.files(pattern=".*csv$")) {
    file <- read_csv(files)
    out <- file(paste0(files, ".log"), open = "w")
    sapply(colnames(file), function(x) {
            cat(
                    paste0(x, ":"),
                    sum(!is.na(file[, x])),
                    "valid,",
                    sum(is.na(file[, x])),
                    "null\n",
                    file = out
            )
    })
    close(out)
}

To write into one file only:

library(readr)    

out <- file("output.log", open = "w")
for(files in list.files(pattern=".*csv$")) {
        file <- read_csv(files)
        cat(files, "\n", file = out)
        sapply(colnames(file), function(x) {
                cat(
                        paste0(x, ":"),
                        sum(!is.na(file[, x])),
                        "valid,",
                        sum(is.na(file[, x])),
                        "null\n",
                        file = out
                )
        })   
}
close(out)
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you for your response. But I have more than 50 csv files. (I gave an example of only 2 csv files). Could you please modify your code accordingly.
Have you tried it? It works for any number of files. Set your working directory to where the files are with setwd(). list.files() lists every file in your current working directory. Inside the loop every file gets read and its contents displayed. I just created two example files to show that it works.
Okay, I set my working directory setwd("F:/Test") and list files as list.files(path="F:/Test/",pattern='.*csv',full.names=TRUE) and followed your code from for loop. It gives me error: Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input). Kindly help.
Try the readr version read_csv() instead of read.csv() if it worked for you before. I also posted an updated version that will write the output into a log file instead of into the console.
duplicate 'row.names' means something about that file makes reading it problematic. That has nothing to do with the fact that it was called in a for loop, it seems more likely to me that either (a) not all files are CSV files, or (b) not all of them are well-structured CSV. Check the file that erred and check it manually. Can you read it by itself? Are there more ,-delimited tokens on the first line than on the second?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.