1

I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1  Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143

The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.

What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:

dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
 d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)  
 # do clean-up stuff
 datalist[[i]] <- d 
}
7
  • Is the value d in the type column consistently? Commented Sep 14, 2017 at 15:38
  • Can you rework your question with reproducible data so others can test it out? In principle you are on the right path as this is a problem that can be addressed by reading in a subset of your data (use the n_max parameter in read_csv) as temp data and using a grep to store the index for the appropriate number of rows to skip for each data file. Commented Sep 14, 2017 at 15:40
  • Yes @D.sen, D is always in the Type column. Commented Sep 14, 2017 at 16:00
  • Looks like there are sometimes lines above the header row in my file that read "Type 'HE' for help". This is causing any solution searching for "Type" to fail. Can anyone tell me the correct regular expression to search for rows that contain only the word "Type" and no other characters? @D.sen, @bmosov01, @You-leee? Commented Sep 14, 2017 at 18:08
  • How about including the , like You-leee did in their grep. I.e Type,? Commented Sep 14, 2017 at 18:56

3 Answers 3

1

One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.

lines <- readLines(i)
dataRows <- grep("^D,", lines)

names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))

data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names

Output:

    Type       Date        Time    Duration Type           Tag ID Ant Count  Gap
1      D 2016-07-05 11:46:24.69 00:00:00.87   HA 900_226000745055  A2     8 1102
2      D 2016-07-05 11:46:43.23 00:00:01.12   HA 900_226000745055  A2    10  143
Sign up to request clarification or add additional context in comments.

1 Comment

This was the cleanest, easiest solution because of unexpected strange formatting issues. Thanks to bmosov01 and D.sen for their useful options.
1

You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.

load_data <-function(path) {
  require(dplyr)
  setwd(path)
  files <- dir()
  read_files <- function(x) {
    data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
    row.number <- grep("^Type$", data_file[,1])
    colnames(data_file) <- data_file[row.number,]
    data_file <- data_file[-c(1:row.number+1),]
    data_file <- data_file %>%
      filter(grepl("^D", Type))
    return(data_file)
  }
  data <- lapply(files, read_files)
}

list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))

Comments

1

If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):

dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
 d01 <- read_csv(i, col_names = F, na = "NA")
 headerRow <- which( d01[,1] == 'Type' )
 d01 <- d01[headerRow+1,] # This keeps all rows after the header row.  
 # do clean-up stuff
 datalist[[i]] <- d 
}

If you want to keep the header, you can use:

for (i in dataFiles) {
 d01 <- read_csv(i, col_names = F, na = "NA")
 headerRow <- which( d01[,1] == 'Type' )
 d01 <- d01[headerRow+1,]  # This keeps all rows after the header row.
 header <- d01[headerRow,] # Get names from header row.
 setNames( d01, header )   # Assign names.
 # do clean-up stuff
 datalist[[i]] <- d 
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.