R Import files with differing number of initial rows to skip

Question

I need to read many files into R, do some clean up, and then combine them into one data frame. The files all basically start like this:

=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.07.11 09:47:35 =~=~=~=~=~=~=~=~=~=~=~=
up
Upload #18
Reader: S1  Site: AA
--------- upload 18 start ---------
Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap
E,2016-07-05,11:45:44.17,"upload 17 complete"
D,2016-07-05,11:46:24.69,00:00:00.87,HA,900_226000745055,A2,8,1102
D,2016-07-05,11:46:43.23,00:00:01.12,HA,900_226000745055,A2,10,143

The row with column headers is "Type,Date,Time,Duration,Type,Tag ID,Ant,Count,Gap". Data should have 9 columns. The problem is that the number of rows above the header string is different for every file, so I cannot simply use skip = 5. I also only need lines that begin with "D,", everything else is messages, not data.

What is the best way to read in my files, ensuring that I have 9 columns and skipping all the junk?
I have been using the read_csv function from the readr() package because thus far it has produced the fewest formatting issues. But, I am open to any new ideas including a way to read in just lines that begin with "D,". I toyed with using read.table and skip = grep("Type," readLines(i)), but it doesn't seem to find the header string correctly. Here's my basic code:

dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
 d01 <- read_csv(i, col_names = F, na = "NA", skip = 35)  
 # do clean-up stuff
 datalist[[i]] <- d 
}

Can you rework your question with reproducible data so others can test it out? In principle you are on the right path as this is a problem that can be addressed by reading in a subset of your data (use the n_max parameter in read_csv) as temp data and using a grep to store the index for the appropriate number of rows to skip for each data file. — dshkol
– dshkol, Commented Sep 14, 2017 at 15:40
Looks like there are sometimes lines above the header row in my file that read "Type 'HE' for help". This is causing any solution searching for "Type" to fail. Can anyone tell me the correct regular expression to search for rows that contain only the word "Type" and no other characters? @D.sen, @bmosov01, @You-leee? — notacodr
– notacodr, Commented Sep 14, 2017 at 18:08
How about including the , like You-leee did in their grep. I.e Type,? — D.sen
– D.sen, Commented Sep 14, 2017 at 18:56

You-leee · Accepted Answer · 2017-09-14 15:56:57Z

1

One other basic R solution is the following: You read in the file by lines, get the indices of rows, that begin with "D" and the header row. After, you simply split these lines by "," and put it in a data.frame and assign the names from the header row to it.

lines <- readLines(i)
dataRows <- grep("^D,", lines)

names <- unlist(strsplit(lines[grep("Type,", lines)], split = ","))

data <- as.data.frame(matrix(unlist(strsplit(lines[dataRows], ",")), nrow = length(dataRows), byrow=T))
names(data) <- names

Output:

    Type       Date        Time    Duration Type           Tag ID Ant Count  Gap
1      D 2016-07-05 11:46:24.69 00:00:00.87   HA 900_226000745055  A2     8 1102
2      D 2016-07-05 11:46:43.23 00:00:01.12   HA 900_226000745055  A2    10  143

edited Sep 14, 2017 at 15:56

answered Sep 14, 2017 at 15:51

You-leee

5603 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

notacodr Over a year ago

This was the cleanest, easiest solution because of unexpected strange formatting issues. Thanks to bmosov01 and D.sen for their useful options.

D.sen · Accepted Answer · 2017-09-14 23:46:31Z

1

You can use a custom function to loop over each file and filter only those which start with D in the type column and bind them all together at the end. Drop the bind_rows if you want them as separate lists.

load_data <-function(path) {
  require(dplyr)
  setwd(path)
  files <- dir()
  read_files <- function(x) {
    data_file <- read.csv(paste(path, "/", x, ".csv", sep = ""), stringsAsFactors = FALSE, na.strings=c("","NA"))
    row.number <- grep("^Type$", data_file[,1])
    colnames(data_file) <- data_file[row.number,]
    data_file <- data_file[-c(1:row.number+1),]
    data_file <- data_file %>%
      filter(grepl("^D", Type))
    return(data_file)
  }
  data <- lapply(files, read_files)
}

list_of_file <- bind_rows(load_data("YOUR_FOLDER_PATH"))

edited Sep 14, 2017 at 23:46

answered Sep 14, 2017 at 15:44

D.sen

9645 silver badges16 bronze badges

Comments

bmosov01 · Accepted Answer · 2017-09-15 16:50:59Z

If your header row always begins with the word Type, you can simply omit the skip option from your initial read, and then remove any rows before the header row. Here's some code to get you started (not tested):

dataFiles <- Sys.glob("*.*")
datalist <- list()
for (i in dataFiles) {
 d01 <- read_csv(i, col_names = F, na = "NA")
 headerRow <- which( d01[,1] == 'Type' )
 d01 <- d01[headerRow+1,] # This keeps all rows after the header row.  
 # do clean-up stuff
 datalist[[i]] <- d 
}

If you want to keep the header, you can use:

for (i in dataFiles) {
 d01 <- read_csv(i, col_names = F, na = "NA")
 headerRow <- which( d01[,1] == 'Type' )
 d01 <- d01[headerRow+1,]  # This keeps all rows after the header row.
 header <- d01[headerRow,] # Get names from header row.
 setNames( d01, header )   # Assign names.
 # do clean-up stuff
 datalist[[i]] <- d 
}

Collectives™ on Stack Overflow

R Import files with differing number of initial rows to skip

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related