How to remove a date written as a string in R?

Question

I am trying to pull data from a website: https://transtats.bts.gov/PREZIP/

I am interested in downloading the datasets named Origin_and_Destination_Survey_DB1BMarket_1993_1.zip to Origin_and_Destination_Survey_DB1BMarket_2021_3.zip

For this I am trying to automate and put the url in a loop

# dates of all files

    year_quarter_comb <- crossing(year = 1993:2021, quarter = 1:4) %>% 
      mutate(year_quarter_comb = str_c(year, "_", quarter)) %>% 
      pull(year_quarter_comb)
    
    # download all files
    for(year_quarter in year_quarter_comb){
      get_BTS_data(str_glue("https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_DB1BMarket_", year_quarter, ".zip"))
}

What I was wondering is how I can exclude 2021 quarter 4 since the data for this is not available yet. Also is there a better way to automate the task? I was thinking of matching by "DB1BMarket" but R is actually case-sensitive. The names for certain dates change to "DB1BMARKET"

I can use this year_quarter_comb[-c(116)] to remove 2021_4 from the output:

EDIT: I was actually trying to download the files into a specific folder with these set of codes:

path_to_local <- "whatever location" # this is the folder where the raw data is stored.

# download data from BTS
get_BTS_data <- function(BTS_url) {
  # INPUT: URL for the zip file with the data
  # OUTPUT: NULL (this just downloads the data)
  
  # store the download in the path_to_local folder
  # down_file <- str_glue(path_to_local, "QCEW_Hawaii_", BLS_url %>% str_sub(34) %>% str_replace_all("/", "_"))
  down_file <- str_glue(path_to_local, fs::path_file(BTS_url))
  
  # download data to folder
  QCEW_files <- BTS_url %>%
    
    # download file
    curl::curl_download(down_file)
  
}

EDIT2:

I edited the codes a little from the answer below and it runs:

url <- "http://transtats.bts.gov/PREZIP"
content <- read_html(url)

file_paths <- content %>%
  html_nodes("a") %>%
  html_attr("href")

origin_destination_paths <-
  file_paths[grepl("DB1BM", file_paths)]

base_url <- "https://transtats.bts.gov"

origin_destination_urls <-
  paste0(base_url, origin_destination_paths)

h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)

lapply(origin_destination_urls, function(x) {
  tmp_file <- tempfile()
  curl_download(x, tmp_file, handle = h)
  unzip(tmp_file, overwrite = F, exdir = "airfare data")
})

It takes a while to download these datasets as the files are quite large. It downloaded files until 2007_2 but then I got an error with the curl connection dropping out.

Till · Accepted Answer · 2022-03-10 02:12:13Z

2

Instead of trying to generate the URL you could scrape the file paths from the website. This avoids generating any non-existing files.

Below is a short script that downloads all of the zip files you are looking for and unzips them into your working directory.

The hardest part for me here was, that the server seems to have a misconfigured SSL certificate. I was able to find help here on SO for turning of SSL certificate verification for read_html() and curl_download(). These solutions are integrated in the script below.

library(tidyverse)
library(rvest)
library(curl)

url <- "http://transtats.bts.gov/PREZIP"
content <-
  httr::GET(url, config = httr::config(ssl_verifypeer = FALSE)) |>
  read_html()

file_paths <-
  content |>
  html_nodes("a") |>
  html_attr("href")

origin_destination_paths <-
  file_paths[grepl("DB1BM", file_paths)]

base_url <- "https://transtats.bts.gov"

origin_destination_urls <-
  paste0(base_url, origin_destination_paths)

h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)

lapply(origin_destination_urls, function(x) {
  tmp_file <- tempfile()
  curl_download(x, tmp_file, handle = h)
  unzip(tmp_file)
})

answered Mar 10, 2022 at 2:12

Till

6,6811 gold badge15 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

OGC Over a year ago

Thanks for your solution! I am actually storing these files into a specific folder and in a zip format. I added these codes in the Edit part of the question. Are the files stored in the temp drive from your set of codes?

Alexis Over a year ago

Hello @Till, interesting answer. Just a question, I see you used ssl_verifypeer = FALSE, does that mean that you can"bypass" a non-ssl webpage?

OGC Over a year ago

I am getting an error with the read_html() part of the code:

Error in read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options) :    argument "x" is missing, with no default

OGC Over a year ago

I have to add this bit read_html(url, encoding = "Windows-1252") to not get the error. Getting an error on the file_paths part of the code: Error in UseMethod("xml_attr") : no applicable method for 'xml_attr' applied to an object of class "character"

OGC Over a year ago

@Till Can you please respond to the comments?

|

Collectives™ on Stack Overflow

How to remove a date written as a string in R?

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related