0

I am trying to pull data from a website: https://transtats.bts.gov/PREZIP/

I am interested in downloading the datasets named Origin_and_Destination_Survey_DB1BMarket_1993_1.zip to Origin_and_Destination_Survey_DB1BMarket_2021_3.zip

For this I am trying to automate and put the url in a loop

# dates of all files

    year_quarter_comb <- crossing(year = 1993:2021, quarter = 1:4) %>% 
      mutate(year_quarter_comb = str_c(year, "_", quarter)) %>% 
      pull(year_quarter_comb)
    
    # download all files
    for(year_quarter in year_quarter_comb){
      get_BTS_data(str_glue("https://transtats.bts.gov/PREZIP/Origin_and_Destination_Survey_DB1BMarket_", year_quarter, ".zip"))
} 

What I was wondering is how I can exclude 2021 quarter 4 since the data for this is not available yet. Also is there a better way to automate the task? I was thinking of matching by "DB1BMarket" but R is actually case-sensitive. The names for certain dates change to "DB1BMARKET"

I can use this year_quarter_comb[-c(116)] to remove 2021_4 from the output: enter image description here

EDIT: I was actually trying to download the files into a specific folder with these set of codes:

path_to_local <- "whatever location" # this is the folder where the raw data is stored.

# download data from BTS
get_BTS_data <- function(BTS_url) {
  # INPUT: URL for the zip file with the data
  # OUTPUT: NULL (this just downloads the data)
  
  # store the download in the path_to_local folder
  # down_file <- str_glue(path_to_local, "QCEW_Hawaii_", BLS_url %>% str_sub(34) %>% str_replace_all("/", "_"))
  down_file <- str_glue(path_to_local, fs::path_file(BTS_url))
  
  # download data to folder
  QCEW_files <- BTS_url %>%
    
    # download file
    curl::curl_download(down_file)
  
}

EDIT2:

I edited the codes a little from the answer below and it runs:

url <- "http://transtats.bts.gov/PREZIP"
content <- read_html(url)

file_paths <- content %>%
  html_nodes("a") %>%
  html_attr("href")

origin_destination_paths <-
  file_paths[grepl("DB1BM", file_paths)]

base_url <- "https://transtats.bts.gov"

origin_destination_urls <-
  paste0(base_url, origin_destination_paths)

h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)

lapply(origin_destination_urls, function(x) {
  tmp_file <- tempfile()
  curl_download(x, tmp_file, handle = h)
  unzip(tmp_file, overwrite = F, exdir = "airfare data")
})

It takes a while to download these datasets as the files are quite large. It downloaded files until 2007_2 but then I got an error with the curl connection dropping out.

0

1 Answer 1

2

Instead of trying to generate the URL you could scrape the file paths from the website. This avoids generating any non-existing files.

Below is a short script that downloads all of the zip files you are looking for and unzips them into your working directory.

The hardest part for me here was, that the server seems to have a misconfigured SSL certificate. I was able to find help here on SO for turning of SSL certificate verification for read_html() and curl_download(). These solutions are integrated in the script below.

library(tidyverse)
library(rvest)
library(curl)

url <- "http://transtats.bts.gov/PREZIP"
content <-
  httr::GET(url, config = httr::config(ssl_verifypeer = FALSE)) |>
  read_html()

file_paths <-
  content |>
  html_nodes("a") |>
  html_attr("href")

origin_destination_paths <-
  file_paths[grepl("DB1BM", file_paths)]

base_url <- "https://transtats.bts.gov"

origin_destination_urls <-
  paste0(base_url, origin_destination_paths)

h <- new_handle()
handle_setopt(h, ssl_verifyhost = 0, ssl_verifypeer=0)

lapply(origin_destination_urls, function(x) {
  tmp_file <- tempfile()
  curl_download(x, tmp_file, handle = h)
  unzip(tmp_file)
})
Sign up to request clarification or add additional context in comments.

9 Comments

Thanks for your solution! I am actually storing these files into a specific folder and in a zip format. I added these codes in the Edit part of the question. Are the files stored in the temp drive from your set of codes?
Hello @Till, interesting answer. Just a question, I see you used ssl_verifypeer = FALSE, does that mean that you can"bypass" a non-ssl webpage?
I am getting an error with the read_html() part of the code: Error in read_xml(x, encoding = encoding, ..., as_html = TRUE, options = options) : argument "x" is missing, with no default
I have to add this bit read_html(url, encoding = "Windows-1252") to not get the error. Getting an error on the file_paths part of the code: Error in UseMethod("xml_attr") : no applicable method for 'xml_attr' applied to an object of class "character"
@Till Can you please respond to the comments?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.