0

While trying to scrape information from several links, I got the error: Error in open.connection(x, "rb") : HTTP error 404.

I feel like it has something to do with the first part of my for-loop, so I tried changing numbers from character to numeric, but that did not fix the problem. I also tried advice here, however, it returned more problems.

Think you can spot where I went wrong?

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .) 

# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)  
numbers = as.numeric(numbers)

## create empty vector  ----------------------------
master1.tree = data.frame()

## Create for loop ---------------------------------
for (i in length(numbers)){
  url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
  
ageDivision <- url %>% html_nodes('.category-title__age-division') %>% html_text()

gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text()  

matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, data.frame(matches))
}

I also ran this, but it did not return the data frame for the scraped data. Instead it printed the results on the screen instead

map_df(get_links, function(i){
  url <- read_html(i)
  
matches <- data.frame(ageDivision <- url %>% 
  html_nodes('.category-title__age-division') %>% html_text(),
gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text() ) 

master1.tree <- rbind(master1.tree, matches)
})
3
  • Try for (i in numbers) to loop over your vector numbers. Commented Sep 11, 2022 at 20:44
  • It did not do anything. It looked like it was running something, but once it stopped nothing happened. There was nothing in my environment either. Commented Sep 11, 2022 at 20:49
  • nvm! I just made a few small adjustments and it worked with your suggestion! Commented Sep 11, 2022 at 20:56

2 Answers 2

1

Here is an alternative to your code. First, it's not necessary to extract the numbers. You can directly loop over the vector get_links. Second, I use purrr::map_df for the looping part which is a more concise way than using the for loop. To this end I use a custom function to scrape one of your pages. Finally, I use trim=TRUE with html_text to remove the leading and trailing white space:

library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')

get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .)

scrape_page <- function(url) {
  html <- read_html(url)
  data.frame(
    division = html %>% html_nodes('.category-title__age-division') %>% html_text(trim = TRUE),
    gender = html %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text(trim = TRUE)
  )
}

master1.tree <- purrr::map_df(get_links[1:5], scrape_page)

master1.tree
#>   division gender
#> 1 Master 1   Male
#> 2 Master 1   Male
#> 3 Master 1   Male
#> 4 Master 1   Male
#> 5 Master 1   Male
Sign up to request clarification or add additional context in comments.

2 Comments

This looks great, thank you! I have a follow-up question. Any idea why I need to run these codes (yours and mine) twice before r puts anything in the environment? I'll run the code, it looks like it's running something, and when it stops, the environment is empty. So, I run it a second time and everything is there.
Hm. No clue what could be the issue. When I run my or your code it just works fine on my machine. Sometimes simply restarting the R session helps.
0
library(rvest)
library(tidyverse)

pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')

get_links <- pageMen %>% 
  html_nodes('.categories-grid__category a') %>% 
  html_attr('href') %>%
  paste0('https://www.bjjcompsystem.com', .) 

# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)  
numbers = as.numeric(numbers)

## create empty vector  ----------------------------
master1.tree = data.frame()

## Create for loop ---------------------------------
for (i in numbers){
  url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))

ageDivision <- url %>% 
html_nodes('.category-title__age-division') %>% 
html_text()

gender <- url %>% 
html_nodes('.category-title__age-division+ .category-title__label') %>% 
html_text()

matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, matches)
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.