0

I am trying to import a table from a web page using the readHtmlTable function, this is the first few rows of what the data looks like when it is in R.

              Event                Athlete Country      Result  Medal year
1          100m Men              Tom Burke     USA        12.0   GOLD 1896
2                            Fritz Hofmann     DEU   12.2 est. SILVER 1896
3                             Francis Lane     USA        12.6 BRONZE 1896
4                          Alajos Szokolyi     HUN   12.6 est. BRONZE 1896
5          400m Men              Tom Burke     USA        54.2   GOLD 1896
6                          Herbert Jamison     USA         n/a SILVER 1896
7                           Charles Gmelin     GBR         n/a BRONZE 1896
8          800m Men            Teddy Flack     AUS      2:11.0   GOLD 1896
9                            Nֳ¡ndor Dֳ¡ni     HUN 2:11.8 est. SILVER 1896
10                       Demitrios Golemis     GRE         n/a BRONZE 1896

Now if you look at the event column you can see some of the rows of the event field empty, this is the way the table is on the website, what I am looking for is the most efficient way to fill those blanks so at the end it should look like this

              Event       Athlete      Country      Result  Medal year
1          100m Men       Tom Burke       USA        12.0   GOLD 1896
2          100m Men       Fritz Hofmann   DEU   12.2 est. SILVER 1896
3          100m Men       Francis Lane    USA        12.6 BRONZE 1896
4          100m Men       Alajos Szokolyi HUN   12.6 est. BRONZE 1896
5          400m Men       Tom Burke       USA        54.2   GOLD 1896

Basically every time a field in the Event column is empty I need to fill it with the last value that was not empty. The column is saved in R as a factor and I know that technically I can do this using a for loop and going over all of the vector elements but this is very time consuming considering the fat that there are about 300000 rows in this table. I am hoping for something more efficient

1
  • 5
    Fill the empty fields with NA and then use na.locf from the zoo package. See library(zoo); ?na.locffor help and google for na.locf to find many additional examples. Commented Jul 21, 2016 at 15:27

3 Answers 3

1

Here's a toy example of how the purrr package could be used to solve your problem, assuming the data is in a data.frame and the missing values are NA:

library(purrr)

df <- data.frame("event" = c(1, NA, 2, NA, 3, NA, 5), "other" = 1:7)

df
#     event other
# 1     1     1
# 2    NA     2
# 3     2     3
# 4    NA     4
# 5     3     5
# 6    NA     6
# 7     5     7


df$event <- accumulate(.x = df$event, .f = function(x, y) { if(is.na(y)) x else y })

df
#     event other
# 1     1     1
# 2     1     2
# 3     2     3
# 4     2     4
# 5     3     5
# 6     3     6
# 7     5     7
Sign up to request clarification or add additional context in comments.

Comments

1

The fill function from the R Studio package tidyr is designed to do this kind of data cleaning. We'll use it with mutate from the dplyr package which will first replace missing values with NA. Your data is in data frame df Code looks like

library(dplyr)
library(tidyr)
df_fill <- df %>% mutate(Event = ifelse(Event == "", NA,Event)) %>%
                  fill(Event)

Comments

0

Here is a base R method using rle and inverse.rle:

# create run length encoding
temp <- rle(df$event)
# fill in missing values with previous values
temp$values[is.na(temp$values)] <- temp$values[which(is.na(temp$values))-1]

# fill in vector of data.frame
df$event <- inverse.rle(temp)

Note that this won't work if there is a missing value in the first position of the variable.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.