0

I have data on wages and about 95% of them are given in hourly format, however some of them are given as an annual salary. So I made a function to convert the annual salaries to hourly, however it takes 1 min 40 sec to run, when my dataset is 43000 rows x 12 columns (which I didnt think would be too big) so I did not think it would take this long.

I am curious if there is a better way to do this than the current function I have created. I am new with dplyr and tidyverse so ideally an answer using those capabilities.

Here is some sample data:

NOC4  Region Region_Name Wage_2012 Wage_2013 Wage_2014   
0011  ER10   National    28.1      65000     NA       
0011  ER1010 Northern    NA        30.5      18       
0011  ER1020 Southern    42.3      72000     22       
0011  ER1030 Eastern     12        NA        45500    
0011  ER1040 Western     8         NA        99000    
0011  ER10   National    NA        65000     NA  

Here is what it should look like after the function:

NOC4  Region Region_Name Wage_2012 Wage_2013 Wage_2014   
0011  ER10   National    28.1      33.33     NA       
0011  ER1010 Northern    NA        30.5      18       
0011  ER1020 Southern    42.3      36.92     22       
0011  ER1030 Eastern     12        NA        23.33    
0011  ER1040 Western     8         NA        50.77    
0011  ER10   National    NA        33.33     NA  

Here is the function:

year_to_hour <- function(dataset, salary, startcol){
  # where "startcol" should be the first column containing the numeric
  # values that you are trying to convert. 
  for(i in startcol:ncol(dataset)){

    for(j in 1:nrow(dataset)){

      if(is.na(dataset[j, i])){

        j = j+1

      }else if(as.numeric(dataset[j, i]) >= as.numeric(salary)){

        dataset[j, i] = dataset[j, i]/1950
      }
      else{

        dataset[j, i] = dataset[j, i]

      }

    }

  }

  return(as_tibble(dataset))

}

converted <- year_to_hour(wage_data_messy, 1000, 4)
2
  • What is the first if, for NA values, meant to accomplish? It doesn't seem to have an effect on the output. Commented May 29, 2019 at 19:13
  • To be honest, I just kept getting errors at one point and it made some of them go away. I believe you are right however and it is useless. Commented May 29, 2019 at 21:54

2 Answers 2

1

R will work much faster if you let it handle the loops under the hood through "vectorized" code.

http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html

Here's an approach using dplyr:

library(dplyr)
salary <- 1000
df %>%
  mutate_at(vars(Wage_2012:Wage_2014),          # For these columns...
            ~ . / if_else(. > salary, 1950, 1)) # Divide by 1950 if > salary
Sign up to request clarification or add additional context in comments.

Comments

1

Using dplyr I would use mutate_if

salary <- 1000
df %>% mutate_if(is.numeric, ~ifelse(. > salary, ./1950, .))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.