0

Background

I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

  1. cylinders
  2. condition
  3. drive
  4. paint_color
  5. type
  6. manufacturer
  7. title_status
  8. model
  9. fuel
  10. transmission
  11. description
  12. region
  13. state
  14. price (num)
  15. posting_date (num)
  16. odometer (num)
  17. year (num)

After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.

Request

Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?

Here is the Python code from the blog post.

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

What would be the equivalent SAS code for the Python code shown above?

1 Answer 1

1

It's basically just doing a word search of sorts.

A simplified example in SAS:

data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");

do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;

run;

You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.

Sign up to request clarification or add additional context in comments.

6 Comments

Ok, thank you for sharing an example. I will test the code that you shared with me and see if it works. If it works, then I will create an array for each variable and looping through the list. Is there a way to create an array for all the variables and a list containing the categorical variables for the respective variables?
You could generalize this to a macro easily which would make your code more similar to the python code but first get it working with non macros.
FYI- another way I've done this before is create a list of the words and their categories, then tokenized all the words in the description and merge the two data sets to get your data. You need some restructuring of your data but it's fully dynamic.
Honestly, I have no idea what you just said since I am a complete noob to SAS and statistics in general. If you can demonstrate it in code, it would help tremendously.
I guessed that, which is why I'm suggesting this approach that you will actually understand and be able to debug and expand. Giving you a solution you don't understand and can't debug your self just means you'll be asking for help each time you need to modify anything.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.