SAS Programming: How to replace missing values in multiple columns using one column?

Question

Background

I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)

After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables.

Request

Similar to the blog post towards data science as shown here (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regression-1b12386c69c8), specifically under the Feature Engineering section, how can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?

Here is the Python code from the blog post.

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

What would be the equivalent SAS code for the Python code shown above?

Reeza · Accepted Answer · 2021-06-30 15:01:23Z

1

It's basically just doing a word search of sorts.

A simplified example in SAS:

data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");

do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;

run;

You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.

answered Jun 30, 2021 at 15:01

Reeza

21.4k4 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

com_thit_nuong Over a year ago

Ok, thank you for sharing an example. I will test the code that you shared with me and see if it works. If it works, then I will create an array for each variable and looping through the list. Is there a way to create an array for all the variables and a list containing the categorical variables for the respective variables?

Reeza Over a year ago

You could generalize this to a macro easily which would make your code more similar to the python code but first get it working with non macros.

Reeza Over a year ago

FYI- another way I've done this before is create a list of the words and their categories, then tokenized all the words in the description and merge the two data sets to get your data. You need some restructuring of your data but it's fully dynamic.

com_thit_nuong Over a year ago

Honestly, I have no idea what you just said since I am a complete noob to SAS and statistics in general. If you can demonstrate it in code, it would help tremendously.

Reeza Over a year ago

I guessed that, which is why I'm suggesting this approach that you will actually understand and be able to debug and expand. Giving you a solution you don't understand and can't debug your self just means you'll be asking for help each time you need to modify anything.

|

Collectives™ on Stack Overflow

SAS Programming: How to replace missing values in multiple columns using one column?

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related