2

I'm trying to figure out a small coding challenge here.

I have a variable, RESULT, that is a character variable, but needs to be converted to numeric. Most of the results are regular numbers, i.e. "90", "90.0", "55.42", etc. However, there are a lot of weird results, such as "UNDETECTABLE" or "1.29E7" or such results.

What I want to do is extract all the observations that have a character OTHER than the numeric digits OR the value "." (i.e. a period). Then I can manually assign those values.

I have a very large dataset, but limited computing power, so I can't scroll through and pick out the odd observations with special characters. It just freezes up my computer and takes way too long.

Thoughts on how to best accomplish this? Is there a SAS function that works for such a task? I've thought about the compress function, but I need to make sure I'm not missing any observations with special characters (i.e. characters other than numbers and period).

Thank you!

4
  • Do you want the list of invalid values of RESULT or the full set of records that have an invalid result? The former should be much smaller, but might be harder to map to a value since there is no context for the value. Commented Nov 30, 2015 at 17:32
  • Thanks for all the helpful comments! Here's the solution I came up with that seems to work: flag = 0; comparator = compress(result,'1234567890.'); if comparator ^= "" then flag = 1; This flags observations that have any character other than the numerals and '.' From there I can add further processing to deal with the typical special characters (such as >40, ~5000, etc.) Sorry if this was unclear...I'm not sure I explained things well in the OP. I appreciate your help. Commented Dec 1, 2015 at 2:50
  • ErraticAssassin see @joe's COMPRESS answer below. I think it's doing the same as what you describe, and a bit more tidy. Commented Dec 1, 2015 at 3:13
  • The COMPRESS function will serve the purpose. But internally, COMPRESS will also go row by row to check if each record has any undesirable characters. So if dataset is very large and the computing power not that great, even COMPRESS can hang. There is no way to avoid scanning the whole dataswet row by row. Commented Dec 1, 2015 at 15:10

3 Answers 3

1

COMPRESS will handle this for you nicely, based on your precise language. Use the list modifier to add digits (3rd parameter) plus '.' from the second parameter.

Note this won't identify numbers that are not valid numbers (like the last one).

data have;
  input @1 char_var $30.;
datalines;
1.234
4.15E7
UNDETECTED
-143.32
+144.12
79.32°F
14.14.14
;;;;
run;

data want;
  set have;
  if compress(char_Var,'.','d') ne ' ';
run;
Sign up to request clarification or add additional context in comments.

Comments

0

Try this:

data out; /*output dataset*/
set in;  /*input dataset*/
result = trim(result);  /*trailing blanks - not have to be a problem*/
clear_number = compress(result, '.'); /*remove period from result*/
/*then, clear_number have to have only digits, so:*/
if notdigit(clear_number) then delete;
/*but, maybe, result have more then one period?*/
if count(result, '.') > 1 then delete;
result_numeric = result*1; /*lazy convertion*/
run;

2 Comments

OP wants the opposite of what you did. And, I would discourage the 'lazy convention' you post, INPUT function in SAS isn't sufficiently more typing and avoids needing to ignore warnings.
Oh, yes, didnt clear read topic, But, then , OP should write if count(result, '.')>1 or notdigit(compress(trim(result), '.')) insted - and didnt need convertion (lazy or not) at all :)
0

Couldn't you just get the distinct set of non-number values? That should be smaller than taking every observation that has a non-number value.

One way to test for a valid number is let SAS do it for you. The INPUT() function can convert text strings to numbers. If you use the COMMA informat then in addition to properly converting scientific notation values like 1.29E7, which is just 12900000, it can also handle values with commas or dollar signs.

 proc sql ;
    create table want as
      select distinct result
      from have
      where result not in (' ','.') 
        and input(result,comma32.)=.
    ;
quit;

That should find values like "UNDETECTABLE", but treat values like "90", "90.0", "55.42", "1.29E7", or "12,345" as valid numbers.

3 Comments

Good thought, but OP specifically calls out 1.29E7 as a result that should be flagged...?
Ah, yes, but 1.29E7 is a valid number.
Sure, but it may not be valid for their particular use case.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.