How could I make this R snippet faster and more R-ish?

Question

Coming from various other languages, I find R powerful and intuitive, but I am not thrilled with its performance. So I decided to try to improve some snippet I wrote and learn how to code better in R.

Here's a function I wrote, trying to determine if a vector is binary-valued (two distinct values or just one value) or not:

isBinaryVector <- function(v) {
  if (length(v) == 0) {
    return (c(0, 1))
  }
  a <- v[1]
  b <- a
  lapply(v, function(x) { if (x != a && x != b) {if (a != b) { return (c()) } else { b = x }}})
  if (a < b) {
    return (c(a, b))
  } else {
    return (c(b, a))
  }
}

EDIT: This function is expected to look through a vector then return c() if it is not binary-valued, and return c(a, b) if it is, a being the small value and b being the larger one (if a == b then just c(a, a). E.g., for

I will lapply this isBinaryVector and get:

$A
[1] 1 1

$B
[1] 1 1

$C
[1] 0 0

The time it took on a moderate sized dataset (about 1800 * 3500, 2/3 of them are binary-valued) is about 15 seconds. The set contains only floating-point numbers.

Is there anyway I could do this faster?

Thanks for any inputs!

I'll be honest, this function makes absolutely no sense to me. Could you provide an example of its use? Is it intended to take a data frame? Is a binary variable one with only 0/1, or one with only two distinct values? — joran
– joran, Commented Apr 19, 2012 at 14:23
@joran: Well it might not make much sense:) I just want to separate a data frame into two parts, a set of nominal valued columns and a set of binary valued (or two distinct valued, as you said) columns. Thanks! — zw324
– zw324, Commented Apr 19, 2012 at 14:25
Well, I don't understand how this function could actually work. Your lapply call isn't assigned to anything. If v is a data frame, a and b are both initially simply the first column of v. Then you test whether each column is identical to a and b (which themselves are identical) incorrectly using vectorized comparisons in an if statement. I could go on. Consider me baffled. — joran
– joran, Commented Apr 19, 2012 at 14:30
Could you provide an example moderate-sized data set? Also, this is probably better suited to the code review site. — Joshua Ulrich
– Joshua Ulrich, Commented Apr 19, 2012 at 14:32
Please add at least two things to your question: 1) Description in words what you are trying to do. 2) Sample data and expected results. — Andrie
– Andrie, Commented Apr 19, 2012 at 14:33

Andrie · Accepted Answer · 2012-04-19 14:52:02Z

8

You are essentially trying to write a function that returns TRUE if a vector has exactly two unique values, and FALSE otherwise.

Try this:

> dat <- data.frame(
+   A = 1:3,
+   B = c(1, 2, 1), 
+   C = 0
+ )
> 
> sapply(dat, function(x)length(unique(x))==2)
    A     B     C 
FALSE  TRUE FALSE

Next, you want to get the min and max value. The function range does this. So:

> sapply(dat, range)
     A B C
[1,] 1 1 0
[2,] 3 2 0

And there you have all the ingredients to make a small function that is easy to understand and should be extremely quick, even on large amounts of data:

isBinary <- function(x)length(unique(x))==2

binaryValues <- function(x){
  if(isBinary(x)) range(x) else NA
}

sapply(dat, binaryValues)

$A
[1] NA

$B
[1] 1 2

$C
[1] NA

answered Apr 19, 2012 at 14:52

Andrie

180k52 gold badges456 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

zw324 Over a year ago

Both yours and Justin's version use 0.19s on the dataset. Thanks!

Andrie Over a year ago

@ZiyaoWei Just shows that performance is often a function of the programmer, not the language! Good luck with your project.

zw324 Over a year ago

I know I am really unexperienced at R:) Perhaps I should buy your upcoming book, but hopefully when it comes out I've passed that stage. Thanks!

Community · Accepted Answer · 2017-05-23 10:34:48Z

4

This function returns true or false for vectors (or columns of a data frame):

is.binary <- function(v) {
  x <- unique(v)
  length(x) - sum(is.na(x)) == 2L
}

Also take a look at this post

I'd use something like that to get column indicies:

bivalued <- apply(my.data.frame, 2, is.binary)

nominal <- my.data.frame[,!bivalued]
binary <- my.data.frame[,bivalued]

Sample data:

my.data.frame <- data.frame(c(0,1), rnorm(100), c(5, 19), letters[1:5], c('a', 'b'))
> apply(my.data.frame, 2, is.binary)
     c.0..1.   rnorm.100.     c.5..19. letters.1.5.  c..a....b.. 
        TRUE        FALSE         TRUE        FALSE         TRUE

edited May 23, 2017 at 10:34

CommunityBot

11 silver badge

answered Apr 19, 2012 at 14:42

Justin

43.4k9 gold badges96 silver badges111 bronze badges

5 Comments

Joshua Ulrich Over a year ago

You could use sapply(my.data.frame, is.binary) instead of apply.

Justin Over a year ago

Yeah, I like the explicitness of apply. Is sapply faster?

Andrie Over a year ago

@Justin I would suggest to use lapply or sapply, since you are working with a data.frame, i.e. a list. apply does the same but has to convert the data to an array first. See my answer.

Joshua Ulrich Over a year ago

I'm not sure which is faster. apply only works on arrays though, so your data.frame is converted to a matrix before is.binary is applied. It's not an issue in this case, but can provide confusing output (e.g. compare apply(iris,2,is.numeric) with sapply(iris,is.numeric)).

Justin Over a year ago

Ah, now that I actually look at the apply code ... you are correct. I'll have to mend my ways!

Collectives™ on Stack Overflow

How could I make this R snippet faster and more R-ish?

2 Answers 2

3 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related