How to randomly sample dataframe rows with unique column values

Question

The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:

library (data.table)
set.seed(1)
df <- data.frame(
  word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
  word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)), 
  true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
  occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
  item_score = runif(40),
  avg_score = rnorm(40),
  line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
  validity = sample(c("T", "F"), 40, replace = T)

)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt

<df
    word_ID  word  true_count occurrences item_score   avg_score line validity denominator numerator
 1:       1    cat        234         234 0.25497614  0.15268651   71        F           1         0
 2:       1    cat        234         234 0.18662407  1.77376261  234        F           2         0
 3:       1    cat        234         234 0.74554352 -0.64807093   71        T           3         1
 4:       1    cat        234         234 0.93296878 -0.19981748   34        T           4         2
 5:       2  house         39          34 0.49471189  0.68924373   25        F           1         0
 6:       2  house         39          34 0.64499368  0.03614551   32        T           2         1
 7:       2  house         39          34 0.17580259  1.94353631  573        F           3         0
 8:       3   sung        876         876 0.60299465  0.73721373    3        T           1         1
 9:       3   sung        876         876 0.88775767  2.32133393  673        F           2         0
10:       4   door          4           4 0.49020940  0.34890935  899        T           1         1
11:       4   door          4           4 0.01838357 -1.13391666  904        T           2         2

The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:

    word_ID  word  true_count occurrences item_score   avg_score line validity denominator numerator
 1:       1    cat        234         234 0.25497614  0.15268651   71        F           1         0
 2:       1    cat        234         234 0.18662407  1.77376261  234        F           2         0
 3:       1    cat        234         234 0.74554352 -0.64807093   71        T           3         1

My basic iteration (found on this site) currently looks like:

for (i in 1:10) {

  a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])

  b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}

}

So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?

Thanks,

lmo · Accepted Answer · 2017-04-07 18:12:32Z

3

Here is a data.table solution that uses a join on a sampled data.table.

set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]

The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample which is fed the unique values of line and is performed separately for each word (using by=word). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.

In this instance, you get

    word_ID   word true_count occurrences item_score   avg_score line validity
 1:       1    cat        234         234 0.26550866  0.91897737   71        F
 2:       1    cat        234         234 0.57285336  0.07456498   71        T
 3:       1    cat        234         234 0.37212390  0.78213630  234        T
 4:       2  house         39          34 0.89838968 -0.05612874   32        T
 5:       2  house         39          34 0.94467527 -0.15579551  573        F
 6:       3   sung        876         876 0.62911404 -0.47815006  673        T
 7:       3   sung        876         876 0.66079779 -1.47075238    3        T
 8:       4   door          4           4 0.06178627  0.41794156  899        F
 9:       4   door          4           4 0.38410372 -0.05380504   55        F
10:       5 pretty         67          65 0.71761851 -0.39428995  100        F
11:       5 pretty         67          65 0.38003518  1.10002537  100        F
12:       5 pretty         67          65 0.49769924 -0.41499456 1003        F
13:       6  towel         81          81 0.21214252 -0.25336168  326        F
14:       6  towel         81          81 0.93470523 -0.16452360   87        F
15:       7    car         90          90 0.12555510  0.55666320   32        T
16:       7    car         90          90 0.26722067 -0.68875569   54        F
17:       8 island         43          43 0.01339033  0.36458196   87        T
18:       8 island         43          43 0.38238796  0.76853292  988        F
19:       8 island         43          43 0.86969085 -0.11234621  988        T
20:       9    ran         54          54 0.59956583 -0.61202639  754        F
21:       9    ran         54          54 0.82737332  1.43302370 4276        F
22:      10  pizza         53          51 0.79423986 -0.36722148   93        F
23:      10  pizza         53          51 0.41127443 -0.13505460   49        T
    word_ID   word true_count occurrences item_score   avg_score line validity

edited Apr 7, 2017 at 18:12

answered Apr 7, 2017 at 18:07

lmo

38.6k9 gold badges63 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

r2evans Over a year ago

47 seconds ... but since mine is base and yours is data.table, I'll keep my answer as a complement.

lmo Over a year ago

These look like pretty distinct methods and the more base answers, the better.

r2evans Over a year ago

I've been meaning to learn data.table. I learn well when I see base (which I recognize) paired with analogous non-base solutions. So for me, seeing both is perfectly relevant. Thanks :-)

xq1515426 Over a year ago

I would accept both if I could since they're both very instructive (with great explanations) for me especially in tandem. Is it possible to keep the method this compact, but to only return more rows for a word than the sample size if there are duplicate line values [not pull them all]? Meaning, sample until there are n rows for a word that have unique line values while keeping the samples already pulled. To do so would require a loop, correct? [you both answered my original question, this is just a follow on after seeing the solutions methods]. Thanks

r2evans · Accepted Answer · 2017-04-07 18:07:56Z

If you sample from a de-duplicated data.frame and do a subsequent left-join with the original data, you can ensure what you need.

I'm not proficient with data.table, so I'll use base functions. (dplyr would work well here, too, but since you're using data.table, I'll avoid it for now.) (As I'm about to hit submit, @lmo provided a dt-specific answer ...)

By "de-duplicate", I mean:

subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36  2
head(subdf)
#   word_ID line
# 1       1   71
# 2       1  234
# 4       1   34
# 5       2   25
# 6       2   32
# 7       2  573

Note that the subdf only has three rows for 1, whereas the original data has 4:

df[1:4,]
#   word_ID word true_count occurrences item_score   avg_score line validity
# 1       1  cat        234         234  0.2655087  0.91897737   71        F
# 2       1  cat        234         234  0.3721239  0.78213630  234        T
# 3       1  cat        234         234  0.5728534  0.07456498   71        T
# 4       1  cat        234         234  0.9082078 -1.98935170   34        T

I'm using by here instead of lapply/split, but the results should be the same:

out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
#   word_ID line word true_count occurrences item_score   avg_score validity
# 1       1   34  cat        234         234  0.9082078 -1.98935170        T
# 2       1   71  cat        234         234  0.5728534  0.07456498        T
# 3       1   71  cat        234         234  0.2655087  0.91897737        F

Collectives™ on Stack Overflow

How to randomly sample dataframe rows with unique column values

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related