4

The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:

library (data.table)
set.seed(1)
df <- data.frame(
  word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
  word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)), 
  true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
  occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
  item_score = runif(40),
  avg_score = rnorm(40),
  line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
  validity = sample(c("T", "F"), 40, replace = T)

)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt

<df
    word_ID  word  true_count occurrences item_score   avg_score line validity denominator numerator
 1:       1    cat        234         234 0.25497614  0.15268651   71        F           1         0
 2:       1    cat        234         234 0.18662407  1.77376261  234        F           2         0
 3:       1    cat        234         234 0.74554352 -0.64807093   71        T           3         1
 4:       1    cat        234         234 0.93296878 -0.19981748   34        T           4         2
 5:       2  house         39          34 0.49471189  0.68924373   25        F           1         0
 6:       2  house         39          34 0.64499368  0.03614551   32        T           2         1
 7:       2  house         39          34 0.17580259  1.94353631  573        F           3         0
 8:       3   sung        876         876 0.60299465  0.73721373    3        T           1         1
 9:       3   sung        876         876 0.88775767  2.32133393  673        F           2         0
10:       4   door          4           4 0.49020940  0.34890935  899        T           1         1
11:       4   door          4           4 0.01838357 -1.13391666  904        T           2         2

The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:

    word_ID  word  true_count occurrences item_score   avg_score line validity denominator numerator
 1:       1    cat        234         234 0.25497614  0.15268651   71        F           1         0
 2:       1    cat        234         234 0.18662407  1.77376261  234        F           2         0
 3:       1    cat        234         234 0.74554352 -0.64807093   71        T           3         1

My basic iteration (found on this site) currently looks like:

for (i in 1:10) {

  a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])

  b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}

}

So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?

Thanks,

2 Answers 2

3

Here is a data.table solution that uses a join on a sampled data.table.

set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]

The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample which is fed the unique values of line and is performed separately for each word (using by=word). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.

In this instance, you get

    word_ID   word true_count occurrences item_score   avg_score line validity
 1:       1    cat        234         234 0.26550866  0.91897737   71        F
 2:       1    cat        234         234 0.57285336  0.07456498   71        T
 3:       1    cat        234         234 0.37212390  0.78213630  234        T
 4:       2  house         39          34 0.89838968 -0.05612874   32        T
 5:       2  house         39          34 0.94467527 -0.15579551  573        F
 6:       3   sung        876         876 0.62911404 -0.47815006  673        T
 7:       3   sung        876         876 0.66079779 -1.47075238    3        T
 8:       4   door          4           4 0.06178627  0.41794156  899        F
 9:       4   door          4           4 0.38410372 -0.05380504   55        F
10:       5 pretty         67          65 0.71761851 -0.39428995  100        F
11:       5 pretty         67          65 0.38003518  1.10002537  100        F
12:       5 pretty         67          65 0.49769924 -0.41499456 1003        F
13:       6  towel         81          81 0.21214252 -0.25336168  326        F
14:       6  towel         81          81 0.93470523 -0.16452360   87        F
15:       7    car         90          90 0.12555510  0.55666320   32        T
16:       7    car         90          90 0.26722067 -0.68875569   54        F
17:       8 island         43          43 0.01339033  0.36458196   87        T
18:       8 island         43          43 0.38238796  0.76853292  988        F
19:       8 island         43          43 0.86969085 -0.11234621  988        T
20:       9    ran         54          54 0.59956583 -0.61202639  754        F
21:       9    ran         54          54 0.82737332  1.43302370 4276        F
22:      10  pizza         53          51 0.79423986 -0.36722148   93        F
23:      10  pizza         53          51 0.41127443 -0.13505460   49        T
    word_ID   word true_count occurrences item_score   avg_score line validity
Sign up to request clarification or add additional context in comments.

4 Comments

47 seconds ... but since mine is base and yours is data.table, I'll keep my answer as a complement.
These look like pretty distinct methods and the more base answers, the better.
I've been meaning to learn data.table. I learn well when I see base (which I recognize) paired with analogous non-base solutions. So for me, seeing both is perfectly relevant. Thanks :-)
I would accept both if I could since they're both very instructive (with great explanations) for me especially in tandem. Is it possible to keep the method this compact, but to only return more rows for a word than the sample size if there are duplicate line values [not pull them all]? Meaning, sample until there are n rows for a word that have unique line values while keeping the samples already pulled. To do so would require a loop, correct? [you both answered my original question, this is just a follow on after seeing the solutions methods]. Thanks
2

If you sample from a de-duplicated data.frame and do a subsequent left-join with the original data, you can ensure what you need.

I'm not proficient with data.table, so I'll use base functions. (dplyr would work well here, too, but since you're using data.table, I'll avoid it for now.) (As I'm about to hit submit, @lmo provided a dt-specific answer ...)

By "de-duplicate", I mean:

subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36  2
head(subdf)
#   word_ID line
# 1       1   71
# 2       1  234
# 4       1   34
# 5       2   25
# 6       2   32
# 7       2  573

Note that the subdf only has three rows for 1, whereas the original data has 4:

df[1:4,]
#   word_ID word true_count occurrences item_score   avg_score line validity
# 1       1  cat        234         234  0.2655087  0.91897737   71        F
# 2       1  cat        234         234  0.3721239  0.78213630  234        T
# 3       1  cat        234         234  0.5728534  0.07456498   71        T
# 4       1  cat        234         234  0.9082078 -1.98935170   34        T

I'm using by here instead of lapply/split, but the results should be the same:

out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
#   word_ID line word true_count occurrences item_score   avg_score validity
# 1       1   34  cat        234         234  0.9082078 -1.98935170        T
# 2       1   71  cat        234         234  0.5728534  0.07456498        T
# 3       1   71  cat        234         234  0.2655087  0.91897737        F

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.