The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:
library (data.table)
set.seed(1)
df <- data.frame(
word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)),
true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
item_score = runif(40),
avg_score = rnorm(40),
line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
validity = sample(c("T", "F"), 40, replace = T)
)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt
<df
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
4: 1 cat 234 234 0.93296878 -0.19981748 34 T 4 2
5: 2 house 39 34 0.49471189 0.68924373 25 F 1 0
6: 2 house 39 34 0.64499368 0.03614551 32 T 2 1
7: 2 house 39 34 0.17580259 1.94353631 573 F 3 0
8: 3 sung 876 876 0.60299465 0.73721373 3 T 1 1
9: 3 sung 876 876 0.88775767 2.32133393 673 F 2 0
10: 4 door 4 4 0.49020940 0.34890935 899 T 1 1
11: 4 door 4 4 0.01838357 -1.13391666 904 T 2 2
The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
My basic iteration (found on this site) currently looks like:
for (i in 1:10) {
a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])
b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}
}
So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?
Thanks,