Sample random rows in dataframe

Question

I am struggling to find the appropriate function that would return a specified number of rows picked up randomly without replacement from a data frame in R language? Can anyone help me out?

John Colby · Accepted Answer · 2011-11-25 19:15:13Z

569

Answer recommended by R Language Collective

First make some data:

> df = data.frame(matrix(rnorm(20), nrow=10))
> df
           X1         X2
1   0.7091409 -1.4061361
2  -1.1334614 -0.1973846
3   2.3343391 -0.4385071
4  -0.9040278 -0.6593677
5   0.4180331 -1.2592415
6   0.7572246 -0.5463655
7  -0.8996483  0.4231117
8  -1.0356774 -0.1640883
9  -0.3983045  0.7157506
10 -0.9060305  2.3234110

Then select some rows at random:

> df[sample(nrow(df), 3), ]
           X1         X2
9  -0.3983045  0.7157506
2  -1.1334614 -0.1973846
10 -0.9060305  2.3234110

answered Nov 25, 2011 at 19:15

John Colby

22.6k4 gold badges59 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

joran Over a year ago

@nikhil See here and here for starters. You can also type ?sample in the R console to read about that function.

stackoverflowuser2010 Over a year ago

Can someone explain why sample(df,3) does not work? Why do you need df[sample(nrow(df), 3), ]?

David Braun Over a year ago

@stackoverflowuser2010, you can type ?sample and see that the first argument in the sample function must be a vector or a positive integer. I don't think a data.frame works as a vector in this case.

CousinCocaine Over a year ago

Remember to set your seed (e.g. set.seed(42) ) every time you want to reproduce that specific sample.

Ari B. Friedman Over a year ago

sample.int would be slightly faster I believe: library(microbenchmark);microbenchmark( sample( 10000, 100 ), sample.int( 10000, 100 ), times = 10000 )

|

Jaap · Accepted Answer · 2016-09-26 19:16:33Z

308

The answer John Colby gives is the right answer. However if you are a dplyr user there is also the answer sample_n:

sample_n(df, 10)

randomly samples 10 rows from the dataframe. It calls sample.int, so really is the same answer with less typing (and simplifies use in the context of magrittr since the dataframe is the first argument).

edited Sep 26, 2016 at 19:16

Jaap

83.7k36 gold badges190 silver badges203 bronze badges

answered Feb 20, 2015 at 9:30

kasterma

4,5091 gold badge23 silver badges28 bronze badges

2 Comments

Matt_B Over a year ago

As of dplyr 1.0.0, sample_n (and sample_frac) have been superseded by slice_sample, though they remain for now.

user11130854 Over a year ago

This appears to sample without replacement, and hence also outputs a sample of size min(nrow(df), 10), so this might not be what is needed.

gented · Accepted Answer · 2015-12-03 10:09:39Z

50

The data.table package provides the function DT[sample(.N, M)], sampling M random rows from the data table DT.

library(data.table)
set.seed(10)

mtcars <- data.table(mtcars)
mtcars[sample(.N, 6)]

    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
2: 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
3: 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
4: 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
5: 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
6: 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2

edited Dec 3, 2015 at 10:09

answered Oct 18, 2015 at 17:37

gented

1,7171 gold badge17 silver badges23 bronze badges

Comments

Spacedman · Accepted Answer · 2011-11-25 19:21:29Z

36

Write one! Wrapping JC's answer gives me:

randomRows = function(df,n){
   return(df[sample(nrow(df),n),])
}

Now make it better by checking first if n<=nrow(df) and stopping with an error.

answered Nov 25, 2011 at 19:21

Spacedman

94.7k12 gold badges148 silver badges231 bronze badges

Comments

Agile Bean · Accepted Answer · 2019-04-19 09:24:29Z

30

Just for completeness sake:

dplyr also offers to draw a proportion or fraction of the sample by

df %>% sample_frac(0.33)

This is very convenient e.g. in machine learning when you have to do a certain split ratio like 80%:20%

answered Apr 19, 2019 at 9:24

Agile Bean

7,4211 gold badge52 silver badges61 bronze badges

Comments

M_Merciless · Accepted Answer · 2021-11-16 09:55:34Z

13

As @matt_b indicates, sample_n() & sample_frac() have been soft deprecated in favour of slice_sample(). See the dplyr docs.

Example from docstring:

# slice_sample() allows you to random select with or without replacement
mtcars %>% slice_sample(n = 5)
mtcars %>% slice_sample(n = 5, replace = TRUE)

answered Nov 16, 2021 at 9:55

M_Merciless

4187 silver badges12 bronze badges

Comments

krlmlr · Accepted Answer · 2024-02-27 09:14:33Z

10

Outdated answer. Please use dplyr::slice_sample() instead.

In my R package there is a function sample.rows just for this purpose:

install.packages('kimisc')

library(kimisc)
example(sample.rows)

smpl..> set.seed(42)

smpl..> sample.rows(data.frame(a=c(1,2,3), b=c(4,5,6),
                               row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

Enhancing sample by making it a generic S3 function was a bad idea, according to comments by Joris Meys to a previous answer.

edited Feb 27, 2024 at 9:14

answered Jan 15, 2014 at 11:42

krlmlr

25.6k14 gold badges127 silver badges231 bronze badges

1 Comment

quickshiftin Over a year ago

A note from ?sample_frac: "[Superseded] ‘sample_n()’ and ‘sample_frac()’ have been superseded in favour of ‘slice_sample()’"

Community · Accepted Answer · 2017-05-23 12:18:21Z

9

EDIT: This answer is now outdated, see the updated version.

In my R package I have enhanced sample so that it now behaves as expected also for data frames:

library(devtools); install_github('kimisc', 'krlmlr')

library(kimisc)
example(sample.data.frame)

smpl..> set.seed(42)

smpl..> sample(data.frame(a=c(1,2,3), b=c(4,5,6),
                           row.names=c('a', 'b', 'c')), 10, replace=TRUE)
    a b
c   3 6
c.1 3 6
a   1 4
c.2 3 6
b   2 5
b.1 2 5
c.3 3 6
a.1 1 4
b.2 2 5
c.4 3 6

This is achieved by making sample an S3 generic method and providing the necessary (trivial) functionality in a function. A call to setMethod fixes everything. The original implementation still can be accessed through base::sample.

edited May 23, 2017 at 12:18

CommunityBot

11 silver badge

answered May 14, 2013 at 8:21

krlmlr

25.6k14 gold badges127 silver badges231 bronze badges

10 Comments

a different ben Over a year ago

What is unexpected about its treatment of data frames?

krlmlr Over a year ago

@adifferentben: When I call sample.default(df, ...) for a data frame df, it samples from the columns of the data frame, as a data frame is implemented as a list of vectors of the same length.

terdon Over a year ago

Is your package still available? I ran install_github('kimisc', 'krlmlr') and got Error: Does not appear to be an R package (no DESCRIPTION). Any way around that?

krlmlr Over a year ago

@JorisMeys: Agreed, except for the "as expected" part. Just because a data frame is implemented as a list internally, it doesn't mean it should behave as one. The [ operator for data frames is a counterexample. Also, please tell me: Have you ever, just one single time, used sample to sample columns from a data frame?

Joris Meys Over a year ago

@krlmlr The [ operator is not a counterexample: iris[2] works like a list, as does iris[[2]]. Or iris$Species, lapply(iris, mean), ... Data frames are lists. So I expect them to behave like them. And yes, I have actually used sample(myDataframe). On a dataset where every variable contains expression data of a single gene. Your specific method helps novice users, but also effectively changing the way sample()behaves. Note I use "as expected" from a programmer's view. Which is different from the general intuition. There's a lot in R that's not compatible with general intuition... ;)

|

igorkf · Accepted Answer · 2019-12-06 13:13:27Z

You could do this:

library(dplyr)

cols <- paste0("a", 1:10)
tab <- matrix(1:1000, nrow = 100) %>% as.tibble() %>% set_names(cols)
tab
# A tibble: 100 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1     1   101   201   301   401   501   601   701   801   901
 2     2   102   202   302   402   502   602   702   802   902
 3     3   103   203   303   403   503   603   703   803   903
 4     4   104   204   304   404   504   604   704   804   904
 5     5   105   205   305   405   505   605   705   805   905
 6     6   106   206   306   406   506   606   706   806   906
 7     7   107   207   307   407   507   607   707   807   907
 8     8   108   208   308   408   508   608   708   808   908
 9     9   109   209   309   409   509   609   709   809   909
10    10   110   210   310   410   510   610   710   810   910
# ... with 90 more rows

Above I just made a dataframe with 10 columns and 100 rows, ok?

Now you can sample it with sample_n:

sample_n(tab, size = 800, replace = T)
# A tibble: 800 x 10
      a1    a2    a3    a4    a5    a6    a7    a8    a9   a10
   <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1    53   153   253   353   453   553   653   753   853   953
 2    14   114   214   314   414   514   614   714   814   914
 3    10   110   210   310   410   510   610   710   810   910
 4    70   170   270   370   470   570   670   770   870   970
 5    36   136   236   336   436   536   636   736   836   936
 6    77   177   277   377   477   577   677   777   877   977
 7    13   113   213   313   413   513   613   713   813   913
 8    58   158   258   358   458   558   658   758   858   958
 9    29   129   229   329   429   529   629   729   829   929
10     3   103   203   303   403   503   603   703   803   903
# ... with 790 more rows

Mohammad · Accepted Answer · 2020-04-28 11:12:17Z

6

You could do this:

sample_data = data[sample(nrow(data), sample_size, replace = FALSE), ]

answered Apr 28, 2020 at 11:12

Mohammad

991 silver badge4 bronze badges

Comments

abalter · Accepted Answer · 2021-10-06 05:39:49Z

6

The 2021 way of doing this in the tidyverse is:

library(tidyverse)

df = data.frame(
  A = letters[1:10],
  B = 1:10
)

df
#>    A  B
#> 1  a  1
#> 2  b  2
#> 3  c  3
#> 4  d  4
#> 5  e  5
#> 6  f  6
#> 7  g  7
#> 8  h  8
#> 9  i  9
#> 10 j 10

df %>% sample_n(5)
#>   A  B
#> 1 e  5
#> 2 g  7
#> 3 h  8
#> 4 b  2
#> 5 j 10

df %>% sample_frac(0.5)
#>   A  B
#> 1 i  9
#> 2 g  7
#> 3 j 10
#> 4 c  3
#> 5 b  2

^{Created on 2021-10-05 by the reprex package (v2.0.0.9000)}

answered Oct 6, 2021 at 5:39

abalter

10.5k18 gold badges103 silver badges172 bronze badges

Comments

Eric Leschinski · Accepted Answer · 2017-02-11 09:04:15Z

5

Select a Random sample from a tibble type in R:

library("tibble")    
a <- your_tibble[sample(1:nrow(your_tibble), 150),]

nrow takes a tibble and returns the number of rows. The first parameter passed to sample is a range from 1 to the end of your tibble. The second parameter passed to sample, 150, is how many random samplings you want. The square bracket slicing specifies the rows of the indices returned. Variable 'a' gets the value of the random sampling.

answered Feb 11, 2017 at 9:04

Eric Leschinski

155k96 gold badges423 silver badges337 bronze badges

Comments

Leopoldo Sanczyk · Accepted Answer · 2018-12-17 06:02:47Z

3

I'm new in R, but I was using this easy method that works for me:

sample_of_diamonds <- diamonds[sample(nrow(diamonds),100),]

PS: Feel free to note if it has some drawback I'm not thinking about.

answered Dec 17, 2018 at 6:02

Leopoldo Sanczyk

1,6191 gold badge27 silver badges28 bronze badges

6 Comments

0Knowledge Over a year ago

Suppose, I have 1000 rows in my df. After applying your code 100 rows will be selected randomly and then how I can store the rest of the 900 rows (which one did not select randomly)?

Leopoldo Sanczyk Over a year ago

@Akib62 try (rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)])

0Knowledge Over a year ago

Not working. When I am using your code (given in the comment) getting the same output as the diamonds or main dataset.

Leopoldo Sanczyk Over a year ago

@Akib62 since that selects the elements not in sample_of_diamonds, can you confirm sample_of_diamonds is not empty? That could explain your problem.

0Knowledge Over a year ago

Say, I have 20 rows in my dataset. So when I am applying sample_of_diamonds <- diamonds[sample(nrow(diamonds),10),] I am getting 10 rows randomly and rest_of_diamonds <- diamonds[which(!diamonds %in% sample_of_diamonds)] I am getting 20 rows (main dataset)

|

Kakoli Rani Paul · Accepted Answer · 2025-01-08 00:20:20Z

-1

df <- data.frame(
      ID = 1:10,
      Name = LETTERS[1:10],
      Score = sample(50:100, 10, replace = TRUE)
    )
    n <- 10
    random_rows <- df[sample(nrow(df), n), ]
    print(random_rows)

edited Jan 8 at 0:20

answered Jan 7 at 7:34

Kakoli Rani Paul

12 bronze badges

1 Comment

Jeremy Caney Jan 9 at 8:13

Thank you for your interest in contributing to the Stack Overflow community. This question already has quite a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?

Collectives™ on Stack Overflow

Sample random rows in dataframe

14 Answers 14

12 Comments

2 Comments

Comments

Comments

Comments

Comments

1 Comment

10 Comments

Comments

Comments

Comments

Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

14 Answers 14

12 Comments

2 Comments

Comments

Comments

Comments

Comments

1 Comment

10 Comments

Comments

Comments

Comments

Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related