I have been playing around with the random state variable from StratifiedKFold in sklearn, but it does not seem to be random. I believe that setting random_state=5, should give me a different testing set then setting random_state=4, but this does not seem to be the case. I have created some crude reproducible code below. First I load my data:
import numpy as np
from sklearn.cross_validation import StratifiedKFold
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
Then I set random_state=5, for which I store the last values:
skf=StratifiedKFold(n_splits=5,random_state=5)
for (train, test) in skf.split(X,y): full_test_1=test
full_test_1
array([ 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 140, 141, 142, 143, 144, 145,
146, 147, 148, 149])
Doing the same procedure for random_state=4:
skf=StratifiedKFold(n_splits=5,random_state=4)
for (train, test) in skf.split(X,y): full_test_2=test
full_test_2
array([ 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 140, 141, 142, 143, 144, 145,
146, 147, 148, 149])
I can then check that they are equal:
np.array_equal(full_test_1,full_test_2)
True
I do not think that the two random states should be returning the same numbers. Is there a flaw in my logic or code?