I wrote a program that generates some synthetic data. Everything runs fine, but right now the program is so slow that I am only generating 1000 rows of data (and it still takes 3 or so minutes). Ideally I would like to generate about 100,000 (which at the moment takes upwards of 10 minutes, I killed the program before it finished running).
I've narrowed down the problem to the way I am generating random dates. After those three lines, the rest of the program executes in a few seconds.
import numpy.random as rnd
import datetime
import pandas as pd
random_days = []
for num in range(0,n):
random_days.append(pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date))))
What I need is, given some number n, to generate that many dates at random from a sequence of business days (the business days part is also important). I need to convert the value to datetime because otherwise it returns a numpy timedelta64 object, which causes problems in other parts of the program.
Is there any way to improve my code to have it generate dates faster? Or do I need to settle for a small sample size?
EDIT:
Just to add a little more context:
I use this loop to generate the rest of my data:
for day in random_days:
new_trans = one_trans(day)
frame.append(new_trans)
frame = pd.concat(frame)
The function one_trans is the following:
def one_trans(date):
trans = pd.Series([date.year, date.month, date.date(), fake.company(),
fake.company(), fake.ssn(),
(rnd.normal(5000000,10000)),
random.sample(["USD", "EUR", "JPY", "BTC"], 1)[0]],
index=["YEAR", "MONTH","DATE","SENDER","RECEIVER",
"TRANSACID","AMOUNT","CURRENCY"])
return trans.to_frame().transpose()
EDIT 2: This is how I implemented Vogel612's suggestion:
def rng_dates(n, start_date, end_date):
for _ in range(n):
yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))
random_days = rng_dates(n, start_date, end_date)
for day in random_days:
new_trans = one_trans(day)
frame.append(new_trans)
frame = pd.concat(frame)