Python - Faster random business date generation

Question

I wrote a program that generates some synthetic data. Everything runs fine, but right now the program is so slow that I am only generating 1000 rows of data (and it still takes 3 or so minutes). Ideally I would like to generate about 100,000 (which at the moment takes upwards of 10 minutes, I killed the program before it finished running).

I've narrowed down the problem to the way I am generating random dates. After those three lines, the rest of the program executes in a few seconds.

import numpy.random as rnd
import datetime
import pandas as pd

random_days = []

for num in range(0,n):
    random_days.append(pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date))))

What I need is, given some number n, to generate that many dates at random from a sequence of business days (the business days part is also important). I need to convert the value to datetime because otherwise it returns a numpy timedelta64 object, which causes problems in other parts of the program.

Is there any way to improve my code to have it generate dates faster? Or do I need to settle for a small sample size?

EDIT:

Just to add a little more context:

I use this loop to generate the rest of my data:

for day in random_days:
    new_trans = one_trans(day)
    frame.append(new_trans)

frame = pd.concat(frame)

The function one_trans is the following:

def one_trans(date):

    trans = pd.Series([date.year, date.month, date.date(), fake.company(),
                       fake.company(), fake.ssn(),
                       (rnd.normal(5000000,10000)),
                       random.sample(["USD", "EUR", "JPY", "BTC"], 1)[0]],
                       index=["YEAR", "MONTH","DATE","SENDER","RECEIVER",
                         "TRANSACID","AMOUNT","CURRENCY"])

    return trans.to_frame().transpose()

EDIT 2: This is how I implemented Vogel612's suggestion:

def rng_dates(n, start_date, end_date):
    for _ in range(n):
        yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))

random_days = rng_dates(n, start_date, end_date)

for day in random_days:
    new_trans = one_trans(day)
    frame.append(new_trans)

frame = pd.concat(frame)

\$\begingroup\$ With Vogel612's suggestion, does it run now quicker? \$\endgroup\$

Serge Stroobandt
– Serge Stroobandt

2018-05-31 18:55:07 +00:00
Commented May 31, 2018 at 18:55 — Serge Stroobandt
– Serge Stroobandt, Commented May 31, 2018 at 18:55

Vogel612 · Accepted Answer · 2017-07-20 20:03:17Z

1

one of the really easy ways to probably make this faster is by foregoing the idea of "I generate n records and then do things with them".

Instead think something like "I generate a record and do things with it. n times".

Python has the really handy concept of iterators. Consider the following:

def rng_dates():
   while true:
       yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))

to make this a little less infinity-ish, you could pass a number of records to it:

def rng_dates(n):
    # hat tip to Peilonrayz
    for _ in range(n):
        yield pd.to_datetime(rnd.choice(pd.bdate_range(start_date, end_date)))

this should allow some optimizations wrt. memory management, cache-misses and list-appending, which should translate into a pretty hefty speedup in larger sample-sizes

edited Jul 20, 2017 at 20:03

answered Jul 20, 2017 at 19:43

Vogel612

25.5k7 gold badges59 silver badges141 bronze badges

3

\$\begingroup\$ Python, by choice, has no i++ operator. It'd also be more common to see for _ in range(n), rather than a while loop too. \$\endgroup\$

Peilonrayz
– Peilonrayz ♦

2017-07-20 19:54:44 +00:00
Commented Jul 20, 2017 at 19:54
\$\begingroup\$ I added more details about my code and the way I use dates to generate the rest of my data. I'm just a little confused about how I could incorporate the function you wrote in generating my data. I feel like I might need to restructure the rest of the code (which I don't mind) \$\endgroup\$

Sergei
– Sergei

2017-07-20 20:30:01 +00:00
Commented Jul 20, 2017 at 20:30

Add a comment |

Stack Exchange Network

Python - Faster random business date generation

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Python - Faster random business date generation

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions