Function within a function for iterating each rows based on a column value

Question

I currently am trying to create a function for a dataframe and is too complex for me. I have a dataframe that looks like this:

df1

     hour    production ....      
0     1          10
0     2          20
0     1          30
0     3          40
0     1          40
0     4          30
0     1          20
0     4          10

I am trying to create a function that would do the following:

Group data by different hour
Calculate 90% confidence interval of production for each hour
If production value of a particular row falls outside the 90% confidence interval for it's respective hour, mark it as unusual by creating a new column

Below is the current step I am taking to do the above for each individual hours:

Calculate confidence interval

confidence = 0.90
data = df1['production ']
n = len(data)
m = mean(data)
std_err = sem(data)
h = std_err * t.ppf((1 + confidence) / 2, n - 1)
lower_interval = m - h
upper_interval = m + h

Then:

def confidence_interval(x):
if x['production'] > upper_interval  :
    return 1
if x['production'] < lower_interval :
    return 1
return 0

df1['unusual'] = df1.apply (lambda x: confidence_interval(x), axis=1)

I am doing this for each of the values in hour, than having to merge all the result together into one original dataframe.

Can anyone help me to crate a function that can do all the above at once? I had a go but just cant get my head around it.

jezrael · Accepted Answer · 2019-05-26 05:25:37Z

2

Create custom function and use GroupBy.transform with Series.between and invert mask by ~:

from scipy.stats import sem, t
from scipy import mean

def confidence_interval(data):
    confidence = 0.90
    n = len(data)
    m = mean(data)
    std_err = sem(data)
    h = std_err * t.ppf((1 + confidence) / 2, n - 1)
    lower_interval = m - h
    upper_interval = m + h
    #print (lower_interval ,upper_interval)
    return ~data.between(lower_interval, upper_interval, inclusive=False)

df1['new'] = df1.groupby('hour')['production'].transform(confidence_interval).astype(int)
print (df1)
   hour  production  new
0     1          10    0
0     2          20    1
0     1          30    0
0     3          40    1
0     1          40    0
0     4          30    0
0     1          20    0
0     4          10    0

edited May 26, 2019 at 5:25

answered May 26, 2019 at 4:43

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Quang Hoang Over a year ago

reset_index may not needed if used tranform instead of apply.

jezrael Over a year ago

@QuangHoang - Thank you.

Collectives™ on Stack Overflow

Function within a function for iterating each rows based on a column value

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related