I currently am trying to create a function for a dataframe and is too complex for me. I have a dataframe that looks like this:
df1
hour production ....
0 1 10
0 2 20
0 1 30
0 3 40
0 1 40
0 4 30
0 1 20
0 4 10
I am trying to create a function that would do the following:
- Group data by different
hour - Calculate 90% confidence interval of
productionfor eachhour - If
productionvalue of a particular row falls outside the 90% confidence interval for it's respectivehour, mark it asunusualby creating a new column
Below is the current step I am taking to do the above for each individual hours:
Calculate confidence interval
confidence = 0.90
data = df1['production ']
n = len(data)
m = mean(data)
std_err = sem(data)
h = std_err * t.ppf((1 + confidence) / 2, n - 1)
lower_interval = m - h
upper_interval = m + h
Then:
def confidence_interval(x):
if x['production'] > upper_interval :
return 1
if x['production'] < lower_interval :
return 1
return 0
df1['unusual'] = df1.apply (lambda x: confidence_interval(x), axis=1)
I am doing this for each of the values in hour, than having to merge all the result together into one original dataframe.
Can anyone help me to crate a function that can do all the above at once? I had a go but just cant get my head around it.