Smoothing curve for matplotlib.pyplot using pandas or numpy/scipy

Question

I have a series of data which consists of values from several experiments (1-40, in the MWE it is 1-5). The overall amount of entries in my original data is ~4.000.000, which I try to smooth in order to display it:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import spline
from statsmodels.nonparametric.smoothers_lowess import lowess

df = pd.DataFrame()
df["values"] = np.random.randint(100000, 200000, 1000)
df["id"] = [1,2,3,4,5] * 200
plt.figure(1, figsize=(11.69,8.27))
# Both fail for my amount of data:
plt.plot(spline(df["values"], df["id"], range(100)), "r-")
plt.plot(lowess(df["values"], df["id"]), "r-")

Both, scipy.interplate and statsmodels.nonparametric.smoothers_lowess.lowess, throw out of memory exceptions for my data. Is there any efficient way to solve this like in, e.g., GNU R using ggplot2 and geom_smooth()?

why you do range(100) in the first plot? In that place is an int. — Lucas
– Lucas, Commented Jan 17, 2017 at 9:33
According to the documentation (docs.scipy.org/doc/scipy-0.18.1/reference/generated/…) it is a list/array of new x-values, which would be [0, 2, ..., 99] in that case, right? — Robin
– Robin, Commented Jan 17, 2017 at 9:40
Separate the smoothing computation from the plot call, so you see where it fails. My guess is that creating a plot with 4 million points is not very informative, and might require a large amount of memory. Also for lowess, the fraction to be used in the local regression should be reduced when the sample size is so large. — Josef
– Josef, Commented Jan 17, 2017 at 19:53
An uninformative plot is the reason why I wanted to change from plotting all values to smoothing. It fails on creating the smoothing computation. — Robin
– Robin, Commented Jan 18, 2017 at 7:16

Samwise · Accepted Answer · 2017-01-25 02:44:09Z

1

I can't quite tell what you're getting at with all the dimensions to your data, but one very simple thing you can try is to just use the 'markevery' kwarg like so:

import numpy as np
import matplotlib.pyplot as plt

x=np.linspace(1,100,1E7)
y=x**2
plt.figure(1, figsize=(11.69,8.27))
plt.plot(x,y,markevery=100)
plt.show()

This will only plot every nth point (n=100 here).

If that doesn't help then you may want to try just a simple numpy interpolation with fewer samples like so:

x_large=np.linspace(1,100,1E7)
y_large=x**2
x_small=np.linspace(1,100,1E3)
y_small=np.interp(x_small,x_large,y_large)
plt.plot(x_small,y_small)

answered Jan 25, 2017 at 2:44

Samwise

5651 gold badge4 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Smoothing curve for matplotlib.pyplot using pandas or numpy/scipy

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related