0

I have a series of data which consists of values from several experiments (1-40, in the MWE it is 1-5). The overall amount of entries in my original data is ~4.000.000, which I try to smooth in order to display it:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import spline
from statsmodels.nonparametric.smoothers_lowess import lowess

df = pd.DataFrame()
df["values"] = np.random.randint(100000, 200000, 1000)
df["id"] = [1,2,3,4,5] * 200
plt.figure(1, figsize=(11.69,8.27))
# Both fail for my amount of data:
plt.plot(spline(df["values"], df["id"], range(100)), "r-")
plt.plot(lowess(df["values"], df["id"]), "r-")

Both, scipy.interplate and statsmodels.nonparametric.smoothers_lowess.lowess, throw out of memory exceptions for my data. Is there any efficient way to solve this like in, e.g., GNU R using ggplot2 and geom_smooth()?

4
  • why you do range(100) in the first plot? In that place is an int. Commented Jan 17, 2017 at 9:33
  • 2
    According to the documentation (docs.scipy.org/doc/scipy-0.18.1/reference/generated/…) it is a list/array of new x-values, which would be [0, 2, ..., 99] in that case, right? Commented Jan 17, 2017 at 9:40
  • Separate the smoothing computation from the plot call, so you see where it fails. My guess is that creating a plot with 4 million points is not very informative, and might require a large amount of memory. Also for lowess, the fraction to be used in the local regression should be reduced when the sample size is so large. Commented Jan 17, 2017 at 19:53
  • An uninformative plot is the reason why I wanted to change from plotting all values to smoothing. It fails on creating the smoothing computation. Commented Jan 18, 2017 at 7:16

1 Answer 1

1

I can't quite tell what you're getting at with all the dimensions to your data, but one very simple thing you can try is to just use the 'markevery' kwarg like so:

import numpy as np
import matplotlib.pyplot as plt

x=np.linspace(1,100,1E7)
y=x**2
plt.figure(1, figsize=(11.69,8.27))
plt.plot(x,y,markevery=100)
plt.show()

This will only plot every nth point (n=100 here).

If that doesn't help then you may want to try just a simple numpy interpolation with fewer samples like so:

x_large=np.linspace(1,100,1E7)
y_large=x**2
x_small=np.linspace(1,100,1E3)
y_small=np.interp(x_small,x_large,y_large)
plt.plot(x_small,y_small)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.