Compute a linear regression, not for all data points

Question

I'm plotting errors obtained for various mesh sizes to study the order of convergence of the model. As shown in the two plots below the "good regime" is not always present, either because the mesh is too coarse (Fig. 1) or either we reach floating point precision (Fig. 2).

So computing the linear regression on the whole data would not yield correct results. One solution would be to manually remove the undesired points and then compute the linear regression (with sklearn.linear_model for instance) but as I have about 144 data frames to deal with, it will take a lot of time!

I did not find such a feature in sklearn.linear_model. Does it exist? Or is there a module that does it?

I also imagined a workaround consisting of computing all the slopes for consecutive points, and keeping for the actual linear regression all the values that are « the most frequent » but I did not manage to implement this.

On the two figures, I underlined the points I'd like to keep for the linear regression in red.

Here is the code I used to get the plot:

import plotly.graph_objects as go
from io import StringIO
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import plotly.express as px

df = pd.read_csv(StringIO("""hsize,errorl2
0.1,0.0126488001839
0.05,0.0130209787153
0.025,0.0111789092046
0.0125,0.00766945668181
0.00625,0.0012502455118
0.003125,0.000268314690697
"""))

def linear_regression_log(x, y):   # Compute the linear regression for log-log data, but here all the points are used in the computation
    keep = ~np.isnan(y) # in my case, some data are missing
    log_x = np.log(np.array(x[keep]))
    log_y = np.log(np.array(y[keep]))

    model = LinearRegression()
    model.fit(log_x.reshape(-1, 1), log_y)

    slope = model.coef_[0]
    intercept = model.intercept_

    return slope, intercept

# Sample data
x = df['hsize']
y = df['errorl2']
a, _ = linear_regression_log(x, y)

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode='markers+lines'))
fig.update_layout(title=f"Slope: {a:.2f}")
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig.show()

And the data I plotted:

Fig. 1:

hsize,errorl2
0.1,0.0126488001839
0.05,0.0130209787153
0.025,0.0111789092046
0.0125,0.00766945668181
0.00625,0.0012502455118
0.003125,0.000268314690697

Fig. 2:

hsize,errorl2
0.1,0.000713407986653
0.05,1.60793872143e-06
0.025,6.20078712336e-11
0.0125,2.99238475669e-13
0.00625,1.1731644955e-13
0.003125,1.88186766825e-13

are you looking for a computable criterion on which to decide if a point should be excluded? — julaine
– julaine, Commented Jan 12, 2024 at 13:33

Lourenço Monteiro Rodrigues · Accepted Answer · 2024-01-12 13:39:03Z

RANSAC (random sample consensus) would be an alternative if you had several pertinent points and a few outliers, in the examples you provide, it seems that there is almost 50-50% distribution of "good regime" and coarse_mesh/float_precision.

If you can make the assumption that these two modes (coarse mesh and float precision limits) always produce smaller slopes than the good regime, you could compute the slopes, rank them and delete the points that only contribute to those smaller ranks.

Say you define a threshold of 30% of mean slope:

slopes = points[1:] - points[:-1]
thresholds = np.mean(slopes) * .3  # for 30% of the mean slope
left_slopes = np.concatenate([[True],slope > threshold]) # slope to the left of point is large enough, first point does not have left slope, so a True is added
right_slopes = np.concatenate([slope > threshold, [True])  # slope to the right of point is large enough, last point does not have right slope, so a True is added
select = np.logical_and(left_slopes, right_slopes)  # array of booleans, where True means that that point contributes to one of the accepted slopes, and should therefore be accepted

Collectives™ on Stack Overflow

Compute a linear regression, not for all data points

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related