2

I wrote a symbolic regression tool in Python. It is possible to give unary and binary operators. If nothing is specified, I want to determine potential operators. So, I want to know if there is an approach to find necessary operators. I tried machine learning and correlation approaches, but they don't give good results for two function definitions (2*log(x1)+3*x2-x1*x2+5 and exp(x1)). An approach could be to consider first correlation term, do a regression, remove main component and iterate again but it should not work for cos(log) I think.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
import operator
import sympy

np.random.seed(42)
n = 1000
x1 = np.random.uniform(1, 10, size = n)
x2 = np.random.uniform(1, 10, size = n)
y = 2 * np.log(x1) + 3 * x2 - x1 * x2 + 5
#y = np.exp(x1)

unary_operators = {"neg": (lambda x: sympy.sympify("neg(" + str(x) + ")"), operator.neg),
                   "abs": (sympy.Abs, operator.abs),
                   "inv_": (lambda x: sympy.sympify("inv_(" + str(x) + ")"), lambda x: 1 / x),
                   "sqrt": (sympy.sqrt, np.sqrt),
                   "cos": (sympy.cos, np.cos),
                   "sin": (sympy.sin, np.sin),
                   "exp": (sympy.tan, np.tan),
                   "log": (sympy.log, np.log),
                   "exp": (sympy.exp, np.exp),
                   "sinh": (sympy.sinh, np.sinh),
                   "cosh": (sympy.cosh, np.cosh),
                   "floor": (lambda x: sympy.sympify("floor(" + str(x) + ")"), np.ceil),
                   "ceil": (lambda x: sympy.sympify("ceil(" + str(x) + ")"), np.floor)}

binary_operators = {"+": (operator.add, operator.add),
                    "-": (operator.sub, operator.sub),
                    "*": (operator.mul, operator.mul),
                    "/": (operator.truediv, operator.truediv),
                    "//": (operator.floordiv, operator.floordiv),
                    "%": (operator.mod, operator.mod),
                    "**": (sympy.Pow, operator.pow)}

symmetric_binary_operators = {"+": True, "-": True, "*": False, "conv": False}
symbols = sympy.symbols("x1 x2")
X = [x1, x2]

X_raw = pd.DataFrame()
        
for i in range(0, len(symbols)):
    X_raw[str(symbols[i])] = X[i]

feature_dict = {}
base_vars = [str(x) for x in symbols]
ops = {}

for k, v in unary_operators.items():
    sym_op, num_op = v

    for var in symbols:
        feature_dict[str(sym_op(var))] = num_op(X_raw[str(var)])
        ops[str(sym_op(var))] = k

for k, v in binary_operators.items():
    sym_op, num_op = v
    
    indices1 = list(range(0, len(symbols)))

    for i1 in indices1:
        indices2 = list(range(0, len(symbols)))

        if (k in list(symmetric_binary_operators.keys())):
            indices2 = list(range(i1 + 1 if v else i1, len(base_vars)))
                        
        for i2 in indices2:
            feature_dict[str(sym_op(symbols[i1], symbols[i2]))] = num_op(X_raw[str(symbols[i1])], X_raw[str(symbols[i2])])
            ops[str(sym_op(symbols[i1], symbols[i2]))] = k

X_feat = pd.DataFrame(feature_dict)
X_feat = X_feat.replace([np.inf, -np.inf, np.nan], 0)
X_feat = X_feat.loc[:, (X_feat.abs().max() < 1e6)]

correlations = X_feat.apply(lambda col: np.corrcoef(col, y)[0, 1])

# Trier par valeur absolue décroissante
correlations_sorted = correlations.abs().sort_values(ascending = False)

print("Top 10 features les plus corrélées (Pearson):\n")
for name in correlations_sorted.head(10).index:
    print(f"{name:30}  corr: {correlations[name]:+.4f}")

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_feat_scaled_array = scaler.fit_transform(X_feat)
X_feat = pd.DataFrame(X_feat_scaled_array, columns=X_feat.columns)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_feat, y, test_size = 0.2, random_state = 0)

model = RandomForestRegressor(n_estimators = 200, random_state = 0)#, max_depth = 10)
model.fit(X_train, y_train)

importances = model.feature_importances_

features_sorted = sorted(zip(X_feat.columns, importances), key = lambda x: x[1], reverse = True)

imp = 0

for i in range(0, len(features_sorted)):
    features_sorted[i] = list(features_sorted[i])
    x = features_sorted[i]
    imp_ = x[1]
    x.append(abs(x[1] - imp))
    imp = imp_

print("Top 10 features les plus utiles :\n")
for name, score, r in features_sorted[:10]:
    print(f"{name:30}  importance: {score:.4f} {r:.4f}")

min_score = 0.015
min_r = 0.05
un_ops = set()
bin_ops = set()

for name, score, r in features_sorted:
    #if (score > min_score):
    if (r > min_r):
        if (ops[name] in unary_operators):
            un_ops.add(ops[name])
        elif (ops[name] in binary_operators):
            bin_ops.add(ops[name])  

print("un_ops", un_ops)
print("bin_ops", bin_ops)

The corresponding output is:

Top 10 features les plus corrélées (Pearson):

x1*x2                           corr: -0.9480
x1 + x2                         corr: -0.8527
neg(x1)                         corr: +0.8298
Abs(x1)                         corr: -0.8298
ceil(x1)                        corr: -0.8244
floor(x1)                       corr: -0.8244
sqrt(x1)                        corr: -0.8170
log(x1)                         corr: -0.7880
inv_(x1)                        corr: +0.6826
sinh(x1)                        corr: -0.6305
Top 10 features les plus utiles :

x1*x2                           importance: 0.7583 0.7583
x1 + x2                         importance: 0.0850 0.6734
log(x1)                         importance: 0.0192 0.0658
inv_(x1)                        importance: 0.0170 0.0022
sinh(x1)                        importance: 0.0164 0.0007
neg(x1)                         importance: 0.0162 0.0002
sqrt(x1)                        importance: 0.0159 0.0003
exp(x1)                         importance: 0.0159 0.0001
cosh(x1)                        importance: 0.0151 0.0008
Abs(x1)                         importance: 0.0137 0.0014
un_ops {'log'}
bin_ops {'*', '+'}
5
  • 1
    This is more of a theoretical mathematics question: "what is the best approach". I suggest you ask your question in math.stackexchange.com instead. SO is more for code/software questions. Commented Jun 15 at 4:57
  • I tried but the topic was closed. Now it is on scicomp.stackexchange.com. Commented Jun 15 at 9:53
  • OP, this is an example of the "model selection" problem; a web search will find many resources. The problem is essentially that you can always get a better fit by adding more terms, so how do you know when to stop? There are various approaches to the "when to stop" problem. My advice is to generate a lot of alternatives (let's say all the possible combinations within some easily-described category) and then rank them by different criteria, such as number of inputs or number of operations. There are many approaches but that will get you started. Good luck and have fun. Commented Jun 16 at 19:22
  • @RobertDodier an expanded version of your comment would make a great Answer to what's currently an over-broad question Commented Jun 16 at 22:02
  • @ti7 Thanks, but I think I put in as much as effort as it merits. But if you want to take the stuff I wrote as a point of departure and write your own answer, I think that would be awesome. Commented Jun 17 at 17:04

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.