这是用户在 2025-5-2 19:31 为 file:///Users/zoe/Downloads/5491_GESR.html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

GESR: A GEOMETRIC EVOLUTION MODEL FOR SYM- BOLIC REGRESSION
GESR:一种符号回归的几何演化模型


Anonymous authors  匿名作者

Paper under double-blind review
双盲评审中的论文

Abstract  摘要


Symbolic regression is a challenging task in machine learning that aims to automatically discover highly interpretable mathematical equations from limited data. Keen efforts have been devoted to addressing this issue, yielding promising results. However, there are still bottlenecks that current methods struggle with, especially when dealing with complex problems containing various noises or with intricate underlying mathematical formulas. In this work, we propose a novel Geometric Evolution Symbolic Regression(GESR) algorithm. Leveraging geometric semantics, the process of symbolic regression in GESR is transformed into an approximation to an unimodal target in n-dimensional topological space. Then, three key modules are proposed to enhance the approximation: (1) a new semantic gradient concept, proposed to assist the exploration, which aims to improve the accuracy of approximation; (2) a new geometric search operator, tailored for approximating the target formula directly in topological space; (3) the Levenberg-Marquardt algorithm with L2 regularization, used for the adjustment of expression structures and the balance of global subtree weights to assist the proposed geometric semantic search operator. With the proposal of these modules, GESR achieves state-of-the-art accuracy performance on multiple authoritative benchmark datasets and demonstrates a certain level of robustness against noise interference. The implementation is available at https://anonymous.4open.science/r/12331211321-014D.
符号回归是机器学习中一项具有挑战性的任务,旨在从有限数据中自动发现高度可解释的数学方程。已有大量研究致力于解决这一问题并取得显著成果,但现有方法在处理含复杂噪声或底层数学公式结构错综的问题时仍存在瓶颈。本工作提出了一种新颖的几何演化符号回归(GESR)算法,通过引入几何语义,将符号回归过程转化为对 n 维拓扑空间中单峰目标的逼近。 随后,提出了三个关键模块以增强近似效果:(1) 一种新的语义梯度概念,旨在辅助探索并提升近似精度;(2) 专为在拓扑空间中直接逼近目标公式而设计的新型几何搜索算子;(3) 采用 L2 正则化的 Levenberg-Marquardt 算法,用于调整表达式结构并平衡全局子树权重,以辅助所提出的几何语义搜索算子。通过这些模块的引入,GESR 在多个权威基准数据集上实现了最先进的精度表现,并展现出一定程度的抗噪声干扰鲁棒性。实现代码发布于 https://anonymous.4open.science/r/12331211321-014D。

1 INTRODUCTION  1 引言


Symbolic regression(SR) is a supervised machine learning method that learns interpretable mathematical expressions directly from a given dataset. SR provides us with the opportunity to automatically extract highly interpretable mathematical expressions that depict the underlying objective patterns in complex data from a mathematical perspective, rather than heavily relying on expert intuition and sensitivity to data. Compared to the numerical solutions generated through neural networks, quantitative mathematical expressions can provide more interpretability and generalization. Nowadays, SR has been widely applied in a variety of tasks, from scientific discovery (Makke & Chawla, 2024; Wang et al. 2023) to engineering applications (Angelis et al. 2023), including governing equations (Sun et al., 2022), finding fundamental physical laws (Udrescu & Tegmark, 2020), fault detection (Hale et al. 2022), and TCP congestion control (Sharan et al., 2022).
符号回归(SR)是一种监督式机器学习方法,能够直接从给定数据集中学习可解释的数学表达式。SR 为我们提供了从数学角度自动提取高度可解释的数学表达式的机会,这些表达式描述了复杂数据中的潜在客观规律,而非过度依赖专家对数据的直觉和敏感性。与通过神经网络生成的数值解相比,定量的数学表达式能提供更强的可解释性和泛化能力。目前,SR 已广泛应用于从科学发现(Makke & Chawla, 2024; Wang et al. 2023)到工程应用(Angelis et al. 2023)的多种任务中,包括控制方程(Sun et al., 2022)、发现基础物理定律(Udrescu & Tegmark, 2020)、故障检测(Hale et al. 2022)以及 TCP 拥塞控制(Sharan et al., 2022)。

Genetic programming(GP) serves as the primary algorithm for SR. In GP, symbolic regression is addressed by evolving the expression trees within a population. With the iterative variation of the population and the guidance of fitness, the accuracy of expression trees gradually improves until a satisfactory level is achieved. GP has been proven to be very useful in many tasks, such as inferring physical laws. However, when the underlying formula is complex, its effectiveness still falls short of expectations due to the quite large search space.
遗传编程(GP)是符号回归(SR)的主要算法。在 GP 中,通过演化种群内的表达式树来解决符号回归问题。随着种群的迭代变异和适应度指导,表达式树的准确性逐渐提高,直至达到满意水平。GP 已被证明在许多任务中非常有用,例如推断物理定律。然而,当基础公式较为复杂时,由于搜索空间相当庞大,其效果仍不尽如人意。

With the exploration of symbolic regression and machine learning, numerous approaches are proposed to seek breakthroughs in terms of inference time, solution rate, accuracy, and model size. Deep Learning methods emerge due to their avoidance of the hyperparameter sensitivity problem in GP and their rapid inference time. SymbolicGPT (Valipour et al. 2021) uses deep language models like GPT to generate mathematical expressions for SR. AIFeynman (Udrescu et al., 2020) trains a neural network to estimate functional modularities. NGGP (Mundhenk et al. 2021) utilizes a hybrid approach to assist GP in generating a start population with neural networks. The end-to-end transformer model for SR (Kamienny et al. 2022) also shows its capability on exploration and inference time. Reinforcement Learning is then employed to explore and discover expressions, whereas the Monte Carlo tree search method is utilized. DSR (Petersen et al. 2019) is a representative reinforcement learning method, which uses a learn policy to add notations to parse tree in turn while a neural network is utilized to generate symbolic distribution. Then, uDSR (Landajuela et al. 2022) is proposed as a comprehensive framework that hybridizes GP, AIFeynman, linear models, DSR, and large-scale pre-training into a single module and achieves state-of-the-art performance.
随着符号回归与机器学习探索的深入,众多方法在推理时间、求解率、准确性和模型规模等方面寻求突破。深度学习方法因其避免了遗传编程(GP)中的超参数敏感性问题及快速推理时间而兴起。SymbolicGPT(Valipour 等人,2021)利用 GPT 等深度语言模型生成符号回归的数学表达式;AIFeynman(Udrescu 等人,2020)训练神经网络来估计函数模块性;NGGP(Mundhenk 等人,2021)采用混合方法辅助 GP 通过神经网络生成初始种群;面向符号回归的端到端 Transformer 模型(Kamienny 等人,2022)也展现了其在探索与推理时间上的优势。强化学习随后被用于探索和发现表达式,其中蒙特卡洛树搜索方法得到应用。DSR(Petersen 等人,2019)作为代表性强化学习方法,采用学习策略逐步向解析树添加符号,同时利用神经网络生成符号分布。后续 uDSR(Landajuela 等人... 2022 年提出的这一综合性框架,将遗传规划(GP)、AI 费曼方法、线性模型、深度符号回归(DSR)以及大规模预训练技术融合为单一模块,实现了最先进的性能表现。


However, these methods still face challenges in effectively generating formulas from data characterized by intricate mathematical representations. Recently, the development of semantic genetic programming(SGP) (Pawlak et al., 2014; Chen et al., 2019; Huang et al., 2022) provides us with a fresh perspective. As we all know, the core idea of GP lies in the concept of iterative evolution, and the benefits of the "evolution" have been widely demonstrated. Nevertheless, the lack of explicitly guided cross-mutation processes severely hinders the improvement of the accuracy of formula evolution. Semantic genetic programming presents the unimodal landscape of symbolic regression, enabling the exploration of approximate unimodal solutions directly in the underlying topological space(named semantic space). Compared to the random exploration of traditional genetic programming, the approximation in the topological space exhibits significantly stronger guidance and superior search efficiency. However, suffering from dimension curse and sparse characteristics in semantic space, the search performance in the underlying semantic space rapidly slows down after several generations and the approximation process becomes convoluted. Severe tree bloating is also a bottleneck problem since SGP directly maps each step of semantic approximation to the modifications of expression trees without the tree structure information in the semantic space.
然而,这些方法在面对具有复杂数学表征的数据时,仍难以有效生成公式。近年来,语义遗传规划(SGP)的发展(Pawlak 等人,2014;Chen 等人,2019;Huang 等人,2022)为我们提供了新的视角。众所周知,遗传规划的核心思想在于迭代进化的概念,而"进化"的优势已得到广泛验证。但缺乏显式引导的交叉变异过程,严重制约了公式演化精度的提升。语义遗传规划展现了符号回归的单峰景观,使得能直接在底层拓扑空间(称为语义空间)中探索近似单峰解。相较于传统遗传规划的随机探索,拓扑空间中的近似搜索展现出显著更强的导向性和更优的搜索效率。 然而,由于维度诅咒和语义空间中的稀疏特性,底层语义空间中的搜索性能在几代之后迅速减慢,近似过程变得复杂。严重的树膨胀也是一个瓶颈问题,因为 SGP 直接将语义近似的每一步映射到表达式树的修改上,而没有考虑语义空间中的树结构信息。

To address these issues, we propose a geometric evolution symbolic regression(GESR) algorithm. The proposed algorithm significantly improves the fitting performance of symbolic expression while effectively tackling concerns associated with tree bloating. Utilizing SRbench benchmark and SRSD benchmark for both black-box and scientific discovery symbolic regression, our method achieves state-of-the-art accuracy results across both benchmarks.
为解决这些问题,我们提出了一种几何进化符号回归(GESR)算法。该算法显著提高了符号表达式的拟合性能,同时有效解决了与树膨胀相关的问题。利用 SRbench 基准和 SRSD 基准进行黑盒和科学发现符号回归,我们的方法在这两个基准上都达到了最先进的准确率结果。

In summary, our work introduces several key contributions: (1) We introduce the concept of semantic gradients based on discovering the inconsistency of the approximation process in sub-semantic space and target semantic space, enabling accurate approximation of mappings between sub-objective spaces and objective spaces, thereby enhancing semantic fitting accuracy. (2) We develop a novel geometric semantic method capable of efficiently approximating target semantics in the sparse semantic space. (3) We also present a mechanism to assist the geometric semantic method in adjusting the generated expression structure and capturing the potential solution by zero-ing the weights of uninformative subtrees (via the LM algorithm). (4) Our proposed GESR consistently demonstrates promising performance across a diverse array of SRbench and SRSD benchmark datasets, as validated through rigorous ablation experiments. When compared to the baseline, GESR exhibits notably superior accuracy in scientific discovery tasks and showcases robust noise resistance capabilities.
总结而言,我们的工作提出了几项核心贡献:(1) 通过揭示子语义空间与目标语义空间近似过程中的不一致性,我们引入了语义梯度的概念,从而实现对子目标空间与目标空间之间映射关系的精确逼近,提升语义拟合精度。(2) 我们开发了一种新颖的几何语义方法,能够高效地在稀疏语义空间中逼近目标语义。(3) 我们还提出了一种辅助机制,通过归零无信息子树权重(基于 LM 算法),帮助几何语义方法调整生成表达式结构并捕捉潜在解。(4) 通过严格的消融实验验证,我们提出的 GESR 方法在 SRbench 和 SRSD 系列基准数据集上持续展现出优越性能。与基线方法相比,GESR 在科学发现任务中表现出显著更高的准确性,并展现出强大的抗噪声能力。

2 RELATED WORK  2 相关工作


Genetic Programming: genetic programming refers to an evolutionary algorithm instance applied to SR, in which a set of expression trees evolve iteratively under the guidance of fitness. Modules such as crossover, mutation, and selection are employed to evolve the population towards minimizing the distance from the optimal expression. Many popular frameworks use GP for symbolic regression Cranmer, 2023; Burlacu et al., 2020; La Cava et al., 2018; de Franca & Aldeia, 2021b; Zhang et al. 2023) and a survey (Mei et al. 2022) is recommended for an overview of GP for SR. In this work, we inherited the concept of population evolution from GP, combining it with geometric semantic methods (Virgolin et al., 2019; Chen et al., 2019) and constant optimization.
遗传编程:遗传编程指的是应用于符号回归(SR)的一种进化算法实例,其中一组表达式树在适应度指导下迭代演化。通过交叉、变异和选择等模块,种群不断进化以最小化与最优表达式之间的距离。许多主流框架采用遗传编程进行符号回归(Cranmer, 2023; Burlacu 等, 2020; La Cava 等, 2018; de Franca & Aldeia, 2021b; Zhang 等, 2023),推荐参考综述文献(Mei 等, 2022)了解遗传编程在符号回归中的概况。本工作中,我们继承了遗传编程的种群进化思想,并将其与几何语义方法(Virgolin 等, 2019; Chen 等, 2019)及常数优化相结合。

Geometric Semantic Evolution: geometric semantic genetic programming(GSGP) (Vanneschi, 2016) is a type of algorithm that evolves the mathematical formula directly in n-dimensional topological space. With the semantic information, GSGP converts the traditional mutations into the operations with geometric properties in the underlying semantic space to approximate the unimodal target semantics and directly map this geometric operation to syntactic modifications of expression trees. Many works have been proposed to improve the performance (Pietropolli et al. 2022; Chen et al., 2019; Nguyen & Chu, 2020; Virgolin et al., 2019; Huang et al., 2024). However, due to the
几何语义进化:几何语义遗传编程(GSGP)(Vanneschi,2016)是一种直接在 n 维拓扑空间中演化数学公式的算法。借助语义信息,GSGP 将传统变异转化为底层语义空间中具有几何特性的操作,以逼近单峰目标语义,并将这一几何操作直接映射为表达式树的语法修改。已有许多研究致力于提升其性能(Pietropolli 等,2022;Chen 等,2019;Nguyen 与 Chu,2020;Virgolin 等,2019;Huang 等,2024)。然而,由于




https://cdn.noedgeai.com/019690bc-7b96-7f33-8fed-abb55d1d548c_2.jpg?x=312&y=228&w=1179&h=761&r=0

Figure 1: Schematic of the proposed method. For each symbolic tree in the population, the GESR first selects the mutated subtree from a symbolic tree based on the calculated probabilities. The chosen subtree is then replaced by the generated combined tree through the semantics approximation in the topological space. Periodically, each tree in the population will be adjusted by the Weight Optimization module. The generated offspring is then executed to get the fitness value and the assisted information needed in the next iteration. After the execution, the evaluation and selection are performed and the selected symbolic trees form the new population for the next round iteration. These procedures are detailedly described in Section 3
图 1:所提方法的示意图。对于种群中的每个符号树,GESR 首先根据计算出的概率从符号树中选择待变异的子树。接着,通过拓扑空间中的语义近似,将选中的子树替换为生成的组合树。周期性地,种群中的每棵树会通过权重优化模块进行调整。生成的子代随后被执行以获取适应度值及下一次迭代所需的辅助信息。执行完毕后进行评估与选择,被选中的符号树构成新一轮迭代的新种群。这些步骤在第 3 节中有详细描述。


high-dimensional nature of semantic space, these methods still struggle in the complex intermediate processes to reach the target semantics. Moreover, the direct mapping of geometric operations to syntax modifications also leads to severe tree bloating issues. Searching for target semantics in high-dimensional semantic space is still a challenging task.
尽管语义空间具有高维特性,这些方法在抵达目标语义的复杂中间过程中仍面临困难。此外,将几何操作直接映射到语法修改也会导致严重的树膨胀问题。在高维语义空间中寻找目标语义仍是一项具有挑战性的任务。

3 METHODOLOGY  3 方法论


3.1 PRELIMINARY  3.1 预备知识


Given a dataset D={(xi,yi)}i<N,(xi,yi)Rd×R ,the main goal of symbolic regression (SR) methods is to find a symbolic expression f that minimizes the theoretical risk of the input-output mapping of the dataset D :
给定一个数据集 D={(xi,yi)}i<N,(xi,yi)Rd×R ,符号回归(SR)方法的主要目标是找到一个符号表达式 f ,以最小化数据集 D 输入输出映射的理论风险:

(1)f=argmaxfFEDΩ(D)[L(f,D)]

Semantic genetic programming (SGP) vectorizes the output of the symbolic expression related to the dataset D as a semantic vector,with which the iterative improvement of the symbolic expression can be achieved through semantic target approximation in topological space. The semantic vectors of possible solutions constitute semantic space, where each symbolic expression is mapped to a semantic vector point. In this paper, we propose a geometric semantic genetic programming algorithm, which utilizes geometric properties describing spatial relationships between symbolic expressions in the n -dimensional semantic space,where n is equivalent to the size of the training set. By directly approximating the target semantic with a geometric method in the semantic space and mapping the corresponding geometric operations to the syntactic modifications of symbolic expression, newly
语义遗传编程(SGP)将与数据集 D 相关的符号表达式输出向量化为语义向量,通过语义目标在拓扑空间中的逼近实现符号表达式的迭代优化。可能解的语义向量构成语义空间,其中每个符号表达式映射为一个语义向量点。本文提出一种几何语义遗传编程算法,利用描述 n 维语义空间中符号表达式间空间关系的几何特性(其中 n 等同于训练集大小),通过在语义空间中直接以几何方法逼近目标语义,并将对应的几何操作映射到符号表达式的语法修改上,新生成的...


obtained symbolic expression can be closer to the target. For example, the midpoint of the two semantic points s1 and s2 that represent two symbolic expression trees tr1 and tr2 respectively in the semantic space can be mapped to a new expression tree: s:1/2s1+1/2s2tr : 1/2tr1+1/2tr2 .
获得的符号表达式可以更接近目标。例如,语义空间中分别代表两个符号表达式树 tr1tr2 的两个语义点 s1s2 的中点可以映射到一个新的表达式树: s:1/2s1+1/2s2tr : 1/2tr1+1/2tr2

In our method, geometric semantics is used to guide the mutation, as shown in Figure 1 However, although geometric semantic mutation improves search efficiency and accuracy, frequent linear combinations of expressions often lead to severe tree bloating. Therefore, we backpropagate the target semantics to sub-expression, combined with semantic gradients and strict limitations, to guide the mutation process in the sub-semantic space. The linear combination of sub-expressions also leads to a significant increase in the proportion of constants. While geometric semantic mutation based on backpropagation is restricted to only adjusting local constants, it is difficult to balance the optimization of global constants. Thus, we incorporate the Levenberg-Marquardt optimization algorithm with L2 regularization to periodically adjust global constants, which also enables the smoothing of formula curves. By the way, since we assign a constant (weight) to each combination subtree, there is a natural characteristic for our method to use a continuous optimization algorithm to further adjust the overall tree structures.
本方法采用几何语义来指导变异过程,如图 1 所示。然而,尽管几何语义变异提升了搜索效率与精度,但频繁的表达式线性组合常导致严重的树结构膨胀。为此,我们将目标语义反向传播至子表达式,结合语义梯度与严格限制,在子语义空间内引导变异过程。子表达式的线性组合还会导致常数占比显著上升。虽然基于反向传播的几何语义变异仅限调整局部常数,但难以兼顾全局常数的优化。因此,我们引入带 L2 正则化的 Levenberg-Marquardt 优化算法周期性调整全局常数,同时实现公式曲线的平滑化。值得一提的是,由于我们为每个组合子树分配了常数(权重),该方法天然具备使用连续优化算法进一步调整整体树结构的特性。

3.2 SEMANTIC GRADIENT  3.2 语义梯度


Semantic backpropagation is achieved by inversely calculating the expression tree (Krawiec & Pawlak, 2013). As the target semantics are reversed into sub-target semantics, the semantic space is also mapped to a sub-semantic space. Thus the solution in the semantic space can be found by searching within the sub-semantic space. In this section, a semantic gradient concept is further proposed to assist the exploration in semantic space, which comes from the following observation: The approximation process of the sub-semantic space cannot be equivalently propagated to the root semantic space. Figure 2 is an example for better understanding. While the target semantics can be backpropagated to the sub-semantic space, the approximation process will be seriously affected by the calculation path of the mutation node, where the calculation path serves as the mapping function f(y) . With f(y) ,the evaluation based on the Euclidean distance between the semantics s of subtree tr and the sub-target semantics st in the sub-semantic space: i=0n(sisti)2 ,is transformed into i=0n(f(si)f(sti))2i=0n(w2xi)2(sisti)2 . The uncertain vector x makes each dimension of the approximation result from the sub-semantic space deviate in the target semantic space, which significantly reduces the accuracy of semantic approximation based on backpropagation. To address this issue, we introduce the concept of semantic gradients to approximate the mapping
语义反向传播通过逆向计算表达式树实现(Krawiec & Pawlak, 2013)。当目标语义被反向分解为子目标语义时,语义空间也随之映射到子语义空间。因此,通过在子语义空间内搜索即可找到语义空间中的解。本节进一步提出语义梯度概念以辅助语义空间探索,该概念源于以下观察:子语义空间的逼近过程无法等效传播至根语义空间。图 2 为例便于理解:虽然目标语义可反向传播至子语义空间,但逼近过程会显著受变异节点计算路径影响,其中计算路径充当映射函数 f(y) 。借助 f(y) ,基于子树 tr 的语义 s 与子语义空间中子目标语义 st 的欧氏距离评估: i=0n(sisti)2 ,被转化为 i=0n(f(si)f(sti))2i=0n(w2xi)2(sisti)2 。 不确定向量 x 使得子语义空间中的近似结果在目标语义空间的每个维度上发生偏离,这显著降低了基于反向传播的语义近似精度。为解决这一问题,我们引入语义梯度的概念来近似映射



https://cdn.noedgeai.com/019690bc-7b96-7f33-8fed-abb55d1d548c_3.jpg?x=546&y=1434&w=692&h=460&r=0

Figure 2: Spatial transformation under nonlinear mapping function f(y) . With f(y) ,the Euclidean distance between semantic points in the sub-semantic space changes when mapped to the target semantic space.
图 2:非线性映射函数 f(y) 下的空间变换。当使用 f(y) 时,子语义空间中语义点之间的欧几里得距离在映射到目标语义空间时会发生变化。


relationship between the sub-semantic space and the target semantic space. As illustrated in Figure 3 starting from the root node of the expression tree, the semantic gradients of the mutated node are computed along the semantic backpropagation path,thereby obtaining the gradient vector T for the
子语义空间与目标语义空间之间的关系。如图 3 所示,从表达式树的根节点出发,沿着语义反向传播路径计算变异节点的语义梯度,从而得到梯度向量 T


216



https://cdn.noedgeai.com/019690bc-7b96-7f33-8fed-abb55d1d548c_4.jpg?x=653&y=230&w=484&h=427&r=0

Figure 3: The process of semantic gradient computation. The semantic vectors si on the backpropa-gation path are replaced by the corresponding sub-target semantic sti .
图 3:语义梯度计算过程。反向传播路径上的语义向量 si 被相应的子目标语义 sti 所替代。


218

jth node. The representation of T is as follows:
jth 节点。 T 的表示如下:

(2)T=<f0s0j,,fnsnj>=<f0s00s00s01s0j1s0j,,fnsn0sn0sn1snj1snj>

Where sij denotes the ith dimension of the semantics of the jth subtree. The partial derivative values across various semantic dimensions indicate the changing rate of the nonlinear mapping function f(trj) with respect to the semantics of the subtree trj . Specifically,the absolute partial derivative values across each semantic dimension serve as an indicator of the degree of influence that deviations of trj within the corresponding sub-semantic space have on the deviations of f(trj) within the corresponding dimension of the target semantic space. Thus,the normalized value |T|Nj serves as the weight of each dimension in the sub-semantic space. As the gradient vector is affected by the semantics of the mutated node which is unfixed, we use the sub-target semantics st instead of the current mutated node semantics s when computing the semantic gradient:
其中 sij 表示 jth 子树语义的第 ith 个维度。各语义维度上的偏导数值反映了非线性映射函数 f(trj) 相对于子树语义 trj 的变化率。具体而言,每个语义维度上偏导数的绝对值大小,反映了对应子语义空间中 trj 的偏差对目标语义空间相应维度上 f(trj) 偏差的影响程度。因此,归一化后的值 |T|Nj 作为子语义空间各维度的权重。由于梯度向量受未固定的变异节点语义影响,在计算语义梯度时,我们采用子目标语义 st 而非当前变异节点语义 s

(3)T=<f0st0j,f1st1j,,fnstnj>

Where stij refers to the sub-target semantics of the ith dimension of the jth subtree.
其中 stij 指代 jth 子树第 ith 维度的子目标语义。

3.3 GEOMETRIC SEMANTIC METHOD
3.3 几何语义方法


Due to the high-dimensional nature of the semantic space, in which the dimensions equal to the training set size, the sparsity of semantic points resulting from high-dimensional features makes it difficult for geometric semantic mutation strategies to search for target semantics in the semantic space. To quickly approximate the target semantics in the semantic space, two steps are conducted as follows. First, as the distribution of candidates in the semantic space is sparse, we scale each candidate semantics s to the perpendicular projection of the sub-target semantics onto the line connecting the semantics point and the origin point. This step aims to migrate the distribution of candidate semantics s to the sub-target semantics st around the semantic space, as shown below:
由于语义空间的高维特性(其维度等于训练集大小),高维特征导致的语义点稀疏性使得几何语义突变策略难以在语义空间中搜索目标语义。为了快速逼近语义空间中的目标语义,我们采取以下两个步骤:首先,鉴于语义空间中候选点的分布稀疏,我们将每个候选语义 s 缩放至子目标语义在该语义点与原点连线上的垂直投影位置。此步骤旨在将候选语义 s 的分布迁移至子目标语义 st 周围,如下所示:

(4)s=αs

where, α :  其中, α

(5)α=sstss=insistiinsisi

Subsequently,we adopt var(stαs)=i=1n((stiαsi)E(stiαsi))2 to rank the candidate set, instead of utilizing the Euclidean distance metric, aiming to get the subexpressions that are more similar in functional form rather than a smaller scalar distance. The possible constant deviation can be eliminated with the constant node in the candidate set, as detailed in Appendix B The approximated candidate set then is sorted based on the Euclidean distance dis(si,st) between the semantics of candidates and the sub-target semantics. After that, we select multiple pair of
随后,我们采用 var(stαs)=i=1n((stiαsi)E(stiαsi))2 对候选集进行排序,而非使用欧氏距离度量,旨在获取函数形式更为相似而非标量距离更小的子表达式。如附录 B 所述,候选集中的常数节点可消除可能的常量偏差。接着,基于候选者语义与子目标语义之间的欧氏距离 dis(si,st) 对近似候选集进行排序。此后,我们选取多组


candidates in order and use the least squares method to combine them pairwise to find the linear combination of candidates that best approximates the target semantics. For the selection of tr1 ,we select the topt subtrees within the sorted set as candidates. For the selection of tr2 ,the following m subtrees for each tr1 are chosen to form the candidate set. Then,the obtained candidate subtrees from tr1 set and tr2 set are linearly combined to minimize the distance L(st,f(str1,str2)) ,which can be represented as below with Eq 4:
候选者并按顺序使用最小二乘法将它们两两组合,以找到能最佳逼近目标语义的候选者线性组合。关于 tr1 的选择,我们从排序集中选取 topt 棵子树作为候选者。对于 tr2 的选择,则为每个 tr1 选取后续 m 棵子树构成候选集。随后,将来自 tr1 集和 tr2 集的候选子树进行线性组合以最小化距离 L(st,f(str1,str2)) ,其数学表示如式 4 所示:

(6)s=α1(1k)str1+α2kstr2

Here,by utilizing the least squares method to minimize the loss function L(st,f(str1,str2))= (stif(sitr1,sitr2))2|T|N,i=(stiα1(1k)sitr1kα2sitr1)2|T|N,i, we can obtain k as follows:
此处,通过最小二乘法最小化损失函数 L(st,f(str1,str2))= (stif(sitr1,sitr2))2|T|N,i=(stiα1(1k)sitr1kα2sitr1)2|T|N,i, ,我们可得到 k 如下:

(7)k=|T|N(α2str2α1str1)(stα1str1)|T|N(α2str2α1str1)(α2str2α1str1)

In the Eq |T,|T|N|N represents the normalized vector of the absolute semantic gradient. With Eq 4. 5 6.7,a new combined tree can be generated: tr=α1(1k)tr1+α2ktr2 . For the generated candidates {tr1,tr2,,trn} after pairwise linear combinations,we first filter the candidate set based on the Euclidean distance L(st,s) between the semantics s of each combined candidate subtree tr and the sub-target semantics st in the root semantic space. Considering that the Euclidean distance used as the evaluation criterion is a scalar value, which loses the vector information in the semantic space. Simply pursuing a decrease in distance may lead to a loss of diversity and trap into local optima. Therefore,an additional parameter λ is used to fully explore the solution space:
在方程 |T,|T|N|N 中代表绝对语义梯度的归一化向量。通过方程 4.5、6.7,可以生成一个新的组合树: tr=α1(1k)tr1+α2ktr2 。对于经过两两线性组合生成的候选集 {tr1,tr2,,trn} ,我们首先基于欧氏距离 L(st,s) 对候选集进行筛选,该距离衡量的是每个组合候选子树 tr 的语义 s 与根语义空间中子目标语义 st 之间的差异。考虑到作为评估标准的欧氏距离是一个标量值,会丢失语义空间中的向量信息,单纯追求距离减小可能导致多样性丧失并陷入局部最优。因此,引入额外参数 λ 以充分探索解空间:

L(st,s)L(st,sc)(1+λ)

(8)((stisi)2|T|N,i)(1+λ)((stisic)2|T|N,i)

Where sc refers to the semantics of the original subtree,and the candidates that do not satisfy the Eq 8 are filtered out. Moreover, we incorporate a variance-based constraint method to ensure smoother function fitting, thereby avoiding potential issues of local overfitting caused by uneven data distribution or significant differences of output values within the dataset. The constraint is formulated
其中 sc 表示原子树的语义,不满足方程 8 的候选将被过滤掉。此外,我们采用基于方差的约束方法以确保更平滑的函数拟合,从而避免因数据分布不均或数据集内输出值差异显著可能引发的局部过拟合问题。该约束条件表述

as follows:  如下:

(9)var(stsc)var(sts)=i=1n((stisic)E(stsc))2((stisi)E(sts))21+λ

The remaining candidate set that satisfies both Eq. 8 and Eq. 9 is evaluated in the final step. Since the semantics is solely constituted by output vectors, lacking tree structure information, which results in a severe expansion problem when translating the semantic space approximation to expression trees. Inspired by the SPL approach (Sun et al. 2022), the final evaluation function is calculated using the following formula:
在最终步骤中,对同时满足式 8 和式 9 的剩余候选集进行评估。由于语义仅由输出向量构成,缺乏树结构信息,这导致在将语义空间近似转换为表达式树时出现严重的扩展问题。受 SPL 方法(Sun 等人,2022)启发,最终评估函数采用以下公式计算:

(10)R=ηl1+i=1n(stisi)2|T|N,i

Where η is the discount factor promoting concise trees, l represents the mutated subtree size. We only replace the mutated subtrees where R is smaller than the original Rc . It is worth noting that the mutated subtree size is calculated by the inner node size instead of the tree size to ensure the fairness of each function symbol. By employing this evaluation function, our method encourages finding more concise expressions while minimizing the Euclidean distance.
其中 η 是鼓励简洁树的折扣因子, l 代表变异子树的大小。我们仅替换那些 R 小于原 Rc 的变异子树。值得注意的是,变异子树的大小是通过内部节点大小而非整树大小计算的,以确保每个函数符号的公平性。通过采用这一评估函数,我们的方法在最小化欧氏距离的同时,鼓励寻找更简洁的表达式。

3.4 SEMANTIC POINT TOURNAMENT SELECTION
3.4 语义点锦标赛选择


In selecting a subtree in a symbolic tree to mutate, the traditional random selection strategy without any prior information reduces search efficiency. We integrate semantic information with the semantic gradient into the mutated subtree selection strategy to guide the selection of mutated subtrees, where the Euclidean distance is chosen as the basic metric for assessing the optimization potential of candidate mutated subtrees. By computing the Euclidean distance between the semantics of the candidate mutated subtree and the sub-target semantics, and mapping it through the semantic gradient to the root semantic space, we obtain a scalar expression of the potential for each candidate mutated subtree. Then, we utilize softmax to convert the mapped Euclidean distance of each candidate mutated
在符号树中选择子树进行变异时,传统无先验信息的随机选择策略会降低搜索效率。我们将语义信息与语义梯度相结合,融入变异子树选择策略以指导变异子树的选取,其中选用欧几里得距离作为评估候选变异子树优化潜力的基础度量标准。通过计算候选变异子树语义与子目标语义间的欧氏距离,并经由语义梯度映射至根语义空间,我们得到每个候选变异子树潜力的标量化表达。随后,利用 softmax 将各候选变异子树的映射欧氏距离转化为概率,并采用锦标赛选择策略从候选子树中确定最终的变异子树。


subtree into probabilities and use a tournament selection strategy to choose the final mutated subtree from the candidate subtrees. The selected probability p(tri) of each candidate tri is computed by:
子树转化为概率后,通过锦标赛选择策略从候选子树中选出最终的变异子树。每个候选子树 tri 的被选概率 p(tri) 计算公式如下:

(11)p(tri)=softmax(ri)=e(stisi)2TNiMe(stjsj)2|T|NjM

Where ri=(stisi)2|T|Ni and M=max0inri=max0in((stisi)2|T|Ni) is used to prevent data overflow. It is worth noting that we only randomly select a few subtrees as candidates in the expression tree in each generation to make full exploration while ensuring randomness, since the mismatched parts of an expression may be scattered across multiple branches of the expression tree.
其中 ri=(stisi)2|T|NiM=max0inri=max0in((stisi)2|T|Ni) 用于防止数据溢出。值得注意的是,在每一代生成过程中,我们仅从表达式树中随机选取少数子树作为候选,以确保在充分探索的同时维持随机性,因为表达式的不匹配部分可能分散在表达式树的多个分支中。

3.5 WEIGHT OPTIMIZATION  3.5 权重优化


The optimization of geometric semantic mutation lies solely in the adjustment of local tree structures, and it is worth noting that our geometric semantics method provides weight coefficients for each subtree(Equation 6). Differs from the traditional methods that use a continuous optimization algorithm to optimize the constants (Kommenda et al., 2020; De Melo et al., 2015), the presence of these weight coefficients allows us not only to adjust each subtree module at any time but also, more importantly, to automatically adjust the overall structure by optimizing constants and removing unsuitable subtrees (by setting the corresponding subtree's weight coefficient to 0.0 , when the weight coefficient is approximated to nearly zero(<1e-6 in our method)). Therefore, we employ the Levenberg-Marquardt(LM) (Moré, 2006) algorithm to perform constant optimization on the expressions at regular intervals. The LM algorithm utilizes Euclidean distance as the loss function. Additionally, considering that expressions generated based on geometric semantics often contain numerous constants, we have added L2 regularization to smooth the function curve, as shown below:
几何语义突变的优化仅在于局部树结构的调整,值得注意的是,我们的几何语义方法为每个子树提供了权重系数(方程 6)。与传统方法使用连续优化算法优化常量(Kommenda 等人,2020;De Melo 等人,2015)不同,这些权重系数的存在使我们不仅能够随时调整每个子树模块,更重要的是,通过优化常量并移除不合适的子树(当权重系数接近零时(在我们的方法中<1e-6),将相应子树的权重系数设置为 0.0),自动调整整体结构。因此,我们采用 Levenberg-Marquardt(LM)(Moré,2006)算法定期对表达式进行常量优化。LM 算法利用欧几里得距离作为损失函数。此外,考虑到基于几何语义生成的表达式通常包含大量常量,我们添加了 L2 正则化以平滑函数曲线,如下所示:

(12) Loss =L(y,f(x;w))+β/2j|wj|2

Following with LM algorithm,we can optimize the constants w through the formula:
遵循 LM 算法,我们可以通过公式优化常量 w

(13)wn+1=wn(JrTJr+μI)1(JrTr+β|w|)

Where Jr is the Jacobian matrix of constants, I is the identity matrix, μ is the penalty factor, β is the hyperparameter,and w represents constants within the expression.
其中 Jr 为常数雅可比矩阵, I 为单位矩阵, μ 为惩罚因子, β 为超参数, w 代表表达式中的常数项。

4 EXPERIMENTS  4 实验部分


4.1 BASIC BENCHMARKS  4.1 基础基准测试


We assess GESR on two mainstream large-scale datasets: the SRBench benchmark (La Cava et al. 2021) which includes both real-world and ground-truth problems, and the SRSD benchmark (Matsub-ara et al. 2022) for scientific discovery. A total of 25 baseline methods are used for performance comparison in the experiment. It is worth noting that these baseline methods use multiple parameter sets and find the most suitable parameter settings through a half-grid search. However, considering the significant impact of parameters on algorithm performance, to demonstrate the superiority of our method and reduce the influence of parameter selection, the hyper-parameters of our method remain the same on all datasets to control the influencing factors. The detailed hyper-parameter setting and the brief introductions to the baseline methods are provided in Appendix A and C respectively. The PMLB dataset includes 46 real-world black-box datasets, including physics, home price, etc., 76 synthetic datasets including Friedman datasets, 130 Strogatz datasets, and Feynman datasets. Among them, real-world datasets and Friedman datasets(form the black-box datasets) are used for overall evaluations, Friedman datasets are used for ablation analysis, Strogatz and Feynman datasets with different levels of Gaussian noise are used for robustness verification. In addition, the SRSD benchmark with 120 scientific discovery datasets that contain dummy variables is also used to test the effectiveness of GESR, in which the properties of the formula and the variables are carefully reviewed to ensure a realistic sampling value range. Our method is assessed with a popular metrics: R2 -score (La Cava et al. 2021).
我们在两个主流大规模数据集上评估 GESR:包含现实世界和基准真实问题的 SRBench 基准测试(La Cava 等人,2021 年),以及用于科学发现的 SRSD 基准测试(Matsubara 等人,2022 年)。实验中总共使用了 25 种基线方法进行性能比较。值得注意的是,这些基线方法采用多组参数,并通过半网格搜索找到最合适的参数设置。然而,考虑到参数对算法性能的重大影响,为了展示我们方法的优越性并减少参数选择的影响,我们方法在所有数据集上的超参数保持一致以控制影响因素。详细的超参数设置和基线方法的简要介绍分别见附录 A 和 C。PMLB 数据集包含 46 个现实世界黑盒数据集(涉及物理、房价等领域)、76 个合成数据集(包括 Friedman 数据集)、130 个 Strogatz 数据集以及 Feynman 数据集。 其中,真实世界数据集和弗里德曼数据集(构成黑盒数据集)用于整体评估,弗里德曼数据集用于消融分析,斯托加茨和费曼数据集则添加不同级别的高斯噪声以进行鲁棒性验证。此外,还采用了包含虚拟变量的 120 个科学发现数据集组成的 SRSD 基准来测试 GESR 的有效性,这些数据集中公式和变量的属性经过仔细审查,以确保采样值范围的合理性。我们的方法采用了一项流行指标进行评估: R2 分数(La Cava 等人,2021 年)。


378


Table 1: SRSD+ Dummy Variables: Accuracy solution rate (R2>0.999) on scientific discovery datasets with different complexity.
表 1:SRSD+虚拟变量:不同复杂度科学发现数据集上的准确解决率 (R2>0.999)

Group  组别gplearn  gplearn 算法AFPAIFDSRE2EuDSR  用户数据服务请求PySR  Python 符号回归GESR
Easy  简单0.00%20.0%6.67%76.7%16.7%53.3%20.0%100.0%
Medium  中等0.00%5.00%0.00%45.0%12.5%37.5%10.0%87.5%
Hard  困难0.00%4.00%0.00%22.0%10.0%12.0%2.0%58.0%


380

4.2 EFFECTIVENESS OF GESR
4.2 GESR 的有效性


We evaluated our method on the SRSD benchmark that contains 120 scientific discovery datasets with dummy variables, to assess the performance in tackling scientific discovery problems. As shown in Table 1,our method achieved accuracy solution rates (R2>0.999) of 100.0%,87.5% ,and 58% on the easy, medium, and hard datasets, which means that our method gets the improvements of 23.3%, 42.5%,36% over the second-ranked baseline method respectively. The experimental result shows the powerful search ability of our geometric semantic method in approximating target semantics, especially in addressing complex scientific discovery problems.The further experimental results on the SRSD benchmark are presented in Appendix D.3, and the qualitative analysis of the GESR has been discussed in Appendix D.4
我们在包含 120 个带有虚拟变量的科学发现数据集的 SRSD 基准上评估了我们的方法,以评估其在解决科学发现问题中的性能。如表 1 所示,我们的方法在简单、中等和困难数据集上分别实现了准确解决率 (R2>0.999)100.0%,87.5%58% ,这意味着我们的方法相对于排名第二的基线方法分别提高了 23.3%和 42.5%,36% 。实验结果表明,我们的几何语义方法在逼近目标语义方面具有强大的搜索能力,特别是在解决复杂的科学发现问题时。关于 SRSD 基准的进一步实验结果见附录 D.3,GESR 的定性分析已在附录 D.4 中讨论。

400



https://cdn.noedgeai.com/019690bc-7b96-7f33-8fed-abb55d1d548c_7.jpg?x=322&y=988&w=1153&h=447&r=0

Figure 4: SRBench-generated comparisons of R2 test(left) and model size(right) on 120 black-box datasets.
图 4:SRBench 生成的关于 120 个黑盒数据集的 R2 测试(左)和模型大小(右)的比较。


403

408

Figure 4 presents the performance of GESR on the SRBench benchmark. In terms of overall accuracy, our method outperforms all other algorithms. Additionally, our method also exhibits advantages in model size and shows substantial improvements compared to previous semantics-based approaches (SBP-GP).
图 4 展示了 GESR 在 SRBench 基准测试中的性能表现。在整体准确率方面,我们的方法超越了所有其他算法。此外,该方法在模型规模上也展现出优势,相比此前基于语义的方法(SBP-GP)有显著提升。

In addition to conducting large-scale evaluations on the scientific discovery problems within the SRSD benchmark and real-world datasets from the SRBench benchmark, we have also tested our method on several common datasets (Nguyen, Livermore). Further experimental details have been provided in Appendix C.
除了在 SRSD 基准的科学发现问题及 SRBench 基准的真实数据集上进行大规模评估外,我们还在多个常用数据集(Nguyen、Livermore)上测试了本方法。更多实验细节详见附录 C。

4.3 FURTHER ANALYSIS  4.3 深入分析


In our method, symbolic expression is generated through the direct data fitting at the semantic level. This may give the impression that the GESR is weak in noise robustness ability. To further explore the adaptability of the GESR to noise and the capability to solve the scientific discovery datasets, we conducted additional experiments on the synthetic dataset Strogatz with different levels of Gaussian noise.
本方法通过在语义层面直接拟合数据生成符号表达式,这可能让人误以为 GESR 的抗噪声能力较弱。为深入探究 GESR 对噪声的适应性及解决科学发现数据集的能力,我们在合成数据集 Strogatz 上针对不同强度的高斯噪声开展了补充实验。


432



https://cdn.noedgeai.com/019690bc-7b96-7f33-8fed-abb55d1d548c_8.jpg?x=333&y=248&w=1139&h=564&r=0

Figure 5: Accuracy solution rate (R2>0.999) on Feynman datasets(left) and Strogatz datasets(right) of SRBench benchmark with different noise levels.
图 5:SRBench 基准测试中,不同噪声水平下 Feynman 数据集(左)和 Strogatz 数据集(右)的准确解决率 (R2>0.999)


433

434

436

438

439

440

441

443

444

As depicted in Figure 5, our method gets state-of-the-art performance. Compared to the black-box datasets with the unknown underlying data generated function containing substantial irregular noise and dummy variables, there a more pronounced distinctions among the baseline methods for the capability to solve the scientific discovery datasets. It can be observed that the Gaussian noise has a negative impact on the accuracy solution rate of most methods. The stable accuracy solution rate of GESR than other methods for different levels of Gaussian noise demonstrates the certain degree of the noise-resistant capabilities of our method. The specific performance of our method on each dataset is provided in Appendix D.6, in which it can be observed that our method outperforms others on most datasets in terms of median R2 ,mean R2 ,and solution rate.
如图 5 所示,我们的方法取得了最先进的性能。与包含大量不规则噪声和虚拟变量、底层数据生成函数未知的黑盒数据集相比,科学发现数据集的基线方法在解决能力上表现出更显著的差异。可以观察到,高斯噪声对大多数方法的准确解决率产生了负面影响。GESR 在不同水平高斯噪声下稳定的准确解决率,证明了我们的方法具有一定程度的抗噪声能力。附录 D.6 提供了我们的方法在每类数据集上的具体表现,从中可见在多数数据集上,我们的方法在中位数 R2 、均值 R2 和解决率方面均优于其他方法。

4.4 ABLATION ANALYSIS  4.4 消融分析


In GESR, several components have been integrated. To validate the necessity of each component, we conduct ablation experiments on the Friedman dataset to individually examine the effects of the semantic gradient strategy, semantic mutation point selection strategy, geometric semantic approximation strategy, and constant optimization. Additionally, the full GESR method is used as a baseline. Table 2 shows the average R2 score,average R2 rank,and average model size on Friedman
在 GESR 中,我们整合了多个组件。为验证各模块的必要性,我们在 Friedman 数据集上进行了消融实验,分别考察语义梯度策略、语义突变点选择策略、几何语义近似策略以及常数优化的效果。同时以完整 GESR 方法作为基线。表 2 展示了 Friedman 数据集上的平均 R2 得分、平均 R2 排名及平均模型大小。


Table 2: The experimental results on Friedman datasets with the removal or replacement of each component. GESR-opt, GESR-slt, and GESR-gradient correspond to the removal of the constant optimization module, the semantic mutation point selection module, and the semantic gradient module. GESR-mutation refers to the replacement of the geometric semantic approximation module to linear scale strategy in (Virgolin et al. 2019). The GESR-baseline refers to the complete GESR method.
表 2:Friedman 数据集上移除或替换各组件后的实验结果。GESR-opt、GESR-slt 和 GESR-gradient 分别对应移除常数优化模块、语义突变点选择模块和语义梯度模块。GESR-mutation 指将几何语义近似模块替换为(Virgolin 等人 2019 年提出的)线性缩放策略。GESR-baseline 表示完整的 GESR 方法。

GESR-opt  GESR-opt(移除常数优化模块)GESR-slt  GESR-slt(移除语义突变点选择模块)GESR-gradient  GESR 梯度GESR-mutation  GESR 突变GESR-baseline  GESR 基线
Avg rk  平均排名2.621.392.532.321.11
Avg R2  平均 R2 0.9290.9390.9010.9290.947
Avg size  平均大小185161123120134


datasets with the lack or replacement of each component. Generally from the experimental results, the removal or replacement of any component shows a negative impact on our method, no matter the average R2 or the average rank,which highlights the necessity of the proposed components. To further test whether there are statistically significant difference in performance with the removal of replacement of each component, a Wilcoxon signed-rank test (Wilcoxon, 1992) is performed on
数据集因各组件缺失或被替换的情况。总体实验结果表明,无论移除或替换哪个组件,都会对我们的方法产生负面影响,无论是平均 R2 还是平均排名,这凸显了所提出组件的必要性。为进一步测试移除或替换各组件后性能是否存在统计学上的显著差异,我们对中位数实验结果进行了 Wilcoxon 符号秩检验(Wilcoxon, 1992)。与 GESR 基线相比,GESR-opt、GESR-slt、GESR-gradient 和 GESR-mutation 的 p 值分别低于 1e-6、1e-2、1e-7 和 1e-6,表明在包含或不包含所提出的权重优化模块、语义点锦标赛选择模块、几何语义方法和语义梯度模块时, R2 存在显著差异。


the median experiment results. Compared with the GESR-baseline, GESR-opt, GESR-slt, GESR-gradient,and GESR-mutation yields p values below 1e-6, 1e-2, 1e-7, and 1e-6, respectively, indicating significant differences (p<0.01) with and without the inclusion of the proposed weight optimization module, semantic point tournament selection module, geometric semantic method, and semantic gradient module respectively.
中位数实验结果。与 GESR 基线相比,GESR-opt、GESR-slt、GESR-gradient 和 GESR-mutation 的 p 值分别低于 1e-6、1e-2、1e-7 和 1e-6,表明在包含或不包含所提出的权重优化模块、语义点锦标赛选择模块、几何语义方法和语义梯度模块时, (p<0.01) 存在显著差异。

Due to the nonlinear mapping of semantic space, the importance of each dimension in the sub-semantic space is not equivalent. The absence of semantic gradients impairs the accuracy and effectiveness of the semantic variation module. Therefore, compared to the GESR-baseline, the GESR-gradient without the semantic gradient module exhibits a significant drop in the average R2 metric. Another significant drop in accuracy performance is observed for GESR-mutation, which demonstrates the superiority of the geometric semantic approximation module. In terms of average ranking, GSR-Baseline also shows the better stability over performance across different datasets.The details and roles of each module have been further discussed in Appendix B.
由于语义空间的非线性映射特性,子语义空间中各维度的重要性并不等同。语义梯度的缺失会损害语义变异模块的精度与效能。因此,相较于 GESR 基线模型,未搭载语义梯度模块的 GESR-gradient 在平均 R2 指标上出现显著下滑。而 GESR-mutation 在准确率表现上的另一重大跌幅,印证了几何语义逼近模块的优越性。就平均排名而言,GSR-Baseline 在不同数据集间也展现出更稳定的性能表现。各模块的具体作用与机理已在附录 B 中深入探讨。

5 CONCLUSION  5 结论


In this work, a novel geometric evolution model is proposed for SR to try to improve the spatial approximation accuracy while ensuring the interpretability. In GESR, a geometric semantic mutation method is introduced as a geometric approximation strategy to search for expressions with optimal accuracy in the semantic space, using a semantic gradient vector as an auxiliary tool for addressing semantic space mapping issues. A semantic-based mutation point selection method is further introduced to efficiently identify sub-expressions that need improvement. Taking into account the inherent advantage of linear combinations within the semantic space, we further incorporate the Levenberg-Marquardt algorithm as a key module for globally adjusting and optimizing expressions constants. The state-of-the-art accuracy performance achieved on both the SRSD and SRBench benchmark datasets demonstrates the effectiveness of our approach in real-world and scientific discovery tasks.
在这项工作中,我们提出了一种新颖的几何演化模型(GESR)以提升符号回归的空间逼近精度,同时确保模型的可解释性。GESR 引入了几何语义变异方法作为几何逼近策略,通过语义梯度向量作为辅助工具解决语义空间映射问题,从而在语义空间中搜索具有最优精度的表达式。进一步提出基于语义的变异点选择方法,高效识别需要改进的子表达式。考虑到语义空间中线性组合的固有优势,我们采用 Levenberg-Marquardt 算法作为核心模块对表达式常量进行全局调整和优化。在 SRSD 和 SRBench 基准数据集上达到的最先进精度性能,验证了该方法在现实场景和科学发现任务中的有效性。

521

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539


REFERENCES  参考文献

Dimitrios Angelis, Filippos Sofos, and Theodoros E Karakasidis. Artificial intelligence in physical sciences: Symbolic regression trends and perspectives. Archives of Computational Methods in Engineering, 30(6):3845-3865, 2023.
迪米特里奥斯·安杰利斯、菲利波斯·索福斯与西奥多罗斯·E·卡拉卡西迪斯。《物理科学中的人工智能:符号回归趋势与展望》。计算工程方法档案,30(6):3845-3865,2023 年。

Ignacio Arnaldo, Krzysztof Krawiec, and Una-May O'Reilly. Multiple regression genetic programming. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO '14, pp. 879-886, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450326629. doi: 10.1145/2576768.2598291.
伊格纳西奥·阿纳尔多、克日什托夫·克拉维茨与尤娜-梅·奥莱利。《多元回归遗传编程》。载于《2014 年遗传与进化计算年会论文集》,GECCO '14,第 879-886 页,美国纽约州纽约市,2014 年。计算机协会出版。ISBN 9781450326629。doi: 10.1145/2576768.2598291。

Bogdan Burlacu, Gabriel Kronberger, and Michael Kommenda. Operon c++: an efficient genetic programming framework for symbolic regression. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference Companion, GECCO '20, pp. 1562-1570, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450371278. doi: 10.1145/3377929. 3398099.
博格丹·布尔拉库、加布里埃尔·克龙伯格与迈克尔·科门达。《Operon C++:高效的符号回归遗传编程框架》。载于《2020 年遗传与进化计算会议配套论文集》,GECCO '20,第 1562-1570 页,美国纽约州纽约市,2020 年。计算机协会出版。ISBN 9781450371278。doi: 10.1145/3377929.3398099。

William La Cava, Tilak Raj Singh, James Taggart, Srinivas Suri, and Jason Moore. Learning concise representations for regression by evolving networks of trees. In International Conference on Learning Representations, 2019.
威廉·拉卡瓦、蒂拉克·拉吉·辛格、詹姆斯·塔格特、斯里尼瓦斯·苏里与杰森·摩尔。《通过演化树网络学习简洁回归表示》。载于国际学习表征会议,2019 年。

Qi Chen, Bing Xue, and Mengjie Zhang. Improving generalization of genetic programming for symbolic regression with angle-driven geometric semantic operators. IEEE Transactions on Evolutionary Computation, 23(3):488-502, 2019. doi: 10.1109/TEVC.2018.2869621.
齐晨、薛冰和张梦杰。利用角度驱动的几何语义算子提升遗传规划在符号回归中的泛化能力。《IEEE 进化计算汇刊》,23(3):488-502,2019 年。doi: 10.1109/TEVC.2018.2869621。

Miles Cranmer. Interpretable machine learning for science with pysr and symbolicregression. jl. arXiv preprint arXiv:2305.01582, 2023.
迈尔斯·克兰默。基于 PySR 和 SymbolicRegression.jl 的可解释机器学习科学研究工具。arXiv 预印本 arXiv:2305.01582,2023 年。

F. O. de Franca and G. S. I. Aldeia. Interaction-Transformation Evolutionary Algorithm for Symbolic Regression. Evolutionary Computation, 29(3):367-390, 09 2021a. ISSN 1063-6560. doi: 10.1162/ evco_a_00285.
F·O·德弗兰卡与 G·S·I·阿尔代亚。符号回归的交互-变换进化算法。《进化计算》,29(3):367-390,2021 年 9 月 a。ISSN 1063-6560。doi: 10.1162/evco_a_00285。

F. O. de Franca and G. S. I. Aldeia. Interaction-Transformation Evolutionary Algorithm for Symbolic Regression. Evolutionary Computation, 29(3):367-390, 09 2021b. ISSN 1063-6560. doi: 10.1162/evco_a_00285.
F·O·德弗兰卡与 G·S·I·阿尔代亚。符号回归的交互-变换进化算法。《进化计算》,29(3):367-390,2021 年 9 月 b。ISSN 1063-6560。doi: 10.1162/evco_a_00285。

Vinicius Veloso De Melo, Benjamin Fowler, and Wolfgang Banzhaf. Evaluating methods for constant optimization of symbolic regression benchmark problems. In 2015 Brazilian conference on intelligent systems (BRACIS), pp. 25-30. IEEE, 2015.
Vinicius Veloso De Melo、Benjamin Fowler 与 Wolfgang Banzhaf。符号回归基准问题中常数优化方法的评估。载于 2015 年巴西智能系统会议(BRACIS),第 25-30 页。IEEE,2015 年。

William T. Hale, Efi Safikou, and George M. Bollas. Inference of faults through symbolic regression of system data. Computers & Chemical Engineering, 157:107619, 2022. ISSN 0098-1354. doi: https://doi.org/10.1016/j.compchemeng.2021.107619.
William T. Hale、Efi Safikou 与 George M. Bollas。通过系统数据的符号回归进行故障推断。《计算机与化学工程》,157 卷:107619,2022 年。ISSN 0098-1354。doi: https://doi.org/10.1016/j.compchemeng.2021.107619。

Zhixing Huang, Yi Mei, and Jinghui Zhong. Semantic linear genetic programming for symbolic regression. IEEE Transactions on Cybernetics, 54(2):1321-1334, 2022.
黄志兴、梅一与钟靖辉。面向符号回归的语义线性遗传编程。《IEEE 控制论汇刊》,54 卷 2 期:1321-1334 页,2022 年。

Zhixing Huang, Yi Mei, and Jinghui Zhong. Semantic linear genetic programming for symbolic regression. IEEE Transactions on Cybernetics, 54(2):1321-1334, 2024. doi: 10.1109/TCYB.2022. 3181461.
黄志兴、梅一与钟靖辉。面向符号回归的语义线性遗传编程。《IEEE 控制论汇刊》,54 卷 2 期:1321-1334 页,2024 年。doi: 10.1109/TCYB.2022.3181461。

Ying Jin, Weilin Fu, Jian Kang, Jiadong Guo, and Jian Guo. Bayesian symbolic regression. arXiv preprint arXiv:1910.08892, 2019.
应金、傅伟林、康健、郭佳东与郭健。贝叶斯符号回归。arXiv 预印本 arXiv:1910.08892,2019 年。

Pierre-alexandre Kamienny, Stéphane d'Ascoli, Guillaume Lample, and Francois Charton. End-to-end symbolic regression with transformers. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 10269-10281. Curran Associates, Inc., 2022.
皮埃尔-亚历山大·卡缅尼、斯特凡·达斯科利、纪尧姆·朗普勒与弗朗索瓦·沙尔东。基于 Transformer 的端到端符号回归。载于 S. Koyejo、S. Mohamed、A. Agarwal、D. Belgrave、K. Cho 与 A. Oh(编),《神经信息处理系统进展》,第 35 卷,第 10269-10281 页。Curran Associates 公司,2022 年。

Maarten Keijzer. Improving symbolic regression with interval arithmetic and linear scaling. In European Conference on Genetic Programming, pp. 70-82. Springer, 2003.
马尔滕·凯泽。利用区间算术与线性缩放改进符号回归。载于《欧洲遗传编程会议》,第 70-82 页。Springer 出版社,2003 年。

Michael Kommenda, Bogdan Burlacu, Gabriel Kronberger, and Michael Affenzeller. Parameter identification for symbolic regression using nonlinear least squares. Genetic Programming and Evolvable Machines, 21(3):471-501, 2020.
迈克尔·科门达、博格丹·布尔拉库、加布里埃尔·克龙贝格尔与迈克尔·阿芬策勒。基于非线性最小二乘的符号回归参数识别。《遗传编程与可进化机器》,21(3):471-501,2020 年。


Krzysztof Krawiec and Tomasz Pawlak. Approximating geometric crossover by semantic backpropa-gation. In Proceedings of the 15th annual conference on Genetic and evolutionary computation, pp. 941-948, 2013.
Krzysztof Krawiec 与 Tomasz Pawlak。通过语义反向传播近似几何交叉。载于《第 15 届遗传与进化计算年会论文集》,第 941-948 页,2013 年。

William La Cava, Tilak Raj Singh, James Taggart, Srinivas Suri, and Jason H Moore. Learning concise representations for regression by evolving networks of trees. arXiv preprint arXiv:1807.00981, 2018.
William La Cava、Tilak Raj Singh、James Taggart、Srinivas Suri 及 Jason H Moore。通过演化树网络学习回归的简洁表示。arXiv 预印本 arXiv:1807.00981,2018 年。

William La Cava, Thomas Helmuth, Lee Spector, and Jason H. Moore. A Probabilistic and Multi-Objective Analysis of Lexicase Selection and ϵ -Lexicase Selection. Evolutionary Computation,27 (3):377-402, 09 2019. ISSN 1063-6560. doi: 10.1162/evco_a_00224.
William La Cava、Thomas Helmuth、Lee Spector 与 Jason H. Moore。词典选择与 ϵ -词典选择的概率及多目标分析。《进化计算》,27(3):377-402,2019 年 9 月。ISSN 1063-6560。doi: 10.1162/evco_a_00224。

William La Cava, Patryk Orzechowski, Bogdan Burlacu, Fabricio de Franca, Marco Virgolin, Ying Jin, Michael Kommenda, and Jason Moore. Contemporary symbolic regression methods and their relative performance. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
William La Cava、Patryk Orzechowski、Bogdan Burlacu、Fabricio de Franca、Marco Virgolin、Ying Jin、Michael Kommenda 及 Jason Moore。当代符号回归方法及其相对性能。载于 J. Vanschoren 与 S. Yeung 合编的《神经信息处理系统数据集与基准跟踪会议论文集》第 1 卷,2021 年。

Mikel Landajuela, Chak Shing Lee, Jiachen Yang, Ruben Glatt, Claudio P Santiago, Ignacio Aravena, Terrell Mundhenk, Garrett Mulcahy, and Brenden K Petersen. A unified framework for deep symbolic regression. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 33985-33998. Curran Associates, Inc., 2022.
米克尔·兰达胡埃拉、李泽星、杨嘉辰、鲁本·格拉特、克劳迪奥·P·圣地亚哥、伊格纳西奥·阿拉韦纳、特雷尔·蒙登克、加勒特·马尔卡希与布伦登·K·彼得森。《深度符号回归的统一框架》。载于 S. Koyejo、S. Mohamed、A. Agarwal、D. Belgrave、K. Cho 与 A. Oh(编),《神经信息处理系统进展》,第 35 卷,第 33985-33998 页。Curran Associates 公司,2022 年。

Nour Makke and Sanjay Chawla. Interpretable scientific discovery with symbolic regression: a review. Artificial Intelligence Review, 57(1):2, 2024.
努尔·马基与桑杰·查瓦拉。《符号回归的可解释科学发现:一项综述》。《人工智能评论》,57(1):2,2024 年。

Yoshitomo Matsubara, Naoya Chiba, Ryo Igarashi, and Yoshitaka Ushiku. Rethinking symbolic regression datasets and benchmarks for scientific discovery. arXiv preprint arXiv:2206.10540, 2022.
松原义智、千叶直也、五十岚亮与牛久祥孝。《重新思考科学发现中的符号回归数据集与基准》。《arXiv 预印本》,arXiv:2206.10540,2022 年。

Trent McConaghy. FFX: Fast, Scalable, Deterministic Symbolic Regression Technology, pp. 235- 260. Springer New York, New York, NY, 2011. ISBN 978-1-4614-1770-5. doi: 10.1007/ 978-1-4614-1770-5_13.
特伦特·麦康纳吉。《FFX:快速、可扩展、确定性的符号回归技术》,第 235-260 页。Springer 纽约,纽约州,2011 年。ISBN 978-1-4614-1770-5。doi:10.1007/978-1-4614-1770-5_13。

Yi Mei, Qi Chen, Andrew Lensen, Bing Xue, and Mengjie Zhang. Explainable artificial intelligence by genetic programming: A survey. IEEE Transactions on Evolutionary Computation, 27(3): 621-641, 2022.
易梅、齐晨、Andrew Lensen、薛冰与张梦杰。基于遗传编程的可解释人工智能综述。《IEEE 进化计算汇刊》,27(3): 621-641,2022 年。

Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28-July 1, 1977, pp. 105-116. Springer, 2006.
Jorge J Moré。Levenberg-Marquardt 算法:实现与理论。收录于《数值分析:1977 年 6 月 28 日-7 月 1 日邓迪会议论文集》,第 105-116 页。Springer 出版社,2006 年。

Terrell Mundhenk, Mikel Landajuela, Ruben Glatt, Claudio P Santiago, Daniel faissol, and Brenden K Petersen. Symbolic regression via deep reinforcement learning enhanced genetic programming seeding. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 24912-24923. Curran Associates, Inc., 2021.
Terrell Mundhenk、Mikel Landajuela、Ruben Glatt、Claudio P Santiago、Daniel Faissol 与 Brenden K Petersen。基于深度强化学习增强遗传编程播种的符号回归。载于 M. Ranzato、A. Beygelzimer、Y. Dauphin、P.S. Liang 与 J. Wortman Vaughan(编),《神经信息处理系统进展》,第 34 卷,第 24912-24923 页。Curran Associates 公司,2021 年。

Quang Uy Nguyen and Thi Huong Chu. Semantic approximation for reducing code bloat in genetic programming. Swarm and Evolutionary Computation, 58:100729, 2020.
Quang Uy Nguyen 与 Thi Huong Chu。语义逼近减少遗传编程中的代码膨胀。《群体与进化计算》,58 卷,100729 号,2020 年。

Tomasz P Pawlak, Bartosz Wieloch, and Krzysztof Krawiec. Semantic backpropagation for designing search operators in genetic programming. IEEE Transactions on Evolutionary Computation, 19(3): 326-340, 2014.
托马斯·P·帕夫拉克、巴托斯·维尔洛赫与克日什托夫·克拉维茨。遗传编程中基于语义反向传播的搜索算子设计。《IEEE 进化计算汇刊》,19(3):326-340,2014 年。

Brenden K Petersen, Mikel Landajuela, T Nathan Mundhenk, Claudio P Santiago, Soo K Kim, and Joanne T Kim. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients. arXiv preprint arXiv:1912.04871, 2019.
布伦登·K·彼得森、米克尔·兰达朱埃拉、T·内森·蒙登克、克劳迪奥·P·圣地亚哥、苏·K·金与乔安妮·T·金。深度符号回归:通过风险寻求策略梯度从数据中恢复数学表达式。arXiv 预印本 arXiv:1912.04871,2019 年。

Gloria Pietropolli, Luca Manzoni, Alessia Paoletti, and Mauro Castelli. Combining geometric semantic gp with gradient-descent optimization. In European Conference on Genetic Programming (Part of EvoStar), pp. 19-33. Springer, 2022.
格洛丽亚·皮耶特罗波利、卢卡·曼佐尼、阿莱西亚·保莱蒂与毛罗·卡斯泰利。将几何语义遗传编程与梯度下降优化相结合。《欧洲遗传编程会议》(EvoStar 分册),第 19-33 页。斯普林格出版社,2022 年。


Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. Science, 324(5923):81-85, 2009. doi: 10.1126/science.1165893.
迈克尔·施密特与霍德·利普森。从实验数据中提炼自由形式的自然法则。《科学》,324(5923):81-85,2009 年。doi:10.1126/science.1165893。

Michael D. Schmidt and Hod Lipson. Age-fitness pareto optimization. In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, GECCO '10, pp. 543-544, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781450300728. doi: 10.1145/1830483.1830584.
迈克尔·D·施密特与霍德·利普森。年龄-适应度帕累托优化。载于《第十二届遗传与进化计算年会论文集》,GECCO '10,第 543-544 页,美国纽约州纽约市,2010 年。计算机协会出版。ISBN 9781450300728。doi: 10.1145/1830483.1830584。

SP Sharan, Wenqing Zheng, Kuo-Feng Hsu, Jiarong Xing, Ang Chen, and Zhangyang Wang. Symbolic distillation for learned tcp congestion control. Advances in Neural Information Processing Systems, 35:10684-10695, 2022.
沙然·SP、郑文清、许国峰、邢佳荣、陈昂与王张阳。符号蒸馏用于学习型 TCP 拥塞控制。《神经信息处理系统进展》,35 卷:10684-10695 页,2022 年。

Fangzheng Sun, Yang Liu, Jian-Xun Wang, and Hao Sun. Symbolic physics learner: Discovering governing equations via monte carlo tree search. arXiv preprint arXiv:2205.13134, 2022.
孙方正、刘洋、王建勋与孙浩。符号物理学习器:通过蒙特卡洛树搜索发现支配方程。arXiv 预印本,arXiv:2205.13134,2022 年。

Silviu-Marian Udrescu and Max Tegmark. Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020. doi: 10.1126/sciadv.aay2631.
西尔维乌-马里安·乌德雷斯科与马克斯·泰格马克。AI 费曼:一种受物理学启发的符号回归方法。《科学进展》,6 卷 16 期:eaay2631,2020 年。doi: 10.1126/sciadv.aay2631。

Silviu-Marian Udrescu, Andrew Tan, Jiahai Feng, Orisvaldo Neto, Tailin Wu, and Max Tegmark. Ai feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 4860-4871. Curran Associates, Inc., 2020.
西尔维乌-马里安·乌德雷斯科、安德鲁·谭、冯家海、奥里斯瓦尔多·内托、吴泰霖与马克斯·泰格马克。《AI 费曼 2.0:利用图模块性的帕累托最优符号回归》。载于 H.拉罗谢尔、M.兰扎托、R.哈德塞尔、M.F.巴尔坎及 H.林(编),《神经信息处理系统进展》,第 33 卷,第 4860-4871 页。柯伦联合公司,2020 年。

Mojtaba Valipour, Bowen You, Maysum Panju, and Ali Ghodsi. Symbolicgpt: A generative transformer model for symbolic regression. arXiv preprint arXiv:2106.14131, 2021.
莫杰塔巴·瓦利普尔、尤博文、潘朱梅森与阿里·戈德西。《SymbolicGPT:一种用于符号回归的生成式 Transformer 模型》。arXiv 预印本,arXiv:2106.14131,2021 年。

Leonardo Vanneschi. An introduction to geometric semantic genetic programming. In NEO 2015: Results of the Numerical and Evolutionary Optimization Workshop NEO 2015 held at September 23-25 2015 in Tijuana, Mexico, pp. 3-42. Springer, 2016.
莱昂纳多·万内斯基。《几何语义遗传编程导论》。载于《NEO 2015:2015 年 9 月 23-25 日墨西哥蒂华纳数值与进化优化研讨会 NEO 2015 成果》,第 3-42 页。斯普林格出版社,2016 年。

M. Virgolin, T. Alderliesten, C. Witteveen, and P. A. N. Bosman. Improving Model-Based Genetic Programming for Symbolic Regression of Small Expressions. Evolutionary Computation, 29(2): 211-237, 06 2021. ISSN 1063-6560. doi: 10.1162/evco_a_00278.
M.维尔戈林、T.阿尔德利斯顿、C.维特文与 P.A.N.博斯曼。《改进基于模型的遗传编程用于小规模表达式符号回归》。《进化计算》,29(2): 211-237,2021 年 6 月。ISSN 1063-6560。doi: 10.1162/evco_a_00278。

Marco Virgolin, Tanja Alderliesten, and Peter A. N. Bosman. Linear scaling with and within semantic backpropagation-based genetic programming for symbolic regression. In Proceedings of the Genetic and Evolutionary Computation Conference, GECCO '19, pp. 1084-1092, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450361118. doi: 10.1145/3321707.3321758.
马尔科·维尔戈林、坦雅·阿尔德利斯顿与彼得·A·N·博斯曼。基于语义反向传播遗传编程的符号回归线性扩展及其内部机制。《遗传与进化计算会议论文集》,GECCO '19,第 1084-1092 页,美国纽约州纽约市,2019 年。计算机协会出版。ISBN 9781450361118。DOI: 10.1145/3321707.3321758。

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47-60, 2023.
王汉宸、傅天凡、杜元起、高文浩、黄可欣、刘子鸣、帕亚尔·钱达克、刘胜超、彼得·范卡特维克、安德烈娅·迪亚克等。人工智能时代的科学发现。《自然》期刊,620 卷 7972 期,第 47-60 页,2023 年。

Frank Wilcoxon. Individual comparisons by ranking methods. In Breakthroughs in statistics: Methodology and distribution, pp. 196-202. Springer, 1992.
弗兰克·威尔科克森。通过排序方法进行个体比较。《统计学突破:方法论与分布》,第 196-202 页。施普林格出版社,1992 年。

Hengzhe Zhang, Aimin Zhou, Qi Chen, Bing Xue, and Mengjie Zhang. Sr-forest: A genetic programming based heterogeneous ensemble learning method. IEEE Transactions on Evolutionary Computation, pp. 1-1, 2023. doi: 10.1109/TEVC.2023.3243172.
张恒哲、周爱民、陈琪、薛冰与张梦杰。SR-Forest:一种基于遗传编程的异构集成学习方法。《IEEE 进化计算汇刊》,第 1-1 页,2023 年。DOI: 10.1109/TEVC.2023.3243172。

698

701


Appendix  附录


A BRIEF DESCRIPTIONS OF THE BASELINE METHODS
基线方法的简要描述


In this section, for the convenience of reference and comparison, we provide the short descriptions of the baseline methods, including the 14 original SRBench baseline methods and 5 SRSD baseline methods that used in Section 4. The original papers are refered for the additional details.
在本节中,为便于参考和比较,我们提供了基准方法的简要描述,包括第 4 节中使用的 14 种原始 SRBench 基准方法和 5 种 SRSD 基准方法。更多细节请参阅原始论文。

  • Unified Framework for Deep Symbolic Regression(uDSR): uDSR Landajuela et al. (2022) is a modular framework which integrate multiple different SR solution strategies in an attempt to complement each other so as to maximize the advantages of each strategy.
    深度符号回归统一框架(uDSR): uDSR Landajuela 等人(2022)提出了一种模块化框架,该框架整合了多种不同的符号回归解决方案策略,旨在通过策略间的优势互补来最大化各策略的效能。

  • Deep symbolic regression(DSR): DSR Petersen et al. (2019) leverages the deep reinforcement learning to search solution for symbolic regression, which employs a neural network to represent token distribution and a reinforcement learning strategy to train the netowrk.
    深度符号回归(DSR): DSR Petersen 等人(2019)利用深度强化学习进行符号回归求解,该方法通过神经网络表示符号分布,并采用强化学习策略训练网络。

  • AIFeynman(AIF): AIFeynman Udrescu et al. (2020) is a symbolic regression method that leverages neural network, graph modularity, hypothesis testing and normalizing flows to improve the accuracy towards harder problems and the robust towards noise.
    AI 费曼(AIF): AI 费曼 Udrescu 等人(2020)是一种符号回归方法,通过结合神经网络、图模块性、假设检验和归一化流技术,提升了对复杂问题的求解精度及噪声鲁棒性。

  • SR with Non-linear least squares(Operon): Operon Burlacu et al. (2020) is a GP framework that aims to achieve efficient implement and provide a completely out-of-the-box solution for symbolic regression.
    非线性最小二乘符号回归(Operon): Operon Burlacu 等人(2020)是一个遗传编程框架,旨在实现高效运算并提供开箱即用的符号回归解决方案。

  • Semantic backpropagation genetic programming(SBP-GP): SBP-GPVirgolin et al. (2019) is a semantic genetic programming method. SBP-GP utilizes linear-scaling technology to shrink the candidate subtree to approximate the output of variation subtree to the sub-target semantics.
    语义反向传播遗传编程(SBP-GP): SBP-GP Virgolin 等人(2019)是一种语义遗传编程方法。该技术利用线性缩放将候选子树压缩,使变异子树的输出逼近子目标语义。

  • End-to-end Symbolic Regression with Transformers(E2E): E2E Kamienny et al. (2022) is an end-to-end symbolic regression method based on transformer to generate the full mathematical expression, which is in advantage of inference time.
    端到端符号回归与 Transformer(E2E):E2E(Kamienny 等人,2022)是一种基于 Transformer 的端到端符号回归方法,能够直接生成完整的数学表达式,其优势在于推理时间短。

  • PySR: PySR Cranmer (2023) is an open-source library, in which the multi-population evolution strategy with the evolve-simplify-optimize loop is the kernel search algorithm.
    PySR:PySR(Cranmer,2023)是一个开源库,其核心搜索算法采用多群体进化策略,结合了进化-简化-优化的循环流程。

  • Multiple regression genetic programming(MRGP): MRGP Arnaldo et al. (2014) is a genetic programming method which decouples and linearly combines the subexpressions of a program by multiple regression on the target variables.
    多重回归遗传编程(MRGP):MRGP(Arnaldo 等人,2014)是一种遗传编程方法,通过对目标变量进行多重回归,将程序的子表达式解耦并线性组合。

  • ϵ -lexicase selection(EPLEX): EPLEX La Cava et al. (2019) is a genetic programming method with ϵ -lexicase selection strategy that improves the parent selection process,in which the training cases are used as evaluation value individually instead of aggregate all of the cases.
    ϵ -词典选择(EPLEX):EPLEX(La Cava 等人,2019)是一种采用 ϵ -词典选择策略的遗传编程方法,改进了父代选择过程,其中训练案例被单独用作评估值,而非将所有案例聚合评估。

  • Feature engineering automation tool(FEAT): FEAT Cava et al. (2019) is a method that for learning and optimizing the interpretable representations, in which neural networks used as features are represented as syntax trees and an archive of representations that characterize the accuracy-complexity trade-offs are maintained to assist in the generalization and interpretation.
    特征工程自动化工具(FEAT):FEAT(Cava 等人,2019 年提出)是一种用于学习和优化可解释表示的方法,其中神经网络特征以语法树形式呈现,并通过维护一个表征精度-复杂度权衡的存档库来辅助泛化与解释。

  • Fast function extraction(FFX): FFX McConaghy (2011) is a non-evolutionary technique leveraging pathwise regularized learning to generate a set of solutions that trade off error versus complexity, which improves the speed, scalability, and deterministic behavior.
    快速函数提取(FFX):FFX(McConaghy,2011 年提出)是一种非进化技术,利用路径正则化学习生成一组在误差与复杂度之间权衡的解决方案,从而提升速度、可扩展性及确定性行为。

  • GP version of the gene-pool optimal mixing evolutionary algorithm(GP-GOMEA): GP-GOMEA Virgolin et al. (2021) is a genetic programming method with improved linkage learning approach imposing a strict limitation size, which focuses on accuracy and small solutions.
    基因池最优混合进化算法的遗传编程版本(GP-GOMEA):GP-GOMEA(Virgolin 等人,2021 年提出)是一种改进关联学习的遗传编程方法,通过严格限制解的大小,专注于精度与小型解决方案。

  • Age-fitness Pareto optimization(AFP): AFP Schmidt & Lipson (2010) is a multi-objective genetic programming method, which evolves population on age-fitness pareto front to avoid premature convergence.
    年龄-适应度帕累托优化(AFP):AFP(Schmidt 和 Lipson,2010 年提出)是一种多目标遗传编程方法,通过使种群在年龄-适应度帕累托前沿上演化来避免早熟收敛。

  • AFP with co-evolved fitness estimates(AFP-FE): AFP-FE Schmidt & Lipson (2009) is an improved AFP method with a new co-evolved fitness estimation method.
    基于共进化适应度估计的自动函数编程(AFP-FE):AFP-FE Schmidt & Lipson (2009)是一种改进的 AFP 方法,采用了一种新的共进化适应度估计技术。


  • Bayesian symbolic regression(BSR): BSR Jin et al. (2019) is a Bayesian framework that employs Markov chain Monte Carlo method to incorporate prior knowledge and sample symbolic trees from posterior distribution, which shows advantages in interpretability, utilization to prior knowledge, and effective memory usage.
    贝叶斯符号回归(BSR):BSR Jin 等(2019)提出了一种贝叶斯框架,利用马尔可夫链蒙特卡洛方法整合先验知识并从后验分布中采样符号树,在可解释性、先验知识利用及高效内存使用方面展现出优势。

  • ITEA: ITEA de Franca & Aldeia (2021a) is a mutation only evolutionary algorithm with a new individual representation way, in which each mutation process is achieved by randomly selecting among six heuristic mutation operators.
    ITEA:ITEA de Franca & Aldeia (2021a)是一种仅含变异的进化算法,采用新颖的个体表示方式,其变异过程通过六种启发式变异算子随机选择实现。

B GENERALIZE FRAMEWORK OF GESR
B GESR 通用框架


In Algorithm 1, we provide the detailed pseudocode for describing the framework of GESR. The population and semantic library are initialized in Lines 1-2, and the detailed parameter settings are provided in Appendix C.1. In GESR, Two mutation strategies are employed: geometric semantic approximation as the primary variation method(Line 9), and a random replacement strategy for population exploration(Line 10), both performed probabilistically in each generation. The Levenberg-Marquardt algorithm is periodically executed to adjust the global weight parameters of the expression(Line 5). Subsequently, mutated expressions are executed, and selected of nodes for the next mutation(Line 14). Additionally, at regular intervals, several subtrees from top-t expression trees in the population are randomly selected to supplement the semantic library for transfer learning the useful features (Lines 16-19).
在算法 1 中,我们提供了详细描述 GESR 框架的伪代码。种群与语义库的初始化见第 1-2 行,具体参数设置详见附录 C.1。GESR 采用两种变异策略:作为主要变异方法的几何语义近似(第 9 行),以及用于种群探索的随机替换策略(第 10 行),二者在每一代中均按概率执行。周期性执行 Levenberg-Marquardt 算法以调整表达式的全局权重参数(第 5 行)。随后执行变异表达式,并选择节点进行下一轮变异(第 14 行)。此外,定期从种群中排名前 t 的表达式树中随机选取若干子树,补充至语义库以迁移学习有用特征(第 16-19 行)。


Algorithm 1: Geometric evolution symbolic regression.
算法 1:几何演化符号回归



Input : Symbolic regression problem P ,consists of tabular data(X,y)
输入:符号回归问题 P ,由表格数据 data(X,y) 组成
Output : Best fitting expression
输出:最佳拟合表达式
Initialize population P ;
初始化种群 P ;
Initialize semantic library Ds ;
初始化语义库 Ds ;
for each iteration i do
对每次迭代 i 执行
if i is equal to the interval of constant optimization then
如果 i 等于常数优化间隔则
C optimize population P with Levenberg-Marquardt algorithm
使用 Levenberg-Marquardt 算法优化种群
else  否则
sU(0,1) ;
if s<0.95 then  如果 s<0.95
C approximate the target semantics of P with geometric semantic mutation
通过几何语义变异近似目标语义
method.  方法。
else  否则
C population variation with random mutation method.
采用随机变异方法的 C 种群变异。
end  结束
end  结束
Execute the symbolic tree in C ;
C 处执行符号树;
Select P with tournament strategy to form the new population P ;
使用锦标赛策略选择 P 以形成新种群 P
if i is equal to the interval of semantic library update then
如果 i 等于语义库更新的间隔时间则
T select tok-n symbolic tree from P ;
P 中选择 tok-n 符号树的 T
update Ds with sub-trees from T ;
T 的子树更新 Ds
end  结束
end  结束




In the geometric semantic mutation strategy, for each expression tree in the population, we first select a subtree to be mutated from several candidate nodes based on Eq 11. Subsequently, a specified number of candidates are randomly selected from the semantic library, and linear combination and filter through Eq 4-10 are performed to replace the mutated subtree. It is worth noting that the representation of the subtree obtained by our geometric semantic method typically takes on the form k1p1+k2p2 ,When k1 or k2 is 0.0,it yields kp or 0 . However,this resulting subtree lacks a combined representation of constants and features,i.e., k1p1+c . Therefore,for each expression tree in the population, we insert an additional constant node " 1 " into the candidate so that this combinatorial subtree can generate such a representation.
在几何语义变异策略中,对于种群中的每个表达式树,我们首先基于公式 11 从若干候选节点中选择一个待变异的子树。随后,从语义库中随机选取指定数量的候选,通过公式 4-10 进行线性组合与筛选,以替换被变异的子树。值得注意的是,通过我们的几何语义方法获得的子树表示通常呈现为 k1p1+k2p2 的形式,当 k1k2 为 0.0 时,会生成 kp 或 0。然而,由此产生的子树缺乏常数与特征的组合表示,即 k1p1+c 。因此,对于种群中的每个表达式树,我们在候选节点中额外插入一个常数节点“1”,以便该组合子树能够生成此类表示。

In the random replacement strategy, two traditional GP methods are employed: random subtree replacement and random node replacement. With a tournament selection strategy in small size, this
在随机替换策略中,采用了两种传统的遗传编程方法:随机子树替换与随机节点替换。配合小规模锦标赛选择策略,该方法能为进化过程提供多样性。


strategy can provide diversity in the evolution. Furthermore, to reduce the tree size, the limited subtree depth generated by the random subtree replacement strategy is less than or equal to the original subtree.
此外,为控制树结构规模,通过随机子树替换策略生成的新子树深度将限制为小于或等于原子树深度。

C EXPERIMENTAL SETUP  C 实验设置


C.1 HYPERPARAMETER SELECTION
C.1 超参数选择



Table 3: Hyperparameter setting
表 3:超参数设置

hyper-parameter  超参数Value  数值
Population size  种群大小500
Generations  世代200
Initial tree depth  初始树深度1-3
Depth limitation  深度限制8
Weight optimization interval
权重优化间隔
20
Semantic library size  语义库大小5000
Geometric semantic mutation rate
几何语义变异率
0.5
Tournament size  锦标赛规模2
Library update interval  库更新间隔10
Semantic candidate size  语义候选集大小200
Top-t for the candidate selection
候选选择的前 t 项
1
Top-m for the candidate selection
候选选择的前 m 项
200
Functions  功能函数<+,,,% protected, sin,cos,log,exp>   <+,,,% 受保护, sin,cos,log,exp>


The hyperparameters for GESR are listed in Table 4. In this work, our method aims to directly approximate the output of the target expression numerically and then map the operation to the formula in the population. To mitigate the impact of poor initial expression structure, we employ a parameter setting with a large population and small depth, allowing for broader coverage of initial expression structures and increased fault tolerance. A smaller tournament size also facilitates group exploration. For variation rate, we primarily utilize geometric semantic search methods while maintaining a certain probability of random variation. The significance of random variation lies in its facilitation of early-stage population exploration, prevention of local optimization due to the solidification of population expressions in later stages, and selection of new expression structures into the semantic library. The updated pattern for the semantic database involves high-frequency sampling with small amounts to sample elite individuals' expression structures at each stage of evolution. Each semantic database samples 20 subtrees from the top 10 individuals' semantics and randomly replaces these subtrees within the database. During evolution,our function set consists of 8 tuples <+,-,*,%protected,sin, cos,log,exp> protected by interval arithmetic Keijzer (2003).
GESR 的超参数列于表 4 中。本工作中,我们的方法旨在直接数值逼近目标表达式的输出,随后将运算映射至种群中的公式。为减轻初始表达式结构不良的影响,我们采用大种群、小深度的参数设置,以广泛覆盖初始表达式结构并提升容错性。较小的锦标赛规模也有利于群体探索。变异率方面,我们主要运用几何语义搜索方法,同时保持一定概率的随机变异。随机变异的意义在于促进早期种群探索,防止后期因种群表达式固化导致的局部优化,并筛选新表达式结构进入语义库。语义数据库的更新模式采用高频少量采样,在进化各阶段抽取精英个体的表达式结构。 每个语义数据库从排名前 10 的个体语义中采样 20 棵子树,并在数据库内随机替换这些子树。进化过程中,我们的函数集由 8 个元组组成<+,-,*,%受保护的,sin, cos,log,exp> ,采用区间算术保护(Keijzer,2003)。

In addition, the diversity of the expression tree structure is important. we think employing a large population with fewer generations outperforms the alternative of a small population with numerous generations. That is because of the gradual loss of population diversity as the number of iterations increases, in which the localized mutation is not enough to alter the overall structure. Therefore, we use a large population size with relatively small generations to promote exploration within the population.
此外,表达式树结构的多样性至关重要。我们认为采用大种群配合较少代数优于小种群多代数的方案,因为随着迭代次数增加,种群多样性会逐渐丧失,此时局部突变不足以改变整体结构。因此,我们使用较大种群规模配合相对较少的代数以促进种群内的探索。

C.2 COMPUTER RESOURCE  C.2 计算机资源


The experiments are carried out on a physical machine equipped with an Intel Core(TM) i7-8700 CPU @ 3.20GHz and a single NVIDIA RTX 3090 GPU with 24268-MB global memory.
实验在一台物理机上执行,配置为 Intel Core(TM) i7-8700 CPU @ 3.20GHz 处理器,以及单块 NVIDIA RTX 3090 GPU,其全局显存为 24268MB。


C.3 THE NOTATIONS USED IN THE PAPER
论文中使用的符号说明



Table 4: Notations  表 4:符号表

Symbol  符号Description  描述
SThe semantics of a tree
树的语义
str(sj)The semantics of the tree tr (the jth  tree)
tr 的语义(即 jth  树)
ScThe semantics of the original subtree before mutated
变异前原子树的原始语义
sThe semantics of the generated new subtree after mutated
变异后生成的新子树的语义
siThe ith dimension semantic of the semantics s
语义 sith 维度语义
st  子目标The sub-target semantics in the sub-semantic space
子语义空间中的子目标语义
sttr(stj)The sub-target semantics of the subtree tr (the jth  subtree)
子树 tr 的子目标语义(即 jth  子树)
stiThe ith dimension semantic of the sub-target semantics st
子目标语义 st 的 ith 维度语义
triThe ith symbolic expression subtree
ith 符号表达式子树
|T|N,iThe ith dimension of the normalized semantic gradient vector
归一化语义梯度向量的 ith 维度
|T|NjThe normalized semantic gradient vector of the jth  subtree
jth  子树的归一化语义梯度向量
αThe scalar factor proposed in Eq|5|
方程|5|中提出的标量因子
αiThe scalar factor for the ith semantics
针对 ith 语义的标量因子
kThe scalar value proposed in Eq. 7
方程 7 中提出的标量值
ηThe discount factor promoting concise trees
促进简洁树结构的折扣因子
lthe mutated subtree size  突变子树的大小
JrThe Jacobian matrix of the constants
常数的雅可比矩阵
IThe identity matrix  单位矩阵
μThe penalty factor in Eq. 13
式 13 中的惩罚因子
βA hyperparameter in Eq|12|and Eq|13|
式|12|和式|13|中的超参数
wjThe jth constant within an expression
表达式中的 jth 常量
wGeneral reference to a constant
常量的通用引用
λA tolerance factor in Eq|8|and Eq|9|
式|8|和式|9|中的容差因子
DA dataset of symbolic regression
一个符号回归的数据集
f(x;w)The output of a symbolic expression with input x and the constants w
符号表达式的输出,输入为 x ,常量为 w
L()A loss function with the input
包含输入的损失函数
RIn evaluation value of the generated subtree,which is calculated by Eq|10|
生成子树的评估值,由 Eq|10| 计算得出
RcAn evaluation value of the original subtree
原子树的一个评估值
p(tri)The probability of selecting the ith subtree tri as the mutated subtree
选择 ith 子树 tri 作为变异子树的概率


D SUPPLEMENTAL BENCHMARK RESULTS
D 补充基准测试结果


In addition to SRBench and SRSD benchmark datasets, our method have also been evaluated on several widely used datasets.
除 SRBench 和 SRSD 基准数据集外,我们的方法还在多个广泛使用的数据集上进行了评估。

D.1 EXPERIMENTAL RESULTS ON LIVERMORE DATASET
D.1 利弗莫尔数据集上的实验结果


The range of the Livermore dataset for both training and test sets is uniformly limited to [10,10] , with a total of 1000 sets. The training set is randomly generated within this range, while the test set adopts equal interval sampling. It should be noted that only datasets within the function set's range are selected for evaluation,given that our function set includes <+,-,*,%,sin,cos, log ,exp>. Each dataset is evaluated 10 times with different seeds.
利弗莫尔数据集的训练集和测试集范围统一限定在 [10,10] 内,共 1000 组数据。训练集在此范围内随机生成,测试集则采用等间隔采样。需注意的是,仅选取函数集定义域内的数据集进行评估,因我们的函数集包含<+,-,*,%,sin,cos, log ,exp>。每组数据使用不同随机种子重复评估 10 次。

Table 5 shows the median R2 performance of our method on each dataset over 10 runs,as well as the success rate under two precision settings. The Accτ refers to the accuracy solution rate and τ indicates the precision. It refers to successfully finding a solution when 1R2<1.0e(τ+1) . The experimental results exhibit a high accuracy across most of the datasets. However, the GESR on datasets containing protection operators (Livermore-4, Livermore-12), particularly higher-order ones (Livermore-5, Livermore-12), exhibit decreased performance. The decline in fitting ability may be attributed to the interval operation protection strategy. Nevertheless, compared to traditional protection operators, this strategy can better prevent overfitting in expressions featuring numerous
表 5 展示了我们的方法在 10 次运行中各数据集的 R2 性能中位数,以及两种精度设置下的成功率。其中, Accτ 代表准确解率, τ 表示精度。当 1R2<1.0e(τ+1) 时,它指的是成功找到一个解。实验结果显示在大多数数据集上具有较高的准确性。然而,在包含保护算子(Livermore-4、Livermore-12)的数据集上,特别是高阶保护算子(Livermore-5、Livermore-12)的数据集上,GESR 表现出性能下降。拟合能力的下降可能归因于区间操作保护策略。尽管如此,与传统保护算子相比,该策略能更好地防止具有大量特征的表达式出现过拟合。


constants such as geometric semantics GP. As a result, we have decided to retain the interval operation strategy.
诸如几何语义 GP 等常量。因此,我们决定保留区间操作策略。

920


Table 5: The median 1R2 value and the accuracy solution rate with different tolerance on Livermore datasets.
表 5:Livermore 数据集上不同容差下的中位数 1R2 值及准确解率。

Dataset  数据集Symbolic expression  符号表达1R2Acc3Acc12
Livermore-1  利弗莫尔-11/3+x1+sin(x1)0100%100%
Livermore-2  利弗莫尔-2sin(x12)cos(x1)20100%100%
Livermore-3  利弗莫尔-3sin(x13)cos(x12)10100%100%
Livermore-4  利弗莫尔-4log(x1+1)+log(x12+x1)+log(x1)6.47e-3  6.47×10⁻³20%0%
Livermore-5  利弗莫尔-5x14x13+x12x22.70e-12  2.70×10⁻¹²100%30%
Livermore-6  利弗莫尔-64x14+3x13+2x12+x11.21e-14100%100%
Livermore-9  利弗莫尔-9i=19x1i1.99e-15100%100%
Livermore-10  利弗莫尔-106sin(x1)cos(x2)0100%100%
Livermore-11  利弗莫尔-11(x12x22)/(x1+x2)0100%100%
Livermore-12  利弗莫尔-12x15/x231.230%0%
Livermore-14  利弗莫尔-14x13+x12+x1+sin(x1)+sin(x12)0100%100%
Livermore-17  利弗莫尔-174sin(x1)cos(x2)1.87e-8100%30%
Livermore-18  利弗莫尔-18sin(x12)cos(x1)50100%100%
Livermore-19  利弗莫尔-19x15+x14+x12+x10100%100%
Livermore-21  利弗莫尔-21i=18x1i0100%100%


D.2 EXPERIMENTAL RESULTS ON NGUYEN DATASET
D.2 阮氏数据集的实验结果


The range and number of training and test datasets for the Nguyen dataset are consistent with those of the Livermore dataset settings. As shown in Table 6, our method shows a good fitting effect in terms of accuracy when applied to the Nguyen dataset.
阮氏数据集的训练与测试数据集范围及数量设置与利弗莫尔数据集保持一致。如表 6 所示,我们的方法在应用于阮氏数据集时,在精度方面展现出良好的拟合效果。


Table 6: The median 1R2 value and the accuracy solution rate under different precision on Nguyen datasets.
表 6:不同精度下阮氏数据集的中位数 1R2 值及准确解率。

Dataset  数据集Symbolic expression  符号表达式1R2Acc3Acc12
Nguyen-1  阮氏-1x13+x12+x10100%100%
Nguyen-2  阮氏-2x14+x13+x12+x10100%100%
Nguyen-3  阮-3x15+x14+x13+x12+x10100%100%
Nguyen-4  阮-4x16+x15+x14+x13+x12+x10100%80%
Nguyen-5  阮-5sin(x12)cos(x1)10100%100%
Nguyen-6  阮-6sin(x1)+sin(x1+x12)0100%100%
Nguyen-7  阮氏-7log(x1+1)+log(x12+1)3.29e-7  3.29×10⁻⁷100%20%
Nguyen-9  阮氏-9sin(x1)+sin(x22)0100%100%
Nguyen-10  阮氏-10sin(x1)cos(x2)0100%100%
Nguyen-12  阮-12x14x130.5x22+x20100%100%


D.3 FURTHER EXPERIMENTAL RESULTS ON SRSD BENCHMARK
D.3 节 SRSD 基准测试的进一步实验结果


In this section, the accuracy solution rate and symbolic solution rate of the GESR on the SRSD benchmark(with and without dummy variables) are presented.
本节展示了 GESR 在 SRSD 基准测试(含与不含虚拟变量)上的准确解率与符号解率。

The results show that the GESR achieves significantly better performance in terms of accuracy solution rate on both types of SRSD datasets, while is also competitive in terms of symbolic solution rate. However, It must be pointed out that, compared to the symbolic solution rate, the proposed GESR is better at numerical fitting under a limited model complexity.
结果表明,GESR 在两类 SRSD 数据集上的准确解率均显著更优,同时在符号解率方面也具备竞争力。但需指出的是,相较于符号解率,所提出的 GESR 在有限模型复杂度下更擅长数值拟合。


972

974


Table 7: SRSD: Accuracy solution rate (R2>0.999) on scientific discovery datasets with different complexity.
表 7:SRSD:不同复杂度科学发现数据集上的准确解决率 (R2>0.999)

Group  组别gplearn  gplearn(遗传编程学习库)AFPAIFDSRE2EuDSRPySR  PySR(符号回归库)GESR
Easy  简单6.67%20.0%33.3%63.3%26.7%100.0%66.7%100.0%
Medium  中等7.50%2.50%5.0%45.0%17.5%75.0%45.0%92.5%
Hard  困难2.00%4.00%6.00%28.0%14.0%20.0%38.0%64.0%

Table 8: SRSD: Symbolic solution rate on scientific discovery datasets with different complexity.
表 8:SRSD:不同复杂度科学发现数据集上的符号求解率。

Group  gplearnAFPAIFDSRE2EuDSRPySRGESR
Easy  简单6.67%20.0%30.0%46.7%0.00%50.0%60.0%53.3%
Medium  中等0.00%2.50%2.50%10.0%0.00%17.5%30.0%40.0%
Hard  困难0.00%0.00%2.00%2.00%0.00%4.00%4.00%6.0%


984

987

988

989

990

991


Table 9: SRSD +Dummy Variables: Symbolic solution rate on scientific discovery datasets with different complexity.
表 9:SRSD+虚拟变量:不同复杂度科学发现数据集上的符号求解率。

Group  gplearnAFPAIFDSRE2EuDSRPySRGESR
Easy  简单0.00%16.7%0.00%10.0%0.00%10.0%20.0%3.3%
Medium  中等0.00%0.00%0.00%0.00%0.00%7.50%5.00%10.0%
Hard  困难0.00%0.00%0.00%2.00%0.00%0.00%0.00%2.00%


992

993

994

995

996

997

998

999

1000



https://cdn.noedgeai.com/019690bc-7b96-7f33-8fed-abb55d1d548c_18.jpg?x=313&y=1229&w=1175&h=815&r=0

Figure 6: Qualitative analysis of the equation mutation process.
图 6:方程变异过程的定性分析。


1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

1023

1024

1025


D.4 QUALITATIVE ANALYSIS OF THE GESR
D.4 GESR 的定性分析


The process of searching for symbolic solutions and accuracy solutions are further investigated respectively. The R2 -Iteration convergence curve and the final formulas are illustrated in the Figure 6 We can find that, the idea of providing weights for each combined subtree and optimizing using the LM algorithm to adjust the global tree structure works well(Many redundant subtree weights are optimized to 0 , like Feynman-I.13.4). Another interesting phenomenon is that, although our method
对符号解搜索过程与精度解搜索过程分别进行了深入探究。图 6 展示了 R2 迭代收敛曲线及最终公式。可以发现,为每个组合子树赋予权重并通过 LM 算法优化以调整全局树结构的思路效果显著(许多冗余子树权重被优化至 0,如 Feynman-I.13.4)。另一个有趣的现象是,尽管我们的方法

1033 tends to perform the numerical fitting when the formula is complicated, R2 can also be reduced to nearly 0 even if the symbolic solution is not found.
1033 倾向于在公式复杂时进行数值拟合,但即使未找到符号解, R2 仍可降至接近 0。


Table 10: Examples of the evolution formulas.
表 10:演化公式示例。

Dataset  数据集Target formula  目标公式Simplied formula  简化公式Generated formula  生成公式1- R2
Accuracy Solution(1- R2<1e -15)
精确解(1- R2<1e -15)
i.30.5x0/(x1sin(x2))1.0x0(x0+ x1)/(x12 sin(x2))((x1x0)1.002764 1.017231)/sin (x2)/((x1/x0 0.0+(x1+x2)0.0 x1)/(x0/x1/(x1/x1) 1.0 +x0x20.0)) 0.980352+((x0/x1/(x2 x0)143652773275.769+ x0/x1/(x2x0) 4.213271 )0.003478+ log(log(x0))/(x1/x22.6e 05x2x2)0.0)0.00
ii.34.11  二.34.11x0x1x2/(2 x3)0.5 * x**0 * x**1 * x**2 / x**3 + 1.31x0x22+ x0x2x22 x3+0.346x22+ 0.346x2
0.5 * x 的 0 次方 * x 的 1 次方 * x 的 2 次方 / x 的 3 次方 + 1.31x0x22+ x0x2x22 x3+0.346x22+ 0.346x2
x2x2(x0x3)+x0 x2 + x2 * ((( x1/x1 + x1 + x2)(x3x0)/(x3/x1)) 0.345614+(x1/x3+x2+x2) x00.154386)0
ii.2.42  二.2.42x0x3(x1 x2)/x41.01x0x3(x1 x2)(0.002x4 1) /x4x3(((x2/x40.030838+x2) x20.0)((x1x2)5.5e 05+x1/x40.030838)) ((x00.001796+x10.0)x4 x0)32.427845+((x4+x3) (x2x1)+cos((x4+x4))) ((x3x0)2e06+(x0x1 x4/x3)0.0)+(x2x3) (x4+x4)x01e06+(x3 x4+x0)(x457732.18215+ x152.941879)0.0)0
iii.17.37  第十七章第三十七节x0(x1cos(x2)+ 1)x0x1(x0x1 *2)x0(x0x1x1)x10
Symbolic Solution  符号解


1038

1039

1042

1043

1046

1048

1051

1052

1053

1054

1055

1056

1057

1058

1061

1066

1069

1071

1074

1075

1076

1077

1078

1079


1080

1081


i.27.61/(x1/x2+1/x0)1.0x0x2/(x0 x1+x2)(x2+(x2x1(x0x2))) 0.0+x2x061.494388 0.0)/(x2/x01.213848+ (x2+x0x2)0.0+x1 0.002891419.809517+(x1 x2)0.0)(sin(cos((0.882887 0.882103)))1.858405+1.0 0.0)+(( x0 * -389.595871 + x2300.836646+(x0x2) 3.8092+1.00.70011+ x0/x17.939006)0.0+x0 x2log(x1)log((x1x2)) (x0x00.0+(x2x0)0.0)+ log(x1)/log(x1)/(x1/x0 (x1+x2))0.0)2.8897180
i.14.39.81x0x19.81x0x1(x1x0)/(x1/x1) x1/x0/(x1/x0)9.80665+ log((x0/x0log(x1)))((x1+ x0)(x0x0))/(x1x0 log(x1))0.00
ii.3.24  二.3.240.08x0/x120.08x0/x12x0/(x1x1x0/x0) 0.079577+((x0x0+ x0/x0)/((x1+x0)x0x1) (x0/x1+x0/x0)(x0x0+ x0/x1))0.00
i.14.4  一.14.40.5x0x120.5x0x12x0x0x0x0+x1x1 x0(0.5+(x1x0+x0x1+ x0/x0(x0x0))0.0)0
i.26.2  一.26.2sin(x0)/sin(x1)1.0 * sin(x0)/sin(x1))x0/x0(x1+ x0)x0/((x0/x0 sin(x1))/(sin(x0)/(x0 x0))0.5+log((log(x0)/(x0 x0)))0.0)/(x0+x1)0
Approximate Solution(1- R2<1 e-3)
近似解(1- R2<1 e-3)
bonus.18  奖金.18(x02x2 2x32(x3 x4/x5+1)+x1 2)/(2x0)0.25x3x4(x0 x5)+0.5x12/x0(x3x4(x0x5)(x1+x1) x1/x0)0.25+((x3+x4) x3x3+x5/x0/(x0x2)) (x5/x0/(x5x2)(x0+x2) (x1x5))0.04.88e- 14  4.88e-14


1082

1083

1084

1085

1086

1087

1088

1089

1090

1091

1092

1093

1095

1096

1097

1100

1102

1105

1106

1107

1109

1110

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133



1134

1179

。_____
1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156i.15.3xx0x1x21.0x0x2(0.998 x1+5.29)(( x2x0x1x2 ) * 7.420798730311672e+17+ x2x16462889543579853.0+ x1/x2471097.535999) (x1/x00.0+x0/x20.0+ cos(log(x1))0.035472) 0.0+((x2+x2+x1x2)) 6.008895307088127e+17+ x2x1log(x1)0.129268 2.2813953662454227e+17) (cos(cos(log(x1)))0.0+ (x1/x20.0+0.09026) 0.0)+x2((x10.999754+ 5.295591+x2x2+x0/x2) 0.998293+(x11.417872+ x2 * 346065318.206917) * (x1x1x2/x0)0.0)+ (3.76412+x01.250469) (x1+x1)(x2+x1)0.0+(x1 0.0+94.293878)x01.6e 05+((x00.628168+x1 0.0+sin(x1)sin(x2)) cos((x0x0))/(x1/x1(x2+ x0)))0.000848
(( x2x0x1x2 ) * 7.420798730311672e+17+ x2x16462889543579853.0+ x1/x2471097.535999) (x1/x00.0+x0/x20.0+ cos(log(x1))0.035472) 0.0+((x2+x2+x1x2)) 6.008895307088127e+17+ x2x1log(x1)0.129268 2.2813953662454227e+17) (cos(cos(log(x1)))0.0+ (x1/x20.0+0.09026) 0.0)+x2((x10.999754+ 5.295591+x2x2+x0/x2) 0.998293+(x11.417872+ 乘以 2 * 346065318.206917) * (x1x1x2/x0)0.0)+ (3.76412+x01.250469) (x1+x1)(x2+x1)0.0+(x1 0.0+94.293878)x01.6e 05+((x00.628168+x1 0.0+sin(x1)sin(x2)) cos((x0x0))/(x1/x1(x2+ x0)))0.000848
3.14e- 8  3.14 乘以 10 的负 8 次方
1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 _____.i.34.10  一.34.10x0/(10.333e 8x1)1.0x0+0.00101 x0/x1+0.499 log(x1)(((x113653.999224+x0) 1.258642) ((x1+x0) x1x0))0.999914+ (x0164713.383376+ 118328192256260.03 log(x1)x1x0)5e06) 0.0+(log(x1)x1x0 0.156508+(x1x0x1x1) 1.331564)((x1x1+x0+ x0)1.012645+x0x12e 06+(x1x0)16.983238) 0.0+(((x0+x0+log(x1))) 0.998994+(x0/x1+x0) 0.002012)0.499967+ (x0/x1+x0x1)6.5e 05+(log(x1)(x0+x0)) (log(x1)+x1/x1)0.0) x0/((x0/x1+x0+x1) 1.000529+x1x1(x0+ x1)0.0)(x1x1(x0+ x1)(0.0+x10.0)+x1 x0(x1+x1)0.0+x0x1 0.0+x1x10.0+x1x1 (x0+x1)1.00.0)3e- 16  3e-16


1180

1181

1182

1183

1184

1185

1186

1187


1188

1189


89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06i.39.11x1x2/(x01)x2(1.06x0x1 (0.058x01.0)+ (0.457x0x1 0.961x148.3 x2) * sin(log(x0)) * cos(2 * x0/x1))/sin(log(x0)
x2(1.06x0x1 (0.058x01.0)+ (0.457x0x1 0.961x148.3 x2) * 正弦(对数(x0)) * 余弦(2 * x0/x1))/sin(log(x0)
(x2x1((x0x00.057574+ 1.0000044e06)(x0 1.0034 0.996614+x0x0 5e06)))/sin(log(x0)) 1.061088+((x2+ 1.000007+x00.0) 96.710375+x27.701144 12.428074)/(0.026219 x00.000476+5e06+ cos(x2)8e06+log(x0) 2e06)(x1x00.456581+ (x0+x1)0.000264+ x10.039793+x2 47.286156+x1/x20.0+ x1 + x2) (cos((x0/x1 + x0/x1))0.025936 (cos(log(sin(x0))) cos((0.99726+x0 0.001665)))(7.4e05+ (cos(x0)+x0)3e05))4e- 16  4e-16
07 Special Condition  07 特殊条件
08 09 10 11 12iii.7.38  三.7.381.9 e/+34x0x1zoo * x0 * x1 + 1
动物园 * x0 * x1 + 1
(x0x1(x1x0))/((x1x0) x0x1)(x1/x1(x0x0) (x0x1)/(cos((x1x0(x0 x0)))0.0+(x0x1)x1 x0(x1/x1(x1+x1))0.0))0
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37i.48.2  一.48.28.99e+16x0zoo * x0 * (2.28e + 21x00.036 log(x0))(zoox0+ 1)
动物园 * x0 * (2.28e + 21x00.036 log(x0))(zoox0+ 1)
((((x11.0+x0 6.2866856149886544e + 22) * 1.0+x1log(x0))(0.036195+ cos(log(x1))0.0)+ (log((x1+x1))/(log(x1) (x0+x0))+log((x1/x0))+x1 x0(x1+x0))2e06+(x1 x1 * 1643323978166933.5 + x10.005684 7.673404519338928e 24+x110.998321 2.4933595845987697e+22+ x1/x00.0+x0/x1/(x0+ x1) 1.7018808659799349e+ 68+x1/x0/(x1+x0) 11.321839) 0.0)(x0+x0 3.10136811332914e+38 0.0+(x0x0)/(x1x1) 1.060634018085812e+ 22) /(x1(x10.0 5558069507319.527 + x 1/x0 * 0.0+(x0+x1)x1/x0 0.0))/(x1/x1/(x1x1)1.0 0.0+cos((log(x1)cos(x0))) x11.0057990.0+ (1.007637+x1/x1/(x1x1) 430533244100.9507)0.0)7.14e- 11


1190

1191

1192

1193

1194

1195

1196

1197

1198

1199

1200

1201

1203

1204

1205

1206

1208

1209

1210

1213

1214

1215

1216

1217

1218

1219

1220

1221

1222

1223

1224

1225

1226

1227

1228

1229

1230

1231

1232

1233

1234

1235

1236

1237

1238

1239

1241


1242

1243


4.2
43 44 45 46 47 48 49 50 51 52 53ii.21.32  二.21.328.99e+9x0/x18.99e+9x0/(x1 (0.454cos((x1+ x2)/x2)+zoo))x0/((x10.633341 3.281687+x1x20.0) (0.999981+x10.0) 0.506802+(x1x2+log(x0)) 0.0+(x11.95036+x20.0) 0.00026)/(log((log(x2)*(x1- x1)))0.045066+cos(((x2+ x1)/x2))0.453885+ (x1/x1(x2x1)+x1+) x0+x1x2)(1.00.0+ (x1x0)0.0))1.0 9471418140.76377
x0/((x10.633341 3.281687+x1x20.0) (0.999981+x10.0) 0.506802+(x1x2+log(x0)) 0.0+(x11.95036+x20.0) 0.00026)/(log((log(x2)*(x1- x1)))0.045066+cos(((x2+ x1)/x2))0.453885+ (x1/x1(x2x1)+x1+) x0+x1x2)(1.00.0+ (x1x0)0.0))1.0 9471418140.76377
5.06e- 14  5.06e-14
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74i.37.4x0+x1+2 sqrt(x0x1) cos(x2)0.224 * (0.303 * x0x1+4.42 x0+4.42x1+ 0.00598cos(x2)) (0.00598x02+ 0.169 * cos(x0) 0.00598cos(x1)+ 0.00598cos(x1 x2)+0.848)+ (0.00598x0/x1+ (1.13x0+1.13 x1+31.9) * (0.187 * x0+0.187x1+ 0.00101 ) + 0.381 * log(x0x1) 0.013 * log(x0 2x0+x1)+ 3.05log(log(x0+ x1) 0.00598 x1/x0)sin((x0+ x1) * cos(x0 * *2) * cos(x2))
0.224 * (0.303 * x0x1+4.42 x0+4.42x1+ 0.00598cos(x2)) (0.00598x02+ 0.169 * cos(x0) 0.00598cos(x1)+ 0.00598cos(x1 x2)+0.848)+ (0.00598x0/x1+ (1.13x0+1.13 x1+31.9) * (0.187 * x0+0.187x1+ 0.00101 ) + 0.381 * log(x0x1) 0.013 * log(x0 2x0+x1)+ 3.05log(log(x0+ x1) 0.00598 x1/x0)sin((x0+ x1) * cos(x0 * *2) * cos(x2))
(cos(x0)0.169021+ 1.8533860.457336+((x0 x0+cos(x1))cos((x2 x1)))0.00576)(((x2 x0)(x2x2)+x2x0 (x2+x1))4.422398+ (cos(x2)11.632138+x1 x0549.516861)0.000551) 0.223539+sin((cos((x0 x0))cos(x2)(x1+x0))) ((31.893814+(x1x0) 1.130586)(0.001229+ (x0+x1)0.186696)+ log((x1x0+x0x0)) 0.012644+log((x0 x1))0.380954+x0/x1 0.00594+x1/x00.005865+ log(log((x0+x1)))3.054007)1.51e- 4  1.51e-4


1244

1245

1246

1247

1248

1249

1250

1251

1252

1253

1254

1255

1256

1257

1258

1259

1260

1261

1262

1263

1264

1265

1266

1267

1268

1269

1270

1271

1272

1273

1274

1275

1276

1277

1278

1279

1280

1281

1282

1283

1284

1285

1286

1287

1288

1289

1290

1291

1292

1293

1294

1295


1296

1297


bonus.73.0e + 8 * sqrt(-x1/x2 *2)
3.0e + 8 * 开方(-x1/x2 *2)
(7.83e+10x0 x1x2+x0x2 (9.23e+33x0 1.21e + 9 * x1 + 2. 66e+8)log(x2) 8.18e+9x1)/(x2 log(x2))((x2/x1)(x2+x2)) (x1x1+x1x0)0.0+ cos(log((x2/x0)))0.0+ (x0+x1)sin(x1)0.0+ (cos((log(x2)+x0x2)) 1.243243+(x2+x0+x2x1) (x0x1+x0+x1)0.049346) 0.0)((((x1x0)/log(x2)) 1878.48632 + x1x0 28.949997+x2x10.0) 41687892.707671+((x0/x1 8.906147285880547e 53+x1x1 6.044006062559795e 27) x0/x0/(x0+x1) * (x0x20.0+(cos(x1) cos(x1))(0.0+x0 0.0 0) ))(x1/x2/log(x2) -8177765343.762911 x1/x2 * 1.8666636319898988e +26 0.0+x0246026712.333492+ x20.0+x0x20.0+x0 x09.232410317538276e+ 33+x020340985.054632+ x20.0+(1.256028+x0 2.3055888900197576e+25+ x10.748889+0.013151) (1.00.0+cos(x1)0.0)))1.67e- 4


1298

1299

1300

1301

1302

1303

1304

1305

1306

1307

1308

1309

1310

1311

1312

1313

1314

1315

1316

1317

1318

1319

1320

1321

1322

1323

1324

1326

Besides, in Table 10, we also present some of the other generated formulas along with their simplified forms, including the accuracy solutions, symbolic solutions, approximate solutions, and special
此外,在表 10 中,我们还展示了其他一些生成的公式及其简化形式,包括精确解、符号解、近似解以及特殊情况下的解。

1331 conditions on the SRSD datasets. In most instances, the generated formulas exhibit good interpretability after being simplified using the sympy library. However, when the necessary operators for obtaining the target formula are lacking (such as 'sqrt'), the complexity of the simplified formula may increase to approximate the target formula as closely as possible. Additionally, it is noteworthy that some generated formulas, after simplification using the sympy library, yield 'zoo' values despite having an R2 value of 1 . Such a situation primarily arises due to the protected division operator. The sub-expression f(x) as a denominator with zero-value is replaced with 1+f(x)2 during the computation.
在 SRSD 数据集上的 1331 种条件下,大多数情况下生成的公式通过 sympy 库简化后展现出良好的可解释性。然而,当缺乏获取目标公式所需的运算符(如'sqrt')时,简化公式的复杂度可能会增加以尽可能接近目标公式。另外值得注意的是,部分生成的公式在使用 sympy 库简化后,尽管 R2 值为 1,却产生了'zoo'值。这种情况主要源于受保护的除法运算符——在计算过程中,分母中值为零的子表达式 f(x) 会被替换为 1+f(x)2

1339

1341

D.5 SCALABILITY TO DATA SIZE
D.5 对数据规模的扩展性


1344 Since the semantic dimension is associated with the training set size, there may be a spatial sparsity problem when dealing with a large-scale dataset. It is also the reason why Eq 4 is used in the geometric search operator to quickly approximate the semantic vector to the sub-target semantics first. To further analyze the specific performance of the GESR under different data scales, the black-box dataset in SRBench according to the data scale is re-partitioned,and the Avg.R2 of the top-ranked methods is statistically compared in SRBench, in which the results show the competitiveness of the GESR.
由于语义维度与训练集规模相关联,在处理大规模数据集时可能存在空间稀疏性问题。这也是为什么在几何搜索算子中使用公式 4 来快速将语义向量近似至子目标语义的原因。为了进一步分析 GESR 在不同数据规模下的具体表现,我们根据数据规模对 SRBench 中的黑盒数据集进行了重新划分,并统计比较了排名靠前方法的 Avg.R2 指标,结果显示 GESR 具有显著竞争力。


1350

1351


Table 11: The Avg. R2 performance on different data scales.
表 11:不同数据规模下的平均 R2 性能表现。

0~100100~500500~10001000~50005000~1000010000~
GESR0.6650.8720.8910.6180.8570.632
SBP-GP0.6650.8570.8760.6080.8580.630
Operon  操作数0.5800.8700.8900.6230.8610.629
FEAT0.6640.8670.8860.6190.8570.632
DSR0.5840.6050.5590.5000.6990.386


1352

1354

1355

1356

1357

D.6 DETAILED RESULTS ON STROGATZ DATASETS
D.6 STROGATZ 数据集详细结果


In order to better observe the specific performance of our method dealing with scientific discovery
为了更好地观察我们方法处理科学发现的具体表现

1362 data sets, we further evaluate our method on the Strogatz datasets. Table 12 shows the median value (1R2) for 10 independent runs of our method on the Strogatz datasets,in which 1R2 is labeled 0 when R2 achieves the accuracy to tolerance 15 . Figure 7 shows the solution rate of our method on the Strogatz series datasets, in which solution rate means the rate of finding the symbolic solution.
在 1362 个数据集上,我们进一步评估了我们的方法在 Strogatz 数据集上的表现。表 12 展示了我们的方法在 Strogatz 数据集上 10 次独立运行的中位值 (1R2) ,其中当 R2 达到 15 的精度容差时, 1R2 标记为 0。图 7 展示了我们的方法在 Strogatz 系列数据集上的求解率,其中求解率指的是找到符号解的比例。

1366

1367


Table 12: The median 1R2 on Strogatz datasets.
表 12:Strogatz 数据集上的中位值 1R2

Dataset  数据集Operon  操纵子SBP-GPAIFeynman  AI 费曼MRGPDSRGESR
Strogatz_bacres1  斯特罗加茨_bacres13.56e-73.51e-6  3.51 乘以 10 的负 6 次方6.19e-3  6.19 乘以 10 的负 3 次方7.02e-5  7.02 乘以 10 的负 5 次方1.14e-1  1.14 乘以 10 的负 1 次方3.66e-6  0.00000366
Strogatz_bacres2  斯特罗加茨_bacres22.67e82.19e92.91e42.59e51.17e-1  0.1172.43e5
Strogatz_barmag1  斯特罗加茨_barmag12.76e-6  2.76 乘以 10 的负 6 次方1.76e-9  1.76 乘以 10 的负 9 次方0.757.33e-5  7.33 乘以 10 的负 5 次方1.73e-1  1.73 乘以 10 的负 1 次方0
Strogatz_barmag2  斯特罗加茨_barmag21.70e-7  1.70 乘以 10 的负 7 次方004.91e-5  4.91 乘以 10 的负 5 次方1.26e-1  1.26 乘以 10 的负 1 次方0
Strogatz_glider1  斯特罗加茨滑翔机 11.31e-1204.99e-3  0.004993.68e-5  0.00003681.14e-1  1.14 乘以 10 的负 1 次方0
Strogatz_glider2  斯特罗加茨滑翔机 2 型1.85e71.83e504.55e-5  4.55 乘以 10 的负 5 次方00
Strogatz_lv1  斯特罗加茨线性向量 1 型2.24e-11  2.24 乘以 10 的负 11 次方0.169.73e-1  9.73 乘以 10 的负 1 次方2.78e-1  2.78 乘以 10 的负 1 次方1.93e0  1.93 乘以 10 的 0 次方0
Strogatz_1v2  斯特罗加茨_1v21.40e-11  1.40 乘以 10 的负 11 次方1.66e-4  1.66 乘以 10 的负 4 次方02.95e25.55e-1  5.55 乘以 10 的负 1 次方0
Strogatz_predprey1  斯特罗加茨捕食者-猎物模型 13.25e-7  3.25 乘以 10 的负 7 次方1.02e31.01e0  1.01 乘以 10 的 0 次方2.83e48.15e21.11e3
Strogatz_predprey2  斯特罗加茨捕食者-猎物模型 27.68e-6  7.68 乘以 10 的负 6 次方2.78e45.11e-3  5.11 乘以 10 的负 3 次方8.04e51.71e-1  1.71 乘以 10 的负 1 次方1.08e-4  1.08 乘以 10 的负 4 次方
Strogatz_shearflow1  斯托加茨剪切流 18.75e-3  8.75×10^-36.33e-3  6.33×10^-303.43e400
Strogatz_shearflow2  斯托加茨剪切流 25.11e-9  5.11 乘以 10 的负 9 次方4.18e-11  4.18 乘以 10 的负 11 次方1.05e0  1.05 乘以 10 的 0 次方3.98e53.02e10
Strogatz_vdp1  斯特罗加茨_vdp13.58e96e-5  0.000061.03e04.91e-5  0.00004912.47e10
Strogatz_vdp2  斯特罗加茨_vdp24.99e-15002.94e-5  0.00002941.10e-1  0.110


1368

1369

1370

1372

1375

1376

1377

1379

1380

1381

1382

1383

1384

1385

1386

1387

1388

1389 As can be seen from Table 12, we can achieve fitting effect on most datasets of Strogatz datasets.
1389 如表 12 所示,我们能够在 Strogatz 数据集的大部分数据上实现拟合效果。

1390 For a few formulas that do not fit perfectly, there are also a good numerical fitting effect. The
1390 对于少数未能完全拟合的公式,也具有良好的数值拟合效果。

1391 underlying formulas for the four data sets that failed to fit are: y=20x1x1x2(1+0.5x12),y= 10x1x2/(1+0.5x12),y=x1(4x1x21+x1),y=x2(x1x1+10.075x2) . A common feature
1391 个未能拟合的四个数据集的底层公式为: y=20x1x1x2(1+0.5x12),y= 10x1x2/(1+0.5x12),y=x1(4x1x21+x1),y=x2(x1x1+10.075x2) 。一个共同特征是

1393 can be found from these formulas, that is, they all have protective division. This may means that
1393 可以从这些公式中发现,即它们都含有保护性除法。这可能意味着

1394 interval operations to some extent prevent the generated formulas from generating extreme values,
1394 区间运算在一定程度上阻止了生成的公式产生极端值,

1395 but also limit the ability of GESR to search for some formulas. Thus, the exploration of protection operators is also a possible direction for our future research.
1395 但也限制了 GESR 搜索某些公式的能力。因此,对保护运算符的探索也是我们未来研究的一个可能方向。

Figure 7 shows the success rate of our method in finding the perfect formula. It can be observed that,
图 7 展示了我们的方法在寻找完美公式上的成功率。可以看出,

1398 compared with the advantages of accuracy fitting, our method does not show great advantages in
1398 与精确拟合的优势相比,我们的方法在解决率方面并未展现出显著优势。

1399 terms of the solution rate. Although the purpose of our method is on numerical fitting, it is undeniable
1399 尽管我们方法的主要目的是数值拟合,但不可否认的是,

1400 that the high-intensity numerical fitting ability of geometric semantic variation also limits the ability
1400 几何语义变异的高强度数值拟合能力也限制了其在其他方面的能力。

1401 to find a perfect formula. When fitting the underlying expression with high complexity, there may be
1401 寻找一个完美的公式。当拟合具有高复杂度的基础表达式时,可能会

1402 multiple mismatched parts in an expression that are expected to be replaced. Once the subexpression
1402 表达式中存在多个预期被替换但不匹配的部分。一旦子表达式

1403 tree where all mismatched parts are located cannot be found and replaced correctly in the process of geometric semantic variation, Then, due to the semantic approximation property, the mismatched but
1403 树中所有不匹配部分在几何语义变异过程中无法被正确找到并替换,因此,由于语义近似特性,不匹配但


1404



https://cdn.noedgeai.com/019690bc-7b96-7f33-8fed-abb55d1d548c_26.jpg?x=355&y=243&w=1116&h=570&r=0

Figure 7: Solution rates on Feynman datasets(left) and Strogatz datasets(right) of SRBench benchmark.
图 7:SRBench 基准测试中 Feynman 数据集(左)和 Strogatz 数据集(右)的求解率。


1405

1406

1407

1408

1409

1410

1411

1412

1413

1414

1415

1416

1417

1421

1424

1426 semantically more similar sub-expression tree may be chosen for replacement. This will cause the symbolic regression process to be transformed into a numerical fitting process rather than a search for a perfect formula, which is also an issue worth studying and deepening in the future.
1426 语义上更相似的子表达式树可能被选择用于替换。这将导致符号回归过程转变为数值拟合过程,而非寻找完美公式,这也是未来值得研究和深化的问题。

1429

1434

1435

1436

1437

1438

1439

1440

1441

1442

1443

1444

1445

1446

1447

1448

1449

1450

1451

1452

1453

1454

1455

1456 1457