Symbolic Regression with TreeGenome
This tutorial walks through the canonical Koza-1 benchmark (y = x⁴ + x³ + x² + x) end-to-end using TreeGenome, the DynamicExpressions.jl-backed expression-tree genome. It exercises the fastest evaluation path in Arborist — no @eval compilation, vectorized evaluation over the whole dataset per generation.
Setup
using Arborist
using DynamicExpressions: OperatorEnumPick the operator set. For pure symbolic regression over Float32-valued inputs, the standard choice is the four basic binary operators plus one unary:
operators = OperatorEnum(;
binary_operators = [+, -, *, /],
unary_operators = [abs],
)Generate the training data
Koza-1 evaluates on 20 equispaced points in [−1, 1]:
xs = Float32.(range(-1, 1, length=20))
X = reshape(xs, 1, :) # n_features × n_samples (= 1 × 20)
y = xs.^4 .+ xs.^3 .+ xs.^2 .+ xsX must be a Matrix{T} of shape n_features × n_samples even when there is only one input feature.
Build the evaluator and problem
evaluator = TreeFitnessEvaluator(X, y, operators)
problem = GPProblem(evaluator, TreeGenome{Float32}; seed=42)The fitness evaluator computes mean squared error over the dataset every generation. Specifying seed=42 makes the run reproducible — Arborist threads an explicit MersenneTwister through the entire evolution.
Configure the algorithm
algorithm = GeneticProgramming(
pop_size = 100,
generations = 300,
mutation_rate = 0.4,
crossover_rate = 0.2,
elitism = 2,
)The defaults fit symbolic regression well: TournamentSelection(3) for parent selection, SubtreeMutation() and PointMutation() for mutation, and SubtreeCrossover() for recombination. All of these dispatch on TreeGenome{Float32}.
Solve
result = solve(problem, algorithm; verbose=true)Verbose output prints the best and mean fitness per generation. With seed=42, this run typically converges to near-zero fitness within roughly 100 generations.
Read the result
println("Best fitness: ", result.best_fitness)
println("Expression: ", serialize(result.best_genome))
println("Wall time: ", round(result.wall_time, digits=2), " s")A representative run:
Best fitness: 6.83e-14
Expression: (((x1 * x1) + x1) * (x1 + 1.0)) * x1 + x1
Wall time: 3.27 sThe serialized expression is a valid Julia expression reusing the same variable names as the evaluator (x1 for the first feature). result.best_fitness is mean squared error; result.fitness_history and result.mean_history are full per-generation trajectories that plot nicely via the Plots.jl recipes.
Next steps
- Extend to Nguyen-1..10 by swapping the target function —
test/benchmarks/nguyen_regression.jlshows the pattern with protectedpdiv / plog / psqrtoperators for functions that would otherwise domain-error. - Add a bloat penalty via
GeneticProgramming(; bloat_penalty=1e-4)or recover the accuracy-vs-parsimony Pareto front with NSGA-II. - Refine the constants found by GP using the optional BFGS pass:
GeneticProgramming(; constant_optimization=ConstantOptimization(frequency=25, top_k=5)).