TreeGenome — DynamicExpressions.jl Backend
Why Two Genome Types
Arborist.jl provides two genome types for different problem classes:
TreeGenome (DynamicExpressions.jl backed)
Appropriate for:
- Pure function approximation: symbolic regression, system identification
- Problems where the genome is
f(x1, x2, ...) -> ywith no side effects - Workloads where evaluation speed over large datasets matters
- Boolean expression problems (parity, multiplexer) expressible as trees
TreeGenome evaluates expression trees directly via DynamicExpressions.jl's vectorized evaluation engine. No @eval compilation is needed, making evaluation roughly 8x faster than ExprGenome on the Koza symbolic regression suite (1000-point dataset); the gap widens further as dataset size grows.
ExprGenome (@eval backed)
Required for:
- Programs with control flow: loops, conditionals, break/continue
- Programs with sequential mutable state
- Side-effectful operations (robotics control, the ant trail)
- General program synthesis where arbitrary Julia is needed
The Santa Fe Ant Trail is a canonical example of the second category. The Koza symbolic regression suite is a canonical example of the first.
Accessing TreeGenome
TreeGenome and TreeFitnessEvaluator are exported directly from Arborist. DynamicExpressions.jl is a hard dependency of Arborist, so no extension dance is needed — just using both packages:
using Arborist
using DynamicExpressionsQuick-Start Example
using Arborist, DynamicExpressions
# Define operators
operators = OperatorEnum(;
binary_operators = [+, -, *, /],
unary_operators = [abs]
)
# Build dataset: y = x^2
xs = Float32.(range(-1, 1, length=100))
X = reshape(xs, 1, :) # 1 feature × 100 samples
y = xs .^ 2
# Create evaluator and problem
evaluator = TreeFitnessEvaluator(X, y, operators)
problem = GPProblem(evaluator, TreeGenome{Float32}; seed=42)
# Run GP
algorithm = GeneticProgramming(
pop_size = 100,
generations = 100,
mutation_rate = 0.4,
crossover_rate = 0.2,
)
result = solve(problem, algorithm)
println("Best fitness: ", result.best_fitness)
println("Best tree: ", serialize(result.best_genome))Best fitness: 0.0
Best tree: abs(x1 * x1)Mutation Operators
TreeGenome implements three mutation types, chosen with equal probability:
- Point mutation: replaces a random node with a new random subtree of depth ≤ 2
- Constant perturbation: adds Gaussian noise (scale 0.1) to a random constant
- Hoist mutation: replaces a subtree with one of its children (bloat reduction)
These are invoked automatically when using SubtreeMutation() or PointMutation() in the algorithm's mutation_ops — the dispatch routes to the TreeGenome implementation.
Ephemeral Random Constants
Both random tree creation and point-mutation leaf creation sample numeric literals via an ephemeral random constant (ERC) sampler. The default is T(randn(rng)) — standard normal, scaled by T. To override, pass a (rng) -> T callable as GeneticProgramming(; constant_sampler=...). The sampler is threaded into TreeGenome creation via task-local storage, so no global state is mutated.
erc_uniform(lo, hi) is the canonical helper, returning a sampler that draws uniformly from [lo, hi] (Koza-style ERCs):
using Arborist
algorithm = GeneticProgramming(
pop_size = 100,
generations = 200,
constant_sampler = erc_uniform(-5.0f0, 5.0f0),
)For other distributions, write a closure directly:
using Arborist
# Log-uniform over [0.1, 10.0] for problems where constants span orders of magnitude
algorithm = GeneticProgramming(;
constant_sampler = rng -> Float32(exp(log(0.1) + log(100.0) * rand(rng))),
)constant_sampler === nothing (default) preserves the historical T(randn(rng)) behavior; the field is ignored by genome types other than TreeGenome.
Known Limitations
Distance metric: TreeGenome's
distancefunction uses absolute node count difference, which is less semantically meaningful than ExprGenome's symmetric-difference metric. This is sufficient forThresholdSpeciationbut could be improved in future versions.Deserialize: Parsing expression trees from arbitrary string representations is not implemented.
deserializereturnsnothing. The LLM operator is not yet supported for TreeGenome.Boolean problems: Boolean logic must be encoded as Float32 operations (0.0 = false, 1.0 = true) since DynamicExpressions.jl operates on numeric types. Custom boolean operators (AND, OR, XOR, etc.) must be defined as Float32 → Float32 functions and passed via
OperatorEnum.