Algorithms

GeneticProgramming

The standard GP algorithm with tournament selection, elitism, and configurable genetic operators.

using Arborist
algorithm = GeneticProgramming(
    pop_size       = 100,        # population size
    generations    = 200,        # number of generations
    mutation_rate  = 0.3,        # probability of mutation per offspring
    crossover_rate = 0.3,        # probability of crossover per pair
    elitism        = 2,          # top individuals carried forward
    bloat_penalty  = 0.0,        # coefficient on complexity(g)
    parallel       = true,       # threaded evaluation
    speciation     = NoSpeciation(),
    mutation_ops   = [SubtreeMutation(), PointMutation()],
    crossover_ops  = [SubtreeCrossover()],
    selection      = TournamentSelection(3),  # tournament size set here
)

IslandModel (sequential, synchronous distributed, and asynchronous distributed) dispatches on ExprGenome, TreeGenome, and GraphGenome. For TreeGenome, the evaluator must be a TreeFitnessEvaluator; migration transports Node{T} directly across workers so the op indices remain valid against every island's shared OperatorEnum. For GraphGenome under distributed=true, each worker is assigned a disjoint innovation-ID range via init_innovation_range!((island_id - 1) * INNOVATION_STRIDE) so cross-process IDs cannot collide; the trade-off is that structurally identical mutations on different workers receive different IDs and are treated as disjoint by NEAT crossover. AntGenome must use the single-population GeneticProgramming solver — the simulator is not thread-safe and has no _initialize_population method on the IslandModel path.

NSGAII

Multi-objective GP using non-dominated sorting and crowding distance with (μ+λ) survivor selection. Requires an evaluator that implements evaluate_multi(evaluator, genome) -> Vector{Float64} (i.e., a subtype of AbstractMultiObjectiveEvaluator). The included ParsimonyEvaluator wraps any single-objective AbstractEvaluator into the standard (fitness, complexity) two-objective problem.

using Arborist, DynamicExpressions

inner = SymbolicRegressionEvaluator(
    x -> x^2 + x, domain=(-2f0, 2f0), points=30
)
evaluator = ParsimonyEvaluator(inner)

algorithm = NSGAII(
    pop_size       = 200,    # must be even, ≥ 4
    generations    = 100,
    mutation_rate  = 0.3,
    crossover_rate = 0.3,
    parallel       = true,
    mutation_ops   = [SubtreeMutation(), PointMutation()],
    crossover_ops  = [SubtreeCrossover()],
)

result = solve(GPProblem(evaluator, TreeGenome{Float32}; seed=42), algorithm)
println("Pareto front size: ", length(result.pareto_front))

Pareto front size: 200

solve(::GPProblem, ::NSGAII) returns an NSGAIIResult with these fields:

pareto_front::Vector{G} — non-dominated genomes from the final population
pareto_fitnesses::Vector{Vector{Float64}} — objective vectors for the Pareto-front genomes
population::Vector{G} and all_fitnesses::Vector{Vector{Float64}} — full final population and its objective vectors
hypervolume_history::Vector{Float64} — hypervolume of front 1 per generation (2D only; problems with three or more objectives return 0.0)
generations_run::Int, wall_time::Float64, objective_names::Vector{String}

See examples/nsga2_regression.jl for a complete runnable example.

Genome support in 0.1.0

NSGAII dispatches on ExprGenome and TreeGenome. AntGenome and GraphGenome are not yet supported — attempting to solve an NSGA-II problem with those genome types raises a MethodError from _nsga2_init_population. Extending NSGA-II to the remaining genome types is planned for a future release.

CMAES

Covariance Matrix Adaptation Evolution Strategy ((μ_w/μ, λ)-CMA-ES, Hansen 2016) for continuous-parameter optimization. Used to refine the weight vector of a genome whose topology is fixed — for example, a GraphGenome whose connections you want to tune after a NEAT search has discovered the structure. The algorithm itself does not mutate topology; pair it with a topology search if both are needed.

using Arborist
algorithm = CMAES(
    generations = 200,
    pop_size    = 0,       # 0 → auto λ = 4 + ⌊3·ln(n)⌋ (Hansen 2016 default)
    sigma0      = 0.5,     # initial step size
    seed_genome = true,    # start the mean from the problem's initial genome
)

The genome must opt in by implementing flatten_weights(g) -> Vector{Float64} and unflatten_weights!(g, w) -> g. The GraphGenome adaptor sorts connections by innovation number for deterministic packing; other genomes can implement these two methods to participate. See the Infrastructure API reference for details.

MAPElites

Quality-Diversity search via grid-based behavioral archives (Mouret & Clune, 2015). Replaces single-best convergence with a coverage goal: fill as many distinct behavior cells as possible, keeping the fittest representative in each. MAPElites runs its own solve loop that samples parents from the filled archive — there is no notion of "population" carried across generations.

algorithm = MAPElites(
    feature_fn     = g -> (my_behavior_x(g), my_behavior_y(g)),
    feature_bounds = [(0.0, 1.0), (0.0, 1.0)],
    n_bins         = [20, 20],
    generations    = 200,
    batch_size     = 50,    # children produced per generation
    n_init         = 200,   # seed pool of random genomes (gen 0)
    mutation_ops   = [SubtreeMutation()],
)

result = solve(GPProblem(evaluator, ExprGenome; ...), algorithm)

feature_fn returns an NTuple{N, Float64} (or any indexable length-N collection of Real) describing where the genome lands in behavior space. feature_bounds and n_bins discretize that space — features outside [lo, hi] clamp to the extreme bin. coverage and qd_score are queryable on the archive, and the result type is MAPElitesResult{G} with archive, coverage_history, and qd_score_history. The Plots.jl recipe plotarchive(result) produces the canonical 2D heatmap (currently 2D archives only).

Novelty Search

Replace the fitness signal with behavioral novelty — the negative mean distance from the k nearest neighbors in a thread-safe bounded archive of past behaviors. Useful on deceptive problems where the fitness gradient points away from the global optimum.

archive  = NoveltyArchive(Vector{Float64};
                           max_size      = 2000,
                           add_threshold = 0.0)

evaluator = NoveltySearchEvaluator(
    g       -> behavioral_signature(g),    # fingerprint_fn :: genome → fingerprint
    (a, b)  -> norm(a - b),                # distance_fn   :: (fp, fp) → Float64
    archive;
    k       = 15,
)

algorithm = GeneticProgramming(
    pop_size      = 200,
    generations   = 200,
    mutation_ops  = [SubtreeMutation(), PointMutation()],
)

result = solve(GPProblem(evaluator, ExprGenome; ...), algorithm)

NoveltySearchEvaluator returns -mean_knn_distance, so the standard minimization solve loop drives the population toward novelty. To combine novelty with task fitness, build a multi-objective evaluator and use NSGAII with both signals.

Constant optimization

GeneticProgramming accepts an optional constant_optimization field that runs a periodic BFGS pass over the numeric constants of the top-K genomes:

using Arborist
algorithm = GeneticProgramming(
    pop_size = 100,
    generations = 200,
    constant_optimization = ConstantOptimization(
        frequency = 25,    # run every 25 generations
        top_k     = 5,     # refine the top 5 individuals
        max_iter  = 50,
        tol       = 1e-8,
        fd_step   = 1e-3,  # central finite-difference gradient step
    ),
)

Currently implemented for TreeGenome symbolic regression. The pass is rollback-guarded: a refined genome's fitness can only improve. To optimize a single genome's constants by hand, call optimize_constants!(g, evaluator; kwargs...).

Warm-starting from a seed population

solve(::GPProblem, ::GeneticProgramming) accepts an initial_population keyword carrying a user-provided seed pool. The solve loop replaces the random initial population with these genomes — useful for resuming from a hand-curated set, transferring solutions between problem variants, or seeding a refinement run from a previous best:

seed_pool = [g1, g2, g3, ...]   # length must equal algorithm.pop_size
@assert length(seed_pool) == algorithm.pop_size

result = solve(problem, algorithm; initial_population = seed_pool)

Element type must match the problem's genome type and length must equal algorithm.pop_size. initial_population is mutually exclusive with resume_from — use one or the other.

All-time-best Hall of Fame

By default GPResult reports only the single best genome at the end of the run. Pass hall_of_fame_size = K to attach a top-K archive of the distinct best fitnesses seen across all generations, surviving elitism loss:

using Arborist, DynamicExpressions
evaluator = SymbolicRegressionEvaluator(x -> x^2 + x, domain=(-1f0, 1f0), points=15)
problem   = GPProblem(evaluator, TreeGenome{Float32}; seed=42)
algorithm = GeneticProgramming(pop_size=40, generations=20)

result = solve(problem, algorithm; hall_of_fame_size = 5)

println("HoF size: ", length(result.hall_of_fame), " / capacity ",
        result.hall_of_fame.capacity)
for i in 1:min(3, length(result.hall_of_fame))
    println("[$i] fitness=", result.hall_of_fame.fitnesses[i])
end

HoF size: 5 / capacity 5
[1] fitness=0.0035834189504926933
[2] fitness=0.003961255149946844
[3] fitness=0.005436814407233899

The archive de-duplicates by fitness within 1e-12 (cheap structural proxy — two genomes with identical fitness are treated as the same solution). hall_of_fame_size = 0 (default) disables the archive and leaves result.hall_of_fame === nothing. The archive composes with resume_from: a fresh archive is allocated on resume, since the checkpoint format does not yet persist the archive.

Checkpoint and resume

Long single-objective GeneticProgramming runs (on ExprGenome or TreeGenome) can checkpoint to disk and resume:

result = solve(problem, algorithm;
    checkpoint_every = 10,                 # generations between writes
    checkpoint_path  = "run.checkpoint",
)

# Later, in a separate process:
result = solve(problem, algorithm;
    resume_from = "run.checkpoint",
)

save_checkpoint writes atomically (write-tmp + rename) and load_checkpoint rejects mismatched format-version or Julia-version stamps. By default, resuming refuses to load a checkpoint produced by a different _algorithm_signature (different hyperparameters or operator types); pass allow_signature_mismatch=true to bypass the check. IslandModel, NSGAII, MAPElites, AntGenome, and GraphGenome are not yet wired for checkpoint/resume.

Structured run logs

For research-grade reproducibility every single-objective solve path accepts an optional RunLog:

using Arborist, DynamicExpressions
evaluator = SymbolicRegressionEvaluator(x -> x^2 + x, domain=(-1f0, 1f0), points=15)
problem   = GPProblem(evaluator, TreeGenome{Float32}; seed=42)
algorithm = GeneticProgramming(pop_size=40, generations=10)

log = RunLog()
result = solve(problem, algorithm; log = log)

# log.entries :: Vector{GenerationLog} — one entry per generation
e = entries(log)[end]
println("generations recorded: ", length(entries(log)))
println("last gen best/mean/worst: ",
        round(e.best_fitness; sigdigits=3), " / ",
        round(e.mean_fitness; sigdigits=3), " / ",
        round(e.worst_fitness; sigdigits=3))

generations recorded: 10
last gen best/mean/worst: 0.0385 / 14.1 / 210.0

The Plots.jl recipe renders three stacked subplots — fitness trajectory, species count, and unique-structure count — directly from the log:

using Plots
plot(log)