Infrastructure — API Reference

Autogenerated reference for evaluators, novelty search / MAP-Elites archives, structured run logs, checkpoint / resume, constant optimization, the AST sanitizer, code-generation primitives, the protected arithmetic operators used in symbolic regression, and shared run-utility helpers (train_test_split, summarize, run_multi_seed).

Types

Arborist.TableFitnessEvaluatorType
TableFitnessEvaluator <: AbstractEvaluator

Evaluates a function against a table of input/output examples. Fitness is mean squared error over rows where execution succeeded. Returns Inf if more than 50% of rows throw exceptions or exceed the time limit.

Fields

  • input_cols::Dict{Symbol, DataType}: input variable names and types
  • output_cols::Dict{Symbol, DataType}: output variable names and types
  • input_rows::Vector{Dict{Symbol, Any}}: input data rows
  • output_rows::Vector{Dict{Symbol, Any}}: expected output data rows
  • time_limit_ns::Int: per-call time limit in nanoseconds (default: 1,000,000)
source
Arborist.TableFitnessEvaluatorMethod
TableFitnessEvaluator(input_cols, output_cols, input_rows, output_rows; time_limit_ns=1_000_000)

Construct a TableFitnessEvaluator with an optional time limit per function call.

source
Arborist.NoveltyArchiveType
NoveltyArchive{B}

A thread-safe, append-only collection of behavioral fingerprints used by NoveltySearchEvaluator for k-nearest-neighbor novelty scoring.

Fields

  • entries::Vector{B}: stored fingerprints, in insertion order.
  • max_size::Int: cap on the number of entries. When the archive is full, insertion is silently dropped (oldest-out eviction would change novelty scores for already-evaluated genomes; bounded growth is the standard Lehman-Stanley behavior).
  • add_threshold::Float64: minimum novelty (mean k-NN distance) at which a new fingerprint is added. Setting to 0.0 means add every evaluated fingerprint (saturates fast); setting too high prevents archive growth and starves later evaluations of references.
  • lock::ReentrantLock: guards entries for parallel=true evaluation.
source
Arborist.NoveltySearchEvaluatorType
NoveltySearchEvaluator{F,D,B} <: AbstractEvaluator

Behavior-based evaluator. Returns the negative mean of the k nearest neighbor distances between the current genome's fingerprint and the archive — lower is better, matching the framework's convention.

Type parameters

  • F: type of the fingerprint function genome -> B.
  • D: type of the distance function (B, B) -> Float64.
  • B: type of a single behavioral fingerprint.

Fields

  • fingerprint_fn::F: extracts a behavioral descriptor from a genome. Typically performs the same rollout the base evaluator would, but records a behavior summary rather than a fitness scalar.
  • distance_fn::D: distance metric over fingerprints. Should return 0.0 for identical behaviors and positive for different ones.
  • archive::NoveltyArchive{B}: behavioral memory.
  • k::Int: number of nearest neighbors used in the novelty score.
source
Arborist.CheckpointType
Checkpoint{G}

Opaque snapshot of a mid-run single-objective GP evolution. Holds everything _run_evolution! needs to resume: population, fitnesses, generation counter, RNG state, fitness/mean histories, wall-time, and a signature derived from the algorithm config.

Not meant for direct construction — save_checkpoint is invoked internally by solve(...; checkpoint_every, checkpoint_path). Users construct one only when implementing a custom solve loop.

Fields

  • format_version::Int: internal version of the Checkpoint layout (bumped on incompatible field changes).
  • arborist_version::VersionNumber: project version from Project.toml.
  • julia_version::VersionNumber: VERSION at save time.
  • generation::Int: generation just completed. Resume starts at generation + 1.
  • population::Vector{G}: final population of the completed generation, sorted best-first.
  • fitnesses::Vector{Float64}: aligned with population.
  • rng_state::Any: copy(rng) at save time, so resumed runs draw the same random stream the interrupted one would have.
  • best_genome::G: the single best across the whole run so far (may be different from population[1] if elitism lost it through breeding).
  • best_fitness::Float64: paired with best_genome.
  • fitness_history::Vector{Float64}: per-generation best-fitness trajectory.
  • mean_history::Vector{Float64}: per-generation mean-finite-fitness trajectory.
  • wall_time::Float64: cumulative seconds elapsed (does not include pre-resume idle time).
  • algorithm_signature::UInt64: hash of the algorithm config (see _algorithm_signature) — checked on resume so the user can't hot-swap hyperparameters silently.
  • hall_of_fame::Any: optional HallOfFame{G} archive at checkpoint time, or nothing when disabled.
source
Arborist.GenerationLogType
GenerationLog

One record per generation in a RunLog. All fields are populated by record!. Fields intended for population in a later phase (Phase F.5) are left as empty containers by F.0.

Fields

  • generation::Int: 1-based generation number.
  • best_fitness::Float64: minimum finite fitness in the generation. Inf if no individual evaluated successfully.
  • mean_fitness::Float64: mean of finite fitnesses.
  • median_fitness::Float64: median of finite fitnesses.
  • worst_fitness::Float64: maximum finite fitness. Inf if no finite fitnesses at all.
  • n_species::Int: number of active species. 1 when speciation is NoSpeciation or speciation state is unavailable.
  • species_sizes::Vector{Int}: member counts per species, in the same order as the speciation state's internal list. Empty when speciation is inactive or the solve path does not thread a SpeciationSnapshot.
  • operator_success::Dict{Symbol,Int}: count of offspring per mutation/ crossover operator that beat the parent's fitness. Populated in F.5.
  • operator_attempted::Dict{Symbol,Int}: count of times each operator was invoked. Populated in F.5.
  • unique_structures::Int: number of distinct genomes in the generation by serialize-hash. Coarse genotypic diversity proxy.
  • wall_time::Float64: cumulative wall-clock seconds elapsed since t0 (run start).
source
Arborist.RunLogType
RunLog

A vector-like container of GenerationLog entries. Callers construct one as RunLog() and pass it to solve(...; log=log). Iteration and indexed access are supported via entries(log), length(log), and log[i].

RunLog is mutable: record! appends one entry per generation.

source
Arborist.SpeciationSnapshotType
SpeciationSnapshot

Mutable carrier passed as a kwarg to _apply_speciation!. Populated with the post-culling species count and per-species member sizes so that record! can record them without the solve path re-computing speciation.

Constructed fresh per generation and discarded; not part of the public API.

source
Arborist.ConstantOptimizationType
ConstantOptimization(; frequency=25, top_k=5, max_iter=50, tol=1e-8, fd_step=1e-3)

Configuration for the periodic constant-optimization pass. Enable by passing GeneticProgramming(; constant_optimization=ConstantOptimization(), ...).

Fields

  • frequency::Int: generations between optimization passes (default: 25). The pass runs at the end of each Nth generation — gen 25, 50, 75, ...
  • top_k::Int: number of top (lowest-fitness) individuals to optimize per pass (default: 5). Applying to all individuals would double evaluation cost every generation; applying only to elites refines the best candidates.
  • max_iter::Int: maximum BFGS iterations per individual (default: 50).
  • tol::Float64: gradient-norm convergence tolerance (default: 1e-8).
  • fd_step::Float64: half-step size for central finite differences (default: 1e-3). Too small amplifies roundoff; too large linearizes too coarsely.
source
Arborist.ASTSanitizerType
ASTSanitizer

Validates ExprGenome expression trees against a whitelist of permitted function calls before @eval compilation. Rejects any expression containing calls to functions outside the whitelist.

This is a defense-in-depth measure for use with LLMMutationOperator. For purely classical GP (no LLM operator), the function set already constrains what can appear, but sanitization adds an explicit check.

Fields

  • allowed_calls::Set{Symbol}: whitelist of permitted function call symbols
  • allow_literals::Bool: whether to allow literal values (default: true)
  • allow_variables::Bool: whether to allow variable references (default: true)
source
Arborist.ASTSanitizerMethod
ASTSanitizer(; allowed_calls=DEFAULT_SAFE_CALLS, allow_literals=true, allow_variables=true)

Construct an ASTSanitizer with the default mathematical/logical whitelist.

source
Arborist.FunctionDetailsType
FunctionDetails(name, args, return_type)

Signature record for a single primitive available to evolved programs. name is the Julia Symbol the evolved code will call, args is the ordered vector of argument types, and return_type is the type produced by the call. FunctionSet collects these into the palette from which create_random_rvalue draws.

source
Arborist.FunctionSetType
FunctionSet(funcs::Set{FunctionDetails})

Container for the set of primitives that evolved expression-tree programs are permitted to call. Populated via add! or by constructing the Set directly, and passed into GenState / GPProblem to define the search space. See default_function_set and boolean_function_set for prebuilt palettes.

source
Arborist.GenStateType
GenState(rng, fset, inputs, outputs, num_temps)

Code-generation state shared across the construction and mutation of a single expression-tree genome. Holds the rng used for every random choice (no global state), the FunctionSet palette, the input/output/temp variable dictionaries typed by Symbol => DataType, the set of types in use, and a cached union of all addressable variables. All stochastic helpers in codegen.jl take a GenState and draw exclusively from state.rng, so seeded runs reproduce.

source

Functions

Arborist.evaluateMethod
evaluate(fe::TableFitnessEvaluator, f::Function) -> Float64

Evaluate f against the table of examples. Returns mean squared error for rows where execution succeeded. Returns Inf if the function fails on more than 50% of rows or exceeds the per-call time limit.

Uses Base.invokelatest to handle world-age issues from @eval-defined functions.

source
Arborist.evaluate_casesMethod
evaluate_cases(e::TableFitnessEvaluator, f::Function) -> Vector{Float64}

Return per-row squared error (Inf for rows that raised, timed out, or produced non-finite output). One entry per input row, in row order. Used by lexicase selection.

source
Arborist.input_signatureMethod
input_signature(fe::TableFitnessEvaluator) -> Dict{Symbol, DataType}

Return the input variable names and types expected by this evaluator.

source
Arborist.output_signatureMethod
output_signature(fe::TableFitnessEvaluator) -> Dict{Symbol, DataType}

Return the output variable names and types expected by this evaluator.

source
Arborist.evaluate_genomeMethod
evaluate_genome(g::AbstractGenome, e::NoveltySearchEvaluator) -> Float64

Compute the novelty score: take the genome's behavioral fingerprint, find the k nearest fingerprints in the archive, return the negative mean of those distances. The fingerprint may be added to the archive (under lock) when its novelty exceeds archive.add_threshold and the archive isn't full.

Returns Inf if fingerprint_fn raises. Returns 0.0 when the archive is empty (first genome of the run) — there's nothing to be novel against yet, but adding to the archive seeds it for subsequent calls.

source
Arborist.load_checkpointMethod
load_checkpoint(path::AbstractString) -> Checkpoint

Load a checkpoint previously written by save_checkpoint. Raises ArgumentError if the file's Julia version or checkpoint format version differs from the current process — Julia's Serialization format is not stable across minor versions.

source
Arborist.save_checkpointMethod
save_checkpoint(ckpt::Checkpoint, path::AbstractString)

Atomically write ckpt to path. Uses Julia's Serialization stdlib. Writes to path * ".tmp" then renames, so a partial file never clobbers an older good checkpoint.

source
Arborist.entriesMethod
entries(log::RunLog) -> Vector{GenerationLog}

Return the vector of GenerationLog entries recorded so far.

source
Arborist.record!Method
record!(log::RunLog, gen, fitnesses, genomes, wall_time;
        snapshot=nothing)

Append one GenerationLog to log with aggregate fitness statistics, optional speciation snapshot, structural diversity, and wall-clock time.

  • gen::Integer: 1-based generation index.
  • fitnesses::AbstractVector: raw fitness per individual (may contain Inf for failed evaluations).
  • genomes::AbstractVector: population genomes, parallel to fitnesses.
  • wall_time::Real: seconds since run start.
  • snapshot::Union{Nothing, SpeciationSnapshot}: if provided, its n_species / sizes fields are copied into the entry. If nothing, the entry records n_species=1 and species_sizes=[length(genomes)] (the NoSpeciation case).
source
Arborist.sanitizeMethod
sanitize(san::ASTSanitizer, expr::Expr) -> Bool

Return true if the expression tree is safe (all function calls are in the whitelist), false if it contains any unsafe call. Walks the entire AST recursively.

Flags as unsafe:

  • :call nodes where args[1] is a Symbol not in allowed_calls
  • :call nodes where args[1] is a qualified name (e.g., Base.run)
  • :macrocall nodes
  • :quote or :$ interpolation nodes

Does NOT flag: assignment, block, if, while, for, literal values, variable symbols.

source
Arborist.sanitizeMethod
sanitize(san::ASTSanitizer, body::Vector{Expr}) -> Bool

Check all statements in a genome body.

source
Arborist.add!Method
add!(fset, f, nargs, input_type, return_type)

Add a primitive f taking nargs arguments of input_type and returning return_type to fset. Shorthand for building homogeneous-signature entries; for mixed argument types, push a FunctionDetails directly into fset.funcs.

source
Arborist.add_loop_checksMethod
add_loop_checks(body; limit=10_000)

Instrument a vector of body expressions with loop iteration checks. Returns a new vector (the original is not modified).

source
Arborist.add_loop_checks_exprMethod
add_loop_checks_expr(expr, limit)

Recursively instrument an expression tree, wrapping each :for and :while node with an iteration counter and a check that throws LoopLimitExceeded if the counter exceeds limit.

source
Arborist.construct_and_define_functionMethod

Construct a Julia function from a signature, body expressions, return expression, and return type. Evaluates the function into the current scope via @eval.

The return_expr may or may not be wrapped in :return; if it is, the value is extracted and re-wrapped with a type assertion.

source
Arborist.create_harnessMethod

Create a function expression wrapping generated body code in a typed, callable function skeleton with initialized temps and outputs.

Returns an Expr that can be @eval'd to define the function.

source
Arborist.create_random_assignmentMethod
create_random_assignment(s::GenState) -> Expr

Generate a random :(lhs = rhs) expression, rejecting self-assignments like x = x. The lvalue is drawn from get_lvalues(s) and the rvalue is built by create_random_rvalue matched to the lvalue's type. Used as the leaf case of random program construction.

source
Arborist.unravelFunction
unravel(tree, expressions=[])

Flatten an Expr tree into a list of all sub-expressions via pre-order traversal.

source
Arborist.crossoverMethod
crossover(s::GenState, parent_a::Expr, parent_b::Expr) -> Tuple{Expr, Expr}

Perform subtree crossover between two parent expression trees.

Strategy:

  • Flatten both trees with unravel().
  • Find pairs of sub-expressions with matching types (via get_rvalue_type).
  • Pick a random compatible pair, deepcopy both parents, and swap the subtrees.
  • If no compatible pair exists, return deepcopy of both parents unchanged.
source
Arborist.replace_subtree!Method
replace_subtree!(tree::Expr, target::Expr, replacement::Expr) -> Bool

Replace the first occurrence of target (by object identity) in tree with replacement. Returns true if a replacement was made, false otherwise.

source
Arborist.boolean_function_setMethod
boolean_function_set() -> FunctionSet

Return a function set containing boolean operators suitable for boolean GP problems (e.g., even parity).

Includes: AND (&), OR (|), NOT (!), NAND (gp_nand), NOR (gp_nor), XOR (xor), all operating on Bool.

source
Arborist.default_function_setMethod
default_function_set() -> FunctionSet

Return a default function set containing basic arithmetic, transcendental, and comparison operators suitable for numerical symbolic regression.

Includes:

  • Binary arithmetic (+, -, *, /, ^) for Float32 and Int32
  • Unary transcendentals (cos, sin, tanh, exp, sign) for Float32
  • Binary comparisons (>, <, ==, !=, >=, <=) for Float32 and Int32, returning Bool
source
Arborist.default_protected_function_setMethod
default_protected_function_set() -> FunctionSet

Return a symbolic-regression FunctionSet built around the protected operators. Suitable for ExprGenome-based symbolic regression where evolved programs must evaluate without raising domain errors.

Contents:

  • Binary arithmetic: +, -, * for Float32 and Int32; pdiv for Float32.
  • Unary transcendentals: plog, psqrt, pexp, sin, cos for Float32.

The set follows the Nguyen/Keijzer convention used in the modern symbolic regression literature (McDermott et al., 2012). pinv is not included by default — use it as a drop-in replacement for pdiv(1.0, x) problems where an explicit inverse primitive is desired.

TreeGenome users do not need this helper: pass the raw functions directly to DynamicExpressions.OperatorEnum, e.g. OperatorEnum(; binary_operators=[+, -, *, pdiv], unary_operators=[plog, psqrt, pexp, sin, cos]).

source
Arborist.pdivMethod
pdiv(a, b)

Protected division. Returns one(a) when |b| < 1.0e-10; otherwise a / b. The canonical Koza-style guard against division by zero.

source
Arborist.pexpMethod
pexp(x)

Protected exponential, exp(clamp(x, -50, 50)). Prevents Inf overflow for large positive x while preserving finite behavior everywhere else. The clamp bounds correspond to exp(50) ≈ 5.18e21, comfortably within Float64 range.

source
Arborist.pinvMethod
pinv(x)

Protected multiplicative inverse. Returns zero(x) when |x| < 1.0e-10, otherwise one(x) / x.

source
Arborist.plogMethod
plog(x)

Protected natural logarithm, log(|x| + 1.0e-10). Always finite and real-valued; tracks log|x| away from zero and saturates at log(PROTECTED_EPS) near zero.

source
Arborist.psqrtMethod
psqrt(x)

Protected square root, sqrt(|x|). Always finite and real-valued.

source
Arborist.run_multi_seedMethod
run_multi_seed(f, seeds::AbstractVector{Int}; parallel=false) -> Vector

Call f(seed) for each integer in seeds and return a vector of the results. When parallel=true and Threads.nthreads() > 1, runs concurrently via Threads.@threads — callers must ensure f is thread-safe (no shared mutable state without locking).

Typical use:

fitnesses = run_multi_seed([1, 2, 3, 4, 5]) do seed
    problem = GPProblem(evaluator, TreeGenome{Float32}; seed=seed)
    result = solve(problem, alg)
    result.best_fitness
end
println(summarize(fitnesses))
source
Arborist.summarizeMethod
summarize(xs::AbstractVector{<:Real}) -> NamedTuple

Return (; mean, std, median, min, max, q25, q75, n) for a vector of real values. Non-finite entries are excluded from every statistic so a single Inf fitness does not poison the summary. n reports the number of finite entries used.

Matches the shape most benchmark reporting expects: mean ± std for a quick headline, quartiles for distribution shape. No weak dependency on Statistics (so Pkg.test in a sandboxed environment works).

source
Arborist.train_test_splitMethod
train_test_split(X::AbstractMatrix, y::AbstractVector;
                 test_size=0.2,
                 rng=Random.default_rng(),
                 stratify=nothing
                ) -> (X_train, y_train, X_test, y_test)

Split a feature matrix X (features × samples) and matching target vector y into train and test partitions.

Arguments

  • X::AbstractMatrix: features-by-samples (columns are samples).
  • y::AbstractVector: one target per column of X.

Keyword arguments

  • test_size::Real (default 0.2): fraction of samples to place in the test set. Must be in (0, 1).
  • rng::AbstractRNG (default Random.default_rng()): RNG used for the permutation. Pass a MersenneTwister(seed) for reproducibility.
  • stratify::Union{Nothing, AbstractVector} (default nothing): when supplied, a class-label vector of the same length as y. Sampling is done class-wise so the class proportions in both partitions match the input distribution as closely as integer rounding allows.

Returns a 4-tuple (X_train, y_train, X_test, y_test) of sub-matrices and sub-vectors.

source

Constants