Skip to main content

Running Evaluations

AgentOdyssey provides three equivalent interfaces for running evaluations:

InterfaceWhen to use
eval.pyDirect command-line usage — run a single evaluation from the terminal
AgentOdyssey.run()Python API — integrate evaluations into scripts, sweeps, or notebooks
agentodyssey runCLI tool — available after installing AgentOdyssey as a package (pip install -e .)
Additional Dependencies

The base requirements.txt covers most of AgentOdyssey's functionality, but certain agents and LLM providers require extra packages:

FeatureExtra packages
RaptorRAGAgenttiktoken, umap-learn, tenacity
Mem1Agentmem0ai
Gemini (llm_provider="gemini")google-genai
Claude (llm_provider="claude")anthropic

Install only what you need, e.g.:

pip install tiktoken umap-learn tenacity   # for RaptorRAGAgent

AgentOdyssey.run()

from agentodyssey import AgentOdyssey

AgentOdyssey.run(
game_name="remnant",
agent="LongContextAgent",
llm_provider="openai",
llm_name="gpt-5",
max_steps=300,
)

AgentOdyssey.run() is a thin wrapper that builds and executes the corresponding eval.py command as a subprocess. Every keyword argument maps one-to-one to an eval.py flag.

Agent Configuration

These parameters control which agent is created and how it is configured. When agents_config is provided, all other agent parameters in this group are ignored.

ParameterTypeDefaultDescription
agentstr"HumanAgent"The agent class to evaluate. See Supported Agents for the full list.
agent_idstr"agent_adam_davis"A unique identifier for the agent. Used as the agent's subdirectory name inside the run directory.
agent_namestr"adam_davis"A human-readable name for the agent. Appears in game dialogue and logs.
llm_namestrNoneThe LLM model identifier. Required for all LLM-based agents. Examples: "gpt-5", "Qwen/Qwen3-4B", "Qwen/Qwen3-32B". Models starting with gpt use the OpenAI provider; all others are loaded locally via vLLM / HuggingFace.
llm_providerstrNoneWhich LLM provider to use. One of "openai", "azure", "azure_openai", "claude", "gemini", "vllm", or "huggingface". When None, the provider is auto-detected from llm_name.
agents_configstrNonePath to a JSON file defining multiple agents for multi-agent evaluation. When provided, the single-agent parameters (agent, agent_id, agent_name, llm_name, llm_provider, and all enable_* flags) are ignored.

Agent Memory Modules

These boolean flags toggle optional memory-augmentation modules on the agent. They are mutually informative and you can enable any combination.

ParameterTypeDefaultDescription
enable_short_term_memoryboolFalseEnable the short-term memory module. Gives the agent a sliding window of recent observations that persists across steps.
short_term_memory_sizeint5Number of recent observations to keep in the short-term memory sliding window. Only used when enable_short_term_memory is True.
enable_reflectionboolFalseEnable the reflection module. The agent periodically generates high-level reflections about its experience and stores them for future retrieval.
enable_summarizationboolFalseEnable the summarization module. The agent periodically summarizes its accumulated observations to compress context.

Game & Environment

These parameters select which game world to run and optionally override the default asset paths.

ParameterTypeDefaultDescription
game_namestr"base"The game variant to run. Built-in options: "base" (the default game), "remnant", "mark", etc. For custom generated worlds, use the name passed to AgentOdyssey.generate(). The game code is loaded from games/generated/<game_name>/ and assets from assets/generated/<game_name>/.
world_definition_pathstrNoneExplicit path to a world definition JSON file. When None, the path is automatically resolved from game_name as assets/<generated>/<game_name>/world_definitions/default.json.
env_config_pathstrNoneExplicit path to the initial environment config JSON or JSONL file. When None, the path is automatically resolved from game_name as assets/<generated>/<game_name>/env_configs/initial.json.
enable_obs_valid_actionsboolFalseInclude the list of valid actions in each observation. Required for RandomAgent (which selects uniformly from valid actions). Optional for other agents.

Execution Control

These parameters control how many steps to run, reproducibility, and run resumption.

ParameterTypeDefaultDescription
max_stepsint300Maximum number of environment steps before the episode is terminated. The episode may also end earlier if the game's termination condition is met.
seedint42Random seed for reproducibility. Controls environment randomness (e.g. NPC behaviour, loot drops).
resume_from_stepintNoneResume a previously interrupted run from the given step number. Requires that the run directory already exists and that cumulative_config_save was enabled during the original run.
enforce_same_hardwareboolFalseWhen resuming a run, verify that the current hardware (GPU model, CPU, RAM) matches the hardware recorded in the original run. Raises an error on mismatch. Useful for ensuring reproducible benchmarks.

Output & Logging

These parameters control where outputs are saved and how frequently checkpoints are written.

ParameterTypeDefaultDescription
output_dirstr"output"Root directory for all evaluation outputs. Each run creates a subdirectory tree under this path.
run_dirstrNoneExplicit path for this run's output directory. When None, the path is auto-generated as <output_dir>/game_<game_name>/<llm_name>/<agent>/<extras>/.
extra_dirstrNoneAn additional directory level appended under run_dir. Useful for organising multiple runs with the same configuration (e.g. different seeds).
overwriteboolFalseIf the run directory already exists, delete it and start fresh. Without this flag, the run will continue from the existing state.
cumulative_config_saveboolFalseSave the environment config cumulatively each step. Required if you intend to use resume_from_step.
debugboolFalseDeprecated alias for cumulative_config_save.
memory_dirstr"memory"Name of the subdirectory (under each agent's run directory) where memory checkpoints are saved.
agent_memory_save_frequencyint5Save the agent's memory checkpoint every N environment steps. Set to 0 or None to disable periodic saves (a final checkpoint is always saved at episode end).
save_dep_graph_stepsintNoneSave a dependency graph snapshot every N steps. When None, dependency tracking is disabled entirely.

agentodyssey run

The CLI mirrors every parameter from AgentOdyssey.run() using hyphenated flag names:

agentodyssey run \
--game-name remnant \
--agent LongContextAgent \
--llm-provider openai \
--llm-name gpt-5 \
--max-steps 300 \
--seed 42 \
--output-dir output \
--overwrite \
--enable-reflection

Multi-Agent Configuration

For multi-agent evaluation, provide a JSON file via agents_config / --agents_config instead of specifying a single agent on the command line. The JSON file contains an "agents" array where each entry defines one agent:

{
"agents": [
{
"agent_type": "LongContextAgent",
"agent_id": "agent_adam_davis",
"agent_name": "adam_davis",
"llm_name": "Qwen/Qwen3-4B",
"llm_provider": null,
"enable_short_term_memory": false,
"short_term_memory_size": 5,
"enable_reflection": false,
"enable_summarization": false
},
{
"agent_type": "VanillaRAGAgent",
"agent_id": "agent_bella_chen",
"agent_name": "bella_chen",
"llm_name": "gpt-5",
"llm_provider": "openai",
"enable_short_term_memory": false,
"short_term_memory_size": 5,
"enable_reflection": true,
"enable_summarization": false
}
]
}

Each agent entry supports the following fields:

FieldTypeRequiredDescription
agent_typestrYesAgent class name (see Supported Agents).
agent_idstrYesUnique identifier. Must be different for each agent.
agent_namestrYesHuman-readable name (used in game dialogue).
llm_namestr | nullNoLLM model identifier. Required for LLM-based agents.
llm_providerstr | nullNoLLM provider override (e.g. "openai", "vllm"). Auto-detected if null.
enable_short_term_memoryboolNoEnable short-term memory module. Default false.
short_term_memory_sizeintNoSliding window size for short-term memory. Default 5.
enable_reflectionboolNoEnable reflection module. Default false.
enable_summarizationboolNoEnable summarization module. Default false.

Usage:

# eval.py
python eval.py --game_name remnant --agents_config assets/agents_configs/three_agents.json

# Python API
AgentOdyssey.run(game_name="remnant", agents_config="assets/agents_configs/three_agents.json")

# CLI
agentodyssey run --game-name remnant --agents-config assets/agents_configs/three_agents.json
Multi-Agent Evaluation Still in Early Stages

The multi-agent evaluation pipeline is functional but has not yet been extensively tested or benchmarked. Please expect some rough edges and report any issues you encounter on the GitHub repository.

Output Directory Structure

Each evaluation run produces the following directory tree:

<output_dir>/
└── game_<game_name>/
└── <llm_name>/
└── <AgentType>/
└── <no_extras | with_short_term_memory | with_reflection | with_summarization>/
├── config.json # (or config.jsonl if cumulative_config_save is enabled)
└── <agent_id>/
├── agent_log.jsonl # per-step log entries
└── <memory_dir>/ # agent memory checkpoints

Each line in agent_log.jsonl is a JSON object:

{
"step": 0,
"action": "go north",
"decision_time": 1.234,
"num_input_tokens": 512,
"num_output_tokens": 32,
"invalid_action": false,
"reward": {"exploration": 1, "combat": 0, "quest": 0, "total": 1},
"observation": "You are in the castle hall...",
"response": "{\"action\": \"go north\"}"
}
FieldDescription
stepZero-indexed step number.
actionThe action string the agent chose.
decision_timeWall-clock seconds the agent took to produce the action.
num_input_tokensNumber of input tokens sent to the LLM (0 for non-LLM agents).
num_output_tokensNumber of output tokens received from the LLM.
invalid_actionWhether the action was rejected by the environment's rule engine.
rewardReward breakdown by category and total.
observationThe text observation the agent received before acting.
responseThe raw LLM response string (or "" for non-LLM agents).