Crusader_Decomp/docs/usecode-tool-improvement-plan.md

9.1 KiB

USECODE Tool Improvement Plan

Purpose

This note turns the earlier tooling comparison into a concrete improvement plan for the local parser/decompiler.

The intent is not to copy Pentagram or crusader-disasm wholesale. The intent is to extract the parts that are genuinely useful for the current workspace toolchain:

  • tools/poc_crusader_usecode_parser.py
  • tools/export_usecode_pseudocode.py
  • the extracted owner-loaded corpus under USECODE/EUSECODE_extracted/

Short version

The most useful next upgrades are:

  1. make the decoder tables more authoritative
  2. decode loop/selector idioms into real structured searches
  3. improve intrinsic naming and signatures
  4. distinguish code from trailers more rigorously
  5. add corpus-level pattern clustering and family annotations
  6. keep strengthening the runtime bridge back into the retail binary

Current status

Implemented in the current local parser/exporter batch:

  • first evidence-backed class/slot aliasing for spawned or called helpers, so common wrappers now render with class names and known slot names such as FREE.waitNTimerTicks(...) instead of raw class_0A0C_slot_32(...)
  • first real loop-selector decoding for the common nearby_items(...) family/shape searches used by alarm and trigger bodies
  • structured rendering now upgrades the simpler selector loops to real for item in nearby_items(...) output instead of raw loopscr comment runs
  • a second common selector family now renders as readable selector_0x42(arg0=..., arg1=..., arg2=..., origin=...) signatures, and the simpler back-edge cases upgrade to for ... in selector_0x42(...) instead of raw loopscr 0x42 comment runs
  • full corpus export regenerated through tools/export_usecode_pseudocode.py, so the checked-in pseudocode corpus matches the improved renderer

Still open after this batch:

  • broader selector mini-language coverage beyond the common nearby_items(...) forms and the currently opaque but readable selector_0x42(...) fallback
  • more wrapper aliasing than the currently verified FREE.waitNTimerTicks seed entry
  • a more authoritative opcode metadata table instead of the current mixed declarative/heuristic decoder
  • corpus-level clustering/index outputs feeding back into inline annotations

Priority 1: Authoritative opcode metadata

What to borrow

From Pentagram and crusader-disasm:

  • stable opcode names
  • operand-shape knowledge
  • special handling for records like SYMBOL_INFO, LINE_NUMBER, PROCESS_EXCLUDE, and END

Why it matters

The current parser already decodes enough to produce readable pseudocode, but some opcodes are still treated more heuristically than declaratively. That is fine for proof-of-concept output, but it becomes fragile once more control-flow and loop idioms are added.

Concrete change

Move the per-opcode knowledge into a single explicit table describing:

  • mnemonic
  • stack effect where known
  • immediate layout
  • control-flow behavior
  • whether the opcode is normal code, metadata, or trailer-oriented
  • whether the opcode participates in loop selector mini-languages

Expected payoff

  • fewer ad hoc decode branches
  • easier regression testing against the text corpus
  • cleaner IR for later restructuring passes

Priority 2: Real loop/selector decoding

What to borrow

From the older disassembly corpus:

  • the meaning of loopscr tokens such as end, ==, item->shape, item->family, and typed literal selectors
  • the visible repeated patterns in alarm-family and trigger-family bodies

Why it matters

Right now the parser preserves loop selector bytes faithfully, but readable pseudocode still shows comments like loopscr value_u8=0x40 instead of the underlying search semantics.

That is the main reason scripts like ALARMHAT still read as partially machine-shaped even though the overall behavior is already understandable.

Concrete change

Introduce a small loop-selector IR layer so common loop forms render as something closer to:

for item in nearby_items(shape=0x04D0, origin=arg_06):

or:

for candidate in nearby_items(family=6, origin=arg_06):

The first target is not full generality. The first target is the set of repeated loop forms already seen in:

  • NPCTRIG
  • ALARMHAT
  • ALARMBOX
  • ALRMTRIG
  • nearby environmental families

Expected payoff

  • much better readability for object-searching scripts
  • better gameplay interpretation of trigger/controller classes
  • a cleaner path to naming common search idioms

Priority 3: Better intrinsic naming and signatures

What to borrow

From Pentagram and crusader-disasm:

  • historical intrinsic names
  • text-mined call arities and stack cleanup behavior
  • rough prototype guesses from the older corpus tools

Why it matters

Readable pseudocode is bottlenecked less by control flow now and more by anonymous calls like Intrinsic0007() or generic placeholders like class_0A18_slot_20(...).

The older tool lines already contain partial information that can improve this materially, as long as it is treated as hint-quality evidence rather than rename authority.

Concrete change

Build a local intrinsic metadata table with confidence levels:

  • verified
  • strong hint
  • weak hint

Populate it from:

  • Pentagram tables
  • usecode_opcodes.txt
  • mined calli/add sp patterns from crusader_disasm.txt
  • current repo notes where compiled-side names are already justified

Expected payoff

  • more readable pseudocode
  • safer future promotion of intrinsic names
  • less confusion between Remorse-only, Regret-only, and cross-game vocabulary

Priority 4: Explicit code-versus-trailer boundaries

What to borrow

From Pentagram's symbol-info/debug-symbol handling:

  • the idea that 0x5C points into structured trailer data
  • the practical distinction between executable body and debug/local trailer rows

Why it matters

The JELYHACK pass already showed how important this is. Tiny scripts are easy to misread if post-ret metadata gets rendered as live code.

The current parser now avoids that in readable pseudocode, but the boundary logic should become a first-class part of the IR rather than a readability-only safeguard.

Concrete change

Make trailer parsing explicit in the IR:

  • code extent
  • trailer extent
  • debug symbol rows
  • line-number records
  • terminal END

Expected payoff

  • safer whole-corpus export
  • better local naming and source-like output
  • fewer false positives when mining repeated code bodies

Priority 5: Corpus-level pattern clustering

What to borrow

From the crusader-disasm corpus mindset:

  • treat the full body set as a searchable evidence base, not only as isolated scripts

Why it matters

The JELYHACK result was only obvious after repeated-body comparison showed it was a small shared stub. The same strategy can keep the decompiler honest elsewhere.

Concrete change

Add corpus analysis helpers that cluster or index:

  • exact repeated bodies
  • normalized repeated bodies
  • repeated loop-selector templates
  • repeated spawn/call templates by class and slot

Those results should feed back into readable annotations like:

  • shared interaction stub
  • alarm-family controller template
  • common trigger setup pattern

Expected payoff

  • faster triage of interesting scripts
  • better distinction between generic templates and unique gameplay logic
  • fewer overinterpretations of tiny bodies

Priority 6: Stronger runtime bridge and import path

What to borrow

From the local repo workflow rather than directly from Pentagram:

  • the current runtime anchors already recorded in runtime_vm_ir.tsv
  • the Ghidra-side annotation path planned in the USECODE notes

Why it matters

The parser is strongest when its readable output can be tied back to the compiled loader and sequencer. That keeps the decompiler grounded instead of drifting into pure script aesthetics.

Concrete change

Expand the export and annotation path so pseudocode/index output can carry verified runtime anchors where known, especially around:

  • 000d:51fd
  • 000d:5572
  • 000d:46ec
  • 000d:21ed
  • 000d:22bc
  • 000d:ebe3

Expected payoff

  • easier Ghidra-side correlation
  • safer promotion of slot/event names
  • better compiled-to-script navigation

Suggested implementation order

  1. stabilize opcode metadata tables
  2. formalize trailer parsing in IR
  3. implement first real loop-selector decoder for common shape and family searches
  4. add intrinsic metadata with confidence levels
  5. add corpus clustering/index helpers
  6. extend runtime-anchor export/import integration

What not to do yet

  • Do not chase full round-tripping first. Readability is still the higher-value frontier.
  • Do not mass-promote intrinsic or event names from Pentagram or the old disasm corpus without current-binary support.
  • Do not try to solve every loop/selector form before landing the small repeated set that already appears across the alarm and trigger families.

Current best next step

The most leverage is in loop-selector decoding.

That is the place where the older tools still give us directly reusable structure and where the current readable output most obviously needs another step forward.