MaddoScientisto 589bfc31ef Pseudocode decompialtion improvements and docs

2026-03-26 22:10:48 +01:00

9.1 KiB

Raw Blame History

USECODE Tool Improvement Plan

Purpose

This note turns the earlier tooling comparison into a concrete improvement plan for the local parser/decompiler.

The intent is not to copy Pentagram or crusader-disasm wholesale. The intent is to extract the parts that are genuinely useful for the current workspace toolchain:

tools/poc_crusader_usecode_parser.py
tools/export_usecode_pseudocode.py
the extracted owner-loaded corpus under USECODE/EUSECODE_extracted/

Short version

The most useful next upgrades are:

make the decoder tables more authoritative
decode loop/selector idioms into real structured searches
improve intrinsic naming and signatures
distinguish code from trailers more rigorously
add corpus-level pattern clustering and family annotations
keep strengthening the runtime bridge back into the retail binary

Current status

Implemented in the current local parser/exporter batch:

first evidence-backed class/slot aliasing for spawned or called helpers, so common wrappers now render with class names and known slot names such as FREE.waitNTimerTicks(...) instead of raw class_0A0C_slot_32(...)
first real loop-selector decoding for the common nearby_items(...) family/shape searches used by alarm and trigger bodies
structured rendering now upgrades the simpler selector loops to real for item in nearby_items(...) output instead of raw loopscr comment runs
a second common selector family now renders as readable selector_0x42(arg0=..., arg1=..., arg2=..., origin=...) signatures, and the simpler back-edge cases upgrade to for ... in selector_0x42(...) instead of raw loopscr 0x42 comment runs
full corpus export regenerated through tools/export_usecode_pseudocode.py, so the checked-in pseudocode corpus matches the improved renderer

Still open after this batch:

broader selector mini-language coverage beyond the common nearby_items(...) forms and the currently opaque but readable selector_0x42(...) fallback
more wrapper aliasing than the currently verified FREE.waitNTimerTicks seed entry
a more authoritative opcode metadata table instead of the current mixed declarative/heuristic decoder
corpus-level clustering/index outputs feeding back into inline annotations

Priority 1: Authoritative opcode metadata

What to borrow

From Pentagram and crusader-disasm:

stable opcode names
operand-shape knowledge
special handling for records like SYMBOL_INFO, LINE_NUMBER, PROCESS_EXCLUDE, and END

Why it matters

The current parser already decodes enough to produce readable pseudocode, but some opcodes are still treated more heuristically than declaratively. That is fine for proof-of-concept output, but it becomes fragile once more control-flow and loop idioms are added.

Concrete change

Move the per-opcode knowledge into a single explicit table describing:

mnemonic
stack effect where known
immediate layout
control-flow behavior
whether the opcode is normal code, metadata, or trailer-oriented
whether the opcode participates in loop selector mini-languages

Expected payoff

fewer ad hoc decode branches
easier regression testing against the text corpus
cleaner IR for later restructuring passes

Priority 2: Real loop/selector decoding

What to borrow

From the older disassembly corpus:

the meaning of loopscr tokens such as end, ==, item->shape, item->family, and typed literal selectors
the visible repeated patterns in alarm-family and trigger-family bodies

Why it matters

Right now the parser preserves loop selector bytes faithfully, but readable pseudocode still shows comments like loopscr value_u8=0x40 instead of the underlying search semantics.

That is the main reason scripts like ALARMHAT still read as partially machine-shaped even though the overall behavior is already understandable.

Concrete change

Introduce a small loop-selector IR layer so common loop forms render as something closer to:

for item in nearby_items(shape=0x04D0, origin=arg_06):

or:

for candidate in nearby_items(family=6, origin=arg_06):

The first target is not full generality. The first target is the set of repeated loop forms already seen in:

NPCTRIG
ALARMHAT
ALARMBOX
ALRMTRIG
nearby environmental families

Expected payoff

much better readability for object-searching scripts
better gameplay interpretation of trigger/controller classes
a cleaner path to naming common search idioms

Priority 3: Better intrinsic naming and signatures

What to borrow

From Pentagram and crusader-disasm:

historical intrinsic names
text-mined call arities and stack cleanup behavior
rough prototype guesses from the older corpus tools

Why it matters

Readable pseudocode is bottlenecked less by control flow now and more by anonymous calls like Intrinsic0007() or generic placeholders like class_0A18_slot_20(...).

The older tool lines already contain partial information that can improve this materially, as long as it is treated as hint-quality evidence rather than rename authority.

Concrete change

Build a local intrinsic metadata table with confidence levels:

verified
strong hint
weak hint

Populate it from:

Pentagram tables
usecode_opcodes.txt
mined calli/add sp patterns from crusader_disasm.txt
current repo notes where compiled-side names are already justified

Expected payoff

more readable pseudocode
safer future promotion of intrinsic names
less confusion between Remorse-only, Regret-only, and cross-game vocabulary

Priority 4: Explicit code-versus-trailer boundaries

What to borrow

From Pentagram's symbol-info/debug-symbol handling:

the idea that 0x5C points into structured trailer data
the practical distinction between executable body and debug/local trailer rows

Why it matters

The JELYHACK pass already showed how important this is. Tiny scripts are easy to misread if post-ret metadata gets rendered as live code.

The current parser now avoids that in readable pseudocode, but the boundary logic should become a first-class part of the IR rather than a readability-only safeguard.

Concrete change

Make trailer parsing explicit in the IR:

code extent
trailer extent
debug symbol rows
line-number records
terminal END

Expected payoff

safer whole-corpus export
better local naming and source-like output
fewer false positives when mining repeated code bodies

Priority 5: Corpus-level pattern clustering

What to borrow

From the crusader-disasm corpus mindset:

treat the full body set as a searchable evidence base, not only as isolated scripts

Why it matters

The JELYHACK result was only obvious after repeated-body comparison showed it was a small shared stub. The same strategy can keep the decompiler honest elsewhere.

Concrete change

Add corpus analysis helpers that cluster or index:

exact repeated bodies
normalized repeated bodies
repeated loop-selector templates
repeated spawn/call templates by class and slot

Those results should feed back into readable annotations like:

shared interaction stub
alarm-family controller template
common trigger setup pattern

Expected payoff

faster triage of interesting scripts
better distinction between generic templates and unique gameplay logic
fewer overinterpretations of tiny bodies

Priority 6: Stronger runtime bridge and import path

What to borrow

From the local repo workflow rather than directly from Pentagram:

the current runtime anchors already recorded in runtime_vm_ir.tsv
the Ghidra-side annotation path planned in the USECODE notes

Why it matters

The parser is strongest when its readable output can be tied back to the compiled loader and sequencer. That keeps the decompiler grounded instead of drifting into pure script aesthetics.

Concrete change

Expand the export and annotation path so pseudocode/index output can carry verified runtime anchors where known, especially around:

000d:51fd
000d:5572
000d:46ec
000d:21ed
000d:22bc
000d:ebe3

Expected payoff

easier Ghidra-side correlation
safer promotion of slot/event names
better compiled-to-script navigation

Suggested implementation order

stabilize opcode metadata tables
formalize trailer parsing in IR
implement first real loop-selector decoder for common shape and family searches
add intrinsic metadata with confidence levels
add corpus clustering/index helpers
extend runtime-anchor export/import integration

What not to do yet

Do not chase full round-tripping first. Readability is still the higher-value frontier.
Do not mass-promote intrinsic or event names from Pentagram or the old disasm corpus without current-binary support.
Do not try to solve every loop/selector form before landing the small repeated set that already appears across the alarm and trigger families.

Current best next step

The most leverage is in loop-selector decoding.

That is the place where the older tools still give us directly reusable structure and where the current readable output most obviously needs another step forward.

9.1 KiB Raw Blame History

USECODE Tool Improvement Plan

Purpose

Short version

Current status

Priority 1: Authoritative opcode metadata

What to borrow

Why it matters

Concrete change

Expected payoff

Priority 2: Real loop/selector decoding

What to borrow

Why it matters

Concrete change

Expected payoff

Priority 3: Better intrinsic naming and signatures

What to borrow

Why it matters

Concrete change

Expected payoff

Priority 4: Explicit code-versus-trailer boundaries

What to borrow

Why it matters

Concrete change

Expected payoff

Priority 5: Corpus-level pattern clustering

What to borrow

Why it matters

Concrete change

Expected payoff

Priority 6: Stronger runtime bridge and import path

What to borrow

Why it matters

Concrete change

Expected payoff

Suggested implementation order

What not to do yet

Current best next step

9.1 KiB

Raw Blame History