7.7 KiB
USECODE Tool Improvement Plan
Purpose
This note turns the earlier tooling comparison into a concrete improvement plan for the local parser/decompiler.
The intent is not to copy Pentagram or crusader-disasm wholesale. The intent is to extract the parts that are genuinely useful for the current workspace toolchain:
tools/poc_crusader_usecode_parser.pytools/export_usecode_pseudocode.py- the extracted owner-loaded corpus under
USECODE/EUSECODE_extracted/
Short version
The most useful next upgrades are:
- make the decoder tables more authoritative
- decode loop/selector idioms into real structured searches
- improve intrinsic naming and signatures
- distinguish code from trailers more rigorously
- add corpus-level pattern clustering and family annotations
- keep strengthening the runtime bridge back into the retail binary
Priority 1: Authoritative opcode metadata
What to borrow
From Pentagram and crusader-disasm:
- stable opcode names
- operand-shape knowledge
- special handling for records like
SYMBOL_INFO,LINE_NUMBER,PROCESS_EXCLUDE, andEND
Why it matters
The current parser already decodes enough to produce readable pseudocode, but some opcodes are still treated more heuristically than declaratively. That is fine for proof-of-concept output, but it becomes fragile once more control-flow and loop idioms are added.
Concrete change
Move the per-opcode knowledge into a single explicit table describing:
- mnemonic
- stack effect where known
- immediate layout
- control-flow behavior
- whether the opcode is normal code, metadata, or trailer-oriented
- whether the opcode participates in loop selector mini-languages
Expected payoff
- fewer ad hoc decode branches
- easier regression testing against the text corpus
- cleaner IR for later restructuring passes
Priority 2: Real loop/selector decoding
What to borrow
From the older disassembly corpus:
- the meaning of
loopscrtokens such asend,==,item->shape,item->family, and typed literal selectors - the visible repeated patterns in alarm-family and trigger-family bodies
Why it matters
Right now the parser preserves loop selector bytes faithfully, but readable pseudocode still shows comments like loopscr value_u8=0x40 instead of the underlying search semantics.
That is the main reason scripts like ALARMHAT still read as partially machine-shaped even though the overall behavior is already understandable.
Concrete change
Introduce a small loop-selector IR layer so common loop forms render as something closer to:
for item in nearby_items(shape=0x04D0, origin=arg_06):
or:
for candidate in nearby_items(family=6, origin=arg_06):
The first target is not full generality. The first target is the set of repeated loop forms already seen in:
NPCTRIGALARMHATALARMBOXALRMTRIG- nearby environmental families
Expected payoff
- much better readability for object-searching scripts
- better gameplay interpretation of trigger/controller classes
- a cleaner path to naming common search idioms
Priority 3: Better intrinsic naming and signatures
What to borrow
From Pentagram and crusader-disasm:
- historical intrinsic names
- text-mined call arities and stack cleanup behavior
- rough prototype guesses from the older corpus tools
Why it matters
Readable pseudocode is bottlenecked less by control flow now and more by anonymous calls like Intrinsic0007() or generic placeholders like class_0A18_slot_20(...).
The older tool lines already contain partial information that can improve this materially, as long as it is treated as hint-quality evidence rather than rename authority.
Concrete change
Build a local intrinsic metadata table with confidence levels:
verifiedstrong hintweak hint
Populate it from:
- Pentagram tables
usecode_opcodes.txt- mined
calli/add sppatterns fromcrusader_disasm.txt - current repo notes where compiled-side names are already justified
Expected payoff
- more readable pseudocode
- safer future promotion of intrinsic names
- less confusion between Remorse-only, Regret-only, and cross-game vocabulary
Priority 4: Explicit code-versus-trailer boundaries
What to borrow
From Pentagram's symbol-info/debug-symbol handling:
- the idea that
0x5Cpoints into structured trailer data - the practical distinction between executable body and debug/local trailer rows
Why it matters
The JELYHACK pass already showed how important this is. Tiny scripts are easy to misread if post-ret metadata gets rendered as live code.
The current parser now avoids that in readable pseudocode, but the boundary logic should become a first-class part of the IR rather than a readability-only safeguard.
Concrete change
Make trailer parsing explicit in the IR:
- code extent
- trailer extent
- debug symbol rows
- line-number records
- terminal
END
Expected payoff
- safer whole-corpus export
- better local naming and source-like output
- fewer false positives when mining repeated code bodies
Priority 5: Corpus-level pattern clustering
What to borrow
From the crusader-disasm corpus mindset:
- treat the full body set as a searchable evidence base, not only as isolated scripts
Why it matters
The JELYHACK result was only obvious after repeated-body comparison showed it was a small shared stub. The same strategy can keep the decompiler honest elsewhere.
Concrete change
Add corpus analysis helpers that cluster or index:
- exact repeated bodies
- normalized repeated bodies
- repeated loop-selector templates
- repeated spawn/call templates by class and slot
Those results should feed back into readable annotations like:
shared interaction stubalarm-family controller templatecommon trigger setup pattern
Expected payoff
- faster triage of interesting scripts
- better distinction between generic templates and unique gameplay logic
- fewer overinterpretations of tiny bodies
Priority 6: Stronger runtime bridge and import path
What to borrow
From the local repo workflow rather than directly from Pentagram:
- the current runtime anchors already recorded in
runtime_vm_ir.tsv - the Ghidra-side annotation path planned in the USECODE notes
Why it matters
The parser is strongest when its readable output can be tied back to the compiled loader and sequencer. That keeps the decompiler grounded instead of drifting into pure script aesthetics.
Concrete change
Expand the export and annotation path so pseudocode/index output can carry verified runtime anchors where known, especially around:
000d:51fd000d:5572000d:46ec000d:21ed000d:22bc000d:ebe3
Expected payoff
- easier Ghidra-side correlation
- safer promotion of slot/event names
- better compiled-to-script navigation
Suggested implementation order
- stabilize opcode metadata tables
- formalize trailer parsing in IR
- implement first real loop-selector decoder for common
shapeandfamilysearches - add intrinsic metadata with confidence levels
- add corpus clustering/index helpers
- extend runtime-anchor export/import integration
What not to do yet
- Do not chase full round-tripping first. Readability is still the higher-value frontier.
- Do not mass-promote intrinsic or event names from Pentagram or the old disasm corpus without current-binary support.
- Do not try to solve every loop/selector form before landing the small repeated set that already appears across the alarm and trigger families.
Current best next step
The most leverage is in loop-selector decoding.
That is the place where the older tools still give us directly reusable structure and where the current readable output most obviously needs another step forward.