248 lines
No EOL
7.7 KiB
Markdown
248 lines
No EOL
7.7 KiB
Markdown
# USECODE Tool Improvement Plan
|
|
|
|
## Purpose
|
|
|
|
This note turns the earlier tooling comparison into a concrete improvement plan for the local parser/decompiler.
|
|
|
|
The intent is not to copy Pentagram or `crusader-disasm` wholesale. The intent is to extract the parts that are genuinely useful for the current workspace toolchain:
|
|
|
|
- `tools/poc_crusader_usecode_parser.py`
|
|
- `tools/export_usecode_pseudocode.py`
|
|
- the extracted owner-loaded corpus under `USECODE/EUSECODE_extracted/`
|
|
|
|
## Short version
|
|
|
|
The most useful next upgrades are:
|
|
|
|
1. make the decoder tables more authoritative
|
|
2. decode loop/selector idioms into real structured searches
|
|
3. improve intrinsic naming and signatures
|
|
4. distinguish code from trailers more rigorously
|
|
5. add corpus-level pattern clustering and family annotations
|
|
6. keep strengthening the runtime bridge back into the retail binary
|
|
|
|
## Priority 1: Authoritative opcode metadata
|
|
|
|
### What to borrow
|
|
|
|
From Pentagram and `crusader-disasm`:
|
|
|
|
- stable opcode names
|
|
- operand-shape knowledge
|
|
- special handling for records like `SYMBOL_INFO`, `LINE_NUMBER`, `PROCESS_EXCLUDE`, and `END`
|
|
|
|
### Why it matters
|
|
|
|
The current parser already decodes enough to produce readable pseudocode, but some opcodes are still treated more heuristically than declaratively. That is fine for proof-of-concept output, but it becomes fragile once more control-flow and loop idioms are added.
|
|
|
|
### Concrete change
|
|
|
|
Move the per-opcode knowledge into a single explicit table describing:
|
|
|
|
- mnemonic
|
|
- stack effect where known
|
|
- immediate layout
|
|
- control-flow behavior
|
|
- whether the opcode is normal code, metadata, or trailer-oriented
|
|
- whether the opcode participates in loop selector mini-languages
|
|
|
|
### Expected payoff
|
|
|
|
- fewer ad hoc decode branches
|
|
- easier regression testing against the text corpus
|
|
- cleaner IR for later restructuring passes
|
|
|
|
## Priority 2: Real loop/selector decoding
|
|
|
|
### What to borrow
|
|
|
|
From the older disassembly corpus:
|
|
|
|
- the meaning of `loopscr` tokens such as `end`, `==`, `item->shape`, `item->family`, and typed literal selectors
|
|
- the visible repeated patterns in alarm-family and trigger-family bodies
|
|
|
|
### Why it matters
|
|
|
|
Right now the parser preserves loop selector bytes faithfully, but readable pseudocode still shows comments like `loopscr value_u8=0x40` instead of the underlying search semantics.
|
|
|
|
That is the main reason scripts like `ALARMHAT` still read as partially machine-shaped even though the overall behavior is already understandable.
|
|
|
|
### Concrete change
|
|
|
|
Introduce a small loop-selector IR layer so common loop forms render as something closer to:
|
|
|
|
```text
|
|
for item in nearby_items(shape=0x04D0, origin=arg_06):
|
|
```
|
|
|
|
or:
|
|
|
|
```text
|
|
for candidate in nearby_items(family=6, origin=arg_06):
|
|
```
|
|
|
|
The first target is not full generality. The first target is the set of repeated loop forms already seen in:
|
|
|
|
- `NPCTRIG`
|
|
- `ALARMHAT`
|
|
- `ALARMBOX`
|
|
- `ALRMTRIG`
|
|
- nearby environmental families
|
|
|
|
### Expected payoff
|
|
|
|
- much better readability for object-searching scripts
|
|
- better gameplay interpretation of trigger/controller classes
|
|
- a cleaner path to naming common search idioms
|
|
|
|
## Priority 3: Better intrinsic naming and signatures
|
|
|
|
### What to borrow
|
|
|
|
From Pentagram and `crusader-disasm`:
|
|
|
|
- historical intrinsic names
|
|
- text-mined call arities and stack cleanup behavior
|
|
- rough prototype guesses from the older corpus tools
|
|
|
|
### Why it matters
|
|
|
|
Readable pseudocode is bottlenecked less by control flow now and more by anonymous calls like `Intrinsic0007()` or generic placeholders like `class_0A18_slot_20(...)`.
|
|
|
|
The older tool lines already contain partial information that can improve this materially, as long as it is treated as hint-quality evidence rather than rename authority.
|
|
|
|
### Concrete change
|
|
|
|
Build a local intrinsic metadata table with confidence levels:
|
|
|
|
- `verified`
|
|
- `strong hint`
|
|
- `weak hint`
|
|
|
|
Populate it from:
|
|
|
|
- Pentagram tables
|
|
- `usecode_opcodes.txt`
|
|
- mined `calli`/`add sp` patterns from `crusader_disasm.txt`
|
|
- current repo notes where compiled-side names are already justified
|
|
|
|
### Expected payoff
|
|
|
|
- more readable pseudocode
|
|
- safer future promotion of intrinsic names
|
|
- less confusion between Remorse-only, Regret-only, and cross-game vocabulary
|
|
|
|
## Priority 4: Explicit code-versus-trailer boundaries
|
|
|
|
### What to borrow
|
|
|
|
From Pentagram's symbol-info/debug-symbol handling:
|
|
|
|
- the idea that `0x5C` points into structured trailer data
|
|
- the practical distinction between executable body and debug/local trailer rows
|
|
|
|
### Why it matters
|
|
|
|
The JELYHACK pass already showed how important this is. Tiny scripts are easy to misread if post-`ret` metadata gets rendered as live code.
|
|
|
|
The current parser now avoids that in readable pseudocode, but the boundary logic should become a first-class part of the IR rather than a readability-only safeguard.
|
|
|
|
### Concrete change
|
|
|
|
Make trailer parsing explicit in the IR:
|
|
|
|
- code extent
|
|
- trailer extent
|
|
- debug symbol rows
|
|
- line-number records
|
|
- terminal `END`
|
|
|
|
### Expected payoff
|
|
|
|
- safer whole-corpus export
|
|
- better local naming and source-like output
|
|
- fewer false positives when mining repeated code bodies
|
|
|
|
## Priority 5: Corpus-level pattern clustering
|
|
|
|
### What to borrow
|
|
|
|
From the `crusader-disasm` corpus mindset:
|
|
|
|
- treat the full body set as a searchable evidence base, not only as isolated scripts
|
|
|
|
### Why it matters
|
|
|
|
The JELYHACK result was only obvious after repeated-body comparison showed it was a small shared stub. The same strategy can keep the decompiler honest elsewhere.
|
|
|
|
### Concrete change
|
|
|
|
Add corpus analysis helpers that cluster or index:
|
|
|
|
- exact repeated bodies
|
|
- normalized repeated bodies
|
|
- repeated loop-selector templates
|
|
- repeated spawn/call templates by class and slot
|
|
|
|
Those results should feed back into readable annotations like:
|
|
|
|
- `shared interaction stub`
|
|
- `alarm-family controller template`
|
|
- `common trigger setup pattern`
|
|
|
|
### Expected payoff
|
|
|
|
- faster triage of interesting scripts
|
|
- better distinction between generic templates and unique gameplay logic
|
|
- fewer overinterpretations of tiny bodies
|
|
|
|
## Priority 6: Stronger runtime bridge and import path
|
|
|
|
### What to borrow
|
|
|
|
From the local repo workflow rather than directly from Pentagram:
|
|
|
|
- the current runtime anchors already recorded in `runtime_vm_ir.tsv`
|
|
- the Ghidra-side annotation path planned in the USECODE notes
|
|
|
|
### Why it matters
|
|
|
|
The parser is strongest when its readable output can be tied back to the compiled loader and sequencer. That keeps the decompiler grounded instead of drifting into pure script aesthetics.
|
|
|
|
### Concrete change
|
|
|
|
Expand the export and annotation path so pseudocode/index output can carry verified runtime anchors where known, especially around:
|
|
|
|
- `000d:51fd`
|
|
- `000d:5572`
|
|
- `000d:46ec`
|
|
- `000d:21ed`
|
|
- `000d:22bc`
|
|
- `000d:ebe3`
|
|
|
|
### Expected payoff
|
|
|
|
- easier Ghidra-side correlation
|
|
- safer promotion of slot/event names
|
|
- better compiled-to-script navigation
|
|
|
|
## Suggested implementation order
|
|
|
|
1. stabilize opcode metadata tables
|
|
2. formalize trailer parsing in IR
|
|
3. implement first real loop-selector decoder for common `shape` and `family` searches
|
|
4. add intrinsic metadata with confidence levels
|
|
5. add corpus clustering/index helpers
|
|
6. extend runtime-anchor export/import integration
|
|
|
|
## What not to do yet
|
|
|
|
- Do not chase full round-tripping first. Readability is still the higher-value frontier.
|
|
- Do not mass-promote intrinsic or event names from Pentagram or the old disasm corpus without current-binary support.
|
|
- Do not try to solve every loop/selector form before landing the small repeated set that already appears across the alarm and trigger families.
|
|
|
|
## Current best next step
|
|
|
|
The most leverage is in loop-selector decoding.
|
|
|
|
That is the place where the older tools still give us directly reusable structure and where the current readable output most obviously needs another step forward. |