Stuff
This commit is contained in:
parent
ee33f94b4b
commit
f92d1504fa
547 changed files with 37597 additions and 0 deletions
248
docs/usecode-tool-improvement-plan.md
Normal file
248
docs/usecode-tool-improvement-plan.md
Normal file
|
|
@ -0,0 +1,248 @@
|
|||
# USECODE Tool Improvement Plan
|
||||
|
||||
## Purpose
|
||||
|
||||
This note turns the earlier tooling comparison into a concrete improvement plan for the local parser/decompiler.
|
||||
|
||||
The intent is not to copy Pentagram or `crusader-disasm` wholesale. The intent is to extract the parts that are genuinely useful for the current workspace toolchain:
|
||||
|
||||
- `tools/poc_crusader_usecode_parser.py`
|
||||
- `tools/export_usecode_pseudocode.py`
|
||||
- the extracted owner-loaded corpus under `USECODE/EUSECODE_extracted/`
|
||||
|
||||
## Short version
|
||||
|
||||
The most useful next upgrades are:
|
||||
|
||||
1. make the decoder tables more authoritative
|
||||
2. decode loop/selector idioms into real structured searches
|
||||
3. improve intrinsic naming and signatures
|
||||
4. distinguish code from trailers more rigorously
|
||||
5. add corpus-level pattern clustering and family annotations
|
||||
6. keep strengthening the runtime bridge back into the retail binary
|
||||
|
||||
## Priority 1: Authoritative opcode metadata
|
||||
|
||||
### What to borrow
|
||||
|
||||
From Pentagram and `crusader-disasm`:
|
||||
|
||||
- stable opcode names
|
||||
- operand-shape knowledge
|
||||
- special handling for records like `SYMBOL_INFO`, `LINE_NUMBER`, `PROCESS_EXCLUDE`, and `END`
|
||||
|
||||
### Why it matters
|
||||
|
||||
The current parser already decodes enough to produce readable pseudocode, but some opcodes are still treated more heuristically than declaratively. That is fine for proof-of-concept output, but it becomes fragile once more control-flow and loop idioms are added.
|
||||
|
||||
### Concrete change
|
||||
|
||||
Move the per-opcode knowledge into a single explicit table describing:
|
||||
|
||||
- mnemonic
|
||||
- stack effect where known
|
||||
- immediate layout
|
||||
- control-flow behavior
|
||||
- whether the opcode is normal code, metadata, or trailer-oriented
|
||||
- whether the opcode participates in loop selector mini-languages
|
||||
|
||||
### Expected payoff
|
||||
|
||||
- fewer ad hoc decode branches
|
||||
- easier regression testing against the text corpus
|
||||
- cleaner IR for later restructuring passes
|
||||
|
||||
## Priority 2: Real loop/selector decoding
|
||||
|
||||
### What to borrow
|
||||
|
||||
From the older disassembly corpus:
|
||||
|
||||
- the meaning of `loopscr` tokens such as `end`, `==`, `item->shape`, `item->family`, and typed literal selectors
|
||||
- the visible repeated patterns in alarm-family and trigger-family bodies
|
||||
|
||||
### Why it matters
|
||||
|
||||
Right now the parser preserves loop selector bytes faithfully, but readable pseudocode still shows comments like `loopscr value_u8=0x40` instead of the underlying search semantics.
|
||||
|
||||
That is the main reason scripts like `ALARMHAT` still read as partially machine-shaped even though the overall behavior is already understandable.
|
||||
|
||||
### Concrete change
|
||||
|
||||
Introduce a small loop-selector IR layer so common loop forms render as something closer to:
|
||||
|
||||
```text
|
||||
for item in nearby_items(shape=0x04D0, origin=arg_06):
|
||||
```
|
||||
|
||||
or:
|
||||
|
||||
```text
|
||||
for candidate in nearby_items(family=6, origin=arg_06):
|
||||
```
|
||||
|
||||
The first target is not full generality. The first target is the set of repeated loop forms already seen in:
|
||||
|
||||
- `NPCTRIG`
|
||||
- `ALARMHAT`
|
||||
- `ALARMBOX`
|
||||
- `ALRMTRIG`
|
||||
- nearby environmental families
|
||||
|
||||
### Expected payoff
|
||||
|
||||
- much better readability for object-searching scripts
|
||||
- better gameplay interpretation of trigger/controller classes
|
||||
- a cleaner path to naming common search idioms
|
||||
|
||||
## Priority 3: Better intrinsic naming and signatures
|
||||
|
||||
### What to borrow
|
||||
|
||||
From Pentagram and `crusader-disasm`:
|
||||
|
||||
- historical intrinsic names
|
||||
- text-mined call arities and stack cleanup behavior
|
||||
- rough prototype guesses from the older corpus tools
|
||||
|
||||
### Why it matters
|
||||
|
||||
Readable pseudocode is bottlenecked less by control flow now and more by anonymous calls like `Intrinsic0007()` or generic placeholders like `class_0A18_slot_20(...)`.
|
||||
|
||||
The older tool lines already contain partial information that can improve this materially, as long as it is treated as hint-quality evidence rather than rename authority.
|
||||
|
||||
### Concrete change
|
||||
|
||||
Build a local intrinsic metadata table with confidence levels:
|
||||
|
||||
- `verified`
|
||||
- `strong hint`
|
||||
- `weak hint`
|
||||
|
||||
Populate it from:
|
||||
|
||||
- Pentagram tables
|
||||
- `usecode_opcodes.txt`
|
||||
- mined `calli`/`add sp` patterns from `crusader_disasm.txt`
|
||||
- current repo notes where compiled-side names are already justified
|
||||
|
||||
### Expected payoff
|
||||
|
||||
- more readable pseudocode
|
||||
- safer future promotion of intrinsic names
|
||||
- less confusion between Remorse-only, Regret-only, and cross-game vocabulary
|
||||
|
||||
## Priority 4: Explicit code-versus-trailer boundaries
|
||||
|
||||
### What to borrow
|
||||
|
||||
From Pentagram's symbol-info/debug-symbol handling:
|
||||
|
||||
- the idea that `0x5C` points into structured trailer data
|
||||
- the practical distinction between executable body and debug/local trailer rows
|
||||
|
||||
### Why it matters
|
||||
|
||||
The JELYHACK pass already showed how important this is. Tiny scripts are easy to misread if post-`ret` metadata gets rendered as live code.
|
||||
|
||||
The current parser now avoids that in readable pseudocode, but the boundary logic should become a first-class part of the IR rather than a readability-only safeguard.
|
||||
|
||||
### Concrete change
|
||||
|
||||
Make trailer parsing explicit in the IR:
|
||||
|
||||
- code extent
|
||||
- trailer extent
|
||||
- debug symbol rows
|
||||
- line-number records
|
||||
- terminal `END`
|
||||
|
||||
### Expected payoff
|
||||
|
||||
- safer whole-corpus export
|
||||
- better local naming and source-like output
|
||||
- fewer false positives when mining repeated code bodies
|
||||
|
||||
## Priority 5: Corpus-level pattern clustering
|
||||
|
||||
### What to borrow
|
||||
|
||||
From the `crusader-disasm` corpus mindset:
|
||||
|
||||
- treat the full body set as a searchable evidence base, not only as isolated scripts
|
||||
|
||||
### Why it matters
|
||||
|
||||
The JELYHACK result was only obvious after repeated-body comparison showed it was a small shared stub. The same strategy can keep the decompiler honest elsewhere.
|
||||
|
||||
### Concrete change
|
||||
|
||||
Add corpus analysis helpers that cluster or index:
|
||||
|
||||
- exact repeated bodies
|
||||
- normalized repeated bodies
|
||||
- repeated loop-selector templates
|
||||
- repeated spawn/call templates by class and slot
|
||||
|
||||
Those results should feed back into readable annotations like:
|
||||
|
||||
- `shared interaction stub`
|
||||
- `alarm-family controller template`
|
||||
- `common trigger setup pattern`
|
||||
|
||||
### Expected payoff
|
||||
|
||||
- faster triage of interesting scripts
|
||||
- better distinction between generic templates and unique gameplay logic
|
||||
- fewer overinterpretations of tiny bodies
|
||||
|
||||
## Priority 6: Stronger runtime bridge and import path
|
||||
|
||||
### What to borrow
|
||||
|
||||
From the local repo workflow rather than directly from Pentagram:
|
||||
|
||||
- the current runtime anchors already recorded in `runtime_vm_ir.tsv`
|
||||
- the Ghidra-side annotation path planned in the USECODE notes
|
||||
|
||||
### Why it matters
|
||||
|
||||
The parser is strongest when its readable output can be tied back to the compiled loader and sequencer. That keeps the decompiler grounded instead of drifting into pure script aesthetics.
|
||||
|
||||
### Concrete change
|
||||
|
||||
Expand the export and annotation path so pseudocode/index output can carry verified runtime anchors where known, especially around:
|
||||
|
||||
- `000d:51fd`
|
||||
- `000d:5572`
|
||||
- `000d:46ec`
|
||||
- `000d:21ed`
|
||||
- `000d:22bc`
|
||||
- `000d:ebe3`
|
||||
|
||||
### Expected payoff
|
||||
|
||||
- easier Ghidra-side correlation
|
||||
- safer promotion of slot/event names
|
||||
- better compiled-to-script navigation
|
||||
|
||||
## Suggested implementation order
|
||||
|
||||
1. stabilize opcode metadata tables
|
||||
2. formalize trailer parsing in IR
|
||||
3. implement first real loop-selector decoder for common `shape` and `family` searches
|
||||
4. add intrinsic metadata with confidence levels
|
||||
5. add corpus clustering/index helpers
|
||||
6. extend runtime-anchor export/import integration
|
||||
|
||||
## What not to do yet
|
||||
|
||||
- Do not chase full round-tripping first. Readability is still the higher-value frontier.
|
||||
- Do not mass-promote intrinsic or event names from Pentagram or the old disasm corpus without current-binary support.
|
||||
- Do not try to solve every loop/selector form before landing the small repeated set that already appears across the alarm and trigger families.
|
||||
|
||||
## Current best next step
|
||||
|
||||
The most leverage is in loop-selector decoding.
|
||||
|
||||
That is the place where the older tools still give us directly reusable structure and where the current readable output most obviously needs another step forward.
|
||||
Loading…
Add table
Add a link
Reference in a new issue