Stuff

2026-03-25 23:32:36 +01:00 · 2026-03-25 23:32:36 +01:00 · f92d1504fa
commit f92d1504fa
parent ee33f94b4b
547 changed files with 37597 additions and 0 deletions
--- a/docs/usecode-tool-improvement-plan.md
+++ b/docs/usecode-tool-improvement-plan.md
@ -0,0 +1,248 @@
+# USECODE Tool Improvement Plan
+
+## Purpose
+
+This note turns the earlier tooling comparison into a concrete improvement plan for the local parser/decompiler.
+
+The intent is not to copy Pentagram or `crusader-disasm` wholesale. The intent is to extract the parts that are genuinely useful for the current workspace toolchain:
+
+- `tools/poc_crusader_usecode_parser.py`
+- `tools/export_usecode_pseudocode.py`
+- the extracted owner-loaded corpus under `USECODE/EUSECODE_extracted/`
+
+## Short version
+
+The most useful next upgrades are:
+
+1. make the decoder tables more authoritative
+2. decode loop/selector idioms into real structured searches
+3. improve intrinsic naming and signatures
+4. distinguish code from trailers more rigorously
+5. add corpus-level pattern clustering and family annotations
+6. keep strengthening the runtime bridge back into the retail binary
+
+## Priority 1: Authoritative opcode metadata
+
+### What to borrow
+
+From Pentagram and `crusader-disasm`:
+
+- stable opcode names
+- operand-shape knowledge
+- special handling for records like `SYMBOL_INFO`, `LINE_NUMBER`, `PROCESS_EXCLUDE`, and `END`
+
+### Why it matters
+
+The current parser already decodes enough to produce readable pseudocode, but some opcodes are still treated more heuristically than declaratively. That is fine for proof-of-concept output, but it becomes fragile once more control-flow and loop idioms are added.
+
+### Concrete change
+
+Move the per-opcode knowledge into a single explicit table describing:
+
+- mnemonic
+- stack effect where known
+- immediate layout
+- control-flow behavior
+- whether the opcode is normal code, metadata, or trailer-oriented
+- whether the opcode participates in loop selector mini-languages
+
+### Expected payoff
+
+- fewer ad hoc decode branches
+- easier regression testing against the text corpus
+- cleaner IR for later restructuring passes
+
+## Priority 2: Real loop/selector decoding
+
+### What to borrow
+
+From the older disassembly corpus:
+
+- the meaning of `loopscr` tokens such as `end`, `==`, `item->shape`, `item->family`, and typed literal selectors
+- the visible repeated patterns in alarm-family and trigger-family bodies
+
+### Why it matters
+
+Right now the parser preserves loop selector bytes faithfully, but readable pseudocode still shows comments like `loopscr value_u8=0x40` instead of the underlying search semantics.
+
+That is the main reason scripts like `ALARMHAT` still read as partially machine-shaped even though the overall behavior is already understandable.
+
+### Concrete change
+
+Introduce a small loop-selector IR layer so common loop forms render as something closer to:
+
+```text
+for item in nearby_items(shape=0x04D0, origin=arg_06):
+```
+
+or:
+
+```text
+for candidate in nearby_items(family=6, origin=arg_06):
+```
+
+The first target is not full generality. The first target is the set of repeated loop forms already seen in:
+
+- `NPCTRIG`
+- `ALARMHAT`
+- `ALARMBOX`
+- `ALRMTRIG`
+- nearby environmental families
+
+### Expected payoff
+
+- much better readability for object-searching scripts
+- better gameplay interpretation of trigger/controller classes
+- a cleaner path to naming common search idioms
+
+## Priority 3: Better intrinsic naming and signatures
+
+### What to borrow
+
+From Pentagram and `crusader-disasm`:
+
+- historical intrinsic names
+- text-mined call arities and stack cleanup behavior
+- rough prototype guesses from the older corpus tools
+
+### Why it matters
+
+Readable pseudocode is bottlenecked less by control flow now and more by anonymous calls like `Intrinsic0007()` or generic placeholders like `class_0A18_slot_20(...)`.
+
+The older tool lines already contain partial information that can improve this materially, as long as it is treated as hint-quality evidence rather than rename authority.
+
+### Concrete change
+
+Build a local intrinsic metadata table with confidence levels:
+
+- `verified`
+- `strong hint`
+- `weak hint`
+
+Populate it from:
+
+- Pentagram tables
+- `usecode_opcodes.txt`
+- mined `calli`/`add sp` patterns from `crusader_disasm.txt`
+- current repo notes where compiled-side names are already justified
+
+### Expected payoff
+
+- more readable pseudocode
+- safer future promotion of intrinsic names
+- less confusion between Remorse-only, Regret-only, and cross-game vocabulary
+
+## Priority 4: Explicit code-versus-trailer boundaries
+
+### What to borrow
+
+From Pentagram's symbol-info/debug-symbol handling:
+
+- the idea that `0x5C` points into structured trailer data
+- the practical distinction between executable body and debug/local trailer rows
+
+### Why it matters
+
+The JELYHACK pass already showed how important this is. Tiny scripts are easy to misread if post-`ret` metadata gets rendered as live code.
+
+The current parser now avoids that in readable pseudocode, but the boundary logic should become a first-class part of the IR rather than a readability-only safeguard.
+
+### Concrete change
+
+Make trailer parsing explicit in the IR:
+
+- code extent
+- trailer extent
+- debug symbol rows
+- line-number records
+- terminal `END`
+
+### Expected payoff
+
+- safer whole-corpus export
+- better local naming and source-like output
+- fewer false positives when mining repeated code bodies
+
+## Priority 5: Corpus-level pattern clustering
+
+### What to borrow
+
+From the `crusader-disasm` corpus mindset:
+
+- treat the full body set as a searchable evidence base, not only as isolated scripts
+
+### Why it matters
+
+The JELYHACK result was only obvious after repeated-body comparison showed it was a small shared stub. The same strategy can keep the decompiler honest elsewhere.
+
+### Concrete change
+
+Add corpus analysis helpers that cluster or index:
+
+- exact repeated bodies
+- normalized repeated bodies
+- repeated loop-selector templates
+- repeated spawn/call templates by class and slot
+
+Those results should feed back into readable annotations like:
+
+- `shared interaction stub`
+- `alarm-family controller template`
+- `common trigger setup pattern`
+
+### Expected payoff
+
+- faster triage of interesting scripts
+- better distinction between generic templates and unique gameplay logic
+- fewer overinterpretations of tiny bodies
+
+## Priority 6: Stronger runtime bridge and import path
+
+### What to borrow
+
+From the local repo workflow rather than directly from Pentagram:
+
+- the current runtime anchors already recorded in `runtime_vm_ir.tsv`
+- the Ghidra-side annotation path planned in the USECODE notes
+
+### Why it matters
+
+The parser is strongest when its readable output can be tied back to the compiled loader and sequencer. That keeps the decompiler grounded instead of drifting into pure script aesthetics.
+
+### Concrete change
+
+Expand the export and annotation path so pseudocode/index output can carry verified runtime anchors where known, especially around:
+
+- `000d:51fd`
+- `000d:5572`
+- `000d:46ec`
+- `000d:21ed`
+- `000d:22bc`
+- `000d:ebe3`
+
+### Expected payoff
+
+- easier Ghidra-side correlation
+- safer promotion of slot/event names
+- better compiled-to-script navigation
+
+## Suggested implementation order
+
+1. stabilize opcode metadata tables
+2. formalize trailer parsing in IR
+3. implement first real loop-selector decoder for common `shape` and `family` searches
+4. add intrinsic metadata with confidence levels
+5. add corpus clustering/index helpers
+6. extend runtime-anchor export/import integration
+
+## What not to do yet
+
+- Do not chase full round-tripping first. Readability is still the higher-value frontier.
+- Do not mass-promote intrinsic or event names from Pentagram or the old disasm corpus without current-binary support.
+- Do not try to solve every loop/selector form before landing the small repeated set that already appears across the alarm and trigger families.
+
+## Current best next step
+
+The most leverage is in loop-selector decoding.
+
+That is the place where the older tools still give us directly reusable structure and where the current readable output most obviously needs another step forward.