Crusader_Decomp/docs/usecode-tool-improvement-plan.md

# USECODE Tool Improvement Plan

## Purpose

This note turns the earlier tooling comparison into a concrete improvement plan for the local parser/decompiler.

The intent is not to copy Pentagram or `crusader-disasm` wholesale. The intent is to extract the parts that are genuinely useful for the current workspace toolchain:

- `tools/poc_crusader_usecode_parser.py`
- `tools/export_usecode_pseudocode.py`
- the extracted owner-loaded corpus under `USECODE/EUSECODE_extracted/`

## Short version

The most useful next upgrades are:

1. make the decoder tables more authoritative
2. decode loop/selector idioms into real structured searches
3. improve intrinsic naming and signatures
4. distinguish code from trailers more rigorously
5. add corpus-level pattern clustering and family annotations
6. keep strengthening the runtime bridge back into the retail binary

## Priority 1: Authoritative opcode metadata

### What to borrow

From Pentagram and `crusader-disasm`:

- stable opcode names
- operand-shape knowledge
- special handling for records like `SYMBOL_INFO`, `LINE_NUMBER`, `PROCESS_EXCLUDE`, and `END`

### Why it matters

The current parser already decodes enough to produce readable pseudocode, but some opcodes are still treated more heuristically than declaratively. That is fine for proof-of-concept output, but it becomes fragile once more control-flow and loop idioms are added.

### Concrete change

Move the per-opcode knowledge into a single explicit table describing:

- mnemonic
- stack effect where known
- immediate layout
- control-flow behavior
- whether the opcode is normal code, metadata, or trailer-oriented
- whether the opcode participates in loop selector mini-languages

### Expected payoff

- fewer ad hoc decode branches
- easier regression testing against the text corpus
- cleaner IR for later restructuring passes

## Priority 2: Real loop/selector decoding

### What to borrow

From the older disassembly corpus:

- the meaning of `loopscr` tokens such as `end`, `==`, `item->shape`, `item->family`, and typed literal selectors
- the visible repeated patterns in alarm-family and trigger-family bodies

### Why it matters

Right now the parser preserves loop selector bytes faithfully, but readable pseudocode still shows comments like `loopscr value_u8=0x40` instead of the underlying search semantics.

That is the main reason scripts like `ALARMHAT` still read as partially machine-shaped even though the overall behavior is already understandable.

### Concrete change

Introduce a small loop-selector IR layer so common loop forms render as something closer to:

```text
for item in nearby_items(shape=0x04D0, origin=arg_06):
```

or:

```text
for candidate in nearby_items(family=6, origin=arg_06):
```

The first target is not full generality. The first target is the set of repeated loop forms already seen in:

- `NPCTRIG`
- `ALARMHAT`
- `ALARMBOX`
- `ALRMTRIG`
- nearby environmental families

### Expected payoff

- much better readability for object-searching scripts
- better gameplay interpretation of trigger/controller classes
- a cleaner path to naming common search idioms

## Priority 3: Better intrinsic naming and signatures

### What to borrow

From Pentagram and `crusader-disasm`:

- historical intrinsic names
- text-mined call arities and stack cleanup behavior
- rough prototype guesses from the older corpus tools

### Why it matters

Readable pseudocode is bottlenecked less by control flow now and more by anonymous calls like `Intrinsic0007()` or generic placeholders like `class_0A18_slot_20(...)`.

The older tool lines already contain partial information that can improve this materially, as long as it is treated as hint-quality evidence rather than rename authority.

### Concrete change

Build a local intrinsic metadata table with confidence levels:

- `verified`
- `strong hint`
- `weak hint`

Populate it from:

- Pentagram tables
- `usecode_opcodes.txt`
- mined `calli`/`add sp` patterns from `crusader_disasm.txt`
- current repo notes where compiled-side names are already justified

### Expected payoff

- more readable pseudocode
- safer future promotion of intrinsic names
- less confusion between Remorse-only, Regret-only, and cross-game vocabulary

## Priority 4: Explicit code-versus-trailer boundaries

### What to borrow

From Pentagram's symbol-info/debug-symbol handling:

- the idea that `0x5C` points into structured trailer data
- the practical distinction between executable body and debug/local trailer rows

### Why it matters

The JELYHACK pass already showed how important this is. Tiny scripts are easy to misread if post-`ret` metadata gets rendered as live code.

The current parser now avoids that in readable pseudocode, but the boundary logic should become a first-class part of the IR rather than a readability-only safeguard.

### Concrete change

Make trailer parsing explicit in the IR:

- code extent
- trailer extent
- debug symbol rows
- line-number records
- terminal `END`

### Expected payoff

- safer whole-corpus export
- better local naming and source-like output
- fewer false positives when mining repeated code bodies

## Priority 5: Corpus-level pattern clustering

### What to borrow

From the `crusader-disasm` corpus mindset:

- treat the full body set as a searchable evidence base, not only as isolated scripts

### Why it matters

The JELYHACK result was only obvious after repeated-body comparison showed it was a small shared stub. The same strategy can keep the decompiler honest elsewhere.

### Concrete change

Add corpus analysis helpers that cluster or index:

- exact repeated bodies
- normalized repeated bodies
- repeated loop-selector templates
- repeated spawn/call templates by class and slot

Those results should feed back into readable annotations like:

- `shared interaction stub`
- `alarm-family controller template`
- `common trigger setup pattern`

### Expected payoff

- faster triage of interesting scripts
- better distinction between generic templates and unique gameplay logic
- fewer overinterpretations of tiny bodies

## Priority 6: Stronger runtime bridge and import path

### What to borrow

From the local repo workflow rather than directly from Pentagram:

- the current runtime anchors already recorded in `runtime_vm_ir.tsv`
- the Ghidra-side annotation path planned in the USECODE notes

### Why it matters

The parser is strongest when its readable output can be tied back to the compiled loader and sequencer. That keeps the decompiler grounded instead of drifting into pure script aesthetics.

### Concrete change

Expand the export and annotation path so pseudocode/index output can carry verified runtime anchors where known, especially around:

- `000d:51fd`
- `000d:5572`
- `000d:46ec`
- `000d:21ed`
- `000d:22bc`
- `000d:ebe3`

### Expected payoff

- easier Ghidra-side correlation
- safer promotion of slot/event names
- better compiled-to-script navigation

## Suggested implementation order

1. stabilize opcode metadata tables
2. formalize trailer parsing in IR
3. implement first real loop-selector decoder for common `shape` and `family` searches
4. add intrinsic metadata with confidence levels
5. add corpus clustering/index helpers
6. extend runtime-anchor export/import integration

## What not to do yet

- Do not chase full round-tripping first. Readability is still the higher-value frontier.
- Do not mass-promote intrinsic or event names from Pentagram or the old disasm corpus without current-binary support.
- Do not try to solve every loop/selector form before landing the small repeated set that already appears across the alarm and trigger families.

## Current best next step

The most leverage is in loop-selector decoding.

That is the place where the older tools still give us directly reusable structure and where the current readable output most obviously needs another step forward.