Crusader_Decomp/docs/usecode-pentagram-ghidra-path.md

207 lines
7 KiB
Markdown
Raw Normal View History

# Pentagram-Derived USECODE Parser And Ghidra Path
## Purpose
This note turns the earlier feasibility assessment into a concrete workflow.
The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:
- reuse Pentagram's Crusader opcode decoding where it is still valid
- replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor
- emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations
## What To Reuse From Pentagram
Useful directly:
- the opcode tokenization model from `convert/Convert.h`
- the disassembly-oriented mnemonic layout from `tools/disasm/Disasm.cpp`
- the Crusader event ordinal table from `convert/crusader/ConvertUsecodeCrusader.h`
Useful only as hints:
- intrinsic names and signatures
- old event-name labels for still-unresolved higher ordinals
Not safe to reuse unchanged:
- Pentagram's Crusader header reader
- any assumption that its old `maxOffset` / `externTable` / `fixupTable` structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary
- the partial Node-based decompiler path as if it were a general Crusader decompiler
## Verified Local Model To Use Instead
The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.
Current authoritative inputs:
- `USECODE/EUSECODE_extracted/class_layout_index.tsv`
- `USECODE/EUSECODE_extracted/class_event_index.tsv`
- `USECODE/EUSECODE_extracted/chunks/`
Current authoritative facts:
- owner-loaded class object index is `class_id + 2`
- class bytes `8..11` provide the code-base anchor already carried in `class_layout_index.tsv`
- slot rows are 6-byte records: `u16 raw_event_entry_word + u32 raw_code_offset`
- slot body windows are already emitted conservatively as `derived_body_start`, `derived_body_end`, and `derived_body_length`
## End-To-End Process
### 1. Start from extracted owner-loaded artifacts
The parser should not reopen `EUSECODE.FLX` directly for the proof of concept. The extractor has already normalized the class and slot selection step.
Inputs:
- one row from `class_layout_index.tsv`
- one row from `class_event_index.tsv`
- the corresponding chunk file under `USECODE/EUSECODE_extracted/chunks/`
### 2. Select one body window conservatively
For a chosen class and slot:
- locate `entry_index`
- confirm `derived_body_start` and `derived_body_end`
- slice the chunk-local body bytes exactly from that range
### 3. Decode opcodes with Pentagram-derived operand formats
Use Pentagram's operand-width model as the first parser source of truth.
For the proof of concept, keep decoding conservative:
- parse the op exactly when the operand format is understood
- keep the raw bytes for every parsed op
- stop cleanly on an unknown opcode and preserve the remaining tail bytes
### 4. Emit canonical IR v1
The parser output should be one machine-friendly object that includes:
- source artifact metadata
- class metadata
- slot/event metadata
- exact op list with raw bytes
- annotation hints for compiled-side VM anchors
### 5. Feed Ghidra with annotations, not with fake code yet
The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.
Do not try to map the bytecode into a full processor module first.
## Proof-Of-Concept Parser
Tool path:
- `tools/poc_crusader_usecode_parser.py`
Current scope:
- uses the extracted TSV and chunk artifacts already in the repo
- disassembles one selected class/slot body at a time
- emits canonical IR JSON
- optionally emits a readable text listing beside the JSON
Current deliberate limits:
- no full intrinsic name table yet
- no synthetic control-flow graph yet
- no recompilation path yet
- no Ghidra importer yet
That keeps the parser useful without pretending the VM is fully solved.
## Canonical Ghidra Annotation Import Path
The first importer should consume the parser IR and create only three kinds of output.
### 1. Bookmarks
Use bookmarks for class/slot-level evidence that should not be hidden inside comments.
Good first bookmark payloads:
- `NPCTRIG slot 0x0A body parsed by POC tool`
- `EVENT slot 0x0A body parsed by POC tool`
- `slot 0x13 payload-shape hint = signed_word`
### 2. Plate or decompiler comments on compiled anchors
Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.
Best current anchors:
- `000d:46ec` = context create from slot index
2026-03-25 23:32:13 +01:00
- `000d:0988` = referent-chain mutation family (`0x18..0x1b`)
- `000d:208b` = materialize-or-forward value lane
- `000d:21ed` = inline payload prepend stage
- `000d:22bc` = decoded matrix/pushback consumer
2026-03-25 23:32:13 +01:00
- `000d:2104` = mixed immediate/object finalize-to-outptr stage
- `000d:ebe3` = opcode sequence runner
Comment payload should stay short and evidence-heavy, for example:
2026-03-25 23:32:13 +01:00
`POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, 5 local/debug rows after ret, parsed via tools/poc_crusader_usecode_parser.py`
### 3. Optional comment bundles per runtime family
If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.
Examples:
- `slot-backed-owner-loaded-body`
- `slot-plus-offset-value-reload`
- `sequencer-matrix-consumer`
- `literal-replay-interpreter-upstream`
## Why Not A Ghidra Processor Yet
The missing pieces are still too important:
- full opcode semantics are incomplete
- stack and return discipline are incomplete
- the relation between owner-loaded body bytes and the later `000c:fa2f` literal/replay lane is still not closed end-to-end
- the upstream selector into `entity_vm_opcode_sequence_run` is still unresolved
So the right order is:
1. parser
2. IR
3. annotation import
4. only then reconsider a language module
## User Workflow
Run the proof-of-concept parser from the repo root.
Example:
```powershell
c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text
```
Recommended first targets:
1. `NPCTRIG` slot `0x0A`
2. `NPCTRIG` slot `0x20`
3. `EVENT` slot `0x0A`
4. one `_BOOT` slot `0x10` body as a short repeated-template control sample
What to look for in the output:
- exact raw body window
- whether the body terminates cleanly at opcode `0x7A`
- body-local call targets and global-address ops
- repeated structural motifs that can be carried back into the VM notes
- anchor hints for the compiled runtime functions
## Next Extensions
1. Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.
2. Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.
3. Add a small importer that converts `annotation_hints` into Ghidra comments and bookmarks.
4. Extend the IR with control-flow edges only after branch/jump confidence is high enough.
5. Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.