# Pentagram-Derived USECODE Parser And Ghidra Path

## Purpose

This note turns the earlier feasibility assessment into a concrete workflow.

The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:

- reuse Pentagram's Crusader opcode decoding where it is still valid
- replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor
- emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations

## What To Reuse From Pentagram

Useful directly:

- the opcode tokenization model from `convert/Convert.h`
- the disassembly-oriented mnemonic layout from `tools/disasm/Disasm.cpp`
- the Crusader event ordinal table from `convert/crusader/ConvertUsecodeCrusader.h`

Useful only as hints:

- intrinsic names and signatures
- old event-name labels for still-unresolved higher ordinals

Not safe to reuse unchanged:

- Pentagram's Crusader header reader
- any assumption that its old `maxOffset` / `externTable` / `fixupTable` structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary
- the partial Node-based decompiler path as if it were a general Crusader decompiler

## Verified Local Model To Use Instead

The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.

Current authoritative inputs:

- `USECODE/EUSECODE_extracted/class_layout_index.tsv`
- `USECODE/EUSECODE_extracted/class_event_index.tsv`
- `USECODE/EUSECODE_extracted/chunks/`

Current authoritative facts:

- owner-loaded class object index is `class_id + 2`
- class bytes `8..11` provide the code-base anchor already carried in `class_layout_index.tsv`
- slot rows are 6-byte records: `u16 raw_event_entry_word + u32 raw_code_offset`
- slot body windows are already emitted conservatively as `derived_body_start`, `derived_body_end`, and `derived_body_length`

## End-To-End Process

### 1. Start from extracted owner-loaded artifacts

The parser should not reopen `EUSECODE.FLX` directly for the proof of concept. The extractor has already normalized the class and slot selection step.

Inputs:

- one row from `class_layout_index.tsv`
- one row from `class_event_index.tsv`
- the corresponding chunk file under `USECODE/EUSECODE_extracted/chunks/`

### 2. Select one body window conservatively

For a chosen class and slot:

- locate `entry_index`
- confirm `derived_body_start` and `derived_body_end`
- slice the chunk-local body bytes exactly from that range

### 3. Decode opcodes with Pentagram-derived operand formats

Use Pentagram's operand-width model as the first parser source of truth.

For the proof of concept, keep decoding conservative:

- parse the op exactly when the operand format is understood
- keep the raw bytes for every parsed op
- stop cleanly on an unknown opcode and preserve the remaining tail bytes

### 4. Emit canonical IR v1

The parser output should be one machine-friendly object that includes:

- source artifact metadata
- class metadata
- slot/event metadata
- exact op list with raw bytes
- annotation hints for compiled-side VM anchors

### 5. Feed Ghidra with annotations, not with fake code yet

The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.

Do not try to map the bytecode into a full processor module first.

## Proof-Of-Concept Parser

Tool path:

- `tools/poc_crusader_usecode_parser.py`

Current scope:

- uses the extracted TSV and chunk artifacts already in the repo
- disassembles one selected class/slot body at a time
- emits canonical IR JSON
- optionally emits a readable text listing beside the JSON

Current deliberate limits:

- no full intrinsic name table yet
- no synthetic control-flow graph yet
- no recompilation path yet
- no Ghidra importer yet

That keeps the parser useful without pretending the VM is fully solved.

## Canonical Ghidra Annotation Import Path

The first importer should consume the parser IR and create only three kinds of output.

### 1. Bookmarks

Use bookmarks for class/slot-level evidence that should not be hidden inside comments.

Good first bookmark payloads:

- `NPCTRIG slot 0x0A body parsed by POC tool`
- `EVENT slot 0x0A body parsed by POC tool`
- `slot 0x13 payload-shape hint = signed_word`

### 2. Plate or decompiler comments on compiled anchors

Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.

Best current anchors:

- `000d:46ec` = context create from slot index
- `000d:0988` = referent-chain mutation family (`0x18..0x1b`)
- `000d:208b` = materialize-or-forward value lane
- `000d:21ed` = inline payload prepend stage
- `000d:22bc` = decoded matrix/pushback consumer
- `000d:2104` = mixed immediate/object finalize-to-outptr stage
- `000d:ebe3` = opcode sequence runner

Comment payload should stay short and evidence-heavy, for example:

`POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, 5 local/debug rows after ret, parsed via tools/poc_crusader_usecode_parser.py`

### 3. Optional comment bundles per runtime family

If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.

Examples:

- `slot-backed-owner-loaded-body`
- `slot-plus-offset-value-reload`
- `sequencer-matrix-consumer`
- `literal-replay-interpreter-upstream`

## Why Not A Ghidra Processor Yet

The missing pieces are still too important:

- full opcode semantics are incomplete
- stack and return discipline are incomplete
- the relation between owner-loaded body bytes and the later `000c:fa2f` literal/replay lane is still not closed end-to-end
- the upstream selector into `entity_vm_opcode_sequence_run` is still unresolved

So the right order is:

1. parser
2. IR
3. annotation import
4. only then reconsider a language module

## User Workflow

Run the proof-of-concept parser from the repo root.

Example:

```powershell
c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text
```

Recommended first targets:

1. `NPCTRIG` slot `0x0A`
2. `NPCTRIG` slot `0x20`
3. `EVENT` slot `0x0A`
4. one `_BOOT` slot `0x10` body as a short repeated-template control sample

What to look for in the output:

- exact raw body window
- whether the body terminates cleanly at opcode `0x7A`
- body-local call targets and global-address ops
- repeated structural motifs that can be carried back into the VM notes
- anchor hints for the compiled runtime functions

## Next Extensions

1. Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.
2. Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.
3. Add a small importer that converts `annotation_hints` into Ghidra comments and bookmarks.
4. Extend the IR with control-flow edges only after branch/jump confidence is high enough.
5. Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.