Crusader_Decomp/docs/usecode-pentagram-ghidra-path.md

# Pentagram-Derived USECODE Parser And Ghidra Path

## Purpose

This note turns the earlier feasibility assessment into a concrete workflow.

The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:

- reuse Pentagram's Crusader opcode decoding where it is still valid
- replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor
- emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations

## What To Reuse From Pentagram

Useful directly:

- the opcode tokenization model from `convert/Convert.h`
- the disassembly-oriented mnemonic layout from `tools/disasm/Disasm.cpp`
- the Crusader event ordinal table from `convert/crusader/ConvertUsecodeCrusader.h`

Useful only as hints:

- intrinsic names and signatures
- old event-name labels for still-unresolved higher ordinals

Not safe to reuse unchanged:

- Pentagram's Crusader header reader
- any assumption that its old `maxOffset` / `externTable` / `fixupTable` structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary
- the partial Node-based decompiler path as if it were a general Crusader decompiler

## Verified Local Model To Use Instead

The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.

Current authoritative inputs:

- `USECODE/EUSECODE_extracted/class_layout_index.tsv`
- `USECODE/EUSECODE_extracted/class_event_index.tsv`
- `USECODE/EUSECODE_extracted/chunks/`

Current authoritative facts:

- owner-loaded class object index is `class_id + 2`
- class bytes `8..11` provide the code-base anchor already carried in `class_layout_index.tsv`
- slot rows are 6-byte records: `u16 raw_event_entry_word + u32 raw_code_offset`
- slot body windows are already emitted conservatively as `derived_body_start`, `derived_body_end`, and `derived_body_length`

## End-To-End Process

### 1. Start from extracted owner-loaded artifacts

The parser should not reopen `EUSECODE.FLX` directly for the proof of concept. The extractor has already normalized the class and slot selection step.

Inputs:

- one row from `class_layout_index.tsv`
- one row from `class_event_index.tsv`
- the corresponding chunk file under `USECODE/EUSECODE_extracted/chunks/`

### 2. Select one body window conservatively

For a chosen class and slot:

- locate `entry_index`
- confirm `derived_body_start` and `derived_body_end`
- slice the chunk-local body bytes exactly from that range

### 3. Decode opcodes with Pentagram-derived operand formats

Use Pentagram's operand-width model as the first parser source of truth.

For the proof of concept, keep decoding conservative:

- parse the op exactly when the operand format is understood
- keep the raw bytes for every parsed op
- stop cleanly on an unknown opcode and preserve the remaining tail bytes

### 4. Emit canonical IR v1

The parser output should be one machine-friendly object that includes:

- source artifact metadata
- class metadata
- slot/event metadata
- exact op list with raw bytes
- annotation hints for compiled-side VM anchors

### 5. Feed Ghidra with annotations, not with fake code yet

The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.

Do not try to map the bytecode into a full processor module first.

## Proof-Of-Concept Parser

Tool path:

- `tools/poc_crusader_usecode_parser.py`

Current scope:

- uses the extracted TSV and chunk artifacts already in the repo
- disassembles one selected class/slot body at a time
- emits canonical IR JSON
- optionally emits a readable text listing beside the JSON

Current deliberate limits:

- no full intrinsic name table yet
- no synthetic control-flow graph yet
- no recompilation path yet
- no Ghidra importer yet

That keeps the parser useful without pretending the VM is fully solved.

## Canonical Ghidra Annotation Import Path

The first importer should consume the parser IR and create only three kinds of output.

### 1. Bookmarks

Use bookmarks for class/slot-level evidence that should not be hidden inside comments.

Good first bookmark payloads:

- `NPCTRIG slot 0x0A body parsed by POC tool`
- `EVENT slot 0x0A body parsed by POC tool`
- `slot 0x13 payload-shape hint = signed_word`

### 2. Plate or decompiler comments on compiled anchors

Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.

Best current anchors:

- `000d:51fd` = slot value load path
- `000d:5572` = slot value plus additive word
- `000d:46ec` = context create from slot index
- `000d:22bc` = decoded matrix/pushback consumer
- `000d:ebe3` = opcode sequence runner

Comment payload should stay short and evidence-heavy, for example:

`POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, payload shape unresolved, parsed via tools/poc_crusader_usecode_parser.py`

### 3. Optional comment bundles per runtime family

If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.

Examples:

- `slot-backed-owner-loaded-body`
- `slot-plus-offset-value-reload`
- `sequencer-matrix-consumer`
- `literal-replay-interpreter-upstream`

## Why Not A Ghidra Processor Yet

The missing pieces are still too important:

- full opcode semantics are incomplete
- stack and return discipline are incomplete
- the relation between owner-loaded body bytes and the later `000c:fa2f` literal/replay lane is still not closed end-to-end
- the upstream selector into `entity_vm_opcode_sequence_run` is still unresolved

So the right order is:

1. parser
2. IR
3. annotation import
4. only then reconsider a language module

## User Workflow

Run the proof-of-concept parser from the repo root.

Example:

```powershell
c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text
```

Recommended first targets:

1. `NPCTRIG` slot `0x0A`
2. `NPCTRIG` slot `0x20`
3. `EVENT` slot `0x0A`
4. one `_BOOT` slot `0x10` body as a short repeated-template control sample

What to look for in the output:

- exact raw body window
- whether the body terminates cleanly at opcode `0x7A`
- body-local call targets and global-address ops
- repeated structural motifs that can be carried back into the VM notes
- anchor hints for the compiled runtime functions

## Next Extensions

1. Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.
2. Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.
3. Add a small importer that converts `annotation_hints` into Ghidra comments and bookmarks.
4. Extend the IR with control-flow edges only after branch/jump confidence is high enough.
5. Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.
Add 'annotate-usecode' command to import USECODE IR JSON annotations - Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors. - Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option. - Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra. - Enhanced JSON schema to include response structure for the new command. 2026-03-24 18:14:20 +01:00			`# Pentagram-Derived USECODE Parser And Ghidra Path`

			`## Purpose`

			`This note turns the earlier feasibility assessment into a concrete workflow.`

			`The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:`

			`- reuse Pentagram's Crusader opcode decoding where it is still valid`
			`- replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor`
			`- emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations`

			`## What To Reuse From Pentagram`

			`Useful directly:`

			- the opcode tokenization model from `convert/Convert.h`
			- the disassembly-oriented mnemonic layout from `tools/disasm/Disasm.cpp`
			- the Crusader event ordinal table from `convert/crusader/ConvertUsecodeCrusader.h`

			`Useful only as hints:`

			`- intrinsic names and signatures`
			`- old event-name labels for still-unresolved higher ordinals`

			`Not safe to reuse unchanged:`

			`- Pentagram's Crusader header reader`
			- any assumption that its old `maxOffset` / `externTable` / `fixupTable` structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary
			`- the partial Node-based decompiler path as if it were a general Crusader decompiler`

			`## Verified Local Model To Use Instead`

			`The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.`

			`Current authoritative inputs:`

			- `USECODE/EUSECODE_extracted/class_layout_index.tsv`
			- `USECODE/EUSECODE_extracted/class_event_index.tsv`
			- `USECODE/EUSECODE_extracted/chunks/`

			`Current authoritative facts:`

			- owner-loaded class object index is `class_id + 2`
			- class bytes `8..11` provide the code-base anchor already carried in `class_layout_index.tsv`
			- slot rows are 6-byte records: `u16 raw_event_entry_word + u32 raw_code_offset`
			- slot body windows are already emitted conservatively as `derived_body_start`, `derived_body_end`, and `derived_body_length`

			`## End-To-End Process`

			`### 1. Start from extracted owner-loaded artifacts`

			The parser should not reopen `EUSECODE.FLX` directly for the proof of concept. The extractor has already normalized the class and slot selection step.

			`Inputs:`

			- one row from `class_layout_index.tsv`
			- one row from `class_event_index.tsv`
			- the corresponding chunk file under `USECODE/EUSECODE_extracted/chunks/`

			`### 2. Select one body window conservatively`

			`For a chosen class and slot:`

			- locate `entry_index`
			- confirm `derived_body_start` and `derived_body_end`
			`- slice the chunk-local body bytes exactly from that range`

			`### 3. Decode opcodes with Pentagram-derived operand formats`

			`Use Pentagram's operand-width model as the first parser source of truth.`

			`For the proof of concept, keep decoding conservative:`

			`- parse the op exactly when the operand format is understood`
			`- keep the raw bytes for every parsed op`
			`- stop cleanly on an unknown opcode and preserve the remaining tail bytes`

			`### 4. Emit canonical IR v1`

			`The parser output should be one machine-friendly object that includes:`

			`- source artifact metadata`
			`- class metadata`
			`- slot/event metadata`
			`- exact op list with raw bytes`
			`- annotation hints for compiled-side VM anchors`

			`### 5. Feed Ghidra with annotations, not with fake code yet`

			`The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.`

			`Do not try to map the bytecode into a full processor module first.`

			`## Proof-Of-Concept Parser`

			`Tool path:`

			- `tools/poc_crusader_usecode_parser.py`

			`Current scope:`

			`- uses the extracted TSV and chunk artifacts already in the repo`
			`- disassembles one selected class/slot body at a time`
			`- emits canonical IR JSON`
			`- optionally emits a readable text listing beside the JSON`

			`Current deliberate limits:`

			`- no full intrinsic name table yet`
			`- no synthetic control-flow graph yet`
			`- no recompilation path yet`
			`- no Ghidra importer yet`

			`That keeps the parser useful without pretending the VM is fully solved.`

			`## Canonical Ghidra Annotation Import Path`

			`The first importer should consume the parser IR and create only three kinds of output.`

			`### 1. Bookmarks`

			`Use bookmarks for class/slot-level evidence that should not be hidden inside comments.`

			`Good first bookmark payloads:`

			- `NPCTRIG slot 0x0A body parsed by POC tool`
			- `EVENT slot 0x0A body parsed by POC tool`
			- `slot 0x13 payload-shape hint = signed_word`

			`### 2. Plate or decompiler comments on compiled anchors`

			`Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.`

			`Best current anchors:`

			- `000d:51fd` = slot value load path
			- `000d:5572` = slot value plus additive word
			- `000d:46ec` = context create from slot index
			- `000d:22bc` = decoded matrix/pushback consumer
			- `000d:ebe3` = opcode sequence runner

			`Comment payload should stay short and evidence-heavy, for example:`

			`POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, payload shape unresolved, parsed via tools/poc_crusader_usecode_parser.py`

			`### 3. Optional comment bundles per runtime family`

			`If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.`

			`Examples:`

			- `slot-backed-owner-loaded-body`
			- `slot-plus-offset-value-reload`
			- `sequencer-matrix-consumer`
			- `literal-replay-interpreter-upstream`

			`## Why Not A Ghidra Processor Yet`

			`The missing pieces are still too important:`

			`- full opcode semantics are incomplete`
			`- stack and return discipline are incomplete`
			- the relation between owner-loaded body bytes and the later `000c:fa2f` literal/replay lane is still not closed end-to-end
			- the upstream selector into `entity_vm_opcode_sequence_run` is still unresolved

			`So the right order is:`

			`1. parser`
			`2. IR`
			`3. annotation import`
			`4. only then reconsider a language module`

			`## User Workflow`

			`Run the proof-of-concept parser from the repo root.`

			`Example:`

			```powershell
			`c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text`
			```

			`Recommended first targets:`

			1. `NPCTRIG` slot `0x0A`
			2. `NPCTRIG` slot `0x20`
			3. `EVENT` slot `0x0A`
			4. one `_BOOT` slot `0x10` body as a short repeated-template control sample

			`What to look for in the output:`

			`- exact raw body window`
			- whether the body terminates cleanly at opcode `0x7A`
			`- body-local call targets and global-address ops`
			`- repeated structural motifs that can be carried back into the VM notes`
			`- anchor hints for the compiled runtime functions`

			`## Next Extensions`

			`1. Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.`
			`2. Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.`
			3. Add a small importer that converts `annotation_hints` into Ghidra comments and bookmarks.
			`4. Extend the IR with control-flow edges only after branch/jump confidence is high enough.`
			`5. Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.`