Crusader_Decomp/docs/usecode-pentagram-ghidra-path.md
MaddoScientisto daa363c3d2 Add 'annotate-usecode' command to import USECODE IR JSON annotations
- Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors.
- Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option.
- Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra.
- Enhanced JSON schema to include response structure for the new command.
2026-03-24 18:14:20 +01:00

6.9 KiB

Pentagram-Derived USECODE Parser And Ghidra Path

Purpose

This note turns the earlier feasibility assessment into a concrete workflow.

The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:

  • reuse Pentagram's Crusader opcode decoding where it is still valid
  • replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor
  • emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations

What To Reuse From Pentagram

Useful directly:

  • the opcode tokenization model from convert/Convert.h
  • the disassembly-oriented mnemonic layout from tools/disasm/Disasm.cpp
  • the Crusader event ordinal table from convert/crusader/ConvertUsecodeCrusader.h

Useful only as hints:

  • intrinsic names and signatures
  • old event-name labels for still-unresolved higher ordinals

Not safe to reuse unchanged:

  • Pentagram's Crusader header reader
  • any assumption that its old maxOffset / externTable / fixupTable structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary
  • the partial Node-based decompiler path as if it were a general Crusader decompiler

Verified Local Model To Use Instead

The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.

Current authoritative inputs:

  • USECODE/EUSECODE_extracted/class_layout_index.tsv
  • USECODE/EUSECODE_extracted/class_event_index.tsv
  • USECODE/EUSECODE_extracted/chunks/

Current authoritative facts:

  • owner-loaded class object index is class_id + 2
  • class bytes 8..11 provide the code-base anchor already carried in class_layout_index.tsv
  • slot rows are 6-byte records: u16 raw_event_entry_word + u32 raw_code_offset
  • slot body windows are already emitted conservatively as derived_body_start, derived_body_end, and derived_body_length

End-To-End Process

1. Start from extracted owner-loaded artifacts

The parser should not reopen EUSECODE.FLX directly for the proof of concept. The extractor has already normalized the class and slot selection step.

Inputs:

  • one row from class_layout_index.tsv
  • one row from class_event_index.tsv
  • the corresponding chunk file under USECODE/EUSECODE_extracted/chunks/

2. Select one body window conservatively

For a chosen class and slot:

  • locate entry_index
  • confirm derived_body_start and derived_body_end
  • slice the chunk-local body bytes exactly from that range

3. Decode opcodes with Pentagram-derived operand formats

Use Pentagram's operand-width model as the first parser source of truth.

For the proof of concept, keep decoding conservative:

  • parse the op exactly when the operand format is understood
  • keep the raw bytes for every parsed op
  • stop cleanly on an unknown opcode and preserve the remaining tail bytes

4. Emit canonical IR v1

The parser output should be one machine-friendly object that includes:

  • source artifact metadata
  • class metadata
  • slot/event metadata
  • exact op list with raw bytes
  • annotation hints for compiled-side VM anchors

5. Feed Ghidra with annotations, not with fake code yet

The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.

Do not try to map the bytecode into a full processor module first.

Proof-Of-Concept Parser

Tool path:

  • tools/poc_crusader_usecode_parser.py

Current scope:

  • uses the extracted TSV and chunk artifacts already in the repo
  • disassembles one selected class/slot body at a time
  • emits canonical IR JSON
  • optionally emits a readable text listing beside the JSON

Current deliberate limits:

  • no full intrinsic name table yet
  • no synthetic control-flow graph yet
  • no recompilation path yet
  • no Ghidra importer yet

That keeps the parser useful without pretending the VM is fully solved.

Canonical Ghidra Annotation Import Path

The first importer should consume the parser IR and create only three kinds of output.

1. Bookmarks

Use bookmarks for class/slot-level evidence that should not be hidden inside comments.

Good first bookmark payloads:

  • NPCTRIG slot 0x0A body parsed by POC tool
  • EVENT slot 0x0A body parsed by POC tool
  • slot 0x13 payload-shape hint = signed_word

2. Plate or decompiler comments on compiled anchors

Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.

Best current anchors:

  • 000d:51fd = slot value load path
  • 000d:5572 = slot value plus additive word
  • 000d:46ec = context create from slot index
  • 000d:22bc = decoded matrix/pushback consumer
  • 000d:ebe3 = opcode sequence runner

Comment payload should stay short and evidence-heavy, for example:

POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, payload shape unresolved, parsed via tools/poc_crusader_usecode_parser.py

3. Optional comment bundles per runtime family

If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.

Examples:

  • slot-backed-owner-loaded-body
  • slot-plus-offset-value-reload
  • sequencer-matrix-consumer
  • literal-replay-interpreter-upstream

Why Not A Ghidra Processor Yet

The missing pieces are still too important:

  • full opcode semantics are incomplete
  • stack and return discipline are incomplete
  • the relation between owner-loaded body bytes and the later 000c:fa2f literal/replay lane is still not closed end-to-end
  • the upstream selector into entity_vm_opcode_sequence_run is still unresolved

So the right order is:

  1. parser
  2. IR
  3. annotation import
  4. only then reconsider a language module

User Workflow

Run the proof-of-concept parser from the repo root.

Example:

c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text

Recommended first targets:

  1. NPCTRIG slot 0x0A
  2. NPCTRIG slot 0x20
  3. EVENT slot 0x0A
  4. one _BOOT slot 0x10 body as a short repeated-template control sample

What to look for in the output:

  • exact raw body window
  • whether the body terminates cleanly at opcode 0x7A
  • body-local call targets and global-address ops
  • repeated structural motifs that can be carried back into the VM notes
  • anchor hints for the compiled runtime functions

Next Extensions

  1. Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.
  2. Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.
  3. Add a small importer that converts annotation_hints into Ghidra comments and bookmarks.
  4. Extend the IR with control-flow edges only after branch/jump confidence is high enough.
  5. Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.