Add 'annotate-usecode' command to import USECODE IR JSON annotations

- Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors. - Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option. - Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra. - Enhanced JSON schema to include response structure for the new command.
2026-03-24 18:14:20 +01:00 · 2026-03-24 18:14:20 +01:00 · daa363c3d2
commit daa363c3d2
parent 4d3c8cd81b
39 changed files with 41450 additions and 871 deletions
--- a/docs/usecode-roundtrip-ir.md
+++ b/docs/usecode-roundtrip-ir.md
@ -359,6 +359,135 @@ class:
      confidence: authoritative-bytes, hinted-label
 ```

+## IR v1 Parser Schema
+
+The next tooling step changes the role of this document slightly. IR v0 was a note-level target for reversible human-readable output. IR v1 is the canonical machine-facing schema for the Pentagram-derived proof-of-concept parser and any future Ghidra annotation bridge.
+
+The design constraints are now explicit:
+
+- keep every authoritative owner-loaded byte visible
+- keep slot identity separate from semantic name hints
+- keep runtime-facing metadata visible even when the body decompiler cannot yet explain it
+- preserve enough structure to emit Ghidra comments and bookmarks later without reparsing prose notes
+
+### Top-level IR object
+
+```yaml
+schema_version: crusader-usecode-ir-v1-poc
+source:
+  flex_path: USECODE/EUSECODE.FLX
+  extracted_root: USECODE/EUSECODE_extracted
+  chunk_file: USECODE/EUSECODE_extracted/chunks/chunk_191_table_1BA8_off_04C347_len_0003A8.bin
+class:
+  entry_index: 191
+  object_index: 0x365
+  class_id: 0x363
+  class_name: NPCTRIG
+  raw_code_base_u32: 0x00da
+  code_base_minus_one: 0x00d9
+  conservative_event_count: 0x21
+event:
+  slot: 0x0a
+  event_name_hint: equip
+  raw_event_entry_word: 0x013e
+  raw_code_offset: 0x00000001
+  derived_body_start: 0x00da
+  derived_body_end: 0x024f
+  derived_body_length: 373
+  repeated_template_status: ""
+body:
+  end_reason: end_opcode
+  raw_body_sha1: <digest>
+  unknown_trailing_bytes: ""
+ops:
+  - offset: 0x0000
+    absolute_body_offset: 0x00da
+    opcode: 0x5a
+    mnemonic: init
+    raw_bytes: 5a06
+    operands:
+      local_bytes: 0x06
+  - offset: 0x0011
+    absolute_body_offset: 0x00eb
+    opcode: 0x40
+    mnemonic: push_local_dword
+    raw_bytes: 40064c02
+    operands:
+      bp_offset: 0x06
+annotation_hints:
+  runtime_family: slot-backed-owner-loaded-body
+  compiled_anchors:
+    - 000d:51fd
+    - 000d:5572
+    - 000d:46ec
+    - 000d:ebe3
+```
+
+### Required fields
+
+`source` keeps the specific extracted artifact path so the parser output can always be checked against the raw chunk bytes.
+
+`class` keeps the owner-loaded identity and header math already validated in the binary.
+
+`event` keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.
+
+`body` records how far the parser got and whether any bytes remain undecoded or trailing.
+
+`ops` is intentionally lossless. Each decoded op keeps:
+
+- body-relative offset
+- absolute chunk-local offset
+- raw opcode byte
+- mnemonic
+- exact raw bytes for the whole op
+- parsed operands as typed fields
+
+`annotation_hints` is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.
+
+### Opcode result policy
+
+The parser should use four result classes only:
+
+- `decoded_op`: normal parsed opcode with structured operands
+- `unknown_opcode`: one-byte opcode not yet modeled; stop or fall back conservatively
+- `raw_tail`: remaining undecoded bytes after a stop condition
+- `debug_blob`: symbol/debug tail such as `0x5c`-anchored metadata
+
+That keeps the IR trustworthy even before the whole Crusader VM is modeled.
+
+### Call-site hint policy
+
+For `call` and `spawn`-family ops, the parser may attach:
+
+- `target_class_id`
+- `target_event_slot`
+- `target_event_name_hint`
+
+It should not attach a stronger semantic claim than that. The body parser is class/event aware, but not yet authoritative about gameplay meaning.
+
+### Annotation-hint schema
+
+The Ghidra bridge should consume only small, stable items:
+
+```yaml
+annotation_hints:
+  runtime_family: slot-backed-owner-loaded-body
+  payload_shape_hint: signed_word
+  compiled_anchors:
+    - address: 000d:51fd
+      role: slot_value_loader
+    - address: 000d:5572
+      role: slot_value_plus_offset
+    - address: 000d:46ec
+      role: context_create_from_slot
+    - address: 000d:ebe3
+      role: opcode_sequence_run
+    - address: 000d:22bc
+      role: matrix_pushback_stage
+```
+
+This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.
+
 That is already a real decompilation output. It keeps the exact slot id, the exact six-byte row contents, and the exact class-header facts, while refusing to pretend that `use` is already a proven semantic name for this class.

 Here is the same style for one active event-bearing attachment class in the same island:
@ -543,6 +672,43 @@ vm_effect_possible:

 That operator block is authoritative as a recovered VM vocabulary, but only ecosystem-level when attached to one specific descriptor family.

+### Binary-side slot and payload-shape evidence to preserve in IR
+
+The current VM pass also adds one useful binary-side rule for the higher event ordinals: the compiled wrapper family distinguishes slot identity from payload shape, and that distinction should survive in any round-trip IR even when the human label stays unresolved.
+
+Verified current ladder around `0005:3115..31da`:
+
+- slot `0x10`: guarded callsite only, zero extra word, packed mask `0x00010000`
+- slot `0x11`: named wrapper `entity_vm_context_try_create_mask_00020000_slot11_with_offset`, one caller-supplied extra word
+- slot `0x12`: named wrapper `entity_vm_context_try_create_mask_00040000_slot12`, zero extra word
+- slot `0x13`: named wrapper `entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity`, one sign-extended extra word after an entity-validity gate
+- slot `0x14`: named wrapper `entity_vm_context_try_create_mask_00100000_slot14_with_offset`, one caller-supplied extra word
+
+Why this matters for the IR:
+
+- It is direct binary evidence that some higher Crusader slot ordinals are already grouped by argument shape before any descriptor-family mapping is proven.
+- That means the IR should preserve `slot_id` plus `payload_shape` independently instead of collapsing everything into one guessed event-name table.
+- It also gives a bounded way to cross-check external event signatures without over-trusting them: slot `0x12` fits a zero-arg event shape, slot `0x13` fits a one-word event shape, and slot `0x14` currently conflicts with Pentagram's older zero-arg `animGetHit()` note.
+
+Practical annotation rule to adopt now:
+
+- keep higher-slot labels binary-stable as `slot 0x10` .. `slot 0x14` unless local behavior closes the label
+- attach external event names only as hints
+- attach one small `payload_shape_hint` field such as `none`, `word`, or `signed_word`
+
+Minimal hinted example:
+
+```yaml
+slot_record:
+  slot_id: 0x13
+  event_name_hint: avatarStoleSomething
+  payload_shape_hint: signed_word
+  binary_anchor: 0005:31da
+  wrapper_name: entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity
+```
+
+The same pass also hardens one existing IR operator boundary: the `000d:22bc` stage is now comment-backed in Ghidra as a matrix/pushback consumer over decoded workspace bytes, not a direct descriptor-row reader. The current safe attachment point is therefore still `decoded VM workspace -> link-matrix stage`, not `NPCTRIG row -> direct entity-link emission`.
+
 ## Conservative Parser Rule To Adopt Now

 For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is: