Add 'annotate-usecode' command to import USECODE IR JSON annotations

- Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors.
- Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option.
- Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra.
- Enhanced JSON schema to include response structure for the new command.
This commit is contained in:
MaddoScientisto 2026-03-24 18:14:20 +01:00
commit daa363c3d2
39 changed files with 41450 additions and 871 deletions

View file

@ -359,6 +359,135 @@ class:
confidence: authoritative-bytes, hinted-label
```
## IR v1 Parser Schema
The next tooling step changes the role of this document slightly. IR v0 was a note-level target for reversible human-readable output. IR v1 is the canonical machine-facing schema for the Pentagram-derived proof-of-concept parser and any future Ghidra annotation bridge.
The design constraints are now explicit:
- keep every authoritative owner-loaded byte visible
- keep slot identity separate from semantic name hints
- keep runtime-facing metadata visible even when the body decompiler cannot yet explain it
- preserve enough structure to emit Ghidra comments and bookmarks later without reparsing prose notes
### Top-level IR object
```yaml
schema_version: crusader-usecode-ir-v1-poc
source:
flex_path: USECODE/EUSECODE.FLX
extracted_root: USECODE/EUSECODE_extracted
chunk_file: USECODE/EUSECODE_extracted/chunks/chunk_191_table_1BA8_off_04C347_len_0003A8.bin
class:
entry_index: 191
object_index: 0x365
class_id: 0x363
class_name: NPCTRIG
raw_code_base_u32: 0x00da
code_base_minus_one: 0x00d9
conservative_event_count: 0x21
event:
slot: 0x0a
event_name_hint: equip
raw_event_entry_word: 0x013e
raw_code_offset: 0x00000001
derived_body_start: 0x00da
derived_body_end: 0x024f
derived_body_length: 373
repeated_template_status: ""
body:
end_reason: end_opcode
raw_body_sha1: <digest>
unknown_trailing_bytes: ""
ops:
- offset: 0x0000
absolute_body_offset: 0x00da
opcode: 0x5a
mnemonic: init
raw_bytes: 5a06
operands:
local_bytes: 0x06
- offset: 0x0011
absolute_body_offset: 0x00eb
opcode: 0x40
mnemonic: push_local_dword
raw_bytes: 40064c02
operands:
bp_offset: 0x06
annotation_hints:
runtime_family: slot-backed-owner-loaded-body
compiled_anchors:
- 000d:51fd
- 000d:5572
- 000d:46ec
- 000d:ebe3
```
### Required fields
`source` keeps the specific extracted artifact path so the parser output can always be checked against the raw chunk bytes.
`class` keeps the owner-loaded identity and header math already validated in the binary.
`event` keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.
`body` records how far the parser got and whether any bytes remain undecoded or trailing.
`ops` is intentionally lossless. Each decoded op keeps:
- body-relative offset
- absolute chunk-local offset
- raw opcode byte
- mnemonic
- exact raw bytes for the whole op
- parsed operands as typed fields
`annotation_hints` is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.
### Opcode result policy
The parser should use four result classes only:
- `decoded_op`: normal parsed opcode with structured operands
- `unknown_opcode`: one-byte opcode not yet modeled; stop or fall back conservatively
- `raw_tail`: remaining undecoded bytes after a stop condition
- `debug_blob`: symbol/debug tail such as `0x5c`-anchored metadata
That keeps the IR trustworthy even before the whole Crusader VM is modeled.
### Call-site hint policy
For `call` and `spawn`-family ops, the parser may attach:
- `target_class_id`
- `target_event_slot`
- `target_event_name_hint`
It should not attach a stronger semantic claim than that. The body parser is class/event aware, but not yet authoritative about gameplay meaning.
### Annotation-hint schema
The Ghidra bridge should consume only small, stable items:
```yaml
annotation_hints:
runtime_family: slot-backed-owner-loaded-body
payload_shape_hint: signed_word
compiled_anchors:
- address: 000d:51fd
role: slot_value_loader
- address: 000d:5572
role: slot_value_plus_offset
- address: 000d:46ec
role: context_create_from_slot
- address: 000d:ebe3
role: opcode_sequence_run
- address: 000d:22bc
role: matrix_pushback_stage
```
This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.
That is already a real decompilation output. It keeps the exact slot id, the exact six-byte row contents, and the exact class-header facts, while refusing to pretend that `use` is already a proven semantic name for this class.
Here is the same style for one active event-bearing attachment class in the same island:
@ -543,6 +672,43 @@ vm_effect_possible:
That operator block is authoritative as a recovered VM vocabulary, but only ecosystem-level when attached to one specific descriptor family.
### Binary-side slot and payload-shape evidence to preserve in IR
The current VM pass also adds one useful binary-side rule for the higher event ordinals: the compiled wrapper family distinguishes slot identity from payload shape, and that distinction should survive in any round-trip IR even when the human label stays unresolved.
Verified current ladder around `0005:3115..31da`:
- slot `0x10`: guarded callsite only, zero extra word, packed mask `0x00010000`
- slot `0x11`: named wrapper `entity_vm_context_try_create_mask_00020000_slot11_with_offset`, one caller-supplied extra word
- slot `0x12`: named wrapper `entity_vm_context_try_create_mask_00040000_slot12`, zero extra word
- slot `0x13`: named wrapper `entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity`, one sign-extended extra word after an entity-validity gate
- slot `0x14`: named wrapper `entity_vm_context_try_create_mask_00100000_slot14_with_offset`, one caller-supplied extra word
Why this matters for the IR:
- It is direct binary evidence that some higher Crusader slot ordinals are already grouped by argument shape before any descriptor-family mapping is proven.
- That means the IR should preserve `slot_id` plus `payload_shape` independently instead of collapsing everything into one guessed event-name table.
- It also gives a bounded way to cross-check external event signatures without over-trusting them: slot `0x12` fits a zero-arg event shape, slot `0x13` fits a one-word event shape, and slot `0x14` currently conflicts with Pentagram's older zero-arg `animGetHit()` note.
Practical annotation rule to adopt now:
- keep higher-slot labels binary-stable as `slot 0x10` .. `slot 0x14` unless local behavior closes the label
- attach external event names only as hints
- attach one small `payload_shape_hint` field such as `none`, `word`, or `signed_word`
Minimal hinted example:
```yaml
slot_record:
slot_id: 0x13
event_name_hint: avatarStoleSomething
payload_shape_hint: signed_word
binary_anchor: 0005:31da
wrapper_name: entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity
```
The same pass also hardens one existing IR operator boundary: the `000d:22bc` stage is now comment-backed in Ghidra as a matrix/pushback consumer over decoded workspace bytes, not a direct descriptor-row reader. The current safe attachment point is therefore still `decoded VM workspace -> link-matrix stage`, not `NPCTRIG row -> direct entity-link emission`.
## Conservative Parser Rule To Adopt Now
For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is: