Add 'annotate-usecode' command to import USECODE IR JSON annotations
- Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors. - Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option. - Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra. - Enhanced JSON schema to include response structure for the new command.
This commit is contained in:
parent
4d3c8cd81b
commit
daa363c3d2
39 changed files with 41450 additions and 871 deletions
|
|
@ -359,6 +359,135 @@ class:
|
|||
confidence: authoritative-bytes, hinted-label
|
||||
```
|
||||
|
||||
## IR v1 Parser Schema
|
||||
|
||||
The next tooling step changes the role of this document slightly. IR v0 was a note-level target for reversible human-readable output. IR v1 is the canonical machine-facing schema for the Pentagram-derived proof-of-concept parser and any future Ghidra annotation bridge.
|
||||
|
||||
The design constraints are now explicit:
|
||||
|
||||
- keep every authoritative owner-loaded byte visible
|
||||
- keep slot identity separate from semantic name hints
|
||||
- keep runtime-facing metadata visible even when the body decompiler cannot yet explain it
|
||||
- preserve enough structure to emit Ghidra comments and bookmarks later without reparsing prose notes
|
||||
|
||||
### Top-level IR object
|
||||
|
||||
```yaml
|
||||
schema_version: crusader-usecode-ir-v1-poc
|
||||
source:
|
||||
flex_path: USECODE/EUSECODE.FLX
|
||||
extracted_root: USECODE/EUSECODE_extracted
|
||||
chunk_file: USECODE/EUSECODE_extracted/chunks/chunk_191_table_1BA8_off_04C347_len_0003A8.bin
|
||||
class:
|
||||
entry_index: 191
|
||||
object_index: 0x365
|
||||
class_id: 0x363
|
||||
class_name: NPCTRIG
|
||||
raw_code_base_u32: 0x00da
|
||||
code_base_minus_one: 0x00d9
|
||||
conservative_event_count: 0x21
|
||||
event:
|
||||
slot: 0x0a
|
||||
event_name_hint: equip
|
||||
raw_event_entry_word: 0x013e
|
||||
raw_code_offset: 0x00000001
|
||||
derived_body_start: 0x00da
|
||||
derived_body_end: 0x024f
|
||||
derived_body_length: 373
|
||||
repeated_template_status: ""
|
||||
body:
|
||||
end_reason: end_opcode
|
||||
raw_body_sha1: <digest>
|
||||
unknown_trailing_bytes: ""
|
||||
ops:
|
||||
- offset: 0x0000
|
||||
absolute_body_offset: 0x00da
|
||||
opcode: 0x5a
|
||||
mnemonic: init
|
||||
raw_bytes: 5a06
|
||||
operands:
|
||||
local_bytes: 0x06
|
||||
- offset: 0x0011
|
||||
absolute_body_offset: 0x00eb
|
||||
opcode: 0x40
|
||||
mnemonic: push_local_dword
|
||||
raw_bytes: 40064c02
|
||||
operands:
|
||||
bp_offset: 0x06
|
||||
annotation_hints:
|
||||
runtime_family: slot-backed-owner-loaded-body
|
||||
compiled_anchors:
|
||||
- 000d:51fd
|
||||
- 000d:5572
|
||||
- 000d:46ec
|
||||
- 000d:ebe3
|
||||
```
|
||||
|
||||
### Required fields
|
||||
|
||||
`source` keeps the specific extracted artifact path so the parser output can always be checked against the raw chunk bytes.
|
||||
|
||||
`class` keeps the owner-loaded identity and header math already validated in the binary.
|
||||
|
||||
`event` keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.
|
||||
|
||||
`body` records how far the parser got and whether any bytes remain undecoded or trailing.
|
||||
|
||||
`ops` is intentionally lossless. Each decoded op keeps:
|
||||
|
||||
- body-relative offset
|
||||
- absolute chunk-local offset
|
||||
- raw opcode byte
|
||||
- mnemonic
|
||||
- exact raw bytes for the whole op
|
||||
- parsed operands as typed fields
|
||||
|
||||
`annotation_hints` is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.
|
||||
|
||||
### Opcode result policy
|
||||
|
||||
The parser should use four result classes only:
|
||||
|
||||
- `decoded_op`: normal parsed opcode with structured operands
|
||||
- `unknown_opcode`: one-byte opcode not yet modeled; stop or fall back conservatively
|
||||
- `raw_tail`: remaining undecoded bytes after a stop condition
|
||||
- `debug_blob`: symbol/debug tail such as `0x5c`-anchored metadata
|
||||
|
||||
That keeps the IR trustworthy even before the whole Crusader VM is modeled.
|
||||
|
||||
### Call-site hint policy
|
||||
|
||||
For `call` and `spawn`-family ops, the parser may attach:
|
||||
|
||||
- `target_class_id`
|
||||
- `target_event_slot`
|
||||
- `target_event_name_hint`
|
||||
|
||||
It should not attach a stronger semantic claim than that. The body parser is class/event aware, but not yet authoritative about gameplay meaning.
|
||||
|
||||
### Annotation-hint schema
|
||||
|
||||
The Ghidra bridge should consume only small, stable items:
|
||||
|
||||
```yaml
|
||||
annotation_hints:
|
||||
runtime_family: slot-backed-owner-loaded-body
|
||||
payload_shape_hint: signed_word
|
||||
compiled_anchors:
|
||||
- address: 000d:51fd
|
||||
role: slot_value_loader
|
||||
- address: 000d:5572
|
||||
role: slot_value_plus_offset
|
||||
- address: 000d:46ec
|
||||
role: context_create_from_slot
|
||||
- address: 000d:ebe3
|
||||
role: opcode_sequence_run
|
||||
- address: 000d:22bc
|
||||
role: matrix_pushback_stage
|
||||
```
|
||||
|
||||
This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.
|
||||
|
||||
That is already a real decompilation output. It keeps the exact slot id, the exact six-byte row contents, and the exact class-header facts, while refusing to pretend that `use` is already a proven semantic name for this class.
|
||||
|
||||
Here is the same style for one active event-bearing attachment class in the same island:
|
||||
|
|
@ -543,6 +672,43 @@ vm_effect_possible:
|
|||
|
||||
That operator block is authoritative as a recovered VM vocabulary, but only ecosystem-level when attached to one specific descriptor family.
|
||||
|
||||
### Binary-side slot and payload-shape evidence to preserve in IR
|
||||
|
||||
The current VM pass also adds one useful binary-side rule for the higher event ordinals: the compiled wrapper family distinguishes slot identity from payload shape, and that distinction should survive in any round-trip IR even when the human label stays unresolved.
|
||||
|
||||
Verified current ladder around `0005:3115..31da`:
|
||||
|
||||
- slot `0x10`: guarded callsite only, zero extra word, packed mask `0x00010000`
|
||||
- slot `0x11`: named wrapper `entity_vm_context_try_create_mask_00020000_slot11_with_offset`, one caller-supplied extra word
|
||||
- slot `0x12`: named wrapper `entity_vm_context_try_create_mask_00040000_slot12`, zero extra word
|
||||
- slot `0x13`: named wrapper `entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity`, one sign-extended extra word after an entity-validity gate
|
||||
- slot `0x14`: named wrapper `entity_vm_context_try_create_mask_00100000_slot14_with_offset`, one caller-supplied extra word
|
||||
|
||||
Why this matters for the IR:
|
||||
|
||||
- It is direct binary evidence that some higher Crusader slot ordinals are already grouped by argument shape before any descriptor-family mapping is proven.
|
||||
- That means the IR should preserve `slot_id` plus `payload_shape` independently instead of collapsing everything into one guessed event-name table.
|
||||
- It also gives a bounded way to cross-check external event signatures without over-trusting them: slot `0x12` fits a zero-arg event shape, slot `0x13` fits a one-word event shape, and slot `0x14` currently conflicts with Pentagram's older zero-arg `animGetHit()` note.
|
||||
|
||||
Practical annotation rule to adopt now:
|
||||
|
||||
- keep higher-slot labels binary-stable as `slot 0x10` .. `slot 0x14` unless local behavior closes the label
|
||||
- attach external event names only as hints
|
||||
- attach one small `payload_shape_hint` field such as `none`, `word`, or `signed_word`
|
||||
|
||||
Minimal hinted example:
|
||||
|
||||
```yaml
|
||||
slot_record:
|
||||
slot_id: 0x13
|
||||
event_name_hint: avatarStoleSomething
|
||||
payload_shape_hint: signed_word
|
||||
binary_anchor: 0005:31da
|
||||
wrapper_name: entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity
|
||||
```
|
||||
|
||||
The same pass also hardens one existing IR operator boundary: the `000d:22bc` stage is now comment-backed in Ghidra as a matrix/pushback consumer over decoded workspace bytes, not a direct descriptor-row reader. The current safe attachment point is therefore still `decoded VM workspace -> link-matrix stage`, not `NPCTRIG row -> direct entity-link emission`.
|
||||
|
||||
## Conservative Parser Rule To Adopt Now
|
||||
|
||||
For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue