MaddoScientisto daa363c3d2 Add 'annotate-usecode' command to import USECODE IR JSON annotations

- Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors.
- Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option.
- Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra.
- Enhanced JSON schema to include response structure for the new command.

2026-03-24 18:14:20 +01:00

38 KiB

Raw Blame History

USECODE Round-Trip IR Plan

Purpose

This note records the current evidence-backed path from Crusader USECODE bytes to a human-readable, editable, and recompilable script form.

It is intentionally conservative. ScummVM gives strong external anchors for the container layout, class/event numbering, and intrinsic naming, but it is not a symbol map for the DOS binary and it is not a ready-made round-trip compiler.

Externally Anchored Pieces

Container and class layout

ScummVM now gives a concrete second implementation for the Crusader USECODE class layout:

usecode/usecode_flex.cpp treats each class body as archive object classid + 2.
Class names come from archive object 1 at name_object + 4 + 13 * classid.
For Crusader, the class base offset is read from class bytes 8..11 and then decremented by 1.
Crusader event count is computed as (base_offset + 19) / 6.
usecode/usecode.cpp resolves event N from class data at 20 + 6 * N, with the code offset stored in bytes +2..+5 of each 6-byte event record.

Combined with the already validated FLEX container notes, the current externally anchored container model is:

FLEX entry count at 0x54
FLEX table at 0x80
USECODE class object index = classid + 2
Crusader class header contains a four-byte base-offset field at bytes 8..11
Crusader event table entries are 6 bytes each, with a known dword code offset and an still-unknown leading word

ScummVM also makes one implementation choice explicit that matters for the current mismatch: uc_machine.cpp uses get_class_base_offset() as the execution-stream base for Crusader class code, not only as metadata for event counting. That means the obj[8..11] - 1 value is part of the live code-addressing model in ScummVM, not just a comment-level interpretation.

Binary-side validation against owner-loaded classes

The first direct local validation pass against sampled owner-loaded EUSECODE class records now splits the ScummVM model into two parts: one part is confirmed, and one part still needs reconciliation.

Confirmed on sampled records (EVENT, NPCTRIG, SURCAMNS, JELYHACK, REE_BOOT, SURCAMEW, SFXTRIG):

The extracted chunk at table offset 0x88 behaves like object 1 for class names.
For each sampled class body, deriving object_index = (table_offset - 0x80) / 8, then class_id = object_index - 2, and then reading 13 bytes from object 1 at 4 + 13 * class_id yields the expected class name.
The class bodies do have a stable 4-byte header field at bytes 8..11.
The region at class + 20 is a real 6-byte event-slot table with u16 unknown_word + u32 code_or_payload_field layout.

Broader family spot-checks now keep the same local structure on the owner-loaded side. In addition to the first validated set, the nearby _BOOT and environmental event families (AND_BOOT, BRO_BOOT, COR_BOOT, VAR_BOOT, FLAMEBOX, NOSTRIL, STEAMBOX) continue to fit the same table_offset -> object_index -> class_id progression with a stable bytes-8..11 dword and a 6-byte table at +20. No contradictory sample has appeared in the local EUSECODE set.

Not yet reconciled with ScummVM's current formula note:

In the sampled owner-loaded records, the raw dword at bytes 8..11 is 0x00d4, 0x00da, or 0x00e6.
Treating that dword directly as the first post-event-table offset makes the layout line up cleanly: (dword_at_8 - 20) / 6 gives 32, 33, or 35 valid slots in the samples.
Scanning instead with the previously noted ScummVM-style (base_offset + 19) / 6 interpretation overruns into inline payload and class-name bytes in the same samples.

Current best explanation:

The mismatch is now best explained as a ScummVM interpretation/detail issue, not as a proven loader-side rewrite.
The same ScummVM code path that decrements bytes 8..11 by 1 also uses that decremented value as the code-stream base. On the local owner-loaded records, this fits naturally if the raw dword is the first code-byte offset and event-table dword offsets are 1-based relative to code_base_minus_one.
Under that reading, the sampled event-count rule becomes (code_base_minus_one - 19) / 6, which is exactly equivalent to (raw_u32_at_8_11 - 20) / 6 and matches the validated 32/33/35 slot counts.
The 000d loader/runtime path (000d:44df -> 000d:4c99 -> 000d:7000 -> 000d:46ec) currently shows indexed file loading and slot-table materialization, but no verified per-class header rewrite before the VM consumes owner-backed records.

Current safe conclusion:

The owner-loaded class records are compatible with object 1 names, classid + 2 body lookup, a header field at bytes 8..11, and 6-byte event records at +20.
The exact meaning of the bytes-8..11 field is now narrower: on the local owner-loaded records it is best read as the first code-byte offset, with ScummVM's decremented base_offset acting as a code_base_minus_one anchor for 1-based event code offsets.
The leading word of each 6-byte event entry remains unresolved.

VM/runtime model

ScummVM also anchors several VM behaviors that line up with the current raw-binary work:

usecode/uc_machine.cpp uses ByteSet(0x1000) for Crusader globals rather than the U8 bitset path.
Remorse initializes global 0x003c to avatar number 1; Regret initializes 0x001e.
Opcode 0x11 is class/event dispatch in Crusader: the bytecode operand is an event number that is translated through get_class_event() before execution.

That makes the current local reading stronger: the 000d runtime lane looks like a Crusader-specific object/event VM that should be interpreted against Crusader event ordinals, not against U8 assumptions.

Event names

convert/crusader/convert_usecode_crusader.h gives a named event table for ids 0x00..0x1f:

Strongly usable names: look, use, anim, setActivity, cachein, hit, gotHit, hatch, schedule, release, equip, unequip, combine, calledFromAnim, enterFastArea, leaveFastArea, cast, justMoved, avatarStoleSomething, animGetHit, unhatch
Weak placeholders remain for 0x0d and 0x16..0x1f (func0D, func16..func1F)

This is enough to annotate event ordinals safely, but not enough to rename raw binary handlers unless local behavior matches.

Intrinsic tables

ScummVM provides two distinct kinds of intrinsic evidence:

convert/crusader/convert_usecode_crusader.h and convert_usecode_regret.h provide ordinal-to-signature/name tables used for readable conversion.
usecode/remorse_intrinsics.h and usecode/regret_intrinsics.h provide the live runtime dispatch tables.

The safe reading is:

Remorse and Regret share the Crusader event-name table.
Remorse and Regret do not share a single intrinsic numbering/signature map.
Intrinsic names are strong hints for arity and broad subsystem identity, but they are still not direct rename authority for the DOS binary.

Safe Reuse Rules

Safe to import now

Event names as labels for event ids 0x00..0x1f in parsers, reports, and note files.
Intrinsic ordinal names as name_hint or signature_hint metadata when the ordinal and argument-byte pattern match.
High-level subsystem labels such as palette fade, camera, movie, audio, item/actor accessors, and weapon fire when they match existing binary evidence.
Slot numbers from sampled owner-loaded classes even when the event name is still only a hint.

Not safe to claim yet

Direct raw-function renames based only on ScummVM event or intrinsic names.
Remorse intrinsic numbering from Regret tables, or vice versa.
Specific descriptor-family to slot-mask mappings that are not yet proven on the binary side.
Meanings for the unknown leading word in the 6-byte Crusader event table entries.
That the ScummVM get_class_event_count() formula applies unchanged to the sampled owner-loaded EUSECODE records.

IR Requirements For Round-Tripping

The first script IR should preserve exact recompilation inputs before it tries to look pretty.

Unit of decompilation

The IR should be organized as:

USECODE archive
class
event slot
instruction stream

That matches the externally anchored class/event layout and avoids baking in any still-unproven descriptor-to-runtime assumptions.

Required top-level records

Each class record should preserve:

class_id
class_object_index (classid + 2)
name_slot_offset (4 + 13 * classid within object 1)
class_name
raw_header_prefix
raw_code_base_u32
code_base_minus_one
event_count
raw_event_table_bytes

Each event record should preserve:

event_id
event_name_hint
raw_event_entry_word
code_offset
raw_body_bytes
decoded_ops

IR v0 Shape

The IR should separate authoritative fields from friendly hints.

class:
  class_id: 0x00be
  class_name: EVENT
  class_object_index: 0x00c0
  raw_code_base_u32: 0x0138
  code_base_minus_one: 0x0137
  raw_header_prefix: <bytes>
  events:
    - event_id: 0x04
      event_name_hint: cachein
      raw_event_entry_word: 0x????
      code_offset: 0x00001234
      ops:
        - op: intrinsic_call
          intrinsic_ordinal: 0x001e
          name_hint: Item::I_fireWeapon
          signature_hint: Item::I_fireWeapon(Item *, x, y, z, byte, int, byte)
          arg_bytes: 0x10
        - op: vm_chain_mutation
          vm_ir: APPEND_UNIQUE_INDIRECT
          opcode_hint: 0x19
        - op: unknown_raw
          bytes: <exact original bytes>

Why this shape

event_name_hint is useful for humans but does not replace the event id.
name_hint and signature_hint are useful for intrinsics but do not replace the ordinal.
unknown_raw gives a lossless fallback for still-unmapped opcodes or operand forms.
raw_event_entry_word keeps the compiler from losing bytes whose meaning is not yet settled.

Operation Families Worth Lifting First

The current binary-side evidence supports lifting a small reversible operator set first:

intrinsic_call
class_event_call
append_unique_inline
append_unique_indirect
remove_matching_inline
remove_matching_indirect
materialize_or_forward_value
prepend_inline_payload
build_entity_link_matrix
emit_or_pushback_result
push_frame_word_literal
compare_stream_dword_and_push_bool
unknown_raw

This is enough to represent the verified 000d:0988, 000d:177c, 000d:1acb, 000d:208b, 000d:21ed, and 000d:22bc families without pretending the whole VM is solved.

Metadata That Must Survive Recompilation

The compiler side will need more than pretty script text. At minimum it must preserve:

Original class ordering and sparse class ids
Original class-name table slotting
Raw class header bytes not yet semantically decoded
Raw bytes 8..11 even when a derived code_base_minus_one is also stored
Raw 6-byte event records, including the unknown leading word
Exact event order within each class
Exact code offsets or enough relocation data to rebuild them deterministically
Intrinsic ordinals and argument-byte counts
Width/sign information for immediates
Inline versus indirect payload form
String payload encoding and terminators
Any unknown opcode byte sequences verbatim

If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.

Practical Naming Policy

For near-term local RE and tooling:

Use ScummVM event names as annotation labels for event slots.
Store intrinsic names as hints attached to ordinals.
Keep binary-facing renames driven by raw evidence, not by ScummVM alone.
Treat EVENT, _BOOT, and NPCTRIG as the strongest current active-event families.
Treat JELYHACK and JELYH2 as referent-anchor classes, not standalone event records.
Treat SURCAMNS and SURCAMEW as callback/eventTrigger holders, not proven active-event cores.

Repeated Slot Patterns Safe To Reuse Now

The latest pass over class_layout_index.tsv and class_event_index.tsv adds a small set of repeatable slot patterns that are safe enough to carry into decompiler output.

What is authoritative here:

whether a class has a non-zero slot entry at a given slot id
the raw u16 event word for that slot
the raw u32 code offset for that slot
repeated slot-set structure across several classes

What is still hint-level only:

the ScummVM event-name labels for slots 0x00..0x1f
any mapping from one repeated slot directly to one recovered 000d opcode family
any claim that one repeated slot family is already tied to one exact gameplay subsystem in the DOS binary

Current small safe candidate sets:

Family	Classes	Non-zero slots	Safe implication
referent-anchor twin	`JELYHACK`, `JELYH2`	`0x01` only	these are structurally anchor-only classes, not active event hubs
boot-event-core	`AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `REE_BOOT`, `VAR_BOOT`	`0x0A`, `0x0F`, `0x10`	one reusable three-slot active-event core template
callback-eventtrigger	`SURCAMNS`, `SURCAMEW`	`0x01`, `0x0A`, `0x20`, `0x21`, `0x22`	one shared callback-oriented multi-slot template
environmental-event	`FLAMEBOX`, `NOSTRIL`, `STEAMBOX`	`0x0A`, `0x20`, `0x21`	one shared hazard/event template with two extra high slots
broad active-event lane	`EVENT`, `SFXTRIG`, and several non-island classes	`0x0A` only	slot `0x0A` is widespread enough to treat as a real repeated event slot, but too broad to over-specialize

Concrete repeated evidence worth preserving in IR:

JELYHACK and JELYH2 both carry only slot 0x01 with the exact same row: raw_event_entry_word = 0x002A, raw_code_offset = 0x00000001.
The five _BOOT cores all share slot 0x10 with the exact same raw_event_entry_word = 0x003B, while the raw_code_offset varies by class (0x0000045c on COR_BOOT, 0x0000048b on AND_BOOT, 0x00000522 on BRO_BOOT, 0x000004df on VAR_BOOT, 0x000005a8 on REE_BOOT). That is a good example of repeated structure without identical bodies.
SURCAMNS and SURCAMEW share the same five-slot layout and the same low/high anchor rows (0x0A = 0x00D1/0x00000001, 0x22 = 0x01A3/...), but differ in the middle high-slot offsets. That looks like one shared callback template with instance-specific bodies, not two unrelated classes.
FLAMEBOX, NOSTRIL, and STEAMBOX all share one 0x0A low slot plus two extra high slots 0x20 and 0x21. Their exact words differ, so the safe reading is shared layout, not identical compiled behavior.
EVENT and SFXTRIG both participate in the wide 0x0A lane, but that family is broad enough that the slot number is more trustworthy than the ScummVM name hint.

Byte-Level Body Comparison Rules And Results

The next step after repeated row mining is to derive the chunk-local body window for each non-zero slot and compare the actual bytes instead of only the 6-byte event-table row.

Current conservative body-window rule:

body_start = code_base_minus_one + raw_code_offset
body_end = code_base_minus_one + next_non_zero_raw_code_offset in the same class, or chunk EOF when there is no later non-zero slot
this keeps the representation reversible because it is computed only from preserved header and event-table fields plus the raw chunk bytes

This rule is now carried directly by the extractor outputs instead of living only in notes:

USECODE/EUSECODE_extracted/class_event_index.tsv now emits derived_body_start, derived_body_end, derived_body_length, and conservative repeated_template_status columns per slot row.
USECODE/EUSECODE_extracted/boot_family_decompile.md / .tsv, callback_family_decompile.md / .tsv, and environmental_family_decompile.md / .tsv now provide concrete generated per-class decompile artifacts for the _BOOT, SURCAM*, and environmental repeated-family lanes, each grounded in emitted output rather than prose-only examples.
USECODE/EUSECODE_extracted/repeated_family_regressions.tsv now records and enforces the current repeated-family slot sets plus the verified raw-row and derived body-window fields for JELYHACK/JELYH2, _BOOT, SURCAMNS/SURCAMEW, and FLAMEBOX/NOSTRIL/STEAMBOX so extractor changes fail fast if those verified baselines drift.

What this confirms on the current repeated families:

JELYHACK and JELYH2 slot 0x01 are exact row twins but not exact body twins. Both bodies are 42 bytes long, both start at 0x00d4, both keep raw_event_entry_word = 0x002A, and both share a 10-byte prefix plus a 17-byte suffix. The first differences are at body offsets 10,11,12,24, which is consistent with one reused mini-template carrying class-local literals rather than one identical compiled body.
_BOOT slot 0x10 is the cleanest repeated-body example. All five classes have a 59-byte body, all share the same row word 0x003B, all share the same first 5 bytes and the same last 17 bytes, and none are byte-identical across the family. This is strong evidence for one shared short-template tail with class-local identifiers or immediates in the middle.
_BOOT slots 0x0A and 0x0F show the same pattern at larger sizes. Slot 0x0A bodies range from 551 to 843 bytes and share only a 3-byte prefix but a 39-byte suffix; slot 0x0F bodies range from 564 to 604 bytes and share a 3-byte prefix plus a 38-byte suffix. These are repeated family bodies, but not clones.
SURCAMNS and SURCAMEW high slots 0x20 and 0x22 also behave like near-templates, not clones. Slot 0x20 is 698 bytes in both classes with an 11-byte common prefix and an 84-byte common suffix. Slot 0x22 is 419 bytes in both classes with an 11-byte common prefix and a 53-byte common suffix.
SURCAM slot 0x21 is the strongest within-family divergence in this batch. SURCAMNS uses row word 0x0709 and a body length of 1801, while SURCAMEW uses row word 0x0655 and a body length of 1621. They still share a 20-byte suffix, so this is best read as one callback-family slot with materially different instance bodies rather than a parsing mistake.

The practical IR consequence is important: repeated-family status should be recorded separately from byte-identity status. A human-readable decompile should be able to say “same family slot template” without falsely implying “same body bytes.”

What A Decompiled Script Looks Like Today

The most honest present-day decompilation is not a polished source language. It is a reversible descriptor-plus-event-table rendering with optional VM-op vocabulary attached where the 000d lane is already verified.

Level 0: Raw event row plus derived body window

This is the minimal human-usable row form. It preserves the original six-byte event entry, explains how the body window is derived, and records whether the slot looks like an exact twin, a near-template, or a unique body.

class_name: REE_BOOT
slot: 0x10
event_name_hint_scummvm: leaveFastArea
raw_event_entry_word: 0x003b
raw_code_offset: 0x000005a8
code_base_minus_one: 0x00d3
derived_body_start: 0x067b
derived_body_end: 0x06b6
derived_body_length: 59
repeated_template_status: boot-event-core/shared-slot-0x10
body_identity_status: non-identical; shared 5-byte prefix and 17-byte suffix across all five _BOOT bodies
body_sha1: 577c61e9c4c6...

Field meaning, using only what is currently verified:

class_name: authoritative class label from object 1 in the owner-loaded class table
slot: authoritative numeric slot id from the event table; this is safer than any guessed semantic name
event_name_hint_scummvm: external label for slots 0x00..0x1f; useful for orientation, not yet verified as the local class-specific meaning
raw_event_entry_word: the unresolved leading u16 from the 6-byte event record; authoritative bytes, unresolved semantics
raw_code_offset: the authoritative row u32; currently best read as a 1-based offset relative to code_base_minus_one
code_base_minus_one: derived from bytes 8..11 in the class header using the current conservative rule
derived_body_start and derived_body_end: computed chunk-local byte window for the slot body; useful for diffing and future recompilation, and now emitted directly in the extractor outputs
repeated_template_status: whether the row participates in a repeated family pattern such as JELY anchor twin, _BOOT event core, or SURCAM callback template
body_identity_status: whether the extracted body bytes are exact twins, near-templates, or materially different within that family
body_sha1: stable digest for exact identity checks without pretending the digest itself has semantic meaning

Level 1: Lossless event-table IR

This is the form that is closest to a future round-trip compiler.

class:
  entry_index: 0x0115
  class_id: 0x04d3
  class_name: JELYHACK
  class_object_index: 0x04d5
  raw_code_base_u32: 0x00d4
  code_base_minus_one: 0x00d3
  conservative_event_count: 32
  descriptor_fields:
    - referent
  events:
    - slot: 0x01
      event_name_hint_scummvm: use
      raw_event_entry_word: 0x002a
      raw_code_offset: 0x00000001
      derived_body_start: 0x00d4
      derived_body_end: 0x00fe
      derived_body_length: 42
      repeated_template_status: referent-anchor-twin/shared-slot-0x01
      body_identity_status: near-template-with-JELYH2
      confidence: authoritative-bytes, hinted-label

IR v1 Parser Schema

The next tooling step changes the role of this document slightly. IR v0 was a note-level target for reversible human-readable output. IR v1 is the canonical machine-facing schema for the Pentagram-derived proof-of-concept parser and any future Ghidra annotation bridge.

The design constraints are now explicit:

keep every authoritative owner-loaded byte visible
keep slot identity separate from semantic name hints
keep runtime-facing metadata visible even when the body decompiler cannot yet explain it
preserve enough structure to emit Ghidra comments and bookmarks later without reparsing prose notes

Top-level IR object

schema_version: crusader-usecode-ir-v1-poc
source:
  flex_path: USECODE/EUSECODE.FLX
  extracted_root: USECODE/EUSECODE_extracted
  chunk_file: USECODE/EUSECODE_extracted/chunks/chunk_191_table_1BA8_off_04C347_len_0003A8.bin
class:
  entry_index: 191
  object_index: 0x365
  class_id: 0x363
  class_name: NPCTRIG
  raw_code_base_u32: 0x00da
  code_base_minus_one: 0x00d9
  conservative_event_count: 0x21
event:
  slot: 0x0a
  event_name_hint: equip
  raw_event_entry_word: 0x013e
  raw_code_offset: 0x00000001
  derived_body_start: 0x00da
  derived_body_end: 0x024f
  derived_body_length: 373
  repeated_template_status: ""
body:
  end_reason: end_opcode
  raw_body_sha1: <digest>
  unknown_trailing_bytes: ""
ops:
  - offset: 0x0000
    absolute_body_offset: 0x00da
    opcode: 0x5a
    mnemonic: init
    raw_bytes: 5a06
    operands:
      local_bytes: 0x06
  - offset: 0x0011
    absolute_body_offset: 0x00eb
    opcode: 0x40
    mnemonic: push_local_dword
    raw_bytes: 40064c02
    operands:
      bp_offset: 0x06
annotation_hints:
  runtime_family: slot-backed-owner-loaded-body
  compiled_anchors:
    - 000d:51fd
    - 000d:5572
    - 000d:46ec
    - 000d:ebe3

Required fields

source keeps the specific extracted artifact path so the parser output can always be checked against the raw chunk bytes.

class keeps the owner-loaded identity and header math already validated in the binary.

event keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.

body records how far the parser got and whether any bytes remain undecoded or trailing.

ops is intentionally lossless. Each decoded op keeps:

body-relative offset
absolute chunk-local offset
raw opcode byte
mnemonic
exact raw bytes for the whole op
parsed operands as typed fields

annotation_hints is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.

Opcode result policy

The parser should use four result classes only:

decoded_op: normal parsed opcode with structured operands
unknown_opcode: one-byte opcode not yet modeled; stop or fall back conservatively
raw_tail: remaining undecoded bytes after a stop condition
debug_blob: symbol/debug tail such as 0x5c-anchored metadata

That keeps the IR trustworthy even before the whole Crusader VM is modeled.

Call-site hint policy

For call and spawn-family ops, the parser may attach:

target_class_id
target_event_slot
target_event_name_hint

It should not attach a stronger semantic claim than that. The body parser is class/event aware, but not yet authoritative about gameplay meaning.

Annotation-hint schema

The Ghidra bridge should consume only small, stable items:

annotation_hints:
  runtime_family: slot-backed-owner-loaded-body
  payload_shape_hint: signed_word
  compiled_anchors:
    - address: 000d:51fd
      role: slot_value_loader
    - address: 000d:5572
      role: slot_value_plus_offset
    - address: 000d:46ec
      role: context_create_from_slot
    - address: 000d:ebe3
      role: opcode_sequence_run
    - address: 000d:22bc
      role: matrix_pushback_stage

This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.

That is already a real decompilation output. It keeps the exact slot id, the exact six-byte row contents, and the exact class-header facts, while refusing to pretend that use is already a proven semantic name for this class.

Here is the same style for one active event-bearing attachment class in the same island:

class:
  entry_index: 0x011b
  class_id: 0x04db
  class_name: REE_BOOT
  class_object_index: 0x04dd
  raw_code_base_u32: 0x00d4
  code_base_minus_one: 0x00d3
  conservative_event_count: 32
  descriptor_fields:
    - referent
    - event
    - counter
    - item
  events:
    - slot: 0x0a
      event_name_hint_scummvm: equip
      raw_event_entry_word: 0x034b
      raw_code_offset: 0x00000001
      derived_body_start: 0x00d4
      derived_body_end: 0x041f
      derived_body_length: 843
      repeated_template_status: boot-event-core/shared-slot-0x0a
      body_identity_status: same-family-body-not-identical
      confidence: authoritative-bytes, hinted-label
    - slot: 0x0f
      event_name_hint_scummvm: enterFastArea
      raw_event_entry_word: 0x025c
      raw_code_offset: 0x0000034c
      derived_body_start: 0x041f
      derived_body_end: 0x067b
      derived_body_length: 604
      repeated_template_status: boot-event-core/shared-slot-0x0f
      body_identity_status: same-family-body-not-identical
      confidence: authoritative-bytes, hinted-label
    - slot: 0x10
      event_name_hint_scummvm: leaveFastArea
      raw_event_entry_word: 0x003b
      raw_code_offset: 0x000005a8
      derived_body_start: 0x067b
      derived_body_end: 0x06b6
      derived_body_length: 59
      repeated_template_status: boot-event-core/shared-slot-0x10
      body_identity_status: same-family-body-not-identical
      confidence: authoritative-bytes, hinted-label

And here is one callback-style multi-slot class, which shows why the high slots should stay numeric for now:

class:
  entry_index: 0x011c
  class_id: 0x04de
  class_name: SURCAMEW
  class_object_index: 0x04e0
  raw_code_base_u32: 0x00e6
  code_base_minus_one: 0x00e5
  conservative_event_count: 35
  descriptor_fields:
    - referent
    - textFile
    - monit
    - valueBox
    - passcode
    - link
    - code
    - screen
    - cameraEgg
    - trueRef
    - therma
    - eventTrigger
    - foundGun
  events:
    - slot: 0x01
      event_name_hint_scummvm: use
      raw_event_entry_word: 0x00f7
      raw_code_offset: 0x000000d2
    - slot: 0x0a
      event_name_hint_scummvm: equip
      raw_event_entry_word: 0x00d1
      raw_code_offset: 0x00000001
    - slot: 0x20
      event_name_hint_scummvm: null
      raw_event_entry_word: 0x02ba
      raw_code_offset: 0x000001c9
      derived_body_start: 0x02ae
      derived_body_end: 0x0568
      derived_body_length: 698
      repeated_template_status: callback-eventtrigger/shared-slot-0x20
      body_identity_status: same-family-body-not-identical
    - slot: 0x21
      event_name_hint_scummvm: null
      raw_event_entry_word: 0x0655
      raw_code_offset: 0x00000483
      derived_body_start: 0x0568
      derived_body_end: 0x0bbd
      derived_body_length: 1621
      repeated_template_status: callback-eventtrigger/shared-slot-0x21
      body_identity_status: same-family-body-not-identical
    - slot: 0x22
      event_name_hint_scummvm: null
      raw_event_entry_word: 0x01a3
      raw_code_offset: 0x00000ad8
      derived_body_start: 0x0bbd
      derived_body_end: 0x0d60
      derived_body_length: 419
      repeated_template_status: callback-eventtrigger/shared-slot-0x22
      body_identity_status: same-family-body-not-identical

The extra derived fields are worth keeping because they answer the immediate human question that the bare event table does not: not only “which slots exist,” but also “how much body belongs to each slot” and “whether this body is a true clone or only a same-family variant.”

Level 2: Friendly but still reversible hinted form

This is the highest-level script shape that is justified right now.

anchor JELYHACK(referent)

# authoritative event rows for the anchor itself
slot 0x01  hint=use?  raw_word=0x002A  code_off=0x00000001  body=0x00D4..0x00FE  family=JELY-anchor  identity=near-template-with-JELYH2

# nearby attachment classes from the same local island
attach REE_BOOT(referent,event,counter,item)
  slot 0x0A  hint=equip?          raw_word=0x034B  code_off=0x00000001  body=0x00D4..0x041F  family=_BOOT-core  identity=shared-template-not-clone
  slot 0x0F  hint=enterFastArea?  raw_word=0x025C  code_off=0x0000034C  body=0x041F..0x067B  family=_BOOT-core  identity=shared-template-not-clone
  slot 0x10  hint=leaveFastArea?  raw_word=0x003B  code_off=0x000005A8  body=0x067B..0x06B6  family=_BOOT-core  identity=shared-template-not-clone

callback SURCAMEW(referent,textFile,monit,valueBox,passcode,link,code,screen,cameraEgg,trueRef,therma,eventTrigger,foundGun)
  slot 0x01  hint=use?    raw_word=0x00F7  code_off=0x000000D2  body=0x01B7..0x02AE
  slot 0x0A  hint=equip?  raw_word=0x00D1  code_off=0x00000001  body=0x00E6..0x02AE
  slot 0x20                raw_word=0x02BA  code_off=0x000001C9  body=0x02AE..0x0568  family=SURCAM-callback  identity=shared-template-not-clone
  slot 0x21                raw_word=0x0655  code_off=0x00000483  body=0x0568..0x0BBD  family=SURCAM-callback  identity=shared-template-with-stronger-divergence
  slot 0x22                raw_word=0x01A3  code_off=0x00000AD8  body=0x0BBD..0x0D60  family=SURCAM-callback  identity=shared-template-not-clone

attach SFXTRIG(referent,event)
  slot 0x0A  hint=equip?  raw_word=0x00B8  code_off=0x00000001

This is decompiled enough to read, diff, and later recompile because it preserves:

the original class identity
the exact non-zero event rows
the derived chunk-local body window for each row
which names are authoritative fields versus external hints
which nearby descriptors appear to be anchors, active event attachments, or callback attachments
whether a repeated family slot is an exact twin or only a structurally similar body

Level 2.5: Human annotation layer

The last layer is prose, not syntax. It should explain the honest current reading of each field so a modder can see what is safe to edit and what still needs caution.

Class name is authoritative at the container level: it comes from the owner-loaded class-name table and is not a guess.
Slot id is authoritative at the event-table level: this is the safest event identifier currently available.
Event-name hint is external: use it as orientation only when the slot is inside 0x00..0x1f and the local behavior has not yet been reverified in binary.
Raw event word is authoritative but semantically unresolved: it must survive round-trip intact.
Raw code offset is authoritative and operational: combined with code_base_minus_one, it tells us where the slot body starts in the chunk.
Body-window length is derived but useful: it tells a human whether a slot is a tiny stub-like record or a large body that deserves its own diff or annotation block.
Repeated-template status is about family structure, not byte identity: a _BOOT slot can be “the same template role” without being byte-equal across classes.
Body-identity status answers the concrete modding question “am I looking at a clone, a parameterized variant, or a different body that only occupies the same family slot?”

Level 3: Where the current VM IR can be attached

For classes in the active-event ecosystems (EVENT, _BOOT, NPCTRIG, SFXTRIG, and the environmental family), the current 000d work is strong enough to attach the known operator vocabulary without pretending one exact class-to-opcode decode already exists.

vm_effect_possible:
  APPEND_UNIQUE_INLINE
  APPEND_UNIQUE_INDIRECT
  REMOVE_MATCHING_INDIRECT
  REMOVE_MATCHING_INLINE
  MATERIALIZE_OR_FORWARD_VALUE
  PREPEND_INLINE_PAYLOAD
  BUILD_ENTITY_LINK_MATRIX
  EMIT_OR_PUSHBACK_RESULT
  FINALIZE_MIXED_VALUE_TO_OUTPTR

That operator block is authoritative as a recovered VM vocabulary, but only ecosystem-level when attached to one specific descriptor family.

Binary-side slot and payload-shape evidence to preserve in IR

The current VM pass also adds one useful binary-side rule for the higher event ordinals: the compiled wrapper family distinguishes slot identity from payload shape, and that distinction should survive in any round-trip IR even when the human label stays unresolved.

Verified current ladder around 0005:3115..31da:

slot 0x10: guarded callsite only, zero extra word, packed mask 0x00010000
slot 0x11: named wrapper entity_vm_context_try_create_mask_00020000_slot11_with_offset, one caller-supplied extra word
slot 0x12: named wrapper entity_vm_context_try_create_mask_00040000_slot12, zero extra word
slot 0x13: named wrapper entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity, one sign-extended extra word after an entity-validity gate
slot 0x14: named wrapper entity_vm_context_try_create_mask_00100000_slot14_with_offset, one caller-supplied extra word

Why this matters for the IR:

It is direct binary evidence that some higher Crusader slot ordinals are already grouped by argument shape before any descriptor-family mapping is proven.
That means the IR should preserve slot_id plus payload_shape independently instead of collapsing everything into one guessed event-name table.
It also gives a bounded way to cross-check external event signatures without over-trusting them: slot 0x12 fits a zero-arg event shape, slot 0x13 fits a one-word event shape, and slot 0x14 currently conflicts with Pentagram's older zero-arg animGetHit() note.

Practical annotation rule to adopt now:

keep higher-slot labels binary-stable as slot 0x10 .. slot 0x14 unless local behavior closes the label
attach external event names only as hints
attach one small payload_shape_hint field such as none, word, or signed_word

Minimal hinted example:

slot_record:
  slot_id: 0x13
  event_name_hint: avatarStoleSomething
  payload_shape_hint: signed_word
  binary_anchor: 0005:31da
  wrapper_name: entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity

The same pass also hardens one existing IR operator boundary: the 000d:22bc stage is now comment-backed in Ghidra as a matrix/pushback consumer over decoded workspace bytes, not a direct descriptor-row reader. The current safe attachment point is therefore still decoded VM workspace -> link-matrix stage, not NPCTRIG row -> direct entity-link emission.

Conservative Parser Rule To Adopt Now

For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is:

Preserve the raw four-byte header field at bytes 8..11 as authoritative.
Derive code_base_minus_one = raw_u32_at_8_11 - 1 for code-addressing only.
Derive event_count = (raw_u32_at_8_11 - 20) / 6 only when that value is non-negative, divisible by 6, and the resulting table end stays within the class object size.
Treat each event entry as u16 raw_event_entry_word + u32 raw_code_offset at class + 20 + 6 * slot.
Treat the event code offset as raw/opaque unless and until the code-addressing interpretation is needed; when needed, interpret it relative to code_base_minus_one so that offset 1 lands on the first code byte.
If the divisibility or bounds checks fail, keep the class opaque and preserve raw bytes rather than forcing a guessed event count.
tools/extract_eusecode_flx.py now implements this rule directly for the current owner-loaded EUSECODE work and emits class_layout_index.tsv plus class_event_index.tsv so raw header/event rows can be consumed by later IR tooling without re-deriving the arithmetic from prose.

Remaining Binary-Side Gaps

The main blockers for a real round-trip compiler are still on the binary side:

The meaning of the first two bytes in each 6-byte Crusader event record is still unverified.
The exact provenance of ScummVM's current get_class_event_count() arithmetic is still unverified; current local evidence says the owner-loaded/raw records fit raw_u32_at_8_11 = first_code_byte_offset, while the ScummVM count formula appears sign-shifted relative to that layout.
The upstream writer for selector local [BP-0x32] in the 000d:ebe3 sequencer is still unresolved.
The full control-flow opcode set and branch encoding are not yet recovered.
The exact on-disk source format behind entity_vm_runtime_owner_resource_create is still not identified.
No direct descriptor-family to slot-mask mapping is proven yet.
Callback/eventTrigger descriptors still do not have a callback-specific opcode family.

Best Current Path

The strongest present path to a usable compiler/decompiler is:

Parse classes/events exactly as ScummVM does.
Keep the class/object indexing and event-entry layout from ScummVM, but use the conservative local event-count rule above for owner-loaded/raw class parsing until a main USECODE sample proves otherwise.
Decompile only the proven operator families into structured IR.
Preserve unknown bytes verbatim in place.
Attach ScummVM event and intrinsic names as hints, not as truth.
Recompile by rebuilding the original class header and event table layout first, then re-emitting decoded and opaque ops together.

That gets to a reversible editor sooner than waiting for a full semantic VM recovery.

38 KiB Raw Blame History