793 lines
No EOL
41 KiB
Markdown
793 lines
No EOL
41 KiB
Markdown
# USECODE Round-Trip IR Plan
|
|
|
|
## Purpose
|
|
|
|
This note records the current evidence-backed path from Crusader USECODE bytes to a human-readable, editable, and recompilable script form.
|
|
|
|
It is intentionally conservative. ScummVM gives strong external anchors for the container layout, class/event numbering, and intrinsic naming, but it is not a symbol map for the DOS binary and it is not a ready-made round-trip compiler.
|
|
|
|
## Externally Anchored Pieces
|
|
|
|
### Container and class layout
|
|
|
|
ScummVM now gives a concrete second implementation for the Crusader USECODE class layout:
|
|
|
|
- `usecode/usecode_flex.cpp` treats each class body as archive object `classid + 2`.
|
|
- Class names come from archive object `1` at `name_object + 4 + 13 * classid`.
|
|
- For Crusader, the class base offset is read from class bytes `8..11` and then decremented by `1`.
|
|
- Crusader event count is computed as `(base_offset + 19) / 6`.
|
|
- `usecode/usecode.cpp` resolves event `N` from class data at `20 + 6 * N`, with the code offset stored in bytes `+2..+5` of each 6-byte event record.
|
|
|
|
Combined with the already validated FLEX container notes, the current externally anchored container model is:
|
|
|
|
- FLEX entry count at `0x54`
|
|
- FLEX table at `0x80`
|
|
- USECODE class object index = `classid + 2`
|
|
- Crusader class header contains a four-byte base-offset field at bytes `8..11`
|
|
- Crusader event table entries are 6 bytes each, with a known dword code offset and an still-unknown leading word
|
|
|
|
ScummVM also makes one implementation choice explicit that matters for the current mismatch: `uc_machine.cpp` uses `get_class_base_offset()` as the execution-stream base for Crusader class code, not only as metadata for event counting. That means the `obj[8..11] - 1` value is part of the live code-addressing model in ScummVM, not just a comment-level interpretation.
|
|
|
|
### Binary-side validation against owner-loaded classes
|
|
|
|
The first direct local validation pass against sampled owner-loaded EUSECODE class records now splits the ScummVM model into two parts: one part is confirmed, and one part still needs reconciliation.
|
|
|
|
Confirmed on sampled records (`EVENT`, `NPCTRIG`, `SURCAMNS`, `JELYHACK`, `REE_BOOT`, `SURCAMEW`, `SFXTRIG`):
|
|
|
|
- The extracted chunk at table offset `0x88` behaves like object `1` for class names.
|
|
- For each sampled class body, deriving `object_index = (table_offset - 0x80) / 8`, then `class_id = object_index - 2`, and then reading 13 bytes from object `1` at `4 + 13 * class_id` yields the expected class name.
|
|
- The class bodies do have a stable 4-byte header field at bytes `8..11`.
|
|
- The region at `class + 20` is a real 6-byte event-slot table with `u16 unknown_word + u32 code_or_payload_field` layout.
|
|
|
|
Broader family spot-checks now keep the same local structure on the owner-loaded side. In addition to the first validated set, the nearby `_BOOT` and environmental event families (`AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `VAR_BOOT`, `FLAMEBOX`, `NOSTRIL`, `STEAMBOX`) continue to fit the same `table_offset -> object_index -> class_id` progression with a stable bytes-`8..11` dword and a 6-byte table at `+20`. No contradictory sample has appeared in the local EUSECODE set.
|
|
|
|
Not yet reconciled with ScummVM's current formula note:
|
|
|
|
- In the sampled owner-loaded records, the raw dword at bytes `8..11` is `0x00d4`, `0x00da`, or `0x00e6`.
|
|
- Treating that dword directly as the first post-event-table offset makes the layout line up cleanly: `(dword_at_8 - 20) / 6` gives 32, 33, or 35 valid slots in the samples.
|
|
- Scanning instead with the previously noted ScummVM-style `(base_offset + 19) / 6` interpretation overruns into inline payload and class-name bytes in the same samples.
|
|
|
|
Current best explanation:
|
|
|
|
- The mismatch is now best explained as a ScummVM interpretation/detail issue, not as a proven loader-side rewrite.
|
|
- The same ScummVM code path that decrements bytes `8..11` by `1` also uses that decremented value as the code-stream base. On the local owner-loaded records, this fits naturally if the raw dword is the first code-byte offset and event-table dword offsets are 1-based relative to `code_base_minus_one`.
|
|
- Under that reading, the sampled event-count rule becomes `(code_base_minus_one - 19) / 6`, which is exactly equivalent to `(raw_u32_at_8_11 - 20) / 6` and matches the validated `32/33/35` slot counts.
|
|
- The `000d` loader/runtime path (`000d:44df -> 000d:4c99 -> 000d:7000 -> 000d:46ec`) currently shows indexed file loading and slot-table materialization, but no verified per-class header rewrite before the VM consumes owner-backed records.
|
|
|
|
Current safe conclusion:
|
|
|
|
- The owner-loaded class records are compatible with `object 1` names, `classid + 2` body lookup, a header field at bytes `8..11`, and 6-byte event records at `+20`.
|
|
- The exact meaning of the bytes-`8..11` field is now narrower: on the local owner-loaded records it is best read as the first code-byte offset, with ScummVM's decremented `base_offset` acting as a `code_base_minus_one` anchor for 1-based event code offsets.
|
|
- The leading word of each 6-byte event entry remains unresolved.
|
|
|
|
### VM/runtime model
|
|
|
|
ScummVM also anchors several VM behaviors that line up with the current raw-binary work:
|
|
|
|
- `usecode/uc_machine.cpp` uses `ByteSet(0x1000)` for Crusader globals rather than the U8 bitset path.
|
|
- Remorse initializes global `0x003c` to avatar number `1`; Regret initializes `0x001e`.
|
|
- Opcode `0x11` is class/event dispatch in Crusader: the bytecode operand is an event number that is translated through `get_class_event()` before execution.
|
|
|
|
That makes the current local reading stronger: the `000d` runtime lane looks like a Crusader-specific object/event VM that should be interpreted against Crusader event ordinals, not against U8 assumptions.
|
|
|
|
### Event names
|
|
|
|
`convert/crusader/convert_usecode_crusader.h` gives a named event table for ids `0x00..0x1f`:
|
|
|
|
- Strongly usable names: `look`, `use`, `anim`, `setActivity`, `cachein`, `hit`, `gotHit`, `hatch`, `schedule`, `release`, `equip`, `unequip`, `combine`, `calledFromAnim`, `enterFastArea`, `leaveFastArea`, `cast`, `justMoved`, `avatarStoleSomething`, `animGetHit`, `unhatch`
|
|
- Weak placeholders remain for `0x0d` and `0x16..0x1f` (`func0D`, `func16`..`func1F`)
|
|
|
|
This is enough to annotate event ordinals safely, but not enough to rename raw binary handlers unless local behavior matches.
|
|
|
|
### Intrinsic tables
|
|
|
|
ScummVM provides two distinct kinds of intrinsic evidence:
|
|
|
|
- `convert/crusader/convert_usecode_crusader.h` and `convert_usecode_regret.h` provide ordinal-to-signature/name tables used for readable conversion.
|
|
- `usecode/remorse_intrinsics.h` and `usecode/regret_intrinsics.h` provide the live runtime dispatch tables.
|
|
|
|
The safe reading is:
|
|
|
|
- Remorse and Regret share the Crusader event-name table.
|
|
- Remorse and Regret do not share a single intrinsic numbering/signature map.
|
|
- Intrinsic names are strong hints for arity and broad subsystem identity, but they are still not direct rename authority for the DOS binary.
|
|
|
|
## Safe Reuse Rules
|
|
|
|
### Safe to import now
|
|
|
|
- Event names as labels for event ids `0x00..0x1f` in parsers, reports, and note files.
|
|
- Intrinsic ordinal names as `name_hint` or `signature_hint` metadata when the ordinal and argument-byte pattern match.
|
|
- High-level subsystem labels such as palette fade, camera, movie, audio, item/actor accessors, and weapon fire when they match existing binary evidence.
|
|
- Slot numbers from sampled owner-loaded classes even when the event name is still only a hint.
|
|
|
|
### Not safe to claim yet
|
|
|
|
- Direct raw-function renames based only on ScummVM event or intrinsic names.
|
|
- Remorse intrinsic numbering from Regret tables, or vice versa.
|
|
- Specific descriptor-family to slot-mask mappings that are not yet proven on the binary side.
|
|
- Meanings for the unknown leading word in the 6-byte Crusader event table entries.
|
|
- That the ScummVM `get_class_event_count()` formula applies unchanged to the sampled owner-loaded EUSECODE records.
|
|
|
|
## IR Requirements For Round-Tripping
|
|
|
|
The first script IR should preserve exact recompilation inputs before it tries to look pretty.
|
|
|
|
## Current Parser Views
|
|
|
|
The current proof-of-concept parser now emits three complementary views for a single class/slot body:
|
|
|
|
- JSON IR: the authoritative machine-facing output for tooling and any future assembler.
|
|
- Flat text listing: a byte-faithful decode with offsets, raw bytes, and trailer sections.
|
|
- Script view: a more readable block-labeled decompilation with locals, labels, and stack-VM statements.
|
|
- Pseudocode view: a higher-level decompilation that tries to collapse common compare ladders and stack expressions into programming-language-like control flow.
|
|
|
|
The script and pseudocode views are intentionally descriptive rather than authoritative. They are meant to help read bodies like `NPCTRIG 0x0A` or `EVENT 0x0A` without losing the exact JSON IR that a round-trip compiler will need.
|
|
|
|
## Deferred Readability Follow-Ups
|
|
|
|
Keep these parser-facing readability tasks for later while the current focus stays on broad pseudocode export and class-family understanding:
|
|
|
|
1. Replace unresolved `class_XXXX_slot_YY` call labels with behavior-backed names where the compiled/runtime evidence is strong enough.
|
|
2. Replace placeholder argument names such as `arg_06` with semantic names inferred from stable usage patterns.
|
|
3. Detect more control-flow shapes beyond compare ladders, especially simple loops and early-return guards.
|
|
4. Collapse common spawn/setup idioms into more domain-specific statements when the stack pattern is consistent.
|
|
5. Run the pseudocode renderer across larger families like `EVENT`, `_BOOT`, and `SURCAM*` and tighten the heuristics where they still leak VM structure.
|
|
6. Add small behavior-level comments only where they help explain gameplay meaning rather than VM mechanics.
|
|
|
|
### Unit of decompilation
|
|
|
|
The IR should be organized as:
|
|
|
|
1. USECODE archive
|
|
2. class
|
|
3. event slot
|
|
4. instruction stream
|
|
|
|
That matches the externally anchored class/event layout and avoids baking in any still-unproven descriptor-to-runtime assumptions.
|
|
|
|
### Required top-level records
|
|
|
|
Each class record should preserve:
|
|
|
|
- `class_id`
|
|
- `class_object_index` (`classid + 2`)
|
|
- `name_slot_offset` (`4 + 13 * classid` within object `1`)
|
|
- `class_name`
|
|
- `raw_header_prefix`
|
|
- `raw_code_base_u32`
|
|
- `code_base_minus_one`
|
|
- `event_count`
|
|
- `raw_event_table_bytes`
|
|
|
|
Each event record should preserve:
|
|
|
|
- `event_id`
|
|
- `event_name_hint`
|
|
- `raw_event_entry_word`
|
|
- `code_offset`
|
|
- `raw_body_bytes`
|
|
- `decoded_ops`
|
|
|
|
## IR v0 Shape
|
|
|
|
The IR should separate authoritative fields from friendly hints.
|
|
|
|
```yaml
|
|
class:
|
|
class_id: 0x00be
|
|
class_name: EVENT
|
|
class_object_index: 0x00c0
|
|
raw_code_base_u32: 0x0138
|
|
code_base_minus_one: 0x0137
|
|
raw_header_prefix: <bytes>
|
|
events:
|
|
- event_id: 0x04
|
|
event_name_hint: cachein
|
|
raw_event_entry_word: 0x????
|
|
code_offset: 0x00001234
|
|
ops:
|
|
- op: intrinsic_call
|
|
intrinsic_ordinal: 0x001e
|
|
name_hint: Item::I_fireWeapon
|
|
signature_hint: Item::I_fireWeapon(Item *, x, y, z, byte, int, byte)
|
|
arg_bytes: 0x10
|
|
- op: vm_chain_mutation
|
|
vm_ir: APPEND_UNIQUE_INDIRECT
|
|
opcode_hint: 0x19
|
|
- op: unknown_raw
|
|
bytes: <exact original bytes>
|
|
```
|
|
|
|
### Why this shape
|
|
|
|
- `event_name_hint` is useful for humans but does not replace the event id.
|
|
- `name_hint` and `signature_hint` are useful for intrinsics but do not replace the ordinal.
|
|
- `unknown_raw` gives a lossless fallback for still-unmapped opcodes or operand forms.
|
|
- `raw_event_entry_word` keeps the compiler from losing bytes whose meaning is not yet settled.
|
|
|
|
## Operation Families Worth Lifting First
|
|
|
|
The current binary-side evidence supports lifting a small reversible operator set first:
|
|
|
|
- `intrinsic_call`
|
|
- `class_event_call`
|
|
- `append_unique_inline`
|
|
- `append_unique_indirect`
|
|
- `remove_matching_inline`
|
|
- `remove_matching_indirect`
|
|
- `materialize_or_forward_value`
|
|
- `prepend_inline_payload`
|
|
- `build_entity_link_matrix`
|
|
- `emit_or_pushback_result`
|
|
- `push_frame_word_literal`
|
|
- `compare_stream_dword_and_push_bool`
|
|
- `unknown_raw`
|
|
|
|
This is enough to represent the verified `000d:0988`, `000d:177c`, `000d:1acb`, `000d:208b`, `000d:21ed`, and `000d:22bc` families without pretending the whole VM is solved.
|
|
|
|
## Metadata That Must Survive Recompilation
|
|
|
|
The compiler side will need more than pretty script text. At minimum it must preserve:
|
|
|
|
- Original class ordering and sparse class ids
|
|
- Original class-name table slotting
|
|
- Raw class header bytes not yet semantically decoded
|
|
- Raw bytes `8..11` even when a derived `code_base_minus_one` is also stored
|
|
- Raw 6-byte event records, including the unknown leading word
|
|
- Exact event order within each class
|
|
- Exact code offsets or enough relocation data to rebuild them deterministically
|
|
- Intrinsic ordinals and argument-byte counts
|
|
- Width/sign information for immediates
|
|
- Inline versus indirect payload form
|
|
- String payload encoding and terminators
|
|
- Post-`ret` debug/local symbol trailers, including the local count byte and each per-local metadata row
|
|
- Any unknown opcode byte sequences verbatim
|
|
|
|
If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.
|
|
|
|
## Practical Naming Policy
|
|
|
|
For near-term local RE and tooling:
|
|
|
|
- Use ScummVM event names as annotation labels for event slots.
|
|
- Store intrinsic names as hints attached to ordinals.
|
|
- Keep binary-facing renames driven by raw evidence, not by ScummVM alone.
|
|
- Treat `EVENT`, `_BOOT`, and `NPCTRIG` as the strongest current active-event families.
|
|
- Treat `JELYHACK` and `JELYH2` as referent-anchor classes, not standalone event records.
|
|
- Treat `SURCAMNS` and `SURCAMEW` as callback/eventTrigger holders, not proven active-event cores.
|
|
|
|
## Repeated Slot Patterns Safe To Reuse Now
|
|
|
|
The latest pass over `class_layout_index.tsv` and `class_event_index.tsv` adds a small set of repeatable slot patterns that are safe enough to carry into decompiler output.
|
|
|
|
What is authoritative here:
|
|
|
|
- whether a class has a non-zero slot entry at a given slot id
|
|
- the raw `u16` event word for that slot
|
|
- the raw `u32` code offset for that slot
|
|
- repeated slot-set structure across several classes
|
|
|
|
What is still hint-level only:
|
|
|
|
- the ScummVM event-name labels for slots `0x00..0x1f`
|
|
- any mapping from one repeated slot directly to one recovered `000d` opcode family
|
|
- any claim that one repeated slot family is already tied to one exact gameplay subsystem in the DOS binary
|
|
|
|
Current small safe candidate sets:
|
|
|
|
| Family | Classes | Non-zero slots | Safe implication |
|
|
|---|---|---|---|
|
|
| referent-anchor twin | `JELYHACK`, `JELYH2` | `0x01` only | these are structurally anchor-only classes, not active event hubs |
|
|
| boot-event-core | `AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `REE_BOOT`, `VAR_BOOT` | `0x0A`, `0x0F`, `0x10` | one reusable three-slot active-event core template |
|
|
| callback-eventtrigger | `SURCAMNS`, `SURCAMEW` | `0x01`, `0x0A`, `0x20`, `0x21`, `0x22` | one shared callback-oriented multi-slot template |
|
|
| environmental-event | `FLAMEBOX`, `NOSTRIL`, `STEAMBOX` | `0x0A`, `0x20`, `0x21` | one shared hazard/event template with two extra high slots |
|
|
| broad active-event lane | `EVENT`, `SFXTRIG`, and several non-island classes | `0x0A` only | slot `0x0A` is widespread enough to treat as a real repeated event slot, but too broad to over-specialize |
|
|
|
|
Concrete repeated evidence worth preserving in IR:
|
|
|
|
- `JELYHACK` and `JELYH2` both carry only slot `0x01` with the exact same row: `raw_event_entry_word = 0x002A`, `raw_code_offset = 0x00000001`.
|
|
- The five `_BOOT` cores all share slot `0x10` with the exact same `raw_event_entry_word = 0x003B`, while the `raw_code_offset` varies by class (`0x0000045c` on `COR_BOOT`, `0x0000048b` on `AND_BOOT`, `0x00000522` on `BRO_BOOT`, `0x000004df` on `VAR_BOOT`, `0x000005a8` on `REE_BOOT`). That is a good example of repeated structure without identical bodies.
|
|
- `SURCAMNS` and `SURCAMEW` share the same five-slot layout and the same low/high anchor rows (`0x0A = 0x00D1/0x00000001`, `0x22 = 0x01A3/...`), but differ in the middle high-slot offsets. That looks like one shared callback template with instance-specific bodies, not two unrelated classes.
|
|
- `FLAMEBOX`, `NOSTRIL`, and `STEAMBOX` all share one `0x0A` low slot plus two extra high slots `0x20` and `0x21`. Their exact words differ, so the safe reading is shared layout, not identical compiled behavior.
|
|
- `EVENT` and `SFXTRIG` both participate in the wide `0x0A` lane, but that family is broad enough that the slot number is more trustworthy than the ScummVM name hint.
|
|
|
|
## Byte-Level Body Comparison Rules And Results
|
|
|
|
The next step after repeated row mining is to derive the chunk-local body window for each non-zero slot and compare the actual bytes instead of only the 6-byte event-table row.
|
|
|
|
Current conservative body-window rule:
|
|
|
|
- `body_start = code_base_minus_one + raw_code_offset`
|
|
- `body_end = code_base_minus_one + next_non_zero_raw_code_offset` in the same class, or chunk EOF when there is no later non-zero slot
|
|
- this keeps the representation reversible because it is computed only from preserved header and event-table fields plus the raw chunk bytes
|
|
|
|
This rule is now carried directly by the extractor outputs instead of living only in notes:
|
|
|
|
- `USECODE/EUSECODE_extracted/class_event_index.tsv` now emits `derived_body_start`, `derived_body_end`, `derived_body_length`, and conservative `repeated_template_status` columns per slot row.
|
|
- `USECODE/EUSECODE_extracted/boot_family_decompile.md` / `.tsv`, `callback_family_decompile.md` / `.tsv`, and `environmental_family_decompile.md` / `.tsv` now provide concrete generated per-class decompile artifacts for the `_BOOT`, `SURCAM*`, and environmental repeated-family lanes, each grounded in emitted output rather than prose-only examples.
|
|
- `USECODE/EUSECODE_extracted/repeated_family_regressions.tsv` now records and enforces the current repeated-family slot sets plus the verified raw-row and derived body-window fields for `JELYHACK/JELYH2`, `_BOOT`, `SURCAMNS/SURCAMEW`, and `FLAMEBOX/NOSTRIL/STEAMBOX` so extractor changes fail fast if those verified baselines drift.
|
|
|
|
What this confirms on the current repeated families:
|
|
|
|
- `JELYHACK` and `JELYH2` slot `0x01` are exact row twins but not exact body twins. Both bodies are `42` bytes long, both start at `0x00d4`, both keep `raw_event_entry_word = 0x002A`, and both share a `10`-byte prefix plus a `17`-byte suffix. The first differences are at body offsets `10,11,12,24`, which is consistent with one reused mini-template carrying class-local literals rather than one identical compiled body.
|
|
- `_BOOT` slot `0x10` is the cleanest repeated-body example. All five classes have a `59`-byte body, all share the same row word `0x003B`, all share the same first `5` bytes and the same last `17` bytes, and none are byte-identical across the family. This is strong evidence for one shared short-template tail with class-local identifiers or immediates in the middle.
|
|
- `_BOOT` slots `0x0A` and `0x0F` show the same pattern at larger sizes. Slot `0x0A` bodies range from `551` to `843` bytes and share only a `3`-byte prefix but a `39`-byte suffix; slot `0x0F` bodies range from `564` to `604` bytes and share a `3`-byte prefix plus a `38`-byte suffix. These are repeated family bodies, but not clones.
|
|
- `SURCAMNS` and `SURCAMEW` high slots `0x20` and `0x22` also behave like near-templates, not clones. Slot `0x20` is `698` bytes in both classes with an `11`-byte common prefix and an `84`-byte common suffix. Slot `0x22` is `419` bytes in both classes with an `11`-byte common prefix and a `53`-byte common suffix.
|
|
- `SURCAM` slot `0x21` is the strongest within-family divergence in this batch. `SURCAMNS` uses row word `0x0709` and a body length of `1801`, while `SURCAMEW` uses row word `0x0655` and a body length of `1621`. They still share a `20`-byte suffix, so this is best read as one callback-family slot with materially different instance bodies rather than a parsing mistake.
|
|
|
|
The practical IR consequence is important: repeated-family status should be recorded separately from byte-identity status. A human-readable decompile should be able to say “same family slot template” without falsely implying “same body bytes.”
|
|
|
|
## What A Decompiled Script Looks Like Today
|
|
|
|
The most honest present-day decompilation is not a polished source language. It is a reversible descriptor-plus-event-table rendering with optional VM-op vocabulary attached where the `000d` lane is already verified.
|
|
|
|
### Level 0: Raw event row plus derived body window
|
|
|
|
This is the minimal human-usable row form. It preserves the original six-byte event entry, explains how the body window is derived, and records whether the slot looks like an exact twin, a near-template, or a unique body.
|
|
|
|
```yaml
|
|
class_name: REE_BOOT
|
|
slot: 0x10
|
|
event_name_hint_scummvm: leaveFastArea
|
|
raw_event_entry_word: 0x003b
|
|
raw_code_offset: 0x000005a8
|
|
code_base_minus_one: 0x00d3
|
|
derived_body_start: 0x067b
|
|
derived_body_end: 0x06b6
|
|
derived_body_length: 59
|
|
repeated_template_status: boot-event-core/shared-slot-0x10
|
|
body_identity_status: non-identical; shared 5-byte prefix and 17-byte suffix across all five _BOOT bodies
|
|
body_sha1: 577c61e9c4c6...
|
|
```
|
|
|
|
Field meaning, using only what is currently verified:
|
|
|
|
- `class_name`: authoritative class label from object `1` in the owner-loaded class table
|
|
- `slot`: authoritative numeric slot id from the event table; this is safer than any guessed semantic name
|
|
- `event_name_hint_scummvm`: external label for slots `0x00..0x1f`; useful for orientation, not yet verified as the local class-specific meaning
|
|
- `raw_event_entry_word`: the unresolved leading `u16` from the 6-byte event record; authoritative bytes, unresolved semantics
|
|
- `raw_code_offset`: the authoritative row `u32`; currently best read as a 1-based offset relative to `code_base_minus_one`
|
|
- `code_base_minus_one`: derived from bytes `8..11` in the class header using the current conservative rule
|
|
- `derived_body_start` and `derived_body_end`: computed chunk-local byte window for the slot body; useful for diffing and future recompilation, and now emitted directly in the extractor outputs
|
|
- `repeated_template_status`: whether the row participates in a repeated family pattern such as `JELY` anchor twin, `_BOOT` event core, or `SURCAM` callback template
|
|
- `body_identity_status`: whether the extracted body bytes are exact twins, near-templates, or materially different within that family
|
|
- `body_sha1`: stable digest for exact identity checks without pretending the digest itself has semantic meaning
|
|
|
|
### Level 1: Lossless event-table IR
|
|
|
|
This is the form that is closest to a future round-trip compiler.
|
|
|
|
```yaml
|
|
class:
|
|
entry_index: 0x0115
|
|
class_id: 0x04d3
|
|
class_name: JELYHACK
|
|
class_object_index: 0x04d5
|
|
raw_code_base_u32: 0x00d4
|
|
code_base_minus_one: 0x00d3
|
|
conservative_event_count: 32
|
|
descriptor_fields:
|
|
- referent
|
|
events:
|
|
- slot: 0x01
|
|
event_name_hint_scummvm: use
|
|
raw_event_entry_word: 0x002a
|
|
raw_code_offset: 0x00000001
|
|
derived_body_start: 0x00d4
|
|
derived_body_end: 0x00fe
|
|
derived_body_length: 42
|
|
repeated_template_status: referent-anchor-twin/shared-slot-0x01
|
|
body_identity_status: near-template-with-JELYH2
|
|
confidence: authoritative-bytes, hinted-label
|
|
```
|
|
|
|
## IR v1 Parser Schema
|
|
|
|
The next tooling step changes the role of this document slightly. IR v0 was a note-level target for reversible human-readable output. IR v1 is the canonical machine-facing schema for the Pentagram-derived proof-of-concept parser and any future Ghidra annotation bridge.
|
|
|
|
The design constraints are now explicit:
|
|
|
|
- keep every authoritative owner-loaded byte visible
|
|
- keep slot identity separate from semantic name hints
|
|
- keep runtime-facing metadata visible even when the body decompiler cannot yet explain it
|
|
- preserve enough structure to emit Ghidra comments and bookmarks later without reparsing prose notes
|
|
|
|
### Top-level IR object
|
|
|
|
```yaml
|
|
schema_version: crusader-usecode-ir-v1-poc
|
|
source:
|
|
flex_path: USECODE/EUSECODE.FLX
|
|
extracted_root: USECODE/EUSECODE_extracted
|
|
chunk_file: USECODE/EUSECODE_extracted/chunks/chunk_191_table_1BA8_off_04C347_len_0003A8.bin
|
|
class:
|
|
entry_index: 191
|
|
object_index: 0x365
|
|
class_id: 0x363
|
|
class_name: NPCTRIG
|
|
raw_code_base_u32: 0x00da
|
|
code_base_minus_one: 0x00d9
|
|
conservative_event_count: 0x21
|
|
event:
|
|
slot: 0x0a
|
|
event_name_hint: equip
|
|
raw_event_entry_word: 0x013e
|
|
raw_code_offset: 0x00000001
|
|
derived_body_start: 0x00da
|
|
derived_body_end: 0x024f
|
|
derived_body_length: 373
|
|
repeated_template_status: ""
|
|
body:
|
|
end_reason: debug_symbols_then_end
|
|
raw_body_sha1: <digest>
|
|
unknown_trailing_bytes: ""
|
|
debug_symbol_offset: 0x0143
|
|
debug_symbol_count: 5
|
|
debug_symbols:
|
|
- index: 0x00
|
|
type_id: 0x69
|
|
bp_repr: [BP+00h]
|
|
name: referent
|
|
- index: 0x01
|
|
type_id: 0x69
|
|
bp_repr: [BP+0Ah]
|
|
name: event
|
|
ops:
|
|
- offset: 0x0000
|
|
absolute_body_offset: 0x00da
|
|
opcode: 0x5a
|
|
mnemonic: init
|
|
raw_bytes: 5a06
|
|
operands:
|
|
local_bytes: 0x06
|
|
- offset: 0x0011
|
|
absolute_body_offset: 0x00eb
|
|
opcode: 0x40
|
|
mnemonic: push_local_dword
|
|
raw_bytes: 40064c02
|
|
operands:
|
|
bp_offset: 0x06
|
|
annotation_hints:
|
|
runtime_family: slot-backed-owner-loaded-body
|
|
compiled_anchors:
|
|
- 000d:46ec
|
|
- 000d:0988
|
|
- 000d:208b
|
|
- 000d:21ed
|
|
- 000d:22bc
|
|
- 000d:2104
|
|
- 000d:ebe3
|
|
```
|
|
|
|
### Required fields
|
|
|
|
`source` keeps the specific extracted artifact path so the parser output can always be checked against the raw chunk bytes.
|
|
|
|
`class` keeps the owner-loaded identity and header math already validated in the binary.
|
|
|
|
`event` keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.
|
|
|
|
`body` records how far the parser got, whether the body terminated at a real `0x7a` end marker, and whether a post-`ret` local/debug trailer was parsed instead of being misclassified as stray opcodes.
|
|
|
|
`ops` is intentionally lossless. Each decoded op keeps:
|
|
|
|
- body-relative offset
|
|
- absolute chunk-local offset
|
|
- raw opcode byte
|
|
- mnemonic
|
|
- exact raw bytes for the whole op
|
|
- parsed operands as typed fields
|
|
|
|
`debug_symbols` preserves the owner-loaded post-`ret` local metadata block. Current evidence from `crusader-disasm` and the live extracted chunks shows that many bodies end as: executable ops -> `ret` -> local/debug symbol rows -> `0x7a` end. Those rows are not executable bytecode and should survive round-trip as structured metadata rather than raw tail bytes.
|
|
|
|
`annotation_hints` is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.
|
|
|
|
### Opcode result policy
|
|
|
|
The parser should use four result classes only:
|
|
|
|
- `decoded_op`: normal parsed opcode with structured operands
|
|
- `unknown_opcode`: one-byte opcode not yet modeled; stop or fall back conservatively
|
|
- `raw_tail`: remaining undecoded bytes after a stop condition
|
|
- `debug_blob`: post-`ret` local/debug trailer ending in `0x7a`
|
|
|
|
That keeps the IR trustworthy even before the whole Crusader VM is modeled.
|
|
|
|
### Call-site hint policy
|
|
|
|
For `call` and `spawn`-family ops, the parser may attach:
|
|
|
|
- `target_class_id`
|
|
- `target_event_slot`
|
|
- `target_event_name_hint`
|
|
|
|
It should not attach a stronger semantic claim than that. The body parser is class/event aware, but not yet authoritative about gameplay meaning.
|
|
|
|
### Annotation-hint schema
|
|
|
|
The Ghidra bridge should consume only small, stable items:
|
|
|
|
```yaml
|
|
annotation_hints:
|
|
runtime_family: slot-backed-owner-loaded-body
|
|
payload_shape_hint: signed_word
|
|
compiled_anchors:
|
|
- address: 000d:46ec
|
|
role: context_create_from_slot
|
|
- address: 000d:0988
|
|
role: referent_chain_mutator
|
|
- address: 000d:208b
|
|
role: materialize_or_forward_value
|
|
- address: 000d:21ed
|
|
role: prepend_inline_payload
|
|
- address: 000d:22bc
|
|
role: matrix_pushback_stage
|
|
- address: 000d:2104
|
|
role: finalize_to_outptr
|
|
- address: 000d:ebe3
|
|
role: opcode_sequence_run
|
|
runtime_stage_hints:
|
|
- stage_address: 000d:0988
|
|
ir_name: APPEND_UNIQUE_INDIRECT
|
|
```
|
|
|
|
This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.
|
|
|
|
That is already a real decompilation output. It keeps the exact slot id, the exact six-byte row contents, and the exact class-header facts, while refusing to pretend that `use` is already a proven semantic name for this class.
|
|
|
|
Here is the same style for one active event-bearing attachment class in the same island:
|
|
|
|
```yaml
|
|
class:
|
|
entry_index: 0x011b
|
|
class_id: 0x04db
|
|
class_name: REE_BOOT
|
|
class_object_index: 0x04dd
|
|
raw_code_base_u32: 0x00d4
|
|
code_base_minus_one: 0x00d3
|
|
conservative_event_count: 32
|
|
descriptor_fields:
|
|
- referent
|
|
- event
|
|
- counter
|
|
- item
|
|
events:
|
|
- slot: 0x0a
|
|
event_name_hint_scummvm: equip
|
|
raw_event_entry_word: 0x034b
|
|
raw_code_offset: 0x00000001
|
|
derived_body_start: 0x00d4
|
|
derived_body_end: 0x041f
|
|
derived_body_length: 843
|
|
repeated_template_status: boot-event-core/shared-slot-0x0a
|
|
body_identity_status: same-family-body-not-identical
|
|
confidence: authoritative-bytes, hinted-label
|
|
- slot: 0x0f
|
|
event_name_hint_scummvm: enterFastArea
|
|
raw_event_entry_word: 0x025c
|
|
raw_code_offset: 0x0000034c
|
|
derived_body_start: 0x041f
|
|
derived_body_end: 0x067b
|
|
derived_body_length: 604
|
|
repeated_template_status: boot-event-core/shared-slot-0x0f
|
|
body_identity_status: same-family-body-not-identical
|
|
confidence: authoritative-bytes, hinted-label
|
|
- slot: 0x10
|
|
event_name_hint_scummvm: leaveFastArea
|
|
raw_event_entry_word: 0x003b
|
|
raw_code_offset: 0x000005a8
|
|
derived_body_start: 0x067b
|
|
derived_body_end: 0x06b6
|
|
derived_body_length: 59
|
|
repeated_template_status: boot-event-core/shared-slot-0x10
|
|
body_identity_status: same-family-body-not-identical
|
|
confidence: authoritative-bytes, hinted-label
|
|
```
|
|
|
|
And here is one callback-style multi-slot class, which shows why the high slots should stay numeric for now:
|
|
|
|
```yaml
|
|
class:
|
|
entry_index: 0x011c
|
|
class_id: 0x04de
|
|
class_name: SURCAMEW
|
|
class_object_index: 0x04e0
|
|
raw_code_base_u32: 0x00e6
|
|
code_base_minus_one: 0x00e5
|
|
conservative_event_count: 35
|
|
descriptor_fields:
|
|
- referent
|
|
- textFile
|
|
- monit
|
|
- valueBox
|
|
- passcode
|
|
- link
|
|
- code
|
|
- screen
|
|
- cameraEgg
|
|
- trueRef
|
|
- therma
|
|
- eventTrigger
|
|
- foundGun
|
|
events:
|
|
- slot: 0x01
|
|
event_name_hint_scummvm: use
|
|
raw_event_entry_word: 0x00f7
|
|
raw_code_offset: 0x000000d2
|
|
- slot: 0x0a
|
|
event_name_hint_scummvm: equip
|
|
raw_event_entry_word: 0x00d1
|
|
raw_code_offset: 0x00000001
|
|
- slot: 0x20
|
|
event_name_hint_scummvm: null
|
|
raw_event_entry_word: 0x02ba
|
|
raw_code_offset: 0x000001c9
|
|
derived_body_start: 0x02ae
|
|
derived_body_end: 0x0568
|
|
derived_body_length: 698
|
|
repeated_template_status: callback-eventtrigger/shared-slot-0x20
|
|
body_identity_status: same-family-body-not-identical
|
|
- slot: 0x21
|
|
event_name_hint_scummvm: null
|
|
raw_event_entry_word: 0x0655
|
|
raw_code_offset: 0x00000483
|
|
derived_body_start: 0x0568
|
|
derived_body_end: 0x0bbd
|
|
derived_body_length: 1621
|
|
repeated_template_status: callback-eventtrigger/shared-slot-0x21
|
|
body_identity_status: same-family-body-not-identical
|
|
- slot: 0x22
|
|
event_name_hint_scummvm: null
|
|
raw_event_entry_word: 0x01a3
|
|
raw_code_offset: 0x00000ad8
|
|
derived_body_start: 0x0bbd
|
|
derived_body_end: 0x0d60
|
|
derived_body_length: 419
|
|
repeated_template_status: callback-eventtrigger/shared-slot-0x22
|
|
body_identity_status: same-family-body-not-identical
|
|
```
|
|
|
|
The extra derived fields are worth keeping because they answer the immediate human question that the bare event table does not: not only “which slots exist,” but also “how much body belongs to each slot” and “whether this body is a true clone or only a same-family variant.”
|
|
|
|
### Level 2: Friendly but still reversible hinted form
|
|
|
|
This is the highest-level script shape that is justified right now.
|
|
|
|
```text
|
|
anchor JELYHACK(referent)
|
|
|
|
# authoritative event rows for the anchor itself
|
|
slot 0x01 hint=use? raw_word=0x002A code_off=0x00000001 body=0x00D4..0x00FE family=JELY-anchor identity=near-template-with-JELYH2
|
|
|
|
# nearby attachment classes from the same local island
|
|
attach REE_BOOT(referent,event,counter,item)
|
|
slot 0x0A hint=equip? raw_word=0x034B code_off=0x00000001 body=0x00D4..0x041F family=_BOOT-core identity=shared-template-not-clone
|
|
slot 0x0F hint=enterFastArea? raw_word=0x025C code_off=0x0000034C body=0x041F..0x067B family=_BOOT-core identity=shared-template-not-clone
|
|
slot 0x10 hint=leaveFastArea? raw_word=0x003B code_off=0x000005A8 body=0x067B..0x06B6 family=_BOOT-core identity=shared-template-not-clone
|
|
|
|
callback SURCAMEW(referent,textFile,monit,valueBox,passcode,link,code,screen,cameraEgg,trueRef,therma,eventTrigger,foundGun)
|
|
slot 0x01 hint=use? raw_word=0x00F7 code_off=0x000000D2 body=0x01B7..0x02AE
|
|
slot 0x0A hint=equip? raw_word=0x00D1 code_off=0x00000001 body=0x00E6..0x02AE
|
|
slot 0x20 raw_word=0x02BA code_off=0x000001C9 body=0x02AE..0x0568 family=SURCAM-callback identity=shared-template-not-clone
|
|
slot 0x21 raw_word=0x0655 code_off=0x00000483 body=0x0568..0x0BBD family=SURCAM-callback identity=shared-template-with-stronger-divergence
|
|
slot 0x22 raw_word=0x01A3 code_off=0x00000AD8 body=0x0BBD..0x0D60 family=SURCAM-callback identity=shared-template-not-clone
|
|
|
|
attach SFXTRIG(referent,event)
|
|
slot 0x0A hint=equip? raw_word=0x00B8 code_off=0x00000001
|
|
```
|
|
|
|
This is decompiled enough to read, diff, and later recompile because it preserves:
|
|
|
|
- the original class identity
|
|
- the exact non-zero event rows
|
|
- the derived chunk-local body window for each row
|
|
- which names are authoritative fields versus external hints
|
|
- which nearby descriptors appear to be anchors, active event attachments, or callback attachments
|
|
- whether a repeated family slot is an exact twin or only a structurally similar body
|
|
|
|
### Level 2.5: Human annotation layer
|
|
|
|
The last layer is prose, not syntax. It should explain the honest current reading of each field so a modder can see what is safe to edit and what still needs caution.
|
|
|
|
- Class name is authoritative at the container level: it comes from the owner-loaded class-name table and is not a guess.
|
|
- Slot id is authoritative at the event-table level: this is the safest event identifier currently available.
|
|
- Event-name hint is external: use it as orientation only when the slot is inside `0x00..0x1f` and the local behavior has not yet been reverified in binary.
|
|
- Raw event word is authoritative but semantically unresolved: it must survive round-trip intact.
|
|
- Raw code offset is authoritative and operational: combined with `code_base_minus_one`, it tells us where the slot body starts in the chunk.
|
|
- Body-window length is derived but useful: it tells a human whether a slot is a tiny stub-like record or a large body that deserves its own diff or annotation block.
|
|
- Repeated-template status is about family structure, not byte identity: a `_BOOT` slot can be “the same template role” without being byte-equal across classes.
|
|
- Body-identity status answers the concrete modding question “am I looking at a clone, a parameterized variant, or a different body that only occupies the same family slot?”
|
|
|
|
### Level 3: Where the current VM IR can be attached
|
|
|
|
For classes in the active-event ecosystems (`EVENT`, `_BOOT`, `NPCTRIG`, `SFXTRIG`, and the environmental family), the current `000d` work is strong enough to attach the known operator vocabulary without pretending one exact class-to-opcode decode already exists.
|
|
|
|
```text
|
|
vm_effect_possible:
|
|
APPEND_UNIQUE_INLINE
|
|
APPEND_UNIQUE_INDIRECT
|
|
REMOVE_MATCHING_INDIRECT
|
|
REMOVE_MATCHING_INLINE
|
|
MATERIALIZE_OR_FORWARD_VALUE
|
|
PREPEND_INLINE_PAYLOAD
|
|
BUILD_ENTITY_LINK_MATRIX
|
|
EMIT_OR_PUSHBACK_RESULT
|
|
FINALIZE_MIXED_VALUE_TO_OUTPTR
|
|
```
|
|
|
|
That operator block is authoritative as a recovered VM vocabulary, but only ecosystem-level when attached to one specific descriptor family.
|
|
|
|
### Binary-side slot and payload-shape evidence to preserve in IR
|
|
|
|
The current VM pass also adds one useful binary-side rule for the higher event ordinals: the compiled wrapper family distinguishes slot identity from payload shape, and that distinction should survive in any round-trip IR even when the human label stays unresolved.
|
|
|
|
Verified current ladder around `0005:3115..31da`:
|
|
|
|
- slot `0x10`: guarded callsite only, zero extra word, packed mask `0x00010000`
|
|
- slot `0x11`: named wrapper `entity_vm_context_try_create_mask_00020000_slot11_with_offset`, one caller-supplied extra word
|
|
- slot `0x12`: named wrapper `entity_vm_context_try_create_mask_00040000_slot12`, zero extra word
|
|
- slot `0x13`: named wrapper `entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity`, one sign-extended extra word after an entity-validity gate
|
|
- slot `0x14`: named wrapper `entity_vm_context_try_create_mask_00100000_slot14_with_offset`, one caller-supplied extra word
|
|
|
|
Why this matters for the IR:
|
|
|
|
- It is direct binary evidence that some higher Crusader slot ordinals are already grouped by argument shape before any descriptor-family mapping is proven.
|
|
- That means the IR should preserve `slot_id` plus `payload_shape` independently instead of collapsing everything into one guessed event-name table.
|
|
- It also gives a bounded way to cross-check external event signatures without over-trusting them: slot `0x12` fits a zero-arg event shape, slot `0x13` fits a one-word event shape, and slot `0x14` currently conflicts with Pentagram's older zero-arg `animGetHit()` note.
|
|
|
|
Practical annotation rule to adopt now:
|
|
|
|
- keep higher-slot labels binary-stable as `slot 0x10` .. `slot 0x14` unless local behavior closes the label
|
|
- attach external event names only as hints
|
|
- attach one small `payload_shape_hint` field such as `none`, `word`, or `signed_word`
|
|
|
|
Minimal hinted example:
|
|
|
|
```yaml
|
|
slot_record:
|
|
slot_id: 0x13
|
|
event_name_hint: avatarStoleSomething
|
|
payload_shape_hint: signed_word
|
|
binary_anchor: 0005:31da
|
|
wrapper_name: entity_vm_context_try_create_mask_00080000_slot13_with_offset_if_valid_entity
|
|
```
|
|
|
|
The same pass also hardens one existing IR operator boundary: the `000d:22bc` stage is now comment-backed in Ghidra as a matrix/pushback consumer over decoded workspace bytes, not a direct descriptor-row reader. The current safe attachment point is therefore still `decoded VM workspace -> link-matrix stage`, not `NPCTRIG row -> direct entity-link emission`.
|
|
|
|
## Conservative Parser Rule To Adopt Now
|
|
|
|
For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is:
|
|
|
|
- Preserve the raw four-byte header field at bytes `8..11` as authoritative.
|
|
- Derive `code_base_minus_one = raw_u32_at_8_11 - 1` for code-addressing only.
|
|
- Derive `event_count = (raw_u32_at_8_11 - 20) / 6` only when that value is non-negative, divisible by `6`, and the resulting table end stays within the class object size.
|
|
- Treat each event entry as `u16 raw_event_entry_word + u32 raw_code_offset` at `class + 20 + 6 * slot`.
|
|
- Treat the event code offset as raw/opaque unless and until the code-addressing interpretation is needed; when needed, interpret it relative to `code_base_minus_one` so that offset `1` lands on the first code byte.
|
|
- If the divisibility or bounds checks fail, keep the class opaque and preserve raw bytes rather than forcing a guessed event count.
|
|
- `tools/extract_eusecode_flx.py` now implements this rule directly for the current owner-loaded EUSECODE work and emits `class_layout_index.tsv` plus `class_event_index.tsv` so raw header/event rows can be consumed by later IR tooling without re-deriving the arithmetic from prose.
|
|
|
|
## Remaining Binary-Side Gaps
|
|
|
|
The main blockers for a real round-trip compiler are still on the binary side:
|
|
|
|
- The meaning of the first two bytes in each 6-byte Crusader event record is still unverified.
|
|
- The exact provenance of ScummVM's current `get_class_event_count()` arithmetic is still unverified; current local evidence says the owner-loaded/raw records fit `raw_u32_at_8_11 = first_code_byte_offset`, while the ScummVM count formula appears sign-shifted relative to that layout.
|
|
- The upstream writer for selector local `[BP-0x32]` in the `000d:ebe3` sequencer is still unresolved.
|
|
- The full control-flow opcode set and branch encoding are not yet recovered.
|
|
- The exact on-disk source format behind `entity_vm_runtime_owner_resource_create` is still not identified.
|
|
- No direct descriptor-family to slot-mask mapping is proven yet.
|
|
- Callback/eventTrigger descriptors still do not have a callback-specific opcode family.
|
|
|
|
## Best Current Path
|
|
|
|
The strongest present path to a usable compiler/decompiler is:
|
|
|
|
1. Parse classes/events exactly as ScummVM does.
|
|
2. Keep the class/object indexing and event-entry layout from ScummVM, but use the conservative local event-count rule above for owner-loaded/raw class parsing until a main USECODE sample proves otherwise.
|
|
3. Decompile only the proven operator families into structured IR.
|
|
4. Preserve unknown bytes verbatim in place.
|
|
5. Attach ScummVM event and intrinsic names as hints, not as truth.
|
|
6. Recompile by rebuilding the original class header and event table layout first, then re-emitting decoded and opaque ops together.
|
|
|
|
That gets to a reversible editor sooner than waiting for a full semantic VM recovery. |