Add Crusader-specific USECODE data and documentation

- Introduced new file `vm_mask_ladder.tsv` containing detailed mappings for Crusader USECODE VM masks and their associated descriptors.
- Added comprehensive documentation in `scummvm-crusader-reference.md` outlining the structure, findings, and implications for reverse-engineering the Crusader engine within ScummVM.
- Created `usecode-roundtrip-ir.md` to document the plan for converting Crusader USECODE bytes into a human-readable format, detailing the container layout, event names, and intrinsic tables.
- Implemented a PowerShell script `temp_usecode_sample.ps1` for extracting and analyzing USECODE data from the Crusader FLX files, providing insights into class and event structures.
This commit is contained in:
MaddoScientisto 2026-03-22 17:26:39 +01:00
commit de42fd1ea1
42 changed files with 21970 additions and 1522 deletions

View file

@ -0,0 +1,272 @@
# USECODE Round-Trip IR Plan
## Purpose
This note records the current evidence-backed path from Crusader USECODE bytes to a human-readable, editable, and recompilable script form.
It is intentionally conservative. ScummVM gives strong external anchors for the container layout, class/event numbering, and intrinsic naming, but it is not a symbol map for the DOS binary and it is not a ready-made round-trip compiler.
## Externally Anchored Pieces
### Container and class layout
ScummVM now gives a concrete second implementation for the Crusader USECODE class layout:
- `usecode/usecode_flex.cpp` treats each class body as archive object `classid + 2`.
- Class names come from archive object `1` at `name_object + 4 + 13 * classid`.
- For Crusader, the class base offset is read from class bytes `8..11` and then decremented by `1`.
- Crusader event count is computed as `(base_offset + 19) / 6`.
- `usecode/usecode.cpp` resolves event `N` from class data at `20 + 6 * N`, with the code offset stored in bytes `+2..+5` of each 6-byte event record.
Combined with the already validated FLEX container notes, the current externally anchored container model is:
- FLEX entry count at `0x54`
- FLEX table at `0x80`
- USECODE class object index = `classid + 2`
- Crusader class header contains a four-byte base-offset field at bytes `8..11`
- Crusader event table entries are 6 bytes each, with a known dword code offset and an still-unknown leading word
ScummVM also makes one implementation choice explicit that matters for the current mismatch: `uc_machine.cpp` uses `get_class_base_offset()` as the execution-stream base for Crusader class code, not only as metadata for event counting. That means the `obj[8..11] - 1` value is part of the live code-addressing model in ScummVM, not just a comment-level interpretation.
### Binary-side validation against owner-loaded classes
The first direct local validation pass against sampled owner-loaded EUSECODE class records now splits the ScummVM model into two parts: one part is confirmed, and one part still needs reconciliation.
Confirmed on sampled records (`EVENT`, `NPCTRIG`, `SURCAMNS`, `JELYHACK`, `REE_BOOT`, `SURCAMEW`, `SFXTRIG`):
- The extracted chunk at table offset `0x88` behaves like object `1` for class names.
- For each sampled class body, deriving `object_index = (table_offset - 0x80) / 8`, then `class_id = object_index - 2`, and then reading 13 bytes from object `1` at `4 + 13 * class_id` yields the expected class name.
- The class bodies do have a stable 4-byte header field at bytes `8..11`.
- The region at `class + 20` is a real 6-byte event-slot table with `u16 unknown_word + u32 code_or_payload_field` layout.
Broader family spot-checks now keep the same local structure on the owner-loaded side. In addition to the first validated set, the nearby `_BOOT` and environmental event families (`AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `VAR_BOOT`, `FLAMEBOX`, `NOSTRIL`, `STEAMBOX`) continue to fit the same `table_offset -> object_index -> class_id` progression with a stable bytes-`8..11` dword and a 6-byte table at `+20`. No contradictory sample has appeared in the local EUSECODE set.
Not yet reconciled with ScummVM's current formula note:
- In the sampled owner-loaded records, the raw dword at bytes `8..11` is `0x00d4`, `0x00da`, or `0x00e6`.
- Treating that dword directly as the first post-event-table offset makes the layout line up cleanly: `(dword_at_8 - 20) / 6` gives 32, 33, or 35 valid slots in the samples.
- Scanning instead with the previously noted ScummVM-style `(base_offset + 19) / 6` interpretation overruns into inline payload and class-name bytes in the same samples.
Current best explanation:
- The mismatch is now best explained as a ScummVM interpretation/detail issue, not as a proven loader-side rewrite.
- The same ScummVM code path that decrements bytes `8..11` by `1` also uses that decremented value as the code-stream base. On the local owner-loaded records, this fits naturally if the raw dword is the first code-byte offset and event-table dword offsets are 1-based relative to `code_base_minus_one`.
- Under that reading, the sampled event-count rule becomes `(code_base_minus_one - 19) / 6`, which is exactly equivalent to `(raw_u32_at_8_11 - 20) / 6` and matches the validated `32/33/35` slot counts.
- The `000d` loader/runtime path (`000d:44df -> 000d:4c99 -> 000d:7000 -> 000d:46ec`) currently shows indexed file loading and slot-table materialization, but no verified per-class header rewrite before the VM consumes owner-backed records.
Current safe conclusion:
- The owner-loaded class records are compatible with `object 1` names, `classid + 2` body lookup, a header field at bytes `8..11`, and 6-byte event records at `+20`.
- The exact meaning of the bytes-`8..11` field is now narrower: on the local owner-loaded records it is best read as the first code-byte offset, with ScummVM's decremented `base_offset` acting as a `code_base_minus_one` anchor for 1-based event code offsets.
- The leading word of each 6-byte event entry remains unresolved.
### VM/runtime model
ScummVM also anchors several VM behaviors that line up with the current raw-binary work:
- `usecode/uc_machine.cpp` uses `ByteSet(0x1000)` for Crusader globals rather than the U8 bitset path.
- Remorse initializes global `0x003c` to avatar number `1`; Regret initializes `0x001e`.
- Opcode `0x11` is class/event dispatch in Crusader: the bytecode operand is an event number that is translated through `get_class_event()` before execution.
That makes the current local reading stronger: the `000d` runtime lane looks like a Crusader-specific object/event VM that should be interpreted against Crusader event ordinals, not against U8 assumptions.
### Event names
`convert/crusader/convert_usecode_crusader.h` gives a named event table for ids `0x00..0x1f`:
- Strongly usable names: `look`, `use`, `anim`, `setActivity`, `cachein`, `hit`, `gotHit`, `hatch`, `schedule`, `release`, `equip`, `unequip`, `combine`, `calledFromAnim`, `enterFastArea`, `leaveFastArea`, `cast`, `justMoved`, `avatarStoleSomething`, `animGetHit`, `unhatch`
- Weak placeholders remain for `0x0d` and `0x16..0x1f` (`func0D`, `func16`..`func1F`)
This is enough to annotate event ordinals safely, but not enough to rename raw binary handlers unless local behavior matches.
### Intrinsic tables
ScummVM provides two distinct kinds of intrinsic evidence:
- `convert/crusader/convert_usecode_crusader.h` and `convert_usecode_regret.h` provide ordinal-to-signature/name tables used for readable conversion.
- `usecode/remorse_intrinsics.h` and `usecode/regret_intrinsics.h` provide the live runtime dispatch tables.
The safe reading is:
- Remorse and Regret share the Crusader event-name table.
- Remorse and Regret do not share a single intrinsic numbering/signature map.
- Intrinsic names are strong hints for arity and broad subsystem identity, but they are still not direct rename authority for the DOS binary.
## Safe Reuse Rules
### Safe to import now
- Event names as labels for event ids `0x00..0x1f` in parsers, reports, and note files.
- Intrinsic ordinal names as `name_hint` or `signature_hint` metadata when the ordinal and argument-byte pattern match.
- High-level subsystem labels such as palette fade, camera, movie, audio, item/actor accessors, and weapon fire when they match existing binary evidence.
- Slot numbers from sampled owner-loaded classes even when the event name is still only a hint.
### Not safe to claim yet
- Direct raw-function renames based only on ScummVM event or intrinsic names.
- Remorse intrinsic numbering from Regret tables, or vice versa.
- Specific descriptor-family to slot-mask mappings that are not yet proven on the binary side.
- Meanings for the unknown leading word in the 6-byte Crusader event table entries.
- That the ScummVM `get_class_event_count()` formula applies unchanged to the sampled owner-loaded EUSECODE records.
## IR Requirements For Round-Tripping
The first script IR should preserve exact recompilation inputs before it tries to look pretty.
### Unit of decompilation
The IR should be organized as:
1. USECODE archive
2. class
3. event slot
4. instruction stream
That matches the externally anchored class/event layout and avoids baking in any still-unproven descriptor-to-runtime assumptions.
### Required top-level records
Each class record should preserve:
- `class_id`
- `class_object_index` (`classid + 2`)
- `name_slot_offset` (`4 + 13 * classid` within object `1`)
- `class_name`
- `raw_header_prefix`
- `raw_code_base_u32`
- `code_base_minus_one`
- `event_count`
- `raw_event_table_bytes`
Each event record should preserve:
- `event_id`
- `event_name_hint`
- `raw_event_entry_word`
- `code_offset`
- `raw_body_bytes`
- `decoded_ops`
## IR v0 Shape
The IR should separate authoritative fields from friendly hints.
```yaml
class:
class_id: 0x00be
class_name: EVENT
class_object_index: 0x00c0
raw_code_base_u32: 0x0138
code_base_minus_one: 0x0137
raw_header_prefix: <bytes>
events:
- event_id: 0x04
event_name_hint: cachein
raw_event_entry_word: 0x????
code_offset: 0x00001234
ops:
- op: intrinsic_call
intrinsic_ordinal: 0x001e
name_hint: Item::I_fireWeapon
signature_hint: Item::I_fireWeapon(Item *, x, y, z, byte, int, byte)
arg_bytes: 0x10
- op: vm_chain_mutation
vm_ir: APPEND_UNIQUE_INDIRECT
opcode_hint: 0x19
- op: unknown_raw
bytes: <exact original bytes>
```
### Why this shape
- `event_name_hint` is useful for humans but does not replace the event id.
- `name_hint` and `signature_hint` are useful for intrinsics but do not replace the ordinal.
- `unknown_raw` gives a lossless fallback for still-unmapped opcodes or operand forms.
- `raw_event_entry_word` keeps the compiler from losing bytes whose meaning is not yet settled.
## Operation Families Worth Lifting First
The current binary-side evidence supports lifting a small reversible operator set first:
- `intrinsic_call`
- `class_event_call`
- `append_unique_inline`
- `append_unique_indirect`
- `remove_matching_inline`
- `remove_matching_indirect`
- `materialize_or_forward_value`
- `prepend_inline_payload`
- `build_entity_link_matrix`
- `emit_or_pushback_result`
- `push_frame_word_literal`
- `compare_stream_dword_and_push_bool`
- `unknown_raw`
This is enough to represent the verified `000d:0988`, `000d:177c`, `000d:1acb`, `000d:208b`, `000d:21ed`, and `000d:22bc` families without pretending the whole VM is solved.
## Metadata That Must Survive Recompilation
The compiler side will need more than pretty script text. At minimum it must preserve:
- Original class ordering and sparse class ids
- Original class-name table slotting
- Raw class header bytes not yet semantically decoded
- Raw bytes `8..11` even when a derived `code_base_minus_one` is also stored
- Raw 6-byte event records, including the unknown leading word
- Exact event order within each class
- Exact code offsets or enough relocation data to rebuild them deterministically
- Intrinsic ordinals and argument-byte counts
- Width/sign information for immediates
- Inline versus indirect payload form
- String payload encoding and terminators
- Any unknown opcode byte sequences verbatim
If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.
## Practical Naming Policy
For near-term local RE and tooling:
- Use ScummVM event names as annotation labels for event slots.
- Store intrinsic names as hints attached to ordinals.
- Keep binary-facing renames driven by raw evidence, not by ScummVM alone.
- Treat `EVENT`, `_BOOT`, and `NPCTRIG` as the strongest current active-event families.
- Treat `JELYHACK` and `JELYH2` as referent-anchor classes, not standalone event records.
- Treat `SURCAMNS` and `SURCAMEW` as callback/eventTrigger holders, not proven active-event cores.
## Conservative Parser Rule To Adopt Now
For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is:
- Preserve the raw four-byte header field at bytes `8..11` as authoritative.
- Derive `code_base_minus_one = raw_u32_at_8_11 - 1` for code-addressing only.
- Derive `event_count = (raw_u32_at_8_11 - 20) / 6` only when that value is non-negative, divisible by `6`, and the resulting table end stays within the class object size.
- Treat each event entry as `u16 raw_event_entry_word + u32 raw_code_offset` at `class + 20 + 6 * slot`.
- Treat the event code offset as raw/opaque unless and until the code-addressing interpretation is needed; when needed, interpret it relative to `code_base_minus_one` so that offset `1` lands on the first code byte.
- If the divisibility or bounds checks fail, keep the class opaque and preserve raw bytes rather than forcing a guessed event count.
- `tools/extract_eusecode_flx.py` now implements this rule directly for the current owner-loaded EUSECODE work and emits `class_layout_index.tsv` plus `class_event_index.tsv` so raw header/event rows can be consumed by later IR tooling without re-deriving the arithmetic from prose.
## Remaining Binary-Side Gaps
The main blockers for a real round-trip compiler are still on the binary side:
- The meaning of the first two bytes in each 6-byte Crusader event record is still unverified.
- The exact provenance of ScummVM's current `get_class_event_count()` arithmetic is still unverified; current local evidence says the owner-loaded/raw records fit `raw_u32_at_8_11 = first_code_byte_offset`, while the ScummVM count formula appears sign-shifted relative to that layout.
- The upstream writer for selector local `[BP-0x32]` in the `000d:ebe3` sequencer is still unresolved.
- The full control-flow opcode set and branch encoding are not yet recovered.
- The exact on-disk source format behind `entity_vm_runtime_owner_resource_create` is still not identified.
- No direct descriptor-family to slot-mask mapping is proven yet.
- Callback/eventTrigger descriptors still do not have a callback-specific opcode family.
## Best Current Path
The strongest present path to a usable compiler/decompiler is:
1. Parse classes/events exactly as ScummVM does.
2. Keep the class/object indexing and event-entry layout from ScummVM, but use the conservative local event-count rule above for owner-loaded/raw class parsing until a main USECODE sample proves otherwise.
3. Decompile only the proven operator families into structured IR.
4. Preserve unknown bytes verbatim in place.
5. Attach ScummVM event and intrinsic names as hints, not as truth.
6. Recompile by rebuilding the original class header and event table layout first, then re-emitting decoded and opaque ops together.
That gets to a reversible editor sooner than waiting for a full semantic VM recovery.