Add Crusader-specific USECODE data and documentation
- Introduced new file `vm_mask_ladder.tsv` containing detailed mappings for Crusader USECODE VM masks and their associated descriptors. - Added comprehensive documentation in `scummvm-crusader-reference.md` outlining the structure, findings, and implications for reverse-engineering the Crusader engine within ScummVM. - Created `usecode-roundtrip-ir.md` to document the plan for converting Crusader USECODE bytes into a human-readable format, detailing the container layout, event names, and intrinsic tables. - Implemented a PowerShell script `temp_usecode_sample.ps1` for extracting and analyzing USECODE data from the Crusader FLX files, providing insights into class and event structures.
This commit is contained in:
parent
3daffbf113
commit
de42fd1ea1
42 changed files with 21970 additions and 1522 deletions
272
docs/usecode-roundtrip-ir.md
Normal file
272
docs/usecode-roundtrip-ir.md
Normal file
|
|
@ -0,0 +1,272 @@
|
|||
# USECODE Round-Trip IR Plan
|
||||
|
||||
## Purpose
|
||||
|
||||
This note records the current evidence-backed path from Crusader USECODE bytes to a human-readable, editable, and recompilable script form.
|
||||
|
||||
It is intentionally conservative. ScummVM gives strong external anchors for the container layout, class/event numbering, and intrinsic naming, but it is not a symbol map for the DOS binary and it is not a ready-made round-trip compiler.
|
||||
|
||||
## Externally Anchored Pieces
|
||||
|
||||
### Container and class layout
|
||||
|
||||
ScummVM now gives a concrete second implementation for the Crusader USECODE class layout:
|
||||
|
||||
- `usecode/usecode_flex.cpp` treats each class body as archive object `classid + 2`.
|
||||
- Class names come from archive object `1` at `name_object + 4 + 13 * classid`.
|
||||
- For Crusader, the class base offset is read from class bytes `8..11` and then decremented by `1`.
|
||||
- Crusader event count is computed as `(base_offset + 19) / 6`.
|
||||
- `usecode/usecode.cpp` resolves event `N` from class data at `20 + 6 * N`, with the code offset stored in bytes `+2..+5` of each 6-byte event record.
|
||||
|
||||
Combined with the already validated FLEX container notes, the current externally anchored container model is:
|
||||
|
||||
- FLEX entry count at `0x54`
|
||||
- FLEX table at `0x80`
|
||||
- USECODE class object index = `classid + 2`
|
||||
- Crusader class header contains a four-byte base-offset field at bytes `8..11`
|
||||
- Crusader event table entries are 6 bytes each, with a known dword code offset and an still-unknown leading word
|
||||
|
||||
ScummVM also makes one implementation choice explicit that matters for the current mismatch: `uc_machine.cpp` uses `get_class_base_offset()` as the execution-stream base for Crusader class code, not only as metadata for event counting. That means the `obj[8..11] - 1` value is part of the live code-addressing model in ScummVM, not just a comment-level interpretation.
|
||||
|
||||
### Binary-side validation against owner-loaded classes
|
||||
|
||||
The first direct local validation pass against sampled owner-loaded EUSECODE class records now splits the ScummVM model into two parts: one part is confirmed, and one part still needs reconciliation.
|
||||
|
||||
Confirmed on sampled records (`EVENT`, `NPCTRIG`, `SURCAMNS`, `JELYHACK`, `REE_BOOT`, `SURCAMEW`, `SFXTRIG`):
|
||||
|
||||
- The extracted chunk at table offset `0x88` behaves like object `1` for class names.
|
||||
- For each sampled class body, deriving `object_index = (table_offset - 0x80) / 8`, then `class_id = object_index - 2`, and then reading 13 bytes from object `1` at `4 + 13 * class_id` yields the expected class name.
|
||||
- The class bodies do have a stable 4-byte header field at bytes `8..11`.
|
||||
- The region at `class + 20` is a real 6-byte event-slot table with `u16 unknown_word + u32 code_or_payload_field` layout.
|
||||
|
||||
Broader family spot-checks now keep the same local structure on the owner-loaded side. In addition to the first validated set, the nearby `_BOOT` and environmental event families (`AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `VAR_BOOT`, `FLAMEBOX`, `NOSTRIL`, `STEAMBOX`) continue to fit the same `table_offset -> object_index -> class_id` progression with a stable bytes-`8..11` dword and a 6-byte table at `+20`. No contradictory sample has appeared in the local EUSECODE set.
|
||||
|
||||
Not yet reconciled with ScummVM's current formula note:
|
||||
|
||||
- In the sampled owner-loaded records, the raw dword at bytes `8..11` is `0x00d4`, `0x00da`, or `0x00e6`.
|
||||
- Treating that dword directly as the first post-event-table offset makes the layout line up cleanly: `(dword_at_8 - 20) / 6` gives 32, 33, or 35 valid slots in the samples.
|
||||
- Scanning instead with the previously noted ScummVM-style `(base_offset + 19) / 6` interpretation overruns into inline payload and class-name bytes in the same samples.
|
||||
|
||||
Current best explanation:
|
||||
|
||||
- The mismatch is now best explained as a ScummVM interpretation/detail issue, not as a proven loader-side rewrite.
|
||||
- The same ScummVM code path that decrements bytes `8..11` by `1` also uses that decremented value as the code-stream base. On the local owner-loaded records, this fits naturally if the raw dword is the first code-byte offset and event-table dword offsets are 1-based relative to `code_base_minus_one`.
|
||||
- Under that reading, the sampled event-count rule becomes `(code_base_minus_one - 19) / 6`, which is exactly equivalent to `(raw_u32_at_8_11 - 20) / 6` and matches the validated `32/33/35` slot counts.
|
||||
- The `000d` loader/runtime path (`000d:44df -> 000d:4c99 -> 000d:7000 -> 000d:46ec`) currently shows indexed file loading and slot-table materialization, but no verified per-class header rewrite before the VM consumes owner-backed records.
|
||||
|
||||
Current safe conclusion:
|
||||
|
||||
- The owner-loaded class records are compatible with `object 1` names, `classid + 2` body lookup, a header field at bytes `8..11`, and 6-byte event records at `+20`.
|
||||
- The exact meaning of the bytes-`8..11` field is now narrower: on the local owner-loaded records it is best read as the first code-byte offset, with ScummVM's decremented `base_offset` acting as a `code_base_minus_one` anchor for 1-based event code offsets.
|
||||
- The leading word of each 6-byte event entry remains unresolved.
|
||||
|
||||
### VM/runtime model
|
||||
|
||||
ScummVM also anchors several VM behaviors that line up with the current raw-binary work:
|
||||
|
||||
- `usecode/uc_machine.cpp` uses `ByteSet(0x1000)` for Crusader globals rather than the U8 bitset path.
|
||||
- Remorse initializes global `0x003c` to avatar number `1`; Regret initializes `0x001e`.
|
||||
- Opcode `0x11` is class/event dispatch in Crusader: the bytecode operand is an event number that is translated through `get_class_event()` before execution.
|
||||
|
||||
That makes the current local reading stronger: the `000d` runtime lane looks like a Crusader-specific object/event VM that should be interpreted against Crusader event ordinals, not against U8 assumptions.
|
||||
|
||||
### Event names
|
||||
|
||||
`convert/crusader/convert_usecode_crusader.h` gives a named event table for ids `0x00..0x1f`:
|
||||
|
||||
- Strongly usable names: `look`, `use`, `anim`, `setActivity`, `cachein`, `hit`, `gotHit`, `hatch`, `schedule`, `release`, `equip`, `unequip`, `combine`, `calledFromAnim`, `enterFastArea`, `leaveFastArea`, `cast`, `justMoved`, `avatarStoleSomething`, `animGetHit`, `unhatch`
|
||||
- Weak placeholders remain for `0x0d` and `0x16..0x1f` (`func0D`, `func16`..`func1F`)
|
||||
|
||||
This is enough to annotate event ordinals safely, but not enough to rename raw binary handlers unless local behavior matches.
|
||||
|
||||
### Intrinsic tables
|
||||
|
||||
ScummVM provides two distinct kinds of intrinsic evidence:
|
||||
|
||||
- `convert/crusader/convert_usecode_crusader.h` and `convert_usecode_regret.h` provide ordinal-to-signature/name tables used for readable conversion.
|
||||
- `usecode/remorse_intrinsics.h` and `usecode/regret_intrinsics.h` provide the live runtime dispatch tables.
|
||||
|
||||
The safe reading is:
|
||||
|
||||
- Remorse and Regret share the Crusader event-name table.
|
||||
- Remorse and Regret do not share a single intrinsic numbering/signature map.
|
||||
- Intrinsic names are strong hints for arity and broad subsystem identity, but they are still not direct rename authority for the DOS binary.
|
||||
|
||||
## Safe Reuse Rules
|
||||
|
||||
### Safe to import now
|
||||
|
||||
- Event names as labels for event ids `0x00..0x1f` in parsers, reports, and note files.
|
||||
- Intrinsic ordinal names as `name_hint` or `signature_hint` metadata when the ordinal and argument-byte pattern match.
|
||||
- High-level subsystem labels such as palette fade, camera, movie, audio, item/actor accessors, and weapon fire when they match existing binary evidence.
|
||||
- Slot numbers from sampled owner-loaded classes even when the event name is still only a hint.
|
||||
|
||||
### Not safe to claim yet
|
||||
|
||||
- Direct raw-function renames based only on ScummVM event or intrinsic names.
|
||||
- Remorse intrinsic numbering from Regret tables, or vice versa.
|
||||
- Specific descriptor-family to slot-mask mappings that are not yet proven on the binary side.
|
||||
- Meanings for the unknown leading word in the 6-byte Crusader event table entries.
|
||||
- That the ScummVM `get_class_event_count()` formula applies unchanged to the sampled owner-loaded EUSECODE records.
|
||||
|
||||
## IR Requirements For Round-Tripping
|
||||
|
||||
The first script IR should preserve exact recompilation inputs before it tries to look pretty.
|
||||
|
||||
### Unit of decompilation
|
||||
|
||||
The IR should be organized as:
|
||||
|
||||
1. USECODE archive
|
||||
2. class
|
||||
3. event slot
|
||||
4. instruction stream
|
||||
|
||||
That matches the externally anchored class/event layout and avoids baking in any still-unproven descriptor-to-runtime assumptions.
|
||||
|
||||
### Required top-level records
|
||||
|
||||
Each class record should preserve:
|
||||
|
||||
- `class_id`
|
||||
- `class_object_index` (`classid + 2`)
|
||||
- `name_slot_offset` (`4 + 13 * classid` within object `1`)
|
||||
- `class_name`
|
||||
- `raw_header_prefix`
|
||||
- `raw_code_base_u32`
|
||||
- `code_base_minus_one`
|
||||
- `event_count`
|
||||
- `raw_event_table_bytes`
|
||||
|
||||
Each event record should preserve:
|
||||
|
||||
- `event_id`
|
||||
- `event_name_hint`
|
||||
- `raw_event_entry_word`
|
||||
- `code_offset`
|
||||
- `raw_body_bytes`
|
||||
- `decoded_ops`
|
||||
|
||||
## IR v0 Shape
|
||||
|
||||
The IR should separate authoritative fields from friendly hints.
|
||||
|
||||
```yaml
|
||||
class:
|
||||
class_id: 0x00be
|
||||
class_name: EVENT
|
||||
class_object_index: 0x00c0
|
||||
raw_code_base_u32: 0x0138
|
||||
code_base_minus_one: 0x0137
|
||||
raw_header_prefix: <bytes>
|
||||
events:
|
||||
- event_id: 0x04
|
||||
event_name_hint: cachein
|
||||
raw_event_entry_word: 0x????
|
||||
code_offset: 0x00001234
|
||||
ops:
|
||||
- op: intrinsic_call
|
||||
intrinsic_ordinal: 0x001e
|
||||
name_hint: Item::I_fireWeapon
|
||||
signature_hint: Item::I_fireWeapon(Item *, x, y, z, byte, int, byte)
|
||||
arg_bytes: 0x10
|
||||
- op: vm_chain_mutation
|
||||
vm_ir: APPEND_UNIQUE_INDIRECT
|
||||
opcode_hint: 0x19
|
||||
- op: unknown_raw
|
||||
bytes: <exact original bytes>
|
||||
```
|
||||
|
||||
### Why this shape
|
||||
|
||||
- `event_name_hint` is useful for humans but does not replace the event id.
|
||||
- `name_hint` and `signature_hint` are useful for intrinsics but do not replace the ordinal.
|
||||
- `unknown_raw` gives a lossless fallback for still-unmapped opcodes or operand forms.
|
||||
- `raw_event_entry_word` keeps the compiler from losing bytes whose meaning is not yet settled.
|
||||
|
||||
## Operation Families Worth Lifting First
|
||||
|
||||
The current binary-side evidence supports lifting a small reversible operator set first:
|
||||
|
||||
- `intrinsic_call`
|
||||
- `class_event_call`
|
||||
- `append_unique_inline`
|
||||
- `append_unique_indirect`
|
||||
- `remove_matching_inline`
|
||||
- `remove_matching_indirect`
|
||||
- `materialize_or_forward_value`
|
||||
- `prepend_inline_payload`
|
||||
- `build_entity_link_matrix`
|
||||
- `emit_or_pushback_result`
|
||||
- `push_frame_word_literal`
|
||||
- `compare_stream_dword_and_push_bool`
|
||||
- `unknown_raw`
|
||||
|
||||
This is enough to represent the verified `000d:0988`, `000d:177c`, `000d:1acb`, `000d:208b`, `000d:21ed`, and `000d:22bc` families without pretending the whole VM is solved.
|
||||
|
||||
## Metadata That Must Survive Recompilation
|
||||
|
||||
The compiler side will need more than pretty script text. At minimum it must preserve:
|
||||
|
||||
- Original class ordering and sparse class ids
|
||||
- Original class-name table slotting
|
||||
- Raw class header bytes not yet semantically decoded
|
||||
- Raw bytes `8..11` even when a derived `code_base_minus_one` is also stored
|
||||
- Raw 6-byte event records, including the unknown leading word
|
||||
- Exact event order within each class
|
||||
- Exact code offsets or enough relocation data to rebuild them deterministically
|
||||
- Intrinsic ordinals and argument-byte counts
|
||||
- Width/sign information for immediates
|
||||
- Inline versus indirect payload form
|
||||
- String payload encoding and terminators
|
||||
- Any unknown opcode byte sequences verbatim
|
||||
|
||||
If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.
|
||||
|
||||
## Practical Naming Policy
|
||||
|
||||
For near-term local RE and tooling:
|
||||
|
||||
- Use ScummVM event names as annotation labels for event slots.
|
||||
- Store intrinsic names as hints attached to ordinals.
|
||||
- Keep binary-facing renames driven by raw evidence, not by ScummVM alone.
|
||||
- Treat `EVENT`, `_BOOT`, and `NPCTRIG` as the strongest current active-event families.
|
||||
- Treat `JELYHACK` and `JELYH2` as referent-anchor classes, not standalone event records.
|
||||
- Treat `SURCAMNS` and `SURCAMEW` as callback/eventTrigger holders, not proven active-event cores.
|
||||
|
||||
## Conservative Parser Rule To Adopt Now
|
||||
|
||||
For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is:
|
||||
|
||||
- Preserve the raw four-byte header field at bytes `8..11` as authoritative.
|
||||
- Derive `code_base_minus_one = raw_u32_at_8_11 - 1` for code-addressing only.
|
||||
- Derive `event_count = (raw_u32_at_8_11 - 20) / 6` only when that value is non-negative, divisible by `6`, and the resulting table end stays within the class object size.
|
||||
- Treat each event entry as `u16 raw_event_entry_word + u32 raw_code_offset` at `class + 20 + 6 * slot`.
|
||||
- Treat the event code offset as raw/opaque unless and until the code-addressing interpretation is needed; when needed, interpret it relative to `code_base_minus_one` so that offset `1` lands on the first code byte.
|
||||
- If the divisibility or bounds checks fail, keep the class opaque and preserve raw bytes rather than forcing a guessed event count.
|
||||
- `tools/extract_eusecode_flx.py` now implements this rule directly for the current owner-loaded EUSECODE work and emits `class_layout_index.tsv` plus `class_event_index.tsv` so raw header/event rows can be consumed by later IR tooling without re-deriving the arithmetic from prose.
|
||||
|
||||
## Remaining Binary-Side Gaps
|
||||
|
||||
The main blockers for a real round-trip compiler are still on the binary side:
|
||||
|
||||
- The meaning of the first two bytes in each 6-byte Crusader event record is still unverified.
|
||||
- The exact provenance of ScummVM's current `get_class_event_count()` arithmetic is still unverified; current local evidence says the owner-loaded/raw records fit `raw_u32_at_8_11 = first_code_byte_offset`, while the ScummVM count formula appears sign-shifted relative to that layout.
|
||||
- The upstream writer for selector local `[BP-0x32]` in the `000d:ebe3` sequencer is still unresolved.
|
||||
- The full control-flow opcode set and branch encoding are not yet recovered.
|
||||
- The exact on-disk source format behind `entity_vm_runtime_owner_resource_create` is still not identified.
|
||||
- No direct descriptor-family to slot-mask mapping is proven yet.
|
||||
- Callback/eventTrigger descriptors still do not have a callback-specific opcode family.
|
||||
|
||||
## Best Current Path
|
||||
|
||||
The strongest present path to a usable compiler/decompiler is:
|
||||
|
||||
1. Parse classes/events exactly as ScummVM does.
|
||||
2. Keep the class/object indexing and event-entry layout from ScummVM, but use the conservative local event-count rule above for owner-loaded/raw class parsing until a main USECODE sample proves otherwise.
|
||||
3. Decompile only the proven operator families into structured IR.
|
||||
4. Preserve unknown bytes verbatim in place.
|
||||
5. Attach ScummVM event and intrinsic names as hints, not as truth.
|
||||
6. Recompile by rebuilding the original class header and event table layout first, then re-emitting decoded and opaque ops together.
|
||||
|
||||
That gets to a reversible editor sooner than waiting for a full semantic VM recovery.
|
||||
Loading…
Add table
Add a link
Reference in a new issue