272 lines
15 KiB
Markdown
272 lines
15 KiB
Markdown
|
|
# USECODE Round-Trip IR Plan
|
||
|
|
|
||
|
|
## Purpose
|
||
|
|
|
||
|
|
This note records the current evidence-backed path from Crusader USECODE bytes to a human-readable, editable, and recompilable script form.
|
||
|
|
|
||
|
|
It is intentionally conservative. ScummVM gives strong external anchors for the container layout, class/event numbering, and intrinsic naming, but it is not a symbol map for the DOS binary and it is not a ready-made round-trip compiler.
|
||
|
|
|
||
|
|
## Externally Anchored Pieces
|
||
|
|
|
||
|
|
### Container and class layout
|
||
|
|
|
||
|
|
ScummVM now gives a concrete second implementation for the Crusader USECODE class layout:
|
||
|
|
|
||
|
|
- `usecode/usecode_flex.cpp` treats each class body as archive object `classid + 2`.
|
||
|
|
- Class names come from archive object `1` at `name_object + 4 + 13 * classid`.
|
||
|
|
- For Crusader, the class base offset is read from class bytes `8..11` and then decremented by `1`.
|
||
|
|
- Crusader event count is computed as `(base_offset + 19) / 6`.
|
||
|
|
- `usecode/usecode.cpp` resolves event `N` from class data at `20 + 6 * N`, with the code offset stored in bytes `+2..+5` of each 6-byte event record.
|
||
|
|
|
||
|
|
Combined with the already validated FLEX container notes, the current externally anchored container model is:
|
||
|
|
|
||
|
|
- FLEX entry count at `0x54`
|
||
|
|
- FLEX table at `0x80`
|
||
|
|
- USECODE class object index = `classid + 2`
|
||
|
|
- Crusader class header contains a four-byte base-offset field at bytes `8..11`
|
||
|
|
- Crusader event table entries are 6 bytes each, with a known dword code offset and an still-unknown leading word
|
||
|
|
|
||
|
|
ScummVM also makes one implementation choice explicit that matters for the current mismatch: `uc_machine.cpp` uses `get_class_base_offset()` as the execution-stream base for Crusader class code, not only as metadata for event counting. That means the `obj[8..11] - 1` value is part of the live code-addressing model in ScummVM, not just a comment-level interpretation.
|
||
|
|
|
||
|
|
### Binary-side validation against owner-loaded classes
|
||
|
|
|
||
|
|
The first direct local validation pass against sampled owner-loaded EUSECODE class records now splits the ScummVM model into two parts: one part is confirmed, and one part still needs reconciliation.
|
||
|
|
|
||
|
|
Confirmed on sampled records (`EVENT`, `NPCTRIG`, `SURCAMNS`, `JELYHACK`, `REE_BOOT`, `SURCAMEW`, `SFXTRIG`):
|
||
|
|
|
||
|
|
- The extracted chunk at table offset `0x88` behaves like object `1` for class names.
|
||
|
|
- For each sampled class body, deriving `object_index = (table_offset - 0x80) / 8`, then `class_id = object_index - 2`, and then reading 13 bytes from object `1` at `4 + 13 * class_id` yields the expected class name.
|
||
|
|
- The class bodies do have a stable 4-byte header field at bytes `8..11`.
|
||
|
|
- The region at `class + 20` is a real 6-byte event-slot table with `u16 unknown_word + u32 code_or_payload_field` layout.
|
||
|
|
|
||
|
|
Broader family spot-checks now keep the same local structure on the owner-loaded side. In addition to the first validated set, the nearby `_BOOT` and environmental event families (`AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `VAR_BOOT`, `FLAMEBOX`, `NOSTRIL`, `STEAMBOX`) continue to fit the same `table_offset -> object_index -> class_id` progression with a stable bytes-`8..11` dword and a 6-byte table at `+20`. No contradictory sample has appeared in the local EUSECODE set.
|
||
|
|
|
||
|
|
Not yet reconciled with ScummVM's current formula note:
|
||
|
|
|
||
|
|
- In the sampled owner-loaded records, the raw dword at bytes `8..11` is `0x00d4`, `0x00da`, or `0x00e6`.
|
||
|
|
- Treating that dword directly as the first post-event-table offset makes the layout line up cleanly: `(dword_at_8 - 20) / 6` gives 32, 33, or 35 valid slots in the samples.
|
||
|
|
- Scanning instead with the previously noted ScummVM-style `(base_offset + 19) / 6` interpretation overruns into inline payload and class-name bytes in the same samples.
|
||
|
|
|
||
|
|
Current best explanation:
|
||
|
|
|
||
|
|
- The mismatch is now best explained as a ScummVM interpretation/detail issue, not as a proven loader-side rewrite.
|
||
|
|
- The same ScummVM code path that decrements bytes `8..11` by `1` also uses that decremented value as the code-stream base. On the local owner-loaded records, this fits naturally if the raw dword is the first code-byte offset and event-table dword offsets are 1-based relative to `code_base_minus_one`.
|
||
|
|
- Under that reading, the sampled event-count rule becomes `(code_base_minus_one - 19) / 6`, which is exactly equivalent to `(raw_u32_at_8_11 - 20) / 6` and matches the validated `32/33/35` slot counts.
|
||
|
|
- The `000d` loader/runtime path (`000d:44df -> 000d:4c99 -> 000d:7000 -> 000d:46ec`) currently shows indexed file loading and slot-table materialization, but no verified per-class header rewrite before the VM consumes owner-backed records.
|
||
|
|
|
||
|
|
Current safe conclusion:
|
||
|
|
|
||
|
|
- The owner-loaded class records are compatible with `object 1` names, `classid + 2` body lookup, a header field at bytes `8..11`, and 6-byte event records at `+20`.
|
||
|
|
- The exact meaning of the bytes-`8..11` field is now narrower: on the local owner-loaded records it is best read as the first code-byte offset, with ScummVM's decremented `base_offset` acting as a `code_base_minus_one` anchor for 1-based event code offsets.
|
||
|
|
- The leading word of each 6-byte event entry remains unresolved.
|
||
|
|
|
||
|
|
### VM/runtime model
|
||
|
|
|
||
|
|
ScummVM also anchors several VM behaviors that line up with the current raw-binary work:
|
||
|
|
|
||
|
|
- `usecode/uc_machine.cpp` uses `ByteSet(0x1000)` for Crusader globals rather than the U8 bitset path.
|
||
|
|
- Remorse initializes global `0x003c` to avatar number `1`; Regret initializes `0x001e`.
|
||
|
|
- Opcode `0x11` is class/event dispatch in Crusader: the bytecode operand is an event number that is translated through `get_class_event()` before execution.
|
||
|
|
|
||
|
|
That makes the current local reading stronger: the `000d` runtime lane looks like a Crusader-specific object/event VM that should be interpreted against Crusader event ordinals, not against U8 assumptions.
|
||
|
|
|
||
|
|
### Event names
|
||
|
|
|
||
|
|
`convert/crusader/convert_usecode_crusader.h` gives a named event table for ids `0x00..0x1f`:
|
||
|
|
|
||
|
|
- Strongly usable names: `look`, `use`, `anim`, `setActivity`, `cachein`, `hit`, `gotHit`, `hatch`, `schedule`, `release`, `equip`, `unequip`, `combine`, `calledFromAnim`, `enterFastArea`, `leaveFastArea`, `cast`, `justMoved`, `avatarStoleSomething`, `animGetHit`, `unhatch`
|
||
|
|
- Weak placeholders remain for `0x0d` and `0x16..0x1f` (`func0D`, `func16`..`func1F`)
|
||
|
|
|
||
|
|
This is enough to annotate event ordinals safely, but not enough to rename raw binary handlers unless local behavior matches.
|
||
|
|
|
||
|
|
### Intrinsic tables
|
||
|
|
|
||
|
|
ScummVM provides two distinct kinds of intrinsic evidence:
|
||
|
|
|
||
|
|
- `convert/crusader/convert_usecode_crusader.h` and `convert_usecode_regret.h` provide ordinal-to-signature/name tables used for readable conversion.
|
||
|
|
- `usecode/remorse_intrinsics.h` and `usecode/regret_intrinsics.h` provide the live runtime dispatch tables.
|
||
|
|
|
||
|
|
The safe reading is:
|
||
|
|
|
||
|
|
- Remorse and Regret share the Crusader event-name table.
|
||
|
|
- Remorse and Regret do not share a single intrinsic numbering/signature map.
|
||
|
|
- Intrinsic names are strong hints for arity and broad subsystem identity, but they are still not direct rename authority for the DOS binary.
|
||
|
|
|
||
|
|
## Safe Reuse Rules
|
||
|
|
|
||
|
|
### Safe to import now
|
||
|
|
|
||
|
|
- Event names as labels for event ids `0x00..0x1f` in parsers, reports, and note files.
|
||
|
|
- Intrinsic ordinal names as `name_hint` or `signature_hint` metadata when the ordinal and argument-byte pattern match.
|
||
|
|
- High-level subsystem labels such as palette fade, camera, movie, audio, item/actor accessors, and weapon fire when they match existing binary evidence.
|
||
|
|
- Slot numbers from sampled owner-loaded classes even when the event name is still only a hint.
|
||
|
|
|
||
|
|
### Not safe to claim yet
|
||
|
|
|
||
|
|
- Direct raw-function renames based only on ScummVM event or intrinsic names.
|
||
|
|
- Remorse intrinsic numbering from Regret tables, or vice versa.
|
||
|
|
- Specific descriptor-family to slot-mask mappings that are not yet proven on the binary side.
|
||
|
|
- Meanings for the unknown leading word in the 6-byte Crusader event table entries.
|
||
|
|
- That the ScummVM `get_class_event_count()` formula applies unchanged to the sampled owner-loaded EUSECODE records.
|
||
|
|
|
||
|
|
## IR Requirements For Round-Tripping
|
||
|
|
|
||
|
|
The first script IR should preserve exact recompilation inputs before it tries to look pretty.
|
||
|
|
|
||
|
|
### Unit of decompilation
|
||
|
|
|
||
|
|
The IR should be organized as:
|
||
|
|
|
||
|
|
1. USECODE archive
|
||
|
|
2. class
|
||
|
|
3. event slot
|
||
|
|
4. instruction stream
|
||
|
|
|
||
|
|
That matches the externally anchored class/event layout and avoids baking in any still-unproven descriptor-to-runtime assumptions.
|
||
|
|
|
||
|
|
### Required top-level records
|
||
|
|
|
||
|
|
Each class record should preserve:
|
||
|
|
|
||
|
|
- `class_id`
|
||
|
|
- `class_object_index` (`classid + 2`)
|
||
|
|
- `name_slot_offset` (`4 + 13 * classid` within object `1`)
|
||
|
|
- `class_name`
|
||
|
|
- `raw_header_prefix`
|
||
|
|
- `raw_code_base_u32`
|
||
|
|
- `code_base_minus_one`
|
||
|
|
- `event_count`
|
||
|
|
- `raw_event_table_bytes`
|
||
|
|
|
||
|
|
Each event record should preserve:
|
||
|
|
|
||
|
|
- `event_id`
|
||
|
|
- `event_name_hint`
|
||
|
|
- `raw_event_entry_word`
|
||
|
|
- `code_offset`
|
||
|
|
- `raw_body_bytes`
|
||
|
|
- `decoded_ops`
|
||
|
|
|
||
|
|
## IR v0 Shape
|
||
|
|
|
||
|
|
The IR should separate authoritative fields from friendly hints.
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
class:
|
||
|
|
class_id: 0x00be
|
||
|
|
class_name: EVENT
|
||
|
|
class_object_index: 0x00c0
|
||
|
|
raw_code_base_u32: 0x0138
|
||
|
|
code_base_minus_one: 0x0137
|
||
|
|
raw_header_prefix: <bytes>
|
||
|
|
events:
|
||
|
|
- event_id: 0x04
|
||
|
|
event_name_hint: cachein
|
||
|
|
raw_event_entry_word: 0x????
|
||
|
|
code_offset: 0x00001234
|
||
|
|
ops:
|
||
|
|
- op: intrinsic_call
|
||
|
|
intrinsic_ordinal: 0x001e
|
||
|
|
name_hint: Item::I_fireWeapon
|
||
|
|
signature_hint: Item::I_fireWeapon(Item *, x, y, z, byte, int, byte)
|
||
|
|
arg_bytes: 0x10
|
||
|
|
- op: vm_chain_mutation
|
||
|
|
vm_ir: APPEND_UNIQUE_INDIRECT
|
||
|
|
opcode_hint: 0x19
|
||
|
|
- op: unknown_raw
|
||
|
|
bytes: <exact original bytes>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Why this shape
|
||
|
|
|
||
|
|
- `event_name_hint` is useful for humans but does not replace the event id.
|
||
|
|
- `name_hint` and `signature_hint` are useful for intrinsics but do not replace the ordinal.
|
||
|
|
- `unknown_raw` gives a lossless fallback for still-unmapped opcodes or operand forms.
|
||
|
|
- `raw_event_entry_word` keeps the compiler from losing bytes whose meaning is not yet settled.
|
||
|
|
|
||
|
|
## Operation Families Worth Lifting First
|
||
|
|
|
||
|
|
The current binary-side evidence supports lifting a small reversible operator set first:
|
||
|
|
|
||
|
|
- `intrinsic_call`
|
||
|
|
- `class_event_call`
|
||
|
|
- `append_unique_inline`
|
||
|
|
- `append_unique_indirect`
|
||
|
|
- `remove_matching_inline`
|
||
|
|
- `remove_matching_indirect`
|
||
|
|
- `materialize_or_forward_value`
|
||
|
|
- `prepend_inline_payload`
|
||
|
|
- `build_entity_link_matrix`
|
||
|
|
- `emit_or_pushback_result`
|
||
|
|
- `push_frame_word_literal`
|
||
|
|
- `compare_stream_dword_and_push_bool`
|
||
|
|
- `unknown_raw`
|
||
|
|
|
||
|
|
This is enough to represent the verified `000d:0988`, `000d:177c`, `000d:1acb`, `000d:208b`, `000d:21ed`, and `000d:22bc` families without pretending the whole VM is solved.
|
||
|
|
|
||
|
|
## Metadata That Must Survive Recompilation
|
||
|
|
|
||
|
|
The compiler side will need more than pretty script text. At minimum it must preserve:
|
||
|
|
|
||
|
|
- Original class ordering and sparse class ids
|
||
|
|
- Original class-name table slotting
|
||
|
|
- Raw class header bytes not yet semantically decoded
|
||
|
|
- Raw bytes `8..11` even when a derived `code_base_minus_one` is also stored
|
||
|
|
- Raw 6-byte event records, including the unknown leading word
|
||
|
|
- Exact event order within each class
|
||
|
|
- Exact code offsets or enough relocation data to rebuild them deterministically
|
||
|
|
- Intrinsic ordinals and argument-byte counts
|
||
|
|
- Width/sign information for immediates
|
||
|
|
- Inline versus indirect payload form
|
||
|
|
- String payload encoding and terminators
|
||
|
|
- Any unknown opcode byte sequences verbatim
|
||
|
|
|
||
|
|
If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.
|
||
|
|
|
||
|
|
## Practical Naming Policy
|
||
|
|
|
||
|
|
For near-term local RE and tooling:
|
||
|
|
|
||
|
|
- Use ScummVM event names as annotation labels for event slots.
|
||
|
|
- Store intrinsic names as hints attached to ordinals.
|
||
|
|
- Keep binary-facing renames driven by raw evidence, not by ScummVM alone.
|
||
|
|
- Treat `EVENT`, `_BOOT`, and `NPCTRIG` as the strongest current active-event families.
|
||
|
|
- Treat `JELYHACK` and `JELYH2` as referent-anchor classes, not standalone event records.
|
||
|
|
- Treat `SURCAMNS` and `SURCAMEW` as callback/eventTrigger holders, not proven active-event cores.
|
||
|
|
|
||
|
|
## Conservative Parser Rule To Adopt Now
|
||
|
|
|
||
|
|
For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is:
|
||
|
|
|
||
|
|
- Preserve the raw four-byte header field at bytes `8..11` as authoritative.
|
||
|
|
- Derive `code_base_minus_one = raw_u32_at_8_11 - 1` for code-addressing only.
|
||
|
|
- Derive `event_count = (raw_u32_at_8_11 - 20) / 6` only when that value is non-negative, divisible by `6`, and the resulting table end stays within the class object size.
|
||
|
|
- Treat each event entry as `u16 raw_event_entry_word + u32 raw_code_offset` at `class + 20 + 6 * slot`.
|
||
|
|
- Treat the event code offset as raw/opaque unless and until the code-addressing interpretation is needed; when needed, interpret it relative to `code_base_minus_one` so that offset `1` lands on the first code byte.
|
||
|
|
- If the divisibility or bounds checks fail, keep the class opaque and preserve raw bytes rather than forcing a guessed event count.
|
||
|
|
- `tools/extract_eusecode_flx.py` now implements this rule directly for the current owner-loaded EUSECODE work and emits `class_layout_index.tsv` plus `class_event_index.tsv` so raw header/event rows can be consumed by later IR tooling without re-deriving the arithmetic from prose.
|
||
|
|
|
||
|
|
## Remaining Binary-Side Gaps
|
||
|
|
|
||
|
|
The main blockers for a real round-trip compiler are still on the binary side:
|
||
|
|
|
||
|
|
- The meaning of the first two bytes in each 6-byte Crusader event record is still unverified.
|
||
|
|
- The exact provenance of ScummVM's current `get_class_event_count()` arithmetic is still unverified; current local evidence says the owner-loaded/raw records fit `raw_u32_at_8_11 = first_code_byte_offset`, while the ScummVM count formula appears sign-shifted relative to that layout.
|
||
|
|
- The upstream writer for selector local `[BP-0x32]` in the `000d:ebe3` sequencer is still unresolved.
|
||
|
|
- The full control-flow opcode set and branch encoding are not yet recovered.
|
||
|
|
- The exact on-disk source format behind `entity_vm_runtime_owner_resource_create` is still not identified.
|
||
|
|
- No direct descriptor-family to slot-mask mapping is proven yet.
|
||
|
|
- Callback/eventTrigger descriptors still do not have a callback-specific opcode family.
|
||
|
|
|
||
|
|
## Best Current Path
|
||
|
|
|
||
|
|
The strongest present path to a usable compiler/decompiler is:
|
||
|
|
|
||
|
|
1. Parse classes/events exactly as ScummVM does.
|
||
|
|
2. Keep the class/object indexing and event-entry layout from ScummVM, but use the conservative local event-count rule above for owner-loaded/raw class parsing until a main USECODE sample proves otherwise.
|
||
|
|
3. Decompile only the proven operator families into structured IR.
|
||
|
|
4. Preserve unknown bytes verbatim in place.
|
||
|
|
5. Attach ScummVM event and intrinsic names as hints, not as truth.
|
||
|
|
6. Recompile by rebuilding the original class header and event table layout first, then re-emitting decoded and opaque ops together.
|
||
|
|
|
||
|
|
That gets to a reversible editor sooner than waiting for a full semantic VM recovery.
|