Add detailed class event processing and family comparison tools

- Enhance `extract_eusecode_flx.py` to derive class event rows with additional metadata including derived body windows and repeated template statuses.
- Introduce `usecode_family_compare.py` for comparing event families, analyzing commonalities in event bodies, and generating reports on identical groups and differences.
- Implement new data structures for managing class event rows and family artifact specifications.
- Update output formats to include derived body information and repeated family regression checks.
- Ensure robust validation of repeated family expectations against actual extracted data.
This commit is contained in:
MaddoScientisto 2026-03-22 23:24:46 +01:00
commit 4d3c8cd81b
23 changed files with 15033 additions and 14221 deletions

View file

@ -234,6 +234,315 @@ For near-term local RE and tooling:
- Treat `JELYHACK` and `JELYH2` as referent-anchor classes, not standalone event records.
- Treat `SURCAMNS` and `SURCAMEW` as callback/eventTrigger holders, not proven active-event cores.
## Repeated Slot Patterns Safe To Reuse Now
The latest pass over `class_layout_index.tsv` and `class_event_index.tsv` adds a small set of repeatable slot patterns that are safe enough to carry into decompiler output.
What is authoritative here:
- whether a class has a non-zero slot entry at a given slot id
- the raw `u16` event word for that slot
- the raw `u32` code offset for that slot
- repeated slot-set structure across several classes
What is still hint-level only:
- the ScummVM event-name labels for slots `0x00..0x1f`
- any mapping from one repeated slot directly to one recovered `000d` opcode family
- any claim that one repeated slot family is already tied to one exact gameplay subsystem in the DOS binary
Current small safe candidate sets:
| Family | Classes | Non-zero slots | Safe implication |
|---|---|---|---|
| referent-anchor twin | `JELYHACK`, `JELYH2` | `0x01` only | these are structurally anchor-only classes, not active event hubs |
| boot-event-core | `AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `REE_BOOT`, `VAR_BOOT` | `0x0A`, `0x0F`, `0x10` | one reusable three-slot active-event core template |
| callback-eventtrigger | `SURCAMNS`, `SURCAMEW` | `0x01`, `0x0A`, `0x20`, `0x21`, `0x22` | one shared callback-oriented multi-slot template |
| environmental-event | `FLAMEBOX`, `NOSTRIL`, `STEAMBOX` | `0x0A`, `0x20`, `0x21` | one shared hazard/event template with two extra high slots |
| broad active-event lane | `EVENT`, `SFXTRIG`, and several non-island classes | `0x0A` only | slot `0x0A` is widespread enough to treat as a real repeated event slot, but too broad to over-specialize |
Concrete repeated evidence worth preserving in IR:
- `JELYHACK` and `JELYH2` both carry only slot `0x01` with the exact same row: `raw_event_entry_word = 0x002A`, `raw_code_offset = 0x00000001`.
- The five `_BOOT` cores all share slot `0x10` with the exact same `raw_event_entry_word = 0x003B`, while the `raw_code_offset` varies by class (`0x0000045c` on `COR_BOOT`, `0x0000048b` on `AND_BOOT`, `0x00000522` on `BRO_BOOT`, `0x000004df` on `VAR_BOOT`, `0x000005a8` on `REE_BOOT`). That is a good example of repeated structure without identical bodies.
- `SURCAMNS` and `SURCAMEW` share the same five-slot layout and the same low/high anchor rows (`0x0A = 0x00D1/0x00000001`, `0x22 = 0x01A3/...`), but differ in the middle high-slot offsets. That looks like one shared callback template with instance-specific bodies, not two unrelated classes.
- `FLAMEBOX`, `NOSTRIL`, and `STEAMBOX` all share one `0x0A` low slot plus two extra high slots `0x20` and `0x21`. Their exact words differ, so the safe reading is shared layout, not identical compiled behavior.
- `EVENT` and `SFXTRIG` both participate in the wide `0x0A` lane, but that family is broad enough that the slot number is more trustworthy than the ScummVM name hint.
## Byte-Level Body Comparison Rules And Results
The next step after repeated row mining is to derive the chunk-local body window for each non-zero slot and compare the actual bytes instead of only the 6-byte event-table row.
Current conservative body-window rule:
- `body_start = code_base_minus_one + raw_code_offset`
- `body_end = code_base_minus_one + next_non_zero_raw_code_offset` in the same class, or chunk EOF when there is no later non-zero slot
- this keeps the representation reversible because it is computed only from preserved header and event-table fields plus the raw chunk bytes
This rule is now carried directly by the extractor outputs instead of living only in notes:
- `USECODE/EUSECODE_extracted/class_event_index.tsv` now emits `derived_body_start`, `derived_body_end`, `derived_body_length`, and conservative `repeated_template_status` columns per slot row.
- `USECODE/EUSECODE_extracted/boot_family_decompile.md` / `.tsv`, `callback_family_decompile.md` / `.tsv`, and `environmental_family_decompile.md` / `.tsv` now provide concrete generated per-class decompile artifacts for the `_BOOT`, `SURCAM*`, and environmental repeated-family lanes, each grounded in emitted output rather than prose-only examples.
- `USECODE/EUSECODE_extracted/repeated_family_regressions.tsv` now records and enforces the current repeated-family slot sets plus the verified raw-row and derived body-window fields for `JELYHACK/JELYH2`, `_BOOT`, `SURCAMNS/SURCAMEW`, and `FLAMEBOX/NOSTRIL/STEAMBOX` so extractor changes fail fast if those verified baselines drift.
What this confirms on the current repeated families:
- `JELYHACK` and `JELYH2` slot `0x01` are exact row twins but not exact body twins. Both bodies are `42` bytes long, both start at `0x00d4`, both keep `raw_event_entry_word = 0x002A`, and both share a `10`-byte prefix plus a `17`-byte suffix. The first differences are at body offsets `10,11,12,24`, which is consistent with one reused mini-template carrying class-local literals rather than one identical compiled body.
- `_BOOT` slot `0x10` is the cleanest repeated-body example. All five classes have a `59`-byte body, all share the same row word `0x003B`, all share the same first `5` bytes and the same last `17` bytes, and none are byte-identical across the family. This is strong evidence for one shared short-template tail with class-local identifiers or immediates in the middle.
- `_BOOT` slots `0x0A` and `0x0F` show the same pattern at larger sizes. Slot `0x0A` bodies range from `551` to `843` bytes and share only a `3`-byte prefix but a `39`-byte suffix; slot `0x0F` bodies range from `564` to `604` bytes and share a `3`-byte prefix plus a `38`-byte suffix. These are repeated family bodies, but not clones.
- `SURCAMNS` and `SURCAMEW` high slots `0x20` and `0x22` also behave like near-templates, not clones. Slot `0x20` is `698` bytes in both classes with an `11`-byte common prefix and an `84`-byte common suffix. Slot `0x22` is `419` bytes in both classes with an `11`-byte common prefix and a `53`-byte common suffix.
- `SURCAM` slot `0x21` is the strongest within-family divergence in this batch. `SURCAMNS` uses row word `0x0709` and a body length of `1801`, while `SURCAMEW` uses row word `0x0655` and a body length of `1621`. They still share a `20`-byte suffix, so this is best read as one callback-family slot with materially different instance bodies rather than a parsing mistake.
The practical IR consequence is important: repeated-family status should be recorded separately from byte-identity status. A human-readable decompile should be able to say “same family slot template” without falsely implying “same body bytes.”
## What A Decompiled Script Looks Like Today
The most honest present-day decompilation is not a polished source language. It is a reversible descriptor-plus-event-table rendering with optional VM-op vocabulary attached where the `000d` lane is already verified.
### Level 0: Raw event row plus derived body window
This is the minimal human-usable row form. It preserves the original six-byte event entry, explains how the body window is derived, and records whether the slot looks like an exact twin, a near-template, or a unique body.
```yaml
class_name: REE_BOOT
slot: 0x10
event_name_hint_scummvm: leaveFastArea
raw_event_entry_word: 0x003b
raw_code_offset: 0x000005a8
code_base_minus_one: 0x00d3
derived_body_start: 0x067b
derived_body_end: 0x06b6
derived_body_length: 59
repeated_template_status: boot-event-core/shared-slot-0x10
body_identity_status: non-identical; shared 5-byte prefix and 17-byte suffix across all five _BOOT bodies
body_sha1: 577c61e9c4c6...
```
Field meaning, using only what is currently verified:
- `class_name`: authoritative class label from object `1` in the owner-loaded class table
- `slot`: authoritative numeric slot id from the event table; this is safer than any guessed semantic name
- `event_name_hint_scummvm`: external label for slots `0x00..0x1f`; useful for orientation, not yet verified as the local class-specific meaning
- `raw_event_entry_word`: the unresolved leading `u16` from the 6-byte event record; authoritative bytes, unresolved semantics
- `raw_code_offset`: the authoritative row `u32`; currently best read as a 1-based offset relative to `code_base_minus_one`
- `code_base_minus_one`: derived from bytes `8..11` in the class header using the current conservative rule
- `derived_body_start` and `derived_body_end`: computed chunk-local byte window for the slot body; useful for diffing and future recompilation, and now emitted directly in the extractor outputs
- `repeated_template_status`: whether the row participates in a repeated family pattern such as `JELY` anchor twin, `_BOOT` event core, or `SURCAM` callback template
- `body_identity_status`: whether the extracted body bytes are exact twins, near-templates, or materially different within that family
- `body_sha1`: stable digest for exact identity checks without pretending the digest itself has semantic meaning
### Level 1: Lossless event-table IR
This is the form that is closest to a future round-trip compiler.
```yaml
class:
entry_index: 0x0115
class_id: 0x04d3
class_name: JELYHACK
class_object_index: 0x04d5
raw_code_base_u32: 0x00d4
code_base_minus_one: 0x00d3
conservative_event_count: 32
descriptor_fields:
- referent
events:
- slot: 0x01
event_name_hint_scummvm: use
raw_event_entry_word: 0x002a
raw_code_offset: 0x00000001
derived_body_start: 0x00d4
derived_body_end: 0x00fe
derived_body_length: 42
repeated_template_status: referent-anchor-twin/shared-slot-0x01
body_identity_status: near-template-with-JELYH2
confidence: authoritative-bytes, hinted-label
```
That is already a real decompilation output. It keeps the exact slot id, the exact six-byte row contents, and the exact class-header facts, while refusing to pretend that `use` is already a proven semantic name for this class.
Here is the same style for one active event-bearing attachment class in the same island:
```yaml
class:
entry_index: 0x011b
class_id: 0x04db
class_name: REE_BOOT
class_object_index: 0x04dd
raw_code_base_u32: 0x00d4
code_base_minus_one: 0x00d3
conservative_event_count: 32
descriptor_fields:
- referent
- event
- counter
- item
events:
- slot: 0x0a
event_name_hint_scummvm: equip
raw_event_entry_word: 0x034b
raw_code_offset: 0x00000001
derived_body_start: 0x00d4
derived_body_end: 0x041f
derived_body_length: 843
repeated_template_status: boot-event-core/shared-slot-0x0a
body_identity_status: same-family-body-not-identical
confidence: authoritative-bytes, hinted-label
- slot: 0x0f
event_name_hint_scummvm: enterFastArea
raw_event_entry_word: 0x025c
raw_code_offset: 0x0000034c
derived_body_start: 0x041f
derived_body_end: 0x067b
derived_body_length: 604
repeated_template_status: boot-event-core/shared-slot-0x0f
body_identity_status: same-family-body-not-identical
confidence: authoritative-bytes, hinted-label
- slot: 0x10
event_name_hint_scummvm: leaveFastArea
raw_event_entry_word: 0x003b
raw_code_offset: 0x000005a8
derived_body_start: 0x067b
derived_body_end: 0x06b6
derived_body_length: 59
repeated_template_status: boot-event-core/shared-slot-0x10
body_identity_status: same-family-body-not-identical
confidence: authoritative-bytes, hinted-label
```
And here is one callback-style multi-slot class, which shows why the high slots should stay numeric for now:
```yaml
class:
entry_index: 0x011c
class_id: 0x04de
class_name: SURCAMEW
class_object_index: 0x04e0
raw_code_base_u32: 0x00e6
code_base_minus_one: 0x00e5
conservative_event_count: 35
descriptor_fields:
- referent
- textFile
- monit
- valueBox
- passcode
- link
- code
- screen
- cameraEgg
- trueRef
- therma
- eventTrigger
- foundGun
events:
- slot: 0x01
event_name_hint_scummvm: use
raw_event_entry_word: 0x00f7
raw_code_offset: 0x000000d2
- slot: 0x0a
event_name_hint_scummvm: equip
raw_event_entry_word: 0x00d1
raw_code_offset: 0x00000001
- slot: 0x20
event_name_hint_scummvm: null
raw_event_entry_word: 0x02ba
raw_code_offset: 0x000001c9
derived_body_start: 0x02ae
derived_body_end: 0x0568
derived_body_length: 698
repeated_template_status: callback-eventtrigger/shared-slot-0x20
body_identity_status: same-family-body-not-identical
- slot: 0x21
event_name_hint_scummvm: null
raw_event_entry_word: 0x0655
raw_code_offset: 0x00000483
derived_body_start: 0x0568
derived_body_end: 0x0bbd
derived_body_length: 1621
repeated_template_status: callback-eventtrigger/shared-slot-0x21
body_identity_status: same-family-body-not-identical
- slot: 0x22
event_name_hint_scummvm: null
raw_event_entry_word: 0x01a3
raw_code_offset: 0x00000ad8
derived_body_start: 0x0bbd
derived_body_end: 0x0d60
derived_body_length: 419
repeated_template_status: callback-eventtrigger/shared-slot-0x22
body_identity_status: same-family-body-not-identical
```
The extra derived fields are worth keeping because they answer the immediate human question that the bare event table does not: not only “which slots exist,” but also “how much body belongs to each slot” and “whether this body is a true clone or only a same-family variant.”
### Level 2: Friendly but still reversible hinted form
This is the highest-level script shape that is justified right now.
```text
anchor JELYHACK(referent)
# authoritative event rows for the anchor itself
slot 0x01 hint=use? raw_word=0x002A code_off=0x00000001 body=0x00D4..0x00FE family=JELY-anchor identity=near-template-with-JELYH2
# nearby attachment classes from the same local island
attach REE_BOOT(referent,event,counter,item)
slot 0x0A hint=equip? raw_word=0x034B code_off=0x00000001 body=0x00D4..0x041F family=_BOOT-core identity=shared-template-not-clone
slot 0x0F hint=enterFastArea? raw_word=0x025C code_off=0x0000034C body=0x041F..0x067B family=_BOOT-core identity=shared-template-not-clone
slot 0x10 hint=leaveFastArea? raw_word=0x003B code_off=0x000005A8 body=0x067B..0x06B6 family=_BOOT-core identity=shared-template-not-clone
callback SURCAMEW(referent,textFile,monit,valueBox,passcode,link,code,screen,cameraEgg,trueRef,therma,eventTrigger,foundGun)
slot 0x01 hint=use? raw_word=0x00F7 code_off=0x000000D2 body=0x01B7..0x02AE
slot 0x0A hint=equip? raw_word=0x00D1 code_off=0x00000001 body=0x00E6..0x02AE
slot 0x20 raw_word=0x02BA code_off=0x000001C9 body=0x02AE..0x0568 family=SURCAM-callback identity=shared-template-not-clone
slot 0x21 raw_word=0x0655 code_off=0x00000483 body=0x0568..0x0BBD family=SURCAM-callback identity=shared-template-with-stronger-divergence
slot 0x22 raw_word=0x01A3 code_off=0x00000AD8 body=0x0BBD..0x0D60 family=SURCAM-callback identity=shared-template-not-clone
attach SFXTRIG(referent,event)
slot 0x0A hint=equip? raw_word=0x00B8 code_off=0x00000001
```
This is decompiled enough to read, diff, and later recompile because it preserves:
- the original class identity
- the exact non-zero event rows
- the derived chunk-local body window for each row
- which names are authoritative fields versus external hints
- which nearby descriptors appear to be anchors, active event attachments, or callback attachments
- whether a repeated family slot is an exact twin or only a structurally similar body
### Level 2.5: Human annotation layer
The last layer is prose, not syntax. It should explain the honest current reading of each field so a modder can see what is safe to edit and what still needs caution.
- Class name is authoritative at the container level: it comes from the owner-loaded class-name table and is not a guess.
- Slot id is authoritative at the event-table level: this is the safest event identifier currently available.
- Event-name hint is external: use it as orientation only when the slot is inside `0x00..0x1f` and the local behavior has not yet been reverified in binary.
- Raw event word is authoritative but semantically unresolved: it must survive round-trip intact.
- Raw code offset is authoritative and operational: combined with `code_base_minus_one`, it tells us where the slot body starts in the chunk.
- Body-window length is derived but useful: it tells a human whether a slot is a tiny stub-like record or a large body that deserves its own diff or annotation block.
- Repeated-template status is about family structure, not byte identity: a `_BOOT` slot can be “the same template role” without being byte-equal across classes.
- Body-identity status answers the concrete modding question “am I looking at a clone, a parameterized variant, or a different body that only occupies the same family slot?”
### Level 3: Where the current VM IR can be attached
For classes in the active-event ecosystems (`EVENT`, `_BOOT`, `NPCTRIG`, `SFXTRIG`, and the environmental family), the current `000d` work is strong enough to attach the known operator vocabulary without pretending one exact class-to-opcode decode already exists.
```text
vm_effect_possible:
APPEND_UNIQUE_INLINE
APPEND_UNIQUE_INDIRECT
REMOVE_MATCHING_INDIRECT
REMOVE_MATCHING_INLINE
MATERIALIZE_OR_FORWARD_VALUE
PREPEND_INLINE_PAYLOAD
BUILD_ENTITY_LINK_MATRIX
EMIT_OR_PUSHBACK_RESULT
FINALIZE_MIXED_VALUE_TO_OUTPTR
```
That operator block is authoritative as a recovered VM vocabulary, but only ecosystem-level when attached to one specific descriptor family.
## Conservative Parser Rule To Adopt Now
For the current owner-loaded EUSECODE and round-trip IR work, the safest reversible rule is: