Add detailed class event processing and family comparison tools

- Enhance `extract_eusecode_flx.py` to derive class event rows with additional metadata including derived body windows and repeated template statuses. - Introduce `usecode_family_compare.py` for comparing event families, analyzing commonalities in event bodies, and generating reports on identical groups and differences. - Implement new data structures for managing class event rows and family artifact specifications. - Update output formats to include derived body information and repeated family regression checks. - Ensure robust validation of repeated family expectations against actual extracted data.
2026-03-22 23:24:46 +01:00 · 2026-03-22 23:24:46 +01:00 · 4d3c8cd81b
commit 4d3c8cd81b
parent de42fd1ea1
23 changed files with 15033 additions and 14221 deletions
--- a/docs/raw-000e.md
+++ b/docs/raw-000e.md
@ -42,7 +42,8 @@ A small helper cluster in the raw `000e:` area implements a fixed-size CRLF reco
 	- `table_end = 0x6090`, which matches the first non-zero payload offset
 	- `403` non-zero entries in the current file
 - `tools/extract_eusecode_flx.py` now parses the full validated table and emits all `403` non-zero entries under `USECODE/EUSECODE_extracted/`, including `entry_index.tsv`, `descriptor_index.tsv`, `descriptor_neighborhoods.tsv`, `summary.json`, per-chunk `.bin`, and `.strings.txt` sidecars.
- The extractor now also carries the conservative owner-loaded class rule directly into machine-readable outputs: `class_layout_index.tsv` records `object_index`, `class_id`, the raw bytes-`8..11` field, derived `code_base_minus_one`, and `conservative_event_count`, while `class_event_index.tsv` expands parsed classes into raw 6-byte event rows with slot numbers, ScummVM event-name hints for `0x00..0x1f`, unresolved leading words, and raw code-offset dwords.
+- The extractor now also carries the conservative owner-loaded class rule directly into machine-readable outputs: `class_layout_index.tsv` records `object_index`, `class_id`, the raw bytes-`8..11` field, derived `code_base_minus_one`, and `conservative_event_count`, while `class_event_index.tsv` expands parsed classes into raw 6-byte event rows with slot numbers, ScummVM event-name hints for `0x00..0x1f`, unresolved leading words, raw code-offset dwords, derived body-window columns, and conservative repeated-template status tags for the verified repeated families.
+- The extractor now emits one concrete generated per-class decompile artifact for the cleanest repeated lane too: `boot_family_decompile.md` / `.tsv` render the five `_BOOT` classes slot-by-slot with raw row bytes, derived body windows, repeated-template status, and stable body digests.
 - The generated reports now expose lightweight descriptor summaries (`primary_label`, `field_names`, `field_tags`) so the object lane can be searched by field grammar instead of only by raw names.
 - The extracted data now separates into at least two lanes:
 	- text-heavy records that fit the `000e:` CRLF parser model, such as `DATALINK` mission/objective text and `TEXTFIL1` message banks
@ -129,6 +130,19 @@ A small helper cluster in the raw `000e:` area implements a fixed-size CRLF reco
 - The environmental hazard lane is now similarly constrained. `environmental_family_compare.tsv` shows that `FLAMEBOX` and `STEAMBOX` are close structural siblings with the same active-event backbone (`referent,event,<hazard>,<hazard2>,direction,count`) and matching `24:0A02 / 24:FC02 / 24:FE02` object-link pattern, while `NOSTRIL` is a smaller fire-specific variant that keeps the active `event` plus dual fire references and count fields but drops the direction/newType side.
 - Their neighborhoods are different enough to matter: `environmental_event_graph.md` shows `FLAMEBOX` embedded among vent/door/bridge/copy records, `NOSTRIL` among flame/pad/desk/blaster/keypad records, and `STEAMBOX` among bounce/hover/fade/steam/flame box records. So this looks like one hazard-event descriptor family reused across distinct local object islands rather than one single environmental mega-cluster.
 - The callback lane is tighter too. `callback_trigger_compare.tsv` confirms that `SURCAMNS` and `SURCAMEW` are effectively the same callback-trigger template: identical field set (`referent,textFile,monit,valueBox,passcode,link,code,screen,cameraEgg,trueRef,therma,eventTrigger,foundGun`) and identical tag grammar except for the `therma` slot offset (`24:F102` vs `24:F602`). That keeps the `eventTrigger` split credible as a true callback/attachment lane rather than only a spelling variation on active `event` carriers.
+- Mining the new `class_layout_index.tsv` / `class_event_index.tsv` outputs now gives a first small safe set of repeated non-zero slot patterns:
+	- `JELYHACK` and `JELYH2` are exact referent-anchor twins at the event-table level too: both have only slot `0x01` non-zero, with the same row `0x002A / 0x00000001`.
+	- The five `_BOOT` event cores (`AND_BOOT`, `BRO_BOOT`, `COR_BOOT`, `REE_BOOT`, `VAR_BOOT`) all share the same three-slot pattern `0x0A / 0x0F / 0x10`. The clearest exact repeated row is slot `0x10`, where all five use `raw_event_entry_word = 0x003B` with class-specific code offsets.
+	- `SURCAMNS` and `SURCAMEW` share one exact five-slot callback pattern `0x01 / 0x0A / 0x20 / 0x21 / 0x22`, including the same `0x0A = 0x00D1 / 0x00000001` anchor row and the same `0x22` event-table word `0x01A3`.
+	- `FLAMEBOX`, `NOSTRIL`, and `STEAMBOX` share one environmental-event pattern `0x0A / 0x20 / 0x21`, which is enough to treat the higher slots as real repeated structure even though the exact row values differ by class.
+	- `EVENT` and `SFXTRIG` both participate in the wide `0x0A` lane, but that lane is broad enough that the slot number is currently more trustworthy than the ScummVM label attached to it.
+- The next body-window pass now confirms that repeated slot rows are usually near-templates rather than clones. Using `body_start = code_base_minus_one + raw_code_offset` and the next non-zero slot offset or chunk EOF as the body end:
+	- `JELYHACK` and `JELYH2` slot `0x01` are both `42` bytes long with a shared `10`-byte prefix and `17`-byte suffix, but are not byte-identical.
+	- `_BOOT` slot `0x10` is a clean short-template lane: all five bodies are exactly `59` bytes long, share the same first `5` bytes and last `17` bytes, but each has a distinct digest.
+	- `_BOOT` slots `0x0A` and `0x0F` are larger variants of the same pattern: shared suffix-heavy structure, class-local middles, no exact clones.
+	- `SURCAMNS` and `SURCAMEW` slots `0x20` and `0x22` are same-length near-templates (`698` and `419` bytes respectively), while slot `0x21` diverges more strongly (`1801` vs `1621` bytes) even though it still keeps a common tail.
+- That makes the current best human-readable script model more precise: preserve repeated-family status and exact row bytes, but record byte-identity as a separate property so “same slot template” does not get mistaken for “same compiled body.”
+- That pattern pass materially improves what a decompiled USECODE script can look like right now. The honest current form is not a pretty source language; it is a reversible descriptor-plus-event-table rendering with raw slot ids, raw event-entry words, raw code offsets, and optional ScummVM labels marked as hints only. The concrete examples now live in `docs/usecode-roundtrip-ir.md` and are grounded in `readable_script_ir.md`, `readable_descriptor_templates.md`, and `runtime_descriptor_family_rankings.md`.
 - The first runtime-side follow-through on those descriptor gains is now a little tighter too. Instruction search around `000d:ebe3` confirms one fixed sequenced VM/opcode driver body, not just a vague constructor helper: it calls `000d:177c`, `000d:1acb`, `000d:0988`, the internal `000d:22bc` link-matrix block, then `000d:1d4a` and `000d:2104` in order. The key negative result is just as useful: `000d:ec31` is only the internal `CALL 000d:22bc` site inside that body, not a standalone function entry.
 - Ghidra now carries that as a conservative disassembly comment at `000d:ebe3`. That is still short of a safe rename, but it does promote the lane from “suspected constructor chain” to “verified ordered opcode/handler sequence,” which is the clearest current bridge from the descriptor-side event families back into the `000d` VM/object runtime.

@ -191,17 +205,17 @@ All three constructor variants (`000e:2777`, `000e:2860`, `000e:2969`) follow th

 1. Call `FUN_000e_e935` (allocator — produces garbled 11KB decompile, not renamed)
 2. Set fields `+0xb4` through `+0xc2` on the result
-3. Call `000d:ebe3` directly (confirmed CALL sites at `000e:283e`, `000e:2931`, `000e:29e4`; multi-step chain initializer: calls `177c`, `1acb`, `0988`, `22bc`, `1d4a`, `2104` in sequence)
+3. Call near target `000e:ebe3` directly (confirmed CALL sites at `000e:283e`, `000e:2931`, `000e:29e4`; this is a separate mis-split `000e` region, not `FUN_000d_ebe3`)
 4. Call `assert_alive_sentinel` (assertion: checks `+0xd4 != -1`)
 5. Call `func_0x000eec83`

-The chain at `000d:ebe3` steps through VM opcode handlers (`000d:177c`, `000d:1acb`, `000d:0988`) that operate on a bytecode VM object with stack pointer at `+0xcc` (decremented by 2 per push) and segment base at `+0xce`.
+The old assumption that these constructor calls fed the `000d` VM sequencer is now retired. Raw instruction search shows the direct near calls land on `000e:ebe3`, whose current body is still mis-split/garbled and cannot yet be tied to the `000d:177c` / `000d:1acb` / `000d:0988` / `000d:22bc` / `000d:2104` chain.

 The constructor-side field setup before that sequencer is now slightly tighter too:

- variants A and B both set `+0xc0 = 1` before the direct `000d:ebe3` call and derive `+0xc2` from `DS:0x604e`
- variant C instead sets `+0xc0 = 0`, `+0xc2 = 1`, and `+0x4c = 0x000d` before the same sequencer call
- these direct xrefs make `000d:ebe3` a constructor-side animation sequencer rather than a globally xref-dark dispatcher, but they still do not expose any new wrapper-level opcode number beyond the internal `0x19/0x1a/0x1b` family already proven inside `000d:0988`
+- variants A and B both set `+0xc0 = 1` before the direct `000e:ebe3` call and derive `+0xc2` from `DS:0x604e`
+- variant C instead sets `+0xc0 = 0`, `+0xc2 = 1`, and `+0x4c = 0x000d` before the same near-call lane
+- this remains useful for the animation subsystem, but it no longer counts as upstream xref evidence for `FUN_000d_ebe3`; the true selector/write path into the `000d` dispatcher is still unresolved

 ### Constructor variant renames