Pseudocode and stuff

2026-03-25 23:32:13 +01:00 · 2026-03-25 23:32:13 +01:00 · ee33f94b4b
commit ee33f94b4b
parent 7310c4fe96
466 changed files with 27770 additions and 276 deletions
--- a/docs/usecode-roundtrip-ir.md
+++ b/docs/usecode-roundtrip-ir.md
@ -113,6 +113,28 @@ The safe reading is:

 The first script IR should preserve exact recompilation inputs before it tries to look pretty.

+## Current Parser Views
+
+The current proof-of-concept parser now emits three complementary views for a single class/slot body:
+
+- JSON IR: the authoritative machine-facing output for tooling and any future assembler.
+- Flat text listing: a byte-faithful decode with offsets, raw bytes, and trailer sections.
+- Script view: a more readable block-labeled decompilation with locals, labels, and stack-VM statements.
+- Pseudocode view: a higher-level decompilation that tries to collapse common compare ladders and stack expressions into programming-language-like control flow.
+
+The script and pseudocode views are intentionally descriptive rather than authoritative. They are meant to help read bodies like `NPCTRIG 0x0A` or `EVENT 0x0A` without losing the exact JSON IR that a round-trip compiler will need.
+
+## Deferred Readability Follow-Ups
+
+Keep these parser-facing readability tasks for later while the current focus stays on broad pseudocode export and class-family understanding:
+
+1. Replace unresolved `class_XXXX_slot_YY` call labels with behavior-backed names where the compiled/runtime evidence is strong enough.
+2. Replace placeholder argument names such as `arg_06` with semantic names inferred from stable usage patterns.
+3. Detect more control-flow shapes beyond compare ladders, especially simple loops and early-return guards.
+4. Collapse common spawn/setup idioms into more domain-specific statements when the stack pattern is consistent.
+5. Run the pseudocode renderer across larger families like `EVENT`, `_BOOT`, and `SURCAM*` and tighten the heuristics where they still leak VM structure.
+6. Add small behavior-level comments only where they help explain gameplay meaning rather than VM mechanics.
+
 ### Unit of decompilation

 The IR should be organized as:
@ -219,6 +241,7 @@ The compiler side will need more than pretty script text. At minimum it must pre
 - Width/sign information for immediates
 - Inline versus indirect payload form
 - String payload encoding and terminators
+- Post-`ret` debug/local symbol trailers, including the local count byte and each per-local metadata row
 - Any unknown opcode byte sequences verbatim

 If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.
@ -396,9 +419,20 @@ event:
  derived_body_length: 373
  repeated_template_status: ""
 body:
-  end_reason: end_opcode
+  end_reason: debug_symbols_then_end
  raw_body_sha1: <digest>
  unknown_trailing_bytes: ""
+  debug_symbol_offset: 0x0143
+  debug_symbol_count: 5
+debug_symbols:
+  - index: 0x00
+    type_id: 0x69
+    bp_repr: [BP+00h]
+    name: referent
+  - index: 0x01
+    type_id: 0x69
+    bp_repr: [BP+0Ah]
+    name: event
 ops:
  - offset: 0x0000
    absolute_body_offset: 0x00da
@ -417,9 +451,12 @@ ops:
 annotation_hints:
  runtime_family: slot-backed-owner-loaded-body
  compiled_anchors:
-    - 000d:51fd
-    - 000d:5572
    - 000d:46ec
+    - 000d:0988
+    - 000d:208b
+    - 000d:21ed
+    - 000d:22bc
+    - 000d:2104
    - 000d:ebe3
 ```

@ -431,7 +468,7 @@ annotation_hints:

 `event` keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.

-`body` records how far the parser got and whether any bytes remain undecoded or trailing.
+`body` records how far the parser got, whether the body terminated at a real `0x7a` end marker, and whether a post-`ret` local/debug trailer was parsed instead of being misclassified as stray opcodes.

 `ops` is intentionally lossless. Each decoded op keeps:

@ -442,6 +479,8 @@ annotation_hints:
 - exact raw bytes for the whole op
 - parsed operands as typed fields

+`debug_symbols` preserves the owner-loaded post-`ret` local metadata block. Current evidence from `crusader-disasm` and the live extracted chunks shows that many bodies end as: executable ops -> `ret` -> local/debug symbol rows -> `0x7a` end. Those rows are not executable bytecode and should survive round-trip as structured metadata rather than raw tail bytes.
+
 `annotation_hints` is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.

 ### Opcode result policy
@ -451,7 +490,7 @@ The parser should use four result classes only:
 - `decoded_op`: normal parsed opcode with structured operands
 - `unknown_opcode`: one-byte opcode not yet modeled; stop or fall back conservatively
 - `raw_tail`: remaining undecoded bytes after a stop condition
- `debug_blob`: symbol/debug tail such as `0x5c`-anchored metadata
+- `debug_blob`: post-`ret` local/debug trailer ending in `0x7a`

 That keeps the IR trustworthy even before the whole Crusader VM is modeled.

@ -474,16 +513,23 @@ annotation_hints:
  runtime_family: slot-backed-owner-loaded-body
  payload_shape_hint: signed_word
  compiled_anchors:
-    - address: 000d:51fd
-      role: slot_value_loader
-    - address: 000d:5572
-      role: slot_value_plus_offset
    - address: 000d:46ec
      role: context_create_from_slot
-    - address: 000d:ebe3
-      role: opcode_sequence_run
+    - address: 000d:0988
+      role: referent_chain_mutator
+    - address: 000d:208b
+      role: materialize_or_forward_value
+    - address: 000d:21ed
+      role: prepend_inline_payload
    - address: 000d:22bc
      role: matrix_pushback_stage
+    - address: 000d:2104
+      role: finalize_to_outptr
+    - address: 000d:ebe3
+      role: opcode_sequence_run
+  runtime_stage_hints:
+    - stage_address: 000d:0988
+      ir_name: APPEND_UNIQUE_INDIRECT
 ```

 This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.