Pseudocode and stuff

This commit is contained in:
MaddoScientisto 2026-03-25 23:32:13 +01:00
commit ee33f94b4b
466 changed files with 27770 additions and 276 deletions

View file

@ -113,6 +113,28 @@ The safe reading is:
The first script IR should preserve exact recompilation inputs before it tries to look pretty.
## Current Parser Views
The current proof-of-concept parser now emits three complementary views for a single class/slot body:
- JSON IR: the authoritative machine-facing output for tooling and any future assembler.
- Flat text listing: a byte-faithful decode with offsets, raw bytes, and trailer sections.
- Script view: a more readable block-labeled decompilation with locals, labels, and stack-VM statements.
- Pseudocode view: a higher-level decompilation that tries to collapse common compare ladders and stack expressions into programming-language-like control flow.
The script and pseudocode views are intentionally descriptive rather than authoritative. They are meant to help read bodies like `NPCTRIG 0x0A` or `EVENT 0x0A` without losing the exact JSON IR that a round-trip compiler will need.
## Deferred Readability Follow-Ups
Keep these parser-facing readability tasks for later while the current focus stays on broad pseudocode export and class-family understanding:
1. Replace unresolved `class_XXXX_slot_YY` call labels with behavior-backed names where the compiled/runtime evidence is strong enough.
2. Replace placeholder argument names such as `arg_06` with semantic names inferred from stable usage patterns.
3. Detect more control-flow shapes beyond compare ladders, especially simple loops and early-return guards.
4. Collapse common spawn/setup idioms into more domain-specific statements when the stack pattern is consistent.
5. Run the pseudocode renderer across larger families like `EVENT`, `_BOOT`, and `SURCAM*` and tighten the heuristics where they still leak VM structure.
6. Add small behavior-level comments only where they help explain gameplay meaning rather than VM mechanics.
### Unit of decompilation
The IR should be organized as:
@ -219,6 +241,7 @@ The compiler side will need more than pretty script text. At minimum it must pre
- Width/sign information for immediates
- Inline versus indirect payload form
- String payload encoding and terminators
- Post-`ret` debug/local symbol trailers, including the local count byte and each per-local metadata row
- Any unknown opcode byte sequences verbatim
If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.
@ -396,9 +419,20 @@ event:
derived_body_length: 373
repeated_template_status: ""
body:
end_reason: end_opcode
end_reason: debug_symbols_then_end
raw_body_sha1: <digest>
unknown_trailing_bytes: ""
debug_symbol_offset: 0x0143
debug_symbol_count: 5
debug_symbols:
- index: 0x00
type_id: 0x69
bp_repr: [BP+00h]
name: referent
- index: 0x01
type_id: 0x69
bp_repr: [BP+0Ah]
name: event
ops:
- offset: 0x0000
absolute_body_offset: 0x00da
@ -417,9 +451,12 @@ ops:
annotation_hints:
runtime_family: slot-backed-owner-loaded-body
compiled_anchors:
- 000d:51fd
- 000d:5572
- 000d:46ec
- 000d:0988
- 000d:208b
- 000d:21ed
- 000d:22bc
- 000d:2104
- 000d:ebe3
```
@ -431,7 +468,7 @@ annotation_hints:
`event` keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.
`body` records how far the parser got and whether any bytes remain undecoded or trailing.
`body` records how far the parser got, whether the body terminated at a real `0x7a` end marker, and whether a post-`ret` local/debug trailer was parsed instead of being misclassified as stray opcodes.
`ops` is intentionally lossless. Each decoded op keeps:
@ -442,6 +479,8 @@ annotation_hints:
- exact raw bytes for the whole op
- parsed operands as typed fields
`debug_symbols` preserves the owner-loaded post-`ret` local metadata block. Current evidence from `crusader-disasm` and the live extracted chunks shows that many bodies end as: executable ops -> `ret` -> local/debug symbol rows -> `0x7a` end. Those rows are not executable bytecode and should survive round-trip as structured metadata rather than raw tail bytes.
`annotation_hints` is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.
### Opcode result policy
@ -451,7 +490,7 @@ The parser should use four result classes only:
- `decoded_op`: normal parsed opcode with structured operands
- `unknown_opcode`: one-byte opcode not yet modeled; stop or fall back conservatively
- `raw_tail`: remaining undecoded bytes after a stop condition
- `debug_blob`: symbol/debug tail such as `0x5c`-anchored metadata
- `debug_blob`: post-`ret` local/debug trailer ending in `0x7a`
That keeps the IR trustworthy even before the whole Crusader VM is modeled.
@ -474,16 +513,23 @@ annotation_hints:
runtime_family: slot-backed-owner-loaded-body
payload_shape_hint: signed_word
compiled_anchors:
- address: 000d:51fd
role: slot_value_loader
- address: 000d:5572
role: slot_value_plus_offset
- address: 000d:46ec
role: context_create_from_slot
- address: 000d:ebe3
role: opcode_sequence_run
- address: 000d:0988
role: referent_chain_mutator
- address: 000d:208b
role: materialize_or_forward_value
- address: 000d:21ed
role: prepend_inline_payload
- address: 000d:22bc
role: matrix_pushback_stage
- address: 000d:2104
role: finalize_to_outptr
- address: 000d:ebe3
role: opcode_sequence_run
runtime_stage_hints:
- stage_address: 000d:0988
ir_name: APPEND_UNIQUE_INDIRECT
```
This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.