Pseudocode and stuff
This commit is contained in:
parent
7310c4fe96
commit
ee33f94b4b
466 changed files with 27770 additions and 276 deletions
|
|
@ -113,6 +113,28 @@ The safe reading is:
|
|||
|
||||
The first script IR should preserve exact recompilation inputs before it tries to look pretty.
|
||||
|
||||
## Current Parser Views
|
||||
|
||||
The current proof-of-concept parser now emits three complementary views for a single class/slot body:
|
||||
|
||||
- JSON IR: the authoritative machine-facing output for tooling and any future assembler.
|
||||
- Flat text listing: a byte-faithful decode with offsets, raw bytes, and trailer sections.
|
||||
- Script view: a more readable block-labeled decompilation with locals, labels, and stack-VM statements.
|
||||
- Pseudocode view: a higher-level decompilation that tries to collapse common compare ladders and stack expressions into programming-language-like control flow.
|
||||
|
||||
The script and pseudocode views are intentionally descriptive rather than authoritative. They are meant to help read bodies like `NPCTRIG 0x0A` or `EVENT 0x0A` without losing the exact JSON IR that a round-trip compiler will need.
|
||||
|
||||
## Deferred Readability Follow-Ups
|
||||
|
||||
Keep these parser-facing readability tasks for later while the current focus stays on broad pseudocode export and class-family understanding:
|
||||
|
||||
1. Replace unresolved `class_XXXX_slot_YY` call labels with behavior-backed names where the compiled/runtime evidence is strong enough.
|
||||
2. Replace placeholder argument names such as `arg_06` with semantic names inferred from stable usage patterns.
|
||||
3. Detect more control-flow shapes beyond compare ladders, especially simple loops and early-return guards.
|
||||
4. Collapse common spawn/setup idioms into more domain-specific statements when the stack pattern is consistent.
|
||||
5. Run the pseudocode renderer across larger families like `EVENT`, `_BOOT`, and `SURCAM*` and tighten the heuristics where they still leak VM structure.
|
||||
6. Add small behavior-level comments only where they help explain gameplay meaning rather than VM mechanics.
|
||||
|
||||
### Unit of decompilation
|
||||
|
||||
The IR should be organized as:
|
||||
|
|
@ -219,6 +241,7 @@ The compiler side will need more than pretty script text. At minimum it must pre
|
|||
- Width/sign information for immediates
|
||||
- Inline versus indirect payload form
|
||||
- String payload encoding and terminators
|
||||
- Post-`ret` debug/local symbol trailers, including the local count byte and each per-local metadata row
|
||||
- Any unknown opcode byte sequences verbatim
|
||||
|
||||
If any of those are dropped, a source-level editor can still be readable, but it will stop being a trustworthy recompilation format.
|
||||
|
|
@ -396,9 +419,20 @@ event:
|
|||
derived_body_length: 373
|
||||
repeated_template_status: ""
|
||||
body:
|
||||
end_reason: end_opcode
|
||||
end_reason: debug_symbols_then_end
|
||||
raw_body_sha1: <digest>
|
||||
unknown_trailing_bytes: ""
|
||||
debug_symbol_offset: 0x0143
|
||||
debug_symbol_count: 5
|
||||
debug_symbols:
|
||||
- index: 0x00
|
||||
type_id: 0x69
|
||||
bp_repr: [BP+00h]
|
||||
name: referent
|
||||
- index: 0x01
|
||||
type_id: 0x69
|
||||
bp_repr: [BP+0Ah]
|
||||
name: event
|
||||
ops:
|
||||
- offset: 0x0000
|
||||
absolute_body_offset: 0x00da
|
||||
|
|
@ -417,9 +451,12 @@ ops:
|
|||
annotation_hints:
|
||||
runtime_family: slot-backed-owner-loaded-body
|
||||
compiled_anchors:
|
||||
- 000d:51fd
|
||||
- 000d:5572
|
||||
- 000d:46ec
|
||||
- 000d:0988
|
||||
- 000d:208b
|
||||
- 000d:21ed
|
||||
- 000d:22bc
|
||||
- 000d:2104
|
||||
- 000d:ebe3
|
||||
```
|
||||
|
||||
|
|
@ -431,7 +468,7 @@ annotation_hints:
|
|||
|
||||
`event` keeps the exact six-byte row meaningfully split into authoritative fields plus the derived body window.
|
||||
|
||||
`body` records how far the parser got and whether any bytes remain undecoded or trailing.
|
||||
`body` records how far the parser got, whether the body terminated at a real `0x7a` end marker, and whether a post-`ret` local/debug trailer was parsed instead of being misclassified as stray opcodes.
|
||||
|
||||
`ops` is intentionally lossless. Each decoded op keeps:
|
||||
|
||||
|
|
@ -442,6 +479,8 @@ annotation_hints:
|
|||
- exact raw bytes for the whole op
|
||||
- parsed operands as typed fields
|
||||
|
||||
`debug_symbols` preserves the owner-loaded post-`ret` local metadata block. Current evidence from `crusader-disasm` and the live extracted chunks shows that many bodies end as: executable ops -> `ret` -> local/debug symbol rows -> `0x7a` end. Those rows are not executable bytecode and should survive round-trip as structured metadata rather than raw tail bytes.
|
||||
|
||||
`annotation_hints` is the bridge to Ghidra. It is not a source-language feature. It exists so a later importer can attach the right comments and bookmarks to the compiled VM/runtime addresses without trying to infer them from free text.
|
||||
|
||||
### Opcode result policy
|
||||
|
|
@ -451,7 +490,7 @@ The parser should use four result classes only:
|
|||
- `decoded_op`: normal parsed opcode with structured operands
|
||||
- `unknown_opcode`: one-byte opcode not yet modeled; stop or fall back conservatively
|
||||
- `raw_tail`: remaining undecoded bytes after a stop condition
|
||||
- `debug_blob`: symbol/debug tail such as `0x5c`-anchored metadata
|
||||
- `debug_blob`: post-`ret` local/debug trailer ending in `0x7a`
|
||||
|
||||
That keeps the IR trustworthy even before the whole Crusader VM is modeled.
|
||||
|
||||
|
|
@ -474,16 +513,23 @@ annotation_hints:
|
|||
runtime_family: slot-backed-owner-loaded-body
|
||||
payload_shape_hint: signed_word
|
||||
compiled_anchors:
|
||||
- address: 000d:51fd
|
||||
role: slot_value_loader
|
||||
- address: 000d:5572
|
||||
role: slot_value_plus_offset
|
||||
- address: 000d:46ec
|
||||
role: context_create_from_slot
|
||||
- address: 000d:ebe3
|
||||
role: opcode_sequence_run
|
||||
- address: 000d:0988
|
||||
role: referent_chain_mutator
|
||||
- address: 000d:208b
|
||||
role: materialize_or_forward_value
|
||||
- address: 000d:21ed
|
||||
role: prepend_inline_payload
|
||||
- address: 000d:22bc
|
||||
role: matrix_pushback_stage
|
||||
- address: 000d:2104
|
||||
role: finalize_to_outptr
|
||||
- address: 000d:ebe3
|
||||
role: opcode_sequence_run
|
||||
runtime_stage_hints:
|
||||
- stage_address: 000d:0988
|
||||
ir_name: APPEND_UNIQUE_INDIRECT
|
||||
```
|
||||
|
||||
This is deliberately smaller than a full import format. It keeps the parser reusable even if the first Ghidra-side importer is only a comment/bookmark script.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue