205 lines
6.9 KiB
Markdown
205 lines
6.9 KiB
Markdown
|
|
# Pentagram-Derived USECODE Parser And Ghidra Path
|
||
|
|
|
||
|
|
## Purpose
|
||
|
|
|
||
|
|
This note turns the earlier feasibility assessment into a concrete workflow.
|
||
|
|
|
||
|
|
The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:
|
||
|
|
|
||
|
|
- reuse Pentagram's Crusader opcode decoding where it is still valid
|
||
|
|
- replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor
|
||
|
|
- emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations
|
||
|
|
|
||
|
|
## What To Reuse From Pentagram
|
||
|
|
|
||
|
|
Useful directly:
|
||
|
|
|
||
|
|
- the opcode tokenization model from `convert/Convert.h`
|
||
|
|
- the disassembly-oriented mnemonic layout from `tools/disasm/Disasm.cpp`
|
||
|
|
- the Crusader event ordinal table from `convert/crusader/ConvertUsecodeCrusader.h`
|
||
|
|
|
||
|
|
Useful only as hints:
|
||
|
|
|
||
|
|
- intrinsic names and signatures
|
||
|
|
- old event-name labels for still-unresolved higher ordinals
|
||
|
|
|
||
|
|
Not safe to reuse unchanged:
|
||
|
|
|
||
|
|
- Pentagram's Crusader header reader
|
||
|
|
- any assumption that its old `maxOffset` / `externTable` / `fixupTable` structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary
|
||
|
|
- the partial Node-based decompiler path as if it were a general Crusader decompiler
|
||
|
|
|
||
|
|
## Verified Local Model To Use Instead
|
||
|
|
|
||
|
|
The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.
|
||
|
|
|
||
|
|
Current authoritative inputs:
|
||
|
|
|
||
|
|
- `USECODE/EUSECODE_extracted/class_layout_index.tsv`
|
||
|
|
- `USECODE/EUSECODE_extracted/class_event_index.tsv`
|
||
|
|
- `USECODE/EUSECODE_extracted/chunks/`
|
||
|
|
|
||
|
|
Current authoritative facts:
|
||
|
|
|
||
|
|
- owner-loaded class object index is `class_id + 2`
|
||
|
|
- class bytes `8..11` provide the code-base anchor already carried in `class_layout_index.tsv`
|
||
|
|
- slot rows are 6-byte records: `u16 raw_event_entry_word + u32 raw_code_offset`
|
||
|
|
- slot body windows are already emitted conservatively as `derived_body_start`, `derived_body_end`, and `derived_body_length`
|
||
|
|
|
||
|
|
## End-To-End Process
|
||
|
|
|
||
|
|
### 1. Start from extracted owner-loaded artifacts
|
||
|
|
|
||
|
|
The parser should not reopen `EUSECODE.FLX` directly for the proof of concept. The extractor has already normalized the class and slot selection step.
|
||
|
|
|
||
|
|
Inputs:
|
||
|
|
|
||
|
|
- one row from `class_layout_index.tsv`
|
||
|
|
- one row from `class_event_index.tsv`
|
||
|
|
- the corresponding chunk file under `USECODE/EUSECODE_extracted/chunks/`
|
||
|
|
|
||
|
|
### 2. Select one body window conservatively
|
||
|
|
|
||
|
|
For a chosen class and slot:
|
||
|
|
|
||
|
|
- locate `entry_index`
|
||
|
|
- confirm `derived_body_start` and `derived_body_end`
|
||
|
|
- slice the chunk-local body bytes exactly from that range
|
||
|
|
|
||
|
|
### 3. Decode opcodes with Pentagram-derived operand formats
|
||
|
|
|
||
|
|
Use Pentagram's operand-width model as the first parser source of truth.
|
||
|
|
|
||
|
|
For the proof of concept, keep decoding conservative:
|
||
|
|
|
||
|
|
- parse the op exactly when the operand format is understood
|
||
|
|
- keep the raw bytes for every parsed op
|
||
|
|
- stop cleanly on an unknown opcode and preserve the remaining tail bytes
|
||
|
|
|
||
|
|
### 4. Emit canonical IR v1
|
||
|
|
|
||
|
|
The parser output should be one machine-friendly object that includes:
|
||
|
|
|
||
|
|
- source artifact metadata
|
||
|
|
- class metadata
|
||
|
|
- slot/event metadata
|
||
|
|
- exact op list with raw bytes
|
||
|
|
- annotation hints for compiled-side VM anchors
|
||
|
|
|
||
|
|
### 5. Feed Ghidra with annotations, not with fake code yet
|
||
|
|
|
||
|
|
The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.
|
||
|
|
|
||
|
|
Do not try to map the bytecode into a full processor module first.
|
||
|
|
|
||
|
|
## Proof-Of-Concept Parser
|
||
|
|
|
||
|
|
Tool path:
|
||
|
|
|
||
|
|
- `tools/poc_crusader_usecode_parser.py`
|
||
|
|
|
||
|
|
Current scope:
|
||
|
|
|
||
|
|
- uses the extracted TSV and chunk artifacts already in the repo
|
||
|
|
- disassembles one selected class/slot body at a time
|
||
|
|
- emits canonical IR JSON
|
||
|
|
- optionally emits a readable text listing beside the JSON
|
||
|
|
|
||
|
|
Current deliberate limits:
|
||
|
|
|
||
|
|
- no full intrinsic name table yet
|
||
|
|
- no synthetic control-flow graph yet
|
||
|
|
- no recompilation path yet
|
||
|
|
- no Ghidra importer yet
|
||
|
|
|
||
|
|
That keeps the parser useful without pretending the VM is fully solved.
|
||
|
|
|
||
|
|
## Canonical Ghidra Annotation Import Path
|
||
|
|
|
||
|
|
The first importer should consume the parser IR and create only three kinds of output.
|
||
|
|
|
||
|
|
### 1. Bookmarks
|
||
|
|
|
||
|
|
Use bookmarks for class/slot-level evidence that should not be hidden inside comments.
|
||
|
|
|
||
|
|
Good first bookmark payloads:
|
||
|
|
|
||
|
|
- `NPCTRIG slot 0x0A body parsed by POC tool`
|
||
|
|
- `EVENT slot 0x0A body parsed by POC tool`
|
||
|
|
- `slot 0x13 payload-shape hint = signed_word`
|
||
|
|
|
||
|
|
### 2. Plate or decompiler comments on compiled anchors
|
||
|
|
|
||
|
|
Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.
|
||
|
|
|
||
|
|
Best current anchors:
|
||
|
|
|
||
|
|
- `000d:51fd` = slot value load path
|
||
|
|
- `000d:5572` = slot value plus additive word
|
||
|
|
- `000d:46ec` = context create from slot index
|
||
|
|
- `000d:22bc` = decoded matrix/pushback consumer
|
||
|
|
- `000d:ebe3` = opcode sequence runner
|
||
|
|
|
||
|
|
Comment payload should stay short and evidence-heavy, for example:
|
||
|
|
|
||
|
|
`POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, payload shape unresolved, parsed via tools/poc_crusader_usecode_parser.py`
|
||
|
|
|
||
|
|
### 3. Optional comment bundles per runtime family
|
||
|
|
|
||
|
|
If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.
|
||
|
|
|
||
|
|
Examples:
|
||
|
|
|
||
|
|
- `slot-backed-owner-loaded-body`
|
||
|
|
- `slot-plus-offset-value-reload`
|
||
|
|
- `sequencer-matrix-consumer`
|
||
|
|
- `literal-replay-interpreter-upstream`
|
||
|
|
|
||
|
|
## Why Not A Ghidra Processor Yet
|
||
|
|
|
||
|
|
The missing pieces are still too important:
|
||
|
|
|
||
|
|
- full opcode semantics are incomplete
|
||
|
|
- stack and return discipline are incomplete
|
||
|
|
- the relation between owner-loaded body bytes and the later `000c:fa2f` literal/replay lane is still not closed end-to-end
|
||
|
|
- the upstream selector into `entity_vm_opcode_sequence_run` is still unresolved
|
||
|
|
|
||
|
|
So the right order is:
|
||
|
|
|
||
|
|
1. parser
|
||
|
|
2. IR
|
||
|
|
3. annotation import
|
||
|
|
4. only then reconsider a language module
|
||
|
|
|
||
|
|
## User Workflow
|
||
|
|
|
||
|
|
Run the proof-of-concept parser from the repo root.
|
||
|
|
|
||
|
|
Example:
|
||
|
|
|
||
|
|
```powershell
|
||
|
|
c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text
|
||
|
|
```
|
||
|
|
|
||
|
|
Recommended first targets:
|
||
|
|
|
||
|
|
1. `NPCTRIG` slot `0x0A`
|
||
|
|
2. `NPCTRIG` slot `0x20`
|
||
|
|
3. `EVENT` slot `0x0A`
|
||
|
|
4. one `_BOOT` slot `0x10` body as a short repeated-template control sample
|
||
|
|
|
||
|
|
What to look for in the output:
|
||
|
|
|
||
|
|
- exact raw body window
|
||
|
|
- whether the body terminates cleanly at opcode `0x7A`
|
||
|
|
- body-local call targets and global-address ops
|
||
|
|
- repeated structural motifs that can be carried back into the VM notes
|
||
|
|
- anchor hints for the compiled runtime functions
|
||
|
|
|
||
|
|
## Next Extensions
|
||
|
|
|
||
|
|
1. Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.
|
||
|
|
2. Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.
|
||
|
|
3. Add a small importer that converts `annotation_hints` into Ghidra comments and bookmarks.
|
||
|
|
4. Extend the IR with control-flow edges only after branch/jump confidence is high enough.
|
||
|
|
5. Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.
|