# Pentagram-Derived USECODE Parser And Ghidra Path ## Purpose This note turns the earlier feasibility assessment into a concrete workflow. The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first: - reuse Pentagram's Crusader opcode decoding where it is still valid - replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor - emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations ## What To Reuse From Pentagram Useful directly: - the opcode tokenization model from `convert/Convert.h` - the disassembly-oriented mnemonic layout from `tools/disasm/Disasm.cpp` - the Crusader event ordinal table from `convert/crusader/ConvertUsecodeCrusader.h` Useful only as hints: - intrinsic names and signatures - old event-name labels for still-unresolved higher ordinals Not safe to reuse unchanged: - Pentagram's Crusader header reader - any assumption that its old `maxOffset` / `externTable` / `fixupTable` structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary - the partial Node-based decompiler path as if it were a general Crusader decompiler ## Verified Local Model To Use Instead The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic. Current authoritative inputs: - `USECODE/EUSECODE_extracted/class_layout_index.tsv` - `USECODE/EUSECODE_extracted/class_event_index.tsv` - `USECODE/EUSECODE_extracted/chunks/` Current authoritative facts: - owner-loaded class object index is `class_id + 2` - class bytes `8..11` provide the code-base anchor already carried in `class_layout_index.tsv` - slot rows are 6-byte records: `u16 raw_event_entry_word + u32 raw_code_offset` - slot body windows are already emitted conservatively as `derived_body_start`, `derived_body_end`, and `derived_body_length` ## End-To-End Process ### 1. Start from extracted owner-loaded artifacts The parser should not reopen `EUSECODE.FLX` directly for the proof of concept. The extractor has already normalized the class and slot selection step. Inputs: - one row from `class_layout_index.tsv` - one row from `class_event_index.tsv` - the corresponding chunk file under `USECODE/EUSECODE_extracted/chunks/` ### 2. Select one body window conservatively For a chosen class and slot: - locate `entry_index` - confirm `derived_body_start` and `derived_body_end` - slice the chunk-local body bytes exactly from that range ### 3. Decode opcodes with Pentagram-derived operand formats Use Pentagram's operand-width model as the first parser source of truth. For the proof of concept, keep decoding conservative: - parse the op exactly when the operand format is understood - keep the raw bytes for every parsed op - stop cleanly on an unknown opcode and preserve the remaining tail bytes ### 4. Emit canonical IR v1 The parser output should be one machine-friendly object that includes: - source artifact metadata - class metadata - slot/event metadata - exact op list with raw bytes - annotation hints for compiled-side VM anchors ### 5. Feed Ghidra with annotations, not with fake code yet The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions. Do not try to map the bytecode into a full processor module first. ## Proof-Of-Concept Parser Tool path: - `tools/poc_crusader_usecode_parser.py` Current scope: - uses the extracted TSV and chunk artifacts already in the repo - disassembles one selected class/slot body at a time - emits canonical IR JSON - optionally emits a readable text listing beside the JSON Current deliberate limits: - no full intrinsic name table yet - no synthetic control-flow graph yet - no recompilation path yet - no Ghidra importer yet That keeps the parser useful without pretending the VM is fully solved. ## Canonical Ghidra Annotation Import Path The first importer should consume the parser IR and create only three kinds of output. ### 1. Bookmarks Use bookmarks for class/slot-level evidence that should not be hidden inside comments. Good first bookmark payloads: - `NPCTRIG slot 0x0A body parsed by POC tool` - `EVENT slot 0x0A body parsed by POC tool` - `slot 0x13 payload-shape hint = signed_word` ### 2. Plate or decompiler comments on compiled anchors Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies. Best current anchors: - `000d:46ec` = context create from slot index - `000d:0988` = referent-chain mutation family (`0x18..0x1b`) - `000d:208b` = materialize-or-forward value lane - `000d:21ed` = inline payload prepend stage - `000d:22bc` = decoded matrix/pushback consumer - `000d:2104` = mixed immediate/object finalize-to-outptr stage - `000d:ebe3` = opcode sequence runner Comment payload should stay short and evidence-heavy, for example: `POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, 5 local/debug rows after ret, parsed via tools/poc_crusader_usecode_parser.py` ### 3. Optional comment bundles per runtime family If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name. Examples: - `slot-backed-owner-loaded-body` - `slot-plus-offset-value-reload` - `sequencer-matrix-consumer` - `literal-replay-interpreter-upstream` ## Why Not A Ghidra Processor Yet The missing pieces are still too important: - full opcode semantics are incomplete - stack and return discipline are incomplete - the relation between owner-loaded body bytes and the later `000c:fa2f` literal/replay lane is still not closed end-to-end - the upstream selector into `entity_vm_opcode_sequence_run` is still unresolved So the right order is: 1. parser 2. IR 3. annotation import 4. only then reconsider a language module ## User Workflow Run the proof-of-concept parser from the repo root. Example: ```powershell c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text ``` Recommended first targets: 1. `NPCTRIG` slot `0x0A` 2. `NPCTRIG` slot `0x20` 3. `EVENT` slot `0x0A` 4. one `_BOOT` slot `0x10` body as a short repeated-template control sample What to look for in the output: - exact raw body window - whether the body terminates cleanly at opcode `0x7A` - body-local call targets and global-address ops - repeated structural motifs that can be carried back into the VM notes - anchor hints for the compiled runtime functions ## Next Extensions 1. Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata. 2. Emit repeated-body family diffs directly from the parser instead of only from the extractor reports. 3. Add a small importer that converts `annotation_hints` into Ghidra comments and bookmarks. 4. Extend the IR with control-flow edges only after branch/jump confidence is high enough. 5. Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.