- Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors. - Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option. - Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra. - Enhanced JSON schema to include response structure for the new command.
6.9 KiB
Pentagram-Derived USECODE Parser And Ghidra Path
Purpose
This note turns the earlier feasibility assessment into a concrete workflow.
The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:
- reuse Pentagram's Crusader opcode decoding where it is still valid
- replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor
- emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations
What To Reuse From Pentagram
Useful directly:
- the opcode tokenization model from
convert/Convert.h - the disassembly-oriented mnemonic layout from
tools/disasm/Disasm.cpp - the Crusader event ordinal table from
convert/crusader/ConvertUsecodeCrusader.h
Useful only as hints:
- intrinsic names and signatures
- old event-name labels for still-unresolved higher ordinals
Not safe to reuse unchanged:
- Pentagram's Crusader header reader
- any assumption that its old
maxOffset/externTable/fixupTablestructure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary - the partial Node-based decompiler path as if it were a general Crusader decompiler
Verified Local Model To Use Instead
The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.
Current authoritative inputs:
USECODE/EUSECODE_extracted/class_layout_index.tsvUSECODE/EUSECODE_extracted/class_event_index.tsvUSECODE/EUSECODE_extracted/chunks/
Current authoritative facts:
- owner-loaded class object index is
class_id + 2 - class bytes
8..11provide the code-base anchor already carried inclass_layout_index.tsv - slot rows are 6-byte records:
u16 raw_event_entry_word + u32 raw_code_offset - slot body windows are already emitted conservatively as
derived_body_start,derived_body_end, andderived_body_length
End-To-End Process
1. Start from extracted owner-loaded artifacts
The parser should not reopen EUSECODE.FLX directly for the proof of concept. The extractor has already normalized the class and slot selection step.
Inputs:
- one row from
class_layout_index.tsv - one row from
class_event_index.tsv - the corresponding chunk file under
USECODE/EUSECODE_extracted/chunks/
2. Select one body window conservatively
For a chosen class and slot:
- locate
entry_index - confirm
derived_body_startandderived_body_end - slice the chunk-local body bytes exactly from that range
3. Decode opcodes with Pentagram-derived operand formats
Use Pentagram's operand-width model as the first parser source of truth.
For the proof of concept, keep decoding conservative:
- parse the op exactly when the operand format is understood
- keep the raw bytes for every parsed op
- stop cleanly on an unknown opcode and preserve the remaining tail bytes
4. Emit canonical IR v1
The parser output should be one machine-friendly object that includes:
- source artifact metadata
- class metadata
- slot/event metadata
- exact op list with raw bytes
- annotation hints for compiled-side VM anchors
5. Feed Ghidra with annotations, not with fake code yet
The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.
Do not try to map the bytecode into a full processor module first.
Proof-Of-Concept Parser
Tool path:
tools/poc_crusader_usecode_parser.py
Current scope:
- uses the extracted TSV and chunk artifacts already in the repo
- disassembles one selected class/slot body at a time
- emits canonical IR JSON
- optionally emits a readable text listing beside the JSON
Current deliberate limits:
- no full intrinsic name table yet
- no synthetic control-flow graph yet
- no recompilation path yet
- no Ghidra importer yet
That keeps the parser useful without pretending the VM is fully solved.
Canonical Ghidra Annotation Import Path
The first importer should consume the parser IR and create only three kinds of output.
1. Bookmarks
Use bookmarks for class/slot-level evidence that should not be hidden inside comments.
Good first bookmark payloads:
NPCTRIG slot 0x0A body parsed by POC toolEVENT slot 0x0A body parsed by POC toolslot 0x13 payload-shape hint = signed_word
2. Plate or decompiler comments on compiled anchors
Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.
Best current anchors:
000d:51fd= slot value load path000d:5572= slot value plus additive word000d:46ec= context create from slot index000d:22bc= decoded matrix/pushback consumer000d:ebe3= opcode sequence runner
Comment payload should stay short and evidence-heavy, for example:
POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, payload shape unresolved, parsed via tools/poc_crusader_usecode_parser.py
3. Optional comment bundles per runtime family
If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.
Examples:
slot-backed-owner-loaded-bodyslot-plus-offset-value-reloadsequencer-matrix-consumerliteral-replay-interpreter-upstream
Why Not A Ghidra Processor Yet
The missing pieces are still too important:
- full opcode semantics are incomplete
- stack and return discipline are incomplete
- the relation between owner-loaded body bytes and the later
000c:fa2fliteral/replay lane is still not closed end-to-end - the upstream selector into
entity_vm_opcode_sequence_runis still unresolved
So the right order is:
- parser
- IR
- annotation import
- only then reconsider a language module
User Workflow
Run the proof-of-concept parser from the repo root.
Example:
c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text
Recommended first targets:
NPCTRIGslot0x0ANPCTRIGslot0x20EVENTslot0x0A- one
_BOOTslot0x10body as a short repeated-template control sample
What to look for in the output:
- exact raw body window
- whether the body terminates cleanly at opcode
0x7A - body-local call targets and global-address ops
- repeated structural motifs that can be carried back into the VM notes
- anchor hints for the compiled runtime functions
Next Extensions
- Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.
- Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.
- Add a small importer that converts
annotation_hintsinto Ghidra comments and bookmarks. - Extend the IR with control-flow edges only after branch/jump confidence is high enough.
- Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.