MaddoScientisto daa363c3d2 Add 'annotate-usecode' command to import USECODE IR JSON annotations

- Introduced a new command 'annotate-usecode' to import USECODE IR JSON annotation hints as Ghidra comments on compiled anchors.
- Added argument parsing for multiple IR JSON files, comment type selection, and a dry-run option.
- Implemented logic to read annotation records from the provided IR files and set comments on the corresponding addresses in Ghidra.
- Enhanced JSON schema to include response structure for the new command.

2026-03-24 18:14:20 +01:00

6.9 KiB

Raw Blame History

Pentagram-Derived USECODE Parser And Ghidra Path

Purpose

This note turns the earlier feasibility assessment into a concrete workflow.

The goal is not to make Ghidra decompile Crusader USECODE as if it were x86 immediately. The goal is to build one trustworthy bridge layer first:

reuse Pentagram's Crusader opcode decoding where it is still valid
replace Pentagram's older Crusader container/header assumptions with the owner-loaded class and slot model already verified in the binary and extractor
emit a lossless IR that can drive both human-readable USECODE output and future Ghidra annotations

What To Reuse From Pentagram

Useful directly:

the opcode tokenization model from convert/Convert.h
the disassembly-oriented mnemonic layout from tools/disasm/Disasm.cpp
the Crusader event ordinal table from convert/crusader/ConvertUsecodeCrusader.h

Useful only as hints:

intrinsic names and signatures
old event-name labels for still-unresolved higher ordinals

Not safe to reuse unchanged:

Pentagram's Crusader header reader
any assumption that its old maxOffset / externTable / fixupTable structure matches the owner-loaded EUSECODE class bodies now validated in the extractor and DOS binary
the partial Node-based decompiler path as if it were a general Crusader decompiler

Verified Local Model To Use Instead

The proof-of-concept parser should be grounded in the existing local artifacts, not in Pentagram's old header logic.

Current authoritative inputs:

USECODE/EUSECODE_extracted/class_layout_index.tsv
USECODE/EUSECODE_extracted/class_event_index.tsv
USECODE/EUSECODE_extracted/chunks/

Current authoritative facts:

owner-loaded class object index is class_id + 2
class bytes 8..11 provide the code-base anchor already carried in class_layout_index.tsv
slot rows are 6-byte records: u16 raw_event_entry_word + u32 raw_code_offset
slot body windows are already emitted conservatively as derived_body_start, derived_body_end, and derived_body_length

End-To-End Process

1. Start from extracted owner-loaded artifacts

The parser should not reopen EUSECODE.FLX directly for the proof of concept. The extractor has already normalized the class and slot selection step.

Inputs:

one row from class_layout_index.tsv
one row from class_event_index.tsv
the corresponding chunk file under USECODE/EUSECODE_extracted/chunks/

2. Select one body window conservatively

For a chosen class and slot:

locate entry_index
confirm derived_body_start and derived_body_end
slice the chunk-local body bytes exactly from that range

3. Decode opcodes with Pentagram-derived operand formats

Use Pentagram's operand-width model as the first parser source of truth.

For the proof of concept, keep decoding conservative:

parse the op exactly when the operand format is understood
keep the raw bytes for every parsed op
stop cleanly on an unknown opcode and preserve the remaining tail bytes

4. Emit canonical IR v1

The parser output should be one machine-friendly object that includes:

source artifact metadata
class metadata
slot/event metadata
exact op list with raw bytes
annotation hints for compiled-side VM anchors

5. Feed Ghidra with annotations, not with fake code yet

The first Ghidra-side use should be comments, bookmarks, and cross-reference notes on the compiled VM functions.

Do not try to map the bytecode into a full processor module first.

Proof-Of-Concept Parser

Tool path:

tools/poc_crusader_usecode_parser.py

Current scope:

uses the extracted TSV and chunk artifacts already in the repo
disassembles one selected class/slot body at a time
emits canonical IR JSON
optionally emits a readable text listing beside the JSON

Current deliberate limits:

no full intrinsic name table yet
no synthetic control-flow graph yet
no recompilation path yet
no Ghidra importer yet

That keeps the parser useful without pretending the VM is fully solved.

Canonical Ghidra Annotation Import Path

The first importer should consume the parser IR and create only three kinds of output.

1. Bookmarks

Use bookmarks for class/slot-level evidence that should not be hidden inside comments.

Good first bookmark payloads:

NPCTRIG slot 0x0A body parsed by POC tool
EVENT slot 0x0A body parsed by POC tool
slot 0x13 payload-shape hint = signed_word

2. Plate or decompiler comments on compiled anchors

Use comments on the compiled runtime functions that already consume or materialize the USECODE bodies.

Best current anchors:

000d:51fd = slot value load path
000d:5572 = slot value plus additive word
000d:46ec = context create from slot index
000d:22bc = decoded matrix/pushback consumer
000d:ebe3 = opcode sequence runner

Comment payload should stay short and evidence-heavy, for example:

POC USECODE body anchor: NPCTRIG slot 0x0A -> body 0x00DA..0x024F, raw word 0x013E, payload shape unresolved, parsed via tools/poc_crusader_usecode_parser.py

3. Optional comment bundles per runtime family

If a later importer wants to annotate more than one function at once, keep it grouped by runtime family instead of by class name.

Examples:

slot-backed-owner-loaded-body
slot-plus-offset-value-reload
sequencer-matrix-consumer
literal-replay-interpreter-upstream

Why Not A Ghidra Processor Yet

The missing pieces are still too important:

full opcode semantics are incomplete
stack and return discipline are incomplete
the relation between owner-loaded body bytes and the later 000c:fa2f literal/replay lane is still not closed end-to-end
the upstream selector into entity_vm_opcode_sequence_run is still unresolved

So the right order is:

parser
IR
annotation import
only then reconsider a language module

User Workflow

Run the proof-of-concept parser from the repo root.

Example:

c:/Users/Maddo/.PYENV/PYENV-WIN/versions/3.14.3/python.exe tools/poc_crusader_usecode_parser.py --class NPCTRIG --slot 0x0A --emit-text

Recommended first targets:

NPCTRIG slot 0x0A
NPCTRIG slot 0x20
EVENT slot 0x0A
one _BOOT slot 0x10 body as a short repeated-template control sample

What to look for in the output:

exact raw body window
whether the body terminates cleanly at opcode 0x7A
body-local call targets and global-address ops
repeated structural motifs that can be carried back into the VM notes
anchor hints for the compiled runtime functions

Next Extensions

Add the full Crusader intrinsic-name table from Pentagram as hint-only metadata.
Emit repeated-body family diffs directly from the parser instead of only from the extractor reports.
Add a small importer that converts annotation_hints into Ghidra comments and bookmarks.
Extend the IR with control-flow edges only after branch/jump confidence is high enough.
Tie parser output back to the current slot/additive runtime tuples used in the compiled VM lane.

6.9 KiB Raw Blame History