Crusader_Decomp/docs/usecode-tooling-comparison.md
2026-03-25 23:32:36 +01:00

11 KiB

USECODE Tooling Comparison

Purpose

This note compares three different USECODE-facing tool lines now in use around the Crusader work:

  1. Pentagram's built-in Crusader usecode converter/disassembler
  2. the local crusader-disasm corpus and helper scripts
  3. the current workspace parser/decompiler in tools/poc_crusader_usecode_parser.py

The goal is not to rank them abstractly. The goal is to state what each one is actually good at, what assumptions it bakes in, and why the current local parser had to diverge.

Short version

Pentagram is a game-engine-side disassembler/converter with generic Crusader hooks.

crusader-disasm is mostly a generated disassembly corpus plus small maintenance scripts that mine or preserve information from that corpus.

Our current parser is the first tool in this workspace that is explicitly built around the validated owner-loaded EUSECODE.FLX structure recovered from the retail binary and then pushed further into readable pseudocode export.

Pentagram: what it does

The relevant Pentagram pieces are:

  • convert/crusader/ConvertUsecodeCrusader.h
  • convert/Convert.h
  • tools/disasm/Disasm.cpp
  • usecode/UsecodeFlex.cpp

Pentagram's model

Pentagram is trying to solve a different problem from our current script. It is not primarily a workspace extraction/decompilation pipeline. It is an engine-aware converter/disassembler that sits on top of Pentagram's own USECODE model.

Its Crusader-specific logic provides:

  • an event-name table for slots 0x00..0x1f
  • an intrinsic-name table
  • a Crusader header reader
  • Crusader event-table decoding through readevents
  • Crusader opcode parsing by routing into the generic readOpGeneric(..., crusader=true) path

What Pentagram assumes

Pentagram's class/container assumptions come from its own UsecodeFlex and converter model:

  • class bodies are addressed as object classid + 2
  • class names come from object 1
  • the Crusader base offset comes from bytes 8..11, then decremented by 1
  • event count is derived as (base_offset + 19) / 6
  • disassembly is driven from the converter header and event table, not from our later owner-loaded extractor outputs

That is close enough to be extremely useful, but it is not the same as the now-validated local owner-loaded reading we use in this repo.

What Pentagram outputs well

Pentagram is strong at:

  • linear opcode disassembly
  • printing BP/SP-relative references in a readable way
  • mapping class/slot offsets to event names
  • following opcode 0x5C symbol-info records into trailing local/debug symbol data
  • printing those debug symbols after the code body

The JELYHACK example is a good illustration. Pentagram's disassembly prints:

Func_1 (Event 1) JELYHACK::use():
    0001: 5A init 00
    0003: 5C symbol info offset 001Ch = "JELYHACK"
    000F: 0B push 0207h
    0012: 40 push dword [BP+06h]
    0014: 4C push indirect 02h bytes
    0016: 77 set info
    0017: 78 process exclude
    0018: 5B line number 219 (00DBh)
    001B: 50 ret
00: 01 type=69 (i) [BP+00h] (00) 00 referent
    002A: 7A end

That is still one of the clearest proofs that the post-ret region contains local/debug-style metadata, not active control flow.

Where Pentagram stops short for this repo

Pentagram is not built around our current local needs:

  • it does not consume class_layout_index.tsv, class_event_index.tsv, or the extracted chunk corpus
  • it does not expose a workspace-friendly IR
  • it does not attach our verified runtime anchors from runtime_vm_ir.tsv
  • it does not export batch pseudocode for the whole EUSECODE corpus
  • it still reflects a converter/disassembler view, not a readability-first decompiler view
  • its Crusader intrinsic table is explicitly mixed with Regret-era knowledge and is useful as a hint table, not rename authority

So Pentagram gave us crucial structure and vocabulary, but not the repo-specific decompilation pipeline we needed.

crusader-disasm: what it does

The local crusader-disasm tree is different again. It is not one coherent parser in the same way Pentagram is. It is a mixture of:

  • a large generated disassembly corpus in crusader_disasm.txt
  • opcode-name tables such as usecode_opcodes.txt
  • small maintenance scripts such as parse_crusader_disasm.py and update_disasm_comments.py
  • handwritten notes and side data gathered over time

What crusader-disasm is strongest at

Its biggest strength is that it is already a rich evidence corpus.

usecode_opcodes.txt gives a full opcode-name vocabulary such as:

  • 0x04 ASSIGN_MEMBER_CHAR
  • 0x10 NEAR_ROUTINE_CALL
  • 0x5C SYMBOL_INFO
  • 0x78 PROCESS_EXCLUDE
  • 0x7A END

That helped verify several names and fill decode gaps in our parser.

The generated crusader_disasm.txt is also valuable because it shows concrete output form, not just names. It proved things like:

  • how symbol info is rendered
  • where local/debug symbol rows appear
  • what a tiny body like JELYHACK::use looks like in a traditional disassembly listing

What the helper scripts actually do

The helper scripts in crusader-disasm are narrow and pragmatic.

parse_crusader_disasm.py:

  • scans an already-generated crusader_disasm.txt
  • looks for calli lines, nearby add sp, and retval pushes
  • infers rough intrinsic prototypes from the text listing
  • emits a guessed intrinsic table

That means it is not parsing EUSECODE.FLX directly. It is mining structure from a pre-rendered textual disassembly.

update_disasm_comments.py:

  • merges comments from an older disassembly into an updated regenerated one
  • preserves manual annotations when intrinsic names change

So this is again a maintenance aid around a text corpus, not a first-principles byte parser.

Where crusader-disasm stops short for this repo

crusader-disasm is excellent evidence, but weak as a live decompilation pipeline:

  • it does not operate on our extracted owner-loaded chunk/index data
  • it does not produce structured IR
  • it does not know our validated body windows from class_event_index.tsv
  • it does not emit script/pseudocode views
  • it does not integrate runtime-anchor hints from the current RE notes
  • some of its information is annotation-quality and corpus-quality rather than machine-robust parser output

In practice, crusader-disasm has been most useful as a vocabulary/evidence source, not as the final tool we run to generate the readable corpus.

Our current parser/decompiler: what it does differently

The current local tool line is centered on:

  • tools/extract_eusecode_flx.py
  • tools/poc_crusader_usecode_parser.py
  • tools/export_usecode_pseudocode.py

1. It is built around the validated owner-loaded local format

This is the biggest difference.

Our parser does not start from Pentagram's generic converter header model or from a pre-rendered disassembly text file. It starts from the extracted local artifacts and the currently validated retail-binary understanding:

  • class_id + 2 body lookup
  • bytes 8..11 treated as the first code-byte anchor / code_base_minus_one basis
  • 6-byte event rows at +20
  • derived body ranges emitted into class_event_index.tsv
  • chunk files under USECODE/EUSECODE_extracted/chunks/

That is why it can decompile the actual extracted corpus in a repeatable workspace-local way.

2. It separates authoritative IR from readable views

Pentagram and crusader-disasm mostly produce one human-facing linear listing.

Our parser deliberately splits output into layers:

  • JSON IR for machine-facing structure
  • flat text listing for byte-faithful decode
  • script view for stack-machine readability
  • pseudocode view for programming-language-like readability
  • batch export of that pseudocode corpus into USECODE/EUSECODE_extracted/pseudocode

That separation is what let us make JELYHACK readable without losing the exact bytes and trailer structure.

3. It handles post-ret metadata differently

Pentagram already knew about debug symbols through 0x5C and readDbgSymbols().

The important difference is that our parser had to make that logic safe in the extracted-corpus setting:

  • it now detects ret-anchored debug/local trailers explicitly
  • it avoids mis-decoding those bytes as live opcodes on bodies like NPCTRIG 0x0A
  • it exposes debug symbols in the IR and readable views
  • it now hides dead post-return junk from the human pseudocode when readability matters more than raw listing fidelity

So Pentagram gave the structural clue, but our parser had to adapt it to the owner-loaded extracted corpus and to the readability-first output mode.

4. It adds runtime cross-reference hints that the older tools do not

Our parser attaches the verified runtime bridge information from runtime_vm_ir.tsv and related notes, such as:

  • 000d:0988
  • 000d:177c
  • 000d:1acb
  • 000d:208b
  • 000d:21ed
  • 000d:22bc
  • 000d:2104
  • 000d:46ec
  • 000d:ebe3

Neither Pentagram nor crusader-disasm is doing that kind of live repo-specific runtime correlation.

5. It is aimed at whole-corpus readability, not only opcode fidelity

This is the most visible practical difference.

Pentagram and crusader-disasm are good at telling you what bytes and opcodes are present.

Our current script is trying to answer a different question too:

What does this class body seem to do, in language a human can scan?

That is why the current parser now:

  • names locals where the debug trailer provides them
  • folds compare ladders into if / else if
  • suppresses dead post-ret tail noise in pseudocode
  • exports the whole decoded corpus into per-class pseudocode files

That is the main place where our script now goes beyond the older tools.

What the older tools still do better

This is not a one-way replacement story.

Pentagram still does some things better than our current script:

  • broader mature generic opcode conversion framework
  • a cleaner historical disassembler path for symbol-info and debug-symbol printing
  • a converter architecture that already knows how to build node-like structures for many ops

crusader-disasm still does some things better too:

  • richer long-lived annotation corpus
  • a larger existing body of older naming/vocabulary experiments
  • a direct opcode-name table from a distinct extraction route
  • concrete disassembly output that is sometimes easier to cross-check than a newer heuristic pseudocode layer

So the best current workflow is still hybrid:

  • use Pentagram for structural/reference behavior
  • use crusader-disasm for opcode vocabulary and corpus evidence
  • use the local parser for validated owner-loaded extraction, IR, pseudocode, and batch readability export

Best current summary

Pentagram is a converter/disassembler.

crusader-disasm is a disassembly corpus with helper scripts.

Our script is the first repo-local tool that is explicitly trying to be a readable decompiler over the validated extracted EUSECODE corpus.

That is why the current parser looks less like a classic disassembler and more like a layered RE workbench:

  • extractor-backed local format understanding
  • structured IR
  • byte-faithful listing
  • readability-first script/pseudocode views
  • batch corpus export
  • runtime-annotation hints tied to the current Crusader notes

The tradeoff is that our current script is newer and more heuristic. It is better at producing something a human can read across the whole corpus, but it is not yet as mature or as battle-tested at raw opcode coverage as the older reference tools.