Stuff

2026-03-25 23:32:36 +01:00 · 2026-03-25 23:32:36 +01:00 · f92d1504fa
commit f92d1504fa
parent ee33f94b4b
547 changed files with 37597 additions and 0 deletions
--- a/docs/usecode-tooling-comparison.md
+++ b/docs/usecode-tooling-comparison.md
@ -0,0 +1,282 @@
+# USECODE Tooling Comparison
+
+## Purpose
+
+This note compares three different USECODE-facing tool lines now in use around the Crusader work:
+
+1. Pentagram's built-in Crusader usecode converter/disassembler
+2. the local `crusader-disasm` corpus and helper scripts
+3. the current workspace parser/decompiler in `tools/poc_crusader_usecode_parser.py`
+
+The goal is not to rank them abstractly. The goal is to state what each one is actually good at, what assumptions it bakes in, and why the current local parser had to diverge.
+
+## Short version
+
+Pentagram is a game-engine-side disassembler/converter with generic Crusader hooks.
+
+`crusader-disasm` is mostly a generated disassembly corpus plus small maintenance scripts that mine or preserve information from that corpus.
+
+Our current parser is the first tool in this workspace that is explicitly built around the validated owner-loaded `EUSECODE.FLX` structure recovered from the retail binary and then pushed further into readable pseudocode export.
+
+## Pentagram: what it does
+
+The relevant Pentagram pieces are:
+
+- `convert/crusader/ConvertUsecodeCrusader.h`
+- `convert/Convert.h`
+- `tools/disasm/Disasm.cpp`
+- `usecode/UsecodeFlex.cpp`
+
+### Pentagram's model
+
+Pentagram is trying to solve a different problem from our current script. It is not primarily a workspace extraction/decompilation pipeline. It is an engine-aware converter/disassembler that sits on top of Pentagram's own USECODE model.
+
+Its Crusader-specific logic provides:
+
+- an event-name table for slots `0x00..0x1f`
+- an intrinsic-name table
+- a Crusader header reader
+- Crusader event-table decoding through `readevents`
+- Crusader opcode parsing by routing into the generic `readOpGeneric(..., crusader=true)` path
+
+### What Pentagram assumes
+
+Pentagram's class/container assumptions come from its own `UsecodeFlex` and converter model:
+
+- class bodies are addressed as object `classid + 2`
+- class names come from object `1`
+- the Crusader base offset comes from bytes `8..11`, then decremented by `1`
+- event count is derived as `(base_offset + 19) / 6`
+- disassembly is driven from the converter header and event table, not from our later owner-loaded extractor outputs
+
+That is close enough to be extremely useful, but it is not the same as the now-validated local owner-loaded reading we use in this repo.
+
+### What Pentagram outputs well
+
+Pentagram is strong at:
+
+- linear opcode disassembly
+- printing BP/SP-relative references in a readable way
+- mapping class/slot offsets to event names
+- following opcode `0x5C` symbol-info records into trailing local/debug symbol data
+- printing those debug symbols after the code body
+
+The JELYHACK example is a good illustration. Pentagram's disassembly prints:
+
+```text
+Func_1 (Event 1) JELYHACK::use():
+    0001: 5A init 00
+    0003: 5C symbol info offset 001Ch = "JELYHACK"
+    000F: 0B push 0207h
+    0012: 40 push dword [BP+06h]
+    0014: 4C push indirect 02h bytes
+    0016: 77 set info
+    0017: 78 process exclude
+    0018: 5B line number 219 (00DBh)
+    001B: 50 ret
+00: 01 type=69 (i) [BP+00h] (00) 00 referent
+    002A: 7A end
+```
+
+That is still one of the clearest proofs that the post-`ret` region contains local/debug-style metadata, not active control flow.
+
+### Where Pentagram stops short for this repo
+
+Pentagram is not built around our current local needs:
+
+- it does not consume `class_layout_index.tsv`, `class_event_index.tsv`, or the extracted chunk corpus
+- it does not expose a workspace-friendly IR
+- it does not attach our verified runtime anchors from `runtime_vm_ir.tsv`
+- it does not export batch pseudocode for the whole `EUSECODE` corpus
+- it still reflects a converter/disassembler view, not a readability-first decompiler view
+- its Crusader intrinsic table is explicitly mixed with Regret-era knowledge and is useful as a hint table, not rename authority
+
+So Pentagram gave us crucial structure and vocabulary, but not the repo-specific decompilation pipeline we needed.
+
+## crusader-disasm: what it does
+
+The local `crusader-disasm` tree is different again. It is not one coherent parser in the same way Pentagram is. It is a mixture of:
+
+- a large generated disassembly corpus in `crusader_disasm.txt`
+- opcode-name tables such as `usecode_opcodes.txt`
+- small maintenance scripts such as `parse_crusader_disasm.py` and `update_disasm_comments.py`
+- handwritten notes and side data gathered over time
+
+### What `crusader-disasm` is strongest at
+
+Its biggest strength is that it is already a rich evidence corpus.
+
+`usecode_opcodes.txt` gives a full opcode-name vocabulary such as:
+
+- `0x04 ASSIGN_MEMBER_CHAR`
+- `0x10 NEAR_ROUTINE_CALL`
+- `0x5C SYMBOL_INFO`
+- `0x78 PROCESS_EXCLUDE`
+- `0x7A END`
+
+That helped verify several names and fill decode gaps in our parser.
+
+The generated `crusader_disasm.txt` is also valuable because it shows concrete output form, not just names. It proved things like:
+
+- how `symbol info` is rendered
+- where local/debug symbol rows appear
+- what a tiny body like `JELYHACK::use` looks like in a traditional disassembly listing
+
+### What the helper scripts actually do
+
+The helper scripts in `crusader-disasm` are narrow and pragmatic.
+
+`parse_crusader_disasm.py`:
+
+- scans an already-generated `crusader_disasm.txt`
+- looks for `calli` lines, nearby `add sp`, and retval pushes
+- infers rough intrinsic prototypes from the text listing
+- emits a guessed intrinsic table
+
+That means it is not parsing `EUSECODE.FLX` directly. It is mining structure from a pre-rendered textual disassembly.
+
+`update_disasm_comments.py`:
+
+- merges comments from an older disassembly into an updated regenerated one
+- preserves manual annotations when intrinsic names change
+
+So this is again a maintenance aid around a text corpus, not a first-principles byte parser.
+
+### Where `crusader-disasm` stops short for this repo
+
+`crusader-disasm` is excellent evidence, but weak as a live decompilation pipeline:
+
+- it does not operate on our extracted owner-loaded chunk/index data
+- it does not produce structured IR
+- it does not know our validated body windows from `class_event_index.tsv`
+- it does not emit script/pseudocode views
+- it does not integrate runtime-anchor hints from the current RE notes
+- some of its information is annotation-quality and corpus-quality rather than machine-robust parser output
+
+In practice, `crusader-disasm` has been most useful as a vocabulary/evidence source, not as the final tool we run to generate the readable corpus.
+
+## Our current parser/decompiler: what it does differently
+
+The current local tool line is centered on:
+
+- `tools/extract_eusecode_flx.py`
+- `tools/poc_crusader_usecode_parser.py`
+- `tools/export_usecode_pseudocode.py`
+
+### 1. It is built around the validated owner-loaded local format
+
+This is the biggest difference.
+
+Our parser does not start from Pentagram's generic converter header model or from a pre-rendered disassembly text file. It starts from the extracted local artifacts and the currently validated retail-binary understanding:
+
+- `class_id + 2` body lookup
+- bytes `8..11` treated as the first code-byte anchor / `code_base_minus_one` basis
+- 6-byte event rows at `+20`
+- derived body ranges emitted into `class_event_index.tsv`
+- chunk files under `USECODE/EUSECODE_extracted/chunks/`
+
+That is why it can decompile the actual extracted corpus in a repeatable workspace-local way.
+
+### 2. It separates authoritative IR from readable views
+
+Pentagram and `crusader-disasm` mostly produce one human-facing linear listing.
+
+Our parser deliberately splits output into layers:
+
+- JSON IR for machine-facing structure
+- flat text listing for byte-faithful decode
+- script view for stack-machine readability
+- pseudocode view for programming-language-like readability
+- batch export of that pseudocode corpus into `USECODE/EUSECODE_extracted/pseudocode`
+
+That separation is what let us make JELYHACK readable without losing the exact bytes and trailer structure.
+
+### 3. It handles post-`ret` metadata differently
+
+Pentagram already knew about debug symbols through `0x5C` and `readDbgSymbols()`.
+
+The important difference is that our parser had to make that logic safe in the extracted-corpus setting:
+
+- it now detects ret-anchored debug/local trailers explicitly
+- it avoids mis-decoding those bytes as live opcodes on bodies like `NPCTRIG 0x0A`
+- it exposes debug symbols in the IR and readable views
+- it now hides dead post-return junk from the human pseudocode when readability matters more than raw listing fidelity
+
+So Pentagram gave the structural clue, but our parser had to adapt it to the owner-loaded extracted corpus and to the readability-first output mode.
+
+### 4. It adds runtime cross-reference hints that the older tools do not
+
+Our parser attaches the verified runtime bridge information from `runtime_vm_ir.tsv` and related notes, such as:
+
+- `000d:0988`
+- `000d:177c`
+- `000d:1acb`
+- `000d:208b`
+- `000d:21ed`
+- `000d:22bc`
+- `000d:2104`
+- `000d:46ec`
+- `000d:ebe3`
+
+Neither Pentagram nor `crusader-disasm` is doing that kind of live repo-specific runtime correlation.
+
+### 5. It is aimed at whole-corpus readability, not only opcode fidelity
+
+This is the most visible practical difference.
+
+Pentagram and `crusader-disasm` are good at telling you what bytes and opcodes are present.
+
+Our current script is trying to answer a different question too:
+
+`What does this class body seem to do, in language a human can scan?`
+
+That is why the current parser now:
+
+- names locals where the debug trailer provides them
+- folds compare ladders into `if / else if`
+- suppresses dead post-`ret` tail noise in pseudocode
+- exports the whole decoded corpus into per-class pseudocode files
+
+That is the main place where our script now goes beyond the older tools.
+
+## What the older tools still do better
+
+This is not a one-way replacement story.
+
+Pentagram still does some things better than our current script:
+
+- broader mature generic opcode conversion framework
+- a cleaner historical disassembler path for symbol-info and debug-symbol printing
+- a converter architecture that already knows how to build node-like structures for many ops
+
+`crusader-disasm` still does some things better too:
+
+- richer long-lived annotation corpus
+- a larger existing body of older naming/vocabulary experiments
+- a direct opcode-name table from a distinct extraction route
+- concrete disassembly output that is sometimes easier to cross-check than a newer heuristic pseudocode layer
+
+So the best current workflow is still hybrid:
+
+- use Pentagram for structural/reference behavior
+- use `crusader-disasm` for opcode vocabulary and corpus evidence
+- use the local parser for validated owner-loaded extraction, IR, pseudocode, and batch readability export
+
+## Best current summary
+
+Pentagram is a converter/disassembler.
+
+`crusader-disasm` is a disassembly corpus with helper scripts.
+
+Our script is the first repo-local tool that is explicitly trying to be a readable decompiler over the validated extracted `EUSECODE` corpus.
+
+That is why the current parser looks less like a classic disassembler and more like a layered RE workbench:
+
+- extractor-backed local format understanding
+- structured IR
+- byte-faithful listing
+- readability-first script/pseudocode views
+- batch corpus export
+- runtime-annotation hints tied to the current Crusader notes
+
+The tradeoff is that our current script is newer and more heuristic. It is better at producing something a human can read across the whole corpus, but it is not yet as mature or as battle-tested at raw opcode coverage as the older reference tools.