256 lines
11 KiB
Markdown
256 lines
11 KiB
Markdown
|
|
# Remorse C++ Decompilation Plan
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
|
||
|
|
Turn the current evidence-backed Remorse decompilation into understandable, maintainable C++ source that can eventually be rebuilt into a working executable.
|
||
|
|
|
||
|
|
The important constraint is that this should be treated as a staged lift, not a direct dump of Ghidra pseudocode into a compiler. The shortest path to a recompilable result is to recover the original object model deliberately: class ownership, instance layouts, vtables, calling conventions, segmented-pointer rules, resource formats, and subsystem boundaries.
|
||
|
|
|
||
|
|
## Short Answer: Can Ghidra Be Made More Class-Aware?
|
||
|
|
|
||
|
|
Yes, but only partially and mostly through explicit modeling.
|
||
|
|
|
||
|
|
Ghidra can already represent a lot of what we need:
|
||
|
|
|
||
|
|
- class and namespace symbols in the Symbol Tree
|
||
|
|
- structs and unions in the Data Type Manager
|
||
|
|
- vtable data and typed function pointers
|
||
|
|
- method ownership through namespaces/classes
|
||
|
|
- `this`-pointer style signatures when the calling convention and object layout are known
|
||
|
|
|
||
|
|
What it does not do well here is infer all of that automatically from a 16-bit DOS binary with mixed C/C++ patterns, custom memory conventions, and incomplete original type information. For this project, class recovery has to be evidence-driven.
|
||
|
|
|
||
|
|
## Why The Shift Is Justified Now
|
||
|
|
|
||
|
|
The current notes already contain repeated object-oriented evidence, not just loose procedural code:
|
||
|
|
|
||
|
|
- constructor-style helpers that allocate, stamp a vtable, and zero instance state
|
||
|
|
- destructor or teardown paths that restore a base vtable and free owned buffers
|
||
|
|
- stable indirect dispatch through known vtable slots
|
||
|
|
- controller, entity, sprite-node, VM-context, and resource-helper families with repeatable instance fields
|
||
|
|
- several class-like clusters that already have better behavioral names than generic `FUN_...` placeholders
|
||
|
|
|
||
|
|
That is enough to start building a real C++ object model rather than treating the entire program as flat C with random function pointers.
|
||
|
|
|
||
|
|
Useful evidence anchors already in the repo include:
|
||
|
|
|
||
|
|
- `docs/ne-segment1.md` for entity, projectile, dialog, and sprite-adjacent object lanes
|
||
|
|
- `docs/raw-0008-000c.md` for constructor families, vtable-backed dispatch entries, VM/runtime helpers, and stateful controller objects
|
||
|
|
- `docs/raw-000a-000d.md` for loader/resource families, callback brokers, and teardown-heavy object lanes
|
||
|
|
- `docs/raw-porting-progress.md` for callback-object evidence and cross-segment vtable dispatch patterns
|
||
|
|
- `docs/far-call-targets.md` for high-frequency ctor/dtor/vtable-slot helpers
|
||
|
|
|
||
|
|
## End State
|
||
|
|
|
||
|
|
The real target should be defined more tightly than `nice C++`:
|
||
|
|
|
||
|
|
1. major gameplay, rendering, UI, VM, and resource subsystems are expressed as named classes with understandable responsibilities
|
||
|
|
2. instance layouts and ownership rules are explicit enough that decompiled code stops depending on anonymous offset math for routine work
|
||
|
|
3. virtual dispatch is expressed through named methods or typed vtable tables rather than raw slot offsets
|
||
|
|
4. the source can be rebuilt with a documented toolchain into a working executable or an equivalent working runtime target
|
||
|
|
5. the rebuilt result is validated by behavior, not by cosmetic similarity to decompiler output
|
||
|
|
|
||
|
|
## Working Assumption About The Rebuild Target
|
||
|
|
|
||
|
|
There are two plausible endgames, and the plan should keep them separate from the start:
|
||
|
|
|
||
|
|
### Track A: Original-style executable rebuild
|
||
|
|
|
||
|
|
Rebuild a DOS executable that preserves the segmented-memory model, calling conventions, packed layouts, and resource/file expectations closely enough to run the original game data.
|
||
|
|
|
||
|
|
This is the harder but most direct historical target. It likely depends on recovering or emulating:
|
||
|
|
|
||
|
|
- the original or closest-possible compiler model
|
||
|
|
- near/far pointer conventions
|
||
|
|
- packed struct layout and enum sizes
|
||
|
|
- startup/runtime integration with the Phar Lap environment or an equivalent replacement layer
|
||
|
|
|
||
|
|
### Track B: Behaviorally equivalent source port
|
||
|
|
|
||
|
|
Rebuild the game logic in modern C++ while preserving data formats and behavior, but not necessarily the original binary ABI.
|
||
|
|
|
||
|
|
This is often the faster path to a working recompiled game, but it is a different goal. If the project wants a true executable reconstruction rather than an engine rewrite, Track A has to remain the primary constraint.
|
||
|
|
|
||
|
|
For now, the safest planning stance is: recover source in a way that keeps both tracks open for as long as possible.
|
||
|
|
|
||
|
|
## Recommended Strategy
|
||
|
|
|
||
|
|
### Phase 0: Treat Ghidra As The Truth Database
|
||
|
|
|
||
|
|
Use Ghidra as the canonical place where recovered class ownership, vtable slots, field layouts, and method names live.
|
||
|
|
|
||
|
|
That means pushing beyond flat rename work into:
|
||
|
|
|
||
|
|
- class namespaces for object families
|
||
|
|
- typed instance structs
|
||
|
|
- typed vtable structs where the slots are stable enough
|
||
|
|
- method names that distinguish static helpers from instance methods
|
||
|
|
- explicit comments recording why a family is believed to be one class and not just one subsystem
|
||
|
|
|
||
|
|
### Phase 1: Recover The Object Model Before Chasing Pretty Output
|
||
|
|
|
||
|
|
Prioritize families that already have strong OO evidence.
|
||
|
|
|
||
|
|
Best early targets:
|
||
|
|
|
||
|
|
1. entity families in `seg001` and the raw/live `0007` lanes
|
||
|
|
2. dispatch-entry / controller objects in `0008` and `000c`
|
||
|
|
3. sprite-node and UI/menu object families
|
||
|
|
4. VM runtime, context, owner-resource, and loader helpers
|
||
|
|
5. callback/resource broker objects around `0x4588`
|
||
|
|
|
||
|
|
For each candidate class family, the minimum closure should be:
|
||
|
|
|
||
|
|
- candidate class name
|
||
|
|
- constructor and destructor candidates
|
||
|
|
- instance size estimate
|
||
|
|
- confirmed or suspected vtable base
|
||
|
|
- known slot-to-method map
|
||
|
|
- field map with confidence levels
|
||
|
|
- inbound callers that prove object lifetime or ownership
|
||
|
|
|
||
|
|
### Phase 2: Separate Methods From Free Functions
|
||
|
|
|
||
|
|
Not every helper touching an object should become a class method.
|
||
|
|
|
||
|
|
The conversion rule should be conservative:
|
||
|
|
|
||
|
|
- make it a method when the object pointer is clearly the owner, the function acts on instance state, and the function participates in the class lifecycle or virtual surface
|
||
|
|
- keep it free or subsystem-local when it behaves like a pure helper, allocator utility, serializer, or cross-object coordinator
|
||
|
|
|
||
|
|
This matters because over-classing weak evidence will make the source look cleaner while actually reducing correctness.
|
||
|
|
|
||
|
|
### Phase 3: Build Stable Type Layers
|
||
|
|
|
||
|
|
Before broad C++ emission, define a small number of disciplined type layers:
|
||
|
|
|
||
|
|
- ABI layer: exact-width integers, near/far pointer wrappers, packed structs, fixed calling-convention macros
|
||
|
|
- runtime layer: allocators, file/resource handles, callback tables, event records, dispatch entries
|
||
|
|
- gameplay layer: entities, actors, projectiles, triggers, controller objects, UI nodes
|
||
|
|
- VM layer: runtime/context/owner-resource classes, opcode streams, slot/value helpers
|
||
|
|
|
||
|
|
The source should compile against these types first, even if some methods still contain low-level or ugly code.
|
||
|
|
|
||
|
|
### Phase 4: Land Recompilable C++ In Vertical Slices
|
||
|
|
|
||
|
|
Do not wait for the whole game to be class-clean before testing compilation.
|
||
|
|
|
||
|
|
Instead, move in subsystem slices:
|
||
|
|
|
||
|
|
1. one object family
|
||
|
|
2. its structs and vtable
|
||
|
|
3. its constructors/destructors
|
||
|
|
4. a handful of live methods
|
||
|
|
5. a compile test for that slice
|
||
|
|
|
||
|
|
This is the only realistic way to find layout or calling-convention mistakes early.
|
||
|
|
|
||
|
|
### Phase 5: Add Runtime Validation Harnesses
|
||
|
|
|
||
|
|
A source-level recompile effort will fail if verification is only manual.
|
||
|
|
|
||
|
|
Needed validation layers:
|
||
|
|
|
||
|
|
- map/resource load smoke tests
|
||
|
|
- deterministic startup path checks
|
||
|
|
- function-level trace comparisons for selected hot methods
|
||
|
|
- data-layout assertions on recovered structs
|
||
|
|
- script/VM behavior checks where extracted USECODE already gives a second evidence source
|
||
|
|
|
||
|
|
### Phase 6: Choose The First Real Rebuild Milestone
|
||
|
|
|
||
|
|
The first meaningful source milestone should not be `whole game builds`.
|
||
|
|
|
||
|
|
A better first milestone is one of these:
|
||
|
|
|
||
|
|
1. compile a library that matches one major subsystem ABI and can run against fixture data
|
||
|
|
2. rebuild the startup/resource path far enough to load into a title/menu state
|
||
|
|
3. rebuild one contained gameplay loop such as entity allocation/update/teardown with equivalent traces
|
||
|
|
|
||
|
|
## Ghidra/MCP Gaps That Matter For This Plan
|
||
|
|
|
||
|
|
The local MCP fork already gives enough read/query power to continue class recovery, but it is still missing key authoring operations for a serious C++ lift:
|
||
|
|
|
||
|
|
- create class or namespace symbols through MCP
|
||
|
|
- move existing functions under class ownership cleanly
|
||
|
|
- create or update struct and vtable datatypes through MCP
|
||
|
|
- set `this`-pointer types and method signatures systematically
|
||
|
|
- analyze a candidate vtable and bind slots to named methods in one operation
|
||
|
|
|
||
|
|
Those gaps have been added to `ghidra_mcp_wishlist.md` in this batch.
|
||
|
|
|
||
|
|
## First Concrete Work Batches
|
||
|
|
|
||
|
|
The most defensible first batches are small and structural.
|
||
|
|
|
||
|
|
### Batch 1: Class Inventory Pass
|
||
|
|
|
||
|
|
Build a repo-side inventory of the strongest current class candidates:
|
||
|
|
|
||
|
|
- class family name
|
||
|
|
- addresses for ctor/dtor/vtable roots
|
||
|
|
- known methods
|
||
|
|
- instance-size estimate
|
||
|
|
- notes/doc references
|
||
|
|
|
||
|
|
### Batch 2: One Fully Modeled Family
|
||
|
|
|
||
|
|
Pick one family with low ambiguity and carry it through end to end inside Ghidra and the notes:
|
||
|
|
|
||
|
|
- class namespace
|
||
|
|
- method ownership
|
||
|
|
- instance struct
|
||
|
|
- vtable struct
|
||
|
|
- method-slot table
|
||
|
|
- short rationale note
|
||
|
|
|
||
|
|
Good initial candidates are the `entity_dispatch_entry_*` family, the sprite-node family, or one compact controller object family.
|
||
|
|
|
||
|
|
### Batch 3: C++ Skeleton Output
|
||
|
|
|
||
|
|
Emit one hand-maintained C++ header/source pair for that family with:
|
||
|
|
|
||
|
|
- exact-width field placeholders
|
||
|
|
- named methods
|
||
|
|
- comments for unresolved fields or slot semantics
|
||
|
|
- enough type discipline that the code could later be compiled under a chosen toolchain
|
||
|
|
|
||
|
|
### Batch 4: Toolchain Recon
|
||
|
|
|
||
|
|
Establish the most credible compile target and constraints early:
|
||
|
|
|
||
|
|
- likely original compiler family or nearest substitute
|
||
|
|
- calling convention spelling
|
||
|
|
- memory-model requirements
|
||
|
|
- struct packing behavior
|
||
|
|
- import/library expectations
|
||
|
|
|
||
|
|
Without this, the source can drift into modernized C++ that reads well but cannot realistically rebuild the game.
|
||
|
|
|
||
|
|
## What To Avoid
|
||
|
|
|
||
|
|
- Do not mass-convert procedural helpers into methods just to make the output look object-oriented.
|
||
|
|
- Do not let Ghidra pseudocode naming outrun field-layout evidence.
|
||
|
|
- Do not assume modern C++ ABI rules match the original compiler.
|
||
|
|
- Do not mix `behaviorally equivalent port` goals with `original-style executable rebuild` claims in the same milestone.
|
||
|
|
- Do not wait for perfect global understanding before compiling anything.
|
||
|
|
|
||
|
|
## Immediate Next Steps
|
||
|
|
|
||
|
|
1. add the missing class/namespace and vtable-authoring MCP endpoints to the local fork when ready
|
||
|
|
2. make a `class candidate inventory` note from the strongest existing families in the current docs
|
||
|
|
3. choose one family and model it all the way through as a pilot C++ class
|
||
|
|
4. decide whether the primary rebuild constraint is original-style DOS/NE compatibility or a behaviorally equivalent C++ port
|
||
|
|
5. define the first compile/test harness before broad source emission starts
|
||
|
|
|
||
|
|
## Success Criteria For This Plan
|
||
|
|
|
||
|
|
This plan is working if, after a few batches, the project has all of the following:
|
||
|
|
|
||
|
|
- at least one real class family fully modeled in Ghidra and mirrored in source
|
||
|
|
- repeatable rules for when a function becomes a method
|
||
|
|
- repeatable rules for vtable and field-layout evidence
|
||
|
|
- a documented compile target with ABI constraints
|
||
|
|
- a narrow but real compilation/validation loop
|
||
|
|
|
||
|
|
If those do not exist, the project is still doing useful reverse engineering, but it has not yet truly shifted into a recompilable C++ decompilation lane.
|