Inferrs Backend Proposal: Shift CUDA/Metal Execution to Dynamic Plugins, Decoupling Core Binary from GPU Dependencies
A new architectural proposal for the Inferrs project aims to fundamentally restructure its backend execution, moving it entirely into dynamically loaded plugins. This change would structurally resolve a persistent linking error and decouple the main binary from specific GPU runtime dependencies like CUDA and Metal. The core tension lies in transitioning from a system where the plugin loader only performs hardware detection to one where it fully handles execution, a shift that promises greater deployment flexibility but requires significant implementation work.
The proposal builds on an existing, production-quality plugin loader within `inferrs/src/backend.rs`. Currently, this loader successfully probes for compatible hardware across platforms using a clean ABI. However, after a successful probe, the main binary itself executes inference using its own statically linked libraries (like `candle-core` with CUDA features). This means `libcudart_static.a` is embedded within the main executable. The proposed change would move the actual inference execution logic into the plugin shared objects (e.g., `libinferrs_backend_cuda.so`), eliminating all CUDA and Metal compile-time dependencies from the core binary.
If implemented, this architecture would permanently resolve GitHub issue #145, as the main binary would never reference `-lcudart`, making the specific link error impossible. More strategically, it unlocks a streamlined deployment model: a single, universal binary (like one installed via `brew install inferrs`) that could dynamically support any major version of CUDA through its plugins, without requiring recompilation of the core application. This represents a significant step towards the modular, dependency-light vision initially described for the project's backend system.