GitHub Security Scanner: Evaluating Tree-sitter for AST-Based Vulnerability Detection to Overcome Regex Limitations
The current regex-based `SecurityScanner` has a critical, documented limitation: it cannot detect multi-line vulnerabilities where a source and sink are on different lines. This architectural gap, tracked in issue #735 and tested in PR #736, leaves a significant blind spot in automated code review. The proposed solution is a major technical pivot—evaluating the integration of `tree-sitter` to enable Abstract Syntax Tree (AST)-aware detection.
A hybrid architecture is under consideration, aiming to retain regex for simple patterns while introducing a new AST-based scanner for complex data-flow analysis. The scope involves adding the `tree-sitter` library and language grammars (e.g., for Rust, Python, JavaScript) as new dependencies, writing an estimated 500-800 lines of new code, and converting 14 existing regex patterns into tree-sitter queries. This shift promises to move detection from line-by-line text matching to language-aware parsing that understands code syntax.
The primary benefit is the ability to detect vulnerabilities that span multiple lines, a fundamental weakness of the current system. This upgrade signals a move towards more sophisticated, context-aware security tooling within the development pipeline. The evaluation represents a strategic investment in the scanner's core capability, with implications for the security posture of all projects that rely on it, potentially reducing false negatives in critical code reviews.