Critical ReDoS Vulnerability in ULMFiT's URL Parser Exposes Systems to CPU Exhaustion
A critical Regular Expression Denial of Service (ReDoS) vulnerability has been identified within the ULMFiT library, posing a direct threat of CPU exhaustion and potential service disruption. The flaw resides in the `replace_url` function's `URL_PATTERN` in `pythainlp/ulmfit/preprocess.py`. The vulnerability is not theoretical; it is triggered by processing a maliciously crafted string, such as a long sequence of exclamation marks, which exploits catastrophic exponential backtracking in the flawed regex.
The security issue stems from a nested greedy quantifier `(?:[^\s()<>{}\[\]]+)+` within the URL-matching pattern. This design flaw causes the regex engine's processing time to explode exponentially (O(2^N)) when evaluating an invalid URL suffix, effectively locking up system resources. The vulnerability was flagged by GitHub's CodeQL static analysis tool, highlighting a concrete path for an attacker to induce a denial-of-service state by submitting specific, easy-to-generate input.
The fix, implemented via a pull request, removes the inner greedy `+` quantifier, simplifying the pattern to `(?:[^\s()<>{}\[\]])+`. This crucial modification forces the regex engine to evaluate input in linear time (O(N)), allowing it to fail fast without consuming excessive CPU cycles. The correction preserves the original URL-matching logic while eliminating the attack vector, underscoring the importance of rigorous code review for security-critical string processing functions in widely used machine learning libraries.