Scikit-Learn Security Flaw: TfidfVectorizer Leaks Sensitive Training Data in Versions <=1.4.1
A critical data leakage vulnerability in the widely-used Python machine learning library scikit-learn has been patched, exposing sensitive information from training datasets. The flaw, tracked as CVE-2024-5206, was present in the TfidfVectorizer component in all versions up to and including 1.4.1.post1. The security fix is now available in the newly released version 1.5.0.
The vulnerability stems from a flaw in how the TfidfVectorizer handles and stores token information. Instead of correctly filtering and storing only the designated stop words, the component's `stop_words_` attribute was inadvertently capturing and retaining *all* tokens present in the training data. This means any raw, potentially sensitive text used to train a model—such as personal identifiers, proprietary terms, or confidential phrases—could be silently stored and potentially extracted from the trained vectorizer object, creating a significant data exposure risk.
The update from version 0.22.2.post1 to 1.5.0 is flagged as a security priority. This vulnerability highlights the hidden risks in machine learning pipelines where data security is often an afterthought. Organizations and data scientists using scikit-learn for text processing must immediately audit their dependencies and upgrade to version 1.5.0 to mitigate the risk of unintentionally leaking their entire training corpus through a standard model attribute.