Definition
The inadvertent inclusion of information during training that would not be available during real-world prediction.
Detailed Explanation
Data leakage occurs when a model is trained using information that wouldn't be available at prediction time leading to overly optimistic performance estimates. Common sources include temporal leakage target leakage and train-test contamination. Preventing leakage requires careful data pipeline design and validation procedures.
Use Cases
Financial forecasting clinical trials analysis predictive maintenance