Definition
Reverse-engineering neural networks to understand their internal computations and algorithms.
Detailed Explanation
The field focused on reverse-engineering neural networks to understand the specific computations and algorithms they implement internally, aiming for a detailed understanding of *how* models work.
Use Cases
Understanding model failures, verifying model safety properties, debugging complex models, discovering novel algorithms learned by NNs, AI alignment research.
