Interpretability is the science of how neural networks work internally, and how modifying their inner mechanisms can shape their behavior--e.g., adjusting a reasoning model's internal concepts to ...