Chapter 1 Introduction

Author: Emanuel Renkl

Supervisor: Christoph Molnar

1.1 Statistical Modeling: The Two Approaches

In statistics, there are two approaches to reach conclusions from data (see Breiman (2001b)). First, the data modeling approach, where one assumes that the data are generated by a given stochastic data model. More specifically, a proposed model associates the input variables, random noise and parameters with the response variables. For instance, linear and logistic regression are typical models. These models allow us to predict what the responses are going to be to future input variables and give information on how the response variables and input variables are associated. (i.e. They are interpretable.) Second, the models used in algorithmic modeling treat the underlying data mechanism as unknown. More precisely, the goal is to find an algorithm, such as random forests or neural networks, that operates on the input variables to predict the response variables. These algorithms allow us to predict what the responses are going to be to future input variables, but do not give information on how the response variables and input variables are associated. Put differently, these algorithms produce black box models because they do not provide any direct explanation for their predictions. (i.e. They are not interpretable.)

Within the statistics community, the data modeling approach was dominant for a long time (Breiman (2001b)). However, especially in the last decade, the increasing availability of enormous amounts of complex and unstructured data, as well as the increase in processing power of computers, served as a breeding ground for a strong shift to the algorithmic modeling approach, primarily for two reasons. First, the data modeling approach is not applicable to exciting problems like text, speech, and image recognition (Breiman (2001b)). Second, for complex prediction problems new algorithms, such as random forests and neural nets, outperform classical models in prediction accuracy as they can model complex relationships in the data (Breiman (2001b)). For these reasons, more and more researchers switched from the data modeling approach to the algorithmic modeling approach that is much more common under the name machine learning.

But what about interpretability? As we learned in the first paragraph, machine learning algorithms are black-box models that do not provide any direct explanation for their predictions. Hence, do we even need to know why an algorithm makes a certain prediction? To get a better feeling for this question, it’s helpful to understand how algorithms learn to make predictions and which tasks are suited to machine learning.

1.2 Importance of Interpretability

Algorithms learn to make predictions from training data. Thus, algorithms also pick up the biases of the training data and hence may not be robust under certain circumstances. e.g. They perform well on a test set, but not in the real world. Such behavior can lead to undesired outcomes.

For instance, consider a simple husky versus wolf classifier that misclassifies some huskies as wolves (see M. T. Ribeiro, Singh, and Guestrin (2016b)). Since the machine learning model does not give any information on how the response and input variables are associated, we do not know why it classified a husky as a wolf. However, interpretability might be useful to debug the algorithm and see if this problem is persistent or not. Using methods that make machine learning algorithms interpretable (which we will discuss later in the book), we find that the misclassification was due to the snow in the image. The algorithm learned to use snow as a feature for classifying images as wolves. This might make sense in the training dataset, but not in the real world. Thus, in this example interpretability helps us to understand how the algorithm gets to the result, and hence, we know in which cases the robustness of the algorithm is not given. In this section, we want to derive the importance of interpretability by focusing on academic and industrial settings.

Broadly speaking, machine learning in academia is used to draw conclusions from data. However, off-the-shelf machine learning algorithms only give predictions without explanations. Thus, they answer only the “what,” but not the “why” of a certain question and therefore do not allow for actual scientific findings. Especially in areas such as life and social sciences, which aim to identify causal relationships between input and response variables, interpretability is key to scientific discoveries. For example, researchers applying machine learning in a medical study found that patients with pneumonia who have a history of asthma have a lower risk of dying from pneumonia than the general population (Caruana et al. (2015)). This is, of course, counterintuitive. However, it was a true pattern in the training data: pneumonia patients with a history of asthma were usually admitted not only to the hospital but also directly to the Intensive Care Unit. The aggressive care received by asthmatic pneumonia patients was so effective that it lowered their risk of dying from pneumonia compared to the general population. However, since the prognosis for these patients was above average, models trained on this data erroneously found that asthma reduces the risk, while asthmatics actually have a much higher risk if they are not hospitalized. In this example blind trust in the machine learning algorithm would yield misleading results. Thus, interpretability is necessary in research to help identify causal relationships and increase the reliability and robustness of machine learning algorithms. Especially in areas outside of statistics, the adoption of machine learning would be facilitated by making these models interpretable and adding explanations to their predictions.

From Amazon’s Alexa or Netflix’s movie recommendation system to the Google’s search algorithm and Facebook’s social media feed, machine learning is a standard component of almost any digital product offered by the industry’s big tech companies. These companies use machine learning to improve their products and business models. However, their machine learning algorithms are also built on training data collected from their users. Thus, in the age af data leaks à la Cambridge Analytica, people want to understand for what purposes their data is collected and how the algorithms work that keep people on streaming platforms or urge them to buy additional products and spend more time on social media. In the digital world, interpretability of machine learning models would yield to a broader understanding of machine learning in society and make the technology more trustworthy and fair. Switching to the analog world, we see a far slower adoption of machine learning systems at scale. This is because decisions made by machine learning systems in the real world can have far more severe consequences than in the digital world. For instance, if the wrong movie is suggested to us, it really doesn’t matter, but if a machine learning system that is deployed to a self-driving car does not recognize a cyclist, it might make the wrong decision with real lives at stake (see Molnar (2019)). We need to be sure that the machine learning system is flawless. For example, an explanation might show that the most important feature is to recognize the two wheels of a bicycle, and this explanation helps you to think about certain edge cases, such as bicycles with side pockets that partially cover the wheels. Self-driving cars are just one example in which machines are taking over decisions in the real world that were previously taken by humans and can involve severe and sometimes irreversible consequences. Interpretability helps to ensure the reliability and robustness of these systems and thus makes them safer.

To conclude, adding interpretability to machine learning algorithms is necessary in both academic and industrial applications. While we distinguished between academia and industry settings, the general points of causality, robustness, reliability, trust, and fairness are valid in both worlds. However, for academia, interpretability is especially key to identify causal relationships and increase the reliability and robustness of scientific discoveries made with the help of machine learning algorithms.
In industrial settings establishing the trust in and fairness of machine learning systems matters most in low-risk environments, whereas robustness and reliability is key to high-risk environments in which machines take over decisions that have far-reaching consequences.

Now that we established the importance of interpretability, how do we put this into practice? A restriction to machine learning models that are considered interpretable due to their simple structure, such as short decision trees or sparse linear models, has the drawback that better-performing models are excluded before model selection. Hence, should we trade of prediction versus information and go back to more simple models? - No! We seperate the explanations from the machine learning model and apply interpretable methods that analyze the model after training.

1.3 Interpretable Machine Learning

As discussed in the previous chapter, most machine learning algorithms produce black-box models because they do not provide any direct explanation for their predictions. However, we do not want to restrict ourselves to models that are considered interpretable because of their simple structure and thus trade prediction accuracy for interpretability. Instead, we make machine learning models interpretable by applying methods that analyze the model after the model is trained. i.e. We establish post-hoc interpretability. Moreover, we separate the explanations from the machine learning model, i.e focus on so called model-agnostic interpretation methods. Post-hoc, model-agnostic explanation systems have several advantages (M. T. Ribeiro, Singh, and Guestrin (2016a)). First, since we seperate the underlying machine learning model and its interpretation, developers can work with any model as the interpretation method is independent of the model. Thus, we establish model flexibility. Second, since the interpretation is independent of the underlying machine learning model, the form of the interpretation also becomes independent. For instance, in some cases it might be useful to have a linear formula, while in other cases a graphic with feature importances is more appropriate. Thus, we establish explanation flexibility.

So what do these explanation systems do? As discussed before, interpretation methods for machine learning algorithms ensure causal relationships, robustness, reliability and establish trust and fairness. More specifically, they do so by shedding light on the following issues (see Molnar (2019)):

  • Algorithm transparency - How does the algorithm create the model?
  • Global, kolistic model interpretability - How does the trained model make predictions?
  • Global model interpretability on a modular level - How do parts of the model affect predictions?
  • Local interpretability for a single prediction - Why did the model make a certain prediction for an instance?
  • Local interpretability for a group of predictions - Why did the model make specific predictions for a group of instances?

Now that we learned that post-hoc and model-agnostic methods ensure model and explanantion flexibility and in which ways explanation systems ensure causal relationships, robustness, reliability and establish trust and fairness, we can move on to the outline of the booklet and discuss specific interpretation methods and their limitations.

1.4 Outline of the booklet

This booklet introduces and investigates the limitations of current post-hoc and model agnostic approaches in interpretable machine learning, such as Partial Dependence Plots (PDP), Accumulated Local Effects (ALE), Permutation Feature Importance (PMI), Leave-One-Covariate Out (LOCO) and Local Interpretable Model-Agnostic Explanations (LIME). All of these methods can be used to explain the behavior and predictions of trained machine learning models. However, their reliability and compactness deteriorate when models use a high number of features, have strong feature interactions and complex feature main effects among others. In this section, the methods mentioned are introduced and the outline of the booklet is given.

To start with, PDP and ALE are methods that enable a better understanding of the relationship between the outcome and feature variables of a machine learning model. Common to both methods is that they reduce the prediction function to a function that depends on only one or two features (Molnar (2019)). Both methods reduce the function by averaging the effects of the other features, but they differ in how the averages are calculated. PDP, for example, visualizes whether the relationship between the outcome and a feature variable is, for instance, linear, monotonic or nonlinear and hence allows for a straightforward interpretation of the marginal effect of a certain feature on the predicted outcome of a model (Friedman (2001)). However, this holds only true as long as the feature in question is not correlated with any other features of the model. ALE plots are basically a faster and unbiased alternative to PDP, because they can interpret models containing correlated variables correctly (Molnar (2019)). Chapter 2 of the booklet gives a short introduction to PDP. Next, PDP and its limitations when features are correlated are investigated in Chapter 3, respectively. Chapter 4 discusses if PDP allow for causal interpretations. Chapter 5 gives then a short introduction to ALE. ALE and PDP are compared in detail in Chapter 6. The choice of intervals, problems with pice-wise constant models and categorial features in the context of ALE are investigated in Chapter 7.

Yet, PDP and ALE do not provide any insights to what extent a feature contributes to the predictive power of a model - in the following defined as feature importance. PFI and LOCO are two methods that allow us to compute and visualize the importance of a certain feature for a machine learning model. PFI by Breiman (2001a) measures the importance of a feature by calculating the increase in the model’s prediction error after permuting the feature. Leave-One-Covariate-Out (LOCO) by Lei et al. (2018), requires to refit the model as the approach is based on leaving features out instead of permuting them (Casalicchio, Molnar, and Bischl (2018)). Chapter 8 gives a short introduction to PFI and LOCO and gives rise to its limitations. Chapter 9 discusses both methods in the case of correlated features. Then partial and individual PFI are discussed in Chapter 10 and the issue whether feature importance should be computed on training or test data is discussed in Chapter 11.

Finally, LIME is a method that explains individual predictions of a black box machine learning model by locally approximating the prediction using a less complex and interpretable model (Molnar (2019)). These simplifying models are referred to as surrogate models. Consider for instance a neural network that is used for a classification task. The neural network itself is of course not interpretable, but certain decision boundaries could, for example, be explained reasonably well by a logistic regression which in fact yields interpretable coefficients. To refer to the first paragraph of the introduction, we use the data modeling approach to explain the algorithmic modeling approach in this example. Chapter 12 gives a short introduction to LIME. Chapter 13 sheds light on the effect of the neighbourhood on LIME’s explanantion for tabular data. Chapter 14 deals with the sampling step in LIME and the resulting side effects in terms of feature weight stability of the surrogate model.

Now that we have introduced the different methods, we can move on to the respective chapters of the booklet, which discuss the methods and their limitations in more detail and provide practical examples.


Breiman, Leo. 2001a. “Random Forests.” Machine Learning 45 (1). Springer: 5–32.

Breiman, Leo. 2001b. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statist. Sci. 16 (3). The Institute of Mathematical Statistics: 199–231. doi:10.1214/ss/1009213726.

Caruana, Rich, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. “Intelligible Models for Healthcare: Predicting Pneumonia Risk and Hospital 30-Day Readmission.” In Proceedings of the 21th Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1721–30. ACM.

Casalicchio, Giuseppe, Christoph Molnar, and Bernd Bischl. 2018. “Visualizing the Feature Importance for Black Box Models.” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 655–70. Springer.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics. JSTOR, 1189–1232.

Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. 2018. “Distribution-Free Predictive Inference for Regression.” Journal of the American Statistical Association 113 (523). Taylor & Francis: 1094–1111.

Molnar, Christoph. 2019. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016a. “Model-Agnostic Interpretability of Machine Learning.” arXiv Preprint arXiv:1606.05386.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016b. “Why Should I Trust You?: Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1135–44. ACM.