Challenge 1: CART Models and Predicting Supreme Court Decisions
This post is part of a series I am writing to translate R scripts I have seen or written into Alteryx workflows. Original post can be found here.
This script comes from MIT’s Data Analytics course (which you can sign up for here). In the section where they introduce Classification and Regression Trees (CART), they use data showcasing decisions made by US supreme court judges and try to predict if a supreme court judge will reverse or uphold the decision of the lower court. Between 1991 and 2001, the same nine judges served on the supreme court, the longest time in US history. This provides us with a data set with more information that can be examined than if this analysis was done during a different time period. More specifically, this CART model looks at the decisions of supreme court judge John Paul Stevens, and whether or not several factors can predict his decision to affirm or reverse the decision of the lower court. These factors include the subject of the case, whether the lower court was more liberal or conservative, and the type of petitioner involved in the case.
Although this analysis could be performed with logistic regression, the outcome would not be as easily interpretable. When we create a decision tree in R, if a variable has an effect on the outcome (in this case, supreme court Judge Stevens reversing the decision of the lower court), we can easily see where it sits in terms of its affect on the outcome and its relationship to other significant variables:
Decision tree plot produced by R.
Creating a Decision Tree model in Alteryx uses just 2 tools:
- The Create Samples tool to split our data into a train set we will build our model off of, and a test set to validate our model
- The Decision Tree tool to actually build our model
And with that, Alteryx spits out a summary report to show the details of how the model was run, and a visual report that includes the Decision Tree, the significance of each variable in determining the outcome, and a confusion matrix to summarise the accuracy:
Although the Model Comparison tool is not included in the default Alteryx package, it can be found in the Alteryx gallery.
We can use this tool to evaluate the accuracy of our model when compared to a simple baseline that predicts the most frequent outcome in our test set.
Before looking at the report generated by this tool, we can check our simple baseline accuracy by using a summarise tool to get a count of 0 and 1 responses. We then use another summarise tool to get both the most frequent outcome, and the total number of rows in our data set. Lastly, we can calculate the accuracy using a formula tool and the calculation: “[Max_Count]/[Sum_Count].” This gives us an accuracy of 54.7%
If we look at the report generated by the model comparison tool, we can see that our accuracy is about 67%, an improvement on our baseline. The report also indicates that the AUC is 73%, which tells us our model is good at differentiating between a reversal and an affirmation decision from Justice Stevens.
Summary of accuracy and AUC from report generated by Model Comparison tool
The benefit of using tools in Alteryx is that the code has already been written for us, but we still have the ability to change the default parameters in the Decision Tree tool such as the complexity parameter and the independent variables. The Model Comparison tool can also be used to quickly compare the accuracy of several models generated in Alteryx, such as logistic regression and random forest. With just a handful of tools, we are able to create an interpretable model that predicts the decisions of a supreme court judge with an accuracy well-above baseline. In addition, the report generated by the model comparison tool provides assessment plots like an ROC curve that can help in deciding what thresholds to use when building our models.