From Visualization to Statistical Analysis
From Feature Engineering to Feature Selection
From the Best Model Selection to Interpretability
The 7th Swiss Conference on Data Science was held online on the 26th of June 2020, and on 25th June 2020 Claudio G. Giancaterino organized a pre-conference workshop about Exploratory Data Analysis topic from a differing point of view.
Usually the goal of Exploratory Data Analysis (EDA) is to understand patterns, detect mistakes, check assumptions and check relationships between variables of a data set with the help of graphical charts and summary statistics.
Instead, the goal of this workshop was to expand the classical EDA journey in a wider pipeline by an experimental approach that, step by step, with an iterative approach, tried to understand the impact of each action taken, into the behavior of models. The result was an Exploratory Data & Models Analysis.
The whole online workshop was conducted in a webinar format where the attendees (18) had the opportunity to interact with the speaker through a Q&A chat box asking questions during the presentation. The approach of the seminar was a hands-on workshop leaving attendees, at almost every step of the journey, the opportunity either to run Google Colaboratory notebooks with a sample of the data set and looking at the results, or the opportunity to challenge themselves with exercise notebooks filling in pieces of missing code.
Participants showed interest in the workshop, posting positive feedback at the end of the seminar, and during the webinar they asked questions about all arguments, with the minutes ticking by quickly.
Topics covered & discussed:
During the workshop a data set from a data science competition was used, and the goal of the classification task was to develop a model to predict whether or not a mortgage can be funded, based on certain factors in a customer’s application data.
The journey started with a quick look at the data set with the help of a visualization tool: AutoViz. Participants were thrilled with the tool used.
Then, the data set was divided into two paths: categorical variables with an encoding activity (transformation of each category string by a numerical representation) and numerical variables to look at the performance of several baseline models.
The Q&A chat box showed questions about issues linked to the use of one-hot encoding (it expands the features space).
At this point we handled missing values, replacing them with a data imputation strategy instead of removing interested rows. We did this because with dropping rows there is a risk of removing relevant features, this is why it is preferable to work with a complete data set. The participants agreed with this.
We then applied Exploratory Data Analysis to the data set, using bivariate analysis as feature selection for the relevant features. One of the questions was about the difference between PCA (it was born as dimensionality reduction but is often used to create new features) and the approach then followed (it was used to select the most predictive features on the target variable).
Before going to the last step, the handling of imbalanced classification, we managed outliers (extreme values that fall far away from the other observations). To it was applied logarithmic transformation to correct the skewness of some variables or discretization to mixture distributions and a new numerical feature was created. As explained, feature engineering con sometimes be frustrating because there are generated correlated features that need to be deleted in the preprocessing step, and business knowledge can play a significative role in the application of this methodology.
In the last step we discussed some strategies to face imbalanced classes in the classification task and applied some techniques.
· Oversampling: randomly sample (with replacement) the minority class to reach the same size of the majority class.
· Undersampling: randomly subset the majority class to reach the same size of the minority class.
· SMOTE (Synthetic Minority Over-sampling Technique): an over-sampling method that creates synthetic samples from the minority class instead of creating copies from it.
In all the steps, except the first one, we applied a modeling process to evaluate the impact of each action on the performance of the models, and the attendees were immediately interested in which models we used: Logistic Regression, AdaBoost, Gradient Boosting Machine, Bagging, Random Forest and Neural Network. For the Oversampling method, the best models were Gradient Boosting Machine and AdaBoost.
From Feature Importance Analysis, using permutation, the best feature able to explain the target values was Property Value for almost all models, instead of using Shap Values with Gradient Boosting Machine, the best feature was the Payment Frequency and specifically the Monthly Payment. The curiosity of candidates was focused on this mentioned feature because it was also the first ranked for its importance in the Logistic Regression and, moreover, the attention was focused in the feature created by the product between the interest rate and loan-to-value that showed importance.
Look at the repository