Boxplots are mainly used to visualize the distribution of the data in different variables in a dataset. We can easily predict outliers by drawing a boxplot for a variable. We can also group the results based on the another variable in the dataset. Lets see how?
Consider a Load Prediction dataset. We will analyze ApplicantIncome and Education variables in this dataset.
Step 1: Import the required libraries
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns
Step 2: Load the dataset
dataset = pd.read_csv(“C:/train_loan_prediction.csv”)
Step 3: Draw boxplot for ApplicantIncome
dataset.boxplot(column='ApplicantIncome')
We can see a lot of outliers/extreme values in the applicant income column. From this, we can conclude that there is a lot of income disparity in the society. But hold on, we are analyzing income of all the people by disregarding their education levels which is practically not right. There is a good probability that educated people will be having higher income as compared to the uneducated / less educated people. Lets segregate the income by education:
dataset.boxplot(column='ApplicantIncome', by = ‘Education')
We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are higher number of graduates with very high incomes, which are appearing to be the outliers.
Consider a Load Prediction dataset. We will analyze ApplicantIncome and Education variables in this dataset.
Step 1: Import the required libraries
import pandas as pd
import numpy as np
import matplotlib as plt
%matplotlib inline
import seaborn as sns
Step 2: Load the dataset
dataset = pd.read_csv(“C:/train_loan_prediction.csv”)
Step 3: Draw boxplot for ApplicantIncome
dataset.boxplot(column='ApplicantIncome')
We can see a lot of outliers/extreme values in the applicant income column. From this, we can conclude that there is a lot of income disparity in the society. But hold on, we are analyzing income of all the people by disregarding their education levels which is practically not right. There is a good probability that educated people will be having higher income as compared to the uneducated / less educated people. Lets segregate the income by education:
dataset.boxplot(column='ApplicantIncome', by = ‘Education')
We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are higher number of graduates with very high incomes, which are appearing to be the outliers.