Airline Passenger Satisfaction Analysis

This is a coursework of Econometrics - Theory and Applications I. The programming language is Python.
———

Introduction

Airline industry is extremely competitive and passenger satisfaction is always top of mind for airline companies. Beside increasing the service quality and the flight safety, passenger satisfaction has a significant influence toward the business. Dissatisfied or disengaged passengers mean fewer return customers, less revenue and degrades the company. It is important that passengers have an excellent experience every time they travel. On-time flights, good in-flight entertainment, more (and better) snacks, and more leg room might be the obvious contributors to a good experience and more loyalty. While we might hear about those aspects the most, the customer experience is not about just the flight itself. Its everything from purchasing the ticket on the company’s website or mobile app to checking bags in at the airport or via a mobile app to waiting in the terminal. Therefore, it is an interesting topic to investigate the relationship between customer’s attitude with some certain factors from satisfaction survey.

This paper is intended to improve our understanding the effective of passenger’s satisfaction toward the airline company with the following question: What are the most important factors affecting passenger’s satisfaction?

I tend to answer this question by mining the relationship between the satisfaction level and the associated attributes using regression with a binary dependent variable such as Probit Regression.

Data and Variable Description

The passenger’s satisfaction for Invistico Airline dataset is from Kaggle (https://www.kaggle.com/sjleshrac/airlines-customer-satisfaction). The dataset contains various features related to an airline passenger satisfaction survey. The dataset contains 129,880 observations, 1 dependent variable (satisfaction) and 22 independent variables.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 satisfaction                       129880 non-null  object 
 gender                             129880 non-null  object 
 customer_type                      129880 non-null  object 
 age                                129880 non-null  int64  
 type_of_travel                     129880 non-null  object 
 class                              129880 non-null  object 
 flight_distance                    129880 non-null  int64  
 seat_comfort                       129880 non-null  int64  
 departure_arrival_time_convenient  129880 non-null  int64  
 food_and_drink                     129880 non-null  int64  
gate_location                      129880 non-null  int64  
inflight_wifi_service              129880 non-null  int64  
inflight_entertainment             129880 non-null  int64  
online_support                     129880 non-null  int64  
ease_of_online_booking             129880 non-null  int64  
on_board_service                   129880 non-null  int64  
leg_room_service                   129880 non-null  int64  
baggage_handling                   129880 non-null  int64  
checkin_service                    129880 non-null  int64  
cleanliness                        129880 non-null  int64  
online_boarding                    129880 non-null  int64  
departure_delay_in_minutes         129880 non-null  int64  
arrival_delay_in_minutes           129487 non-null  float64
dtypes: float64(1), int64(17), object(5)
memory usage: 22.8+ MB

The Factor features: satisfaction, gender, customer type, type of travel, and class.

The Integer features: age, flight distance, seat comfort, departure/arrival time convenient, food and drink, gate location, inflight wifi service, inflight entertainment, online support, ease of online booking, on board service, leg room service, baggage handling, check-in service, cleanliness, online boarding, departure delay in minutes, and arrival delay in minutes.

The following 14 columns are ratings showing in number from 0 (least satisfied) to 5 (most satisfied): seat comfort, departure/arrival time convenient, food and drink, gate location, inflight wifi service, inflight entertainment, online support, ease of online booking, on board service, leg room service, baggage handling, check-in service, cleanliness, online boarding.

In the dependent variable, about 54.7% of the observations are labeled as “satisfied”, and 45.3% are labeled as “dissatisfied”.

After I checked the dataset, I have found that there are 393 missing values in arrival_delay_in_minutes. Because most of the arrival delay minutes are less than 200 minutes, and there are few arrival delays which have very high minutes. In order not to distort the data by those extreme values, I have decided to replace these missing values with median value of this attribute. See the distribution of counts in different minutes range of arrival delay in minutes in Figure 1.

Figure 1: Count of arrival delay in minutes

Figure 2: Distribution of customer satisfaction in each categorical attribute

Figure 2 shows that satisfaction is much higher for female passengers, while male passengers tend to be more dissatisfied; higher proportion of loyal customer was satisfied than disloyal customers; higher proportion of customers who were in business travel was satisfied than those in personal travel; higher proportion of customers took business class was satisfied.

Figure 3: Distribution of customer satisfaction in each numerical attribute

Figure 3 shows that the more satisfied the customer with the service, then there was greater chance that the customer will be satisfied.

Figure 4: Correlation matrix

Figure 4 shows that inflight_entertainment, online_support, ease_of_online_booking have higher correlation with satisfaction compared to other attributes. Other interesting correlation I noticed was that there was very high correlation (correlation = 0.96) between departure_delay_in_minutes and arrival_delay in_minutes. Despite of replacing the missing value with the mean of the arrival_delay_in_minutes, I just excluded the “arrival_delay_in_minutes” column entirely to avoid imperfect multicollinearity.

Method of Data Analysis

The dependent variable (satisfaction) is binary, and it is either true (1) or false (0). Because the dependent variable is binary, it would be more appropriate if I use a non-linear model instead of a linear model (because a linear probability model may have a predicted probabilities below 0 or exceed 1 given some value of a regressor). Therefore, I have two options (probit or logit regression) to examine the causal effect. Since probit and logit regressions usually give similar results, I may just use probit here.

I included satisfaction as the dependent variable. As for independent variables, I picked inflight_entertainment and ease_of_online_booking as they are highly correlated to the passenger’s satisfaction, and which may influence customers’ satisfaction as well.

The service profit chain (Heskett et al., 1994) illustrated that there was a relationship between customer satisfaction and customer loyalty, however, the article did not show whether causality exists. Therefore, I would also like to examine the causal effect between satisfaction and customer type (disloyal customer, loyal customer).

Because the other 19 variables are also important in determining customer satisfaction, to eliminate omitted variables bias, I would also take the other 19 variables as control variables.

What are you testing by using this model/method?

First, I will start with a simple probit regression with inflight entertainment as the only independent variable. Then I will add other independent variables one by one to the model to see whether there are omitted variable bias. Next, I want to examine interactions between variables. Finally, I would also like to use Wald-statistic for hypothesis concerning several coefficients.

To avoid inconsistency of OLS standard errors and make sure the hypothesis tests are reliable, I use heteroskedasticity-robust standard errors (cov_type=’HC1’) when constructing the models.

What kind of data transformation needed?

The dependent variable is shown in text (satisfied, dissatisfied), therefore, I mapped them to 1: satisfied, 0: dissatisfied. I created dummy variables for the other categorical variables. To avoid perfect multicollinearity, I only get 1 dummy out of 2 categorical levels by removing the first level.

Results

Dependent variable: satisfaction, binary, 129,880 observations
Regressor	(0)	(1)	(2)	(3)	(4)	(5)
customer type = disloyal Customer (Binary, CT)		-1.3342 (0.014)	-1.1619 (0.014)	-1.1182 (0.015)	-1.2570 (0.044)	-0.3066 (0.052)
inflight entertainment (IE)			0.3666 (0.005)	0.3767 (0.005)	0.3679 (0.005)	0.3444 (0.005)
ease of online booking (EOB)				0.1272 (0.007)	0.1269 (0.007)	0.1887 (0.007)
CT x IE					0.0456 (0.013)	0.0549 (0.012)
CT x EOB						-0.3016 (0.010)

The other 19 control variables	Y	Y	Y	Y	Y	Y
Intercept	-2.7010 (0.032)	-2.1724 (0.034)	-2.7572 (0.036)	-2.6470 (0.037)	-2.6224 (0.037)	-2.7536 (0.038)
Wald-Statistics and p-Values Testing Exclusion of Groups of Variables
interactions and CT						6953.72 (0.000)
interactions only						1015.59 (0.000)

psudo R-square	0.3341	0.3946	0.4340	0.4361	0.4363	0.4425
difference in predicted probability of satisfaction, loyal vs disloyal (%)		48.4%	43.5%	42.1%	46.2%	11.7%

Standard errors are in parenthesis under the coefficients. p-values are given in parentheses under the F-statistics. The difference in predicted probabilities of satisfaction is calculated with the sample means of the regressors (other than customer_type_disloyal_Customer). For joint hypotheses testing, the p-value are in parentheses.

Model (0): satisfaction ~ The other 19 control variables

The p-values of all 19 variables are individually statistically significant.
The pseudo R-square here is 0.3341.

Model (1): satisfaction ~ customer_type_disloyal_Customer + The other 19 control variables

After adding customer_type_disloyal_Customer (a binary variable, 1 indicates disloyal customer, 0 indicates loyal customer), the coefficient of customer_type_disloyal_Customer is -1.3342, the negative sign means that the z-value of a disloyal custom is less than a loyal customer (holding constant other regressors). The t-statistic is -93.186 which is very significant.
The p-value of gate_location became 0.092 from 0.000 after adding customer_type_disloyal_Customer, so we cannot reject at the 5% significance level that the hypothesis that the coefficient of gate_location is 0. Model (1) provides no evidence that gate_location will have big influence in customer satisfaction.

Model (2): satisfaction ~ inflight_entertainment + customer_type_disloyal_Customer + The other 19 control variables

The coefficient of inflight_entertainment is the difference in the z-value associated with a unit difference in inflight_entertainment. That is, the coefficient of inflight_entertainment is 0.3666, and if inflight_entertainment increase by 1 (holding other regressors constant), the probability of satisfaction increases by the difference of the probabilities in the tail of the standard normal distribution to the left of z and z + 0.3666, and the probability difference depends on z. (Note: the values of inflight_entertainment are 0, 1, 2, 3, 4, 5. The higher the number, the more satisfied of the customer to the service.)
The coefficient of inflight_entertainment here is a positive value, which means that higher rating in inflight_entertainment leads to have a higher probability in satisfaction. The t-statistic of inflight_entertainment is 75.066 which is also very statistically significant.
After adding the inflight_entertainment, the coefficient of customer_type_disloyal_Customer increased from -1.3342 to -1.1619, which means that the effect of whether the customer is loyal or not has less effect to the customer satisfaction compared to model (1) if we also consider inflight_entertainment, which indicates that model (1) had omitted variable bias. However, customer_type_disloyal_Customer is still statistically significant.

Model (3): satisfaction ~ ease_of_online_booking + inflight_entertainment + customer_type_disloyal_Customer + The other 19 control variables

After adding ease_of_online booking, the coefficients of regressors in model (3) did not have much difference from model (2). Though ease_of_online_booking is still statistically significant at 5% confidence level, it is less significant compared to customer_type_disloyal_Customer and inflight_entertainment. We can see that the t-statistic of ease_of_online_booking is 18.883, which is less than -76.948 of customer_type_disloyal_Customer and 75.983 of inflight_entertainment. -The estimated difference in satisfaction probabilities did not change much from model (2) to model (3) as well.
The psudo R-square is 0.4361 which has only little change from 0.4340 of model (2).

Model (4): satisfaction ~ disloyalIE + ease_of_online_booking + inflight_entertainment + customer_type_disloyal_Customer + The other 19 control variables

This examines whether there is interaction of customer loyalty and inflight entertainment. Do disloyal and loyal customers have different reaction in satisfaction to different rating in inflight entertainment?

After adding the interaction term, disloyalIE, the coefficients of regressors did not have big changes from model (3).
The coefficient of disloyalIE is 0.0456, which indicates that for a disloyal customer, the z-value increases 0.0456 more than that of a loyal customer. In other words, disloyal customers tend to give you more positive feedback in terms of satisfaction than loyal customers when the rating of inflight_entertainment is increased.

Model (5): satisfaction ~ disloyalEOB + disloyalIE + ease_of_online_booking + inflight_entertainment + customer_type_disloyal_Customer + The other 19 control variables

After adding the interaction term, disloyalEOB, customer_type_disloyal_Customer had a big change from -1.2570 in model (4) to -0.3066 while the coefficients of other regressors did not change much. Furthermore, we can see that the t-statistic of disloyalEOB is quite significant, which indicates that the previous model suffered an omitted variable bias.
After adding disloyalEOB, the difference in predicted probability of satisfaction between loyal and disloyal customers dropped significantly from 46.2% in model (4) to 11.7%, this could be caused by the big change in the coefficient of customer_type_disloyal_Customer.
The coefficient of disloyalEOB is -0.3016, this is not easy to interpret in a probit model, but I can try to say that the z-value of a disloyal customer increase 0.3016 less than loyal customer in each increment of ease_on_online_booking. In other words, loyal customers tend to give you more positive feedback in terms of satisfaction than disloyal customers when you increase the ease_of_online_booking.
I conducted wald test to test the joint significance of intersection terms and wald test on interaction terms and customer_type_disloyal_Customer. Both results are jointly statistically significant at the 5% level. The interaction terms are jointly statistically significant indicates that at least one of the interaction terms has significance to satisfaction.

Conclusion

From model (1) to (3), I conclude that there are causal effects between satisfaction and customer type, inflight entertainment, and ease of online booking.

From the significant differences in predicted probability of satisfaction in model (1) to (5), I conclude that loyal customers are easier to be satisfied by the services than disloyal customers. However, this may need more hypotheses testing to reinforce this conclusion.

References

Khanal, B. Airlines customer satisfaction. Kaggle. Retrieved December 5, 2021, from https://www.kaggle.com/bimarshakhanal/airlines-customer-satisfaction.
Heskett, J.L., Jones, T.O., Loveman, G.W., Sasser, W.E. Jr and Schlesinger, L.A. (1994), “Putting the service profit chain to work”, Harvard Business Review, March-April, pp. 105-11.
Airlines Customer satisfaction. (n.d.). Kaggle. https://www.kaggle.com/sjleshrac/airlines-customer-satisfaction

Written on December 10, 2021