
Text Analysis for Transforming Southwest Airlines Customer Service
Classification Model | Sentiment Analysis | Web Scraping
In December 2022, Southwest Airlines faced a significant disruption in their flight operations due to a system failure. This led to a surge in customer inquiries on Twitter. Slow and inadequate responses negatively impacted both customer satisfaction and financial performance.
Our goal is to tackle this operational challenge by developing a classification model that can automatically categorize customer inquiries on Twitter and assign them to the right customer service agent.
Key Benefits
-
Enhanced Customer Satisfaction
The automated classification and escalation of inquiries will result in swifter, more accurate responses, leading to heightened customer satisfaction and increased customer retention rates.
-
Improved Operational Efficiency
By automatically categorizing and routing inquiries to specialized customer service teams, Southwest Airlines' overall customer service operations will experience heightened efficiency.
-
Reduced Workload for the Social Media Team
The implementation of an automated classification model will eliminate the need for the Social Media team to manually review and respond to every Twitter inquiry.
Dataset Description
The dataset used for model development comprises tweets tagged with @SouthwestAir, spanning from January to February 2023. Approximately 3,000 tweets were collected via the Tweepy Twitter API.
As the tweets were unclassified, they were assigned one of the following labels, corresponding to primary customer inquiry types associated with Southwest Air. The labeling process was manually performed in Excel by the project team members to create a gold standard for model testing.
Customer Inquiry Types
System Design
Module Design
Module 1/
Rule-Based Classification &
Sentiment Analysis
The initial method in this project involved rule-based classification and sentiment analysis using Afinn package. The primary goal was to utilize sentiment scores to categorize tweets as either requiring a response or not. If a tweet's sentiment score fell below a threshold of minus five, it was classified as "response-needed." Otherwise, it was considered "no response needed.”
This approach was followed by dictionary-based classification into one of seven categories. The reason for employing two methods was that the dictionary-based approach determined the tweet's category, while the sentiment analysis determined if a response was necessary. Lower sentiment scores were associated with customer help requests, and higher scores indicated positive reviews.
Module 2/
Feature Engineering &
Machine Learning Algorithm
The second classification method tackled the issue of a skewed dataset with a majority of records falling into the "no-response-needed" class. To address this challenge, feature engineering and machine learning were applied.
Multiple models, including SVM and XGBoost, were explored. These models were chosen for their ability to handle skewed datasets effectively. SVM separates classes using a margin-based approach, while XGBoost assigns higher weights to the minority class samples. By combining feature engineering and various models, the classification accuracy was significantly improved.
Module 3/
Feature Engineering & Sentiment Analysis & Machine Learning Algorithm
The third method aimed to enhance the first approach by including sentiment scores as an additional independent variable in training the machine learning model. We were curious about whether utilizing the sentiment score as a threshold in the first method was optimal, or if incorporating sentiment scores as an independent variable in a machine learning model would be a better approach.
This approach combined sentiment scores with Logistic Regression and Naive Bayes models, the top-performing models from the second method. The goal was to capture the connection between tweet sentiment and the likelihood of requiring a customer service response.
Model Evaluation
In evaluating our classification model, we opted for the Macro-Average F1 Score due to the imbalanced nature of our dataset. This metric comprehensively considers precision and recall balance across all classes, offering a holistic assessment of the model's performance.
By prioritizing a balanced evaluation, particularly essential for imbalanced datasets, we aim to gauge the model's ability to effectively handle diverse class distinctions and make informed decisions across the entire spectrum of classes.
The dictionary approach with Afinn scores in -5 achieved the highest F1 score of 0.76, signifying its robust performance.
Module 1 Performance
Module 2 Performance
The model with the same dictionary approach but different sentiment thresholds in 0 and the Naive Bayes model using Bag of Words (Frequency) both garnered the second-highest F1 score of 0.72.
Module 3 Performance
Model incorporating Logistic Regression and Bag of Words (Binary) achieved a slightly lower F1 score of 0.71.
Model Selection
The selection of the model from Method 1, which attained the highest F1 score of 0.76, presents valuable advantages for Southwest Airlines in terms of customer service and tweet management strategies. This model excelled, primarily because it harnessed the frequent repetition of terms in the tweets, making the dictionary approach particularly effective.
-
RANK 1/ Dictionary approach with Afinn score in -5
F1-Score 0.76
-
RANK 2/ Dictionary approach with Afinn score in 0
F1-Score 0.72
-
RANK 3/ Naive Bayes model with Bag of Words - Frequency
F1-Score 0.72
-
RANK 4/ Logistic Regression and Bag of Words - Binary
F1-Score 0.71
Conclusion
The project has paved the way for Southwest Airlines to enhance its customer service operations by automating the classification of customer inquiries on Twitter. We introduced and tested three distinct methods, ultimately recommending the rule-based classification with sentimental analysis that captures inquiries requiring a response.
Our project's future direction involves a comprehensive cost-benefit analysis, investigating ensemble methods for improved performance, and potentially integrating advanced natural language processing techniques. With these future steps, Southwest Airlines can make informed decisions regarding system implementation, ensuring that customer satisfaction and operational efficiency are at the forefront of their strategy.
Future Direction
Still, it's crucial to acknowledge that classifying specific inquiry issues remained a due to data limitations. Wechallenge look forward to the continued evolution of this project to better serve the airline's customer support needs.
Investigate Ensemble Methods
Explore diverse ensemble techniques to enhance overall model performance, particularly in terms of recall for customer inquiries requiring a response.
Consider Advanced Models
Evaluate the feasibility and advantages of advanced natural language processing techniques like BERT for enhanced accuracy in customer inquiry classification.
Cost-Benefit Analysis
Conduct a cost-benefit analysis to assess system implementation's financial impact, considering improved customer support efficiency, resource allocation, and customer satisfaction.