Text Analysis for Transforming Southwest Airlines Customer Service

Classification Model | Sentiment Analysis | Web Scraping

In December 2022, Southwest Airlines faced a significant disruption in their flight operations due to a system failure. This led to a surge in customer inquiries on Twitter. Slow and inadequate responses negatively impacted both customer satisfaction and financial performance.

Our goal is to tackle this operational challenge by developing a classification model that can automatically categorize customer inquiries on Twitter and assign them to the right customer service agent.

Key Benefits

Enhanced Customer Satisfaction

The automated classification and escalation of inquiries will result in swifter, more accurate responses, leading to heightened customer satisfaction and increased customer retention rates.
Improved Operational Efficiency

By automatically categorizing and routing inquiries to specialized customer service teams, Southwest Airlines' overall customer service operations will experience heightened efficiency.
Reduced Workload for the Social Media Team

The implementation of an automated classification model will eliminate the need for the Social Media team to manually review and respond to every Twitter inquiry.

Dataset Description

The dataset used for model development comprises tweets tagged with @SouthwestAir, spanning from January to February 2023. Approximately 3,000 tweets were collected via the Tweepy Twitter API.

As the tweets were unclassified, they were assigned one of the following labels, corresponding to primary customer inquiry types associated with Southwest Air. The labeling process was manually performed in Excel by the project team members to create a gold standard for model testing.

Customer Inquiry Types

System Design

Module Design

Module 1/

Rule-Based Classification &
Sentiment Analysis

The initial method in this project involved rule-based classification and sentiment analysis using Afinn package. The primary goal was to utilize sentiment scores to categorize tweets as either requiring a response or not. If a tweet's sentiment score fell below a threshold of minus five, it was classified as "response-needed." Otherwise, it was considered "no response needed.”

This approach was followed by dictionary-based classification into one of seven categories. The reason for employing two methods was that the dictionary-based approach determined the tweet's category, while the sentiment analysis determined if a response was necessary. Lower sentiment scores were associated with customer help requests, and higher scores indicated positive reviews.

Module 2/

Feature Engineering &
Machine Learning Algorithm

The second classification method tackled the issue of a skewed dataset with a majority of records falling into the "no-response-needed" class. To address this challenge, feature engineering and machine learning were applied.

Multiple models, including SVM and XGBoost, were explored. These models were chosen for their ability to handle skewed datasets effectively. SVM separates classes using a margin-based approach, while XGBoost assigns higher weights to the minority class samples. By combining feature engineering and various models, the classification accuracy was significantly improved.

Module 3/

Feature Engineering & Sentiment Analysis & Machine Learning Algorithm

The third method aimed to enhance the first approach by including sentiment scores as an additional independent variable in training the machine learning model. We were curious about whether utilizing the sentiment score as a threshold in the first method was optimal, or if incorporating sentiment scores as an independent variable in a machine learning model would be a better approach.

This approach combined sentiment scores with Logistic Regression and Naive Bayes models, the top-performing models from the second method. The goal was to capture the connection between tweet sentiment and the likelihood of requiring a customer service response.

Model Evaluation

In evaluating our classification model, we opted for the Macro-Average F1 Score due to the imbalanced nature of our dataset. This metric comprehensively considers precision and recall balance across all classes, offering a holistic assessment of the model's performance.

By prioritizing a balanced evaluation, particularly essential for imbalanced datasets, we aim to gauge the model's ability to effectively handle diverse class distinctions and make informed decisions across the entire spectrum of classes.

The dictionary approach with Afinn scores in -5 achieved the highest F1 score of 0.76, signifying its robust performance.

Module 1 Performance

Module 2 Performance

The model with the same dictionary approach but different sentiment thresholds in 0 and the Naive Bayes model using Bag of Words (Frequency) both garnered the second-highest F1 score of 0.72.

Module 3 Performance

Model incorporating Logistic Regression and Bag of Words (Binary) achieved a slightly lower F1 score of 0.71.

Model Selection

The selection of the model from Method 1, which attained the highest F1 score of 0.76, presents valuable advantages for Southwest Airlines in terms of customer service and tweet management strategies. This model excelled, primarily because it harnessed the frequent repetition of terms in the tweets, making the dictionary approach particularly effective.

RANK 1/ Dictionary approach with Afinn score in -5

F1-Score 0.76
RANK 2/ Dictionary approach with Afinn score in 0

F1-Score 0.72
RANK 3/ Naive Bayes model with Bag of Words - Frequency

F1-Score 0.72
RANK 4/ Logistic Regression and Bag of Words - Binary

F1-Score 0.71

Conclusion

The project has paved the way for Southwest Airlines to enhance its customer service operations by automating the classification of customer inquiries on Twitter. We introduced and tested three distinct methods, ultimately recommending the rule-based classification with sentimental analysis that captures inquiries requiring a response.

Our project's future direction involves a comprehensive cost-benefit analysis, investigating ensemble methods for improved performance, and potentially integrating advanced natural language processing techniques. With these future steps, Southwest Airlines can make informed decisions regarding system implementation, ensuring that customer satisfaction and operational efficiency are at the forefront of their strategy.

Future Direction

Still, it's crucial to acknowledge that classifying specific inquiry issues remained a due to data limitations. Wechallenge look forward to the continued evolution of this project to better serve the airline's customer support needs.

Investigate Ensemble Methods

Explore diverse ensemble techniques to enhance overall model performance, particularly in terms of recall for customer inquiries requiring a response.

Consider Advanced Models

Evaluate the feasibility and advantages of advanced natural language processing techniques like BERT for enhanced accuracy in customer inquiry classification.

Cost-Benefit Analysis

Conduct a cost-benefit analysis to assess system implementation's financial impact, considering improved customer support efficiency, resource allocation, and customer satisfaction.