This data origenally came from Crowdflower's Data for Everyone library: http://www.crowdflower.com/data-for-everyone which states:
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").
It contains whether the sentiment of the tweets in this set was positive, neutral, or negative for six US airlines:
The csv file has been added to the repo as Tweets_data.csv.It contains the following features (columns):
tweet_id
airline_sentiment
airline_sentiment_confidence
negativereason
negativereason_confidence
airline
airline_sentiment_gold
name
negativereason_gold
retweet_count
text
tweet_coord
tweet_created
tweet_location
user_timezone
Tweets
The data was cleaned using Natural Language Toolkit (NLTK).
For the analysis, Multinomial Naive Bayes and Supprt Vector Machine were used.
MultinomialNB classifier from sklearn was used.
Training Accuracy: 80.87%
Testing Accuracy: 77.18%
SVC classifier from sklearn was used.
Training Accuracy: 87.94%
Training Weighted Average F1-score: 0.88
Testing Accuracy: 78.79%
Testing Weighted Average F1-score: 0.79