Social media has evolved and affected our lives in many aspects. In this assignment, we aim to have a closer look at two of the most popular social media platforms: Tiktok and Facebook. These platforms are in a similar industry but with very different target audiences, thus the two brands and their audiences could differ in their communication styles and language. Understanding the language used in these platforms may lead to their business implications and directions.
Our research question examines whether there are differences in sentiments of tweet communications that mention Tiktok and Facebook accounts.
Our method of sentiments analysis is text mining with R.
First, after getting tweets from twitter, we use basic tools of data exploration to transform, visualize, and examine different features of the datasets, such as source, time, length, and content (e.g, link and picture) of the tweets. We produce bar charts to visualize the most popular words used by each twitter account, as well as the most popular sentiments associated with tweets that each account produces. A wordcloud also helps paint a clearer picture of each company’s most commonly used words.
Second, we transform the datasets into tidy text format for sentiment analysis. The two main lexicons that we use are nrc and affin.
Finally, we run 4 different models to predict if a tweet was posted by either Facebook or Tiktok. The inputs of these models are the length of the tweet, as well as sentiment (which includes anger, anticipation, disgust, negative, postive, trust, joy, surprise, fear and sadness).
The first model is a Simple Decision Tree, the second model is a Bagging Model, the third model is a Random Forest and the fourth model is a Gradient Boosting Model.
Our results include a sum of squares analysis on the test set of data to determine which models have the smallest differences between the predicted tweeter and actual tweeter. We also include confusion matrices on the test set of data to analyze the prediction accuracy of the 4 models.
#Loading packages.
library(rtweet)
library(tidyverse)
library(lubridate)
library(scales)
library(tidytext)
library(wordcloud)
library(textdata)
library(caret) # for general model fitting
library(rpart) # for fitting decision trees
library(rpart.plot)
library(ipred) # for fitting bagged decision trees
library(ranger)
library(gbm)
library(vip)
library(kableExtra)
#Getting tweets
# Run these two lines to get the tweets
# and then save them as a csv for future use
# tiktok <- get_timeline("tiktok_us", n=3200)
# tiktok %>% write_as_csv('tiktok.csv')
#
# facebook <- get_timeline("Facebook", n=3200)
# facebook %>% write_as_csv("facebook.csv")
<-
tiktok read_csv('tiktok.csv') %>%
select(status_id, source, text, created_at)
<-
facebook read_csv('facebook.csv') %>%
select(status_id, source, text, created_at)
<- read_rds("nrc.rds")
nrc
%>% head() facebook
## # A tibble: 6 x 4
## status_id source text created_at
## <chr> <chr> <chr> <dttm>
## 1 x1382020080343~ Twitter W~ "Ramadan Mubarak <U+0001F319>\r\n \r\nThis #Mo~ 2021-04-13 17:17:18
## 2 x1381734429018~ Khoros CX "@MeenalK1 Hi Meenal. Do you have the referenc~ 2021-04-12 22:22:13
## 3 x1381733382632~ Khoros CX "@Afrojalipro Thanks for updating us, Afroj! W~ 2021-04-12 22:18:04
## 4 x1381732668388~ Khoros CX "@CallandManning Hi Calland. If you do not hav~ 2021-04-12 22:15:14
## 5 x1381711376876~ Khoros CX "@BHARTINANDAN4 Hello! Please visit this Help ~ 2021-04-12 20:50:37
## 6 x1381710548476~ Khoros CX "@weathermatt22 Hi Matt. Please visit our Help~ 2021-04-12 20:47:20
%>%
facebook count(source, hour = hour(with_tz(created_at, "EST"))) %>%
mutate(percent = n/sum(n)) %>%
ggplot(aes(x = hour, y = percent, color = source)) +
labs(x = "Hour of day (EST)", y = "% of tweets", color = "") +
scale_y_continuous(labels = percent_format()) +
geom_line() +
ggtitle('Facebook Source Breakdown by Hour')
%>%
tiktok count(source, hour = hour(with_tz(created_at, "EST"))) %>%
mutate(percent = n/sum(n)) %>%
ggplot(aes(x = hour, y = percent, color = source)) +
labs(x = "Hour of day (EST)", y = "% of tweets", color = "") +
scale_y_continuous(labels = percent_format()) +
geom_line() +
ggtitle('Tiktok Source Breakdown by Hour')
These above figures indicate Tiktok/Facebook breakdown by hour. Across sources, the “busiest” time on both platforms are from 12:00 to 20:00. While Khoros Publishing has the most tweets about Facebook with its peak around 16:00, Twitter Web App and Fan Experiences (peaks around 16:00) are the main source of tweets about Tiktok.
<-
fb_wordcounts %>%
facebook mutate(tweetLength = str_length(text)) %>%
filter(tweetLength < 500)
<-
tiktok_wordcounts %>%
tiktok mutate(tweetLength = str_length(text)) %>%
filter(tweetLength < 500)
writeLines(c(paste0("Facebook Mean Tweet Length: ",
mean(fb_wordcounts$tweetLength)),
paste0("TikTok Mean Tweet Length: ",
mean(tiktok_wordcounts$tweetLength))))
## Facebook Mean Tweet Length: 163.370231394622
## TikTok Mean Tweet Length: 112.916405760801
hist(tiktok_wordcounts$tweetLength)
hist(fb_wordcounts$tweetLength)
In terms of tweet length, a typical tweet related to Tiktok has from 50 to 100 words. There are less tweets that has more than 100 words. A typical tweet related to Facebook has around 150 words.
<-
fb_picture_counts %>%
facebook filter(!str_detect(text, '^"')) %>%
count(picture = ifelse(str_detect(text, "t.co"),
"Picture/link", "No picture/link"))
<-
tiktok_picture_counts %>%
tiktok filter(!str_detect(text, '^"')) %>%
count(picture = ifelse(str_detect(text, "t.co"),
"Picture/link", "No picture/link"))
barplot(fb_picture_counts$n,
names.arg=c("No picture/link", "Picture/link"),
main = "Facebook # of Tweets with and without pics/link")
barplot(tiktok_picture_counts$n,
names.arg=c("No picture/link", "Picture/link"),
main = "Tiktok # of Tweets with and without pics/link")
Facebook tweets that contains pictures or links are more common than ones that have no pictures or links. There are no remarakble differences between tweets that contain picture/link and ones that don’t contains picture/link from Tiktok.
<- "([^A-Za-z\\d#@']|'(?![A-Za-z\\d#@]))"
reg
# Unnest the text strings into a data frame of words
<-
fb_words %>%
facebook filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text,
"https://t.co/[A-Za-z\\d]+|&",
"")) %>%
unnest_tokens(word, text,
token = "regex",
pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
<-
tiktok_words %>%
tiktok filter(!str_detect(text, '^"')) %>%
mutate(text = str_replace_all(text,
"https://t.co/[A-Za-z\\d]+|&",
"")) %>%
unnest_tokens(word, text,
token = "regex",
pattern = reg) %>%
filter(!word %in% stop_words$word,
str_detect(word, "[a-z]"))
# Inspect the first six rows of tweet_words
head(fb_words)
## # A tibble: 6 x 4
## status_id source created_at word
## <chr> <chr> <dttm> <chr>
## 1 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 ramadan
## 2 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 mubarak
## 3 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 0001f319
## 4 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 #monthofgood
## 5 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 check
## 6 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 kindness
%>%
fb_words count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity") +
ylab("Occurrences") +
coord_flip()
%>%
tiktok_words count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity") +
ylab("Occurrences") +
coord_flip()
<-
fb_sentiment inner_join(fb_words, nrc, by = "word") %>%
group_by(sentiment)
<-
tiktok_sentiment inner_join(tiktok_words, nrc, by = "word") %>%
group_by(sentiment)
%>% head() fb_words
## # A tibble: 6 x 4
## status_id source created_at word
## <chr> <chr> <dttm> <chr>
## 1 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 ramadan
## 2 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 mubarak
## 3 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 0001f319
## 4 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 #monthofgood
## 5 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 check
## 6 x1382020080343470082 Twitter Web App 2021-04-13 17:17:18 kindness
Here we compare the sentiment between Facebook and TikTok. It looks like discussions surrounding Facebook uses more trust words while topics about TikTok uses more words that reflect anticipation.
<- fb_sentiment %>%
fb_sentiment_analysis count(word, sentiment) %>%
group_by(sentiment)
%>%
fb_sentiment_analysis top_n(15) %>%
ggplot(aes(x = sentiment, y = n )) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Frequency") +
xlab("Sentiment") +
labs(title="Facebook Sentiment")
## Selecting by n
<- tiktok_sentiment %>%
tiktok_sentiment_analysis count(word, sentiment) %>%
group_by(sentiment)
%>%
tiktok_sentiment_analysis top_n(15) %>%
ggplot(aes(x = sentiment, y = n )) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Frequency") +
xlab("Sentiment") +
labs(title="TikTok Sentiment")
## Selecting by n
%>% filter(!sentiment %in% c("positive", "negative")) %>%
fb_sentiment_analysis mutate(sentiment = reorder(sentiment, -n),
word = reorder(word, -n)) %>% top_n(10) -> fb_sentiment_analysis2
## Selecting by n
ggplot(fb_sentiment_analysis2, aes(x=word, y=n, fill = n)) +
facet_wrap(~ sentiment, scales = "free")+
geom_bar(stat ="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(y="count", title="Facebook Sentiment")
%>% filter(!sentiment %in% c("positive", "negative")) %>%
tiktok_sentiment_analysis mutate(sentiment = reorder(sentiment, -n),
word = reorder(word, -n)) %>% top_n(10) -> tiktok_sentiment_analysis2
## Selecting by n
ggplot(tiktok_sentiment_analysis2, aes(x=word, y=n, fill = n)) +
facet_wrap(~ sentiment, scales = "free")+
geom_bar(stat ="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(y="count", title="Tik Tok Sentiment")
We also want to visualize common words on Facebook and Tiktok by Wordcloud. The visual depiction indicates to us that “learn”, “center” and “report” are common words, with more secondary common words such as “secure” page“, and”visit" for Facebook account engagement. This could be that Facebook users tweet about account issues. Whereas, TikTok has “top”, “tomorrow”, and “prizes” as common words, and more secondary common words such as “winner,”nominating“, and”grand", indicating that the social media platform likes to promote competitions or giveaways, which makes sense given their younger demographics might enjoy these types of rewards and games.
<- fb_words %>% count(word) %>% arrange(-n)
facebook_cloud wordcloud(facebook_cloud$word, facebook_cloud$n, max.words = 200, colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"))
<- tiktok_words %>% count(word) %>% arrange(-n)
tiktok_cloud wordcloud(tiktok_cloud$word, tiktok_cloud$n, max.words = 200, colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"))
Next, we examine texts on Facebook and Tiktok to see their positive-negative score by using the AFINN sentiment lexicon, a list of English terms manually rated for valence with an integer between -5 (negative) and +5 (positive) by Finn Årup Nielsen between 2009 and 2011.
We use this lexicon to compute mean positivity scores for all words tweeted by each user.
# run this to get afinn lexicon and save it as a csv
# get_sentiments ("afinn") -> afinn
#
#afinn %>% write_as_csv("afinn.csv")
<- read_csv('afinn.csv')
afinn <-
fb_afinn inner_join(fb_words,
afinn, by = "word")
<-
tiktok_afinn inner_join(tiktok_words,
afinn, by = "word")
<-
fb_mean_afinn %>%
fb_afinn summarise(mean_fb_afinn = mean(value))
<-
tiktok_mean_afinn %>%
tiktok_afinn summarise(mean_tt_afinn = mean(value))
cat(paste0("Average AFINN scores for all words by user\n",
"\nFacebook: ", round(fb_mean_afinn, 3),
"\nTikTok: ", round(tiktok_mean_afinn, 3)))
## Average AFINN scores for all words by user
##
## Facebook: 0.785
## TikTok: 1.704
Facebook’s mean AFINN value is 0.79 while TikTok’s mean AFINN value is 1.704. In general, words tweeted by Tiktok are more positive than those tweeted by Facebook.
Here, using the text of a tweet, we attempt to predict the user who tweeted it.
The features we extracted are tweet length, the presence of a picture/link, number of words for each sentiment, and mean AFINN score per tweet.
TikTok is encoded as 1, and Facebook is encoded as 0.
First, we produce a simple decision tree.
<-
fb_piclinks %>%
facebook filter(!str_detect(text, '^"')) %>%
mutate(picture_link = ifelse(str_detect(text, "t.co"),
1, 0)) %>%
select(1,5)
<-
tiktok_piclinks %>%
tiktok filter(!str_detect(text, '^"')) %>%
mutate(picture_link = ifelse(str_detect(text, "t.co"),
1, 0)) %>%
select(1,5)
<-
fb_tweet_afinn %>%
fb_afinn group_by(status_id) %>%
summarize(afinn = mean(value))
## `summarise()` ungrouping output (override with `.groups` argument)
<-
tiktok_tweet_afinn %>%
tiktok_afinn group_by(status_id) %>%
summarize(afinn = mean(value))
## `summarise()` ungrouping output (override with `.groups` argument)
<-
fb_sentiment_counts %>%
fb_sentiment group_by(status_id) %>%
count(sentiment) %>%
pivot_wider(id_cols = status_id,
names_from = sentiment,
values_from = n,
values_fill = 0)
<-
tiktok_sentiment_counts %>%
tiktok_sentiment group_by(status_id) %>%
count(sentiment) %>%
pivot_wider(id_cols = status_id,
names_from = sentiment,
values_from = n,
values_fill = 0)
<-
tiktok_feature_selection %>%
tiktok_wordcounts mutate(user = 1) %>%
left_join(tiktok_sentiment_counts,
by="status_id") %>%
left_join(tiktok_tweet_afinn,
by="status_id") %>%
left_join(tiktok_piclinks,
by="status_id")
<-
facebook_feature_selection %>%
fb_wordcounts mutate(user = 0) %>%
left_join(fb_sentiment_counts,
by="status_id") %>%
left_join(fb_tweet_afinn,
by="status_id") %>%
left_join(fb_piclinks,
by="status_id")
<-
both_users %>%
tiktok_feature_selection rbind(facebook_feature_selection) %>%
mutate_if(is.numeric,coalesce,0)
set.seed(123)
<-
index createDataPartition(both_users$user,
p = 0.8, list = FALSE)
<-
for_decisiontree %>% select(-1,-2,-3,-4) %>%
both_users drop_na()
<- for_decisiontree[index, ] train
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
<- for_decisiontree[-index, ]
test set.seed(123)
<- rpart(user ~ .,
simple_model data = train, method = "class")
rpart.plot(simple_model, yesno = 2)
We produce additional models using the bagging, random forests, and gradient boosting methods.
set.seed(123)
<- train(
bagging_model ~ .,
user data = train,
method = "treebag",
trControl = trainControl(method = "oob"),
keepX = T,
nbagg = 100,
importance = "impurity",
control = rpart.control(minsplit = 2, cp = 0)
)<- length(setdiff(names(train), "user"))
n_features <- ranger(
rf_model ~ .,
user data = train,
mtry = floor(n_features * 0.5),
respect.unordered.factors = "order",
importance = "permutation",
seed = 123
)set.seed(123) # for reproducibility
<- gbm(
gbm_model formula = user ~ .,
data = train,
distribution = "gaussian", # SSE loss function
n.trees = 1000,
shrinkage = 0.05,
interaction.depth = 5,
n.minobsinnode = 4,
cv.folds = 10
)
We also display four variable importance plots to see which variables each model identified as significant.
vip(simple_model, num_features = 30) + ggtitle('Simple Decision Tree - Variable Importance Plot')
vip(bagging_model, num_features = 30) + ggtitle('Bagging - Variable Importance Plot')
vip(rf_model, num_features = 30) + ggtitle('Random Forests - Variable Importance Plot')
vip(gbm_model, num_features = 30) + ggtitle('Gradient Boosting - Variable Importance Plot')
It seems that the simple decision tree placed the most importance on the presence of a picture/link. The bagging model, on the other hand places no importance on this variable. All four methods identified tweet length as strongly predictive of the user. All four heavily weighted anticipation sentiments and AFINN scores.
Now, we produce confusion matrices and show residual sum of squares for all tree-based methods—first evaluating their performance on the training set and then on the test set. Note again that a Tiktok tweet is encoded as 1, and a Facebook tweet is encoded as 0. The code is shown for the first matrix but not for subsequent ones for the sake of elegance.
Simple Decision Tree - Training Set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2215 583
## 1 332 1984
##
## Accuracy : 0.8211
## 95% CI : (0.8103, 0.8315)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6423
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Precision : 0.7916
## Recall : 0.8697
## F1 : 0.8288
## Prevalence : 0.4980
## Detection Rate : 0.4331
## Detection Prevalence : 0.5471
## Balanced Accuracy : 0.8213
##
## 'Positive' Class : 0
##
Bagging Method - Training Set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2458 57
## 1 89 2510
##
## Accuracy : 0.9715
## 95% CI : (0.9665, 0.9758)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9429
##
## Mcnemar's Test P-Value : 0.0103
##
## Precision : 0.9773
## Recall : 0.9651
## F1 : 0.9712
## Prevalence : 0.4980
## Detection Rate : 0.4806
## Detection Prevalence : 0.4918
## Balanced Accuracy : 0.9714
##
## 'Positive' Class : 0
##
Random Forests Method - Training Set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2410 99
## 1 137 2468
##
## Accuracy : 0.9539
## 95% CI : (0.9477, 0.9594)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9077
##
## Mcnemar's Test P-Value : 0.01602
##
## Precision : 0.9605
## Recall : 0.9462
## F1 : 0.9533
## Prevalence : 0.4980
## Detection Rate : 0.4713
## Detection Prevalence : 0.4906
## Balanced Accuracy : 0.9538
##
## 'Positive' Class : 0
##
Gradient Boosting Method - Training Set:
## Using 806 trees...
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2297 303
## 1 250 2264
##
## Accuracy : 0.8919
## 95% CI : (0.883, 0.9002)
## No Information Rate : 0.502
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7837
##
## Mcnemar's Test P-Value : 0.02702
##
## Precision : 0.8835
## Recall : 0.9018
## F1 : 0.8926
## Prevalence : 0.4980
## Detection Rate : 0.4492
## Detection Prevalence : 0.5084
## Balanced Accuracy : 0.8919
##
## 'Positive' Class : 0
##
Performance Summary and RSS
type | total_errors | accuracy |
---|---|---|
Simple | 915 | 0.8210794 |
Bagging | 146 | 0.9714509 |
Random Forests | 236 | 0.9538522 |
Gradient Boosting | 553 | 0.8918655 |
The rankings for accuracy on the training set are as follows: 1. Bagging method 2. Random forests 3. Gradient boosting method 4. Simple decision tree
We show the residual sum of squares for all four models on the training set below.
<- sum((actual_train-simple_pred_train)^2)
rss_simple_train <- sum((actual_train-bagging_pred_train)^2)
rss_bagging_train <- sum((actual_train-rf_pred_train)^2)
rss_rf_train <- sum((actual_train-gb_pred_train)^2)
rss_gb_train cat(paste0("Residual Sum of Squares on Training Set\n",
"\nSimple model: ", rss_simple_train,
"\nBagging model: ", rss_bagging_train,
"\nRandom forests model: ", rss_rf_train,
"\nGradient boosting model: ", rss_gb_train))
## Residual Sum of Squares on Training Set
##
## Simple model: 744.823130490417
## Bagging model: 148.955784690615
## Random forests model: 226.599946610086
## Gradient boosting model: 407.58762929458
The bagging model performed the best on the training set, followed by the random forests method, the gradient boosting method, and the simple model in last place.
Now, we show confusion matrices for the test set.
Simple Decision Tree - Test Set:
<- test$user
actual_test <-
simple_pred_test predict(simple_model, newdata = test) %>%
as_tibble() %>%
select(2) %>%
unlist() %>%
as.vector()
<-
simple_test_confusion confusionMatrix(data = factor(round(simple_pred_test)),
reference = factor(actual_test), mode = "prec_recall")
<-
simple_test_errors $table[2] +
simple_test_confusion$table[3]
simple_test_confusion<-
simple_test_accuracy as.numeric(simple_test_confusion$overall[1])
simple_test_confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 562 138
## 1 89 489
##
## Accuracy : 0.8224
## 95% CI : (0.8003, 0.843)
## No Information Rate : 0.5094
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6441
##
## Mcnemar's Test P-Value : 0.001443
##
## Precision : 0.8029
## Recall : 0.8633
## F1 : 0.8320
## Prevalence : 0.5094
## Detection Rate : 0.4397
## Detection Prevalence : 0.5477
## Balanced Accuracy : 0.8216
##
## 'Positive' Class : 0
##
Bagging Method - Test Set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 550 92
## 1 101 535
##
## Accuracy : 0.849
## 95% CI : (0.8282, 0.8682)
## No Information Rate : 0.5094
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6979
##
## Mcnemar's Test P-Value : 0.5647
##
## Precision : 0.8567
## Recall : 0.8449
## F1 : 0.8507
## Prevalence : 0.5094
## Detection Rate : 0.4304
## Detection Prevalence : 0.5023
## Balanced Accuracy : 0.8491
##
## 'Positive' Class : 0
##
Random Forests Method - Test Set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 567 87
## 1 84 540
##
## Accuracy : 0.8662
## 95% CI : (0.8463, 0.8844)
## No Information Rate : 0.5094
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7323
##
## Mcnemar's Test P-Value : 0.8784
##
## Precision : 0.8670
## Recall : 0.8710
## F1 : 0.8690
## Prevalence : 0.5094
## Detection Rate : 0.4437
## Detection Prevalence : 0.5117
## Balanced Accuracy : 0.8661
##
## 'Positive' Class : 0
##
Gradient Boosting Method - Test Set:
## Using 806 trees...
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 568 93
## 1 83 534
##
## Accuracy : 0.8623
## 95% CI : (0.8422, 0.8807)
## No Information Rate : 0.5094
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7244
##
## Mcnemar's Test P-Value : 0.4975
##
## Precision : 0.8593
## Recall : 0.8725
## F1 : 0.8659
## Prevalence : 0.5094
## Detection Rate : 0.4444
## Detection Prevalence : 0.5172
## Balanced Accuracy : 0.8621
##
## 'Positive' Class : 0
##
Performance Summary and RSS
type | total_errors | accuracy |
---|---|---|
Simple | 227 | 0.8223787 |
Bagging | 193 | 0.8489828 |
Random Forests | 171 | 0.8661972 |
Gradient Boosting | 176 | 0.8622848 |
The rankings for accuracy on the test set are as follows:
Now, we show the residual sum of squares for each model with respect to the test set.
<- sum((actual_test-simple_pred_test)^2)
rss_simple_test <- sum((actual_test-bagging_pred_test)^2)
rss_bagging_test <- sum((actual_test-rf_pred_test)^2)
rss_rf_test <- sum((actual_test-gb_pred_test)^2)
rss_gb_test cat(paste0("Residual Sum of Squares on Test Set\n",
"\nSimple model: ", rss_simple_test,
"\nBagged model: ", rss_bagging_test,
"\nRandom forests model: ", rss_rf_test,
"\nGradient boost model: ", rss_gb_test))
## Residual Sum of Squares on Test Set
##
## Simple model: 185.903238689494
## Bagged model: 137.082023601223
## Random forests model: 120.747112472235
## Gradient boost model: 132.480675055803
Our sentiment analyses indicate that there are differences in sentiments of tweet communications between Tiktok and Facebook account. In general, while discussions surrounding Facebook uses more trust words, topics about TikTok uses more words that reflect anticipation. Tweets from Tiktok are more positive than those on Facebook. The random forests model performed the best on the test set even though it was only second best for the training set. However, that may be an indication that the bagging model was overfit to the training data, which caused it to perform worse on the test set than the random forests model.
When testing the models on our test data sets, for the simple model we observed 82.24% accuracy, for the bagging model we observed 84.9% accuracy, for the random forest we observed 86.62% accuracy, and for the gradient boosting model we observed 86.23% accuracy. Another good metric to compare is balanced accuracy, which is the average of specificity and sensitivity. For the simple model, balanced accuracy is 82.16%. For the bagging model, balanced accuracy is 84.91%. For the random forest, balanced accuracy is 86.61%. For the gradient boosting model, balanced accuracy is 86.21%.
In sum, it seems that the best model is the random forests model, with a test set accuracy score of 86.62%.
When analyzing the sum of squares difference between the actual tweeter and the predicted tweeter, the results were as follows:
Simple model: 744.823130490417
Bagging model: 148.955784690615
Random forests model: 226.599946610086
Gradient boosting model: 407.58762929458
Looking at the analyses, it seems that the Facebook and TikTok accounts have systematically different Twitter presences. Facebook seems to respond more frequently to user fears, which are associated with words such as “secure” and “trust.” Whereas, TikTok focuses on generating excitement and offer prize giveaways, which is associated with “anticipation” words such as “winning” and “tomorrow.” Differences in tweet length also possibly reflect on the preferences of the target audience; TikTok users are younger and less likely to consume written information (it is a video platform, after all), and the opposite is true for Facebook. In sum, our predictive endeavor was successful, and we unveiled a number of useful insights from it.
Ammar Plumber, Elaina Lin, Kim Nguyen, Meghan Aines, Ryan Karbowicz