During my WhatsApp user experience redesign I started questioning
Why do people spread mis-information?
And the answer is that people would rather trigger an intense emotion over a subtle one.
View My UX Case Study Here
This led to me questioning if these strong feelings could be quantitatively analyzed and connected to misclassifying COVID related claims.
I found a study conducted by Harvard students on South Korean Adults and their responses to a COVID19 survey. The source can be found here Harvard Study Data
I used their dataset to build a predictive model using principal component analysis to derive if a person would share misinformation based on
○ Belief of false COVID statements
○ Age
○ Gender
○ Political Ideology
○ Anger towards virus
○ Anxiety towards virus
○ Education
○ Income
Source : Harvard Study Data
Sample of 513 people from South Korea, of which the sample was matched to be the same proportions of key Korean demographics.
A data frame with 513 observations and 37 variables. The responses to the following 12 false statements regarding COVID19 were encoded as 0 or 1 for trust and sharing.
The following 12 statements were presented:
Q1: Even if completely cured, COVID-19 patients suffer from life-long lung damage
Q2: A runny nose is a symptom of cold, not COVID-19
Q3: Hair dryers can kill the virus
Q4: Drinking alcohol can kill the virus
Q5: Only certain age groups, races, or ethnicities are vulnerable
to the virus
Q6: Antibiotics or flu vaccines can prevent the disease
Q7: You can test yourself for COVID-19 by holding your breath
for 10 seconds
Q8: Garlic can prevent infection
Q9: Gargling with salt water can eliminate the virus
Q10: The virus can penetrate into the body through contact
Q11: Smoking can kill the virus
Q12: Drinking hot water or hot tea can reduce the chances of
getting infected
FNtrust(1-12) are responses for believing the 12 statements
FNsharing(1-12) are responses for sharing the 12 statements
Anger- Anger towards pandemic (6-point Likert scale value)
Anxiety- Fear towards the pandemic (6-point Likert scales)
Ideology- 7 point scale (1 being extremely conservative and 7 being extremely liberal)
Gender- Gender (Female is 1)
Age - Age in years
Educ- Education (1: No elementary school diploma to 12: Master’s degrees or higher)
Income- Income (1: Less than ₩10,000,000 to 11: More than ₩100,000,000)
st- Sum of trust per participant
ss- Sum of sharing per participant
Dataset: Harvard Dataset
Load in dataset and create a variable for
library(readxl)
harvard <- read_excel("C:/Users/ksree/Desktop/comp_app/harvard.xlsx",
sheet = "Sheet16")
View(harvard)
data <- cbind(harvard)
Remove binary answers to statements 1-12 for trust and sharing (I use the average per participant for these variables)
data = subset(data, select = -c(No, FNtrust1,FNtrust2,FNtrust3,FNtrust4,FNtrust5,FNtrust6,FNtrust7,FNtrust8,
FNtrust9,FNtrust10,FNtrust11,FNtrust12,FNtrust13, FNsharing1,FNsharing2,FNsharing3,FNsharing4,FNsharing5,
FNsharing6,FNsharing7,FNsharing8,FNsharing9,FNsharing10,FNsharing11,FNsharing12, FNsharing13))
Profile data to see any immediate anomalies and check variables
View(data)
Needed packages
library(faraway)
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:faraway':
##
## logit
Get some descriptive statistics for the overall dataset Identify any visual correlations among variables, visual outliers, mean and skew
plot(data)
summary(data)
## anger anxiety ideology gender
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:0.0000
## Median :4.000 Median :4.000 Median :4.000 Median :1.0000
## Mean :3.928 Mean :3.926 Mean :4.109 Mean :0.5224
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:1.0000
## Max. :6.000 Max. :6.000 Max. :7.000 Max. :1.0000
## age region educ income
## Min. :17.00 Min. : 1.000 Min. : 2.000 Min. : 1.000
## 1st Qu.:35.00 1st Qu.: 3.000 1st Qu.: 8.000 1st Qu.: 3.000
## Median :49.00 Median : 6.000 Median :10.000 Median : 5.000
## Mean :48.19 Mean : 6.665 Mean : 9.158 Mean : 5.374
## 3rd Qu.:62.00 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.: 7.000
## Max. :94.00 Max. :19.000 Max. :12.000 Max. :12.000
## st ss
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 1.000
## Median : 3.000 Median : 3.000
## Mean : 3.639 Mean : 3.238
## 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. :13.000 Max. :13.000
describe(data)
Create a visual for the correlations existing among variables through a correlation plot Correlations are not as high as hypothesized
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:faraway':
##
## happy
ggcorr(data, method = c("everything", "pearson"))
Split data into training and test sets to validate model in the end
require(caTools)
## Loading required package: caTools
## Warning: package 'caTools' was built under R version 4.0.3
set.seed(4321)
sample = sample.split(data,SplitRatio = 0.75)
train = subset(data,sample ==TRUE)
test = subset(data, sample==FALSE)
describe(train)
describe(test)
Standardize and scale data so that the mean is 0 and std is 1 (comparing measurements with different units) Important to scale data after splitting train/test
train$anger_scaled<-scale(train$anger)
train$anxiety_scaled<-scale(train$anxiety)
train$ideology_scaled<-scale(train$ideology)
train$gender_scaled<-scale(train$gender)
train$age_scaled<-scale(train$age)
train$educ_scaled<-scale(train$educ)
train$income_scaled<-scale(train$income)
train$region_scaled<-scale(train$region)
train$st_scaled<-scale(train$st)
train$ss_scaled<-scale(train$ss)
train = subset(train, select = -c(anger, anxiety,ideology, gender, age, educ, income, region, st, ss))
describe(train)
test$anger_scaled<-scale(test$anger)
test$anxiety_scaled<-scale(test$anxiety)
test$ideology_scaled<-scale(test$ideology)
test$gender_scaled<-scale(test$gender)
test$age_scaled<-scale(test$age)
test$educ_scaled<-scale(test$educ)
test$income_scaled<-scale(test$income)
test$region_scaled<-scale(test$region)
test$st_scaled<-scale(test$st)
test$ss_scaled<-scale(test$ss)
test = subset(test, select = -c(anger, anxiety,ideology, gender, age, educ, income, region, st, ss))
describe(test)
Principle Component Analysis looks to reduce dimensionality through unsupervised learning The goal is for the model to explain as much variation as possible by transforming a large set of variables into a smaller one that still contains most of the information in the large set
train.pca <- prcomp(train[,1:10])
train.pca
## Standard deviations (1, .., p=10):
## [1] 1.4255319 1.2077033 1.1529020 1.0887612 0.9731955 0.8938451 0.8922933
## [8] 0.8441184 0.7324568 0.4510464
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5
## anger_scaled -0.25583327 0.57400924 -0.218816336 0.01419926 0.13902518
## anxiety_scaled -0.26163825 0.58754240 -0.112973270 -0.10224780 0.22324146
## ideology_scaled 0.18730345 -0.13699372 -0.460041912 0.36959192 0.16250685
## gender_scaled 0.11918834 -0.09918661 0.547144849 -0.20349712 0.29557019
## age_scaled -0.25644739 0.11560798 0.503856074 -0.04362025 -0.18781140
## educ_scaled 0.24571311 0.05528468 0.003909389 -0.45788526 0.61382793
## income_scaled 0.17087265 -0.20222358 -0.348418899 -0.50472856 0.02881812
## region_scaled 0.06003516 -0.06929543 0.183675363 0.57769944 0.58003623
## st_scaled -0.57538926 -0.34135434 -0.088042282 -0.06086302 0.19860545
## ss_scaled -0.57165693 -0.34360070 -0.107612108 -0.08706893 0.16819793
## PC6 PC7 PC8 PC9
## anger_scaled -0.102424733 -0.215839558 0.073590176 0.689284240
## anxiety_scaled -0.007255479 -0.030576195 0.172903418 -0.694421526
## ideology_scaled -0.434023289 0.374473811 0.488566733 -0.002866363
## gender_scaled -0.543400395 -0.392750484 0.307983359 0.007919557
## age_scaled 0.244165044 0.538362771 0.505317276 0.158149055
## educ_scaled 0.027212835 0.500412493 -0.288059340 0.123776087
## income_scaled 0.435959716 -0.286391627 0.528033751 0.044457474
## region_scaled 0.489857141 -0.188509564 0.088777883 0.014324610
## st_scaled 0.001721200 0.010591295 -0.077848971 0.008858990
## ss_scaled -0.124158628 0.008914722 0.007234805 -0.001805096
## PC10
## anger_scaled 0.006834423
## anxiety_scaled 0.013440128
## ideology_scaled 0.064748705
## gender_scaled 0.057986670
## age_scaled 0.003875763
## educ_scaled -0.025648413
## income_scaled 0.007340747
## region_scaled -0.064000244
## st_scaled 0.703740331
## ss_scaled -0.701530164
summary(train.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4255 1.2077 1.1529 1.0888 0.97320 0.8938 0.89229
## Proportion of Variance 0.2032 0.1459 0.1329 0.1185 0.09471 0.0799 0.07962
## Cumulative Proportion 0.2032 0.3491 0.4820 0.6005 0.69524 0.7751 0.85475
## PC8 PC9 PC10
## Standard deviation 0.84412 0.73246 0.45105
## Proportion of Variance 0.07125 0.05365 0.02034
## Cumulative Proportion 0.92601 0.97966 1.00000
Visualize how each principle component explains model variation
plot(train.pca)
biplot(train.pca)
Reassess correlations with PCA’s
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
cor1 <- cor(train.pca$x, method="pearson")
corrplot::corrplot(cor1, method= "color", order = "hclust", tl.pos = 'n')
Save PCA’s as a seperate data frame to use for model building
pcs <- as.data.frame(train.pca$x)
data2 <- cbind(train, pcs)
data2 <- subset(data2, select = -c(anger_scaled, anxiety_scaled, ideology_scaled, gender_scaled, age_scaled, educ_scaled,
income_scaled, region_scaled, st_scaled))
data2
Opitmal PCA’s is 3
Run a regression model using the first 3 PCA’s
lmodel <- lm(ss_scaled ~ PC1 + PC2 + PC3, data = data2)
summary(lmodel)
##
## Call:
## lm(formula = ss_scaled ~ PC1 + PC2 + PC3, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31237 -0.23538 0.04115 0.22582 1.31479
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.956e-17 2.041e-02 0.00 1
## PC1 -5.717e-01 1.434e-02 -39.87 < 2e-16 ***
## PC2 -3.436e-01 1.693e-02 -20.30 < 2e-16 ***
## PC3 -1.076e-01 1.773e-02 -6.07 3.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3868 on 355 degrees of freedom
## Multiple R-squared: 0.8517, Adjusted R-squared: 0.8504
## F-statistic: 679.5 on 3 and 355 DF, p-value: < 2.2e-16
Compare to a simple least squares regression
lmod <- lm(ss_scaled ~ ., data = train)
summary(lmod)
##
## Call:
## lm(formula = ss_scaled ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.41925 -0.27647 0.06367 0.26180 2.72794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.984e-17 3.238e-02 0.000 1.0000
## anger_scaled 1.230e-02 3.668e-02 0.335 0.7375
## anxiety_scaled 1.834e-02 3.673e-02 0.499 0.6179
## ideology_scaled 4.701e-02 3.391e-02 1.386 0.1666
## gender_scaled 3.810e-02 3.362e-02 1.133 0.2580
## age_scaled 2.448e-02 3.421e-02 0.716 0.4748
## educ_scaled -5.381e-02 3.342e-02 -1.610 0.1083
## income_scaled 7.737e-03 3.378e-02 0.229 0.8190
## region_scaled -6.915e-02 3.292e-02 -2.101 0.0364 *
## st_scaled 7.847e-01 3.336e-02 23.522 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6135 on 349 degrees of freedom
## Multiple R-squared: 0.6331, Adjusted R-squared: 0.6236
## F-statistic: 66.9 on 9 and 349 DF, p-value: < 2.2e-16
Use forward selection to find most significant features in training dataset
library(MASS)
library(olsrr)
## Warning: package 'olsrr' was built under R version 4.0.3
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:faraway':
##
## hsb
## The following object is masked from 'package:datasets':
##
## rivers
model <- lm(ss_scaled ~ ., data = train)
k <- ols_step_forward_p(model)
ols_step_forward_p(model)
##
## Selection Summary
## -------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -------------------------------------------------------------------------------
## 1 st_scaled 0.6226 0.6216 3.9338 673.9481 0.6152
## 2 region_scaled 0.6269 0.6248 1.8961 671.8867 0.6126
## 3 educ_scaled 0.6294 0.6263 1.4768 671.4311 0.6113
## 4 ideology_scaled 0.6306 0.6264 2.3834 672.3158 0.6112
## 5 gender_scaled 0.6318 0.6266 3.2080 673.1128 0.6111
## -------------------------------------------------------------------------------
plot(k)
Repeat for stepwise selection
k2 <- ols_step_both_p(model)
ols_step_both_p(model)
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------
## 1 st_scaled addition 0.623 0.622 3.9340 673.9481 0.6152
## 2 region_scaled addition 0.627 0.625 1.8960 671.8867 0.6126
## -----------------------------------------------------------------------------------------
plot(k2)
summary(k2)
## Length Class Mode
## orders 2 -none- character
## method 2 -none- character
## steps 1 -none- numeric
## predictors 2 -none- character
## rsquare 2 -none- numeric
## aic 2 -none- numeric
## sbc 2 -none- numeric
## sbic 2 -none- numeric
## adjr 2 -none- numeric
## rmse 2 -none- numeric
## mallows_cp 2 -none- numeric
## indvar 9 -none- character
## betas 5 -none- numeric
## lbetas 2 -none- numeric
## pvalues 5 -none- numeric
## beta_pval 4 data.frame list
## model 12 lm list
View all possible models
library(olsrr)
k3 <- ols_step_all_possible(model)
k3
plot(k3)
Using these variable selection methods, formally write out each model to compare model accuracy
#pcr
lmodpcr <- lm(train$ss_scaled ~ train.pca$x[,1:3])
#simple
lmod <- lm(ss_scaled ~ ., data = train)
#forward
forward <- lm(ss_scaled ~ region_scaled + educ_scaled + ideology_scaled + gender_scaled, data = train)
#stepwise
stepwise <- lm(ss_scaled ~ st_scaled + region_scaled, data = train)
Perform cross validation on these 4 models
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
##
## melanoma
set.seed(13245)
train.control <- trainControl(method = "repeatedcv", number = 10,repeats=20)
model_leastsquares <- train(ss_scaled ~ ., data = train, method = "lm",
trControl = train.control)
model_pcr <- train(ss_scaled ~ PC1 + PC2 + PC3, data = data2, method = "lm",
trControl = train.control)
model_forward <- train(ss_scaled ~ region_scaled + educ_scaled + ideology_scaled + gender_scaled, data = train, method = "lm",
trControl = train.control)
model_stepwise <- train(ss_scaled ~ st_scaled + region_scaled, data = train, method = "lm",
trControl = train.control)
print(model_leastsquares)
## Linear Regression
##
## 359 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 323, 323, 324, 323, 323, 322, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6134204 0.6174199 0.4311321
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(model_pcr)
## Linear Regression
##
## 359 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 323, 324, 324, 323, 324, 323, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3847179 0.850367 0.293216
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(model_forward)
## Linear Regression
##
## 359 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 322, 325, 323, 324, 322, 323, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.9848961 0.04987417 0.7874916
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(model_stepwise)
## Linear Regression
##
## 359 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 323, 324, 322, 324, 323, 323, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6053524 0.6288003 0.4233915
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Validate models on test data set
Set up PCA validation
testpca <- cbind(data2[1:154,])
testpca
PC1 <- train.pca$rotation[,1]
PC2<- train.pca$rotation[,2]
PC3<- train.pca$rotation[,3]
View(testpca)
Validate models on test data set
model_leastsquares_test <- train(ss_scaled ~ ., data = test, method = "lm", trControl = train.control)
model_PCA_test <- model_pcr_test <- train(ss_scaled ~ PC1 + PC2 + PC3, data = testpca,
method = "lm",
trControl = train.control)
model_forward_test <- train(ss_scaled ~ region_scaled + educ_scaled + ideology_scaled + gender_scaled, data = test, method = "lm", trControl = train.control)
model_stepwise_test <- train(ss_scaled ~ st_scaled + region_scaled, data = test, method = "lm",
trControl = train.control)
model_leastsquares_test
## Linear Regression
##
## 154 samples
## 9 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 139, 139, 140, 138, 138, 139, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6152499 0.6245287 0.4369182
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model_PCA_test
## Linear Regression
##
## 154 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 139, 138, 138, 139, 139, 139, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3604114 0.8871406 0.2763031
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model_forward_test
## Linear Regression
##
## 154 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 138, 138, 138, 140, 138, 140, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.9988511 0.07279995 0.7535439
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model_stepwise_test
## Linear Regression
##
## 154 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 20 times)
## Summary of sample sizes: 138, 138, 140, 139, 139, 138, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.5980537 0.6459783 0.4289118
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Based on cross validation and testing the models on the test data set, the Principal Component Regression is the strongest model. It has a low RMSE (prediction errors), low MSE, and high R-squared (fit of the model with the data).
The least squares line usually has the best fit of the data because it intentionally reduces the sum of squared residuals but here it is clear that the Principal Component Regression out performs even the least squares model.
Through PCA I was able to reduce the complexity in the dimensionality of the model from 10 variables (in ten-dimensional space) to 3 variables which is much more intuitive.
According to my model, a person's likelihood to share COVID19 misinformation could be calculated through a simple survey of
○ Belief in 12 false COVID19 claims
○ Anger
○ Anxiety
○ Political Ideology
○ Gender
○ Age
○ Education
○ Income
○ Region
(With very low prediction errors)
This project makes me wonder if people would be safer answering those questions prior to sharing information about the pandemic. The user experience side of me is saying no, but in terms of reducing the incorrect information which is potentially harming people, this appears to be an option.
However, this algorithm makes decisions based on the participant's demographic information to output a more accurate result.
The alternative is to use a stepwise model that yields a less accurate result but will utilize only the first and last features listed above.
Both are feasible strategies but it would be up to the organization to decide the tradeoff between accuracy and requiring personal information.