Motivation Materials Model Building Discussion

Motivation

Predict if someone would spread misinformation regarding COVID19 using unsupervised machine learning

key_points

Background

During my WhatsApp user experience redesign I started questioning

Why do people spread mis-information?

And the answer is that people would rather trigger an intense emotion over a subtle one.

View My UX Case Study Here

This led to me questioning if these strong feelings could be quantitatively analyzed and connected to misclassifying COVID related claims.

I found a study conducted by Harvard students on South Korean Adults and their responses to a COVID19 survey. The source can be found here Harvard Study Data

My Role

I used their dataset to build a predictive model using principal component analysis to derive if a person would share misinformation based on

○ Belief of false COVID statements
○ Age
○ Gender
○ Political Ideology
○ Anger towards virus
○ Anxiety towards virus
○ Education
○ Income

Materials

Data

Source : Harvard Study Data

Sample of 513 people from South Korea, of which the sample was matched to be the same proportions of key Korean demographics.

A data frame with 513 observations and 37 variables. The responses to the following 12 false statements regarding COVID19 were encoded as 0 or 1 for trust and sharing.

The following 12 statements were presented:

Q1: Even if completely cured, COVID-19 patients suffer from life-long lung damage

Q2: A runny nose is a symptom of cold, not COVID-19

Q3: Hair dryers can kill the virus

Q4: Drinking alcohol can kill the virus

Q5: Only certain age groups, races, or ethnicities are vulnerable to the virus

Q6: Antibiotics or flu vaccines can prevent the disease

Q7: You can test yourself for COVID-19 by holding your breath for 10 seconds

Q8: Garlic can prevent infection

Q9: Gargling with salt water can eliminate the virus

Q10: The virus can penetrate into the body through contact

Q11: Smoking can kill the virus

Q12: Drinking hot water or hot tea can reduce the chances of getting infected

Variables

FNtrust(1-12) are responses for believing the 12 statements

FNsharing(1-12) are responses for sharing the 12 statements

Anger- Anger towards pandemic (6-point Likert scale value)

Anxiety- Fear towards the pandemic (6-point Likert scales)

Ideology- 7 point scale (1 being extremely conservative and 7 being extremely liberal)

Gender- Gender (Female is 1)

Age - Age in years

Educ- Education (1: No elementary school diploma to 12: Master’s degrees or higher)

Income- Income (1: Less than ₩10,000,000 to 11: More than ₩100,000,000)

st- Sum of trust per participant

ss- Sum of sharing per participant

Model Building

Dataset: Harvard Dataset


Data Cleaning

Load in dataset and create a variable for

library(readxl)
  harvard <- read_excel("C:/Users/ksree/Desktop/comp_app/harvard.xlsx", 
                        sheet = "Sheet16")
  View(harvard)

  data <- cbind(harvard)




Remove binary answers to statements 1-12 for trust and sharing (I use the average per participant for these variables)

data = subset(data, select = -c(No, FNtrust1,FNtrust2,FNtrust3,FNtrust4,FNtrust5,FNtrust6,FNtrust7,FNtrust8,
                                  FNtrust9,FNtrust10,FNtrust11,FNtrust12,FNtrust13,  FNsharing1,FNsharing2,FNsharing3,FNsharing4,FNsharing5,
                                  FNsharing6,FNsharing7,FNsharing8,FNsharing9,FNsharing10,FNsharing11,FNsharing12, FNsharing13))




Profile data to see any immediate anomalies and check variables

View(data)




Needed packages

library(faraway)
  library(psych)
## 
  ## Attaching package: 'psych'
## The following object is masked from 'package:faraway':
  ## 
  ##     logit

Descriptive Statistics

Get some descriptive statistics for the overall dataset Identify any visual correlations among variables, visual outliers, mean and skew

plot(data)

summary(data)
##      anger          anxiety         ideology         gender      
  ##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :0.0000  
  ##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:0.0000  
  ##  Median :4.000   Median :4.000   Median :4.000   Median :1.0000  
  ##  Mean   :3.928   Mean   :3.926   Mean   :4.109   Mean   :0.5224  
  ##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:1.0000  
  ##  Max.   :6.000   Max.   :6.000   Max.   :7.000   Max.   :1.0000  
  ##       age            region            educ            income      
  ##  Min.   :17.00   Min.   : 1.000   Min.   : 2.000   Min.   : 1.000  
  ##  1st Qu.:35.00   1st Qu.: 3.000   1st Qu.: 8.000   1st Qu.: 3.000  
  ##  Median :49.00   Median : 6.000   Median :10.000   Median : 5.000  
  ##  Mean   :48.19   Mean   : 6.665   Mean   : 9.158   Mean   : 5.374  
  ##  3rd Qu.:62.00   3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.: 7.000  
  ##  Max.   :94.00   Max.   :19.000   Max.   :12.000   Max.   :12.000  
  ##        st               ss        
  ##  Min.   : 0.000   Min.   : 0.000  
  ##  1st Qu.: 2.000   1st Qu.: 1.000  
  ##  Median : 3.000   Median : 3.000  
  ##  Mean   : 3.639   Mean   : 3.238  
  ##  3rd Qu.: 5.000   3rd Qu.: 5.000  
  ##  Max.   :13.000   Max.   :13.000
describe(data)




Create a visual for the correlations existing among variables through a correlation plot Correlations are not as high as hypothesized

library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Loading required package: ggplot2
## 
  ## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
  ## 
  ##     %+%, alpha
## Registered S3 method overwritten by 'GGally':
  ##   method from   
  ##   +.gg   ggplot2
## 
  ## Attaching package: 'GGally'
## The following object is masked from 'package:faraway':
  ## 
  ##     happy
ggcorr(data, method = c("everything", "pearson")) 

Split Training and Test data

Split data into training and test sets to validate model in the end

require(caTools)
## Loading required package: caTools
## Warning: package 'caTools' was built under R version 4.0.3
set.seed(4321)
  sample = sample.split(data,SplitRatio = 0.75)
  train = subset(data,sample ==TRUE) 
  test = subset(data, sample==FALSE)

  describe(train)
describe(test)

Standardize

Standardize and scale data so that the mean is 0 and std is 1 (comparing measurements with different units) Important to scale data after splitting train/test

train$anger_scaled<-scale(train$anger)
  train$anxiety_scaled<-scale(train$anxiety)
  train$ideology_scaled<-scale(train$ideology)
  train$gender_scaled<-scale(train$gender)
  train$age_scaled<-scale(train$age)
  train$educ_scaled<-scale(train$educ)
  train$income_scaled<-scale(train$income)
  train$region_scaled<-scale(train$region)
  train$st_scaled<-scale(train$st)
  train$ss_scaled<-scale(train$ss)

  train = subset(train, select = -c(anger, anxiety,ideology, gender, age, educ, income, region, st, ss))
  describe(train)
test$anger_scaled<-scale(test$anger)
  test$anxiety_scaled<-scale(test$anxiety)
  test$ideology_scaled<-scale(test$ideology)
  test$gender_scaled<-scale(test$gender)
  test$age_scaled<-scale(test$age)
  test$educ_scaled<-scale(test$educ)
  test$income_scaled<-scale(test$income)
  test$region_scaled<-scale(test$region)
  test$st_scaled<-scale(test$st)
  test$ss_scaled<-scale(test$ss)

  test = subset(test, select = -c(anger, anxiety,ideology, gender, age, educ, income, region, st, ss))
  describe(test)

PCA

Principle Component Analysis looks to reduce dimensionality through unsupervised learning     The goal is for the model to explain as much variation as possible by transforming a large set of variables into a smaller one that still contains most of the information in the large set

train.pca <- prcomp(train[,1:10])
  train.pca
## Standard deviations (1, .., p=10):
  ##  [1] 1.4255319 1.2077033 1.1529020 1.0887612 0.9731955 0.8938451 0.8922933
  ##  [8] 0.8441184 0.7324568 0.4510464
  ## 
  ## Rotation (n x k) = (10 x 10):
  ##                         PC1         PC2          PC3         PC4         PC5
  ## anger_scaled    -0.25583327  0.57400924 -0.218816336  0.01419926  0.13902518
  ## anxiety_scaled  -0.26163825  0.58754240 -0.112973270 -0.10224780  0.22324146
  ## ideology_scaled  0.18730345 -0.13699372 -0.460041912  0.36959192  0.16250685
  ## gender_scaled    0.11918834 -0.09918661  0.547144849 -0.20349712  0.29557019
  ## age_scaled      -0.25644739  0.11560798  0.503856074 -0.04362025 -0.18781140
  ## educ_scaled      0.24571311  0.05528468  0.003909389 -0.45788526  0.61382793
  ## income_scaled    0.17087265 -0.20222358 -0.348418899 -0.50472856  0.02881812
  ## region_scaled    0.06003516 -0.06929543  0.183675363  0.57769944  0.58003623
  ## st_scaled       -0.57538926 -0.34135434 -0.088042282 -0.06086302  0.19860545
  ## ss_scaled       -0.57165693 -0.34360070 -0.107612108 -0.08706893  0.16819793
  ##                          PC6          PC7          PC8          PC9
  ## anger_scaled    -0.102424733 -0.215839558  0.073590176  0.689284240
  ## anxiety_scaled  -0.007255479 -0.030576195  0.172903418 -0.694421526
  ## ideology_scaled -0.434023289  0.374473811  0.488566733 -0.002866363
  ## gender_scaled   -0.543400395 -0.392750484  0.307983359  0.007919557
  ## age_scaled       0.244165044  0.538362771  0.505317276  0.158149055
  ## educ_scaled      0.027212835  0.500412493 -0.288059340  0.123776087
  ## income_scaled    0.435959716 -0.286391627  0.528033751  0.044457474
  ## region_scaled    0.489857141 -0.188509564  0.088777883  0.014324610
  ## st_scaled        0.001721200  0.010591295 -0.077848971  0.008858990
  ## ss_scaled       -0.124158628  0.008914722  0.007234805 -0.001805096
  ##                         PC10
  ## anger_scaled     0.006834423
  ## anxiety_scaled   0.013440128
  ## ideology_scaled  0.064748705
  ## gender_scaled    0.057986670
  ## age_scaled       0.003875763
  ## educ_scaled     -0.025648413
  ## income_scaled    0.007340747
  ## region_scaled   -0.064000244
  ## st_scaled        0.703740331
  ## ss_scaled       -0.701530164
summary(train.pca)
## Importance of components:
  ##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
  ## Standard deviation     1.4255 1.2077 1.1529 1.0888 0.97320 0.8938 0.89229
  ## Proportion of Variance 0.2032 0.1459 0.1329 0.1185 0.09471 0.0799 0.07962
  ## Cumulative Proportion  0.2032 0.3491 0.4820 0.6005 0.69524 0.7751 0.85475
  ##                            PC8     PC9    PC10
  ## Standard deviation     0.84412 0.73246 0.45105
  ## Proportion of Variance 0.07125 0.05365 0.02034
  ## Cumulative Proportion  0.92601 0.97966 1.00000




Visualize how each principle component explains model variation

plot(train.pca)

biplot(train.pca)




Reassess correlations with PCA’s

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
cor1 <- cor(train.pca$x, method="pearson")
  corrplot::corrplot(cor1, method= "color", order = "hclust", tl.pos = 'n')




Save PCA’s as a seperate data frame to use for model building

pcs <- as.data.frame(train.pca$x)
  data2 <- cbind(train, pcs)
  data2 <- subset(data2, select = -c(anger_scaled, anxiety_scaled, ideology_scaled, gender_scaled, age_scaled, educ_scaled,
                                    income_scaled, region_scaled, st_scaled))
  data2