Predict if someone would spread misinformation regarding COVID19 using unsupervised machine learning



During my WhatsApp user experience redesign I started questioning

Why do people spread mis-information?

And the answer is that people would rather trigger an intense emotion over a subtle one.

This led to me questioning if these strong feelings could be quantitatively analyzed and connected to misclassifying COVID related claims.

I found a study conducted by Harvard students on South Korean Adults and their responses to a COVID19 survey. The source can be found here Harvard Study Data

My Role

I used their dataset to build a predictive model using principal component analysis to derive if a person would share misinformation based on

○ Belief of false COVID statements
○ Age
○ Gender
○ Political Ideology
○ Anger towards virus
○ Anxiety towards virus
○ Education
○ Income



Source : Harvard Study Data

Sample of 513 people from South Korea, of which the sample was matched to be the same proportions of key Korean demographics.

A data frame with 513 observations and 37 variables. The responses to the following 12 false statements regarding COVID19 were encoded as 0 or 1 for trust and sharing.

The following 12 statements were presented:

Q1: Even if completely cured, COVID-19 patients suffer from life-long lung damage

Q2: A runny nose is a symptom of cold, not COVID-19

Q3: Hair dryers can kill the virus

Q4: Drinking alcohol can kill the virus

Q5: Only certain age groups, races, or ethnicities are vulnerable to the virus

Q6: Antibiotics or flu vaccines can prevent the disease

Q7: You can test yourself for COVID-19 by holding your breath for 10 seconds

Q8: Garlic can prevent infection

Q9: Gargling with salt water can eliminate the virus

Q10: The virus can penetrate into the body through contact

Q11: Smoking can kill the virus

Q12: Drinking hot water or hot tea can reduce the chances of getting infected


FNtrust(1-12) are responses for believing the 12 statements

FNsharing(1-12) are responses for sharing the 12 statements

Anger- Anger towards pandemic (6-point Likert scale value)

Anxiety- Fear towards the pandemic (6-point Likert scales)

Ideology- 7 point scale (1 being extremely conservative and 7 being extremely liberal)

Gender- Gender (Female is 1)

Age - Age in years

Educ- Education (1: No elementary school diploma to 12: Master’s degrees or higher)

Income- Income (1: Less than ₩10,000,000 to 11: More than ₩100,000,000)

st- Sum of trust per participant

ss- Sum of sharing per participant

Model Building

Dataset: Harvard Dataset

Data Cleaning

Load in dataset and create a variable for

  harvard <- read_excel("C:/Users/ksree/Desktop/comp_app/harvard.xlsx", 
                        sheet = "Sheet16")

  data <- cbind(harvard)

Remove binary answers to statements 1-12 for trust and sharing (I use the average per participant for these variables)

data = subset(data, select = -c(No, FNtrust1,FNtrust2,FNtrust3,FNtrust4,FNtrust5,FNtrust6,FNtrust7,FNtrust8,
                                  FNtrust9,FNtrust10,FNtrust11,FNtrust12,FNtrust13,  FNsharing1,FNsharing2,FNsharing3,FNsharing4,FNsharing5,
                                  FNsharing6,FNsharing7,FNsharing8,FNsharing9,FNsharing10,FNsharing11,FNsharing12, FNsharing13))

Profile data to see any immediate anomalies and check variables


ggcorr(data, method = c("everything", "pearson")) 

Split Training and Test data

Split data into training and test sets to validate model in the end

Principle Component Analysis looks to reduce dimensionality through unsupervised learning     The goal is for the model to explain as much variation as possible by transforming a large set of variables into a smaller one that still contains most of the information in the large set

train.pca <- prcomp(train[,1:10])
Visualize how each principle component explains model variation



Reassess correlations with PCA’s

Save PCA’s as a seperate data frame to use for model building

pcs <-$x)
  data2 <- cbind(train, pcs)
  data2 <- subset(data2, select = -c(anger_scaled, anxiety_scaled, ideology_scaled, gender_scaled, age_scaled, educ_scaled,
                                    income_scaled, region_scaled, st_scaled))

Opitmal PCA’s is 3
Run a regression model using the first 3 PCA’s

lmodel <- lm(ss_scaled ~ PC1 + PC2 + PC3, data = data2)
Least Squares Model

Compare to a simple least squares regression

lmod <- lm(ss_scaled ~ ., data = train)
Use forward selection to find most significant features in training dataset

  ## -------------------------------------------------------------------------------
  ##         Variable                         Adj.                                      
  ## Step        Entered        R-Square    R-Square     C(p)       AIC        RMSE     
  ## -------------------------------------------------------------------------------
  ##    1    st_scaled            0.6226      0.6216    3.9338    673.9481    0.6152    
  ##    2    region_scaled        0.6269      0.6248    1.8961    671.8867    0.6126    
  ##    3    educ_scaled          0.6294      0.6263    1.4768    671.4311    0.6113    
  ##    4    ideology_scaled      0.6306      0.6264    2.3834    672.3158    0.6112    
  ##    5    gender_scaled        0.6318      0.6266    3.2080    673.1128    0.6111    
  ## -------------------------------------------------------------------------------

View all possible models

  k3 <- ols_step_all_possible(model)

Using these variable selection methods, formally write out each model to compare model accuracy

  lmodpcr <- lm(train$ss_scaled ~ train.pca$x[,1:3])
  lmod <- lm(ss_scaled ~ ., data = train)
  forward <- lm(ss_scaled ~ region_scaled + educ_scaled + ideology_scaled + gender_scaled, data = train)
  stepwise <- lm(ss_scaled ~ st_scaled + region_scaled, data = train)

Cross Validation

Perform cross validation on these 4 models

Validate models on test data set

Set up PCA validation

testpca <- cbind(data2[1:154,])
PC1 <- train.pca$rotation[,1]
PC2<- train.pca$rotation[,2]
PC3<- train.pca$rotation[,3]

Validate models on test data set

model_leastsquares_test <- train(ss_scaled ~ ., data = test, method = "lm", trControl = train.control)
model_PCA_test <- model_pcr_test <- train(ss_scaled ~ PC1 + PC2 + PC3, data = testpca,
                        method = "lm",
                        trControl = train.control)

model_forward_test <- train(ss_scaled ~ region_scaled + educ_scaled + ideology_scaled + gender_scaled, data = test, method =                          "lm", trControl = train.control)

model_stepwise_test <- train(ss_scaled ~ st_scaled + region_scaled, data = test, method = "lm",
                        trControl = train.control)

Best Model

Based on cross validation and testing the models on the test data set, the Principal Component Regression is the strongest model. It has a low RMSE (prediction errors), low MSE, and high R-squared (fit of the model with the data).

The least squares line usually has the best fit of the data because it intentionally reduces the sum of squared residuals but here it is clear that the Principal Component Regression out performs even the least squares model.

Through PCA I was able to reduce the complexity in the dimensionality of the model from 10 variables (in ten-dimensional space) to 3 variables which is much more intuitive.


According to my model, a person's likelihood to share COVID19 misinformation could be calculated through a simple survey of

○ Belief in 12 false COVID19 claims
○ Anger
○ Anxiety
○ Political Ideology
○ Gender
○ Age
○ Education
○ Income
○ Region

(With very low prediction errors)

This project makes me wonder if people would be safer answering those questions prior to sharing information about the pandemic. The user experience side of me is saying no, but in terms of reducing the incorrect information which is potentially harming people, this appears to be an option.

However, this algorithm makes decisions based on the participant's demographic information to output a more accurate result.

The alternative is to use a stepwise model that yields a less accurate result but will utilize only the first and last features listed above.

Both are feasible strategies but it would be up to the organization to decide the tradeoff between accuracy and requiring personal information. personal data