For one of my upper level research classes I wrote a paper on how digital media alters personal risk assessment. The findings showed that the shift to a digital environment forces different types of dangers to be processed which is worth studying because risk perception is the bottom line for keeping ourselves alive.
Read My Paper Here
Related to my paper, I found a dataset online recording different social relationships and how that influences perceived risk. The source can be found here Risk Alteration Data
Source : Risk Alteration Data
Sample of 391 Vietnamese respondents aged from 15 to 47 years, various geographical regions in Vietnam.
A data frame with 392 observations and 15 variables. I added in 10 more columns as a way to encode the response variable (Risk Perception) into classes for the neural net program to run as a regression.
â—‹ ID- Participant ID
â—‹ Age - Age in years
â—‹ Gender- 0 - Male; 1 - Female; 2 - Others or do not want to answer
â—‹ Religion- 0 - Non-religion; 1 - Buddhism; 2 - Christian and Protestian; 3 - Other religion
â—‹ Income- Millions in VND
â—‹ Career- 1 - Students; 2 - Government officers; 3 - Private sectors
â—‹ Family Member- Number of people in your household (except respondents)
â—‹ Social Media Usage Daily- 1 - Less 1 hour; 2 - From 1 to 3 hours; 3 - More than 3 hours
â—‹ Province- 1 - North, 2 - Middle, 3 - South, 4 - Overseas
â—‹ Education- 1 - Highschool, 2 - University/College, 3 - Postgraduation
â—‹ nCoV-source- 1 - Official information, 2 - Social media and word-of-mouth, 3 - Others
â—‹ Frequency- The number of actively searching Corona virus information
â—‹ Risk Perception- 1 to 10
â—‹ Fake News- From 1-10: The level of fake news of virus Corona
â—‹ Officialnews- From 1-10: The level of official news of virus Corona
Dataset: Socioeconomic Dataset
Load in dataset
Last 10 columns were created because a multi-class neural net cannot distinguish between factor inputs within one column
library(readxl)
Data_in_brief <- read_excel("C:/Users/ksree/Desktop/senior year/statistics/wh9xk5mp9m-3/Data-in-brief.xlsx",
sheet = "Dataset", range = "b1:y392")
View(Data_in_brief)
str(Data_in_brief)
## tibble [391 x 24] (S3: tbl_df/tbl/data.frame)
## $ Age : num [1:391] 26 21 22 25 22 20 25 35 24 19 ...
## $ Gender : num [1:391] 0 0 0 1 0 1 1 0 2 1 ...
## $ Religion : num [1:391] 1 0 0 0 0 0 0 0 0 0 ...
## $ Income : num [1:391] 10 7 8 15 4 8 15 10 7 4 ...
## $ Career : num [1:391] 1 3 3 3 3 1 3 3 3 1 ...
## $ Familymember : num [1:391] 4 6 0 4 4 5 5 4 4 2 ...
## $ Socialmedia : num [1:391] 2 2 2 3 3 3 2 3 2 3 ...
## $ Province : num [1:391] 1 3 3 3 3 1 2 3 3 1 ...
## $ Education : num [1:391] 3 2 2 2 2 2 2 2 2 1 ...
## $ nCoVsource : num [1:391] 1 1 1 1 2 1 2 2 2 2 ...
## $ Frequency : num [1:391] 0 4 2 2 1 4 8 24 2 2 ...
## $ Fakenews : num [1:391] 10 8 8 4 8 10 8 9 4 8 ...
## $ Officialnews : num [1:391] 5 7 5 3 5 5 4 2 5 4 ...
## $ Riskperception: num [1:391] 1 10 9 8 8 7 10 8 6 10 ...
## $ r1 : num [1:391] 1 0 0 0 0 0 0 0 0 0 ...
## $ r2 : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
## $ r3 : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
## $ r4 : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
## $ r5 : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
## $ r6 : num [1:391] 0 0 0 0 0 0 0 0 1 0 ...
## $ r7 : num [1:391] 0 0 0 0 0 1 0 0 0 0 ...
## $ r8 : num [1:391] 0 0 0 1 1 0 0 1 0 0 ...
## $ r9 : num [1:391] 0 0 1 0 0 0 0 0 0 0 ...
## $ r10 : num [1:391] 0 1 0 0 0 0 1 0 0 1 ...
Put data into a variable minus the Riskperception column because the results were encoded seprately in the last ten columns
data = subset(Data_in_brief, select = -c(Riskperception))
Profile data to see any immediate anomalies and check variables
View(data)
Load required packages and set seed to reproduce results
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.0.3
library(nnet)
## Warning: package 'nnet' was built under R version 4.0.3
library(psych)
require(ggplot2)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
set.seed(345)
Get some descriptive statistics for the overall dataset
Identify any visual correlations among variables, visual outliers, mean and skew
plot(data)
summary(data)
## Age Gender Religion Income
## Min. : 0.00 Min. :0.0000 Min. :0.0000 Min. : 0.000
## 1st Qu.:19.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 2.000
## Median :23.00 Median :0.0000 Median :0.0000 Median : 5.000
## Mean :23.23 Mean :0.5754 Mean :0.4834 Mean : 8.747
## 3rd Qu.:28.00 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:10.000
## Max. :47.00 Max. :2.0000 Max. :3.0000 Max. :50.000
## Career Familymember Socialmedia Province
## Min. :1.000 Min. : 0.00 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 1st Qu.: 4.00 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median : 4.00 Median :2.000 Median :3.000
## Mean :1.926 Mean : 4.21 Mean :2.414 Mean :2.315
## 3rd Qu.:3.000 3rd Qu.: 5.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.000 Max. :10.00 Max. :3.000 Max. :3.000
## Education nCoVsource Frequency Fakenews
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.: 7.000
## Median :2.000 Median :2.000 Median : 2.000 Median : 8.000
## Mean :2.056 Mean :1.885 Mean : 4.751 Mean : 7.767
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.: 8.000 3rd Qu.: 9.500
## Max. :3.000 Max. :3.000 Max. :24.000 Max. :10.000
## Officialnews r1 r2 r3
## Min. : 0.000 Min. :0.000000 Min. :0.000000 Min. :0.00000
## 1st Qu.: 5.000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median : 6.000 Median :0.000000 Median :0.000000 Median :0.00000
## Mean : 5.795 Mean :0.005115 Mean :0.002557 Mean :0.01535
## 3rd Qu.: 7.000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.000000 Max. :1.000000 Max. :1.00000
## r4 r5 r6 r7 r8
## Min. :0 Min. :0 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0 Median :0 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0 Mean :0 Mean :0.07928 Mean :0.1611 Mean :0.3043
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :0 Max. :0 Max. :1.00000 Max. :1.0000 Max. :1.0000
## r9 r10
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.1202 Mean :0.1816
## 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000
describe(data)
Split data into training and test sets to validate model in the end
set.seed(4321)
dt <- sort(sample(nrow(data), nrow(data)*.75))
train <- data[dt,]
test <- data[-dt,]
describe(train)
describe(test)
scl <- function(x){ (x - min(x))/(max(x) - min(x)) }
train[, 1:23] <- data.frame(lapply(train[, 1:23], scl))
test[, 1:23] <- data.frame(lapply(test[, 1:23], scl))
is.nan.data.frame <- function(x)
do.call(cbind, lapply(x, is.nan))
train[is.nan(train)] <- 0
test[is.nan(test)] <- 0
To put in a formula each factor that is encoded in its own column must be put into its own variable
n <- names(train)
f <- as.formula(paste("r1 + r2 + r3 + r4 + r5 + r6 +r7 + r8 + r9 + r10~", paste(n[!n %in% c("r1","r2","r3","r4","r5","r6","r7","r8","r9","r10")], collapse = " + ")))
f
## r1 + r2 + r3 + r4 + r5 + r6 + r7 + r8 + r9 + r10 ~ Age + Gender +
## Religion + Income + Career + Familymember + Socialmedia +
## Province + Education + nCoVsource + Frequency + Fakenews +
## Officialnews
After trail and error I found parameters for the hidden vertices and neurons that minimize error
nntrain <- neuralnet(f,
data = train,
hidden = c(290,270,260,150),
threshold = 0.001,
rep = 1,
act.fct = "logistic",
linear.output = FALSE,
lifesign = "minimal")
## hidden: 290, 270, 260, 150 thresh: 0.001 rep: 1/1 steps: 218 error: 0.50139 time: 1.57 mins
plot(nntrain)
Test the model on the test dataset
nntest <- neuralnet(f,
data = train,
hidden = c(290,270,260,150),
threshold = 0.001,
rep = 1,
act.fct = "logistic",
linear.output = FALSE,
lifesign = "minimal")
## hidden: 290, 270, 260, 150 thresh: 0.001 rep: 1/1 steps: 249 error: 0.50073 time: 2.12 mins
My notes on how feed forward neural nets work broken down into the basics
View My Notes Here
After trial and error for determining how to bind the hidden layer of perceptrons, I reduced my error to about 0.5 for both the training and testing data which is a good indicator that the model is not overfitting the training data. However, the run time for the program was higher and thus tuning this model or applying it to a larger dataset will be costly. This was primarily a way for me to understand how to tune a neural net, and I played around with the learning rate and different starting weights but presented my basic model.
I researched for the ideal number of hidden neurons but there does not seem to be a unified answer. Some of the popular methods included taking the square root of the number of observations and multiplying it to the number of classes. But that did not reduce my error. Below is a visualization of the neural net.
With this model, a person's risk perception, specifically towards COVID19, could be classified and predicted
Since the world has shifted online, the risks we encode are primarily based on our digital environment. If this is making us worse at predicting real world danger is an ongoing debate, but this much is clear
Our media consumption and relationships affect our decision making