Motivation Materials Model Building Discussion

Motivation

Understand what influences risk and how social relationships affect risk perception

3d modeling

Background

For one of my upper level research classes I wrote a paper on how digital media alters personal risk assessment. The findings showed that the shift to a digital environment forces different types of dangers to be processed which is worth studying because risk perception is the bottom line for keeping ourselves alive.

Read My Paper Here

Related to my paper, I found a dataset online recording different social relationships and how that influences perceived risk. The source can be found here Risk Alteration Data

Materials

Data

Source : Risk Alteration Data

Sample of 391 Vietnamese respondents aged from 15 to 47 years, various geographical regions in Vietnam.

A data frame with 392 observations and 15 variables. I added in 10 more columns as a way to encode the response variable (Risk Perception) into classes for the neural net program to run as a regression.

Variables

○ ID- Participant ID

○ Age - Age in years

○ Gender- 0 - Male; 1 - Female; 2 - Others or do not want to answer

○ Religion- 0 - Non-religion; 1 - Buddhism; 2 - Christian and Protestian; 3 - Other religion

○ Income- Millions in VND

○ Career- 1 - Students; 2 - Government officers; 3 - Private sectors

○ Family Member- Number of people in your household (except respondents)

○ Social Media Usage Daily- 1 - Less 1 hour; 2 - From 1 to 3 hours; 3 - More than 3 hours

○ Province- 1 - North, 2 - Middle, 3 - South, 4 - Overseas

○ Education- 1 - Highschool, 2 - University/College, 3 - Postgraduation

○ nCoV-source- 1 - Official information, 2 - Social media and word-of-mouth, 3 - Others

○ Frequency- The number of actively searching Corona virus information

○ Risk Perception- 1 to 10

○ Fake News- From 1-10: The level of fake news of virus Corona

○ Officialnews- From 1-10: The level of official news of virus Corona

Model Building

Dataset: Socioeconomic Dataset


Data Profiling

Load in dataset
Last 10 columns were created because a multi-class neural net cannot distinguish between factor inputs within one column

library(readxl)
Data_in_brief <- read_excel("C:/Users/ksree/Desktop/senior year/statistics/wh9xk5mp9m-3/Data-in-brief.xlsx", 
                            sheet = "Dataset", range = "b1:y392")
View(Data_in_brief)

str(Data_in_brief)
## tibble [391 x 24] (S3: tbl_df/tbl/data.frame)
##  $ Age           : num [1:391] 26 21 22 25 22 20 25 35 24 19 ...
##  $ Gender        : num [1:391] 0 0 0 1 0 1 1 0 2 1 ...
##  $ Religion      : num [1:391] 1 0 0 0 0 0 0 0 0 0 ...
##  $ Income        : num [1:391] 10 7 8 15 4 8 15 10 7 4 ...
##  $ Career        : num [1:391] 1 3 3 3 3 1 3 3 3 1 ...
##  $ Familymember  : num [1:391] 4 6 0 4 4 5 5 4 4 2 ...
##  $ Socialmedia   : num [1:391] 2 2 2 3 3 3 2 3 2 3 ...
##  $ Province      : num [1:391] 1 3 3 3 3 1 2 3 3 1 ...
##  $ Education     : num [1:391] 3 2 2 2 2 2 2 2 2 1 ...
##  $ nCoVsource    : num [1:391] 1 1 1 1 2 1 2 2 2 2 ...
##  $ Frequency     : num [1:391] 0 4 2 2 1 4 8 24 2 2 ...
##  $ Fakenews      : num [1:391] 10 8 8 4 8 10 8 9 4 8 ...
##  $ Officialnews  : num [1:391] 5 7 5 3 5 5 4 2 5 4 ...
##  $ Riskperception: num [1:391] 1 10 9 8 8 7 10 8 6 10 ...
##  $ r1            : num [1:391] 1 0 0 0 0 0 0 0 0 0 ...
##  $ r2            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r3            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r4            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r5            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r6            : num [1:391] 0 0 0 0 0 0 0 0 1 0 ...
##  $ r7            : num [1:391] 0 0 0 0 0 1 0 0 0 0 ...
##  $ r8            : num [1:391] 0 0 0 1 1 0 0 1 0 0 ...
##  $ r9            : num [1:391] 0 0 1 0 0 0 0 0 0 0 ...
##  $ r10           : num [1:391] 0 1 0 0 0 0 1 0 0 1 ...




Put data into a variable minus the Riskperception column because the results were encoded seprately in the last ten columns

data = subset(Data_in_brief, select = -c(Riskperception))




Profile data to see any immediate anomalies and check variables

View(data)




Load required packages and set seed to reproduce results

library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.0.3
library(nnet)
## Warning: package 'nnet' was built under R version 4.0.3
library(psych)
require(ggplot2)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
set.seed(345)

Descriptive Statistics

Get some descriptive statistics for the overall dataset
Identify any visual correlations among variables, visual outliers, mean and skew

plot(data)

summary(data)
##       Age            Gender          Religion          Income      
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:19.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 2.000  
##  Median :23.00   Median :0.0000   Median :0.0000   Median : 5.000  
##  Mean   :23.23   Mean   :0.5754   Mean   :0.4834   Mean   : 8.747  
##  3rd Qu.:28.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:10.000  
##  Max.   :47.00   Max.   :2.0000   Max.   :3.0000   Max.   :50.000  
##      Career       Familymember    Socialmedia       Province    
##  Min.   :1.000   Min.   : 0.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.: 4.00   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median : 4.00   Median :2.000   Median :3.000  
##  Mean   :1.926   Mean   : 4.21   Mean   :2.414   Mean   :2.315  
##  3rd Qu.:3.000   3rd Qu.: 5.00   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :10.00   Max.   :3.000   Max.   :3.000  
##    Education       nCoVsource      Frequency         Fakenews     
##  Min.   :0.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 2.000   1st Qu.: 7.000  
##  Median :2.000   Median :2.000   Median : 2.000   Median : 8.000  
##  Mean   :2.056   Mean   :1.885   Mean   : 4.751   Mean   : 7.767  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.: 9.500  
##  Max.   :3.000   Max.   :3.000   Max.   :24.000   Max.   :10.000  
##   Officialnews          r1                 r2                 r3         
##  Min.   : 0.000   Min.   :0.000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.: 5.000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median : 6.000   Median :0.000000   Median :0.000000   Median :0.00000  
##  Mean   : 5.795   Mean   :0.005115   Mean   :0.002557   Mean   :0.01535  
##  3rd Qu.: 7.000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :10.000   Max.   :1.000000   Max.   :1.000000   Max.   :1.00000  
##        r4          r5          r6                r7               r8        
##  Min.   :0   Min.   :0   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0   1st Qu.:0   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0   Median :0   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0   Mean   :0   Mean   :0.07928   Mean   :0.1611   Mean   :0.3043  
##  3rd Qu.:0   3rd Qu.:0   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :0   Max.   :0   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##        r9              r10        
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.1202   Mean   :0.1816  
##  3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000
describe(data)

Split Training and Test Data

Split data into training and test sets to validate model in the end

set.seed(4321)
dt <- sort(sample(nrow(data), nrow(data)*.75))
train <- data[dt,]
test <- data[-dt,]

describe(train)
describe(test)

Scale and Standardize Data

scl <- function(x){ (x - min(x))/(max(x) - min(x)) }
train[, 1:23] <- data.frame(lapply(train[, 1:23], scl))
test[, 1:23] <- data.frame(lapply(test[, 1:23], scl))


is.nan.data.frame <- function(x)
  do.call(cbind, lapply(x, is.nan))

train[is.nan(train)] <- 0
test[is.nan(test)] <- 0

Neural Network

To put in a formula each factor that is encoded in its own column must be put into its own variable

n <- names(train)
f <- as.formula(paste("r1 + r2 + r3 + r4 + r5 + r6  +r7 + r8 + r9 + r10~", paste(n[!n %in% c("r1","r2","r3","r4","r5","r6","r7","r8","r9","r10")], collapse = " + ")))
f
## r1 + r2 + r3 + r4 + r5 + r6 + r7 + r8 + r9 + r10 ~ Age + Gender + 
##     Religion + Income + Career + Familymember + Socialmedia + 
##     Province + Education + nCoVsource + Frequency + Fakenews + 
##     Officialnews




After trail and error I found parameters for the hidden vertices and neurons that minimize error

nntrain <- neuralnet(f,
                data = train,
                hidden = c(290,270,260,150),
                threshold = 0.001,
                rep = 1,
                act.fct = "logistic",
                linear.output = FALSE,
                lifesign = "minimal")
## hidden: 290, 270, 260, 150    thresh: 0.001    rep: 1/1    steps:     218    error: 0.50139  time: 1.57 mins
plot(nntrain)




Test the model on the test dataset

nntest <- neuralnet(f,
                data = train,
                hidden = c(290,270,260,150),
                threshold = 0.001,
                rep = 1,
                act.fct = "logistic",
                linear.output = FALSE,
                lifesign = "minimal")
## hidden: 290, 270, 260, 150    thresh: 0.001    rep: 1/1    steps:     249    error: 0.50073  time: 2.12 mins

Discussion

Mathematical Explanation

My notes on how feed forward neural nets work broken down into the basics

View My Notes Here

Model Evaluation

After trial and error for determining how to bind the hidden layer of perceptrons, I reduced my error to about 0.5 for both the training and testing data which is a good indicator that the model is not overfitting the training data. However, the run time for the program was higher and thus tuning this model or applying it to a larger dataset will be costly. This was primarily a way for me to understand how to tune a neural net, and I played around with the learning rate and different starting weights but presented my basic model.

I researched for the ideal number of hidden neurons but there does not seem to be a unified answer. Some of the popular methods included taking the square root of the number of observations and multiplying it to the number of classes. But that did not reduce my error. Below is a visualization of the neural net.