Sree Kanagala

Data Profiling

Load in dataset
Last 10 columns were created because a multi-class neural net cannot distinguish between factor inputs within one column

library(readxl)
Data_in_brief <- read_excel("C:/Users/ksree/Desktop/senior year/statistics/wh9xk5mp9m-3/Data-in-brief.xlsx", 
                            sheet = "Dataset", range = "b1:y392")
View(Data_in_brief)

str(Data_in_brief)

## tibble [391 x 24] (S3: tbl_df/tbl/data.frame)
##  $ Age           : num [1:391] 26 21 22 25 22 20 25 35 24 19 ...
##  $ Gender        : num [1:391] 0 0 0 1 0 1 1 0 2 1 ...
##  $ Religion      : num [1:391] 1 0 0 0 0 0 0 0 0 0 ...
##  $ Income        : num [1:391] 10 7 8 15 4 8 15 10 7 4 ...
##  $ Career        : num [1:391] 1 3 3 3 3 1 3 3 3 1 ...
##  $ Familymember  : num [1:391] 4 6 0 4 4 5 5 4 4 2 ...
##  $ Socialmedia   : num [1:391] 2 2 2 3 3 3 2 3 2 3 ...
##  $ Province      : num [1:391] 1 3 3 3 3 1 2 3 3 1 ...
##  $ Education     : num [1:391] 3 2 2 2 2 2 2 2 2 1 ...
##  $ nCoVsource    : num [1:391] 1 1 1 1 2 1 2 2 2 2 ...
##  $ Frequency     : num [1:391] 0 4 2 2 1 4 8 24 2 2 ...
##  $ Fakenews      : num [1:391] 10 8 8 4 8 10 8 9 4 8 ...
##  $ Officialnews  : num [1:391] 5 7 5 3 5 5 4 2 5 4 ...
##  $ Riskperception: num [1:391] 1 10 9 8 8 7 10 8 6 10 ...
##  $ r1            : num [1:391] 1 0 0 0 0 0 0 0 0 0 ...
##  $ r2            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r3            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r4            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r5            : num [1:391] 0 0 0 0 0 0 0 0 0 0 ...
##  $ r6            : num [1:391] 0 0 0 0 0 0 0 0 1 0 ...
##  $ r7            : num [1:391] 0 0 0 0 0 1 0 0 0 0 ...
##  $ r8            : num [1:391] 0 0 0 1 1 0 0 1 0 0 ...
##  $ r9            : num [1:391] 0 0 1 0 0 0 0 0 0 0 ...
##  $ r10           : num [1:391] 0 1 0 0 0 0 1 0 0 1 ...

Put data into a variable minus the Riskperception column because the results were encoded seprately in the last ten columns

data = subset(Data_in_brief, select = -c(Riskperception))

Profile data to see any immediate anomalies and check variables

View(data)

Load required packages and set seed to reproduce results

library(neuralnet)

## Warning: package 'neuralnet' was built under R version 4.0.3

library(nnet)

## Warning: package 'nnet' was built under R version 4.0.3

library(psych)
require(ggplot2)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

set.seed(345)

Descriptive Statistics

Get some descriptive statistics for the overall dataset
Identify any visual correlations among variables, visual outliers, mean and skew

plot(data)

summary(data)

##       Age            Gender          Religion          Income      
##  Min.   : 0.00   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:19.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 2.000  
##  Median :23.00   Median :0.0000   Median :0.0000   Median : 5.000  
##  Mean   :23.23   Mean   :0.5754   Mean   :0.4834   Mean   : 8.747  
##  3rd Qu.:28.00   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:10.000  
##  Max.   :47.00   Max.   :2.0000   Max.   :3.0000   Max.   :50.000  
##      Career       Familymember    Socialmedia       Province    
##  Min.   :1.000   Min.   : 0.00   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.: 4.00   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median : 4.00   Median :2.000   Median :3.000  
##  Mean   :1.926   Mean   : 4.21   Mean   :2.414   Mean   :2.315  
##  3rd Qu.:3.000   3rd Qu.: 5.00   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :3.000   Max.   :10.00   Max.   :3.000   Max.   :3.000  
##    Education       nCoVsource      Frequency         Fakenews     
##  Min.   :0.000   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 2.000   1st Qu.: 7.000  
##  Median :2.000   Median :2.000   Median : 2.000   Median : 8.000  
##  Mean   :2.056   Mean   :1.885   Mean   : 4.751   Mean   : 7.767  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 8.000   3rd Qu.: 9.500  
##  Max.   :3.000   Max.   :3.000   Max.   :24.000   Max.   :10.000  
##   Officialnews          r1                 r2                 r3         
##  Min.   : 0.000   Min.   :0.000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.: 5.000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median : 6.000   Median :0.000000   Median :0.000000   Median :0.00000  
##  Mean   : 5.795   Mean   :0.005115   Mean   :0.002557   Mean   :0.01535  
##  3rd Qu.: 7.000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :10.000   Max.   :1.000000   Max.   :1.000000   Max.   :1.00000  
##        r4          r5          r6                r7               r8        
##  Min.   :0   Min.   :0   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0   1st Qu.:0   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0   Median :0   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0   Mean   :0   Mean   :0.07928   Mean   :0.1611   Mean   :0.3043  
##  3rd Qu.:0   3rd Qu.:0   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :0   Max.   :0   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##        r9              r10        
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.1202   Mean   :0.1816  
##  3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000

describe(data)

Split Training and Test Data

Split data into training and test sets to validate model in the end

set.seed(4321)
dt <- sort(sample(nrow(data), nrow(data)*.75))
train <- data[dt,]
test <- data[-dt,]

describe(train)

describe(test)

Scale and Standardize Data

scl <- function(x){ (x - min(x))/(max(x) - min(x)) }
train[, 1:23] <- data.frame(lapply(train[, 1:23], scl))
test[, 1:23] <- data.frame(lapply(test[, 1:23], scl))


is.nan.data.frame <- function(x)
  do.call(cbind, lapply(x, is.nan))

train[is.nan(train)] <- 0
test[is.nan(test)] <- 0

Neural Network

To put in a formula each factor that is encoded in its own column must be put into its own variable

n <- names(train)
f <- as.formula(paste("r1 + r2 + r3 + r4 + r5 + r6  +r7 + r8 + r9 + r10~", paste(n[!n %in% c("r1","r2","r3","r4","r5","r6","r7","r8","r9","r10")], collapse = " + ")))
f

## r1 + r2 + r3 + r4 + r5 + r6 + r7 + r8 + r9 + r10 ~ Age + Gender + 
##     Religion + Income + Career + Familymember + Socialmedia + 
##     Province + Education + nCoVsource + Frequency + Fakenews + 
##     Officialnews

After trail and error I found parameters for the hidden vertices and neurons that minimize error

nntrain <- neuralnet(f,
                data = train,
                hidden = c(290,270,260,150),
                threshold = 0.001,
                rep = 1,
                act.fct = "logistic",
                linear.output = FALSE,
                lifesign = "minimal")

## hidden: 290, 270, 260, 150    thresh: 0.001    rep: 1/1    steps:     218    error: 0.50139  time: 1.57 mins

plot(nntrain)

Test the model on the test dataset

nntest <- neuralnet(f,
                data = train,
                hidden = c(290,270,260,150),
                threshold = 0.001,
                rep = 1,
                act.fct = "logistic",
                linear.output = FALSE,
                lifesign = "minimal")

## hidden: 290, 270, 260, 150    thresh: 0.001    rep: 1/1    steps:     249    error: 0.50073  time: 2.12 mins

Motivation

Understand what influences risk and how social relationships affect risk perception

Background

Materials

Data

Variables

Model Building

Risk Perception Neural Network

Data Profiling

Descriptive Statistics

Split Training and Test Data

Scale and Standardize Data

Neural Network

Discussion

Mathematical Explanation

Model Evaluation

Application