R Shiny Code – Pima Indians Diabetes Data Set
April 16, 2018 By Pascal Schmidt Personal Project R
The Shiny App that was created with the code below can be found here and in our previous blog post.
Shiny Ui
library(shiny) library(ggplot2) library(rsconnect) library(here) diabetes <- read.csv(here::here("docs", "diabetes.csv")) shinyUI(fluidPage( # Create a header titlePanel(title = "Checking the Model Assumptions for LDA and QDA for the Diabetes Pima Indians Data Set"), sidebarLayout( sidebarPanel( # dropdown selector selectInput("variablex", "Ellipses X - Axis", names(diabetes)[1:8], multiple = FALSE, selected = "Glucose"), selectInput("variabley", "Ellipses Y - Axis", names(diabetes)[1:8], multiple = FALSE, selected = "Insulin"), selectInput("variableyy", "Boxplot Y - Axis", names(diabetes)[1:8], multiple = FALSE, selected = "Insulin"), selectInput("normalq", "Q-Q Plot Normality", names(diabetes)[1:8], multiple = FALSE, selected = "Pregnancies"), selectInput("normald", "Distribution Normality", names(diabetes)[1:8], multiple = FALSE, selected = "Pregnancies") ), # Show a plot of the generated distribution mainPanel( tabsetPanel( type = "tabs", tabPanel("Variance-Covariance Ellipses", htmlOutput("text"), plotOutput("plot")), tabPanel("Equal Variance", htmlOutput("text.1"), plotOutput("plot.1")), navbarMenu( "Normality", tabPanel("Q-Q Plots", htmlOutput("text.2"), plotOutput("plot.2"), plotOutput("plot.3")), tabPanel("Distribution", htmlOutput("text.3"), plotOutput("plot.4")) ) ) ) ) ))
Shiny Server
library(shiny) library(ggplot2) library(rsconnect) shinyServer(function(input, output) { output$plot <- renderPlot({ diabetes <- read.csv("diabetes.csv") diabetes[, 2:6][diabetes[, 2:6] == 0] <- NA # replaces all zero values from column two to six with NA diabetes <- na.omit(diabetes) # now we omit all NA values diabetes$Outcome <- as.factor(diabetes$Outcome) levels(diabetes$Outcome) <- c("No Diabetes", "Diabetes") ggplot(diabetes, aes(x = diabetes[, input$variablex], y = diabetes[, input$variabley], col = Outcome)) + geom_point() + stat_ellipse() + ylab(input$variabley) + xlab(input$variablex) + scale_color_manual(values = c("red", "blue")) }) ######### END OF TAB 1 ######### START OF TAB 2 output$plot.1 <- renderPlot({ diabetes <- read.csv("diabetes.csv") diabetes[, 2:6][diabetes[, 2:6] == 0] <- NA # replaces all zero values from column two to six with NA diabetes <- na.omit(diabetes) # now we omit all NA values diabetes$Outcome <- as.factor(diabetes$Outcome) levels(diabetes$Outcome) <- c("No Diabetes", "Diabetes") ggplot(diabetes, aes(x = Outcome, y = diabetes[, input$variableyy], col = Outcome, fill = Outcome)) + geom_boxplot(alpha = 0.2) + ylab(input$variableyy) + scale_color_manual(values = c("red", "blue")) + scale_fill_manual(values = c("red", "blue")) }) ######### END OF TAB 2 ######### START OF TAB 3.1 output$plot.2 <- renderPlot({ diabetes <- read.csv("diabetes.csv") diabetes[, 2:6][diabetes[, 2:6] == 0] <- NA # replaces all zero values from column two to six with NA diabetes <- na.omit(diabetes) # now we omit all NA values diabetes$Outcome <- as.factor(diabetes$Outcome) levels(diabetes$Outcome) <- c("No Diabetes", "Diabetes") diab.yes <- subset(diabetes, Outcome == "Diabetes") qqnorm(diab.yes[, input$normalq], main = "Diabetes Group \n Normal Q-Q Plot") qqline(diab.yes[, input$normalq], col = 2) }) output$plot.3 <- renderPlot({ diabetes <- read.csv("diabetes.csv") diabetes[, 2:6][diabetes[, 2:6] == 0] <- NA # replaces all zero values from column two to six with NA diabetes <- na.omit(diabetes) # now we omit all NA values diabetes$Outcome <- as.factor(diabetes$Outcome) levels(diabetes$Outcome) <- c("No Diabetes", "Diabetes") diab.no <- subset(diabetes, Outcome == "No Diabetes") qqnorm(diab.no[, input$normalq], main = "Non Diabetes Group \n Normal Q-Q Plot") qqline(diab.no[, input$normalq], col = 2) }) ######### START OF TAB 3.1 output$plot.4 <- renderPlot({ diabetes <- read.csv("diabetes.csv") diabetes[, 2:6][diabetes[, 2:6] == 0] <- NA # replaces all zero values from column two to six with NA diabetes <- na.omit(diabetes) # now we omit all NA values diabetes$Outcome <- as.factor(diabetes$Outcome) levels(diabetes$Outcome) <- c("No Diabetes", "Diabetes") ggplot(diabetes, aes(x = diabetes[, input$normald], y = ..density.., col = Outcome, fill = Outcome)) + geom_density(aes(y = ..density..), alpha = 0.1) + xlab(input$normald) + scale_color_manual(values = c("blue", "red")) + scale_fill_manual(values = c("blue", "red")) }) # text output for data description output$text <- renderUI({ tags$div( HTML('<p style="color:black; font-size: 12pt"> Here we can experiment with the ellipses of each variable. Linear Discriminant Analysis assumes equal variance-covariance matrices within each group, whereas for Quadratic Discriminant Analysis, the covariance ellispes can differ and it allows different covariance structures within each group. The covariance sets the shape of the ellipses. When we experiment with the variables, then we can see that most of the ellipses have the same orientation and therefore, have roughly the same covariance marices. However, the variances differ a lot within each group. This is visible by looking at the size of the ellipses. Bigger ellispses mean a larger variance. </p> <p style="color:black; font-size: 12pt"> Another assumption is that each group is drawn from a multivariate normal distribution. We can check this assumption by looking if the scatter of the data is actually elliptical. For some variables, we notice that this is not always the case. </p>') ) }) output$text.1 <- renderUI({ tags$div( HTML('<p style="color:black; font-size: 12pt"> With these boxplots, we can check the variances assumption within each group for each variable. The bigger the boxplot is, the larger the variance. The horizontal line inside each boxplot represents the median and the lower end and the upper end are the first and third quantile respectively. The points that are visible are considered outliers. </p>') ) }) output$text.2 <- renderUI({ tags$div( HTML('<p style="color:black; font-size: 12pt"> With the Q-Q Plots we can further check if the data within each group is drawn from a normal distribution. We can assume normality, when the points fall on the red line. For a lot of variables this is not the case </p>') ) }) output$text.3 <- renderUI({ tags$div( HTML('<p style="color:black; font-size: 12pt"> Here, we are comparing the distributions of each class. These visualizations help determining if the variances of the classes are roughly equal. In addition to that, they help determining if each class is drawn from a multivariate normal distribution. From the plots, we can see that a lot of distributions are skewed and that the variances for the classes differ. </p> <p style="color:black; font-size: 12pt"> So now that we have gone through all the visualizations, we can say that some assumptions are violated for the linear discriminant analysis and also for the quadratic discriminant analysis. The normality assumption for some variables is violated. Moreover, the variance for some variables are not equal. However, the orientation of each ellipse was roughly equal and so the covariance structure was similar. </p> <p style="color:black; font-size: 12pt"> I hope you enjoyed the visual approach towards linear and quadratic disciminant analysyis and have learned how to check the assumptions with visualizations. Thank you. </p>') ) }) })
Recent Posts
Recent Comments
- Kardiana on The Lasso – R Tutorial (Part 3)
- Pascal Schmidt on RSelenium Tutorial: A Tutorial to Basic Web Scraping With RSelenium
- Pascal Schmidt on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications
- Gisa on Persistent Data Storage With a MySQL Database in R Shiny – An Example App
- Nicholas on Dynamic Tabs, insertTab, and removeTab For More efficient R Shiny Applications