SAS Certification Prep Guide: Statistical Business Analysis Using SAS9

✍ Scribed by Joni Shreve; Donna Dea Holland

Publisher: SAS Institute
Year: 2018
Tongue: English
Leaves: 414
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Must-have study guide for the SAS Certified Statistical Business Analyst Using SAS9: Regression and Modeling exam! Written for both new and experienced SAS programmers, the SAS Certification Prep Guide: Statistical Business Analysis Using SAS9 is an in-depth prep guide for the SAS Certified Statistical Business Analyst Using SAS9: Regression and Modeling exam.

✦ Table of Contents

Contents
About This Book
What Does This Book Cover?
Requirements and Details
Exam Objectives
Take a Practice Exam
Registering for the Exam
Syntax Conventions
What Should You Know about the Examples?
Software Used to Develop the Book's Content
Example Code and Data
SAS University Edition
Where Are the Exercise Solutions?
We Want to Hear from You
Chapter 1: Statistics and Making Sense of Our World
Introduction
What Is Statistics?
The Two Branches of Statistics
Variable Types and SAS Data Types
Variable Types
Table 1.1 Data for the Study of Diabetes
SAS Data Types
The Data Analytics Process
Defining the Purpose
Table 1.2 Examples of Analyses by Purpose for Various Industries
Data Preparation
Sampling
Cleaning the Data
Exploring the Data
Analyzing the Data and Roadmap to the Book
Table 1.3 Summary of Statistical Models for Business Analysis Certification by Variable Role
Conclusions and Interpretation
Getting Started with SAS
Diabetic Care Management Case
Ames Housing Case
Table 1.4 List of Data Sets Used in the Book by Chapter
Accessing the Data in the SAS Environment
Program 1.1 PROC CONTENTS of the Diabetes Care Management Case Data Set
SAS Log 1.1 PROC CONTENTS of the Diabetes Care Management Case Data Set
Output 1.1 PROC CONTENTS of the Diabetes Care Management Case Data Set
Key Terms
Chapter 2: Summarizing Your Data with Descriptive Statistics
Introduction
Measures of Center
Mean
Figure 2.1 Time to Process Online Orders (in Hours)
Median
Mode
Table 2.1 Number of Deaths for Top Ten Causes – 2014 United States
Measures of Variation
Range
Table 2.2 Time to Process Orders (in Hours) by Retailer
Figure 2.2 Time to Process Orders (in Hours)
Variance
Table 2.3 Descriptive Statistics for Time to Process Orders
Table 2.4 Calculations for Variance as Average Squared Deviations
Standard Deviation
Measures of Shape
Skewness
Figure 2.3 Examples of Symmetric and Asymmetric Distributions
Table 2.5 Sum of Z3 Values for Calculating Skewness
Kurtosis
Figure 2.4 Examples of Kurtosis as Compared to the Normal Distribution
Table 2.6 Sum of Z4 Values for Calculating Kurtosis
Other Descriptive Measures
Percentiles, the Five-Number-Summary, and the Interquartile Range (IQR)
Percentiles
The Five-Number-Summary and the Interquartile Range (IQR)
Figure 2.5 Time to Process Online Orders (in Hours) for Retailer 2
Outliers
The MEANS Procedure
Procedure Syntax for PROC MEANS
Program 2.1 PROC MEANS of Process Time and Amount Spent for Retailer 1
Output 2.1 PROC MEANS of Process Time and Amount Spent for Retailer 1
Customizing Output with the VAR statement and Statistics Keywords
Program 2.2 PROC MEANS with Additional Descriptive Statistics of Process Time for Retailer 1
Output 2.2 PROC MEANS with Additional Descriptive Statistics of Process Time for Retailer 1
Key Words for Generating Desired Statistics
Table 2.7 Keywords for Requesting Statistics in the MEANS Procedure
Comparing Groups Using the CLASS Statement or the BY Statement
PROC MEANS Using the CLASS Statement
Program 2.3 PROC MEANS of Process Time for Retailers 1 and 2 Using the CLASS Statement
Output 2.3 PROC MEANS of Process Time for Retailers 1 and 2 Using the CLASS Statement
PROC MEANS Using the BY Statement
Program 2.4 PROC MEANS of Process Time for Retailers 1 and 2 Using the BY Statement
Output 2.4 PROC MEANS of Process Time for Retailers 1 and 2 Using the BY Statement
Program 2.5 Analysis of Process Time for Retailers 1 and 2 Using BY DESCENDING
Output 2.5 Analysis of Process Time For Retailers 1 and 2 Using BY DESCENDING
Multiple Classes and Customizing Output Using the WAYS and TYPES Statements
Using Multiple Classes in the CLASS Statement
Program 2.6 Three-Way Analysis of Ketones by Diabetes Status, Renal Disease, and Gender
Output 2.6 Three-Way Analysis of Ketones by Diabetes Status, Renal Disease, and Gender
The WAYS Statement for Multiple Classes
Program 2.7 Two-Way Analysis of Ketones by Diabetes Status, Renal Disease, and Gender
Output 2.7 Two-Way Analysis of Ketones by Diabetes Status, Renal Disease, and Gender
The TYPES Statement for Multiple Classes
Program 2.8 One- and Two-Way Analyses of Ketones by Diabetes Status, Renal Disease, and Gender
Output 2.8 One- and Two-Way Analyses of Ketones by Diabetes Status, Renal Disease, and Gender
Saving Your Results Using the OUTPUT Statement
Program 2.9 Ketones for the Diabetic Care Management Case
Output 2.9 Ketones for the Diabetic Care Management Case
The CLASS Statement and the TYPE and FREQ Variables
Program 2.10 Ketones by the Class Controlled_Diabetic
Output 2.10 Ketones by the Class Controlled_Diabetic
Program 2.11 Ketones by the Classes Controlled_Diabetic and Renal_Disease
Output 2.11 Ketones by the Classes Controlled_Diabetic and Renal_Disease
SAS Log 2.1 Ketone Analysis by Two Classes
Program 2.12 Ketones by the Classes Controlled_Diabetic, Renal_Disease, and Gender
Output 2.12 Ketones by the Classes Controlled_Diabetic, Renal_Disease, and Gender
Table 2.8 TYPE Values and the Subgroups Produced by Three-Way Analyses
SAS Log 2.2 Ketone Analysis by the Classes Controlled_Diabetic, Renal_Disease, and Gender
Table 2.9 TYPE, WAYS, Subgroups, and Number of Observations for One-, Two-, and Three-Way Analyses
The CLASS Statement and Filtering the Output Data Set
Program 2.13 Ketone Analysis by Four Classes
SAS Log 2.3 Ketone Analysis by Four Classes
Output 2.13 Filter of Output File for Only One-Way Analyses (TYPE = 1, 2, 4, 8)
The NWAY Option and Comparisons to the WAYS and TYPES Statements
Program 2.14 Three-Way Analysis of Ketones Using the NWAY Option
Output 2.14 Three-Way Analysis of Ketones Using the NWAY Option
Program 2.15 Alternative 1 for Three-Way Analysis of Ketones Using the NWAY Option
Program 2.16 Three Class Variables Connected by the Asterisk (*) in the TYPES Statement
The BY Statement and the TYPE and FREQ Variables
Program 2.17 Ketones by Controlled_Diabetic
Output 2.15 Ketones by Controlled_Diabetic
Program 2.18 Ketones by Controlled_Diabetic for Two Classes
Output 2.16 Ketones by Controlled_Diabetic for Two Classes
Handling Missing Data with the MISSING Option
Program 2.19 The MEANS Procedure of Glucose by AE_DURATION Including Missing Values
Output 2.17a The MEANS Procedure of Glucose by AE_DURATION Including Missing Values
Output 2.17b Glucose by AE_DURATION Including Missing Values
Key Terms
Chapter Quiz
Chapter 3: Data Visualization
Introduction
View and Interpret Categorical Data
Frequency and Crosstabulation Tables Using the FREQ Procedure
Procedure Syntax for PROC FREQ
Figure 3.1 Diabetic Care Management Case Data
Program 3.1 Frequency Tables of GENDER, AGE_RANGE, and CONTROLLED_DIABETIC
Output 3.1 Frequency Tables of GENDER, AGE_RANGE, and CONTROLLED_DIABETIC
PLOTS Options within the TABLES Statement
Program 3.2 Frequency Table and Bar Chart of GENDER
Output 3.2 Frequency Table and Bar Chart of GENDER
Crosstabulations for Illustrating Associations between Two Categorical Variables
Program 3.3 Crosstabulation of Gender by Diabetes Status
Output 3.3a Crosstabulation of Gender by Diabetes Status
Output 3.3b Crosstabulation of Gender by Diabetes Status: Frequency Pots of Gender by Diabetes Status
Program 3.4 Cross Tabs and Frequency Plots of Diabetes Status and Renal Disease
Output 3.4 Cross Tabs and Frequency Plots of Diabetes Status and Renal Disease
MISSING Option within the TABLES Statement
Program 3.5 Crosstabulation of Diabetes Status and Primary Medication with Missing Obs Excluded
Output 3.5 Crosstabulation of Diabetes Status and Primary Medication with Missing Obs Excluded
Program 3.6 Crosstabulation of Diabetes Status and Primary Medication with Missing Obs Included
Output 3.6 Crosstabulation of Diabetes Status and Primary Medication with Missing Obs Included
View and Interpret Numeric Data
Histograms Using the UNIVARIATE Procedure
Figure 3.2 Histogram for Numeric Data
Procedure Syntax for PROC UNIVARIATE
Program 3.7 Univariate Statistics on BMI for 200 Diabetic Patients
Output 3.7 Univariate Statistics on BMI for 200 Diabetic Patients
Table 3.1 Summary Data for the Variable BMI
Program 3.8 Histogram of the Variable BMI
Output 3.8 Histogram of the Variable BMI
Q-Q Plots Using the UNIVARIATE Procedure
Table 3.2 Expected Z-Scores for Number of Texts
Figure 3.3 Q-Q Plot for Number of Texts
Interpreting the Q-Q Plots
Program 3.9 Q-Q Plot for the Variable BMI
Output 3.9 Q-Q Plot for the Variable BMI
Box-and-Whisker Plot Using the UNIVARIATE Procedure
Calculating Quartiles for Five-Number Summary
Figure 3.4 Box Plot for Number of Texts
Interpreting the Box Plot
Program 3.10 Distribution and Probability Plot for BMI
Output 3.10 Distribution and Probability Plot for BMI
UNIVARIATE Procedures Using the INSET Statement
Program 3.11 Histogram with Descriptive Statistics of BMI
Output 3.11 Histogram with Descriptive Statistics of BMI
UNIVARIATE Procedures Using the CLASS Statement
Program 3.12 Histogram of Pounds with Descriptive Statistics by Gender
Output 3.12 Histogram of Pounds with Descriptive Statistics by Gender
Visual Analyses Using the SGPLOT Procedure
Procedure Syntax for PROC SGPLOT
Exploring Bivariate Relationships with Basic Plots, Fits, and Confidence
The SCATTER and REG Statements
Program 3.13 Scatter Plot of Systolic and Diastolic Blood Pressure
Output 3.13 Scatter Plot of Systolic and Diastolic Blood Pressure
Program 3.14 Regression Line and Confidence Limits on Bivariate Scatter Plot
Output 3.14 Regression Line and Confidence Limits on Bivariate Scatter Plot
Program 3.15 Scatter Plot of Price by Quantity Sold
Output 3.15 Scatter Plot of Price by Quantity Sold
Program 3.16 Scatter Plot of Weight and Blood Pressure by Gender
Output 3.16a Scatter Plot of Weight and Blood Pressure by Gender
Output 3.16b Scatter Plot of Weight by Systolic Blood Pressure by Gender
Exploring Other Relationships Using SGPLOT
Program 3.17 Vertical Bar Charts for Diabetes Status
Output 3.17 Vertical Bar Charts for Diabetes Status
Program 3.18 Bar Chart of Diabetes Status by Renal Disease
Output 3.18a Bar Chart of Diabetes Status by Renal Disease
Output 3.18b Numbers with Renal Diseases by Diabetes Status
Program 3.19 Bar Charts for Diastolic and Systolic BP by Diabetes Status
Output 3.19 Bar Charts for Diastolic and Systolic BP by Diabetes Status
Key Terms
Chapter Quiz
Chapter 4: The Normal Distribution and Introduction to Inferential Statistics
Introduction
Continuous Random Variables
Normal Random Variables
Figure 4.1 Distributions of Adult Weights for Three Populations
The Empirical Rule
Figure 4.2 Visualization of the Empirical Rule
Figure 4.3 Empirical Rule Applied to Height of Diabetic Males
Program 4.1 Actual Percentage of Males Having Heights within 1, 2, and 3 Standard Deviations from the Mean
Output 4.1 Actual Percentage of Males Having Heights within 1, 2, and 3 Standard Deviations from the Mean
The Standard Normal Distribution
Figure 4.4 Proportion of Z-values Less Than -1.15, P(Z < -1.15)
Table 4.1 Excerpt from Standard Normal Cumulative Area (for Z ≤ 0)
Figure 4.5 Proportion of Z-values Less Than 1.15, P(Z < +1.15)
Table 4.2 Excerpt from Standard Normal Cumulative Area (for Z ≥ 0)
Figure 4.6 Proportion of Z-Values Greater Than 1.15, P(Z > +1.15)
Figure 4.7 Proportion of Z-Values between -1.00 and +1.00, P(-1.00 < Z < +1.00)
Figure 4.8 Proportion of Z-Values between -1.96 and +1.96, P(-1.96 < Z < +1.96)
Applying the Standard Normal Distribution to Answer Probability Questions
Figure 4.9 Proportion of Americans Exceeding Recommended Daily Sugar Consumption
Figure 4.10 Proportion of College Students Spending More Than 14 Hours Using Digital Devices
The Sampling Distribution of the Mean
Characteristics of the Sampling Distribution of the Mean
Figure 4.11 Distribution of Wait-Times at a Casual-Dining Restaurant
Program 4.2 Description of the Sampling Distribution of Mean Wait-Times
Output 4.2 Description of the Sampling Distribution of Mean Wait-Times
The Central Limit Theorem
Figure 4.12 Sampling Distribution of Average Wait-Times by Sample Size
Application of the Sampling Distribution of the Mean
Figure 4.13 Sample Distribution of the Mean Based upon a Sample Size of 50
Figure 4.14 Probability That Z > +1.77
Effects of Sample Size on the Sampling Distribution
Figure 4.15 Sampling Distribution of the Mean for Two Sample Sizes
Introduction to Hypothesis Testing
Defining the Null and Alternative Hypotheses
Figure 4.16 Rejection Region for a Two-Tailed Test
Figure 4.17 Rejection Region for a Lower-Tailed Test
Figure 4.18 Rejection Region for an Upper-Tailed Test
Defining and Controlling Errors in Hypothesis Testing
Hypothesis Testing for the Population Mean (σ Known)
Two-Tailed Tests for the Population Mean (µ)
Figure 4.19 Rejection Region for a Two-Tailed Test at α = 0.05
Table 4.3 Finding Z-Value Associated with 0.025 Area in the Lower Tail
Figure 4.20 Critical Values for a Two-Tailed Test at α = 0.05
One-Tailed Tests for the Population Mean (µ)
Figure 4.21 Critical Value for a One-Tailed Test at α = 0.05
Figure 4.22 Test Statistic Compared to the Critical Value
Table 4.4 Critical Values Based upon α-Level and One-Tailed versus Two-Tailed Tests
Hypothesis Testing Using the P-Value Approach
Figure 4.23 p-Value for a One-Tailed Test
The P-Value for the Two-Tailed Hypothesis Test
Figure 4.24 p-Value for a Two-Tailed Test
Hypothesis Testing for the Population Mean (σ Unknown)
One-Tailed Tests for the Population Mean (µ)
Figure 4.25 The t-Distribution for Various Sample Sizes
Table 4.5 Descriptive Statistics of BMI for 25 Female Diabetic Patients
Table 4.6 Excerpt from the t-Table
Figure 4.26 t-Test Statistic Compared to the Critical Value
Procedure Syntax for PROC TTEST
Program 4.3 t-Test of BMI for Female Diabetics
Output 4.3 t-Test of BMI for Female Diabetics
Confidence Intervals for Estimating the Population Mean
Confidence Interval for the Population Mean (σ Known)
Figure 4.27 Confidence Intervals as Related to the Sampling Distribution
Effects of Level of Confidence and Sample Size on Confidence Intervals
Confidence Interval for the Population Mean (σ Unknown)
Key Terms
Chapter Quiz
Chapter 5: Analysis of Categorical Variables
Introduction
Testing the Independence of Two Categorical Variables
Hypothesis Testing and the Chi-Square Test
Table 5.1 Expected Frequency Count of Online Shopping by Gender
Table 5.2 Observed and Expected Frequencies Count of Online Shopping by Gender
Figure 5.1 Bivariate Bar Charts of Gender and Online Shopping
The Chi-Square Test Using the FREQ Procedure
Procedure Syntax for PROC FREQ
Program 5.1 Testing Association between Bonus and Kitchen Quality
Output 5.1a Testing Association between Bonus and Kitchen Quality
Output 5.1b Testing Association between Bonus and Kitchen Quality: Bivariate Bar Charts of Bonus and Kitchen Quality
Assumptions
Measuring the Strength of Association between Two Categorical Variables
Cramer’s V
The Odds Ratio
Table 5.3 General Form of the 2x2 Contingency Table
Using Chi-Square Tests for Exploration Prior to Predictive Analytics
Program 5.2 Testing Association between Bonus and Corner Lot
Output 5.2a Testing Association between Bonus and Corner Lot
Output 5.2b Testing Association between Bonus and Corner Lot: Bivariate Bar Charts of Bonus and Corner Lot
Key Terms
Chapter Quiz
Chapter 6: Two-Sample t-Test
Introduction
Independent Samples
The Pooled Variance t-Test
Assumptions
Procedure Syntax of PROC TTEST Procedure
Program 6.1 Independent Samples t-Test for Mean Differences in Above Ground Living Area
Output 6.1a Independent Samples t-Test for Ames Housing, Above Ground Living Area by Bonus
Testing the Equal Variance Assumption Using the Folded F-Test
Verifying the Assumptions of a Two-Sample t-Test
Output 6.1b Normal Probability Plots for Above Ground Living Area by Bonus
Supplemental Plots for Data Visualization
Output 6.1c Histograms and Box Plots for Above Ground Living Area by Bonus
Testing the Normality Assumption Using the Kolmogorov-Smirnov Test
Program 6.2 Kolmogorov-Smirnov Test of Normality for Above Ground Living Area by Bonus
Output 6.2 Kolmogorov-Smirnov Test of Normality for Above Ground Living Area by Bonus
Satterthwaite t-Test for Unequal Variances
Program 6.3 Independent Samples t-Test for Mean Differences in Total Basement Area
Output 6.3 Independent Sample t-Test for Ames Housing, Total Basement Area by Bonus
Summary of Steps for the t-Test of Two Independent Populations
Paired Samples
Assumptions
The Paired-Sample t-Test Using the PAIRED Statement in the TTEST Procedure
Table 6.1 Whitley County, Indiana, 2012 and 2016 Tax Assessed Property Values Sample Data
Program 6.4 Kolmogorov-Smirnov Test of Normality Assumption on the Difference Score Using the UNIVARIATE Procedure
Output 6.4a Kolmogorov-Smirnov Test of Normality Assumption on the Difference Score Using the UNIVARIATE Procedure
Output 6.4b Paired t-Test Results for Differences in Tax Assessed Property Values
Output 6.4c Accompanying Plots for the Paired-Sample t-Test
Key Terms
Chapter Quiz
Chapter 7: Analysis of Variance (ANOVA)
Introduction
One-Factor Analysis of Variance
The One-Factor ANOVA Model
Constructing the Test Statistic: Estimating Variance among Groups and Variance within Groups
Table 7.1 Deviations within and across Groups
Table 7.2 Squared Deviations within and across Groups
Figure 7.1 The F-Distribution
Table 7.3 General Form of the Analysis of Variance Table
The GLM Procedure for Investigating Mean Differences
Program 7.1 Descriptive Statistics for Computer Anxiety by Academic Major
Output 7.1 Exploration of Computer Anxiety by Academic Major
Program 7.2 One-Way ANOVA for Testing Differences in Computer Anxiety
Output 7.2 One-Way ANOVA for Testing Differences in Computer Anxiety
Predicted Values and Residuals Using the OUTPUT Statement
Program 7.3 Predicted Values and Residuals for Computer Anxiety Scores
Output 7.3 Predicted Values and Residuals for Computer Anxiety Scores
Measures of Fit
The Normality Assumption and the PLOTS Option
Output 7.4 Fit Diagnostics for the One-Way Analysis of Variance
Levene’s Test for Equal Variances and the MEANS Statement
Program 7.4 The MEANS Statement for Additional Tests of Computer Anxiety Scores
Output 7.5 Levene’s Homogeneity of Variance Test for Computer Anxiety Scores
Post Hoc Tests: The Tukey-Kramer Procedure and the MEANS Statement
Output 7.6 Tukey-Kramer for Testing Pairwise Differences in Computer Anxiety
Other Post Hoc Procedures, the LSMEANS Statement, and the Diffogram
Output 7.7 LSMEANS Statement for Testing Pairwise Differences in Computer Anxiety
Output 7.8 Dunnett Adjustment for Testing Pairwise Differences in Computer Anxiety
Program 7.5 Complete Analysis of Difference in Computer Anxiety Scores Across Academic Majors
The Randomized Block Design
The ANOVA Model for the Randomized Block Design
Example and Interpretation of the Randomized Block Design
Program 7.6 Exploration of Computer Anxiety by Academic Major and Block
Output 7.9 Exploration of Computer Anxiety by Academic Major and Block
Table 7.4 The ANOVA Table for the Randomized Block Design
Program 7.7 Randomized Block Design for Testing Differences in Computer Anxiety
Output 7.10 Randomized Block Design for Testing Differences in Computer Anxiety
Post Hoc Tests Using the LSMEANS Statement
Output 7.11 LSMEANS Statement for Testing Pairwise Differences in Computer Anxiety When Blocking
Assessing the Assumptions of a Randomized Block Design Using the PLOTS Option
Unbalanced Designs, the LSMEANS Statement, and Type III Sums of Squares
Table 7.5 Cell Means and Sample Sizes for Computer Anxiety Scores
Two-Factor Analysis of Variance
The Two-Factor ANOVA Model
Table 7.6 General Form of the Two-Factor ANOVA Table
Example and Interpretation of the Two-Factor ANOVA
Program 7.8 Exploration of Computer Anxiety by Academic Major and Gender
Output 7.12 Descriptive Statistics for Computer Anxiety by Academic Major and Gender
Figure 7.2 Mean Computer Anxiety Scores by Academic Major and Gender
Program 7.9 Two-Factor ANOVA for Testing Differences in Computer Anxiety
Output 7.13a Two-Factor ANOVA for Testing Differences in Computer Anxiety
Output 7.13b Least Squares Means for Major by Gender Interaction Effects
Output 7.13c Diffogram of MAJOR by GENDER Means
Analyzing Simple Effects When Interaction Exists Using the LSMEANS Statement with the SLICE Option
Output 7.13d Analysis of Simple Effects in the Presence of Interaction
Assessing the Assumptions of a Two-Factor Analysis of Variance
Key Terms
Chapter Quiz
Chapter 8: Preparing the Input Variables for Prediction
Introduction
Missing Values
Complete-Case Analysis
Using Imputation with a Missing Value Indicator
Program 8.1 Ames Housing Data with Missing Values
Output 8.1 Ames Housing Data with Missing Values
Program 8.2 Ames Housing with Imputed Data
Output 8.2 Ames Housing with Imputed Data
Categorical Input Variables
Sparse Events and Quasi-Complete Separation
Greenacre’s Method Using the CLUSTER Procedure
Table 8.1 Contingency Table of Bonus by Neighborhood
Program 8.3 Combining Neighborhoods from Ames Data Housing Using Greenacre’s Method
Output 8.3a Chi-square for Bonus by Neighborhood
Output 8.3b Proportion of Houses with Bonus by Neighborhood
Output 8.3c Results of Cluster Analysis on Ames Neighborhoods
Output 8.3d Dendrogram of Cluster Analysis Results by Neighborhoods
Output 8.3e Contents of the Cluster History
Output 8.3f Log P-Value Information and the Cluster History
Output 8.3g Plot of Log P-Value by Number of Clusters
Output 8.3h List of Neighborhoods by Cluster
Table 8.2 Contingency Table of Bonus by Clustered Neighborhoods
Variable Clustering
The VARCLUS Procedure for Variable Reduction
Table 8.3 Correlation Matrix for Variables Q1 through Q6
Procedure Syntax for PROC VARCLUS
Program 8.4 The VARCLUS Procedure for Reducing Ames Housing Inputs
Output 8.4a Summary Information for VARCLUS Procedure for Ames Housing Input Data
Output 8.4b Cluster Summary for 2 Clusters for Ames Housing Input Data
Output 8.4c Cluster Summary for 23 Clusters for Ames Housing Input Data
Output 8.4d R-Squared with Own Cluster and Next Closest Cluster for Ames Housing Input Data
Output 8.4e Summary of Cluster Splitting by Stage
Output 8.4f Dendrogram Illustration of Cluster Splits for Ames Housing Input Data
Cluster Representative and Best Variable Selection
Table 8.4 Reduced Set of Inputs After Deleting Redundant Variables for Ames Housing
Variable Screening
The CORR Procedure for Detecting Associations
Program 8.5 Description of Input Variables Screened for Relevance for Ames Housing Data
Output 8.5a Summary of Input Variables Screened for Relevance for Ames Housing Data
Output 8.5b ODS Output of Spearman Data
Output 8.5c Spearman’s and Hoeffding’s D Correlation Data Sorted by Spearman’s Rank
Output 8.5d Rank of Spearman’s Correlation by Rank of Hoeffding’s D
Using the Empirical Logit to Detect Non-Linear Associations
Program 8.6 Plot of Empirical Logit by Bsmt_Unf_SF
Output 8.6a Value of Bsmt_Unf_SF and Bin Variables for the First Eight Houses in Ames Housing
Output 8.6b Total Frequency, Number of Houses Earning a Bonus, and Average Bsmt_Unf_SF by Bin
Output 8.6c Empirical Logit by the Variable Bsmt_Unf_SF
Key Terms
Chapter Quiz
Chapter 9: Linear Regression Analysis
Introduction
Exploring the Relationship between Two Continuous Variables
Exploring the Relationship between Two Continuous Variables Using a Scatter Plot
Program 9.1 Scatter Plot of Sale Price by Above Ground Living Area
Output 9.1 Scatter Plot of Sale Price and Above Ground Living Area
Program 9.2 Scatter Plot of Sale Price and Age at Time of Sale
Output 9.2 Scatter Plot of Sale Price and Age at Time of Sale
Program 9.3 Scatter Plot of Sale Price and Square Footage
Output 9.3 Scatter Plot of Sale Price and Square Footage
Quantifying the Degree of Association between Two Continuous Variables Using Correlation Statistics
Figure 9.1 Scatter Plot of Perfect Positive, Perfect Negative, and No Relationship
Producing Correlation Coefficients Using the CORR Procedure
Program 9.4 Correlation Coefficient and Descriptive Statistics for Ames Housing
Output 9.4 Correlation Coefficients and Descriptive Statistics for Ames Housing
Program 9.5 Correlation Coefficients with Sale Price for Ames Housing
Output 9.5a Correlation Coefficients with Sale Price for Ames Housing
Output 9.5b Scatter Plots for Sale Price with Potential Predictors
Testing the Hypothesis for a Bivariate Linear Relationship Using the CORR Procedure
Understanding Potential Misuses of the Correlation Coefficient
Simple Linear Regression
Fitting a Simple Linear Regression Model Using the REG Procedure
Figure 9.2 Fitting the Line Closest to All Points
Program 9.6 Linear Regression for Predicting Sale Price with Ground Living Area
Output 9.6 Linear Regression Output for Predicting Saleprice with Ground Living Area
Measures of Fit for the Linear Regression Model
The Coefficient of Determination (R2)
The Standard Error of the Regression (Se)
Using Measures of Fit to Compare Models
Table 9.1 Measures of Fit for Simple Linear Regression
Hypothesis Testing for the Slope
The t-Test for Slope
The F-Test for Slope
Table 9.2 Analysis of Variance (ANOVA) Table for Linear Regression
Producing Confidence Intervals
Program 9.7 Confidence Interval for Effect of Gr_Liv_Area on Sale Price
Output 9.7 Confidence Interval for Effect of Gr_Liv_Area on SalePrice
Multiple Linear Regression
Fitting a Multiple Linear Regression Model Using the REG Procedure
Program 9.8 Multiple Linear Regression for Predicting Sale Price with Six Predictors
Output 9.8 Multiple Linear Regression for Predicting SalePrice with Six Predictors
Measures of Fit for the Multiple Linear Regression Model
Adjusted R-Square
Output 9.9 Multiple Linear Regression for Predicting SalePrice with Five Predictors
Table 9.3 Measures of Fit for Multiple Linear Regression
Quantifying the Relative Impact of a Predictor
Program 9.9 Measures of Relative Predictor Importance in Multiple Linear Regression
Output 9.10 Measure of Relative Predictor Impact in Multiple Linear Regression
Checking for Collinearity Using VIF, COLLIN, and COLLINOINT
The Variance Inflation Factor (VIF) for Detecting Collinearity
The Condition Index (C) for Detecting Collinearity
Program 9.10 VIF and Condition Numbers for Detecting Collinearity
Output 9.11 VIF and Condition Numbers for Detecting Collinearity
Fitting a Simple Linear Regression Model Using the GLM Procedure
Program 9.11 PROC GLM for Prediction Using One Categorical Variable
Output 9.12a PROC GLM for Prediction Using One Categorical Variable
Output 9.12b Tukey Procedure for Detecting Differences in Mean Sale Price
Program 9.12 PROC REG for Prediction Using One Categorical Variable
Output 9.13 PROC REG for Prediction Using One Categorical Variable
Variable Selection Using the REG and GLMSELECT Procedures
The REG Procedure for Variable Selection
All Possible Subsets
Program 9.13 Best Subsets Regression Models Ranked by Adjusted R-Square
Output 9.14 Best Subsets Regression Models Ranked by Adjusted R-Square
Program 9.14 Best Subsets Regression Models Ranked by Mallows’ Cp
Output 9.15a Mallows’ Cp Plot for Variable Selection
Output 9.15b Best Subsets Regression Models Ranked by Mallows’ Cp
Backward Elimination
Program 9.15 Backward Elimination for the Ames Housing Case
Output 9.16a Backward Elimination Step 0
Output 9.16b Backward Elimination Step 1
Output 9.16c Backward Elimination Step 2
Output 9.16d Backward Elimination Step 7
Output 9.16e Summary of Backward Elimination
Output 9.16f Plot of Adjusted R-Square by Backward Elimination Step
Forward Selection
Program 9.16 Forward Selection for the Ames Housing Case
Output 9.17a Forward Selection Step 1
Output 9.17b Forward Selection Step 2
Output 9.17c Forward Selection Step 7
Output 9.17d Summary of Forward Selection
Output 9.17e Plot of Adjusted R-Square by Forward Selection Step
Stepwise Selection
Program 9.17 Stepwise Selection for the Ames Housing Case
Program 9.18 Three Variable Selection Methods for the Ames Housing Case
The GLMSELECT Procedure for Variable Selection
Program 9.19 PROC GLMSELECT with Stepwise Selection for the Ames Housing Case
Output 9.18a PROC GLMSELECT for Stepwise Selection Step 1
Output 9.18b PROC GLMSELECT for Stepwise Selection Step 2
Output 9.18c Summary for Stepwise Selection in PROC GLMSELECT
Output 9.18d The Selected Model from Stepwise Selection in PROC GLMSELECT
Other Features of the GLMSELECT Procedure
Table 9.4 Default SLENTRY and SLSTAY Settings by Model Selection Method.
Cautionary Note on Sequential Selection Methods
Assessing the Validity of Results Using Regression Diagnostics
The Assumptions of Linear Regression
Residual Analysis for Checking Assumptions
Figure 9.3 Fit Plot and Residual Plot for Illustrating a Linear Trend with Constant Variance
Figure 9.4 Residual Plot Illustrating a Curvilinear Trend
Figure 9.5 Residual Plot Illustrating Unequal Variance
Figure 9.6 Residual Plot Illustrating Autocorrelation
Program 9.20 Linear Regression Analysis Diagnostics Panel
Output 9.19a Linear Regression on Revenue with Diagnostics Panel
Output 9.19b Predicted Revenue and Residuals Using the Predictor AdExpense
Program 9.21 Linear Regression Analysis Using Transformed Ad Expense (LnAdExp)
Output 9.20 Linear Regression on Revenue Using Transformed Ad Expense (LnAdExp)
Program 9.22 Diagnostics for Multiple Linear Regression
Output 9.21a Multiple Linear Regression for Predicting SalePrice
Output 9.21b Residual by Predicted Plot and Q-Q Plot of Residuals for SalePrice
Output 9.21c Panel of Residual by Regressors for SalePrice
Studentized Residuals
Program 9.23 Residuals and Studentized Residuals by AdExpense for Saleprice
Output 9.22a Residual and Studentized Residuals by AdExpense for SalePrice
Output 9.22b Residuals and Studentized Residuals by AdExpense for Saleprice
Using Statistics to Identify Potential Influential Observations
Program 9.24 Comparing Regression Lines Based on Influence of Obs 15
Output 9.23 Comparing Regression Lines Based on Influence of Obs 15
Leverage (hii)
Discrepancy (RSTUDENTi)
Influence
Program 9.25 Identifying Suspicious Observations Using Measures of Influence
Output 9.24a Linear Regression Output for SalePrice with Influential Observation
Output 9.24b Leverage by RStudent Plot
Output 9.24c Cook’s D and DFFITS Plots for Detecting Influence
Output 9.24d Deletion Statistics for Detecting Influence
Output 9.25 Influence Statistics Using the INFLUENCE Option
Program 9.26 DFBETA Plots for Assessing Local Influence
Output 9.26 DFBETA Plots for Assessing Local Influence
Program 9.27 Regression Diagnostics for the Ames Housing Case
Output 9.27a Influence Panels and Influential Observations for Ames Housing
Output 9.27b Observations Flagged as Influential for Ames Housing
Recommendations for Handling Influential Observations
Concluding Remarks
Key Terms
Chapter Quiz
Chapter 10: Logistic Regression Analysis
Introduction
The Logistic Regression Model
Development of the Logistic Regression Model
Figure 10.1 Scatter Plot of Gr_Liv_Area by Bonus
Program 10.1 Scatter Plot of Binned Living Area by Proportion of Successes
Output 10.1 Scatter Plot of Binned Living Area by Proportion of Successes
The Logit Transformation
Estimating the Logistic Regression Parameters
Syntax for the Logistic Regression Procedure
Program 10.2 Simple Logistic Regression
Output 10.2a Model Information and Response Profile for Simple Logistic Regression
Output 10.2b Model Convergence, Fit Statistics, and Testing Global Null
Output 10.2c Analysis of Maximum Likelihood Estimates
Estimating the Odds Ratio from the Parameter Estimates
Output 10.2d Odds Ratio Estimate for Gr_Liv_Area Based upon Default UNITS=1
Output 10.3 Odds Ratio Estimate for Gr_Liv_Area Based upon UNITS=100
Additional Measures of Fit
Output 10.2e Association of Predicted Probabilities and Observed Responses
Assumptions of Logistic Regression
Plots for Probabilities of an Event and for the Odds Ratios
Figure 10.2 Plot of Gr_Living Area by Probability for Bonus=1
Program 10.3 Odds Ratio with 95% Confidence Interval for Gr_Liv_Area (UNITS=100)
Output 10.4 Plot of Odds Ratio with 95% Confidence Interval for Gr_Liv_Area (UNIT=1)
Program 10.4 Odds Ratio with 95% Confidence Interval for Gr_Liv_Area (UNITS=100)
Output 10.5 Plot of Odds Ratio with 95% Confidence Interval for Gr_Liv_Area (UNITS=100)
Program 10.5 UNITS Statement and ODDSRATIO Statement
Logistic Regression with a Categorical Predictor
Effect Coding Parameterization
Program 10.6 Logistic Regression for One Categorical Predictor Using Effect Coding
Output 10.6 Logistic Regression for One Categorical Predictor Using Effect Coding
Reference Cell Coding Parameterization
Program 10.7 Logistic Regression for One Categorical Predictor Using Reference Coding
Output 10.7 Logistic Regression for One Categorical Predictor Using Reference Coding
Program 10.8 CLASS Statement with Dummy Coded Variable
The Multiple Logistic Regression Model
Multiple Logistic Regression by Example
Program 10.9 Multiple Logistic Regression for Ames Housing Using Reference Coding
Output 10.8a Class Level Information Using Reference Coding
Output 10.8b Fit Statistics and Global Null Test for Multiple Logistic Regression
Output 10.8c Test 3 Analysis of Effects for Multiple Logistic Regression
Output 10.8d Maximum Likelihood Estimates and Odds Ratios for Multiple Logistic Regression
Variable Selection
Backward Elimination
Program 10.10 Backward Elimination for Ames Housing
Output 10.9a Effects Eligible for Removal for Step 1 of Backward Elimination
Output 10.9b Effects Eligible for Removal for Step 2 of Backward Elimination
Output 10.9c Effects Eligible for Removal for Steps 3 through 5 of Backward Elimination
Output 10.9d Summary of Effects Removed in Backward Elimination
Forward Selection
Program 10.11 Forward Selection for Ames Housing
Output 10.10a Effects Eligible for Entry for Step 1 of Forward Selection
Output 10.10b Summary of Effects Entered in Forward Selection
Stepwise Selection
Program 10.12 Stepwise Selection for Ames Housing
Output 10.11a Effects Eligible for Entry for Step 1 of Stepwise Selection
Output 10.11b Effects Eligible for Removal After Step 1 of Stepwise Selection
Output 10.11c Effects Eligible for Entry for Step 2 of Stepwise Selection
Output 10.11d Summary of Effects Entered or Removed in Stepwise Selection
Table 10.1 Summary of Effects Entered or Removed in Stepwise Selection
Customized Options within the Sequential Methods
Output 10.12 Summary of Effects Removed in Backward Elimination Using the STOP= Option
Output 10.13 Summary of Effects Entered in Forward Selection Using START= Option
Best Subset Selection
Program 10.13 Score Chi-Square Statistics for the Best Subsets of Size 1 through 8
Output 10.14 Score Chi-Square Statistics for the Best Subsets of Size 1 through 8
Modeling Interaction
Figure 10.3 Mean Plots by Degree and Occupational Area
Program 10.14 Testing Main Effects and Interactions for Ames Housing
Output 10.15 Testing Main Effects and Interactions for Ames Housing
Output 10.16 Example of Failed Model Convergence
Program 10.15 Backward Model Selection for Ames Housing
Output 10.17a Step 0 of Backward Elimination for Main and Interactions Effects
Output 10.17b Interaction Effects Eligible for Removal for Step 1 of Backward Elimination
Output 10.17c Interaction Effects Eligible for Removal for Step 2 of Backward Elimination
Output 10.17d Effects Eligible for Removal for Step 3 of Backward Elimination
Output 10.17e Final Model Selected Using Backward Elimination
Program 10.16 Odds Ratios with Plots for Main Effects and Conditional Effects
Output 10.18a Odds Ratios with Plots for Main Effects and Conditional Effects
Output 10.18b Probabilities for High_Kitchen_Quality by Fullbath_2plus for Overall_Quality=1
Scoring New Data
The SCORE Statement with PROC LOGISTIC
Program 10.17 Predicted Class for New Observations Using the SCORE Statement in PROC LOGISTIC
Output 10.19 Predicted Class for New Observations Using the SCORE Statement in PROC LOGISTIC
Using the PLM Procedure to Call Score Code Created by PROC LOGISTIC
Program 10.18 Predicted Class for New Observations Using PROC PLM with the SCORE Statement
Output 10.20 Predicted Class for New Observations Using PROC PLM with the SCORE Statement
The CODE Statement within PROC LOGISTIC
Program 10.19 Predicted Class for New Observations Using PROC PLM with the SCORE Statement
Output 10.21 Predicted Class for New Observations Using PROC PLM with the SCORE Statement
Program 10.20 SAS Scoring Code Created by the PLM Procedure
The OUTMODEL and INMODEL Options with PROC LOGISTIC
Program 10.21 Model Saved as SAS Data Set Created by the OUTMODEL Option in PROC LOGISTIC
Output 10.22 Model Saved as SAS Data Set Created by the OUTMODEL Option in PROC LOGISTIC
Key Terms
Chapter Quiz
Chapter 11: Measure of Model Performance
Introduction
Preparation for the Modeling Phase
Honest Assessment of a Classifier
PROC SURVEYSELECT for Creating Training and Validation Data Sets
Program 11.1 Partitioning Ames Housing Data into Training and Validation Data Sets
Output 11.1a PROC FREQ on Bonus for Ames Housing Data
Output 11.1b PROC SURVEYSELECT Using Ames Housing Data
Output 11.1c PROC FREQ on Bonus for Ames Training and Validation Data
Log 11.1 Partial Log for PROC SURVEYSELECT Using Ames Housing Data
Recommendations for the Model Preparation Stage
Assessing Classifier Performance
Measures of Performance Using the Classification Table
Table 11.1 General Form of the Classification Table
The CTABLE Option for Producing Classification Results
Program 11.2 Classification Tables for Ames Training and Validation Data Sets
Output 11.2a Classification Table for Ames Training Data
Table 11.2 Classification Table for Ames Training Data
Output 11.2b Classification Table for Ames Validation Data
Assessing the Performance and Generalizability of a Classifier
The Effect of Cutoff Values on Sensitivity and Specificity Estimates
Output 11.3 Classification Table for Multiple Cutoff Values for Ames Training Data
Figure 11.1 Performance Measures by Cutoff Values for Ames Training Data
Program 11.3 Classification Table Using Cutoff=0.20 for Ames Validation Data
Output 11.4 Classification Table for Cutoff = 0.20 for Ames Validation Data
Measure of Performance Using the Receiver-Operator-Characteristic (ROC) Curve
Figure 11.2 ROC Curve for Ames Training Data
Producing an ROC Curve Using the SCORE Statement with the OUTROC Option
Program 11.4 ROC Curves for Ames Housing Training and Validation Data
Output 11.5a: Training and Validation ROC Curves for Ames Housing Data
Output 11.5b: ROC Information for Ames Validation Data
Model Comparison Using the ROC and ROCCONTRAST Statements
Program 11.5 Comparing Two Models Using Validation ROC Curves for Ames Housing
Output 11.6a: ROC Curves for Two Models Applied to Ames Validation Data
Output 11.6b: ROC Contrast Results for Two Models Applied to Ames Validation Data
Measures of Performance Using the Gains and Lift Charts
The Gains Chart
Program 11.6 Gains Information for Ames Validation Data
Output 11.7a Gains Information for Ames Validation Data
Output 11.7b Gains Chart for Ames Validation Data
The Lift Chart
Output 11.8 Lift Chart for Ames Validation Data
Adjustment to Performance Estimates When Oversampling Rare Events
The PEVENT Option for Defining Prior Probabilities
Program 11.7 Use of PEVENT Option to Define Prior Probabilities
Output 11.9a The Logistic Regression Model for Ames Training Data
Output 11.9b Classification Table for PEVENT = 0.02 and PEVENT = 0.4053
Table 11.3 Classification Table for Ames Housing Training Data Labeled for Bayes’ Theorem
Manual Adjustment of the Classification Matrix
Table 11.4 General Classification Table Adjusted for Oversampling
Scoring the Validation Data Using Adjusted Posterior Probabilities
Manually Adjusting Posterior Probabilities to Account for Oversampling
Program 11.8 Posterior Probabilities Manually Adjusted for Oversampling
Output 11.10 Classification Table for Ames Housing Validation Data Adjusted for Oversampling
Manually Adjusted Intercept Using the Offset
Program 11.9 Posterior Probabilities Using Manually Adjusted Intercept
Program 11.10 Adjusting the Model Intercept Using the OFFSET Option
Output 11.11 Logistic Regression Model for Ames Training with Intercept Adjusted for Oversampling
Automatically Adjusted Posterior Probabilities to Account for Oversampling
Program 11.11 Comparison of the Three Approaches to Adjusting for Oversampling
Output 11.12 Posterior Probabilities for Ames Validation Data Using Three Approaches
The Use of Decision Theory for Model Selection
Decision Cutoffs and Expected Profits for Model Selection
Table 11.6 Profit Matrix for Classification Decisions
Table 11.7 Profit Matrix for Ames Housing
Program 11.12 Classification Results and Profit Information for Ames Validation Data
Output 11.13a Classification Matrix for Ames Validation Data Based upon 0.10 Cutoff
Output 11.13b Average Expected Profit for Ames Validation Data Based upon 0.10 Cutoff
Output 11.13c Line Listing for Several Houses in the Ames Validation Data Set
Using Estimated Posterior Probabilities to Determine Cutoffs
Program 11.13 Average Profit for Ames Validation Data by Depth and Cutoff
Output 11.14a Average Profit for Ames Validation Data by Depth and Cutoff
Output 11.14b Maximum Average Profit for Ames Validation Data
Key Terms
Chapter Quiz
References

📜 SIMILAR VOLUMES

SAS® Certified Specialist Prep Guide: Ba

📁 SAS® Certified Specialist Prep Guide: Base Programming Using SAS® 9.4

✍ SAS 📂 Library 📅 2019 🏛 SAS Institute 🌐 English

SAS Certified Professional Prep Guide: A

📁 SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9.4

✍ SAS 📂 Library 📅 2019 🏛 SAS 🌐 English

SAS Certified Professional Prep Guide: A

📁 SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9. 4

✍ Sas; Sas Institute 📂 Library 📅 2019 🏛 SAS Institute 🌐 English

The official guide by the SAS Global Certification Program, SAS Certified Professional Prep Guide: Advanced Programming Using SAS 9.4 prepares you to take the new SAS 9.4 Advanced Programming Performance-Based Exam. New in this edition is a workbook whose sample scenarios require you to write code t

SAS Certified Specialist Prep Guide: Bas

📁 SAS Certified Specialist Prep Guide: Base Programming Using SAS 9.4

✍ SAS Institute 📂 Library 📅 2019 🏛 SAS Institute 🌐 English

The SAS® Certified Specialist Prep Guide: Base Programming Using SAS® 9.4 prepares you to take the new SAS 9.4 Base Programming -- Performance-Based Exam. This is the official guide by the SAS Global Certification Program. This prep guide is for both new and experienced SAS users, and it covers all

SAS Certification Prep Guide: Advanced P

📁 SAS Certification Prep Guide: Advanced Programming for SAS 9

✍ SAS Publishing 📂 Library 📅 2007 🏛 SAS Publishing 🌐 English

SAS Certification Prep Guide: Base Progr

📁 SAS Certification Prep Guide: Base Programming for SAS 9

✍ SAS 📂 Library 📅 2006 🏛 SAS Publishing 🌐 English

Prepare for the SAS Base Programming for SAS®9 certification exam with the official guide by the SAS® Certified Professional Program. New and experienced SAS users who want to prepare for the SAS Base Programming for SAS®9 certification exam will find this guide an invaluable, convenient and compreh