*Date: Aug 16, 2017*

*- All variables used are log transformed to improve R square.*

*- In SAS codes, words in bolds are keywords in SAS for programming.*

There are various model selection methods in SAS PROC REG in which I used **STEPWISE and RSQUARE**.

**STEPWISE** is the most popular model selection methods in PROC REG. We can make adjustments to SLE and SLS. I used the default setting.

PROC REG DATA =PLOTlog;

MODELBIO_MG_HAN=Total_retu Elev_minim Elev_maxim Elev_mean Elev_mode Elev_stdde Elev_varia Elev_CV Elev_IQ Elev_kurto Elev_AAD Elev_MAD_m Elev_MAD_1 Elev_L1 Elev_L2 Elev_L_CV Elev_P01 Elev_P05 Elev_P10 Elev_P20 Elev_P25 Elev_P30 Elev_P40 Elev_P50 Elev_P60 Elev_P70 Elev_P75 Elev_P80 Elev_P90 Elev_P95 Elev_P99 Canopy_rel Elev_SQRT_ Elev_CURT_

/ SELECTION = STEPWISE;

RUN;

**R square selection (RSQUARE)** always identifies the model with the largest R square for each number of variables considered. It requires much more computer time than the other selection methods. We can fix this problem by dividing the data into subgroups, find the largest R square in subgroup firstly, then compare the best ones. However, this only applies to 1 variables.

PROC REG DATA =PLOTlog;

MODELBIO_MG_HAN=Total_retu Elev_minim Elev_maxim Elev_mean Elev_mode Elev_stdde Elev_varia Elev_CV Elev_IQ Elev_kurto Elev_AAD Elev_MAD_m Elev_MAD_1 Elev_L1 Elev_L2 Elev_L_CV Elev_P01 Elev_P05 Elev_P10 Elev_P20 Elev_P25 Elev_P30 Elev_P40 Elev_P50 Elev_P60 Elev_P70 Elev_P75 Elev_P80 Elev_P90 Elev_P95 Elev_P99 Canopy_rel Elev_SQRT_ Elev_CURT_

/ SELECTION = RSQUARE STOP=1;

RUN;

Results from the stepwise selection:

Results from R square selection:

Comparison of two models:

PROC REG DATA= PLOTlogOUTEST=OUT1;

MODELBIO_MG_HAN = Elev_mode /AIC BIC PRESS RSQUARE RMSE;

PROC PRINT DATA= OUT1;

PROC REG DATA =PLOTlogOUTEST=OUT2;

MODELBIO_MG_HAN = Elev_mode Total_retu Elev_P95/ AIC BIC PRESS RSQUARE RMSE VIF;

PROC PRINTDATA=OUT2;

RUN;

#### Model Selection Criteria

1. **Statistical test** on individual coefficients at a given value (0.05). It is desirable to keep all predictor variables in the model significant.

For RSQUARE selection model, Elev_mode variable is significant.

For STEWWISE selection model, Elev_mode is significant while Total_retu and Elev_P95 are not significant.

2. **Model coefficient of determination R square**. The __larger,__ the better.

R square increases with the number of variables in the model.

RSQURE R square: 0.7837

STEPWISE R square: 0.8223

3. **Adjusted R square**. The __larger__, the better.

Compared with R square, Adjusted R square does not always increase with number of variables in the model. It removes the impact of degrees of freedom and gives a quantity that is more comparable than R square over models involving different numbers of parameters.

RSQURE adjusted R square: 0.7786

STEPWISE adjusted R square: 0.8090

4. **Mallow's Cp. **Close to the number of coefficients (including intercept).

Not considered in here. Mallow's Cp is calculated in the SELECTION process.

5. **Predicted Sum of Squares (PRESS)**. The smaller, the better.

The PRESS statistic gives a good indication of the predictive power of the model. It can be used in combination with RMSE. We get smaller RMSE when the model gets closer to each data point, however, this could cause overfitting problem which gives us not representative and predictive model. The PRESS guards against this by testing how well the current model would predict the points in the dataset.

RSQUARE PRESS: 2.13015

STEPWISE PRESS: 1.79648

6. Model Selection Criteria Based on Information Theory, including **AIC**, AICC,

BIC and SBC. The smaller, the better.

AIC is not a test of the model in the sense of hypothesis testing; rather it is a test between models - a tool for model selection. Akaike's rule of thumb: two models are essentially indistinguishable if the difference of their AICs is **less than 2**.

7. **Variance Inflation (VIF)**.

This is for multicollinearity detection and diagnostics. VIF provide an indication of which regression coefficients are adversely affected and to what extent. It is generally believed that if any VIF exceeds 10, there is a reason for at least some concerns on multicollinearity in the data.

The highest VIF in the STEPWISE selection model is 4.99, which is smaller than 10.

### Summary

STEPWISE model is better.