* do-file for linear regression exercise # 1 (VER 14.1)

version 18 /* works also with versions 14-17 */
set more off
set scheme stcolor_alt
cd "r:\"

* open the btb_episodes data set
use btb_episodes.dta, clear

* Q1a: distribution of outcome -intvl-
codebook intvl
histogram intvl
summarize intvl, detail

* Q1b: natural log transformed outcome
generate intvl_ln=ln(intvl)
histogram intvl_ln

* Q2: simple linear regressions
regress intvl_ln p_rct
regress intvl_ln p_year
regress intvl_ln hdsize

* Q3: multiple linear regression
regress intvl_ln p_rct p_year hdsize
* added calculation of Pearson correlation between p_rct and hdsize
pwcorr p_rct hdsize
* comparing coefficients from 2 models
regress intvl_ln hdsize
estimates store smpl
regress intvl_ln p_rct p_year hdsize
estimates store mltpl
estimates table smpl mltpl

* Q4: prediction intervals for means and individuals (based only on hdsize)
regress intvl_ln hdsize
* doing the calculations yourself
predict pv, xb
predict pv_mean_se, stdp
scalar tstar=invttail(2985,.025) /* using DFE */
generate pv_mean_u=pv + tstar*pv_mean_se
generate pv_mean_l=pv - tstar*pv_mean_se
predict pv_ind_se, stdf
generate pv_ind_u=pv + tstar*pv_ind_se
generate pv_ind_l=pv - tstar*pv_ind_se
twoway (scatter intvl_ln hdsize, msize(vsmall)) (line pv hdsize) ///
  (line pv_mean_u hdsize) (line pv_mean_l hdsize) ///
  (line pv_ind_u hdsize) (line pv_ind_l hdsize)
gen bt_pv=exp(pv)
gen btpv_mean_u=exp(pv_mean_u)
gen btpv_mean_l=exp(pv_mean_l)
gen btpv_ind_u=exp(pv_ind_u)
gen btpv_ind_l=exp(pv_ind_l)
sort hdsize
twoway (scatter intvl hdsize, msize(vsmall)) (line bt_pv hdsize) ///
  (line btpv_mean_u hdsize) (line btpv_mean_l hdsize) ///
  (line btpv_ind_u hdsize) (line btpv_ind_l hdsize)

* using the built in graphics capabilities (only simple linear regression)
twoway (scatter intvl_ln hdsize, sort msize(vsmall)) ///
  (lfitci intvl_ln hdsize, ciplot(rline)) ///
  (lfitci intvl_ln hdsize, stdf ciplot(rline))

* Q5: R2
regress intvl_ln p_year hdsize
regress intvl_ln p_rct p_year hdsize
