Applied Statistics With R

John Fox
Department of Sociology, McMaster University

May/June 2006

The statistical programming language and computing environment S has become the de-facto standard among statisticians. The S language has two major implementations: the commercial product S-PLUS, and the free, open-source R. Both are available for Windows and Unix/Linux systems; R, in addition, runs on Macintoshes. This course introduces R in the context of applied statistics, covering applications to linear models, generalized linear models (including logit models, probit models, and Poisson-regression models), and structural-equation models. Some attention will also be paid to R fundamentals -- manipulating data, basic programming, and drawing graphs.


A statistical package, such as SPSS or (to a lesser extent) SAS, is primarily oriented toward combining instructions with rectangular case-by-variable datasets to produce (often voluminous) printouts. Such packages make routine data analysis relatively easy, but they make it relatively difficult to do things that are innovative or non-standard, or to add to the built-in capabilities of the package. In contrast, a good statistical computing environment also makes routine data analysis easy, but it additionally supports convenient programming; this means that users can extend the already impressive facilities of S. Statisticians have taken advantage of the extensibility of S to contribute literally hundreds of freely available statistical "packages" of R programs. As well, S is especially capable in the area of statistical graphics, reflecting its origin at Bell Labs, a centre of graphical innovation. In my opinion, R has leap-frogged S-PLUS to become the premier implementation of the S language.


Outline

Topic Dates

Optional Reading
(in the R and S-PLUS Companion)

Materials
1. Getting started with R; data in R; using R packages Tue, May 23, 13:00-16:00 Ch.1-3 script, exercises, answers, notes, Duncan.txt, Duncan.xls, nations.por, Prestige.txt
2. Linear models Tue, May 30, 13:00-16:00
Fri, June 2, 11:00-14:00
Ch. 4 & 6 script (a, b), exercises (a, b), answers (a, b), notes (a, b), SLID-Ontario.txt
3. Logit and probit models for categorical response variables Fri, June 9, 11:00-14:00
Tue, June 13, 13:00-16:00
Sec. 5.1, 5.2 script, exercises, answers, notes, Powers.txt, WorkingMoms.txt, SLID.txt, BEPS.txt, WVS.txt
4. Other generalized linear models Fri, June 16, 10:30-13:30 Sec. 5.3-5.5,
Sec. 6.6
script, exercisesanswers, notes, Long.txt
5. Structural-equation models Tue, June 20, 13:00-16:00
Fri, June 23, 11:00-14:00
Appendix script, exercises, answers, notes, paper in Structural Equation Modeling (preprint), Lincoln.R, Rindfuss.R, Wheaton.R
6. Introduction to programming in R and R graphics Tue, June 27, 13:00-16:00 Ch. 7 & 8 script, notes

Location: PC Lab, UZA 2, Level 5


Acquiring R

I've created a CD/ROM with the installer for the Windows version of R, Windows "binaries" for all of the contributed packages on the Comprehensive R Archive Network (CRAN) web site, and the free Tinn-R programming editor. You can borrow a copy of the CD from me if you don't have a fast Internet connection.

A superior alternative is to download the R Windows installer directly from CRAN; then double-click on the installer to install R as you would any Windows software. You can subsequently download and install only those packages that you want over the Internet from CRAN, via the Packages Install packages from CRAN menu in the RGui console. Likewise, the small installer for Tinn-R can also be downloaded directly. Note that you can also download from CRAN Macintosh and Unix/Linux versions of R.

Additional information about obtaining, installing, and configuring R is available on the web site for my R and S-PLUS Companion to Applied Regression.


SELECTED BIBLIOGRAPHY

Basic Texts

The principal sources for this workshop is J. Fox, An R and S-PLUS Companion to Applied Regression, Sage, 2002, and J. Fox, Applied Regression Analysis, Generalized Linear Models, and Related Methods, Second Edition (Sage, forthcoming). Additional materials are available on the web site for R and S-PLUS Companion, including several appendices (on structural-equation models, mixed models, survival analysis, etc.); scripts for the examples in all of the chapters and appendices; information on acquiring and installing R; and more. The book is associated with the car package for R. Alternatively (or additionally), more advanced students may wish to use W. N. Venables and B. D. Ripley, Modern Applied Statistics with S as their principal source.


Manuals

R is distributed with a set of manuals, which are also available at the CRAN web site.

A manual for S-PLUS Trellis Graphics (also useful for the lattice package in R) is at also available on the web.


Programming in S

R. A. Becker, J. M. Chambers, and A .R. Wilks, The New S Language: A Programming Environment for Data Analysis and Statistics. Pacific Grove, CA: Wadsworth, 1988. Defines S Version 2, which forms the basis of the currently used S Versions 3 and 4, as well as R. (Sometimes called the "Blue Book.")

J. M. Chambers, Programming with Data: A Guide to the S Language. New York: Springer, 1998. Describes the new features in S Version 4, including the newer formal object-oriented programming system (also incorporated in R), by the principal designer of the S language. Not an easy read. (The "Green Book.")

J. M. Chambers and T.J. Hastie, eds., Statistical Models in S. Pacific Grove, CA: Wadsworth, 1992. An edited volume describing the statistical modeling language in S, Versions 3 and 4, and R, and the object-oriented programming system used in S Version 3 and R (and available, for "backwards compatibility," in S Version 4). In addition, the text covers S software for particular kinds of statistical models, including linear models, nonlinear models, generalized linear models, local-polynomial regression models, and generalized additive models. (The "White Book.")

R. Ihaka and R. Gentleman, R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299-314, 1996. The original published description of the R project, now dated but still worth looking at.

W. N. Venables and B. D. Ripley, S Programming. New York: Springer, 2000. The definitive treatment of writing software in the various versions S-PLUS and R, now slightly dated, particularly with respect to R.


Selected Statistical Methods Programmed in S

W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford: Oxford University Press, 1997. A good introduction to nonparametric density estimation and nonparametric regression, associated with the sm package (for both S-PLUS and R).

C. Davison and D. V. Hinkley, Bootstrap Methods and their Application. Cambridge: Cambridge University Press, 1997. A comprehensive introduction to bootstrap resampling, associated with the boot package (for S-PLUS and R, written by A. J. Canty). Somewhat more difficult than Efron and Tibshirani.

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. London: Chapman and Hall, 1993. Another extensive treatment of bootstrapping by its originator (Efron), also accompanied by an S package, bootstrap (for both S-PLUS and R, but somewhat less usable than boot).

F. E. Harrell, Jr., Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer, 2001. Describes an interesting approach to statistical modeling, with frequent references to Harrell's Hmisc and Design packages for S-PLUS and R.

T. J. Hastie and R. J. Tibshirani, Generalized Additive Models. London: Chapman and Hall, 1990. An accessible treatment of generalized additive models, as implemented in the gam function in S-PLUS and in the gam package in R, and of nonparametric regression analysis in general. [The gam function in the mgcv package in R takes a somewhat different approach; see Wood (2000), below.]

R. Koenker, Quantile Regression. Cambridge: Cambridge University Press, 2005. The definitive text on parametric and nonparametric quantile regression analysis, associated with the quantreg package in R.

C. Loader, Local Likelihood and Regression. New York: Springer, 1999. Another text on nonparametric regression and density estimation, using the locfit package (in S-PLUS and R). Although the text is less readable than Bowman and Azzalini, the locfit software in very capable.

P. Murrell. R Graphics. New York: Chapman and Hall, 2005. Currently the definitive reference on traditional R graphics and "lattice" graphics (the R implementation of Cleveland's Trellis graphics). The figures in the book and R code to produce them are on Murrell's web site.

J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000. An extensive treatment of linear and nonlinear mixed-effects models in S, focused on the authors' nlme package, which is available for both S-PLUS and R. Mixed models are appropriate for various kinds of non-independent (clustered) data, including hierarchical and longitudinal data.

J. L. Schafer, Analysis of Incomplete Multivariate Data. London: Chapman and Hall, 1997. This text presents a broadly applicable Bayesian treatment of missing-data problems, including methods for multiple imputation. The most extensive implementation of the methods in the book is in the missing package in S-PLUS version 6. Schafer's norm, cat, mix, and pan packages are available for earlier versions of S-PLUS and for R.

T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York, Springer: 2000. An overview of both basic and advanced methods of survival analysis (event-history analysis), with reference to S and SAS software. There are both S-PLUS and R versions of Therneau's state-of-the-art survival package.

W. N. Venables and B. D. Ripley. Modern Applied Statistics with S, Fourth Edition. New York: Springer, 2002. An influential and wide-ranging treatment of data analysis using S. Many of the facilities described in the book are programmed in the associated (and indispensable) MASS, nnet, and spatial packages, available both for S-PLUS and R. This text is more advanced and has a broader focus than my R and S-PLUS Companion.

S. N. Wood, Modelling and smoothing parameter estimation with multiple quadratic penalties. Journal of the Royal Statistical Society, Series B, 62: 413-428, 2000. Describes the mgcv package in R, which contains a gam function for fitting generalized additive models. The initials "mgcv" stand for multiple generalized cross validation, the method by which Wood selects GAM smoothing parameters. The description of the software in the paper is slightly dated; consult the package documentation for up-to-date information, including additional references. Wood has a new book on GAMs which I haven't yet seen.


Other Sources (Some Free)

See the R web site. R News, the newsletter of the R Project for Statistical Computing is also a good source of information. CRAN "Task Views" summarize R resources in a number of areas -- econometrics, social statistics, spatial statistics, and so on.


Last change: 2006-06-28 by J. Fox <jfox AT mcmaster.ca>