Package 'HRW'

Title: Datasets, Functions and Scripts for Semiparametric Regression Supporting Harezlak, Ruppert & Wand (2018)
Description: The book "Semiparametric Regression with R" by J. Harezlak, D. Ruppert & M.P. Wand (2018, Springer; ISBN: 978-1-4939-8851-8) makes use of datasets and scripts to explain semiparametric regression concepts. Each of the book's scripts are contained in this package as well as datasets that are not within other R packages. Functions that aid semiparametric regression analysis are also included.
Authors: Jaroslaw Harezlak [aut], David Ruppert [aut], Matt P. Wand [aut, cre]
Maintainer: Matt P. Wand <[email protected]>
License: GPL (>= 2)
Version: 1.0-5
Built: 2025-01-23 04:55:11 UTC
Source: https://github.com/cran/HRW

Help Index


Contraception use in Bangladesh

Description

The BanglaContrac data frame has multilevel data from the 1988 Bangladesh Fertility Survey. There are data on contraceptive use, number of children, age and urban-dweller status of 1,934 women grouped in 60 districts.

Usage

data(BanglaContrac)

Format

This data frame contains the following columns:

districtID

district identification number.

usingContraception

indicator that woman is using contraception:
1 = woman using contraception at time of survey,
0 = woman not using contraception at time of survey.

childCode

numerical code for number of living children:
1 = no living children at time of survey,
2 = one living child at time of survey,
3 = two living children at time of survey,
4 = three or more living children at time of survey.

ageMinusMean

age in years of woman at time of survey, with mean age subtracted

isUrban

indicator that woman lives in an urban region:
1 = woman an urban region dweller at time of survey,
0 = woman a rural region dweller at time of survey.

Source

Huq, N.M. and Cleland, J. (1990). Bangladesh Fertility Survey 1989 (Main Report). Dhaka, Bangladesh: National Institute of Population Research and Training.

Examples

library(HRW) ; data(BanglaContrac)
if (require("lattice"))
   print(xyplot(jitter(usingContraception) ~ ageMinusMean|factor(districtID),
         groups = childCode, data = BanglaContrac))

Berry, Carroll and Ruppert (2002) data

Description

The BCR data frame has data from a 6-week clinical trial of a drug versus a placebo. The data are subject to measurement error and have been transformed and rescaled from their original form. These data are analyzed in the 2002 ‘Journal of the American Statistical Association’ article by Berry, Carroll and Ruppert (full reference below).

Usage

data(BCR)

Format

This data frame contains the following columns:

status

code for status of patient:
control = patient is in placebo group,
treatment=patient is in treatment group.

w

physician-assessed score of patient's mental health at baseline.

y

physician-assessed score of patient's mental health at end of the study.

References

Berry, S.M., Carroll, R.J. and Ruppert, D. (2002). Bayesian smoothing and regression splines for measurement error problems.Journal of the American Statistical Association, 97, 160-169.

Examples

library(HRW) ; data(BCR)
if (require("lattice")) 
   print(xyplot(y ~ w|status,data = BCR))

Mortgage applications in Boston

Description

The full dataset is in the data frame 'Hdma' within the R package Ecdat. This data frame is the subset of mortgage applications during the years 1998-1999.

Usage

data(BostonMortgages)

Format

A data frame with 2,380 observations on the following 13 variables:

dir

ratio of the debt payments to the total income.

hir

ratio of the housing expenses to the total income.

lvr

ratio of the loan size to the assessed value of property.

ccs

a credit score ranging from 1 to 6, where a low value indicates low credit risk.

mcs

a mortgage credit score from 1 to 6, where a low value indicates low credit risk.

pbcr

did the applicant have a public bad credit record?: a factor with levels no and yes.

dmi

was the applicant denied mortgage insurance?: a factor with levels no and yes.

self

was the applicant self-employed?: a factor with levels no and yes.

single

is the applicant single?: a factor with levels no and yes.

uria

1989 Massachusetts unemployment rate in the applicant's industry.

condominium

is the unit a condominium?: a factor with levels no and yes.

black

is the applicant black?: a factor with levels no and yes.

deny

was the mortgage denied?: a factor with levels no and yes.

References

Munnell, A. H., Tootell, G. M. B., Browne, L. E., and McEneaney, J. (1996). Mortgage lending in Boston: Interpreting HDMA data, American Economic Review, 25-53.

Examples

library(HRW) ; data(BostonMortgages)
BostonMortgages$denyBinary <- as.numeric(BostonMortgages$deny == "yes")
fit <- glm(denyBinary ~ black + dir + lvr + pbcr + self + single + as.factor(ccs),
                        family = binomial,data = BostonMortgages)
summary(fit)

Brain image

Description

The brainImage data frame corresponds a functional magnetic image of a coronal slice of a human brain. The data are brain activity on a pixel array.

Usage

data(brainImage)

Format

This data frame is a 80 by 37 array. The entries correspond to brain activity in each of the corresponding pixels. The columns names are c1-c37. These names have no meaning, and are present to ensure that this data frame conforms with R data frame conventions.

Source

Landman, B.A., Huang, A.J., Gifford, A., Vikram, D.S., Lim, I.A.L, Farrell, J.A.D., Bogovic, J.A., Hua, J., Chen, M., Jarso, S., Smith, S.A., Joel, S., Mori, S., Pekar, J.J., Barker, P.B., Prince, J.L. and van Zijl, P.C.M. (2010). Multi-parametric neuroimaging reproducibility: A 3T resource study. NeuroImage, 54, 2854-2866.

Examples

library(HRW) ; data(brainImage) 
image(as.matrix(brainImage))

Stock indices

Description

Daily returns on the Standard & Poors' 500 stock market index, daily rate of the U.S. Treasury bills, and 3 companies' stocks including Microsoft, the General Electric and the Ford Motor Company during the period from November 1, 1993 to March 31, 2003.

Usage

data(capm)

Format

A data frame with 2363 observations on the following 6 variables:

Close.tbill

Daily Treasury bill rate expressed as a percentage.

Close.msft

Daily closing price of the Microsoft stock.

Close.sp500

Daily closing Standard and Poor's 500 index.

Close.ge

Daily closing price of the General Electric stock.

Close.ford

Daily closing price of the Ford Motor Company stock.

Date

Dates from November 1, 1993 to March 31, 2003 (d-Mon-yy and dd-Mon-yr formats).

Source

Federal Reserve Bank of St. Louis U.S.A. (Treasury bill rates) and Yahoo Finance (stock prices).

Examples

# The Capital Asset Pricing Model (CAPM) states that the excess returns on a stock 
# have a linear relationship with the returns on the market. This example investigates
# the CAPM for General Electric stock:

library(HRW) ; data(capm)
n <- dim(capm)[1]
riskfree <- capm$Close.tbill[2:n]/365
elrGE <- diff(log(capm$Close.ge)) - riskfree
elrSP500 <- diff(log(capm$Close.sp500)) - riskfree
plot(elrSP500,elrGE,col = "blue",cex = 0.2)
fitOLS <- lm(elrGE ~ elrSP500)
summary(fitOLS)
par(mfrow = c(2,2)) ; plot(fitOLS)

Cars purchased at auction

Description

The carAuction data frame has data on several variables concerning cars purchased at automobile auctions by automobile dealerships in the United States of America. The origin of these data is a classification competition titled “Don't Get Kicked!” that ran on the ‘kaggle’ platform (https://www.kaggle.com) during 2011-2012. Many of the variables in this data frame have been derived from those in the original data set from https://www.kaggle.com.

Usage

data(carAuction)

Format

This data frame contains the following columns:

RefId

unique number assigned to each vehicles.

IsBadBuy

indicator that the vehicle purchased at auction by an automobile dealership has serious problems that hinder or prevent it being sold - a "bad buy":
1 = the vehicle is a "bad buy"
0 = the vehicle is a "good buy".
All other indicator variables are defined in this way.

purchIn2010

indicator that vehicle was purchased in 2010.

aucEqAdesa

indicator that the auction provider at which the vehicle was purchased was Adesa.

aucEqManheim

indicator that the auction provider at which the vehicle was purchased was Manheim.

vehYearEq03

indicator that the manufacturer's year of the vehicle is 2003.

vehYearEq04

indicator that the manufacturer's year of the vehicle is 2004.

vehYearEq05

indicator that the manufacturer's year of the vehicle is 2005.

vehYearEq06

indicator that the manufacturer's year of the vehicle is 2006.

vehYearEq07

indicator that the manufacturer's year of the vehicle is 2007.

ageAtSale

age of the vehicle in years when sold.

makeEqChevrolet

indicator that the vehicle's manufacturer is Chevrolet.

makeEqFord

indicator that the vehicle's manufacturer is Ford.

makeEqDodge

indicator that the vehicle's manufacturer is Dodge.

makeEqChrysler

indicator that the vehicle's manufacturer is Chrysler.

trimEqBas

indicator that the trim level of the vehicle is 'Bas'.

trimEqLS

indicator that the trim level of the vehicle is 'LS'.

trimEqSE

indicator that the trim level of the vehicle is 'SE'.

subModelEq4DSEDAN

indicator that the submodel of the vehicle is '4DSedan'.

subModelEq4DSEDANLS

indicator that the submodel of the vehicle is '4DSedanLS'.

subModelEq4DSEDANSE

indicator that the submodel of the vehicle is '4DSedanSE'.

colourEqSilver

indicator that the vehicle color is silver.

colourEqWhite

indicator that the vehicle color is white.

colourEqBlue

indicator that the vehicle color is blue.

colourEqGrey

indicator that the vehicle color is grey.

colourEqBlack

indicator that the vehicle color is black.

colourEqRed

indicator that the vehicle color is red.

colourEqGold

indicator that the vehicle color is gold.

colourEqOrange

indicator that the vehicle color is orange.

transEqManual

indicator that the vehicle has manual transmission.

wheelEqAlloy

indicator that the vehicle has alloy wheels.

wheelEqCovers

indicator that the vehicle has covered wheels.

odomRead

the vehicle's odometer reading in miles.

AmericanMade

indicator that the vehicle was manufactured in the United States of America.

otherAsianMade

indicator that the vehicle was manuctured in an Asian nation other than Japan or South Korea.

sizeEqTruck

indicator that the size category of the vehicle is 'truck'.

sizeEqMedium

indicator that the size category of the vehicle is 'medium'.

sizeEqSUV

indicator that the size category of the vehicle is 'SUV'.

sizeEqCompact

indicator that the size category of the vehicle is 'compact'.

sizeEqVan

indicator that the size category of the vehicle is 'van'.

price

acquisition price for this vehicle in average condition at time of purchase in U.S. dollars.

purchInTexas

indicator that the vehicle was purchased in Texas.

purchInFlorida

indicator that the vehicle was purchased in Florida.

purchInCalifornia

indicator that the vehicle was purchased in California.

purchInNorthCarolina

indicator that the vehicle was purchased in North Carolina.

purchInArizona

indicator that the vehicle was purchased in Arizona.

purchInColorado

indicator that the vehicle was purchased in Colorado.

purchInSouthCarolina

indicator that the vehicle was purchased in South Carolina.

costAtPurch

acquisition cost paid for the vehicle at time of purchase.

onlineSale

indicator that the vehicle was purchased online.

warrantyCost

warranty cost in U.S. dollars.

Source

The “Don't Get Kicked” competition, https://www.kaggle.com.

Examples

library(HRW) ; data(carAuction)
## Not run: 
for (colNum in 3:10) 
{
   plot(jitter(carAuction[,colNum]),jitter(carAuction$IsBadBuy),pch = ".",
        xlab = names(carAuction)[colNum],ylab = "is car a bad buy?",col = "blue")
   readline("Hit Enter to continue.\n")
}
for (colNum in 11:51) 
{
   plot(jitter(carAuction[,colNum]),jitter(carAuction$IsBadBuy),pch = ".",
        xlab = names(carAuction)[colNum],ylab = "is car a bad buy?",col = "blue")
   readline("Hit Enter to continue.\n")
}

## End(Not run)

Coronory heart disease

Description

The CHD data frame has data on coronary heart disease status, cholesterol level measurements and age. Further details are given in the 1996 ‘Journal of the American Statistical Association’ article by Roeder, Carroll and Lindsay (full reference below).

Usage

data(CHD)

Format

This data frame contains the following columns:

CHD

indicator of coronary heart disease status:
0 = patient does not have coronary heart disease,
1 = patient has coronary heart disease.

LDL

low density lipoprotein cholesterol level.

TC

total cholesterol level.

age

age of patient in years.

Source

Roeder, K., Carroll, R.J. and Lindsay, B.G. (1996). A semiparametric mixture approach to case-control studies with errors in covariables. Journal of the American Statistical Association, 91, 722-732.

References

Richardson, S., Leblond, L., Jaussent, I. and Green, P.J. (2002). Mixture models in measurement error problems, with reference to epidemiological studies. Journal of the Royal Statistical Society, Series A, 163, 549-566.

Examples

library(HRW) ; data(CHD)
pairs(CHD)

Coral organisms in French Polynesia

Description

The coral data frame has data on initial size, taxonomic identity and alive/death status of coral organisms in French Polynesia.

Usage

data(coral)

Format

This data frame contains the following columns:

siteDepthPeriod

factor with levels corresponding to a code for the site, depth and time period concerning where and when coral organisms were measured.

taxon

factor corresponding to an abbreviation for taxonomic identity:
ACR = Acropora,
POC = Pocillopora,
POR = Porites.

logInitialSizePlusOne

initial size measurement of coral organism transformed according to the log(initial size + 1).

died

indicator that coral organism has died:
1 = coral organism has died,
0 = coral organism still alive.

Source

Kayal, M., Vercelloni, J., Wand, M.P. and Adjeroud, M. (2015). Searching for the best bet in life-strategy: a quantitative population dynamics approach to life history trade-offs in reef-building corals. Ecological Complexity, 23, 73-84.

Examples

library(HRW) ; data(coral)
if (require("lattice"))
   print(xyplot(died ~ logInitialSizePlusOne|siteDepthPeriod*taxon,
                data = coral,layout = c(15,5)))

Boundary polygon creation

Description

Create a boundary polygon corresponding nominally to the effective probability density support of a bivariate dataset via an interactive graphical interface and mouse (or, possibly, touchpad) posititionings and button clicks.

Usage

createBoundary(x,y)

Arguments

x

vector containing the x-coordinates of a bivariate dataset.

y

vector containing the y-coordinates of a bivariate dataset.

Details

After the bivariate dataset is displayed on the screen a boundary polygon is selected by performing left mouse (or, possibly, touchpad) clicks on the screen to specify vertex positions, and then moving around in a clockwise direction until the polygon is completed. Completion is achieved by clicking inside the red octagon surrounding the starting vertex.

Value

A two-column matrix containing the vertices of the selected boundary polygon.

Author(s)

M.P. Wand [email protected]

See Also

pointsInPoly

Examples

library(HRW)
x <- c(4,1,9,8,3,9,7)
y <- c(5,7,5,4,2,1,1)
## Not run: myBoundary <- createBoundary(x,y)

Female spinal bone mineral densities

Description

The femSBMD data frame has longitudinal data on spinal bone mineral, density, age and ethnicity for female youths from a study on bone mineral acquisition.

Usage

data(femSBMD)

Format

This data frame contains the following columns:

idnum

identification number unique to each subject.

spnbmd

spinal bone mineral density in grams per square centimeter.

age

age of subject in years.

ethnicity

factor corresponding to the subject's ethnicity with levels Asian, Black, Hispanic and White.

black

indicator of the subject being black:
0 = subject is black
1 = subject is not black.

hispanic

indicator of the subject being Hispanic:
0 = subject is Hispanic
1 = subject is not Hispanic.

white

indicator of the subject being white:
0 = subject is white
1 = subject is not white.

Source

Bachrach, L.K., Hastie, T., Wang, M.-C., Narasimhan, B. and Marcus, R. (1999). Bone mineral acquisition in healthy Asian, Hispanic, Black, and Caucasian youth: a longitudinal study. Journal of Clinical Endocrinology and Metabolism, 84, 4702-4712.

Examples

library(HRW) ; data(femSBMD)
if (require("lattice"))
   print(xyplot(spnbmd ~ age|factor(ethnicity),groups = idnum,
                data = femSBMD,type = "b"))

Adolescent somatic growth in Indiana, U.S.A.

Description

Data on adolescent somatic growth obtained from a study of the mechanisms of human hypertension development conducted at the Indiana University School of Medicine, Indianapolis, Indiana, U.S.A. The data are restricted to a subset of 216 adolescents in the original study who had at least 9 height measurements. There are a total of 4,123 height measurements taken approximately every 6 months.

Usage

data(growthIndiana)

Format

A data frame with 4,123 observations on the following 5 variables:

idnum

identification numbers of the 216 adolescents.

height

height in centimeters.

age

age in years.

male

indicator of the adolescent being male:
1 = adolescent is male,
0 = adolescent is female.

black

indicator of the adolescent being black:
1 = adolescent is black,
0 = adolescent is not black.

References

Pratt, J.H., Jones, J.J., Miller, J.Z., Wagner, M.A. and Fineberg, N.S. (1989). Racial differences in aldosterone excretion and plasma aldosterone concentrations in children. New England Journal of Medicine, 321, 1152-1157.

Examples

library(HRW) ; data(growthIndiana)
growthINblackMales <- growthIndiana[(growthIndiana$male == 1) & (growthIndiana$black == 1),]
if (require("lattice"))
   xyplot(height ~ age|factor(idnum),data = growthINblackMales)

Respiratory infection in Indonesian children

Description

Indonesian Children's Health Study of respiratory infections for a cohort of 275 Indonesian children. The data are longitudinal with each child having between 1 and 6 repeated measurements.

Usage

data(indonRespir)

Format

A data frame with 1200 observations on the following 12 variables:

idnum

child identification number.

respirInfec

indicator of presence of resipiratory infection.

age

age of the child in years.

vitAdefic

indicator of Vitamin A deficiency:
1 = the child had Vitamin A deficiency,
0 = the child did not have Vitamin A deficiency.

female

indicator of child being female:
1 = the child is female,
0 = the child is male.

height

height of the child in centimeters.

stunted

indicator of the child being "short for his/her age":
1 = the child is "short for his/her age",
0 = the child is not "short for his/her age"

visit2

indicator that the child had exactly 2 clinical visits:
1 = the exact number of clinical visits was 2,
0 = the exact number of clinical visits was not 2.

visit3

indicator that the child had exactly 3 clinical visits:
1 = the exact number of clinical visits was 3,
0 = the exact number of clinical visits was not 3.

visit4

indicator that the child had exactly 4 clinical visits:
1 = the exact number of clinical visits was 4,
0 = the exact number of clinical visits was not 4.

visit5

indicator that the child had exactly 5 clinical visits:
1 = the exact number of clinical visits was 5,
0 = the exact number of clinical visits was not 5.

visit6

indicator that the child had exactly 6 clinical visits:
1 = the exact number of clinical visits was 6,
0 = the exact number of clinical visits was not 6.

Source

Sommer, A. (1982). Nutritional Blindness. New York: Oxford University Press.

References

Diggle, P., Heagerty, P., Liang, K.-L. and Zeger, S. (2002). Analysis of Longitudinal Data (Second Edition). Oxford: Oxford University Press.

Examples

library(HRW) ; data(indonRespir)
if (require("mgcv"))
{
   fit <- gamm(respirInfec ~ s(age) + vitAdefic + female + height
            + stunted + visit2 + visit3 + visit4  + visit5 + visit6,
            random = list(idnum = ~1),family = binomial,data = indonRespir)
   summary(fit$gam)
}

Light detection and ranging

Description

The lidar data frame has 221 pairs from a LIght Detection And Ranging (LIDAR) experiment.

Usage

data(lidar)

Format

This data frame contains the following columns:

range

distance traveled before the light is reflected back to its source.

logratio

logarithm of the ratio of received light from two laser sources.

Source

Sigrist, M. (Ed.) (1994). Air Monitoring by Spectroscopic Techniques (Chemical Analysis Series, vol. 197). New York: Wiley.

Examples

library(HRW) ; data(lidar)
plot(lidar$range,lidar$logratio)

Ozone levels in midwest U.S.A.

Description

This dataset is a subset of the ozone2 dataset in the fields package. It contains the 8-hour average ozone concentration at 147 sites in the midwest region of the U.S.

Usage

data(ozoneSub)

Format

A data frame with 147 observations on the following 3 variables:

longitude

observation longitude.

latitude

observation latitude.

ozone

ozone level.

Source

Aerometric Information Retrieval System, the U.S. Environmental Protection Agency air quality data base.

References

Nychka, D., Furrer, R., Paige, J. and Sain, S. (2017). fields: Tools for spatial data. R package version 9.0. https://www.r-project.org.

Examples

library(HRW) ; data(ozoneSub)
if (require("mgcv"))
{
   fit.ozone.mgcv.tp <- gam(ozone ~ s(longitude,latitude,bs = "tp"),
                            data = ozoneSub,method = "REML")
   plot(fit.ozone.mgcv.tp,scheme = 2,
        main = "ozone concentration",bty = "l")
   points(ozoneSub$longitude,ozoneSub$latitude)
}
if (require("fields"))
   US(add = TRUE,lwd = 2)

Flow cytometric measurements on plankton organisms

Description

The plankton data frame has data on six flow cytometric measurements for 400 plankton organisms categorized into 5 different species. The data are synthetic and were generated to test various machine learning algorithms for plankton species classification. More details are given in the 2001 ‘Cytometry’ article by Boddy, Wilkins and Morris (full reference below).

Usage

data(plankton)

Format

This data frame contains the following columns:

timeFlight

time of flight.

forwScatt

forward-scatter.

sideScatt

side-scatter.

redFluorBlueLight

red fluorescence under blue light.

greenFluorBlueLight

green fluorescence under blue light.

redFluorRedLight

red fluorescence under red light.

species

name of the plankton species, which is either Dunaliella, Hemiselmis, Isochrysis, Pavlova or Pyramimonas.

Source

Boddy, L., Wilkins, M.F. and Morris, C.W. (2001). Pattern recognition in flow cytometry. Cytometry, 44, 195-209.

Examples

library(HRW) ; data(plankton)
pointCols <- c("red","blue","green","orange","purple")
pairs(plankton[,1:6],col = pointCols[plankton$species],pch = ".")

Points inside/outside polygon determination

Description

Determination of whether each member of a set of bivariate points are inside a polygon.

Usage

pointsInPoly(pointsCoords,polygon)

Arguments

pointsCoords

two-column matrix with each row specifying the (x,y) coordinates of a point.

polygon

two-column matrix with each row specifying the (x,y) coordinates of a vertex of a polygon.

Details

Geometric results are used to determine whether each of a set of bivariate points is inside or outside a polygon. A Boolean vector of indicators of whether or not each point is inside the polygon is returned.

Value

A Boolean array with length equal to the number of rows in ‘pointsCoords’ corresponding to whether or not each of the corresponding points in 'pointCoords' are inside 'polygon'.

Author(s)

M.P. Wand [email protected]

See Also

createBoundary

Examples

library(HRW)
myPolygon <- rbind(c(1,9),c(8,8),c(9,3),c(3,2),c(1,9))/10
plot(0:1,0:1,type = "n") ; lines(myPolygon)
xyMat <- cbind(runif(10),runif(10))
inPoly <- pointsInPoly(xyMat,myPolygon) ; print(inPoly)
points(xyMat[,1],xyMat[,2],col = as.numeric(inPoly) + 2)

Protein intake dietery study

Description

The protein data frame has longitudinal data on protein intake, body mass index and age of subjects in a dietary study.

Usage

data(protein)

Format

This data frame contains the following columns:

idnum

identification number unique to each subject.

proteinBioM

logarithm of intake of protein as measured by the biomarker urinary.

age

age of subject in years.

BMI

body mass index.

proteinRecall

logarithm of intake of protein as measured by a 24-hour recall instrument.

female

indicator that subject is female:
1=subject is female,
0=subject is male.

Source

Kipnis, V., Subar, A.F., Midthune, D., Freedman, L.S., Ballard-Barbash, R., Troiano, R., Bingham, S., Schoeller, D.A., Schatzkin, A. and Carroll, R.J. (2003). The structure of dietary measurement error: results of the OPEN biomarker study. American Journal of Epidemiology, 158, 14-21.

Examples

library(HRW) ; data(protein)
if (require("lattice"))
   print(xyplot(proteinBioM ~ BMI|factor(female),groups = idnum,
                data = protein, type = "b"))

Ragweed pollen in Kalamazoo, U.S.A.

Description

The ragweed data frame has data on ragweed levels and meteorological variables for 334 days in Kalamazoo, Michigan, U.S.A.

Usage

data(ragweed)

Format

This data frame contains the following columns:

pollenCount

ragweed pollen count (grains per cubic metre).

year

one of 1991, 1992, 1993 or 1994.

dayInSeason

day number in the current ragweed pollen season.

temperature

temperature for the corresponding day (degrees Fahrenheit).

temperatureResidual

residual from fitting a 5 effective degrees of freedom smoothing splines to temperature versus day number for each annual ragweed pollen season (degrees Fahrenheit).

rain

indicator of significant rain on the corresponding day:
1=at least 3 hours of steady or brief but intense rain,
0=otherwise.

windSpeed

wind speed for the corresponding day (knots).

Source

Stark, P. C., Ryan, L. M., McDonald, J. L. and Burge, H. A. (1997). Using meteorologic data to model and predict daily ragweed pollen levels. Aerobiologia, 13, 177-184.

References

Ruppert, D., Wand, M.P. and Carroll, R.J. (2003). Semiparametric Regression Cambridge University Press.

Examples

library(HRW) ; data(ragweed)
pairs(ragweed,pch = ".")

Scallop abundance off Long Island, U.S.A.

Description

The scallop data frame has 148 triplets concerning scallop abundance; based on a 1990 survey cruise in the Atlantic continental shelf off Long Island, New York, U.S.A.

Usage

data(scallop)

Format

This data frame contains the following columns:

latitude

degrees latitude (north of the Equator).

longitude

degrees longitude (west of Greenwich).

totalCatch

total size of scallop catch at location specified by 'latitude' and 'longitude'.

Source

Ecker, M.D. and Heltshe, J.F. (1994). Geostatistical estimates of scallop abundance. In Case Studies in Biometry. Lange, N., Ryan, L., Billard, L., Brillinger, D., Conquest, L. and Greenhouse, J. (eds.) New York: John Wiley & Sons, 107-124.

References

Ruppert, D., Wand, M.P. and Carroll, R.J. (2003). Semiparametric Regression. Cambridge University Press.

Examples

library(HRW) ; data(scallop)
pairs(scallop)

School results in the United Kingdom

Description

The schoolResults data frame has multilevel data school results and gender for 1,905 school children from 73 schools in United Kingdom.

Usage

data(schoolResults)

Format

This data frame contains the following columns:

schoolID

school identification number.

studentID

student identification number.

female

indicator that child is female:
1=child is female,
0=child is male.

writtenScore

score on traditional written examination paper out of a total of 160.

courseScore

score from projects undertaken during the course and marked by the student's own teacher, out of a total of 108.

Source

Creswell, M. (1991). A multilevel bivariate model. In Data Analysis with ML3 (eds. Prosser, R., Rasbash, J. and Goldstein, H.) London: Institute of Education, London.

Examples

library(HRW) ; data(schoolResults)
if (require("lattice"))
   print(xyplot(writtenScore ~ courseScore|factor(schoolID),
                groups = female,data = schoolResults))

Summarizes Markov chain Monte Carlo (MCMC) samples both graphically and numerically

Description

Given a set of MCMC for possibly several parameters the following summaries are produced: 1. a trace (time series) plot of each MCMC, 2. a plot of each MCMC sample against the 1-lagged sample, 3. estimated autocorrelatio function (acf), 4. Brooks-Gelman-Rubin (BGR) diagnostic plot in cases where multiple chains are inputted, 5. kernel ddensity estimate of the posterior density function based on the MCMC sample, 6. numerical summary consisting of the MCMC-based estimates of posterior means and credible sets for each parameter.

Usage

summMCMC(xList,EPSfileName,PDFfileName,plotInd = 1,parNames,columnHeadings,
         colourVersion = TRUE,credLevel = 0.95,columnHeadCex = NULL,paletteNum = 1,
         numerSummCex = NULL,BGRsttPos = 10,BGRyRange = c(0.95,1.25),BGRtickPos = 1.2,
         BGRlogTransf,BGRlogitTransf,KDExlim,KDEvertLine = TRUE,KDEvertLineCol = "black",
         addTruthToKDE = NULL)

Arguments

xList

list of matrices, where each matrix corresponds to a different chain, and the columns of each matrix correspond to different parameters. The matrices each have dimension "numMCMC" by "numParms"; where "numMCMC" is the size of the MCMC sample and "numParms" is the number of parameters being summarized.

EPSfileName

filename if the summary is to be saved as a (encapsulated) Postscript file. If this argument and 'PDFfileName' are both not specified then the summary is printed to the screen.

PDFfileName

filename if the summary is to be saved as a PDF file. If this argument and 'EPSfileName' are both not specified then the summary is printed to the screen.

plotInd

if "numChains" exceeds 1 then this indicates which chain is summarized in the non-BGR panels. The BGR panels are Brooks-Gelman-Rubin diagnostic plots are use all chains. The default value is 1.

parNames

list containing a vector of character strings for the parameter names. The maximum length of the vector is 3.

columnHeadings

vector containing column headings. The default is: c("parameter","trace","lag 1","acf","BGR","density","summary").

colourVersion

logical flag indicating if summary should be in colour. The default is TRUE.

credLevel

number between 0 and 1 specifying the credible set level. The default is 0.95.

columnHeadCex

positive number specifying character expansion factor for the column headings.

paletteNum

either 1 or 2 to specify the colour palette to be used.

numerSummCex

positive number specifying character expansion factor for the numerical summary (last column).

BGRsttPos

starting position for the Brooks-Gelman-Rubin plots. The default value is 10.

BGRyRange

vertical axis limits for the Brooks-Gelman-Rubin plots. The default value is c(0.95,1.25).

BGRtickPos

position of tick mark on vertical axis for the Brooks-Gelman-Rubin plots. The default value is 1.2.

BGRlogTransf

vector containing indices of those parameters for which the Brooks-Gelman-Rubin plots should be done on a logarithmic scale.

BGRlogitTransf

vector containing indices of those parameters for which the Brooks-Gelman-Rubin plots should be done on a logit scale.

KDExlim

list of vectors of length 2 specifying the horizontal axis limits for the kernel density estimates.

KDEvertLine

logical flag indicating if a vertical line at zero should be added to the kernel density estimates. The default value is TRUE.

KDEvertLineCol

Colour of the vertical line at zero for kernel density estimates. The default value is "black".

addTruthToKDE

Vector indicating ‘true’ values of parameters. The default value is NULL. If 'addTruthToKDE' is non-NULL then dashed vertical lines corresponding to true values (if known from simulation) are added.

Author(s)

Matt Wand [email protected]

Examples

library(HRW)
xListSingleChain <- list(cbind(rnorm(100),rnorm(100),rnorm(100),rnorm(100)))
summMCMC(xListSingleChain,parNames = list("par1","par2","par3","par4"))
xListMultipleChains <- list(chain1 = cbind(rnorm(100),rnorm(100),rnorm(100),rnorm(100)),
                            chain2 = cbind(rnorm(100),rnorm(100),rnorm(100),rnorm(100)))
summMCMC(xListMultipleChains,parNames = list("par1","par2","par3","par4"))

Sydney real estate

Description

The SydneyRealEstate data frame has data on several variables concerning houses sold in Sydney, Australia, during 2001.

Usage

data(SydneyRealEstate)

Format

This data frame contains the following columns:

logSalePrice

natural logarithm of sale price in Australian dollars.

lotSize

lot size in square meters but with some imputation.

longitude

degrees longitude of location of house.

latitude

degrees latitude of location of house.

saleDate

sale date in dd/mm/yyy format.

saleQtr

financial quarter in which sale took place.

infRate

inflation rate measure as a percentage.

postCode

four-digit post code of the suburb in which the house located.

crimeDensity

crime density measure for the suburb in which the house is located.

crimeRate

crime rate measure for the suburb in which the house is located.

income

average weekly income of the suburb in which the house is located.

distToBusStop

distance from house to the nearest bus stop (kilometers).

distToCoastline

distance from house to the nearest coastline location (kilometers).

distToNatPark

distance from house to the nearest national park (kilometers).

distToPark

distance from house to the nearest park (kilometers).

distToRailLine

distance from house to the nearest railway line (kilometers).

distToRailStation

distance from house to the nearest railway station (kilometers).

distToHighway

distance from house to the nearest highway (kilometers).

distToFreeway

distance from house to the nearest freeway (kilometers).

distToTunnel

distance from house to the Sydney Harbour Tunnel (kilometers).

distToMainRoad

distance from house to the nearest main road (kilometers).

distToSealedRoad

distance from house to the nearest sealed road (kilometers).

distToUnsealedRoad

distance from house to the nearest sealed road (kilometers).

airNoise

aircraft noise exposure measure.

foreignerRatio

proportion of foreigners in the suburb in which the house is located.

distToGPO

distance from the house to the General Post Office in Sydney's central business district (kilometers).

NO

nitrous oxide level recorded at the air pollution monitoring station nearest to the house.

NO2

nitrogen dioxide level recorded at the air pollution monitoring station nearest to the house.

ozone

ozone level recorded at the air pollution monitoring station nearest to the house.

neph

nephelometer suspended matter measurement recorded at the air pollution monitoring station nearest to the house.

PM10

particulate matter with a diameter of under 10 micrometers leve recorded at the air pollution monitoring station nearest to the house.

SO2

sulphur dioxide level recorded at the air pollution monitoring station nearest to the house.

distToAmbulance

distance from house to the nearest ambulance station (kilometers).

distToFactory

distance from house to the nearest factory (kilometers).

distToFerry

distance from house to the nearest ferry wharf (kilometers).

distToHospital

distance from house to the nearest hospital (kilometers).

distToMedical

distance from house to the nearest medical services (kilometers).

distToSchool

distance from house to the nearest school (kilometers).

distToUniversity

distance from house to the nearest university (kilometers).

References

Chernih, A. and Sherris, M. (2004). Geoadditive hedonic pricing models. Unpublished manuscript. University of New South Wales, Australia.

Examples

library(HRW) ; data(SydneyRealEstate)
## Not run: 
for (colNum in setdiff((2:39),c(5,8)))
{
   plot(jitter(SydneyRealEstate[,colNum]),SydneyRealEstate$logSalePrice,pch = ".",
        xlab = names(SydneyRealEstate)[colNum],ylab = "log(sale price)",col = "blue")
   readline("Hit Enter to continue.\n")
}

## End(Not run)

Polygonal boundary of Sydney

Description

A two-column data frame containing the (longitude,latitude) pairs for the vertices of a 202-sided polygon. The polygon was created manually using the HRW package function createBdry(). The polygon tightly encompasses the majority of the (longitude,latitude) data of the HRW package 'SydneyRealEstate' data frame and approximately corresponds to the residential parts of Sydney, Australia.

Usage

data(SydneyRealEstateBdry)

Format

A data frame with 203 observations on the following 2 variables (note that the first vertex is repeated at the end of the data frame):

longitude

longitudinal position of a vertex.

latitude

latitudinal position of the same vertex.

See Also

SydneyRealEstate

Examples

library(HRW) ; data(SydneyRealEstateBdry) ; data(SydneyRealEstate)
plot(SydneyRealEstate$longitude,SydneyRealEstate$latitude,cex = 0.1)
lines(SydneyRealEstateBdry,lwd = 5,col = "red")

U.S. Treasury rate

Description

One-month maturity U.S. Treasury rate during the period 2001-2013.

Usage

data(TreasuryRate)

Format

A data frame with 3,117 observations on each of the following 2 variables:

date

days from July 31, 2001 until July 10, 2013 with the occasional missing values due to holidays.

rate

daily one-month maturity U.S. Treasury rate.

Source

Federal Reserve Bank of St. Louis, U.S.A.

Examples

library(HRW) ; data(TreasuryRate)
TRdate <- as.Date(TreasuryRate$date,"%m/%d/%Y")[!is.na(TreasuryRate$rate)]
TRrate <- TreasuryRate$rate[!is.na(TreasuryRate$rate)]
plot(TRdate,TRrate,type = "l",bty = "l",xlab = "date",ylab = "U.S. Treasury rate")

Peak expiratory flow in Utah, U.S.A.

Description

The UtahPEF data frame data contains longitudinal data on peak expiratory flow, air pollution and temperature for a cohort of 41 schoolchildren in the Utah Valley, U.S.A., during 1990-1991.

Usage

data(UtahPEF)

Format

This data frame contains the following columns:

idnum

schoolchild identification number.

devPEF

daily peak expiratory flow measurements for each schoolchild minus the overall average for that schoolchild.

PM10withMA5

5-day moving average of the concentration of particulate matter 10 micrometers or less in diameter.

lowTemp

lowest temperature in degrees Fahrenheit on the day of recording.

timeTrend

day of the study on which the measurements were made.

Source

Pope, C.A., Dockery, D.W., Spengler, J.D. and Raizenne, M.E. (1991). Respiratory health and PM_10 pollution: a daily time series analysis. American Review of Respiratory Disease, 144, 668-674.

Examples

library(HRW) ; data(UtahPEF)
if (require("lattice"))
   print(xyplot(devPEF ~ PM10withMA5|as.factor(idnum),data = UtahPEF))

Apartment prices in Warsaw, Poland

Description

'WarsawApts' is a subset of the data set 'apartments' in the R package PBImisc. This dataset contains the prices of the apartments which were sold in Warsaw, Poland, during the calendar years 2007 to 2009.

Usage

data(WarsawApts)

Format

A data frame with 409 observations on the following 6 variables:

surface

area of the apartment in square meters.

district

a factor corresponding to the district of Warsaw with levels Mokotow, Srodmiescie, Wola and Zoliborz.

n.rooms

number of rooms in the apartment.

floor

floor on which the apartment is located.

construction.date

year that the apartment was constructed.

areaPerMzloty

area in square meters per million zloty.

Source

The Polish real estate web-site https://www.oferty.net.

References

Biecek, P. (2016). PBImisc: A set of datasets in My Classes or in the Book 'Modele Liniowe i Mieszane w R, Wraz z Przykladami w Analizie Danych' 1.0.

Examples

library(HRW) ; data(WarsawApts)
x <- WarsawApts$construction.date
y <- WarsawApts$areaPerMzloty
plot(x,y,bty = "l",col = "dodgerblue")
if (require("mgcv"))
{
   fitGAMcr <- gam(y ~ s(x,bs = "cr",k = 30))
   xg <- seq(min(x),max(x),length = 1001)
   fHatgGAMcr <- predict(fitGAMcr,newdata = data.frame(x = xg))
   lines(xg,fHatgGAMcr,col = "darkgreen")
}

U.S., European and Japanese yield curves

Description

U.S., European and Japanese yield curves. These are functions of maturity. This dataset has 91 columns. The first column is the date, columns 2 to 31 are European yields at maturities from 1 to 30 years, columns 32 to 61 are Japanese yields at these maturities, and columns 62 to 91 are U.S. yields at the same maturities.

Usage

data(yields)

Format

A data frame with 1565 observations on the following 91 variables:

date

date when the yield is measured.

EU01

European yield at a maturity of 1 year.

EU02

European yield at a maturity of 2 years.

EU03

European yield at a maturity of 3 years.

EU04

European yield at a maturity of 4 years.

EU05

European yield at a maturity of 5 years.

EU06

European yield at a maturity of 6 years.

EU07

European yield at a maturity of 7 years.

EU08

European yield at a maturity of 8 years.

EU09

European yield at a maturity of 9 years.

EU10

European yield at a maturity of 10 years.

EU11

European yield at a maturity of 11 years.

EU12

European yield at a maturity of 12 years.

EU13

European yield at a maturity of 13 years.

EU14

European yield at a maturity of 14 years.

EU15

European yield at a maturity of 15 years.

EU16

European yield at a maturity of 16 years.

EU17

European yield at a maturity of 17 years.

EU18

European yield at a maturity of 18 years.

EU19

European yield at a maturity of 19 years.

EU20

European yield at a maturity of 20 years.

EU21

European yield at a maturity of 21 years.

EU22

European yield at a maturity of 22 years.

EU23

European yield at a maturity of 23 years.

EU24

European yield at a maturity of 24 years.

EU25

European yield at a maturity of 25 years.

EU26

European yield at a maturity of 26 years.

EU27

European yield at a maturity of 27 years.

EU28

European yield at a maturity of 28 years.

EU29

European yield at a maturity of 29 years.

EU30

European yield at a maturity of 30 years.

JP01

Japanese yield at a maturity of 1 year.

JP02

Japanese yield at a maturity of 2 years.

JP03

Japanese yield at a maturity of 3 years.

JP04

Japanese yield at a maturity of 4 years.

JP05

Japanese yield at a maturity of 5 years.

JP06

Japanese yield at a maturity of 6 years.

JP07

Japanese yield at a maturity of 7 years.

JP08

Japanese yield at a maturity of 8 years.

JP09

Japanese yield at a maturity of 9 years.

JP10

Japanese yield at a maturity of 10 years.

JP11

Japanese yield at a maturity of 11 years.

JP12

Japanese yield at a maturity of 12 years.

JP13

Japanese yield at a maturity of 13 years.

JP14

Japanese yield at a maturity of 14 years.

JP15

Japanese yield at a maturity of 15 years.

JP16

Japanese yield at a maturity of 16 years.

JP17

Japanese yield at a maturity of 17 years.

JP18

Japanese yield at a maturity of 18 years.

JP19

Japanese yield at a maturity of 19 years.

JP20

Japanese yield at a maturity of 20 years.

JP21

Japanese yield at a maturity of 21 years.

JP22

Japanese yield at a maturity of 22 years.

JP23

Japanese yield at a maturity of 23 years.

JP24

Japanese yield at a maturity of 24 years.

JP25

Japanese yield at a maturity of 25 years.

JP26

Japanese yield at a maturity of 26 years.

JP27

Japanese yield at a maturity of 27 years.

JP28

Japanese yield at a maturity of 28 years.

JP29

Japanese yield at a maturity of 29 years.

JP30

Japanese yield at a maturity of 30 years.

US01

U.S. yield at a maturity of 1 year.

US02

U.S. yield at a maturity of 2 years.

US03

U.S. yield at a maturity of 3 years.

US04

U.S. yield at a maturity of 4 years.

US05

U.S. yield at a maturity of 5 years.

US06

U.S. yield at a maturity of 6 years.

US07

U.S. yield at a maturity of 7 years.

US08

U.S. yield at a maturity of 8 years.

US09

U.S. yield at a maturity of 9 years.

US10

U.S. yield at a maturity of 10 years.

US11

U.S. yield at a maturity of 11 years.

US12

U.S. yield at a maturity of 12 years.

US13

U.S. yield at a maturity of 13 years.

US14

U.S. yield at a maturity of 14 years.

US15

U.S. yield at a maturity of 15 years.

US16

U.S. yield at a maturity of 16 years.

US17

U.S. yield at a maturity of 17 years.

US18

U.S. yield at a maturity of 18 years.

US19

U.S. yield at a maturity of 19 years.

US20

U.S. yield at a maturity of 20 years.

US21

U.S. yield at a maturity of 21 years.

US22

U.S. yield at a maturity of 22 years.

US23

U.S. yield at a maturity of 23 years.

US24

U.S. yield at a maturity of 24 years.

US25

U.S. yield at a maturity of 25 years.

US26

U.S. yield at a maturity of 26 years.

US27

U.S. yield at a maturity of 27 years.

US28

U.S. yield at a maturity of 28 years.

US29

U.S. yield at a maturity of 29 years.

US30

U.S. yield at a maturity of 30 years.

Source

European Central Bank (European yields), Japanese Ministry of Finance (Japanese yields) and U.S. Federal Reserve Board (U.S. yields).

Examples

library(HRW) ; data(yields)
t <- 1:30 ; yieldsCleaned <- na.omit(yields)[,-1]
plot(t,yieldsCleaned[1,61:90], type="l",ylim = c(0,6),lwd = 2,
     xlab = "maturity (years)",ylab = "yield",
     bty = "l",cex.lab = 1.5,cex.axis = 1.5)
for (i in 2:14) lines(t,yieldsCleaned[100*i+1,61:90],col = i,lwd = 2)

O'Sullivan spline design matrices

Description

Constructs a design matrix consisting of O'Sullivan cubic spline functions of an array of abscissae. Typicially the array corresponds to either observed values of a predictor or an abscissa grid for plotting purposes.

Usage

ZOSull(x,range.x,intKnots,drv = 0)

Arguments

x

array of abscissae.

range.x

array of length 2 such that range.x[1] <= min(x) and range.x[2] >= max(x).

intKnots

ordered array of length smaller than that of x and consisting of unique numbers between min(x) and max(x) that specifies the positions of internal knots, that define the spline basis (see the Wand and Ormerod (2008) reference below for full mathematical details).

drv

either 0,1 or 2 with a default value of 0. If drv = 1 then the first derivatives of the O'Sullivan spline basis functions are computed instead. Similarly, if drv = 2 then the second derivatives are computed.

Value

A matrix of with length(x) rows and (length(intKnots) + 2) columns with each column containing a separate O'Sullivan spline basis function (determined by range.x and intKnots) of the abscissae in x. The values of range.x and intKnots are included as attributes of the fit object.

Author(s)

Matt Wand [email protected]

References

O'Sullivan, F. (1986). A statistical perspective on ill-posed inverse problems (with discussion). Statistical Science, 1, 505-527.

Wand, M.P. and Ormerod, J.T. (2008). On semiparametric regression with O'Sullivan penalized splines. Australian and New Zealand Journal of Statistics. 50, 179-198.

Examples

library(HRW)
x <- WarsawApts$construction.date
a <- 1.01*min(x) - 0.01*max(x)
b <- 1.01*max(x) - 0.01*min(x) ; numIntKnots <- 23
intKnots <- quantile(unique(x),seq(0,1,length = (numIntKnots + 2))[-c(1,(numIntKnots + 2))])
xg <- seq(a,b,length = 1001)
Zg <- ZOSull(xg,range.x = c(a,b),intKnots = intKnots)
plot(0,type = "n",xlim = range(xg),ylim = range(Zg),
     bty = "l",xlab = "construction date (year)",
     ylab = "spline basis function")
for (k in 1:ncol(Zg)) lines(xg,Zg[,k],col = k,lwd = 2)
   lines(c(min(xg),max(xg)),rep(0,2),col = "darkmagenta")
for (k in 1:numIntKnots)
   points(intKnots[k],0,pch = 18,cex = 2,col = "darkmagenta")