Fred (Fred Ramsey) Ramsey, Daniel Schafer
Chapter 9
Multiple Regression - all with Video Answers
Educators
Chapter Questions
Meadowfoam. (a) Write down a multiple regression model with parallel regression lines of flowers on light for the two separate levels of time (using an indicator variable). (b) Add a term to the model in (a) so that the regression lines are not parallel.
Donald Albin
Numerade Educator
Meadowfoam. A model (without interaction) for the mean flowers is estimated to be 71.3058.0405 light +12.1583 early. For a fixed level of timing, what is the estimated difference between the mean flowers at 600 and $300 \mu \mathrm{mol} / \mathrm{m}^2 / \mathrm{sec}$ of light intensity?
Victor Salazar
Numerade Educator
Meadowfoam. (a) Why were the numbers of flowers from 10 plants averaged to make a response, rather than representing them as 10 different responses? (b) What assumption is assisted by averaging the numbers from the 10 plants?
Harsh Gadhiya
Numerade Educator
Mammal Brain Weights. The three-toed sloth has a gestation period of 165 days. The Indian fruit bat has a gestation period of 145 days. From Display 9.14 the estimated model for the mean of log brain weight is $.8548+.5751$ lbody +.4179 lgest -.3101 llitter. Since /gest for the sloth is .1292 more than lgest for the fruit bat, does this imply that an estimate of the mean log brain weight for the sloth is $(.4179)(.1292)$ more than the mean log brain weight for the bat (i.e., the median is $5.5 \%$ higher)? Why? Why not?
James Kiss
Numerade Educator
Insulating Fluid (Section 8.1.2). Would it be possible to test for lack of fit to the straight-line model for the regression of log breakdown time on voltage by including a voltage-squared term in the model, and testing whether the coefficient of the squared term is zero?
Check back soon!
Island Area and Species. For the island area and number of species data in Section 8.1.1, would it be possible to test for lack of fit to the straight-line model for the regression of log number of species on log island area by including the square of log area in the model and testing whether its coefficient is zero?
Check back soon!
Which of the following regression models are linear?
(a) $\mu\{Y \mid X\}=\beta_0+\beta_1 X+\beta_2 X^2+\beta_3 X^3$
(b) $\mu\{Y \mid X]=\beta_0+\beta_1 10^X$
(c) $\mu\{Y \mid X)=\left(\beta_0+\beta_1 X\right) /\left(\beta_0+\beta_2 X\right)$
(d) $\mu\{Y \mid X\}=\beta_0 \exp \left(\beta_1 X\right)$.
Adriano Chikande
Numerade Educator
Describe what $\sigma$ measures in the meadowfoam problem and in the brain weight problem.
Joe Lesueur
Numerade Educator
Pollen Removal. Reconsider the data on proportion of pollen removed and duration of visit to the flower for bumblebee queens and honeybee workers, in Exercise 3.28. (a) Write down a model that describes the mean proportion of pollen removed as a straight-line function of duration of visit, with separate intercepts and separate slopes for bumblebee queens and honeybee workers. (b) How would you test whether the effect of duration of visit on proportion removed is the same for queens as for workers?
Rashmi Sinha
Numerade Educator
Breast Milk and IQ. In a study, intelligence quotient (IQ) test scores were obtained for 300 8 -year-old children who had been part of a study of premature babies in the early 1980 s. Because they were premature, all the babies were fed milk by a tube. Some of them received breast milk entirely, some received a prepared formula entirely, and some received some combination of breast milk and formula. The proportion of breast milk in the diet depended on whether the mother elected to provide breast milk and to what extent she was successful in expressing any, or enough, for the baby's diet. The researchers reported the results of the regression of the response variable-IQ at age 8 -on social class (ordered from I, the highest, to 5), mother's education (ordered from 1, the lowest, to 5), an indicator variable taking the value 1 if the child was female and 0 if male, the number of days of ventilation of the baby after birth, and an indicator variable taking the value 1 if there was any breast milk in the baby's diet and 0 if there was none. The estimates are reported in Display 9.16 along with the $p$-values for the tests that each coefficient is zero. (Data from Lucas et al., "Breast Milk and Subsequent Intelligence Quotient in Children Born Preterm," Lancet 339 (1992): 261-64).
(a) After accounting for the effects of social class, mother's education, whether the child was a female, and days after birth of ventilation, how much higher is the estimated mean IQ for those children who received breast milk than for those who did not?
(b) Is it appropriate to use the variables "Social class" and "Mother's education" in the regression even though in both instances the numbers 1 to 5 do not correspond to anything real but are merely ordered categories?
(c) Does it seem appropriate for the authors to simply report $<.0001$ for the $p$-value of the breast milk coefficient rather than the actual $p$-value?
(d) Previous studies on breast milk and intelligence could not separate out the effects of breast milk and the act of breast feeding (the bonding from which might encourage intellectual development of the child). How is the important confounding variable of whether a child is breast fed dealt with in this study?
(e) Why is it important to have social class and mother's education as explanatory variables?
(f) In a subsidiary analysis the researchers fit the same regression model as above except with the indicator variable for whether the child received breast milk replaced by the percentage of breast milk in the diet (between 0 and $100 \%$ ). The coefficient of that variable turned out to be .09 . (i) From this model, how much larger is the estimated mean IQ for children who received $100 \%$ breast milk than for those who received $50 \%$ breast milk, after accounting for the other explanatory variables? (ii) What is the importance of the percentage of breast milk variable in dealing with confounding variables?
Shu Naito
Numerade Educator
Glasgow Graveyards. Do persons of higher socioeconomic standing tend to live longer? This was addressed by George Davey Smith and colleagues through the relationship of the heights of commemoration obelisks and the life lengths of the corresponding grave site occupants. In burial grounds in Glasgow a certain design of obelisk is quite prevalent, but the heights vary greatly. Since the height would influence the cost of the obelisk, it is reasonable to believe that height is related to socioeconomic status. The researchers recorded obelisk height, year of death, age at death, and gender for 1,349 individuals who died prior to 1921. Although they were interested in the relationship between mean life length and obelisk height, it is important that they included year of construction as an explanatory variable since life lengths tended to increase over the years represented ( 1801 to 1920). For males, they fit the regression of life length on obelisk height (in meters) and year of obelisk construction and found the coefficient of obelisk height to be 193. For females they fit the same regression and found the coefficient of obelisk height to be 2.92. (Data from Smith et al., "Socioeconomic Differentials in Mortality: Evidence from Glasgow Graveyards," British Medical Journal 305 (1992): 1557-60.)
(a) After accounting for year of obelisk construction, each extra meter in obelisk height is associated with $Z$ extra years in mean lifetime. What is the estimated $Z$ for males? What is the estimated $Z$ for females?
(b) Since the coefficients differ significantly from zero, would it be wise for an individual to build an extremely tall obelisk, to ensure a long life time?
(c) The data were collected from eight different graveyards in Glasgow. Since there is a potential blocking effect duc to the different graveyards, it might be appropriate to include a graveyard effect in the model. How can this be done?
Lucas Finney
Numerade Educator
Mammal Brain Weights. (a) Draw a matrix of scatterplots for the mammal brain weight data (Display 9.4) with all variables transformed to their logarithms (to reproduce Display 9.11). (b) Fit the multiple linear regression of log brain weight on log body weight, log gestation, and log litter size, to confirm the estimates in Display 9.15. (c) Draw a matrix of scatterplots as in (a) but with litter size on its natural scale (untransformed). Does the relationship between log brain weight and litter size appear to be any better or any worse (more like a straight line) than the relationship between log brain weight and log litter size?
Lainey Roebuck
Numerade Educator
Meat Processing. One way to check on the adequacy of a linear regression is to try to include an $X$-squared term in the model to see if there is significant curvature. Use this technique on the meat processing data of Section 7.1.2. (a) Fit the multiple regression of $\mathrm{pH}$ on hour and hour-squared. Is the coefficient of hour-squared significantly different from zero? What is the $p$-value? (b) Fit the multiple regression of $\mathrm{pH}$ on $\log$ (hour) and the square of $\log$ (hour). Is the coefficient of the squared-term significantly different from zero? What is the $p$-value? (c) Does this exercise suggest a potential way of checking the appropriateness of taking the logarithm of $X$ or of leaving it untransformed?
Check back soon!
Pace of Life and Heart Disease. Some believe that individuals with a constant sense of time urgency (often called type-A behavior) are more susceptible to heart disease than are more relaxed individuals. Although most studies of this issue have focused on individuals, some psychologists have investigated geographical areas. They considered the relationship of city-wide heart disease rates and general measures of the pace of life in the city.
For each region of the United States (Northeast, Midwest, South, and West) they selected three large metropolitan areas, three medium-size cities, and three smaller cities. In each city they measured three indicators of the pace of life. The variable walk is the walking speed of pedestrians over a distance of 60 feet during business hours on a clear summer day along a main downtown street. Bank is the average time a sample of bank clerks takes to make change for two $\$ 20$ bills or to give $\$ 20$ bills for change. The variable talk was obtained by recording responses of postal clerks explaining the difference between regular, certified, and insured mail and by dividing the total number of syllables by the time of their response. The researchers also obtained the age-adjusted death rates from ischemic heart disease (a decreased flow of blood to the heart) for each city (heart). The data in Display 9.17 were read from a graph in the published paper. (Data from R. V. Levine, "The Pace of Life," American Scientist 78 (1990): 450-9.) The variables have been standardized, so there are no units of measurement involved.
(a) Draw a matrix of scatterplots of the four variables. Construct it so that the bottom row of plots all have heart on the vertical axis. If you do not have this facility, draw scatterplots of heart versus each of the other variables individually.
(b) Obtain the least squares fit to the linear regression of heart on bank, walk, and talk.
(c) Plot the residuals versus the fitted values. Is there evidence that the variance of the residuals increases with increasing fitted values or that there are any outliers?
(d) Report a summary of the least squares fit. Write down the estimated equation with standard errors below each estimated coefficient.
Sandra Kudolo
Numerade Educator
Rainfall and Corn Yield. The data on corn yields and rainfall, discussed in Section 9.3.1, appear in Display 9.18. (Data from M. Ezekiel and K. A. Fox, Methods of Correlation and Regression Analysis, New York: John Wiley \& Sons, 1959; originally from E. G. Misner, "Studies of the Relationship of Weather to the Production and Price of Farm Products, I. Com" [mimeographed publication, Cornell University, March 1928].)
(a) Plot corn yield versus rainfall.
(b) Fit the multiple regression of corn yield on rain and rain ${ }^2$.
(c) Plot the residuals versus year. Is there any pattern evident in this plot? What does it mean? (Anything to do, possibly, with advances in technology?)
(d) Fit the multiple regression of corn yield on rain, rain ${ }^2$, and year. Write the estimated model and report standard errors, in parentheses, below estimated coefficients. How do the coefficients of rain and rain ${ }^2$ differ from those in the estimated model in (b)? How does the estimate of $\sigma$ differ? (larger or smaller?) How do the standard errors of the coefficients differ? (larger or smaller?) Describe the effect of an increase of one inch of rainfall on the mean yield over the range of rainfalls and years.
(e) Fit the multiple regression of corn yield on rain, rain $^2$, year, and year $\times$ rain. Is the coefficient of the interaction term significantly different from zero? Could this term be used to say something about technological improvements regarding irrigation?
Check back soon!
Pollen Removal. The data in Exercise 3.28 are the proportions of pollen removed and the duration of visits on a flower for 35 bumblebee queens and 12 honeybee workers. It is of interest to understand the relationship between the proportion removed and duration and the relative pollen removal efficiency of queens and workers. (a) Draw a coded scatterplot of proportion of pollen removed versus duration of visit; use different symbols or letters as the plotting codes for queens and workers. Does it appear that the relationship between proportion removed and duration is a straight line? (b) The logit transformation is often useful for proportions between 0 and 1. If $p$ is the proportion then the logit is $\log [p /(1-p)]$. This is the log of the ratio of the amount of pollen removed to the amount not removed. Draw a coded scatterplot of the logit versus duration. (c) Draw a coded scatterplot of the logit versus log duration. From the three plots, which transformations appear to be worthy of pursuing with a regression model? (d) Fit the multiple linear regression of the logit of the proportion of pollen removed on (i) log duration, (ii) an indicator variable for whether the bee is a queen or a worker, and (iii) a product term for the interaction of the first two explanatory variables. By examining the $p$-value of the interaction term, determine whether there is any evidence that the proportion of pollen depends on duration of visit differently for queens than for workers. (e) Refit the multiple regression but without the interaction term. Is there evidence that, after accounting for the amount of time on the flower, queens tend to remove a smaller proportion of pollen than workers? Why is the $p$-value for the significance of the indicator variable so different in this model than in the one with the interaction term?
Beth Stone
Numerade Educator
Old Faithful. With the Old Faithful data from Display 7.14, (a) draw a coded scatterplot of interval versus duration, with different codes for the different days; and (b) using 7 indicator variables for the 8 days, fit the multiple regression of interval on both duration and the factor day. Write the estimated model and show standard errors in parentheses below the estimated coefficients.
Sheryl Ezze
Numerade Educator
Speed of Evolution. How fast can evolution occur in nature? Are evolutionary trajectories predictable or idiosyncratic? To answer these questions R. B. Huey et al. ("Rapid Evolution of a Geographic Cline in Size in an Introduced Fly," Science 287 (2000): 308-9) studied the development of a fly-Drosophila subobscura-that had accidentally been introduced from the Old World into North America (NA) around 1980. In Europe (EU), characteristics of the flies' wings follow a "cline"-a steady change with latitude. One decade after introduction, the NA population had spread throughout the continent, but no such cline could be found. After two decades, Huey and his team collected flies from 11 locations in westem NA and native flies from 10 locations in EU at latitudes ranging from 35-55 degrees $\mathrm{N}$. They maintained all samples in uniform conditions through several generations to isolate genetic differences from environmental differences. Then they measured about 20 adults from each group. Display 9.19 shows average wing size in millimeters on a logarithmic scale, and average ratios of basal lengths to wing size.
(a) Construct a scatter plot of average wing size against latitude, in which the four groups defined by continent and sex are coded differently. Do these suggest that the wing sizes of the NA flies have evolved toward the same cline as in EU?
(b) Construct a multiple linear regression model with wing size as the response, with latitude as a linear explanatory variable, and with indicator variables to distinguish the sexes and continents. As there are four groups, you will want to have three indicator variables: the continent indicator, the sex indicator, and the product of the two. Construct the model in such a way that one parameter measures the difference between the slopes of the wing size versus latitude regressions of $\mathrm{NA}$ and $\mathrm{EU}$ for males, one measures the difference between the NA-EU slope difference for females and that for males, one measures the difference between the intercepts of the regressions of NA and EU for males, and one measures the difference between the NA-EU intercepts' difference for females and that for males.
Check back soon!
Depression and Education. Has homework got you depressed? It could be worse. Depression, like other illnesses, is more prevalent among adults with less education than you have.
R. A. Miech and M. J. Shanahan investigated the association of depression with age and education, based on a 1990 nationwide (U.S.) telephone survey of 2,031 adults aged 18 to 90 . Of particular interest was their finding that the association of depression with education strengthens with increasing age-a phenomenon they called the "divergence hypothesis."
They constructed a depression score from responses to several related questions. Education was categorized as (i) college degree, (ii) high school degree plus some college, or (iii) high school degree only. (See "Socioeconomic Status and Depression over the Life Course," Journal of Health and Social Behaviour 41(2) (June, 2000): 162-74.)
(a) Construct a multiple linear regression model in which the mean depression score changes linearly with age in all three education categories, with possibly unequal slopes and intercepts. Identify a single parameter that measures the diverging gap between categories (iii) and (i) with age.
(b) Modify the model to specify that the slopes of the regression lines with age are equal in categories (i) and (ii) but possibly different in category (iii). Again identify a single parameter measuring divergence.
This and other studies found evidence that the mean depression is high in the late teens, declines toward middle age, and then increases towards old age. Construct a multiple linear regression model in which the association has these characteristics, with possibly different structures in the three education categories. Can this be done in such a way that a single parameter characterizes the divergence hypothesis?
Sheryl Ezze
Numerade Educator
Winning Speeds at the Kentucky Derby. The Kentucky Derby is a 1.25 mile horse race held annually at the Churchill Downs race track in Louisville, Kentucky. Shown in Display 9.20 are some sample rows of a data set containing the year of the race, the winning horse, the condition of the track, and the average speed (in feet per second) of the winner, for years 1896-2000. The track conditions have been grouped into three categories: fast. good (which includes the official designations "good"and "dusty"). and slow (which includes the designations "slow," "heavy," "muddy," and "sloppy"). Use a statistical computer program to fit a model for the mean winning speed as a function of year and the track condition factor. The data are from www.kentuckyderby.com.
Patrick Connors
Numerade Educator