A Guide to Robust Statistical Methods -- Rand R_ Wilcox
A Guide to Robust Statistical Methods -- Rand R_ Wilcox
Wilcox
A Guide
to Robust
Statistical
Methods
A Guide to Robust Statistical Methods
Rand R. Wilcox
A Guide to Robust
Statistical Methods
Rand R. Wilcox
Department of Psychology
University of Southern California
Los Angeles, CA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Consider the collection of classic methods for comparing groups and studying
associations that are routinely taught and used. A fundamental issue is how well
these methods perform when the underlying assumptions are violated. Based on
hundreds of published papers, methods based on the mean perform well when the
groups do not differ in any manner. That is, they have identical distributions. If
the distributions differ, they might continue to perform well, but under general
conditions they can have relatively poor power and can yield inaccurate confidence
intervals. Even a slight departure from a normal distribution can result in poor
power. Fundamental concerns about these methods have been known for over a half
century for reasons that are reviewed at various points in this book. In fact, some
concerns have been known for over two centuries. The main point is that there is
now an extensive collection of new and improved methods that perform well over a
much broader range of situations compared to classic techniques. To be a bit more
precise, there are general conditions where modern methods provide better power
and more accurate confidence intervals. Included are new methods that help provide
a deeper and more nuanced understanding of how groups compare. For instance,
there are now robust heteroscedastic measures of effect size.
One basic concern when using any method based on means is outliers. Outliers
can destroy power and yield misleading information about the typical response.
Dealing with outliers might seem trivial: simply remove any outliers and apply
a conventional method for comparing groups using the remaining data. However,
from a technical point of view, this approach is highly unsatisfactory regardless of
how large the sample sizes might be: this approach results in estimates of standard
errors that are highly inaccurate. Easy-to-use methods for dealing with this issue are
now available. But even when there are no outliers, skewed distributions are another
source of concern. When distributions differ in skewness, this has the potential of
inaccurate confidence intervals even when the sample sizes are moderately large.
Practical concerns get worse when distributions are skewed and outliers are likely
to occur. Modern methods provide substantially better techniques for addressing this
concern.
vii
viii Preface
does not cover the many important advances and insights that provide a deeper and
more accurate sense of what data are trying to tell us. The hope is that this book
helps address this issue.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Student’s T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outliers and the Breakdown Point of an Estimator . . . . . . . . . . . . . . . . . 9
1.4 Homoscedasticity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Detecting Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Strategies for Dealing with Outliers and Violations of
Assumptions That Can Be Highly Unsatisfactory . . . . . . . . . . . . . . . . . . 16
1.6.1 Dealing with Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6.2 Transforming Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6.3 Testing Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.4 Standardizing Data and Non-normality. . . . . . . . . . . . . . . . . . . . 18
1.7 Pearson’s Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.8 Robust: From a Statistical Point of View, What
Does This Mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.9 R Functions and Data Used in This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 The One-Sample Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Measures of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 Trimmed Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.2 M-Estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.3 Quantile Estimators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 R Functions tmean, mom, onestep, hd, thd, quant, and qno.est . . . . 30
2.2.1 Robust Measures of Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 R function pbvar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Computing Confidence Intervals and Testing Hypotheses . . . . . . . . . 32
2.3.1 Trimmed Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.2 Bootstrap-t Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.3 The Percentile Bootstrap Method . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.4 Choosing the Number of Bootstrap Samples . . . . . . . . . . . . . . 37
xi
xii Contents
regression estimator suffers from similar issues, and new concerns are introduced.
First, the normal distribution is reviewed plus some other basic concepts that are
important here. Then the impact of non-normality on Student’s t-test is described
and illustrated, which lays a foundation for understanding why other standard
methods for analyzing data are problematic under general conditions. This is not
to say that classic methods are always unsatisfactory. If, for example, groups are
being compared that have identical distributions, conventional methods based on
means perform reasonably well in terms of controlling the probability of a Type I
error. When studying associations, standard inferential methods based on the least
squares regression estimator work well when there is no association. Conventional
methods might continue to perform reasonably well when groups differ or when
there is an association. But under general conditions, this is not the case: they can
completely miss important differences among groups and strong associations among
the bulk of the data as will be seen.
In the year 1809, Gauss derived the normal distribution. It is informative to outline
the derivation of the normal distribution and to comment on some of its properties.
Consider a random sample .X1 , . . . , Xn of n participants and let
1
.X̄ = Xi (1.1)
n
denote the sample mean, which is an example of what is called a measure of
location. Any summary of the data is called a measure of location if it satisfies three
properties. First, its value is greater than or equal to the smallest value observed,
and its value is less than or equal to the largest values observed. Second, if all
observations are multiplied by some constant c, its value is multiplied by c as well.
Third, if c is added to every value, its value is increased by the same amount. Often
the sample mean is described as a measure of central tendency, but this term is
misleading because the sample mean can reflect a value that is highly atypical, for
reasons that are described and illustrated in Sect. 1.3.
Now imagine that a study is repeated many times yielding sample means
.X̄1 , X̄2 , X̄3 , . . . Note that because not all participants were measured, the values of
these sample means will vary. In particular, there is some probability that the sample
mean will be less than or equal to 3, less than or equal to 6, and more generally less
than or equal to any constant c we might pick. These probabilities constitute what
is called the sampling distribution of the mean. As explained in a basic statistics
course, the variation of the sample means is called the squared standard error of the
sample mean. Assuming random sampling only, it can be shown that the squared
standard error of the mean is
1.1 The Normal Distribution 3
σ2
VAR(X̄) =
. , (1.2)
n
where .μ is the population mean and .−∞ < x < ∞. For example, for any constant
c, .P (X ≤ c), is the area under this curve extending from .−∞ to c, as explained in
a standard introductory course.
The normal distribution has several properties that make it highly convenient
from a technical point of view. For example, if both X and Y have normal
distributions, .X − Y has a normal distribution as well. If we standardize a random
variable X that has a normal distribution, yielding
X−μ
Z=
. , (1.4)
σ
Z also has normal distribution, but with mean 0 and variance 1. That is, Z has
what is called a standard normal distribution. Another convenient property is that
regardless of what the mean and standard deviation happen to be, the probability
that a randomly sampled observation is within one standard deviation of the mean
is 0.68. The probability that a randomly sampled observation is within two standard
deviations of the mean is 0.954. These two properties are depicted in Fig. 1.1.
Next, consider the sample variance
1
s2 =
. (Xi − X̄)2 . (1.5)
n−1
Because the sample mean is used to compute the sample variance, an obvious
speculation is that .s 2 and .X̄ are dependent, and in general, this is true. However,
when dealing with a normal distribution, they are independent. The dependence
between .s 2 and .X̄ helps explain some unexpected properties of Student’s t-test,
which is reviewed Sect. 1.2.
Roughly, a heavy-tailed distribution is a distribution for which the tails of the
distribution lie above the tails of the normal distribution. An example is the mixed
normal distribution described in Sect. 1.2. Interestingly, around the time the normal
4 1 Introduction
.68
.954
Fig. 1.1 For all normal distributions, the probability that an observation is within one standard
deviation of the mean is 0.68. The probability of being within two standard deviations is 0.954
1.2 Student’s T-Test 5
Note that the tails of the mixed normal lie slightly above the tails of the normal
distribution. For this reason, as previously indicated, the mixed normal is said to
have heavy tails.
Now, suppose .n = 20 values are randomly sampled from the mixed normal √
distribution.
√ From basic principles, the standard error of the mean is .σ/ n =
(10.9/20) = 0.738. In contrast, the standard error of the median is 0.300. This
illustrates the fact that no single measure of location has the smallest standard error.
This is one of the reasons that multiple methods can be needed when attempting to
understand data, as will be illustrated in subsequent chapters.
H0 : μ = μ0 ,
. (1.6)
where .μ0 is some specified constant that is often labeled the null value.
Suppose the goal is to compute a .1−α confidence interval for .μ. That is, the goal
is to compute an interval containing .μ with probability .1 − α. Assuming random
sampling from a normal distribution, it can be shown that
X̄ − μ
T =
. √ (1.7)
s/ n
s s
. X̄ − t √ , X̄ + t √ , (1.8)
n n
X̄ − μ0
T =
. √ . (1.9)
s/ n
6 1 Introduction
Suppose it is desired that the probability of a Type I error (rejecting when the null
hypothesis when it is true) is .α. Then reject (1.6) if .|T | ≥ t, where again t is the
.1 − α/2 quantile of a Student’s t with .ν = n − 1 degrees of freedom.
Power is the probability of rejecting when the null hypothesis is false. Power
is a function of the choice for .α, the Type I error probability; the sample size n;
the population standard deviation .σ ; and the magnitude of .μ − μ0 , the difference
between the hypothesized value and the true value of the population mean. Of
particular importance here is that as the population standard deviation increases,
with all other factors held constant, power decreases. This property will be seen to
be very important when considering the relative merits of methods that might be
used.
Hypothesis Testing Versus Decision-Making
It is important to describe an issue raised by Tukey (1991). Tukey objected to the
goal of testing for exact equality arguing that surely .μ differs from .μ0 at some
decimal place. Jones and Tukey (2000) argued that the goal of testing for equality
should be replaced by Tukey’s three-decision rule. For the situation at hand, if the
null hypothesis is rejected and .X̄ > μ0 , decide that the population mean .μ is greater
than .μ0 . If the hypothesis is rejected and .X̄ < μ0 , decide that the population mean
.μ is less than .μ0 . If the hypothesis is not rejected, make no decision. This point of
view has interesting implications for a wide range of situations as will be seen.
Comments About P -Values
Another issue that should be discussed is the p-value due to concerns and misinter-
pretations raised, for example, by Kmetz (2019) and Wasserstein et al. (2019). Note
that if the hypothesis is rejected when the Type I error is set to .α = 0.05, it is unclear
whether the hypothesis would be rejected for .α = 0.025 or 0.01. The p-value refers
to the lowest .α value for which the null hypothesis is rejected. Stigler (1986, p. 152)
notes that the idea of a p-value dates back to a paper published by Laplace in the
year 1823.
In the context of Tukey’s three-decision rule, a p-value reflects the strength of
the empirical evidence that a decision can be made about whether the parameter
of interest is greater than or less than the hypothesized value. This interpretation
is consistent with a view expressed by R. A. Fisher in the 1920s as noted by Biau
et al. (2010). However, a p-value close to zero does not necessarily mean that the
difference between .μ and .μ0 is clinically important. Moreover, a p-value close to
zero does not mean that there is a high probability of rejecting again if the study
is replicated. The probability of rejecting is a power issue. Imagine, for example,
that the p-value is 0.02 and so the null hypothesis is rejected at the .α = 0.05 level.
Further, imagine that by chance a Type I error was made. That is, the null hypothesis
is true. This means that the probability of rejecting again, if the study is replicated
exactly, is 0.05 assuming normality. As previously noted, power is a function of the
magnitude of .μ − μ0 as well as the standard error of the mean given by (1.2). A
p-value provides no information about either of these unknown quantities.
There are two fundamental ways that Student’s t-test can be unsatisfactory.
The first is that as we move toward a heavy-tailed distribution, the power of
1.2 Student’s T-Test 7
0.4
standard normal distribution.
The dashed line is the mixed
normal distribution
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
Example Consider testing .H0 : .μ = 0 with .α = 0.05 when in fact the true value of
the population mean is 0.8 and sampling is from a normal distribution with .σ = 1.
It can be shown that with .n = 20, power is 0.93 when using Student’s t-test given by
(1.9). Now suppose that sampling is from the mixed normal in Fig. 1.2. Now power
is 0.39. What is needed is a method that performs about as well as Student’s t under
normality, but it continues to perform well, in terms of power, when dealing with
a heavy-tailed distribution. Such methods have been derived and are described in
subsequent chapters.
The second problem has to do with skewed distributions. Particularly devastating
are situations where a distribution is skewed with a heavy tail. Figure 1.3 shows what
is called a lognormal distribution. Gleason (1993) argues that this distribution is
light-tailed. Cain et al. (2017) report estimates of skewness based on 1,567 datasets.
The skewness of the lognormal distribution is well within the range of the values
reported by Cain et al. Suppose data are randomly sampled from this distribution,
and the goal is to compute a 0.95 confidence interval for the mean. Further assume
that the confidence interval is considered to be reasonably accurate if the actual
probability coverage is between 0.925 and 0.975 as suggested by Bradley (1978).
To achieve this goal, a sample size .n > 130 is required. In the context of hypothesis
testing, if the null hypothesis is true, .n > 130 is required to ensure that the
Type I error probability is between 0.025 and 0.075. For a skewed, heavy-tailed
distribution, an even larger sample size is required.
8 1 Introduction
−10 −8 −6 −4 −2 0 2
This might seem incorrect because assuming random sampling only, the mean of the
numerator of T , .X̄ − μ, is zero. However, the sample mean and the sample variance
are dependent, which can be shown to explain this result.
This section provides another perspective on why methods based on means can
have low power. The issue is outliers, roughly meaning values that are unusually
small or large compared to the bulk of the data. The likelihood of encountering
outliers increases as we move from light-tailed distributions toward heavy-tailed
distributions. A concern is that even a single outlier can inflate both the sample
mean and especially the sample variance. Put another way, the sample mean can
poorly reflect the typical response.
Example A total of 2182 undergraduate females were asked how many sexual
partners they desired over the next 30 years. The sample mean is 3.47. But 85.5%
of the values are less than 3.47. That is, the sample mean is estimated to correspond
to the 0.855 quantile of the distribution. (Quantiles are percentiles divided by 100.)
The proportion less than 3 is 79% indicating that any response greater than 3 is
rather atypical. The most common value was 1.
The breakdown point of the sample mean is the minimum proportion of values
that must be altered to make the sample mean arbitrarily large or small. The
breakdown point of any estimator is intended as a measure of how sensitive it is
to outliers. The breakdown point of the sample mean is only .1/n. That is, a single
outlier can result in a value for the mean that is highly atypical for the bulk of the
participants. The breakdown point of the sample variance is also .1/n. That is, even
a single outlier has the potential of destroying the power of Student’s t-test. And
in practice, it is common to encounter more than one outlier, which exacerbates
concerns about Student’s t-test.
Example Consider Student’s t-test given by (1.9). Data were generated from a
normal distribution with mean .μ = 0.5, .n = 25, followed by a test of .H0 : .μ = 0.
The sample mean was 0.4679, and the p-value was 0.030. Next, the largest value,
which was 2.546, was increased by 2, and Student’s t-test was applied. This process
was repeated 10 times. That is, the largest value was increased to 4.256, Student’s
was applied, then the largest value was taken to be 6.256, and Student’s was applied
and so on. The resulting estimates of the mean and corresponding p-values are
shown in Table 1.1. Of course, the sample mean increases suggesting that there is
stronger evidence that the null hypothesis is false, yet the p-value increases as well.
The reason is that the test statistic T decreases due to the increase in the sample
variance. Of course, a practical issue is whether this concern can be avoided and
whether alternative techniques can make a substantial difference when deciding
whether to reject some hypothesis. This answer is an unequivocal yes as will be
seen.
10 1 Introduction
It is noted that there is an analog of the breakdown point when dealing parameters
(e.g., Staudte & Sheather, 1990), but the technical details go beyond the scope of
this book. Basically, it can be shown that the population mean and variance have a
breakdown point of zero. Roughly, this means that a slight shift in any distribution
can inflate the variance by an arbitrarily large amount. If, for example, it is assumed
that a distribution is normal, a slight departure from a normal distribution can inflate
the variance substantially. Also, the population mean can be highly atypical.
1.4 Homoscedasticity
There is another assumption that pervades many standard methods that should be
discussed: homoscedasticity. This assumption was adopted over two centuries ago
and is routinely assumed today. It greatly simplifies technical issues, but when
groups differ, violating this assumption creates serious practical concerns. This
section outlines these concerns, and subsequent chapters indicate how to deal with
this issue.
First, consider Student’s t-test for two independent groups. A common goal is to
test
.H0 : μ1 = μ2 , (1.10)
the hypothesis that the two groups have the same mean. In the context of Tukey’s
three-decision rule, the issue is whether it is reasonable to make a decision about
which group has the larger population mean.
The classic Student’s t-test assumes normality and homoscedasticity, meaning
that
σ12 = σ22
.
1.4 Homoscedasticity 11
where .s12 and .s22 are the sample variances for the first and second group, respectively,
and .n1 and .n2 are the corresponding sample sizes. The test statistic is
X̄1 − X̄2
T =
.
, (1.12)
sp n 1 + n 2
2 1 1
. Y = β0 + β1 X. (1.13)
.Ŷi = b0 + b1 Xi (1.14)
12 1 Introduction
(.i = 1, . . . , n). The least squares approach is to determine values for .b0 and .b1 that
minimize
. ri2 , (1.15)
the sum of the squared residuals, where the residuals are given by
(Xi − X̄)(Yi − Ȳ )
b1 =
. (1.16)
(Xi − X̄)2
and
b0 = Ȳ − b1 X̄,
. (1.17)
respectively.
The least squares estimator just described has a breakdown point of only .1/n.
That is, a single unusual point can completely mask the nature of the association
among the bulk of the participants. The standard error of the slope and intercept
can be substantially higher than other estimators covered in Chap. 7. The main point
here is that the conventional method for computing confidence intervals assumes
that the variation of Y , given X, does not depend on X.
Consider, for example, a study where the goal is to understand the association
between the cognitive functioning of children, Y , given that they live in a home
where the level of marital aggression is X. Homoscedasticity means that the
variation of Y given that .X = 8 say, is the same as the variation of Y given that
.X = 12 or any other value of X that might occur. If X and Y are independent,
this means that there is homoscedasticity. But when there is an association, there
is no reason to assume that the homoscedasticity assumption is true. If there
is heteroscedasticity, meaning that the homoscedasticity assumption is false, the
conventional method for estimating the standard error .b1 and .b0 is incorrect. That
is, there is the risk of an inaccurate confidence interval regardless of how large the
sample size might be. Chaps. 7 and 8 describe methods for dealing with this issue.
|X − X̄|
. > 2. (1.18)
s
This method works well when there is only one outlier, and it might work well
when there are two or more outliers. But it suffers from masking: the very presence
of outliers can cause them to be missed.
Example Consider
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 1000, 10,000.
.
It is evident that the last two values are outliers, but (1.18) finds only one outlier:
10,000.
What is needed is a measure of location and variation that are not sensitive to
outliers. Certainly the best-known method is the boxplot rule, which is based in part
on an estimate of the lower and upper quartiles. To review the meaning of the lower
and upper quartiles, imagine that if all participants could be measured, 25% of the
values would be less than or equal to 36 and that 75% would be less than or equal
to 81. Then 36 and 81 are the 0.25 and 0.75 quartiles, respectively. Put another way,
the lower quartile is the 0.25 quantile, and the upper quartile is the 0.75 quantile.
There are many ways of estimating the quantiles of a distribution. A method for
estimating the quartiles that has been found to perform relatively well, given the
goal of detecting outliers, is called the ideal fourths (Frigge et al., 1989). Consider
the random sample .X1 , . . . , Xn and let
X(1) ≤ · · · ≤ X(n)
.
denote these values written in ascending order. The lower ideal fourth is
where j is the integer portion of .(n/4) + (5/12), meaning that j is .(n/4) + (5/12)
rounded down to the nearest integer, and
n 5
h=
. + − j.
4 12
The upper ideal fourth is
where .k = n − j + 1. Computing the ideal fourths can be done via the R function
. idealf(x),
14 1 Introduction
assuming the R functions described in Sect. 1.7 have been installed. What is
important here is that the breakdown point of the ideal fourths is 0.25.
The boxplot rule declares the value X to be an outlier if
or
written for this book, checks for outliers using the ideal fourths. The built-in R
function
boxplot(x),
.
creates a boxplot.
Various modifications of the boxplot rule have been proposed. Carling (2000)
noted that the proportion of points declared outliers via the boxplot rule is a function
of the sample size. He suggested a modification that deals with this issue. His
method is based on an estimate of the median, M. Because there are many ways
of estimating the population median (the 0.5 quantile), it is useful to review how the
usual sample median, used by Carling, is computed.
If the number of observations, n, is odd,
M = X(m) ,
.
where .m = (n + 1)/2. That is, the sample median is the mth value after the
observations are put in ascending order. If the number of observations, n, is even,
now .m = n/2 and
X(m) + X(m+1)
M=
. ,
2
the average of the mth and (.m + 1)th observations after putting the observed values
in ascending order. This is the estimator used by the built-in R function
median(x).
.
Note that M has a breakdown point equal to 0.5, the highest possible value.
Carling (2000) suggests declaring X an outlier if
1.5 Detecting Outliers 15
and
17.63n − 23.64
k=
. . (1.24)
7.74n − 3.71
Carling’s modification can be used via the R function outbox be setting the
argument mbox=TRUE.
Another method for detecting outliers is the MAD-median rule, , which is based
in part on a measure of dispersion called the median absolute deviation (MAD)
statistic, which is the median of
This measure of dispersion has a breakdown point equal to 0.5. The MAD-median
rule declares X an outlier if
|X − M|
. > 2.24, (1.25)
MADN
where MADN is MAD/0.6745. When dealing with a normal distribution, MADN
estimates the standard deviation, so this method is an analog of the two-standard
deviation rule given by (1.18). The value 2.24 in (1.25) stems from Rousseeuw and
van Zomeren (1990). The R function
outpro(x)
.
applies the MAD-median rule. Because the MAD-median rule has a breakdown
point of 0.5, it is arguably better than the boxplot rule. There are situations where
the boxplot rule appears to be preferable (e.g., Wilcox, 2022a), but the details go
beyond the scope of this book.
Some books recommend using a histogram to check for outliers, but this
approach can be highly unsatisfactory. What is needed is a method that is specif-
ically designed to detect outliers. One problem with the histogram is that when
sampling data from heavy-tailed distribution, it can poorly reflect the nature of the
tails of the distribution.
Example One-hundred values were generated from the mixed normal distribution
in Fig. 1.2. Figure 1.5 shows the resulting histogram using the built-in R function
hist. As previously noted, for a standard normal distribution, any value greater
than 2 or less than .−2 would be declared an outlier using the two-standard deviation
rule. This rule corresponds to declaring any value an outlier if it is less than the
0.02275 quantile or greater than 0.97725 quantile. From this perspective, any values
less than .−2.4 or greater than .2.4 would be viewed as unusual for the mixed normal.
The right tail of the histogram suggests that values greater than 15 are outliers, which
16 1 Introduction
40
from a mixed normal
distribution
30
Frequency
20
10
0
−10 −5 0 5 10 15 20 25
is correct based on how the data were generated. But in fact, any value greater than
2.4 is unusual, contrary to what is indicated by the histogram. The left tail suggests
that there are no outliers, but in fact any value less than .−2.4 is highly unusual.
The difficulty is that the default method for creating a histogram provides a poor
estimate of the distribution that generated the data.
The data used in Fig. 1.5 also provides another example of masking. Using the
two- standard deviation rule given by (1.18), observed values less than .−7.82 and
greater than 8.26 were declared outliers. A total of seven outliers were found. Using
the MAD-median rule, values less than .−2.5 or greater than 2.5 were declared
outliers. Now 18 values are flagged as outliers.
There is an extensive literature aimed at dealing with the concerns outlined in this
chapter. But there are some methods that are technically unsound, regardless of how
large the sample size might be, and other seemingly natural strategies should be
used with caution or not at all.
It might seem that dealing with outliers is trivial: simply remove outliers and
proceed using some method based on the means. However, this can result in highly
inaccurate estimates of the standard error. The reason is that the derivation of
1.6 Strategies for Dealing with Outliers and Violations of Assumptions That. . . 17
the standard error of the mean, given by (1.2), assumes that .X1 , . . . , Xn that are
uncorrelated. That is, the derivation requires that for any i and j , .i = j .ρij = 0,
where .ρij is Pearson’s correlation between .Xi and .Xj .
As previously indicated,
which are called the order statistics, indicates the values .X1 , . . . , Xn written in
ascending order. Suppose the probability that .X(1) > 3 = 0.1. But suppose
.X(2) = 3. Given that .X(2) = 3 means that it is impossible to have .X(1) > 3.
If they were independent, knowing the value .X(2) would not alter the probability
that .X(1) > 3. In fact, the correlation between .X(1) and .X(2) is greater than zero.
For example, when .n = 20 and sampling is from a standard normal distribution,
Pearson’s correlation between .X(1) and .X(2) is approximately 0.6.
Here is the problem. Suppose outliers are removed. That is, some of the lowest
values are removed, and possibly some of the largest values are removed as well.
The result is that the remaining values are correlated. In particular, determining
the variance of the sample mean based on the remaining data requires taking into
account the correlations among the remaining data. Suppose that m values are
left after removing outliers, and let .sm 2 denote the sample variance based on the
remaining data. The point here is that if the squared standard error of the sample
mean is estimated with .sm2 /m, this estimate can differ substantially from an estimate
An early attempt at dealing with non-normality, one that remains popular today, is to
replace the data with the logs of the data. Another possibility is to use the square root
of the data. More involved transformations have been proposed (e.g., Box & Cox,
1964). These transformations can yield more symmetric distributions, but in some
situations, the distribution remains substantially skewed, especially when dealing
with a skewed, heavy-tailed distribution, meaning that outliers tend to be common.
For a discussion of what are called inverse normal transformations, see Beasley et al.
(2009). Grayson (2004) argues that a transformation can transform the construct
being measured. For instance, if the goal is to make inferences about the mean, a
transformation can alter this goal.
Perhaps more importantly, transformations can be ineffective at dealing with
outliers. Outliers can remain, and outliers can appear after taking logs that were
not flagged as outliers before transforming the data. Moreover, after taking logs,
situations are encountered where the standard deviation is increased. Illustrations of
these issues are relegated to the exercises.
There are various statistical methods based on ranks that deal with outliers. That
is, the smallest value gets a rank of 1, the next smallest gets a rank of 2, and so on.
18 1 Introduction
Some of these methods can be very effective at testing the hypothesis that groups
have identical distributions. But many of these methods can be unsatisfactory for
comparing measures of location unless rather restrictive assumptions are made.
A seemingly natural strategy for dealing with assumptions is to test the hypothesis
that the assumption is true. For example, when comparing the means of independent
groups, one might test the assumption that groups have a common variance. If
the test fails to reject, use a method that assumes homoscedasticity. However,
published papers do not support this approach (e.g., Hayes & Cai, 2007; Markowski
& Markowski, 1990; Moser et al., 1989; Wilcox et al., 1986; Zimmerman, 2004).
The basic problem is that the methods used to test for equal variances did not have
enough power to detect situations where there is a violation of the assumption
that is a practical concern. The main message here is that in terms of violating
the normality and homoscedasticity assumptions, there are methods that perform
nearly as well as classic methods when these assumptions are true. Moreover, more
modern methods, to be described, continue to perform well in situations where
classic methods perform poorly.
A common strategy is to standardize the data. That is, convert the data .X1 , . . . , Xn
to
Xi − X̄
.Zi = ,
s
i = 1, . . . , n. But an important point is that this does not make the data more
.
8
standardized to have a mean
equal to 0 and a variance
equal to 1
6
4
2
0
In terms of the least squares regression estimate of the slope of a regression line, .b1 ,
discussed in Sect. 1.4,
sx
r = b1
. . (1.27)
sy
It can be shown that the value of .r 2 , called the coefficient of determination, is the
variance of the predicted Y values given by (1.14), divided by the variance of the
observed Y values:
VAR(Ŷ )
r2 =
. (1.28)
VAR(Y )
X X
Correlation=.8 Correlation=.2
Fig. 1.7 The bivariate distribution of X and Y when both are normal. The left and right plots
illustrate the impact of increasing .ρ from 0.2 to 0.8
1.9 R Functions and Data Used in This Book 21
Traditionally, when dealing with statistical issues, the term robust refers to a method
that performs reasonably well in terms of controlling the Type I error probability. In
the statistics literature, such methods are said to be level robust. But over the last 60
years, it has taken on a much broader meaning. Roughly, it refers to methods that
are not overly sensitive to small changes in a distribution or small changes in the
data. For example, if an arbitrarily small change in a distribution can alter the value
of a parameter in an arbitrarily large manner, it is not robust. This chapter has given
some indication that the population mean and variance are not robust based on this
criterion. There is a formal proof that indeed the population mean and variance are
not robust. This result is just part of a well-developed mathematical foundation for
developing and describing robust methods (Hampel et al., 1986; Huber & Ronchetti,
2009; Staudte & Sheather, 1990). A formal proof that Pearson’s correlation .ρ is not
robust was derived by Devlin et al. (1981). These fundamental advances have led to a
wide range of improved techniques for comparing groups and studying associations.
The goal in this book is to provide a relatively nontechnical description of these
advances and to illustrate their practical utility.
It is assumed that the reader is familiar with the basics of the software R. If not, R
can be downloaded from www.R-project.org. A free and very useful interface for R
is R Studio (RStudio Team, 2020) available at www.rstudio.com. Many books are
available that are focused on the basics of R (e.g., Crawley, 2007; Venables & Smith,
22 1 Introduction
2002; Verzani, 2004; Zuur et al., 2009. The book by Verzani (2004) is available on
the web at
http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf.
There is an extensive collection of R functions that are used in this book. One way
of gaining access to these functions is to download the file Rallfun, which is stored
at https://osf.io/xhe8u/. The current version is Rallfun-v41. Once downloaded, use
the R command
source(file.choose())
.
1.10 Exercises
Some of these exercises assume that the file Rallfun has been sourced as described
in Sect. 1.9.
1. A boxplot indicates that the largest two values are outliers. Eliminating these
two outliers and applying Student’s t-test are an invalid way of computing a
confidence interval. Why?
2. The file cort_dat.txt contains measures of cortisol levels taken upon awak-
ening and 30–45 minutes later. One way of reading the data into R is with
the command m=read.table(file.choose(),skip=1) The argu-
ment skip=1 ignores the first line, which contains a description of the data.
The R object m is a data frame where m[,2] contains the data in the second
column of the file cort_dat.txt and m[,3] contains the data in the third column.
Create a boxplot for both measures. Next, create boxplots based on logs of the
data. What does this illustrate? Also, compare the standard deviations before
and after transforming the data.
3. Using R, execute the following commands:
set.seed(45)
.
x=ghdist(50,g=1,h=.2)+3
.
1.10 Exercises 23
akerd(x)
.
This creates what is called an adaptive kernel density estimate of the distribu-
tion. Next, use the command
akerd(log(x))
.
set.seed(45)
x=rnorm(50)
y=rnorm(50)
. cor(x,y)
xx=c(x,5)
yy=c(y,5)
cor(x,y)
As explained in Chap. 1, outliers are a serious concern when using the sample
mean. This chapter describes two basic strategies for dealing with outliers and
indicates how to compute an estimate of the standard error that is technically sound.
This is followed by a description of inferential methods based on these measures
of location, including a discussion of their relative merits. Some methods aimed
at measuring effect size are described, some of which provide a foundation for
understanding certain measures of effect size covered in Chap. 3. Included are
some recent results related to the probability of success when dealing with binary
data that will help complement some of the methods described in subsequent
chapters. This chapter also discusses methods aimed at estimating quantiles. As
will be illustrated in Chap. 3, situations are encountered where important differences
between two groups occur in the tails of the distributions rather than the center of the
distributions. The quantile estimators described here will play a role in addressing
this issue. Some issues related to skewed distributions are discussed as well.
The immediate goal is to describe estimators aimed at dealing with outliers. The
first general approach is to simply trim a fixed proportion of the lowest and highest
values. The usual sample median described in Chap. 1 is the best-known example of
this approach. The other is to empirically determine which values, if any, should be
eliminated
Although the median can have a lower standard error than the mean, the reality is
that a less extreme amount of trimming is often beneficial for various reasons, some
of which will be made evident when attention is turned to comparing two or more
groups.
A 20% trimmed mean has been studied extensively (e.g., Wilcox, 2022a).
Basically, the lowest 20% and the highest 20% are trimmed, and the average of
the remaining data constitute the 20% trimmed mean, which will be labeled .X̄t .
More precisely, let g denote the value of .0.2n rounded down to the nearest integer.
Then
1
.X̄t = (X(g+1) + X(g+2) + · · · + X(n−g) ). (2.1)
n − 2g
A 10% trimmed mean is obtained by taking g to be .0.1n rounded down to the nearest
integer.
The 20% trimmed mean has a breakdown point equal to 0.2. That is, at least 20%
of the data need to be altered to make it arbitrarily large. Of course, this estimator is
not always optimal in terms of achieving the smallest standard error. As explained in
Chap. 1, there is no single estimator that always has the smallest standard error, an
issue that will be addressed in later chapters. The point is that often a 20% trimmed
mean is a good compromise between the two extremes of no trimming (the mean)
and the maximum amount of trimming (the median).
2.1.2 M-Estimators
the sum of the squared distances from each observed value. This approach is a
special case of the least squares regression estimator mentioned in Chap. 1. The
solution is to take .X̃ = X̄, the sample mean. From this perspective, the sample
mean can be unsatisfactory because it gives too much weight to extreme values.
2.1 Measures of Location 27
M-estimators deal with this by replacing the squared error used by (2.2) with a
function that gives less weight to extreme values. There are several variations of
this approach (e.g., Wilcox, 2022a), but for simplicity, the focus here is on the one-
step M-estimator, which stems from results summarized in Huber and Ronchetti
(2009).
Let .i1 be the number of observations .Xi for which .(Xi − M)/MADN < −1.28,
and let .i2 be the number of observations such that .(Xi − M)/MADN > 1.28. The
one-step M-estimator is
n−i2
1.28(MADN)(i2 − i1 ) + i=i1 +1 X(i)
X̄os =
. , (2.3)
n − i1 − i2
As can be seen, it eliminates unusually small and large values, it computes the mean
of the remaining data, and it makes an adjustment when the number of unusually
low values differs from the number of unusually high values. The breakdown point
of this estimator is 0.5, the highest possible value. Moreover, this estimator has
excellent theoretical properties summarized in Hampel et al. (1986) as well as Huber
and Ronchetti (2009).
Based on the breakdown point, the choice between the 20% trimmed mean and
the one-step M-estimator would seem to be clear: use the one-step M-estimator.
Also, the one-step M-estimator includes the possibility of not eliminating any
values. But it turns out that the choice between these two estimators is not simple.
Part of the problem is that different methods are sensitive to different features of the
data. One consequence is that the hypothesis testing method with the most power
depends on the nature of unknown distributions. Also, different methods can be
required in order to get a deep and nuanced understanding of data as argued by
Steegen et al. (2016). This will be illustrated at various points. One limitation of the
M-estimator is that situations are encountered where .MADN = 0, in which case the
M-estimator cannot be computed. This occurs, for example, for the sexual attitude
data mentioned in Sect. 1.3.
In terms of achieving a relatively small standard error, the 20% trimmed mean,
MOM, and the one-step M-estimator compete well with the sample mean when
sampling from a normal distribution. But they can have a substantially smaller
standard error when sampling from a heavy-tailed distribution as will be seen.
As previously noted, situations are encountered where information about the tails of
a distribution can be informative as will be demonstrated in Chap. 3. Dealing with
this issue requires methods for estimating quantiles. Consider, for example, a study
where the random variable of interest, X, is a measure of depressive symptoms. For
the population of all adults, there is some value for X, say q such that .P (X ≤ q) =
28 2 The One-Sample Case
0.8. That is, q is the 0.8 quantile meaning that 80% of all adults have a value less
than or equal to q. The goal is to find some way of estimating q. For the special
case where .P (X ≤ q) = 0.5, q corresponds to the population median, which is
estimated by the sample median, M.
There are two basic approaches to estimating q. The first is to use a weighted
average of just two of the order statistics. That is, only two values are used to
estimate a quantile, the remaining data determine which two values are used. This
was the strategy used by the sample median M when the sample size, n, is even. M
is based on the average of the two middle-order statistics. Hyndman and Fan (1996)
compared eight such estimators. Their recommended estimator can be computed
with the R function quant described in Sect. 2.2.
The other general approach is to use a weighted average of all the order statistics.
That is, choose weights .w1 , . . . , wn such that
. q̂ = wi X(i) (2.4)
estimates q. The best-known version of this approach was derived by Harrell and
Davis (1982). All of the weights are greater than zero. That is, .wi > 0 for every i,
.i = 1, . . . , n. When estimating the median, for example, more weight is given to
the values near the center of the order statistics. The extreme values are given a very
small weight. Liu et al. (2022) derived an alternative to the Harrell–Davis estimator
that offers an advantage in some situations, but software for applying their somewhat
complex method is not yet available.
Note that because all of the weights used by the Harrell–Davis are greater than
zero, the breakdown point is only .1/n. The same is true for the broad collection
of related estimators summarized by Liu et al. (2022). Akinshin (2022) derived a
modification of the Harrell–Davis estimator that deals with this possible concern.
This estimator sets some of the weights equal to zero, depending on which quantile
is being estimated. The remaining weights are adjusted to get an estimate of q. In
essence, a trimmed version of the Harrell–Davis estimator is being used, but the data
dictate which values are trimmed.
To provide some sense of how various estimators compare, a sample of .n = 25
values were generated from a standard normal distribution, and six of the methods
just described were used to estimate the population mean. This was repeated 10,000
times. Figure 2.1 shows boxplots of the results. Theory tells us that the sample
mean has the smallest standard error. But note that the improvement of the mean
over the 20% trimmed mean and M-estimator is very small. The median is the least
satisfactory. The standard deviations of these estimates provide an estimate of the
standard error of the estimator. The estimates corresponding to the boxplots 1-6 are
0.199, 0.247, 0.212, 0.206, 0.225, and 0.234, respectively. Notice that the standard
errors of the 20% trimmed mean and M-estimator are nearly equal to the standard
error of the mean.
Figure 2.2 shows boxplots of the estimates when sampling from the mixed
normal distribution described in Chap. 1. As is evident, the sample mean performs
2.1 Measures of Location 29
1.0
Fig. 2.1 Boxplots based on
10,000 estimates when
sampling is from a standard
normal distribution: 1 .=
0.5
mean, 2 .= median, 3 .= 20%
trimmed mean, 4 .=
M-estimator, 5 .=
Harrell–Davis estimator, 6 .=
0.0
trimmed Harrell–Davis
estimator
−0.5
−1.0
1 2 3 4 5 6
Harrell–Davis estimator, 6 .=
trimmed Harrell–Davis
0
estimator
−1
−2
−3
1 2 3 4 5 6
poorly. The corresponding estimates of the standard errors are now 0.661, 0.271,
0.241, 0.243, 0.249, and 0.257. The 20% trimmed mean and M-estimator performed
the best, with little separating these two estimators.
Next, consider the three quantile estimators previously described and the goal of
estimating the 0.8 quantile when sampling from a standard normal distribution. The
standard errors of the Harrell–Davis estimator, the trimmed Harrell–Davis estimator,
and the estimator recommended by Hyndman and Fan (1996) are estimated to
be 0.255, 0.264, and 0.275, respectively. In this case, the Harrell–Davis estimator
is best. The corresponding medians of the 10,000 estimates were 0.857, 0.887,
30 2 The One-Sample Case
and 0.837. The actual value is 0.8416. From this perspective, the Hyndman–Fan
recommended estimator has a bit of an advantage.
But now consider the mixed normal. The standard errors of the Harrell–Davis
estimator, the trimmed Harrell-Davis estimator, and the estimator recommended by
Hyndman and Fan are 0.519, 0.482, and 0.372, respectively. So in this case, an
estimator based on only two-order statistics is best. The 0.8 quantile is approxi-
mately equal to 0.95. The medians of the 10,000 estimates are 1.024, 1.028, and
0.951, again indicating that the Hyndman–Fan recommended estimator is best in
this situation. But when making inferences about quantiles and when dealing with
data that are fairly discrete, the Harrell–Davis estimator and the trimmed Harrell–
Davis estimator can perform much better than the Hyndman–Fan estimator as will
be seen in Chap. 3. Overall, no single estimator dominates. A crude rule is that when
dealing with a relatively light-tailed distribution, use the Harrell–Davis estimator.
When dealing with a heavy-tailed distribution that is reasonably continuous, use the
recommended Hyndman–Fan estimator.
This section describes R functions for applying the location estimators described in
the previous section.
The built-in R function mean(x,tr=0) computes the mean by default, but it
can be used to compute a trimmed mean via the argument tr. For example, setting
tr=0.2 results in using the 20% trimmed mean. For convenience, the function
.tmean(x,tr=0.2)
mom(x)
.
onestep(x)
.
hd(x, q=0.5)
.
thd(x, q=0.5)
.
quant(x, q=0.5)
.
computes the estimator recommended by Hyndman and Fan (1996). The argument
q indicates which quantile is to be used and defaults to 0.5, the population median.
The R function
qno.est(x, q=0.5)
.
There are numerous robust measures of dispersion (Wilcox, 2022a). One that plays
a prominent role when dealing with a trimmed mean is the Winsorized variance,
which is described in the next section. And there are the interquartile range and the
median absolute deviation (MAD) measure described in Sect. 1.5. Based on results
in Lax (1985) and Randal (2008), two others are worth mentioning. Both have a
connection to M-estimators. The first is called the biweight midvariance, which
appears to have a breakdown point equal to 0.5. However, there are some theoretical
concerns about this measure of dispersion that are summarized in Wilcox (2022a).
The other is the percentage bend midvariance estimator. The default version used
here has a breakdown point equal to 0.2. Roughly, it is based on determining whether
a value is unusually large or small using a method that has a certain similarity to the
MAD-median rule for detecting outliers. Complete computational details can be
found in Wilcox (2022a, Section 3.12.3). Under normality, it estimates a measure
of dispersion that is nearly equal to the population variance. It plays a role when
measuring the strength of a linear model, as will be seen in Chap. 9.
The R function
pbvar(x)
.
This section describes methods for making inferences based on the estimators
described in Sect. 2.1. Included are two basic bootstrap methods. Bootstrap methods
have been studied extensively and found to have considerable practical value when
using estimators that have a reasonably high breakdown point (e.g., Wilcox, 2022a).
However, when dealing with the mean, there are situations where serious practical
concerns remain as will be illustrated.
First focus on the 20% trimmed mean. Three basic approaches are discussed. When
working with any estimator, certainly the best-known approach is to use what is
called a pivotal test statistic. These have the general form
Est − P E
Z=
. , (2.5)
SE
where Est is some estimator, PE is the parameter being estimated by Est, and SE is
an estimate of the standard error of Est. The T statistic given by Eq. (1.7) in Chap. 1
is √
an example where Est is the sample mean, PE is the population mean, and SE is
.s/ n, an estimate of the standard error of the sample mean. When dealing with a
trimmed mean, there are two issues: determining a technically sound estimate of the
standard error and finding a satisfactory approximation of the distribution of Z.
A technically sound estimate of the standard error of a trimmed mean was first
derived by Tukey and McLaughlin (1963). The method begins by Winsorizing the
data. As noted in Sect. 2.1.1, a trimmed mean removes the g smallest values and the
g largest values. Winsorizing means that, rather than trim the g smallest values, set
the g smallest values equal to the smallest value not trimmed. In a similar manner,
set the g largest values equal to the largest value not trimmed. For example, the 20%
Winsorized values corresponding to
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
.
are
3, 3, 3, 4, 5, 6, 7, 8, 8, 8.
.
1
W̄ =
. Wi (2.6)
n
2.3 Computing Confidence Intervals and Testing Hypotheses 33
1
sw2 =
. (Wi − W̄ )2 (2.7)
n−1
Letting G denote the amount of trimming, an estimate of the standard error of the
trimmed mean is
sw
. √ . (2.8)
(1 − 2G) n
For example, with 20% trimming, .G = 0.2 and the standard error of a 20% trimmed
mean is estimated with
sw
. √ .
0.6 n
Let .h = n − 2g denote the number of observations left after trimming and let .μt
denote the population trimmed mean. Tukey and McLaughlin (1963) approximate
the distribution of
√ X̄t − μt
. Tt = (1 − 2G) n (2.9)
sw
H0 : μt = μ0
. (2.11)
√ X̄t − μ0
.Tt = (1 − 2G) n . (2.12)
sw
associated with no trimming (e.g., Wilcox, 1994). But with a small sample size,
there are situations where there is room for improvement. There are two bootstrap
methods for dealing with this issue. The first is called a bootstrap-t method.
(1 − 2G)(X̄t∗ − X̄t )
T∗ =
. √ , (2.13)
sw∗ / n
where .X̄t∗ and .sw∗ are the trimmed mean and Winsorized standard deviation based
on the bootstrap sample. Note that in the bootstrap world, the population trimmed
mean is known, it is .X̄t , the trimmed mean based on the observed data.
The process just described is repeated B times yielding .T1∗ , . . . , TB∗ , which
provides an estimate of the distribution of .Tt . If the goal is to compute a .1 − α
confidence interval, let . = αB/2, rounded to the nearest integer and let .u = B − .
∗ ≤ · · · ≤ T ∗ denote the .T ∗ , . . . , T ∗ values written in ascending order.
Let .T(1) (B) 1 B
∗
Then .T(+1) and .T(u)∗ provide estimates of the .α/2 and .1 − α/2 quantiles of the
Note that the lower end of the confidence interval is based on the estimate of the
1 − α/2 quantile. A little algebra demonstrates that this is correct.
.
|Tt | ≥ |T ∗ |(c) .
. (2.16)
mean.
methods, but the actual Type I error probability is again greater than 0.075.
label the results .X̄t (1) ≤ · · · ≤ X̄t (B) . An approximate .1 − α confidence interval for
the population trimmed mean is
where . and u are computed as done by the equal-tailed bootstrap-t method. Let A
denote the number of bootstrap samples greater than the hypothesized value .μ0 , and
let .p̂ = A/B. A (generalized) p-value is
. 2 min{p̂, 1 − p̂}.
(Liu & Singh, 1997). That is, the p-value is .2p̂ or .2(1 − p̂), whichever is smaller.
Table 2.2 shows the actual probability of a Type I error when .α = 0.05
and when using a 20% trimmed mean. Compared to the results in Table 2.1,
it is evident that a 20% trimmed mean provides better control over the Type I
error probability. For a skewed, light-tailed distribution, the bootstrap-t methods
offer a slight advantage over the percentile bootstrap method. But for heavy-tailed
distributions, the percentile bootstrap method is best in terms of having an actual
Type I error probability close to the nominal level. Note that the percentile bootstrap
method is the most stable of the four methods considered.
Becher et al. (1993) as well as Westfall and Young (1993) established that when
dealing with the mean, the bootstrap-t performs better than the percentile bootstrap.
However, as the amount or trimming increases, at some point the percentile
bootstrap tends to be more satisfactory than the bootstrap-t. This has been found
2.3 Computing Confidence Intervals and Testing Hypotheses 37
to be the case with 20% trimming. There are some indications that the percentile
bootstrap competes well with the bootstrap-t when using 10% trimming, but this
issue needs to be studied more. More broadly, when using an estimator with a
reasonably high breakdown point, typically the percentile bootstrap performs well.
There is the practical issue of choosing B, the number of bootstrap samples. Early
studies were focused on controlling the Type I error probability. And there was the
additional problem that computers were substantially slower compared to computers
available today. These early studies found that .B = 500 often sufficed. But there
are at least two reasons for choosing a larger value. First, if a different set of B
bootstrap samples is used, certainly it helps that this has very little impact on a
confidence interval or the p-value. Put another way, the R functions in this book
that use bootstrap methods set the seed of the random number generator so that the
results are always duplicated if the function is used again on the same data. But if
another seed for the random number generator in R were used, this could alter the p-
value by a fair amount if B has a relatively small value. The second practical reason
for using a large value for B is that this can increase power (Racine & MacKinnon,
2007).
The R function
performs the Tukey–McLaughlin method for a trimmed mean. The data are assumed
to be stored in the R object x. The argument tr controls the amount of trimming
and defaults to 20%. The null value can be specified by the argument null.value
or nullval. The function
performs the bootstrap-t method. If plotit = TRUE, the function plots the
bootstrap estimates of .Tt .
When dealing with the median, special techniques are required. As the amount of
trimming approaches the median, the Tukey–McLaughlin method breaks down. In
particular, the estimate of the standard error of the trimmed mean used by Tukey–
McLaughlin is highly unsatisfactory when using the median. This section deals with
this issue and includes a method for making inferences about any quantile. Chapter 3
elaborates on why quantiles other than the population median can be of interest.
There are in fact quite a few methods that might be used (e.g., Wilcox, 2022a,
Section 4.6). Some of these methods are based on an estimate of the standard error,
there are situations where these methods perform fairly well, but there are situations
where they are unsatisfactory. One serious concern is a situation where tied values
occur. Tied values refer to a situation where some values occur more than once.
That is, there are duplicate values. The practical concern is that when using the
median, tied values can invalidate all known methods for estimating the standard
error. It is possible that with a large sample size and very few tied values, fairly
accurate estimates of standard error can be computed. But there are no satisfactory
guidelines indicating when this is the case. Currently, the safest approach is to use a
method that does not require an estimate of the standard error.
Yet another concern is assuming that the distribution of the median approaches
a normal distribution as the sample size increases. When there are tied values, this
is not necessarily the case. Koenker (2005, p. 150) described situations where the
sample median, M, converges to a discrete distribution. This point is illustrated at
the end of Sect. 2.8.
Here, two methods are described for computing a confidence interval for any
quantile. The first assumes random sampling only. For any .i < j , it can be shown
that the probability that the interval .(X(i) , X(j ) ) contains the qth quantile is exactly
equal to
2.3 Computing Confidence Intervals and Testing Hypotheses 39
j −1
n
. q k (1 − q)n−k (2.18)
k
k=i
(e.g., Arnold et al., 1992). Put another way, consider binary data taking the values 0
or 1, where .P (X = 1) = q. Let .p(k) denote the probability that the value 1 occurs
k times in n trials. That is, .p(k) is given by the binomial probability function that is
covered in a standard introductory course. Then,
j −1
. p(k) (2.19)
k=i
indicates the exact probability that the interval .(X(i) , X(j ) ) contains the qth
quantile. Said yet another way, the exact probability coverage can be determined
with the binomial probability function.
A second approach is to use a percentile bootstrap method. When dealing with
the median, and sampling is from a light-tailed distribution, this approach, coupled
with the Harrell–Davis estimator, can yield shorter confidence intervals compared
to the method just described. However, for the lower and upper quartiles, it can be
unsatisfactory when the sample size is relatively small. At the moment, there are
no good guidelines regarding how large the sample size needs to be when using the
Harrell–Davis estimator.
The R function
computes a confidence interval for the qth quantile using the method given by
(2.18). By default, a confidence interval for the median is computed that has
probability coverage greater than or equal to 0.95. The exact probability coverage
is reported as well. Because the confidence interval is based on the binomial
distribution, which is discrete, an exact 0.95 confidence interval cannot be computed
with this method. When dealing with the median, Hettmansperger and Sheather
(1986) derived an interpolation method that helps correct this problem, which can
be applied via the R function
TRUE)
40 2 The One-Sample Case
qcipb(x,q=0.5,alpha=0.05,nboot=2000,SEED=TRUE,nv=0)
.
There is a rather involved method for estimating the standard error of the one-step
M-estimator (Huber & Ronchetti, 2009), which can be computed via the R function
.mestse(x).
This suggests using a pivotal test statistic having the form given by (2.5), which
was used to compute a confidence interval for the trimmed mean. For a moderately
large sample size, this approach performs reasonably well when sampling from a
symmetric distribution. But for skewed distributions, this is not the case. In theory,
this approach would work well for a sufficiently large sample size, but there are
no satisfactory guidelines when this is the case. However, there is a simple way of
dealing with this issue: use a percentile bootstrap method. The percentile bootstrap
method was described in Sect. 2.3.1 in the context of a trimmed mean. But a
percentile bootstrap method can be used with any estimator.
As for the MOM estimator, an expression for its standard error has not been
derived. But the standard error can be estimated using a bootstrap method. Simply
generate a bootstrap sample yielding say .X̄m . Repeat this process B times yielding
.X̄m1 , . . . , X̄mB . The sample variance of the B bootstrap samples
1
. (X̄mb − X̃)2 (2.20)
B −1
estimates the squared standard error of MOM, where .X̃ = X̄mb /B. However,
again there is the issue of how to deal with skewed distributions. Currently, the best
solution is to use a percentile bootstrap method.
2.4 Inferences About the Probability of Success 41
The R function
computes a bootstrap estimate of the standard error based on the measure of location
indicated by the argument est
This section deals with binary data taking on the value 0 or 1, where 1 indicates a
success and 0 a failure. Let .p = P (X = 1) denote the probability of a success.
Based on a random sample .X1 , . . . , Xn , the usual estimate of p is
1
p̂ =
. Xi , (2.21)
n
which is just the proportion of successes among n trials or participants. Put another
way, .p̂ is the mean of data that has the value 0 or 1. Inferences about p will help
provide perspective in situations described in subsequent chapters.
A basic goal is computing a confidence interval for p. Certainly the best-known
method stems from Agresti and Coull (1998). Let c denote the .1 − α/2 quantile of
a standard normal distribution, and let X denote the total number of successes. Let
ñ = n + c2 ,
.
c2
X̃ = X +
. ,
2
and
42 2 The One-Sample Case
X̃
. p̃ = .
ñ
The Agresti–Coull .1 − α confidence interval for the probability of success, p, is
p̃(1 − p̃)
p̃ ± c
. . (2.22)
ñ
However, for a small sample size, there are situations where the Agresti–
Coull method can be inaccurate (Wilcox, 2019a). This occurs when the unknown
probability of success is close to 0.16 or 0.84. When .n = 25, the actual value of
.1 − α can drop below 0.90 when the goal is to compute .1 − α = 0.95 confidence
interval. This concern can be addressed using a method derived by Schilling and Doi
(2014). The computational details are rather involved and not described here. Also,
as the sample size increases, execution time for the Schilling–Doi method, using
the R function supplied in the next section, can be an issue. Here, the Schilling–Doi
method is used when .n ≤ 35. Otherwise, the Agresti-Coull method is used, which
generally performs well when .n > 35. There are, however, four situations where
the Agresti–Coull method is replaced by a method recommended by Blyth (1986).
To describe them, let .cL denote the lower end of the confidence interval and let .cU
denote the upper end.
• If .X = 0,
cU = 1 − α 1/n
.
cL = 0.
.
• If .X = 1,
α 1/n
.cL = 1 − 1 −
2
α 1/n
cU = 1 −
. .
2
• If .X = n − 1,
α 1/n
cL =
.
2
α 1/n
cU = 1 −
. .
2
• If .X = n,
cL = α 1/n ,
.
2.5 R Functions binom.conf and cat.dat.ci 43
and
cU = 1.
.
For the cases .X = 0 and .X = n, the method derived by Clopper and Pearson (1934)
replaces .α with .α/2, which guarantees that the probability coverage is at least .1 − α.
The Schilling–Doi method also guarantees that the probability coverage is at least
.1 − α with the added benefit that the length of the confidence interval is as short as
possible.
The R function
the Agresti–Coull method when .n < 35, set the argument AUTO=FALSE and the
argument method=‘AC’. When testing some hypothesis, the argument nullval
indicates the null value. To get a p-value when using the Schilling–Doi method, set
PVSD=TRUE.
Consider data where the sample space consists of very few values. For each
observed value, the R function
.cat.dat.ci(x,alpha=0.05)
computes a confidence interval for the probability that value occurs in the
population under study.
Example The command z=rbinom(50,4,0.4) was used to randomly generate
50 values from a binomial distribution and store the results in the R object z. The
second argument, 4, indicates that the observed number of successes has a value
44 2 The One-Sample Case
between 0 and 4, inclusive. The last argument, 0.4, indicates that the probability of
a one is 0.4. The R command cat.dat.ci(z) returned
$output
x Est. ci.low ci.up
[1,] 0 0.14 0.06637434 0.2649959
[2,] 1 0.32 0.20697226 0.4587129
[3,] 2 0.34 0.22389410 0.4789371
[4,] 3 0.12 0.05249712 0.2417271
[5,] 4 0.08 0.02640142 0.1935306
When comparing groups, effect size is a generic term for quantitative methods
that characterize how the groups differ. For the one-sample case considered here,
they characterize the extent the true distribution differs from the hypothesized
distribution. The goal in this section is to provide some background that will help
set the stage for methods to be described.
Four distinct approaches are considered here. Let .θ denote any measure of
location. The first approach is to simply use .θ − θ0 , the difference between the
true value and the hypothesized value. Of course, estimating this measure of effect
size is straightforward since .θ0 is known. For example, when using the median, use
.M − θ0 .
μ − μ0
δ=
. . (2.23)
σ
However, this method is not robust in the sense described in Sect. 1.6.
To illustrate this last point, assume that for a normal distribution, .δ = 0.2, 0.5 and
0.8 are small, medium, and large effect sizes. So for the standard normal distribution
and .H0 : μ = 0, .μ = 0.8 is being√ viewed as a large effect size. But for the mixed
normal, now the effect size is .0.8/ (10.9) = 0.24, which is relatively small. That
is, a small departure from a normal distribution can alter this measure of effect
size substantially. In addition, the estimate of .δ, .δ̂ = (X̄ − μ0 )/s, can be lowered
substantially by outliers.
Following Algina et al. (2005), a more robust analog of .δ is to replace the mean
and variance with a 20% trimmed mean and Winsorized standard deviation that is
2.6 Effect Size 45
X̄t − μ0
δt = 0.642
. , (2.24)
sw
which has as breakdown point equal to 0.2. For the mixed normal example in the
previous paragraph, .δt = 0.71.
An analog of (2.24), based on the median, is simply
M − μ0
δm =
. . (2.25)
MADN
If MADN=0, replace MADN with the 0.25 Winsorized standard deviation that is
rescaled to estimate .σ under normality.
Another approach to measuring effect size is to use the quantiles of the null
distribution. Consider the case where .θ is taken to be the 0.5 quantile, the population
median. The hypothesis is that sampling is from a distribution having a median equal
to .θ0 . This is the null distribution. If the hypothesis is true, .θ is the 0.5 quantile of
the null distribution. But if the hypothesis is false, .θ corresponds to Q, some other
quantile associated with the null distribution. The further Q is from 0.5, the larger
the effect. This approach is an example of what is called a quantile shift measure of
effect size.
Consider a normal distribution with .σ = 1 and the goal of testing .H0 : μ = 0.
If .δ = 0.2, .μ = 0.2 and the probability of being less than or equal 0.2 is 0.58. That
is, .δ = 0.2 indicates that the population mean corresponds to the 0.58 quantile of
the null distribution. In a similar manner, .δ = 0.5 and 0.8 correspond to the 0.69
and 0.79 quantiles. Simplifying a bit, if .δ = 0.2, 0.5 and 0.8 are viewed as small,
medium and large effect sizes, respectively, this corresponds to viewing the 0.6, 0.7,
and 0.8 quantiles of the null distribution as small, medium, and large effect sizes as
well. If .δ = −0.2, .−0.5 and .−0.8, these values correspond to the 0.4, 0.3, and 0.2
quantiles, which again are being viewed as small, medium, and large effect sizes.
Now consider a distribution that is skewed. Note that .δ makes no distinction
based on whether it is positive or negative. For example, .δ = −0.5 and .δ = 0.5
are both being viewed as medium effect sizes. Moreover, in terms of the quantiles
of the null distribution, the above interpretation of .δ, under normality, can be highly
invalid. Consider, for example, the lognormal distribution shown in Fig. 1.3 and
suppose this distribution has been shifted to so that its mean is .μ − 0.5σ . That is,
.δ = −0.5, which supposedly is a medium effect size. It can be shown that .μ −
0.5σ = 0.568, which corresponds to the 0.286 quantile of the null distribution. The
mean of a lognormal distribution corresponds to the 0.691 quantile. That is, .μ−0.5σ
46 2 The One-Sample Case
reflects a shift from the 0.691 quantile to the 0.286 quantile, a difference of 0.691–
0.286=0.405. But under normality, a large effect size based on .δ corresponds to a
shift from the 0.5 quantile to the 0.2 quantile, a difference of only 0.3. That is, in
terms of the quantiles of the null distribution, .δ = −0.5 corresponds to a very large
effect size, not a medium effect size. In a similar manner, if .δ = 0.5, this represents
a shift to the 0.84 quantile of a lognormal distribution, an increase of only 0.15,
which is viewed as being small when dealing a normal distribution.
Again let Q denote the quantile of the null distribution corresponding to the
actual value of the population median. Estimating Q is straightforward. First,
compute
Zi = Xi − M + θ0 ,
.
i = 1, . . . , n. That is, center the data so that its median is equal to the null value .θ0 ,
.
in which case an estimate of Q is the proportion of .Zi values less than or equal to
M. More formally, let .Ii = 1 if .Zi ≤ M; otherwise .Ii = 0. An estimate of Q is
1
Q̂ =
. Ii . (2.26)
n
Often .δm and Q give similar results in terms of their relative magnitude, but
exceptions can occur. If Q indicates a medium effect size, it is likely that .δm will
indicate a medium effect size as well. Confidence intervals for Q and the measure
of effect size given by (2.24) can be computed with a percentile bootstrap method.
Evidently there are no results on how to compute a confidence interval for the
population value of .δm . Until this issue is addressed, Q seems preferable to .δm .
The R function
estimates .δt and computes a confidence interval using a percentile bootstrap method.
The R function
2.8 Plots
There are numerous methods for plotting data. See, for example, Wickham (2016)
and Sievert (2020). This section mentions two basic plots that play a role in this book
beyond a boxplot and a histogram. Additional plots are covered in later chapters.
The first is called an adaptive kernel density estimator, which estimates the
population distribution. There are several variations of kernel density estimators.
The adaptive kernel density estimator used here is motivated by results in Silverman
(1986). The default version of this is estimator can provide a better sense of the
underlying distribution versus the default version of the histogram.
Example Figure 1.5 illustrates that with a sample size of .n = 100, a histogram
can poorly reflect the shape of the true distribution when sampling data from the
mixed normal distribution. Here, another one-hundred values were generated from
the mixed normal distribution. Figure 2.3 shows the resulting histogram using the
R function hist. Notice that the left tail seems to be relatively short suggesting
that perhaps the distribution is skewed to the right. Figure 2.4 shows the estimate of
the distribution using adaptive kernel density estimator. Now the two tails appear to
similar in shape, which is correct. Of course, this one example is not compelling
evidence that generally, the adaptive kernel density estimator provides a more
accurate estimate of a distribution. Indeed, both can fail in some situations. The only
point is that the adaptive kernel density estimator has the potential of providing a
more accurate estimate of a distribution than the default version of the histogram.
One problem with the histogram is that its shape can be impacted by outliers. In
fairness, a histogram might be improved by using more bins. An example is provided
in the exercises.
48 2 The One-Sample Case
40
30
20
10
0
−10 0 10 20
Fig. 2.3 Shown is a histogram based on 100 values generated from the mixed normal distribution
in Fig. 1.2
When dealing with discrete data where the sample space is relatively small,
plotting the relative frequencies of each observed value can be more informative
than using a histogram or an adaptive kernel density estimator. One way of doing
this is with the R function
which plots the relative frequencies of all distinct values. With op=TRUE, a line
connecting points marking the relative frequencies is added to the plot. The function
also returns the frequencies and relative frequencies.
Example Section 2.3.6 noted that as the sample size increases, the sampling
distribution of the median approaches a normal distribution when tied values never
occur. But when there are tied values, this is not necessarily the case. The R function
splot is used to illustrate this point. Consider the probability function shown in
Fig. 2.5. The sample space consists of the integers from 0 to 15. Suppose a sample
of .n = 20 values are randomly sampled based on this probability function and the
median is computed. Further suppose this process is repeated 5000 times, which
yields an estimate of the sampling distribution of the median. The left panel of
Fig. 2.5 shows the relative frequencies of the resulting medians using the R function
splot. The right panel is based on .n = 100. Note that the number of unique
values for the median decreased when moving from .n = 20 to 100. As is evident,
2.9 Some Concluding Remarks 49
0.30
0.25
0.20
0.15
0.10
0.05
0.00
− 10 −5 0 5 10
Fig. 2.4 Shown is a adaptive kernel density estimate based on the same 100 values used in Fig. 2.3
*
0.20
*
0.15
*
Probability
0.10
*
*
0.05
* *
*
0.00
* * * * * * *
0 5 10 15
X
increasing the sample size to 100 did not result in a plot that looks more normal
compared to when .n = 20 (Fig. 2.6).
There is an issue about standard errors that should be stressed. A correct expression
for the standard error of a location estimator, and hence, a technically sound method
for estimating the standard error, depends crucially on how extreme values are
50 2 The One-Sample Case
0.6
* *
0.4
0.5
*
0.3
0.4
Rel. Freq.
Rel. Freq.
*
0.3
0.2
0.2
0.1
0.1
*
* *
* *
0.0
0.0
9.0 10.5 12.0 10.0 10.4 10.8
Median, n=20 Median, n=100
Fig. 2.6 Shown are estimates of the sampling distribution of the median when sampling from
the distribution in Fig. 2.5. The estimates are based on 5000 medians. The left panel is based on
.n = 20, and the right panel is based on .n = 100
treated. For a 20% trimmed mean, there is a relatively simple way of dealing with
this issue. But when using a one-step M-estimator, the expression for the standard
error is quite involved, the details of which were not covered here. Imagine that
when calculating the one-step M-estimator, it trims 10% from both the lower and
upper tails. This might suggest that an estimate of the standard error, based on a 10%
trimmed mean, could be used. But this is incorrect and can yield a highly inaccurate
estimate of the standard error.
Another point that should be stressed is that using a correct estimate of the
standard error can be crucial. Ignoring this issue can result in an estimate of
the standard error that is highly inaccurate. Imagine that the 20% smallest and
largest values are trimmed and the standard error of the sample mean, based in the
remaining data, is computed. Generally, the resulting estimate is about half of the
correct estimate given (2.8). An illustration is relegated to the exercises.
Next, imagine that an argument can be made that any value greater than 10 is
erroneous, and so any value greater than 10 is discarded. Now the usual estimate
of the standard error of the mean is valid. For the situation at hand, determining
which values are trimmed does not depend on the observed data. This is in contrast
to trimmed means and M-estimators where values declared to be unusually small
are large depend on data that are available.
As mentioned in Sect. 2.1.2, multiple methods can be needed to get a good
understanding of data. One basic reason is that different methods reflect different
features of the data. The most obvious example occurs when distributions are
skewed, in which case the population mean, 20% trimmed mean, and median all
have different values. An added complication is that different measures of variation
2.10 Exercises 51
also reflect different features of the data. This adds another level of complexity
when choosing a standardized measure of effect size. Guidelines can be provided
regarding the relative merits of methods in terms of their ability to control the
probability of a Type I error. But even for the relatively simple situation considered
here, more than one perspective can be crucial.
Of course, one could simply check several methods and see whether they paint a
different picture. If this is the case, look more closely at the data to understand why.
However, when testing hypotheses, there is the issue of controlling the probability
of one or more Type I errors. Methods for dealing with this issue will be discussed
at various times in subsequent chapters.
2.10 Exercises
1. Use the read.table command to store the data in the file A1B1C_dat.txt
in the R object A1B1C. The column labeled cort1 contains cortisol measures
taken upon awakening. Next, execute the commands:
par(mfrow=c(2,2))
hist(A1B1C$cort1,freq=FALSE)
hist(A1B1C$cort1,breaks=50,freq=FALSE)
hist(A1B1C$cort1[A1B1C$cort1<2],freq=FALSE)
akerd(A1B1C$cort1)
par(mfrow=c(1,1))
Comment on what this illustrates.
2. For the data used in Exercise 1, compute a confidence interval for the 20%
trimmed mean using trimpb. Next compute a confidence interval for the mean
using trimci with the argument tr=0. Compare the lengths of the confidence
intervals. What explains the difference?
3. A total of 150 females were asked how many sexual partners they desired
over the next thirty years. The data are stored in sexf_dat.txt, which can
be obtained as explained in Sect. 1.9. Read the data into R using the scan
command. First, however, look at the data in the file and note why the argument
skip=1 is needed when using the scan command. Next, compute the mean,
20% trimmed mean, and median. Also compute the onestep M-estimator and
comment on the result that is obtained. How do you explain the result returned
by onestep? Next determine the most common response using the R function
splot. Next, compute confidence intervals using cat.dat.ci and comment
on the confidence interval for the probability of getting the response 1.
4. For the data in Exercise 3, test the hypothesis that the typical response is one
using the mean. Next, test the same hypothesis using the 20% trimmed mean
based on the Tukey–McLaughin method followed by the percentile bootstrap
method. Comment on the results.
5. For the data in Exercise 3, use the R function D.akp.effect.ci to
test the hypothesis that the effect size given by (2.24) is 0 when the null
52 2 The One-Sample Case
value is 1. Next, use the R function sintv2 to test the hypothesis that the
median is 1. Comment on how the result compares to the result obtained by
D.akp.effect.ci.
6. This item deals with the strategy of trimming values and estimating the standard
error of mean based on the remaining data. Imagine that m values remain after
trimming. Let√sm denote the standard deviation based on the remaining data and
suppose sm / m is used to estimate the standard error. This estimate ignores
the dependence among the remaining data. The issue is the extent this approach
gives a different result compared to the estimate given by (2.8), which deals
with the dependence among the remaining data.
Set the seed of the random number generator to 46. The command is
set.seed(46). Generate 50 values from a standard normal distribution with
the R function rnorm and store the results in some R object. Compute the
standard error of the 20% trimmed mean using the R function trimse. Note
that the 20% trimmed mean removes the 10 smallest and 10 largest values
leaving 30 values. Based on √ the remaining 30 values, compute the sample
variance, s 2 followed by s/ 30. Comment on the results.
7. The file dana_dat contains reaction time data for two groups of participants.
For the first group, stored in the first column, compute a confidence interval for
the median using the R function qint. Compare the length of the confidence
interval to the length obtained by Student’s t. What explains the difference?
8. For the data used in the last exercise, compute mom, the one-step M-estimator,
and the mean. What explains the discrepancy between the first two and the
mean?
9. Imagine that the mean, 20% trimmed mean, and median have very similar
values. Why is it that the choice of an estimator can still be important?
10. Imagine that for each of B = 1000 bootstrap samples, the 20% trimmed mean is
computed. The null hypothesis is H0 : μt = 6. If 900 of the bootstrap estimates
of the trimmed means are greater than 6, what is the p-value?
11. Describe a type of distribution where the actual Type I error probability tends
to be less than or equal the nominal level when using Student’s t.
12. What types of distributions are a serious concern when using Student’s t?
13. Consider the hypothesis H0 : μt = 8 and imagine that if μt = 10, this is
considered to be an important difference from the hypothesized value. Further
imagine the p-value is 0.01 and that the 0.95 confidence interval is (9, 12).
Interpret the results based on Tukey’s three-decision rule when testing at the
0.05 level.
14. Generally, why is a confidence interval for some measure effect size more
informative than simply reporting a p-value?
15. Can a small departure from a normal distribution seriously impact the power of
Student’s t-test? Defend your response.
16. Imagine that with B = 300 bootstrap samples, it is known that the Type I error
probability is controlled well. Why might it be advantageous to use a larger
number of bootstrap samples?
2.10 Exercises 53
(Dixon & Tukey, 1968). When using 20% trimming, in which case g = 0.2n
rounded down to the nearest integer, argue that this estimate of the standard
error of the Winsorized mean is larger than the estimate of the standard error of
the trimmed mean.
18. Comment on the claim that when computing an accurate confidence interval for
the median, a large sample size is needed.
19. Describe a situation where the Type I error probability of Student’s t will be
less than or equal to the nominal level.
Chapter 3
Comparing Two Independent Groups
The previous two chapters provide basic information that is needed for the main
goals in this book: comparing groups and studying associations. This chapter
focuses on comparing two independent groups. There are multiple ways to approach
this problem with different methods providing different perspectives. Here is a
general outline of the strategies that might be used:
1. Compare measures of location. This includes the strategy of comparing all of the
quantiles.
2. Focus on the probability that a randomly sampled value from the first group is
less than a randomly sampled value from the second group.
3. Use the median of the typical difference between two randomly sampled
participants. This approach is related to 2 and differs from 1 as will be seen.
4. Compare groups based on some measure of variation.
5. Compare groups based on a measure of effect size that takes into account both a
measure of location and a measure of variation.
6. Compare the groups based on a quantile shift measure of effect size.
7. For discrete data having a small sample space, compare the probabilities
associated with each observed outcome. Included as a special case is comparing
two groups based on the probability of a success associated with each group.
A general issue is whether these methods paint a similar picture. For example,
do all methods suggest a large effect? If not, why? There are some technical issues
when performing multiple tests that are addressed in this chapter as well.
As noted in Chapter 1, Pratt (1964) established that Student’s t-test can be unsatis-
factory when distributions differ in shape. In particular, violating the homoscedas-
ticity assumption is a serious practical concern, and testing this assumption has
been found to be an ineffective strategy. All of the methods in this section allow
heteroscedasticity. A positive feature of heteroscedastic methods is that they use a
correct estimate of the standard error regardless of whether the null hypothesis is
true or false. Using an incorrect estimate of the standard can result in poor control
over the Type I error probability and inaccurate confidence intervals.
Let .θj be any measure of location associated with the j th group (.j = 1, 2). Two
related goals are to test
H0 : θ1 = θ2
. (3.1)
This in turn yields a test statistic that is a simple generalization of the pivotal test
statistic given by Eq. (2.5) in Chap. 2. Given a reasonable test statistic, the next step
is finding a reasonably accurate approximation of the null distribution. The next
section illustrates this process when comparing trimmed means.
This section describes and discusses two methods for comparing trimmed means
that are based in part on an estimate of the standard error of the difference between
the sample trimmed means. The first stems from Yuen (1974) who derived a method
for comparing trimmed means that illustrates the basic strategy just described. Let
.nj , .X̄tj and .s
2
wj denote the sample size, the trimmed mean, and Winsorized variance,
respectively, associated with the j th group. Yuen estimates the squared standard
error of .X̄tj with
(nj − 1)swj
2
dj =
. , (3.2)
hj (hj − 1)
where .hj is the number of values left after trimming. Yuen’s test statistic is
X̄t1 − X̄t2
Ty = √
. . (3.3)
d1 + d2
3.1 Comparing Measures of Location 57
As usual, there is the issue of approximating the distribution of .Ty when the null
hypothesis is true. Yuen uses a Student’s t distribution with degrees of freedom
(d1 + d2 )2
ν̂y =
. .
d12 d22
h1 −1 + h2 −1
Letting t denote the .1 − α/2 quantile of a Student’s t distribution with .ν̂y degrees of
freedom, reject the null hypothesis if .|Ty | ≥ t. A .1 − α confidence interval is given
by
(X̄t1 − X̄t2 ) ± t d1 + d2 .
. (3.4)
∗
The hypothesis of equal population trimmed means is rejected if .Ty ≤ T(+1) or
.Ty ≥ T
∗ , where .Ty is the test statistic based on the observed data.
(u)
To compute a symmetric confidence interval, put the absolute values of the
bootstrap test statistics in ascending order yielding
|T ∗ |(1) ≤ · · · ≤ |T ∗ |(B) .
.
where .c = (1 − α)B rounded to the nearest integer. Reject the null hypothesis if
|Ty | ≥ |T ∗ |(c) .
.
There is a more involved variation of the bootstrap-t method just described that
has been found to perform a bit better in terms of Type I error probabilities. As
suggested by Keselman et al. (2004), a bootstrap-t method is used in conjunction
with a variation of Yuen’s method derived by Guo and Luh (2000). This approach
has been found to be best when using 10% or 15% trimming.
The percentile bootstrap method for two independent groups is based on a simple
generalization of the percentile bootstrap method described in Chap. 2. In principle,
it can be used with any measure of location and is applied as follows. Take a
bootstrap sample from each group, compute the measure of location of interest
for both groups, and let .D ∗ denote the difference. For example, when working a
trimmed mean, .D ∗ = X̄t1 ∗ − X̄ ∗ . Repeat this B times yielding .D ∗ , . . . , D ∗ . Put
t2 1 B
these value B value in ascending order yielding .D(1) ∗ ≤ · · · ≤ D ∗ . Then an
(B)
approximate .1 − α confidence interval for the difference between the population
trimmed means, .μt1 − μt2 , is
∗ ∗
(D(+1)
. , D(u) ), (3.7)
2 min{p̂, 1 − p̂}.
. (3.8)
Note the close similarity to how a percentile bootstrap p-value was computed in
Chap. 2.
Using the Median, M
When comparing groups with the sample median M, a slight generalization of (3.8)
is needed when there are tied (duplicated) values. Let C denote the number of .D ∗
values equal to zero. Now
A C
p̂ =
. + 0.5
B B
is used in (3.8). This method works well in general when using M, even with a
fairly small sample size, and currently is the only known method that performs well
when there are tied values (Wilcox, 2006). Moreover, when dealing with very small
3.1 Comparing Measures of Location 59
sample sizes, it is arguably one of the best methods for controlling the Type I error
probability.
The relative merits of the percentile bootstrap method, versus the boostrap-t
method, are essentially the same as those mentioned in Chap. 2. With a reasonably
high breakdown point, the percentile bootstrap is better than the bootstrap-t. With no
trimming, the bootstrap-t is best with the understanding that when means are being
compared, both the bootstrap-t and Welch’s method can be unsatisfactory in terms of
controlling the probability of a Type I error or yielding accurate confidence intervals.
Just how large the sample sizes must be to get accurate results when using means
is complicated function of how unequal the sample sizes happen to be, the degree
to which the groups have different amounts of skewness, and the extent to which
the distributions have heavy-tails. A positive feature of methods based on means is
that when comparing two identical distributions, the actual Type I error probability
is less than or equal to the nominal level. That is, in terms of the Type I error
probability, it provides a good test of the hypothesis that distributions are identical.
A negative feature is that the standard errors of the means can be substantially larger
than the standard error of robust estimators, as explained in Chap. 2.
The R function
yuen(x,y,tr=0.2,alpha= 0.05)
.
applies Yuen’s method. By default, 20% trimming is used. The version of the
bootstrap-t method for comparing trimmed means, studied by Keselman et al.
(2004), can be applied with the R function
PV=FALSE)
The R function
uses a percentile bootstrap method for comparing trimmed means. The R function
medpb2(x,y,alpha= 0.05,nboot=2000)
.
60 3 Comparing Two Independent Groups
pb2gen(x,y,alpha= 0.05,nboot=2000,est=onestep,...)
.
can be used to compare groups based on any measure of location using a percentile
bootstrap method. It uses a one-step M-estimator by default.
Often data are stored in a matrix or data frame where one of the columns contains
the data to be analyzed and another column contains group identification values. One
way of splitting the data into groups is with the R function
fac2list(x, g, pr = TRUE)
.
a=fac2list(dat[,4], dat[,2])
.
would store the data in a where a[[1]] contains the data for group A, a[[2]]
contains the data for group E and a[[3]] contains the data for group G.
Example The file A1_dat.txt, which can be down loaded as described in Sect. 1.9,
contains numerous measures stemming from older adults. Assume the data have
been read into the R object A1. The column named edugp indicates level of
education: did not complete high school, graduated from high school, some college
or technical training, 4 years of college, postgraduate study. The column named
CESD contains a measure of depressive symptoms. The command
.a=fac2list(A1$CESD,A1$edugp)
separates the data into groups based on education level and stores the results in the
R object a in list mode. The groups are identified numerically where 1 is did not
complete high school, 2 is graduated from high school, and so on. Consequently
a[[1]] contains the data for the first group, a[[2]] and so forth. The command
yuen(a[[1]],a[[5]])
.
compares the first and last groups using 20% trimmed means. The p-value is 0.063.
Exercise 3 at the end of this chapter demonstrates that most other methods fail to
reject. Exceptions are a method for comparing medians as well as a method based
3.2 Methods Dealing with P (X1 < X2 ) and the Typical Difference 61
on the quantile shift measure of effect size described in Sect. 3.6.2. It is left as an
exercise to verify that none of the other methods in this section reject at the 0.05
level.
This section deals with two related methods that are based on the typical difference
between .X1 , a randomly sampled participant from the first group and .X2 , a
randomly sampled participant from the second group. The first deals with
The last term, .0.5P (X1 = X2 ), is included to deal with situations where tied values
can occur. When there are no tied values, P is simply the probability that a randomly
sampled value from the first distribution is less than a randomly sampled value from
the second distribution. As is evident, this is the same as the probability that .X1 −
X2 < 0. Two basic goals are testing
.P (X1 < X2 ) is simply the proportion of the .Dij values that are less than zero. And
the estimate of .P (X1 = X2 ) is the proportion of the .Dij values that are equal to
zero. The resulting estimate of P is labeled .P̂ .
Wilcoxon (1945) derived a classic rank-based method for comparing two groups
that is routinely taught. The same method was derived by Mann and Whitney (1947).
62 3 Comparing Two Independent Groups
is zero.
Let .θD denote the population median of D, which is estimated by .θ̂D , the median
of the .Dij values. Now, the mean of the .Dij values is equal to .X̄1 − X̄2 , the difference
between the sample means. However, in general, .θ̂D = M1 − M2 , the difference
between the medians.
Example To illustrate the last point, .n1 = 10 values were generated from a
standard normal distribution and .n2 = 20 values were generated from the lognormal
distribution in Fig. 1.3. The difference between sample the medians was .M1 −M2 =
−1.086 and the median of the .Dij values was .θ̂D = −1.546.
3.2 Methods Dealing with P (X1 < X2 ) and the Typical Difference 63
The R function
rval=15, xlab=‘’,ylab=‘’)
...).
= "")
loc2dif.ci(x,y,est=median alpha=0.05)
.
height. It is left as an exercise to show that all of the methods described in this
section have p-values less than 0.01 when comparing 4000 BC to 150 AD. However,
when the measures of breadth are compared for 1850 BC and 150 AD, none of
the methods in this section reject at the 0.05 level. The p-values using Yuen’s
method with 20% trimming, the percentile bootstrap method with 20% trimming,
the medians using a percentile bootstrap method and Cliff’s method are 0.281, 0.303
0.293, and 0.160, respectively. To underscore an important point, compare these
results to the example in Sect. 3.3.3.
Several methods have been proposed and studied that are aimed at comparing
quantiles other than the median (e.g. Wilcox, 2022a, Section 5.1.5). For example,
let .θj denote the 0.25 quantile of the j th group. If .θ1 = 6 and .θ2 = 8, this means
that 75% of the participants in the first group have values greater than or equal to 6,
while 75% of the participants in the second group have values greater than or equal
to 8. A goal is to test
H0 : θ1 = θ2 .
. (3.11)
And there is the related goal of computing a confidence interval for .θ1 − θ2 . The
focus here is on two methods that currently appear to perform relatively well in
terms of controlling the Type I error probability.
3.3.1 Method Q2
The first method, labeled Q2, simply uses a percentile bootstrap method in
conjunction with either the Harrell–Davis estimator or the trimmed Harrell–Davis
estimator. That is, given some quantile of interest, proceed as described in Sect. 3.1.2
but with the trimmed mean replaced by some estimate of the quantile of interest. One
advantage of using these estimators is that, combined with a percentile bootstrap
method, they are able to handle tied values. As mentioned in Sect. 2.1.3, some
quantile estimators are based on a weighted average of only two values. Currently,
however, using the Harrell–Davis estimator or the trimmed Harrell–Davis estimator
appears to be better at controlling the Type I error probability. It is noted, though,
that when dealing with the more extreme quantiles, sample sizes greater than 20 can
be needed to get accurate confidence intervals. For example, when dealing with the
0.9 quantile, both sample sizes should be greater than or equal to 40. With a common
sample size of 30, and when sampling from skewed distributions, the actual Type I
error probability can be greater than 0.075 when testing at the 0.05 level.
3.3 Comparing Quantiles Other than the Median 65
The second approach, known as a shift function, was derived by Doksum and
Sievers (1976). For convenience, momentarily assume there are no tied values. Then
the smallest value in the first group is an estimate of the .1/n1 quantile of the first
group. The next smallest value is an estimate of the .2/n1 quantile of the first group.
And in general, if the values of the first group are put in ascending order, the ith
value estimates the .i/n1 quantile simply because the proportion of values less than
or equal to ith value is equal to .i/n1 . The idea is to plot the .i/n1 quantile of the first
group, versus the difference between the .i/n1 quantile of the second group minus
the .i/n1 quantile of the first group. For each of the .n1 quantiles, the hypothesis is
that the .i/n1 quantile of the first group is equal to the .i/n1 quantile of the second
group. And there is the related issue of computing confidence intervals. Written a
bit more formally, if .θj represents the qth quantile of the j th group, the goal is to
compute a confidence interval for .θ2 − θ1 for every quantile
1 2
. , , . . . , 1.
n1 n1
H0 : θ2 − θ1 = 0.
. (3.12)
Note that as the number of hypotheses being tested increases, the probability of
committing one or more Type I errors increases as well. This raises the issue of how
to control the probability of one or more Type I errors, a topic that is discussed in
more detail in Chap. 5. A related goal is computing two or more confidence intervals
with the property that simultaneously, all of the confidence intervals contain the
parameter of interest with probability .1 − α. Assuming random sampling only,
the Doksum–Sievers method provides a confidence interval .θ2 − θ1 for all of the
quantiles. Their method is based on an extension of the Kolmogorov–Smirnov
test, which is a technique for testing the the hypothesis that two distributions
are identical. The Doksum–Sievers method provides a confidence interval for
the difference between the medians, the difference between the lower and upper
quartiles (the 0.25 and 0.75 quantiles), the difference between the deciles (the 0.1,
0.2, ..., 0.9 quantiles), and in fact all other quantiles as well. Moreover, the exact
probability that all of these confidence intervals contain the true difference can
be determined exactly assuming random sampling only. This includes situations
where there are tied values. The resulting confidence intervals are known as an S
band. Getting confidence intervals where the actual probability coverage is exactly
0.95 is impossible due to the discrete nature of the distribution of the test statistic.
However, a method for determining a critical value, so that the probability coverage
is as close as possible to 0.95, is available using algorithms summarized in Wilcox
(2022, Section 5.1.1). These algorithms include a technique that deals with tied
66 3 Comparing Two Independent Groups
values. There are well-known methods for approximating the critical value used
by the Kolmogorov–Smirnov test, which are also used by the S band. But these
approximations are no longer needed.
An advantage of the shift function is that it provides a detailed description
of where and how much two distributions differ. However, a possible concern
with the S band is that it can have poor power when comparing the tails of two
distributions. A weighted version of the S band, basically a weighted variation of
the Kolmogorov–Smirnov test, is one way of addressing this issue. The resulting
confidence intervals are known as a W band. Another approach is to use Q2, which
can have more power than the S band, especially when differences occur in the tails
of the distributions. Evidently, there are no results comparing Q2 to the W band.
The R function
applies method Q2. By default the Harrell–Davis estimator is used. To use the
trimmed Harrell–Davis estimator, set the argument est=thd. The argument q
determines which quantiles are compared. If parallel processing is available, setting
MC=TRUE can reduce execution time.
The R function
computes the shift function described in Sect. 3.3.2. (Earlier versions of this function
used an approximation of an appropriate critical value, but this approximation is
no longer needed.) When the argument plotit=TRUE, the function creates a
plot with the values stored in the first argument making up the x-axis. The y-axis
indicates the difference between the estimate of the quantiles of the second group
minus the estimates for the first group. For example, if the 0.2 quantile of the first
group is 8, and the estimate of the 0.2 quantile of second group is 10, the plot would
indicate the value .10 − 8 = 2 corresponds to the value 8 on the x-axis.
3.3 Comparing Quantiles Other than the Median 67
The quantiles for the first group are based on the unique values stored in the first
argument, x. For any value stored in x, say c, there is a certain proportion less than
or equal to c, say .q̂. This is taken to mean c is the qth quantile of the first group.
For example, if .n1 = 20 and the smallest value is 32, then 32 is taken to be the
.1/20 = 0.05 quantile. If the next largest value is 36, then 36 is taken to be the
.2/20 = 0.1 quantile. If the three smallest values are equal to 6, then 6 is taken to be
.3/20 = 0.15 quantile. The function also indicates how many significant differences
were found, and it reports confidence intervals for each quantile corresponding to
the first group.
Example The skull data used in the example at the end of Sect. 3.2.1 are considered
again where the goal is to compare 1850 BC data and 150 AD data, only now method
Q2 is used to compare the 0.2, 0.5 and 0.8 quantiles. The results are
q n1 n2 est.1 est.2 est.1_minus_est.2 ci.low ci.up
[1,] 0.2 30 30 130.9685 131.1832 -0.2147814 -3.870219 3.7772626
[2,] 0.5 30 30 135.4109 136.6778 -1.2669751 -4.175657 1.6764197
[3,] 0.8 30 30 137.4937 140.6865 -3.1928051 -6.571679 -0.1242277
p-value adj.p.value
[1,] 0.9720 0.9720
[2,] 0.2605 0.5210
[3,] 0.0115 0.0345
As can be seen, the p-value when comparing the 0.8 quantiles is 0.0115. In
contrast, when comparing these two groups based on the four measures of location
used in Sect. 3.2.1, the lowest p-value is 0.16. That is, no decision is made at the
0.05 level about which group has the larger median or 20% trimmed mean, and no
decision is made about whether P is greater or less than 0.5. But the data indicate
that for the right tails of the distributions, skull breadth is larger for the year 150
AD. Roughly, there is a sense that larger breadth measures occur in the year 150
AD. The column headed by adj.p.value refers to an adjustment to the p-values so
that the probability of one or more Type I error probabilities is less than or equal
to the nominal Type I error probability indicated by the argument alpha, which
defaults to 0.05. The adjustment is based on Hochberg’s method, which is described
in Chap. 5.
The R function
is exactly like the R function sband, only the weighted version is used.
The R function
0.12
measures. The solid line is
the estimate for year 1850 BC
0.10
0.08
0.06
0.04
0.02
can be used to plot up to five distributions. Using the skull data described in
the previous example, Fig. 3.1 shows estimates of the distributions of the breadth
measures. This provides perspective on where the distributions differ. Note that for
the 1850 BC data, the plot stops at about 140 in contrast to the 150 AD data.
Example The shift function is illustrated with data dealing with weight gain in
newborns who weighed at least 3500 grams at birth (Salk, 1973). The experimental
group consisted of newborns who were continuously exposed to the sound of
a mother’s heartbeat. The data are stored in the file salk_dat.txt. Data for the
experimental group are stored in column one and the data for the control group
are stored in column 2. Figure 3.2 shows the plot returned using the R function
sband. The .+ indicates the median, and the two o’s indicate the lower and upper
quartiles of the first group. The dashed lines indicate the confidence band for the
difference between the quantiles. Based on the critical value that was used, the exact
probability that the dashed lines contain all true differences is reported in the output
labeled pc, which is 0.978 for the data used here. Note that the lower dashed line
stops at about the value .−50 on the x-axis. This is because the lower end of the
confidence intervals extend to .−∞. Where this lower dashed line goes above zero
is the region where a significant result is obtained. In this case, a decision is made
that the difference exceeds zero in the region extending from the 0.389 quantile to
the 0.472 quantile of the first group. The plot suggests that the difference between
the quantiles is largest in the far left of the x-axis and that the difference decreases
moving left to right up to about the value 0. That is, the largest differences are
estimated to occur among newborns in the control group who lose weight. But the
confidence intervals make it clear that there is a great deal of uncertainty about the
extent to which this is true.
3.3 Comparing Quantiles Other than the Median 69
300
band for the weight-gain data
200
Est. 2 − Est. 1
100
0
−100 o + o
Now consider binary data taking on the values 0 (failure) and 1 (success). The first
goal in this section is to deal with testing
H0 : p1 = p2 ,
. (3.13)
the hypothesis that the probability of success is the same for both groups. Or in terms
of Tukey’s three-decision rule, the goal is to determine whether it is reasonable to
decide which group has the largest probability of success. And there is the goal of
computing a confidence interval for .p1 − p2 . As usual, numerous methods have
been derived (e.g., Wilcox, 2022a, Section 5.8). Here attention is focused on the
two methods that have been found to perform relatively well.
The first was derived by Storer and Kim (1990). Let .Pj (x) be the probability of
exactly x successes in .nj trials from group j (.j = 1, 2). As noted in an introductory
course,
nj
Pj (x) =
. pjk (1 − pj )n−k . (3.14)
x
Let .p̂j denote the proportion of successes corresponding to the j th group. The
Storer–Kim method is based on a simple idea. First, assume that the hypothesis is
true. Next, estimate this assumed common probability of success and label it .p̃. The
estimate is simply the total number of successes in both groups divided by .n1 + n2 ,
the total sample size. Next, assume that the probability of success for both groups is
equal to .p̃. Because the groups are independent, the probability that the first group
has x successes and simultaneously the second group has y successes is .P1 (x)P2 (y)
where now .p1 and .p2 in (3.14) are taken to be .p̃. The strategy is to determine how
unusual it is to observe the value .|p̂1 − p̂2 | when the null hypothesis is true. With
this in mind, let
if
x
− y ≥ |p̂1 − p̂2 |.
.
n n2
1
Otherwise, let .Sxy = 0. Computing .Sxy for all possible values for x and y, and
adding the results, yields an estimate of the probability of getting estimates for .p1
and .p2 that are greater than or equal to .|p̂1 − p̂2 | when the null hypothesis is true.
In essence, a p-value is given by
. Sxy . (3.15)
x y
3.4 Comparing the Probability of Success 71
ŵ ˆ + v̂
u u v̂
. sin arcsin ± z1−α/2 − , (3.16)
u ŵ 2n1 n2 /N u
where .z1−α/2 is the .1−α/2 quantile of a standard normal distribution, .u = 0.5( nN2 +
n1 ˆ
N ), . = (r1 + 0.5)/(n1 + 1) − (r2 + 0.5)/(n2 + 1), .ψ̂ = 0.5(r
1
+ 0.5)/(n1 + 1) +
n2
(0.5)(r2 + 0.5)/(n2 + 1), .v̂ = (1 − 2ψ̂)(0.5 − N ), and .ŵ = 2uψ̂(1 − ψ̂) + v̂ 2 .
Notice that the methods just described provide another way of comparing two
discrete random variables, each having a relatively small sample space. That is, a
limited number of values are possible. Suppose, for example, two groups are asked
to rate a product on a scale from 0 to 4, in which case there are only five possible
responses. Note that the two groups could be compared based on the probability of
getting the response 0. Of course, the same can be done for the responses 1, 2, 3,
and 4. More generally, one can test
where x is any possible value that might occur and .Xj indicates the number of
successes for group j . This can provide details of how the groups compare that
are missed when using a single measure of location. In essence, the goal is to
compare the cell probabilities of two independent multinomial distributions. This
can be accomplished with the methods in this section, which provide an analog of
the shift function when dealing with discrete data.
The R function
tests the hypothesis given by (3.13). By default it uses the KMS method. To use the
Storer–Kim method, set the argument method=‘SK’.
Example Imagine the first group has 12 successes in 30 trials and the second group
has 20 successes in 25 trials. The R command
binom2g(12,30,20,35)
.
would test (3.13) using the method derived by Kulinskaya et al. (2010). Setting
the argument method=‘SK’, the storer–Kim method would be used instead. If
the data are stored in R objects x1 and x2, containing just the values 0 and 1, the
command
binom2g(x=x1,y=x2)
.
tests the hypothesis given by (3.17). By default, the individual probabilities are
compared using the SK method. Setting the argument KMS=TRUE, method KMS is
used. If plotit=TRUE, the function plots the relative frequencies for all distinct
values found in each of the two groups.
The R function
splotg5(x1,x2=NULL,x3=NULL,x4=NULL,x5=
.
NULL,xlab=‘X’,ylab=‘Rel. Freq.’.
plots the relative frequencies for all distinct values found in two or more groups. The
function is limited to a maximum of five groups. With op=TRUE, a line connecting
the points corresponding to the relative frequencies is formed (Fig. 3.3).
Example A study dealing with a panic disorder is used to illustrate the R function
binband. An experimental group was given clomipramine, and control group was
given a placebo. Here is the output returned by binband:
3.5 Comparing Measures of Dispersion 73
0.4
corresponds to the
clomipramine group. Note
that clomipramine is more
0.3
effective at avoiding high
Rel. Freq.
panic scores
0.2
0.1
0.0
0 1 2 3 4 5 6
X
Fig. 3.3 shows a plot of the results. The solid line is for the experimental group.
The plot indicates that the estimated likelihood of a high panic score is relatively
small for the clomipramine group. That is, the results indicate that clomipramine is
effective at avoiding high panic scores. The left portion of Fig. 3.3 indicates that low
panic scores are more likely for the experimental group. But while a decision can be
made that a panic score of 1 is more likely for the experimental group, no decision
is made about a panic score of 0, at least the 0.05 level.
H0 : σ12 = σ22 ,
. (3.18)
the hypothesis that two independent groups have identical variances. None are
completely satisfactory in terms of controlling the Type I error probability. (See, for
example, Wilcox, 2017a, Section 7.10; and Wilcox, 2022a, Section 5.5.1 for more
details.) Currently, a generalization of what is called the Morgan–Pitman test (origi-
nally designed for comparing the variances of dependent groups), works reasonably
well provided the distributions do not differ substantially in terms of skewness.
74 3 Comparing Two Independent Groups
The R function
.varcom.IND.MP(x,y,SEED=TRUE)
A variety of methods have been proposed for characterizing how groups differ
beyond using differences between measures of location. One approach is to use
the probability that a randomly sampled observation from the first group is less than
a randomly sampled observation from the second group as described in Sect. 3.2.
This section summarizes several other techniques.
The first approach compares groups based on a measure of effect size that is a
function of both a measure of location and dispersion. Certainly the best-known
version of this approach is
3.6 Measures of Effect Size 75
0.4
0.4
distributions in the left panel,
.δ = 0.8. For the mixed
normals in the right panel,
0.3
0.3
.δ = 0.24
Density
Density
0.2
0.2
0.1
0.1
0.0
0.0
−3 0 2 4 −3 0 2 4
μ1 − μ2
δ=
. , (3.19)
σ
where by assumption, .σ1 = σ2 = σ . That is, homoscedasticity is assumed. Perhaps
more importantly, this measure of effect size is not robust: a small change in the
distributions can result in a relatively large value .δ being rendered relatively small.
Example The standard example of this last point is based on the mixed normal
distribution introduced in Chap. 1. For illustrative purposes, assume .δ = 0.2, 0.5 and
0.8 are viewed as small, medium and large effect sizes, respectively, as suggested by
Cohen (1988). The left panel of Fig. 3.4 shows two normal distributions, both have
.σ = 1, and the means are 0 and 0.8, in which case .δ = 0.8, which is being viewed as
large. The right panel shows two mixed normal distributions with the same means.
The distributions have an obvious similarity to normal distributions, but the standard
deviations of the mixed normals are both equal to 3.3. Consequently, .δ = 0.24,
which is being viewed as relatively small.
The usual estimate of .δ, known as Cohen’s d, is
X̄1 − X̄2
d=
. , (3.20)
s
X̄t1 − X̄t2
dt = 0.642
. , (3.21)
Sw
where
(n1 − 1)sw1
2 + (n − 1)s 2
2
2
SW
. = w2
n1 + n2 − 2
(1 − u)sw1
2 + us 2
Vw2 =
.
w2
.
u(1 − u)
X̄t1 − X̄t2
δ̂kms.t = 0.642
. (3.22)
Vw
When there is no trimming, .δ̂kms.t reduces to the effect size used by Kulinskaya et al.
(2008). Henceforth, .δ̂kms.t is called the KMS measure of effect size.
Kulinskaya and Staudte (2006, p. 101) note that a natural generalization of .δ to
the heteroscedastic case does not appear to be possible without taking into account
the relative sample sizes. Also, under normality, and when the population variances
and sample sizes are equal, .δ = 2δkms.t , where .δkms.t is the population parameter
being estimated by .δ̂kms.t .
Note that rather than test the hypothesis that two groups have a common trimmed
mean, one could test
And a confidence interval for .δkms.t is of interest as well. These two goals can be
achieved with a percentile bootstrap method (Wilcox, 2022b).
This section describes an extension of the quantile shift measure of effect size Q
described in Sect. 2.6. It is convenient to focus on .θD , the median of all possible
differences, .Dij , introduced in Sect. 3.2. The basic idea is to determine how unusual
.θD happens to be relative to the distribution where .H0 : θD = 0 is true. This is done
1
.Q̂ = Iij . (3.24)
n1 n2
That is, .Q̂ is the proportion of difference scores, centered to have a median of zero,
which are less than or equal to the median of the uncentered difference scores.
If when dealing with a normal distribution, Cohen’s .d = 0.2, 0.5 and 0.8 are
considered small, medium, and large effect sizes, this corresponds to .Q = 0.55,
0.64 and 0.71, respectively. In a similar manner, if .d = −0.2, .−0.5 and .−0.8 are
considered small, medium, and large effect sizes, this corresponds to .Q = 0.45,
0.36 and 0.29, respectively. Again, a percentile bootstrap method has been found to
perform well when testing
H0 : Q = 0.5
. (3.25)
VAR(Ŷ )
.ξ2 = . (3.26)
VAR(Y )
It is the square root of .ξ 2 , .ξ , that is used as a measure of effect size. Under normality,
when .δ given by (3.19) is equal to 0.2, 0.5 and 0.8, the corresponding values for
.ξ , based on a 20% trimmed mean, are 0.14, 0.34, and 0.52, respectively. To add
perspective, it is noted that Cohen (1988) suggests that as a general guide, Pearson’s
78 3 Comparing Two Independent Groups
correlation .ρ = 0.1, 0.3, and 0.5 correspond to a small, medium, and large values,
respectively.
For the situation at hand, if a value is randomly sampled from the j th group, the
predicted value is .θj , where now .θj is any measure of location that is of interest. The
variance based on .θ1 and .θ2 corresponds to the numerator of .ξ 2 . The denominator
is the variance of the pooled distributions.
When the sample sizes are equal, estimating of .ξ 2 is straightforward. First,
estimate the measures of location and compute the variance based on these two
values yielding an estimate of the numerator of .ξ 2 . Next, pool the data and estimate
some robust measure of dispersion. When dealing with unequal sample sizes, there
are estimation issues that can be addressed as described in Wilcox (2022a), but the
details are not described here. What is more important here is understanding the
nature of this measure of effect size.
Currently, the version of .ξ that has gotten the most attention is based on a 20%
trimmed mean and a 20% Winsorized variance that has been rescaled to estimate the
population variance when dealing with a normal distribution. Another possibility is
to use the median or an M-estimator coupled with some robust measure of dispersion
such as the percentage bend midvariance in Sect. 2.2.1.
The R function
ESfun(x,y,QSfun=median,method=c(‘EP’,‘QS’,‘QStr’,
.
‘AKP’,‘WMW’, ‘KMS’)tr=0.2,pr=TRUE,SEED=TRUE)
can be used to estimate any of six measures of effect size via the argument method,
which can have any of the following values:
• method=‘EP’: explanatory power
• method=‘QS’: quantile shift measure based on the measure of location
indicated by the the argument QSfun. The default is the median.
• method=‘AKP’: The Algina et al. measure of effect size given by (3.21).
• method=‘WMW’: The measure of effect size given by (3.9), the probability that
a randomly sampled value from the first group is less than a randomly sampled
value from the second group.
• method=‘KMS’: The heteroscedastic measure of effect size given by (3.22).
The six measures of location computed by ESfun can be computed simultane-
ously via the R function
The argument NULL.V indicates the corresponding null values when testing
hypotheses. Consider again the measure of effect size .δ, given by (3.19). As
previously noted, .δ = 0.2, 0.5 and 0.8 are often taken to be small, medium, and large
effect sizes. But presumably there are situations where this is not the case. Suppose,
for example, it is argued that .δ = 0.1, 0.3 and 0.5 are small, medium, and large.
This can be indicated by setting the argument REL.M=c(0.1, 0.3, 0.5).
This raises the issue of what are the corresponding values for the six measures of
effect computed by ES.summary. These values are determined by the function, or
they can be specified via the argument REL.MAG.
The R function
estimates the same measures of effect size as ES.summary, but it also computes
confidence intervals and tests hypotheses based on the null values indicated by the
argument NULL.V.
Example The file diet.csv contains a collection of measures related to three diets
and weight loss. The data are also stored in the R object diet, which can be accessed
via the WRS2 package. Column 4 contains “A,” “B,” or “C” indicating which diet
was used. The command
a=fac2list(diet[,7],diet[,4])
.
stores the data in the R object a having list mode. Here a[[1]], a[[2]], and
a[[3]] contain the data for diets A, B and C, respectively. The command
ES.summary.CI(a[[1]],a[[3]],REL.M = c(0.1,0.3,0.5))
.
3.7 Exercises
1. Imagine that 0.2, 0.5, and 0.8 are taken to be small, medium, and large values
for the AKP measure of effect size, respectively. If the AKP measure of effect
size is estimated to be 0.5, the p-value is 0.02, and the 0.95 confidence for this
(0.1, 0.9), interpret the results in terms of Tukey’s three-decision rule.
2. The last example in Sect. 3.1.3 compared two education groups based on
depressive symptoms using Yuen’s method. Compare the same groups using
a one-step M-estimator, medians using a percentile bootstrap method and
measures of effect size using the R function ES.summary.CI.
3. Repeat Exercise 2 only now use the shift function described in Sect. 3.3.2 to
compare groups. How do the results compare to the the results in Exercise 1?
4. For the diet data used in the example at the end of Sect. 3.6.4, plot the shift
function when comparing the first and third groups. The plot suggests that as
the amount of weight loss in the first group increases from 4 to 8, the estimated
difference between the quantiles decreases. Explain why it is unjustified to
conclude that this is the case.
5. Execute the following commands using R
set.seed(58)
x1=ghdist(50,g=1)+0.5
x2=ghdist(75)
ES.summary.CI(x1,x2)
The second command generates data from a lognormal distribution that has
median equal to 0.5. The third command generates data from a standard
normal distribution. Comment on the p-values and effect sizes returned by
ES.summary.CI(x1,x2).
6. The file shoulder_pain.tex contains data on shoulder pain after surgery. The
goal was to compare an active treatment to no treatment. Measures of pain
were taken at three times. The first three columns in the file are times 1–3 for
the active treatment. The final three columns are times for no treatment. Use
bindand to compare the outcomes taken at time 3. Comment on the results.
Note: given how the data are stored, the first two lines in the file need to be
skipped when using read.table
7. A study was performed where the goal was to investigate the degree to which
smokers experience negative emotional responses (such as anger and irritation)
upon being exposed to antismoking warnings on cigarette packets. Smokers
were randomly allocated to view warnings that contained only text, such
as “Smoking Kills,” or warnings that contained text and graphics, such as
3.7 Exercises 81
18. An example in Sect. 3.3.3 dealt with Salk’s data dealing with newborns.
Assuming the data are stored in the R object salk, the R command
sband(salk[,2],salk[,1]) yielded 14 significant results. What
happens when using the command sband(salk[,1],salk[,2])?
Chapter 4
Comparing Two Dependent Groups
This chapter deals with comparing two dependent groups. For example, participants
could be measured before and after receiving treatment for some medical condition.
As another example, husbands and wives might be compared based on their political
attitudes. In both cases, if .θ̂j is some location estimator for the j th group, there is
the issue of taking into account any association between .θ̂1 and .θ̂2 when making
inferences about how .θ1 and .θ2 , the population measures of location, compare.
There is a simple and rather trivial way of addressing this issue when dealing
with means. Consider two random variables, .X1 and .X2 , that might be dependent.
Let .μ1 and .μ2 denote the population means of .X1 and .X2 , respectively. Let .D =
X1 − X2 .The population mean of D is simply .μD = μ1 − μ2 . That is, despite the
possible dependence between .X1 and .X2 , testing the hypothesis
H0 : μ1 = μ2
. (4.1)
. H0 : μD = 0. (4.2)
times, .Xi1 is the ith measure taken at time 1 and .Xi2 is the ith measure taken at time
2. Let .Di = Xi1 − Xi2 (.i = 1, . . . , n). A convenient feature of the sample mean
is that .D̄ = X̄1 − X̄2 . That is, the sample mean of the difference scores is equal
to the sample mean of the data at time 1 minus the sample mean of data at time 2.
Assuming normality, this leads to the classic paired Student’s t test:
√
nD̄
.T = , (4.3)
sD
2 =
where .sD (Di − D̄ 2 )/(n − 1), the sample variance of the difference scores.
As noted in a basic course, if the Type I error probability is set at .α, reject the null
hypothesis if .|T | ≥ t, where t is the .1 − α/2 quantile of Student’s t distribution with
.n − 1 degrees of freedom.
H0 : μt1 = μt2 ,
. (4.4)
the hypothesis that the marginal trimmed means, meaning the trimmed means
associated with .X1 and .X2 , are equal. Under general conditions, this is not the same
as testing
H0 : μtD = 0,
. (4.5)
the hypothesis that the population trimmed mean of the difference scores is zero.
As previously noted, .D̄ = X̄1 − X̄2 . There are exceptions, but in general, this
property does not extend to the robust location estimators considered here. For
example, .X̄t1 − X̄t2 , the difference between the trimmed means, is not equal to
.D̄t , the trimmed mean of difference scores. The same is true for the median and the
M-estimator.
Here is another perspective. As was done in Chap. 3, let .Dij = Xi1 − Xj 2 ,
.i = 1, . . . , n; .j = 1, . . . , n. That is, all pairwise differences are being used as
opposed to just the n paired difference scores used by the paired Student’s t test.
When dealing with husbands and wives, .X11 − X12 is the difference for the first
married couple that was randomly sampled. But here, the difference between all
males and all females is used. Now a goal of interest is testing
H0 : θD = 0,
. (4.6)
some analog of the shift function described in Sect. 3.3. There are measures of effect
size, beyond those in Chap. 2, that are particularly well suited for dependent groups.
Other issues are how to compare measures of variation and how to deal with missing
values. Each of these topics is covered in this chapter.
This section deals with measures of location associated with the marginal distri-
butions. That is, the goal is to make inferences about some measure of location
associated with .X1 and .X2 , say .θ1 and .θ2 . In particular, there is the goal of testing
H0 : θ1 = θ2
. (4.7)
and computing a confidence interval for .θ1 − θ2 . Given a random sample of n pairs
of values .(X11 , X12 ), . . . , (Xn1 , Xn2 ), inferences are made based on .θ̂1 an estimate
of .θ1 based on .X11 , . . . , Xn1 ; and .θ̂2 based on .X12 , . . . , Xn2 .
As was the case in Chap. 3, there are bootstrap methods and non-bootstrap methods
for comparing trimmed means. First, a non-bootstrap method is described followed
by a bootstrap-t method. Section 4.1.2 deals with a percentile bootstrap method in
the broader context of estimators that have a reasonably high breakdown point.
Let .X̄t1 and .X̄t2 denote the sample trimmed means. Because these trimmed
means are possibly dependent, the method in Chap. 3 is invalid. What is needed
is a method that estimates the standard error of .X̄t1 − X̄t2 in manner that takes
this dependence into account. A solution is obtained based in part on what is called
the Winsorized covariance. The data for both groups are Winsorized as is done in
Chap. 3, but in a manner that keeps the dependent observations paired together.
Here is an illustration of what this means using eight pairs of observations.
Xi1 : 18 6 2 12 14 12 8 9
.
Xi2 : 11 15 9 12 9 6 7 10
Wi1 : 14 6 6 12 14 12 8 9
.
Wi2 : 11 12 9 12 9 7 7 10
1
dj2 =
. (Wij − W̄j )2 ,
h(h − 1)
and
1
d12 =
. (Wi1 − W̄1 )(Wi2 − W̄2 ).
h(h − 1)
Then . d12 + d22 − 2d12 estimates the standard error of .X̄t1 − X̄t2 , in which case a
reasonable test statistic is
X̄t1 − X̄t2
Ty =
. . (4.8)
d12 + d22 − 2d12
X11 , X12
.
..
.
Xn1 , Xn2
yielding
∗ , X∗
X11 12
.
..
.
∗ , X∗
Xn1 n2
4.1 Methods Based on the Marginal Distributions 87
Based on the bootstrap sample, compute .Cij∗ = Xij ∗ − X̄ . This centers the bootstrap
tj
distributions so that the null hypothesis is true. Let .Ty∗ be the value of .Ty , given
by (4.8), based on the .Cij∗ values. Repeat this process B times yielding .Tyb ∗ , .b =
∗ ∗ ∗
1, . . . , B. Let .Ty(1) ≤ · · · ≤ Ty(B) be the .Tyb values written in ascending order.
Set . = αB/2 and .u = (1 − α/2)B, rounding both to the nearest integer. Then an
∗
estimate of the lower and upper critical values is .Ty(+1) ∗ . An equal-tailed
and .Ty(u)
.1 − α confidence interval for .μt1 − μt2 is
∗ ∗
(X̄t1 − X̄t2 − Ty(u)
. d12 + d22 − 2d12 , X̄t1 − X̄t2 − Ty(+1) d12 + d22 − 2d12 ).
(4.10)
∗ by .|T ∗ | , its absolute value, set
To get a symmetric confidence interval, replace .Tyb yb
a = (1 − α)B, rounding to the nearest integer, in which case the .(1 − α) confidence
.
There is the issue of how to handle missing values. First consider the approach based
on difference scores. There are robust methods for imputing missing values (e.g.,
Branden & Verboven, 2009; Danilov et al., 2012). That is, the existing data are used
to estimate reasonable values for those that are missing. However, when the goal
is to test hypotheses and compute confidence intervals, there are concerns that this
approach can be unsatisfactory (e.g., Liang et al., 2008; Wang & Rao, 2002). A
better approach is to simply remove any pairs of values where one of the values is
missing.
However, when dealing with the marginal measures of location, there are two
methods that use all of the available data assuming that missing values occur at
random. To outline the first, called method M1, suppose there are .m1 pairs for which
both values are available. Of course, the test statistic in method in Sect. 4.1.1 can be
used to compare the trimmed means. Let .m2 denote the number of pairs where the
first value is available but not the second, and let .m3 denote the number of pairs
where the first value is unavailable, but the second is available. These two sets of
data consist of independent groups and can be compared with the test statistic in
Sect. 3.1.1. There is a way of combining these two statistics into a single test statistic
for comparing the marginal trimmed means (Wilcox, 2022a, Section 5.9.13, method
M1). The R function rm2miss in Sect. 4.3.1 applies this method.
Note that a trimmed mean for the first group can be computed ignoring the second
group regardless of whether there are any missing values for the second group. Of
course, the same is true for the second group, a trimmed mean can be computed
regardless of whether any values are missing from the first group. The same is
true when dealing with a bootstrap sample. In essence, missing values are easily
addressed when using a percentile bootstrap method, which is called method M2.
This approach can be applied with the R function rmmismcp in Sect. 4.3.1.
Section 3.3 described a quantile shift function for comparing two independent
groups. Lombard (2005) derived an analog of the shift function based on the
marginal distributions. That is, the goal is to compare all of the quantiles of the
marginal distributions. Like the method in Sect. 3.3, it assumes random sampling
only. Moreover, the basic description given in Sect. 3.3 applies here. Roughly, the
ith smallest value in the first group is taken to be the .i/n quantile of that group,
which is then compared to an estimate of the .i/n quantile of the second group
(.i = 1, . . . , n). The R function lband in Sect. 4.3.1 applies this method.
4.3 The Sign Test 89
Like the other methods in the previous section, a percentile bootstrap method works
well when dealing with the median of the typical difference. That is, the goal is to
test the hypothesis given by (4.6). The R function loc2dif in Sect. 3.2.1 can be
used to estimate .θD , the median of the typical difference, and loc2dif.ci can be
used to test hypotheses and compute a confidence interval.
The well-known sign test is yet another way of comparing dependent groups. That
is, the goal is to make inferences about
P = P (X1 < X2 )
. (4.12)
the probability that for a randomly sampled pair of observations, the first is less than
the second. The method eliminates any pair of values where .Xi1 = Xi2 , leaving say
m pairs of values. Let I denote the number of times .Xi1 < Xi2 . That is, I is the
number of successes in m trials, in which case inferences about P can be made with
the methods in Sect. 2.4.
The R function
yuend(x,y,tr=0.2,alpha= 0.05)
.
tests (4.4), the hypothesis of equal marginal trimmed means using the non-bootstrap
method in Sect. 4.1.1 and it computes a confidence interval for .μt1 − μt2 , the
difference between the trimmed means. As usual, the argument tr controls the
amount of trimming.
The R function
tests (4.4) using a bootstrap-t method. The number of bootstrap samples defaults to
nboot=599. Using side=FALSE results in an equal-tailed confidence interval,
while side=TRUE returns a symmetric confidence interval instead. Setting the
argument plotit =TRUE creates a plot of the bootstrap values. As was the case
90 4 Comparing Two Dependent Groups
in Chap. 3, this function is relatively good choice when the amount of trimming is
relatively small.
Using a percentile bootstrap method, the R function
two.dep.pb(x,y=NULL,alpha=0.05, est=tmean,
.
method=c(‘TR’,‘TRPB’,‘MED’,‘HDPB’,‘AD’,‘SIGN’))
can be used to compare dependent groups based on difference scores using any
of five methods: trimmed means based on the Tukey–McLaughlin method in
Sect. 2.3.1 (TR), trimmed means based on a percentile bootstrap method (TRPB),
median using the method in Sect. 2.3.3 (MED), median using the Harrell–Davis
estimator (HDPB), the typical difference as described in Sect. 3.2 (AD), and the
sign test (SIGN).
For convenience, the R function
is supplied for performing a sign test. If the argument y is not specified, it is assumed
that x is either a matrix with two columns corresponding to two dependent groups or
that x has list mode. The function computes the differences .Xi1 −Xi2 (.i = 1, . . . , n)
and then eliminates all differences that are equal to zero. Next, it determines the
number of pairs for which .Xi1 < Xi2 and then it calls the R function binom.conf.
The R function
compares the quantiles associated with the marginal distributions. By default, the
deciles are compared via the Harrell–Davis estimator. The argument q can be used
to specify alternative quantiles. For example, q=c(0.25, 0.5, 0.75) would
compare the quartiles.
The R function
applies the shift function for dependent groups. This function gives a detailed
indication of where and how the marginal distributions differ, it controls the FWE
rate assuming random sampling only, but the R function Dqcomhd is likely to have
more power.
When dealing with missing values, the R function
.rm2miss(x,y,tr=0)
0.05
distribution of CESD scores
before intervention. Note the
right tail of the solid line is
0.04
above the distribution after
intervention
0.03
Density
0.02
0.01
0.00
0 10 20 30 40 50
CESD
Chap. 5. The point here is that limiting comparisons to a single method might miss
an important difference detected by some other technique.
Notice that all of the methods aimed at comparing marginal measures of location
failed to reject at the 0.05 level. However, look at Fig. 4.1. Shown are kernel density
estimates of the distributions using the R function g5plot in Sect. 3.3.1. The solid
line is the distribution of the CESD scores before intervention. The centers of the
distributions appear to be fairly similar, but the right tails suggest that there is a
difference when looking at higher CESD scores. Here is the output from the R
function Dqcomhd:
q n1 n2 est.1 est.2 est.1_minus_est.2 ci.low
[1,] 0.7 326 326 17.47415 15.69255 1.781608 -0.160802301
[2,] 0.8 326 326 22.20009 20.22397 1.976124 0.001230157
[3,] 0.9 326 326 28.70465 24.63129 4.073360 1.506739641
ci.up p-value adj.p.value
[1,] 3.448839 0.072 0.072
[2,] 4.335330 0.050 0.072
[3,] 6.977370 0.001 0.003
The results suggest that the distributions do differ in the upper quantiles. That is, in
terms of relatively high levels of depressive symptoms, the intervention program is
beneficial.
H0 : σ12 = σ22 ,
. (4.13)
4.4 Comparing Measures of Dispersion 93
the hypothesis that the marginal distributions have a common variance. The classic
method for dealing with this goal is the Morgan–Pitman test. Let
Ui = Xi1 − Xi2
.
and
. Vi = Xi1 + Xi2
(.i = 1, . . . , n) and let .ρuv be the population value of Pearson’s correlation between
U and V. It can be shown that when (4.13) is true, .ρuv = 0. As noted in a basic
statistics course, the standard method for testing
H0 : ρuv = 0
. (4.14)
The R function
comdvar(x,y,alpha= 0.05)
.
uses the modified Morgan–Pitman test to test the hypothesis of equal variances.
The R function
Measures of location and the sign test provide ways of charactering the extent two
dependent groups differ. And when using difference scores, measures of effect size
in Sect. 2.6 can be used. This section describes a few additional measures.
It can be shown that the variance of .X1 − X2 is
where .σ12 is the covariance between .X1 and .X2 . The covariance can be shown to
be .σ1 σ2 ρ, where .ρ is Pearson’s correlation between .X1 and .X2 . A robust version
of .τ 2 , .τw2 , is obtained by replacing the variances and covariances with a Winsorized
variances and covariances that have been rescaled to estimate .σ12 , .σ22 and .σ12 when
sampling from a normal distribution. This suggests the measure of effect size
√ μt1 − μt2
ω=
. 2 . (4.17)
τw
√
The term . 2 is included so that under normality, and when there is no association
(i.e., .ρ = 0), .ω is equal to the effect size given by (2.23) in Sect. 2.6, which is
estimated by Cohen’s d.
The R function
4.6 Exercises 95
computes the measure of effect size given by (4.17). Currently, there are no results
on how to compute a confidence interval.
The example in Sect. 4.3.1 noted that based on difference scores, the p-value is
0.02 based on the Tukey–McLaughlin method with a 20% trimmed mean. Using a
percentile bootstrap method, the p-value is 0.013. It is left as an exercise to show
that the effect sizes returned by dep.ES.summary.CI are very small.
4.6 Exercises
scent and when they were not. The columns headed by U.Trial.1, U.Trial.2, and
U.Trial.3 are the times for no scent, which were taken taken on three different
occasions. Compare U.Trial.1 and U.Trial.3 using yuend, two.dep.pb with
dif=FALSE as well as dif=TRUE. Comment on the results.
6. Repeat the last exercise only now compare U.Trial.1 to S.Trial.3 and comment
on the results.
7. The file cort_dat.txt contains cortisol measures taken upon awakening and again
30–45 minutes later. The data are contained in columns 2 and 3. Compare these
measures using the same R functions used in the previous two exercises and
comment on the results.
8. If the marginal distributions are identical, what is the shape of the distribution
of the difference scores?
9. The text described three ways of viewing how groups compare based on a robust
measure of location. The first used a measure of location based on the marginal
distributions, the second use difference scores, and the third uses all pairwise
differences. Which method is most likely to have the most power?
10. Section 4.4 described a method for comparing variances based on a modifica-
tion of the Morgan–Pitman test. Describe situation where this method fails to
control the Type error probability.
11. When the marginal distributions are identical, difference scores have a symmet-
ric distribution about zero. Comment on the relative merits of using the mean
of the differences score when testing the hypothesis that the population mean is
zero.
12. The sign test is sometimes criticized for having relatively low power. What are
some reasons for using it anyway?
13. Comment on the strategy of imputing missing values.
14. Comment on comparing the marginal medians when using the usual sample
median M in conjunction with a bootstrap-t method.
15. Summarize the relative merits of the R function lband versus Dqcomhd.
Chapter 5
Comparing Multiple Independent
Groups
This chapter extends the methods in Chap. 3 to situations where there are more
than two independent groups. Certainly the best-known and most commonly used
approach is to focus on some measure of location, say .θ . A very common strategy
when dealing with .J > 2 groups is to first test
H0 : θ1 = θ2 = · · · = θJ .
. (5.1)
.H0 : θj = θk , (5.2)
for every .j < k. That is, do all pairwise comparisons. Note, however, that if each
of these tests is performed at the .α level, even if each test controls the Type I error
probability exactly at the .α level, if (5.1) is true, the probability of at least one Type I
error will be greater than .α. That is, as the number of tests increases, the more likely
it is that a Type I error will be made in the event that the groups have a common
measure of location. This raises the issue of controlling what is known as the family-
wise error (FWE) rate, meaning the probability of one or more Type I errors. A
basic course typically covers some classic methods for dealing with this issue such
as the Tukey–Kramer method, which is based on means assuming normality and
homoscedasticity. One goal here is to describe robust heteroscedastic versions of
these techniques.
Some comments are in order regarding Tukey’s argument, in Sect. 1.2, which
surely measures of location differ at some decimal place. From this point of view,
testing (5.1) is aimed at determining whether the sample sizes are large enough to
establish what is already known. From Tukey’s perspective, what is more interesting
is determining the extent to which it is reasonable to make a decision about whether
θj is less than or greater than .θk for every .j < k. However, even if this view is
.
accepted, the global test given by (5.1) is useful in the context of a step-down
multiple comparison procedure described in Sect. 5.3.4. In addition, testing the
global hypothesis given by (5.1) plays a useful role when making inferences about
certain measures of effect size, as will be seen in Sect. 5.1.5.
One more point is worth stressing. There is a tradition that one begin by testing
the global hypothesis (5.1), and if it rejects, perform pairwise comparisons of the
groups using a method aimed at controlling the FWE rate. However, the bulk of the
multiple comparison methods covered here are designed to control the FWE rate
without first rejecting the global hypothesis. Moreover, if these multiple comparison
methods are used only if the global hypothesis is rejected, this impacts their FWE
rate: it tends to go down. For example, if the method used is designed so that the
FWE rate is 0.05, the actual FWE rate will be less than 0.05 if it used only when the
global hypothesis is rejected first. This in turn can lower the power. This issue was
first pointed out by Bernhardson (1975) in the context of the Tukey–Kramer method
and related techniques aimed at performing all pairwise comparisons.
This chapter begins with a one-way design where the goal is to test (5.1).
Included are measures of effect size that help characterize the extent to which the
groups, taken as a whole, differ. Then two-way designs are considered with a focus
on global hypothesis testing methods plus methods for comparing groups based
on robust measures effect size. This is followed by a summary of some methods
for a three-way design. Finally, methods for performing multiple comparisons are
covered that include as a special case techniques aimed at controlling the FWE rate
when testing (5.2).
This section is focused on heteroscedastic methods for testing (5.1). First, two non-
bootstrap methods are described that are designed for 20% trimming or less. In
theory, these methods could be used with 25% or 30% trimming. The reason for
saying that the methods are designed for 20% trimming or less is that there are no
published results on how well they perform when using slightly more trimming.
However, it is known that these methods are unsatisfactory when comparing
medians because the standard errors are not estimated in an appropriate manner.
Included is a method that uses medians assuming that there are no tied values.
When dealing with the M-estimator or the MOM estimator, all indications are
that a bootstrap method is essential, at least when the sample sizes are small or even
moderately large. Just how large the sample sizes would have to be to justify a non-
bootstrap method is unknown. Included is a method based on medians that deals
with tied values. This section concludes with measures of effect size.
5.1 One-Way Global Tests 99
Welch (1951) derived a heteroscedastic method for (5.1) based on means, which is
readily extended to trimmed means (Wilcox, 1995a). The test statistic is computed
as follows: For the j th group, let
(nj − 1)swj
2
dj =
. ,
hj × (hj − 1)
2 is the Winsorized
where hj is the number of observations left after trimming and swj
variance. Compute
1
wj =
.
dj
U=
. wj
1
X̃ =
. wj X̄tj
U
1
A=
. wj (X̄tj − X̃)2
J −1
and
2(J − 2) (1 − U )2
j w
.B = .
J2 − 1 hj − 1
ν1 = J − 1
.
−1
3 (1 − wj /U )2
ν2 =
. .
J −1
2 hj − 1
Another method that performs reasonably well was proposed by Lix and
Keselman (1998). Their test statistic is
100 5 Comparing Multiple Independent Groups
hj (X̄tj − X̄t )2
Fb =
. , (5.4)
1 − (hj /H )Sj2
where H = hj , X̄t = hj X̄tj /H and
(nj − 1)swj
2
Sj2 =
. .
hj − 1
and
2
J
j =1 (1 − fj )Sj
2
ν̂2 = J
.
j =1 Sj (1 − fj ) /(hj − 1)
4 2
The R function
t1way(x,tr=0.2)
.
5.1 One-Way Global Tests 101
tests the hypothesis of equal trimmed means using the test statistic .Ft . The first
argument x can be any R object having list mode, or it can be a matrix, or a data
frame. If x is a matrix or data frame, it is assumed that the columns correspond to
groups.
The R function
box1way(x,tr=0.2,grp=NA),
.
tests the hypothesis of equal trimmed means using the test statistic .Fb and
A convenient feature of the 20% trimmed mean is that non-bootstrap methods have
been derived that perform reasonably well except possibly when the sample sizes
are small. Non-bootstrap methods based on M-estimators are readily developed
based on extant theoretical results, but concerns are encountered when dealing with
skewed distributions. Presumably, such methods would perform reasonably well
with sufficiently large sample sizes, but there are no clear guidelines indicating
when this is the case. As in previous situations, bootstrap methods can provide an
advantage over non-bootstrap methods. As was the case in Chap. 3, the only known
way of dealing with tied values, when dealing with medians, is to use a bootstrap
method.
102 5 Comparing Multiple Independent Groups
1.5
bootstrap samples. The null
point is indicated by *
1.0
V
0.5
0.0
A bootstrap-t method based on .Ft , given by (5.3), can give improved control over
the Type I error probability. Section 3.1.1 described how to perform a boostrap-t
method based on Yuen’s method for comparing trimmed means. Here, the method
is applied in virtually the same manner as described in Sect. 3.1.1 except that now
the bootstrap version of .Ft is used rather than the test statistics .Ty used in Sect. 3.1.1.
The remainder of this section deals with a percentile bootstrap method. First
consider .J = 3 groups. It helps to first describe a generalization of the percentile
bootstrap method that is not quite satisfactory. Consider any measure of location .θ
and focus on the goal of testing the global hypothesis
H0 : θ1 − θ2 = θ1 − θ3 = 0.
. (5.5)
H0 : θ1 − θ2 = θ1 − θ3 = θ2 − θ3 = 0.
. (5.6)
Now the null point is (0, 0, 0). More generally, for .J > 2, all pairwise differences are
used. The issue is quantifying how deeply the null point is nested within a bootstrap
cloud. A basic course on multivariate statistical methods might seem to suggest a
simple solution: use Mahalanobis distance, which is formally defined in Sect. 7.1.
This approach might work, but there is a computational issue that can preclude this
approach, especially when dealing with more than three groups. (The covariance
matrix is singular.)
Wilcox (2022a, Section 6.2.2) provides complete details of how a p-value is
computed. To provide at least some indication of how this is done, consider a
collection of values .X1 , . . . , Xn . The depth of the value .Xi can be characterized
by its standardized distance from the median. The standardized distance used here
is
|Xi − M|
. , (5.7)
q2 − q1
where .q2 − q1 is the interquartile range based on the ideal fourths. The smaller the
distance, the deeper is the value .Xi among all values.
Now consider a multivariate cloud of B bootstrap estimates of .δij = θi − θj for
all .i < j . The immediate goal is to measure the distance of all B points from the
center of the cloud plus the distance of the null vector consisting of all zeros.
A key component of how this is done is based on projections of the data.
Figure 5.2 illustrates this process for a bivariate cloud of points. Consider a line
connecting any point and the center of the cloud, which here is taken to be the
marginal medians. All points are (orthogonally) projected onto the line as indicated
by the arrow for one of the points in the cloud. In general, one can reduce a cloud of
multivariate data to univariate data by projecting all of the points onto a line. Once
this is done, a standardized distance from the median can be computed based on
the projected data. Here, B lines are used. That is, for each point, project the data
onto the line connecting it to the center of the cloud. The result is that every point
has B standardized distances corresponding to the B projections. The maximum of
these distances is taken to be the standardized distance of the point. (In the statistics
literature, if say .Db is the projection distance of the bth point, its depth is taken to
be .1/(Db + 1).
Let .Ib = 1 if the standardized distance of the null vector is less than the
standardized distance of the bth bootstrap point; otherwise .Ib = 0. The p-value
is
1
. Ib . (5.8)
B
104 5 Comparing Multiple Independent Groups
2
projected onto a line * *
* * **
* *
** *
* ** * *
* * *
* ** * * *
0
* * ** * *
* * * *
* ****
* *** *
−2
−4
*
−2 −1 0 1 2
The R function
TRUE)
pbprotm(x,est=tmean,con=0,alpha= 0.05,nboot=2000,
.
MC=FALSE,SEED=TRUE,na.rm=FALSE, ...)
Qanova(x,q=.5,op=3,nboot=2000,MC=FALSE,SEED=TRUE)
.
5.1 One-Way Global Tests 105
deals with the medians via the Harrell–Davis estimator by default. The only
difference from pbprotm is that other quantiles can be used via the argument q.
The R function
boot.TM(x,nboot=599)
.
performs the method based on the M-estimator derived by Özdemir et al. (2020).
This section describes three robust measures of effect size. The first is a simple
extension of explanatory power described in Sect. 3.6.3:
VAR(Ŷ )
ξ2 =
. . (5.9)
VAR(Y )
The numerator is estimated with the variance of the sample measures of location.
With equal sample sizes, simply compute the variance of the pooled data. As in
Chap. 3, there are estimation issues when there are unequal sample sizes. Details
are summarized in Wilcox (2022a). The term explanatory power refers to .ξ 2 . But as
was done in Chap. 3, it is the square root of .ξ 2 , .ξ , that is used as a measure of effect
size.
A property of .ξ helps provide perspective on its magnitude. Consider the case
where all J groups have a standard normal distribution. Next, suppose the first group
is shifted so that its mean is equal 0.8. The resulting values for .ξ corresponding
to .J = 2, . . . , 8 are 0.5390, 0.4248, 0.3673, 0.3310, 0.3057, 0.2903 and 0.2704,
respectively. If for this situation, it is desired to adjust .ξ so that its value remains
relatively constant, use
ξadj = cJ ξ,
. (5.10)
where .(c3 , . . . , c8 ) = (1.2687, 1.4671, 1.6282, 1.7631, 1.8566, 1.9933). For exam-
ple, when there are .J = 4 groups, .ξadj = 1.4671ξ .
A percentile bootstrap method is used to compute a confidence interval for .ξ .
Note that with near certainty, based on a bootstrap sample, the estimate of .ξ will be
greater than zero. It would be equal to zero if all of the bootstrap estimates of the
trimmed means happen to have the same value. Consequently, if the hypothesis of
equal trimmed means is not rejected (based on the R function t1way), the lower
end of the confidence interval is taken to be zero.
There are global measures of effect size based on means and variances assuming
homoscedasticity. Zhang and Algina (2008) summarize these methods and report
results on techniques for computing confidence intervals. Kulinskaya and Staudte
(2006) derived a heteroscedastic measure of effect size based on means and
106 5 Comparing Multiple Independent Groups
where .X̃ = qj X̄tj and .sj2wN is the Winsorized variance of the j th group rescaled
to estimate the variance when dealing with a normal distribution. Methods for
computing a confidence interval have not yet been studied.
Consider again the case where all J groups have a standard normal distribution.
Again, suppose the first group is shifted so that its mean is equal 0.8. For .J = 2,
.ω = 0.4, which matches the value for the KMS measure of effect size given by
ωadj = cJ ω,
. (5.12)
The R function
SEED=TRUE, adj=TRUE)
computes a confidence interval for .ξ , the square root of the explanatory measure of
effect size, using a percentile bootstrap method. As noted in the previous section,
the lower end of the confidence interval is taken to be zero if t1way fails to reject.
When the argument adj=TRUE, the adjusted estimate, given by (5.10), is used.
The R function
KS.ANOVA.ES(x,tr=0.2,adj=TRUE)
.
computes the adjusted measure of effect size .ω, the measure of effect size given by
(5.11) as described in the previous section. To get an unadjusted estimate, set the
argument adj=FALSE.
By default, the R function
ESprodis(x,est=tmean,REP=100,DIF=FALSE,SEED=TRUE,...)
.
computes a measure of effect size using the projection distance between the
measures of location and the grand mean. To compute an estimate of the measure of
effect size .ωpd.a , set the argument DIF=TRUE.
Example Consider again the example at the end of Sect. 5.1.2 where the goal was
to compare five groups, based on education level, using a measure of depressive
symptoms. The adjusted estimate of the measure of effect size .ξ is 0.63. The
adjusted version of .ω is estimated to 0.24. The estimate of .ωpd.g is 1.05, and the
estimate of .ωpd.g is 1.03. Not surprisingly, the relative magnitude of the estimates
can depend on how effect size is measured as demonstrated here.
investigated, this would be a 2-by-3 design. To review some basic concepts, it helps
to first focus on a 2-by-2 design.
Table 5.1 depicts the situation where .θ is any measure of location. One could,
of course, simply test the global hypothesis that all measures of location have a
common value. But of interest here is how males compare to females, ignoring
which treatment was used, and how treatment M1 compares to treatment M2,
ignoring gender. Perhaps more importantly, to what extent does the effectiveness of
method M1 differ for males versus females. This latter issue refers to an interaction.
The hypothesis of no interaction is
H0 : θ1 − θ2 = θ3 − θ4 .
. (5.13)
For the situation in Table 5.1, any difference between treatments among males is the
same as any difference between treatments among females. If there is an interaction,
.θ1 > θ2 and simultaneously, .θ3 > θ4 , the interaction is said to be ordinal. In
Table 5.1, treatment M1 is best for both males and females, but the magnitude of
the effect depends on gender. If .θ1 < θ2 and simultaneously, .θ3 < θ4 , again the
interaction is ordinal. If .θ1 > θ2 and .θ3 < θ4 , the interaction is disordinal. If .θ1 < θ2
and .θ3 > θ4 , again the interaction is disordinal. That is, one of the treatments is more
beneficial for females but not for males.
Now consider the more general case where the first factor, typically labeled
Factor A, has J levels and the second factor, Factor B, has K levels. It is convenient
to change notation slightly and let .θj k denote some measure of location associated
with the j th level of the first factor and the kth level of the second factor. One way
of comparing the levels of Factor A ignoring Factor B, as well as comparing levels
of Factor B ignoring Factor A, is as follows. For the j th level of Factor A, let
1
K
.θ̄j. = θj k
K
k=1
be the average of the K measures of location among the levels of the Factor B.
Similarly, for the kth level of Factor B,
1
J
.θ̄.k = θj k
J
j =1
5.2 Two-Way and Three-Way Designs 109
is the average of the J measures of location among the levels of the Factor A. The
hypothesis of no main effects for Factor A is
In Table 5.1, this would mean that there is no difference between males and females
ignoring treatment. The hypothesis of main effects for Factor B is
There are formal methods for defining no interaction when dealing with a J -by-
K design when J or K or both are greater than 2. But these details are not the main
focus here. What is important is understanding what no interaction means. Consider
any two levels of Factor A, say j and .j and any two levels of Factor B, say k and
.k . No interaction means that
θj k − θj k = θj k − θj k
. (5.16)
Johansen (1980) derived a general heteroscedastic method for dealing with means
that includes two-way designs as a special case. Johansen’s method has been
extended to trimmed means and found to perform relatively well. As was the case
with Yuen’s method, the test statistic is based in part on the Winsorized variances.
The somewhat involved computational details are summarized in Wilcox (2022a,
Section 7.2). The main point here is that this method reduces many of the concerns
associated with any technique based on means, and it eliminates concerns about
methods that assume homoscedasticity.
For the special case where the sample median is used, an alternative to the
generalization of Johansen’s method is needed. Such a method has been derived
assuming there are no tied values (Wilcox, 2022a, Section 7.2.2). Again, when there
are tied values, the best approach at the moment is to use a percentile bootstrap
method, as described in the next section.
When dealing with main effects and interactions, a percentile bootstrap method can
be applied in a manner similar to the approach used to test (5.6). When dealing with
main effects for Factor A, for example, focus on
110 5 Comparing Multiple Independent Groups
θ̄j. − θ̄j .
. (5.17)
As was the case in Sect. 5.1.3, the method generates bootstrap samples from each
group, a measure of location is computed for each bootstrap sample, which in
turn provides bootstrap estimates of all the pairwise differences in (5.17). This is
repeated B times yielding a bootstrap cloud of estimated differences. A p-value can
be computed as outlined in Sect. 5.1.3.
In case it helps, here is a description of a three-way design. Now there are three
factors: A, B, and C. For example, the factors might be gender, method, and ethnic
group.
Let .θj kl represent the measure of location associated with the j th level of Factor
A, the kth level of Factor B, and the lth level of Factor C. Then
1
K L
θ̄j.. =
. θj kl
KL
k=1 l=1
1
J L
. θ̄.k. = θj kl
JL
j =1 l=1
and
1
J K
θ̄..l =
. θj kl
JK
j =1 k=1
are the averages of the population measures of location associated with the kth and
lth levels of Factors B and C, respectively.
The hypothesis of no main effects for Factor A is
and
respectively.
Next, the goal is to describe a two-way interaction. Let
1
L
θ̄j k. =
. θj kl ,
L
l=1
1
K
. θ̄j.l = θj kl ,
K
l=1
and
1
J
θ̄.kl =
. θj kl .
J
l=1
The notion of no interactions associated with Factors A and B is that for any two
levels of Factor A, say j and .j , and any two levels of Factor B, k and .k ,
The R function
t2way(J,K,x,grp=c(1:p),tr=0.2,alpha= 0.05),
.
112 5 Comparing Multiple Independent Groups
performs the tests based on trimmed means described in Sect. 5.2.1, where J and K
denote the number of levels associated with Factors A and B, respectively.
For all of the functions in this section for a two-way design, when the data are
stored in list mode, the first K groups are assumed to be the data for the first level of
Factor A, the next K groups are assumed to be data for the second level of Factor A,
and so on. In R notation, x[[1]] is assumed to contain the data for level 1 of factors
A and B, x[[2]] is assumed to contain the data for level 1 of factor A and level 2 of
Factor B, and so forth. If, for example, a 2-by-4 design is being used, the data are
stored as follows:
Factor B
Factor A x[[1]] x[[2]] x[[3]] x[[4]]
x[[5]] x[[6]] x[[7]] x[[8]]
If the data are stored in a matrix or data frame, it is assumed that the first K
columns correspond to first level of Factor A and the K levels of Factor B. The next
K columns correspond to the second level of Factor A, and so on.
Often data are stored in a file where the dependent variable is stored in a
column of some R object, where the R object is a matrix or a data frame. And
two other columns are used to indicate the levels of the two factors. The R function
fac2list can be used to store the data as expected by the functions in this section
when dealing with a two-way design.
Example Exercise 7 in Chap. 3 described a study where all participants were asked
to swim their best event as far as possible, but in each case the time that was reported
was falsified to indicate poorer than expected performance (i.e., each swimmer was
disappointed). Thirty minutes later, they did the same performance. The issue was
whether the performance on the second trial among the more pessimistic swimmers
would be worse than on their first trial, whereas optimistic swimmers would do
better. The variable of interest is ratio .= Time1/Time2, where a ratio greater than 1
means that a swimmer did better in trial 2. The first few lines of the data are:
Optim Sex Event Ratio
1 Optimists Male Free 0.986
2 Optimists Male Free 1.108
3 Optimists Male Free 1.080
4 Optimists Male Free 0.952
5 Optimists Male Free 0.998
6 Optimists Male Free 1.017
7 Optimists Male Free 1.080
8 Optimists Male Breast 1.026
9 Optimists Male Breast 1.045
Suppose the goal is to compare the 20% trimmed means of ratio where the first
factor is optimism and the second is gender. This can be done with the commands:
5.2 Two-Way and Three-Way Designs 113
a=fac2list(swimming[,4],swimming[,1:2])
t2way(2,2,a)
As explained in Sect. 3.1.3, data are stored in alphabetical order, so a[[1]]
contains data for participants who are both an optimists and female, a[[2]]
contains data for optimists and male, a[[3]] are participants who are pessimists
and female and a[[4]] are participants who are pessimists and male. Here, Factor
B is gender. If the second argument had been swimming[,c(2,1)], Factor A
would be gender. The p-value for Factor A, optimism, is 0.015, for Factor B, gender,
the p-value is 0.129, and for the hypothesis of no interaction the p-value is 0.017.
The R function
performs the tests based on trimmed means for a three-way design as described in
Sect. 5.2.3. The method used is a generalization of the method in 5.2.1. The data
are assumed to be arranged such that the first L groups correspond to level 1 of
factors A and B (.J = 1 and .K = 1) and the L levels of factor C. The next L groups
correspond to the first level of Factor A, the second level of Factor B, and the L
levels of factor C. So for a 3-by-2-by-4 design, it is assumed that for .J = 1 (the
first level of the first factor), the data are stored in the R variables x[[1]],...,x[[8]] as
follows:
Factor C
Factor B x[[1]] x[[2]] x[[3]] x[[4]]
x[[5]] x[[6]] x[[7]] x[[8]]
The R function
.pbad3way(J,K,L,x,est=tmean,alpha=0.05,nboot=2000,MC=FALSE)
Consider a 2-by-2 design. Any of the measures of effect size covered in Chap. 3
can be used to characterize an interaction. To elaborate, first focus on the KMS
heteroscedastic measure of effect size, .δ̂kms.t , given by Eq. (3.17) in Sect. 3.6.1.
Consider the j th level of Factor A (.j = 1, 2) and for notational convenience, let
.δj denote the population version of .δ̂kms.t when comparing the corresponding two
H0 : δ1 = δ2 .
. (5.19)
H0 : Q1 = Q2 .
. (5.20)
Once more, a percentile bootstrap method performs reasonably when testing (5.20)
or computing a confidence interval for .Q1 − Q2 (Wilcox, 2022c). As was the case
when using KMS, interchanging rows and columns can alter the value of .Q1 − Q2 .
5.2 Two-Way and Three-Way Designs 115
Yet another approach is to use the effect size given by (3.9) in Sect. 3.2 as
suggested by Patel and Hoel (1973). To elaborate, let .Xj k denote a random variable
associated with the j th level of Factor A and the kth level of Factor B. Let
and
H0 : p1 = p2 .
. (5.21)
That is, the probability of an observation being smaller under level 1 of Factor B,
versus level 2, is the same for both levels of Factor A. As noted in Sect. 3.2, a
method derived by Cliff (1996) is one of the more effective methods for making
inferences about .p1 as well as .p2 . Moreover, the method can handle tied values.
An extension of Cliff’s method can be used to test (5.21) as well as computing a
confidence interval for .p1 − p2 (Wilcox, 2022a, Section 7.9.2).
De Neve and Thas (2017) suggest using
as a measure of effect size. That is, for randomly sampled observations from each
of the four groups, focus on the probability that for the first level of Factor A, the
difference between the two values associated with the levels of Factor B is less than
the difference for level 2 of Factor A.
Finally, the explanatory measure of effect size in Sect. 3.6.3 can be used as
well. However, a percentile bootstrap method does not perform well in terms
of controlling the Type I error probability or computing a confidence interval.
Currently, a method that does perform well is unknown.
The R functions in this section deal with a 2-by-2 design. Functions that deal with a
J -by-K design are described in Sect. 5.4.4.
The R function
computes an estimate of the interaction based on the KMS measure of effect size
and it tests the hypothesis of no interaction corresponding to (5.19). The argument
x is assumed to be a matrix with four columns or to have list mode with length four.
That is, the function is designed for a 2-by-2 design. By default, the KMS measure
of effect size is based on comparing the two levels of Factor B, which is done for
each level of Factor A. Setting the argument SW=TRUE, the KMS measure of effect
size is based on the two levels of Factor A.
The R function
is exactly like the function KMS.inter.pbci only it uses the quantile shift
measure effect size. By default, the version based on the median is used. To use
a 20% trimmed mean, set the argument locfun=tmean. The R function
TRUE, SW = FALSE)
This section deals with the problem of controlling the FWE rate (the probability
of one or more Type I errors) when testing two or more hypotheses. This includes
methods for making inferences based on measures of effect size. There are two basic
118 5 Comparing Multiple Independent Groups
strategies that are closely related. The first, when testing some hypothesis based on
some appropriate test statistic, is to adjust the critical value so that the FWE rate is
approximately equal to some specified value .α. The second approach is to adjust the
p-values.
The immediate goal is to perform all pairwise comparisons based on some
measure of location. More formally, the goal is to test
H0 : θj = θk
. (5.23)
The first approach is simple: for each pair of groups, use Yuen’s test statistic given
by (3.3) in Sect. 3.1.1. Next, determine a critical value based on what is called the
Studentized maximum modulus distribution. This critical value is a function of the
degrees of freedom used by Yuen’s test plus the number of hypotheses being tested,
which in this case is .C = (J 2 − J )/2, where J is the number of groups. (The
critical value is computed with the R function qsmm. The R function psmm is used
to get adjusted p-values.) When there is no trimming, this method reduces to the T3
technique studied by Dunnett (1980).
To provide some sense of the strategy for controlling the FWE rate, let .Tj k denote
Yuen’s test statistic when comparing the j th group to the kth group. The hypothesis
of equal trimmed means is rejected if .|Tj k | is sufficiently large. Let .|Tmax | be the
largest .|Tj k | value among all of the pairwise comparisons. If null hypothesis is true
for all pair of groups, then at least one Type I error is made if in particular .|Tmax | is
larger than some specified critical value. Suppose the distribution .|Tmax | is known
when there are no differences among the trimmed means. In particular, the 0.95
quantile is known to be c. In that case, the FWE rate would be 0.05 if each test
rejected only if .|Tj k | ≥ c. The Studentized maximum modulus distribution is a
method for approximating c.
Note that in addition to controlling the FWE rate, there is the issue of computing
confidence intervals so that all of the confidence intervals contain the true difference
between measures of location with probability .1 − α. The T3 method is designed to
accomplish this goal. Rather than use the critical value described in Sect. 3.1.1, use
the critical value based on the Studentized maximum modulus distribution. More
formally, a confidence interval for .μtj − μtk is
where .tqsmm is the critical value based on the Studentized maximum modulus
distribution.
5.3 Multiple Pairwise Comparisons for a One-Way Design 119
Percentile bootstrap methods are not based on a test statistic. But the FWE rate can
be controlled by adjusting the p-values. A simple solution is to use the Bonferroni
method. If C hypotheses are to be tested, perform each test at the .α/C, in which case
the probability of one or more Type errors will be less than .α assuming the actual
level of each individual test is indeed equal to .α/C. Several methods have been
derived that perform better than the Bonferroni method. That is, they also result in
an FWE rate less than or equal to .α, but they have more power (e.g., Hochberg,
1988; Holm, 1979; Hommel, 1988; Rom, 1990). There are slight differences among
the methods just cited, but in practice it is very difficult finding a situation where the
choice matters. For this reason, Hochberg’s method is typically used here.
Hochberg’s (1988) method is applied as follows. Let .p1 , . . . , pC be the p-values
associated with the C tests. Put these p-values in descending order, and label the
results .p[1] ≥ p[2] ≥ · · · ≥ p[C] . Beginning with .k = 1 (step 1), reject all
hypotheses if
p[k] ≤ α/k.
.
That is, reject all hypotheses if the largest p-value is less than or equal to .α. If
p[1] > α, proceed as follows:
.
1. Increment k by 1. If
α
p[k] ≤
. ,
k
stop and reject all hypotheses having a p-value less than or equal .p[k]
2. If .p[k] > α/k, repeat step 1.
3. Repeat steps 1 and 2 until you reject or all C hypotheses have been tested.
A method for testing hypotheses based on a percentile bootstrap method is
straightforward. Simply use the percentile bootstrap method in Sect. 3.1.2 when
comparing the j th and kth groups, then adjust the p-values using Hochberg’s
methods.
It is briefly mentioned that when using a bootstrap-t method, the FWE rate can be
controlled using an analog of the Studentized maximum modulus distribution (e.g.,
Wilcox, 2022a, 7.4.5). Basically, the method uses bootstrap samples to determine
the distribution of .|Tmax | when the null hypotheses are true. Consistent with past
remarks, a bootstrap-t method is preferable to the percentile bootstrap method when
there is little or no trimming.
120 5 Comparing Multiple Independent Groups
Rather than control the FWE rate, Benjamini and Hochberg (1995) proposed a
method that controls what is called the false discovery rate. To explain, let Q be
the proportion of hypotheses that are true and rejected. That is, Q is the proportion
of Type I errors among the null hypotheses that are correct. The false discovery rate
is the expected value of Q. That is, if a study is repeated (infinitely) many times, the
false discovery rate is the average proportion of Type I errors among the hypotheses
that are true.
The Benjamini–Hochberg method controls the false discovery rate using a
variation of Hochberg’s method where in step 1 of Hochberg’s method, .p[k] ≤ α/k
is replaced by
(C − k + 1)α
p[k] ≤
.
C
A criticism of the Benjamini–Hochberg method is that situations can be found
where some hypotheses are true, some are false, and the probability of at least one
Type I error will exceed .α among the hypotheses that are true (Hommel, 1988).
However, Benjamini and Hochberg (1995) show that their method ensures that the
false discovery rate is less than or equal to .α when performing C independent tests.
For a recent summary of how well the Benjamini–Hochberg method performs, see
Du et al. (2023).
All pairs power refers to the probability of detecting all true differences. For the
special case where all pairwise comparisons are to be made, a so-called step-down
method might provide higher all pairs power (Hochberg & Tamhane, 1987; Wilcox,
1991). The method is applied as follows:
1. Test the global hypothesis, at the .αJ = α level, that all J groups have a common
trimmed mean. If .H0 is not rejected, stop and fail to find any differences among
the groups. Otherwise, continue to the next step.
2. For each subset of .J −1 groups, test at the .αJ −1 = α level the hypothesis that the
.J − 1 groups have a common trimmed mean. If all such tests are nonsignificant,
detect any differences among the groups; otherwise, continue to the next step.
5. Reduce p to .p − 1. If .p > 2, repeat step 4. If .p = 2, go to step 6.
5.3 Multiple Pairwise Comparisons for a One-Way Design 121
6. The final step consists of testing all pairwise comparisons of the groups at the
.α2 = 1 − (1 − α)
2/J level. In this final step, when comparing the j th group
to the kth group, either fail to reject, fail to reject by implication from one of
the previous steps, or reject. For example, if the hypothesis that groups 1, 2, and
3 have equal trimmed means is not rejected, then in particular groups 1 and 2
would not be declared significantly different by implication.
Although this step-down method can increase all pairs power, it should be noted
that when comparing means, power can be relatively poor. Consider, for example,
four groups, three of which have normal distributions, and the third has a heavy-
tailed distribution. Even when the first three groups differ, a few outliers in the
fourth group can destroy the power of a global test based on means. That is, the
first step in the step-down method can fail to reject, in which case no differences are
found. Using a robust measure of location reduces this concern.
The R function
performs the T3 method in the previous section. The argument con is explained in
Sect. 5.4.4.
The R function
performs all pairwise comparisons using a percentile bootstrap method. The argu-
ment method=‘holm’ means that Holm’s method is used to control the FWE rate.
For all practical purposes, it gives results identical to Hochberg’s method. Setting
method=‘hoch’, Hochberg’s method would be used. Setting method=‘BH’,
the Benjamini–Hochberg method would be used.
The R function
linconbt(x,con=0,tr=0.2,alpha= 0.05,nboot=599)
.
applies a bootstrap-t method where the FWE rate is controlled via an analog of the
Studentized maximum modulus distribution.
122 5 Comparing Multiple Independent Groups
The R function
computes measures of effect size for each pair groups. One of six measures can be
used via the argument method. The other choices for method are the same as
those listed in Sect. 3.6.4: ‘EP’,‘QS’,‘QStr’,‘AKP,’ and‘WMW’.
Many of the R functions written for this book, aimed at performing multiple
tests, contain an argument method for controlling the FWE rate by adjusting the
p-values. If a collection of p-values that have not been adjusted to control the FWE
rate, it is noted that adjusted p-values can be computed via the R function
lincon(a[1:3])
.
$test
Group Group test crit se df
[1,] 1 2 2.797653 2.428787 1.395728 95.48122
[2,] 1 3 4.499152 2.420304 1.410148 120.01703
[3,] 2 3 1.795417 2.425610 1.358854 103.38463
$psihat
Group Group psihat ci.lower ci.upper p.value Est.1
[1,] 1 2 3.904762 0.5148357 7.294688 6.226504e-03 15.07143
[2,] 1 3 6.344472 2.9314839 9.757460 1.587794e-05 15.07143
[3,] 2 3 2.439710 -0.8563396 5.735760 7.550878e-02 11.16667
Est.2 adj.p.value
[1,] 11.166667 1.851706e-02
[2,] 8.726957 4.763078e-05
[3,] 8.726957 2.086989e-01
5.4 Multiple Comparisons for a Two-Way and Higher Design 123
The results indicate that a decision can be made that group 1 has a higher 20%
trimmed mean than groups 2 and 3 when the FWE rate is set to 0.05. But no decision
is made when comparing groups 2 and 3. It is left as an exercise to show that the
KMS measure of effect size is estimated to be moderately large when comparing
group 1 to groups 2 and 3 based on a common convention mentioned in Chap. 3.
Comparing groups 2 and 3, the KMS measure of effect size is relatively small.
J
=
. cj θj , (5.25)
j =1
where .c1 , . . . , cJ are specified constants such that . cj = 0.
To illustrate the notation, consider again Table 5.1. As explained, the hypothesis
of no interaction is .H0 : θ1 − θ2 = θ3 − θ4 . Of course, this is the same as .H0 : θ1 −
θ2 − θ3 + θ4 = 0. In the context of a linear contrast, the hypothesis of no interaction
is .H0 : = 0, where the contrast coefficients are .c1 = 1, c2 = −1, c3 = −1
and .c4 = 1. For main effects for Factor A (gender in Table 5.1), the hypothesis
is .H0 : θ1 + θ2 = θ3 + θ4 . This corresponds to the linear contrast coefficients
.c1 = c2 = 1 and .c3 = c4 = −1.
Now consider a 3-by-3 design and denote the measures of location as indicated
in Table 5.2. The number of contrast coefficients is always equal to the number of
groups. For Factor A, there are three main effects of interest. The first is .H0 : θ1 +
θ2 + θ3 = θ4 + θ5 + θ6 . In terms of a linear contrast, this corresponds to .H0 : = 0,
where the the contrast coefficients are .c1 = c2 = c3 = 1, .c4 = c5 = c6 = −1
and .c7 = c8 = c9 = 0. There are nine interactions. That is, there are nine ways
of focusing on two rows and two columns. The contrast coefficients for the first
interaction are .(c1 , c2 , c3 , c4 , c4 , c6 , c7 , c8 , c9 ) = (1, .−1, 0, .−1, 1, 0, 0, 0, 0). The
linear contrast is . = θ1 −θ2 −θ4 +θ5 . In this case, testing .H0 : = 0 corresponds
to testing .H0 : θ1 − θ2 = θ4 − θ5 .
When dealing with trimmed means, a generalization of the T3 method in Sect. 5.3.1
can be used to test
H0 : = 0.
. (5.26)
The estimate of . is
J
ˆ =
. cj X̄tj .
j =1
ˆ is
An estimate of the squared standard error of .
. A= dj ,
where
hj is the effective sample size (the sample size after trimming) of the j th group, and
.
2 is the Winsorized variance.
swj
.
Let
dj2
D=
. ,
hj − 1
set
A2
ν̂ =
. ,
D
and let t be the .1−α/2 quantile of Student’s t distribution with .ν̂ degrees of freedom.
Then an approximate .1 − α confidence interval for . is
√
ˆ ± t A.
. (5.27)
5.4 Multiple Comparisons for a Two-Way and Higher Design 125
When testing C hypotheses, again the critical value for controlling the FWE rate is
based on the Studentized maximum modulus distribution.
Repeat this process B times yielding .1∗ , . . . , B∗ . A p-value for testing (5.26), as
well as a confidence interval for ., is computed as described in Sect. 3.1.2 with
∗ ∗ ∗ ∗
.D , . . . , D in Sect. 3.1.2 replaced by . , . . . , .
1 B 1 B
Several R functions for performing multiple tests are described in the next section.
They are aimed at dealing with the more common goals of testing main effects and
interactions. However, a possibility is that other linear contrasts are of interest. If
this is the case, the R functions lincon and linconpb can be useful.
For example, imagine a situation where there are .J = 3 groups, one of which is a
control group. Moreover, the goal is to compare the other two groups to the control
group only. This might be preferred to using all pairwise differences in order to
increase power and simultaneously control the FWE rate. For illustrative purposes,
suppose the data are stored in the R object dat and the data in the second group is
the control group. First, use the R command
A=conCON(3,2)$conCON
This function creates the contrast coefficients that are needed. The first argument
indicates the number of groups, and the second indicates which group is the control
group. The resulting R object A contains
[,1] [,2]
[1,] 1 0
[2,] -1 -1
[3,] 0 1
The first column contains the contrast coefficients for the first linear contrast to be
tested, which indicates that groups 1 and 2 are compared. The second column is
the second linear contrast, which indicates that groups 2 and 3 are compared. The
command
lincon(dat,con=A)
126 5 Comparing Multiple Independent Groups
would perform the two tests. The R function linconpb is used in a similar manner.
More generally, the argument con can be any matrix with J rows (the number of
groups), that contains contrast coefficients in the columns.
The R function
FALSE, pr = TRUE)
uses method T3 to test all main effects and all interactions when dealing with a
two-way design. The R function
and
are designed specifically for comparing medians. The first is for a two-way design
and the second is for a three-way design.
The R function
deals with main effects and interactions based on one ore more quantiles when
dealing with a two-by-two design.
The R function
estimates the KMS interaction effect size for all relevant interactions in a J -by-K
design.
Basically, the argument SW=TRUE interchanges the rows and columns. To make
sure there is no confusion when interpreting the linear contrast coefficients reported
by the function, consider, for example, a 2-by-3 design. Consistent with previous
functions, it is assumed that the data are stored as follows:
Factor B
Factor A x[[1]] x[[2]] x[[3]]
x[[4]] x[[5]] x[[6]]
Factor B
Factor A x[[1]] x[[4]]
x[[2]] x[[5]]
x[[3]] x[[6]]
So Factor A is now Factor B and Factor B is now Factor A. The linear contrasts are
reported as:
128 5 Comparing Multiple Independent Groups
$con
[,1] [,2] [,3]
[1,] 1 1 0
[2,] -1 -1 0
[3,] -1 0 1
[4,] 1 0 -1
[5,] 0 -1 -1
[6,] 0 1 1
The first column indicates that a measure of effect size for groups 1 and 4 is
compared to a measure of effect size for groups 2 and 5.
The R function
is like the R function KMSinter.mcp only the QS measure of effect size is used.
The R function
computes the six measures of effect estimated by the R functions in Sect. 3.6.4.
When the argument con=NULL, the function computes measures of effect size for
all pairs of groups. Setting the argument fun=ES.summary.CI, the function tests
the hypothesis of no effect for all six measures, and it controls the FWE rate among
these six tests with Hochberg’s method. This is in contrast to ESmcp.CI, which
controls the FWE rate among all pairwise comparisons of J independent group
when the focus is on a single measure of effect size.
It might be desired to compute a global measure of effect size for level 1 of
Factor A based on the K groups associated with Factor B via the R function
KS.ANOVA.ES in Sect. 5.1.6. Of course, this might be done for the other levels
of Factor A as well. And for each level of Factor B, a global measure of effect size
based on the J levels of Factor A can be of interest. The R function
JK.AB.KS.ES(J,K,x)
.
$Fac.B[[2]]
[1] 0.4050633
$Fac.B[[3]]
[1] 0.06567751
For example, Fac.B[[1]] reports a global measure of effect size for males versus
females when the focus is on event Back. Controlling for event, effect sizes when
comparing males and females are relatively small for events Back and Free. For
the breast stroke, the effect is 0.405, which is moderately large. This raises the
issue of whether it is reasonable to conclude that this effect is indeed larger than
the effect associated with the other two events. This can be investigated with the
R function KMSinter.mcp, but note that based on how the data are stored, the
default version of KMSinter.mcp would deal with effect sizes between levels of
the second factor, event, not gender. To compare effect sizes for the first factor, one
could swap the roles of the factors by using the command
KMSinter.mcp(2,3,a,SW=TRUE)
For a two-way design, there is an alternative approach to effect sizes for
main effects that might be of interest, which can be applied via the R function
IND.PAIR.ES in conjunction with the argument con. When the argument con
is specified, the function simply pools the data over the levels. For example, for a
2-by-3 design, if the goal is to compare the two levels of the first factor, the function
proceeds as follows. For the first level of Factor A, pool the data corresponding
to the three levels of Factor B. Do the same for the second level of Factor A.
More precisely, given some contrast coefficients, the function pools the data of the
groups having a contrast coefficient equal to 1. The same is done for the groups
having a contrast equal to .−1. The function then computes measures of effect
size for these two groups using the R function ES.summary. When dealing with
interactions, currently the best approach is to use the R functions KMSinter.mcp,
QSinter.mcp, and PHinter.mcp previously described.
Situations can occur where it is convenient to have a function that creates the
contrast coefficients for main effects and interactions. The R function
con2way(J,K)
.
130 5 Comparing Multiple Independent Groups
con3way(J,K,L)
.
$effect.size
$effect.size[[1]]
Est NULL S M L
AKP 0.4112021 0.0 0.20 0.50 0.80
EP 0.3072032 0.0 0.14 0.34 0.52
QS (median) 0.6342531 0.5 0.55 0.64 0.71
QStr 0.6342531 0.5 0.55 0.64 0.71
WMW 0.3587244 0.5 0.45 0.36 0.29
KMS 0.2011279 0.0 0.10 0.25 0.40
$effect.size[[2]]
Est NULL S M L
5.5 Rank-Based Methods 131
Consider, for example, the top of the output, which shows the contrast coefficients
created by con2way for the main effects of the first factor. The first column of the
linear contrast coefficients contains 1, 1, .−1, .−1, 0, 0, 0, 0. This means that the
first two groups, which are males and females in level 1 of Factor A, are pooled.
The same is done for the second level of Factor A. Then measures of effect size are
computed based on these two groups. The next column indicates that the first and
third education groups are compared. The output labeled $effect.size[[1]] are the
effect sizes associated with column one of the contrast coefficients. That is, the first
and second education levels are compared. Similarly, $effect.size[[2]] corresponds
to the effect sizes comparing the first and third education levels. The command
IND.PAIR.ES(a,con=A$conB)
deals with the main effects associated with the second factor.
For completeness, there are rank-based methods for testing the hypothesis that J
independent groups have a common distribution. From Tukey’s (1991) perspective,
it is known that the distributions differ albeit the difference can be trivially small. If
this view is accepted, most rank-based methods deal with determining whether the
sample size is large enough to conclude what is surely true.
The Kruskal–Wallis test is often mentioned in an introductory statistics course,
and various improvements have been derived (e.g., Brunner et al., 2002, 2019;
Wilcox, 2022a, Section 7.8). A method derived by Rust and Fligner (1984) tests
the hypothesis .H0 : pj k = 0.5, where .pj k is the probability that a randomly
sampled value from group j is less than a randomly sampled value from group
k. However, the method makes a highly restrictive assumption: Distributions differ
in terms of a measure of location only. If there is heteroscedasticity, for example, or
the distributions differ in skewness, the method is no longer valid.
The R function
bdm(x)
.
132 5 Comparing Multiple Independent Groups
5.6 Exercises
1. Using the data described in the example at the end of Sect. 5.1.2, use the
first group as the control group and compare the mean of the control group
to each of the other four groups based on the T3 method. That is, use the R
function lincon with the argument tr=0. Take advantage of the R function
conCON. The adjusted p-values are based on the Studentized maximum
modulus distribution. What happens if the p-values are adjusted based on
Hochberg’s method?
2. Using the data in Exercise 1, compare all pairs of groups based on the KMS
measure of effect size as well as the quantile shift measure of effect size via the
R function ESmcp.CI.
3. The file hyp_dat_5g_dat.txt contains hypothetical data for five independent
groups. Test the global hypothesis of equal means using t1way with the
argument tr=0. Next, compare the groups with a 20% trimmed again using
t1way followed by pbprotm. Note that the data are separated by &. That is,
when using the read.table command, include the argument sep=‘&’.
4. Comment on the strategy of testing the hypothesis that there is homoscedastic-
ity and using a homoscedastic method if the test fails to reject.
5. For the data in the file A1B1C.txt, the column named edugp indicates the
amount of education for each participant. Compare these group based on a
measure of depressive symptoms, labeled CESD, by testing the hypothesis of
equal 20% trimmed means based on the R functions t1way and lincon. In
terms of controlling the FWE rate, what is a concern when using the function
t1way?
6. Using the data in the last exercise, compare the first group to the other groups
using the measures of effect size estimated by the R function IND.PAIR.ES.
Use the R function conCON to specify the linear contrasts and include
confidence intervals for the effect sizes. Comment on the lower ends of the
confidence intervals.
7. For the data in the file A1B1C.txt, the column named racegp indicates a
participant’s reported ethnic group. Compare the first four groups, based
on a measure of depressive symptoms, labeled CESD, with the R function
IND.PAIR.ES. Include confidence intervals in the results. When comparing
groups 1 and 3, what do p-values and confidence intervals indicate? Which two
groups have the largest measures of effect size?
8. Imagine that the rank-based method in Sect. 5.5 rejects. What does this
indicate?
5.6 Exercises 133
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 135
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9_6
136 6 Comparing Multiple Dependent Groups
As explained in Chap. 4, there are at least two basic approaches when comparing
dependent groups. The first is to focus on some measure of location associated
with the marginal distributions. The second is to focus on a measure of location
associated with the difference scores. This section begins with methods that deal
with global tests based on a measure of location associated with the marginal
distributions. That is, the goal is to test
H0 : θ1 = θ2 = · · · = θJ ,
. (6.1)
where .θj (.j = 1, . . . , J ) are measures of location associated with the marginal
distributions of J groups that are possibly dependent. This is followed by a
global test that deals with difference scores. Multiple comparisons are described
in Sect. 6.6.
First, consider the goal of testing the hypothesis that the marginal trimmed means
are equal. For the situation where there is no trimming, there is a classic ANOVA
F test that is often covered in an introductory statistics course. The method is
based on a rather restrictive assumption, called sphericity, about the nature of the
association among the J random variables. Details about this assumption can be
found in Kirk (1995). The important points here are that violating this assumption is
a practical concern and that methods for avoiding this assumption have been derived.
The method in this section uses a generalization of one such technique aimed at
comparing means, known as the Huynh–Feldt correction.
Let .X̄tj denote the trimmed mean of the j th group. The method begins by
Winsorizing the data in essentially the same manner as done in Sect. 4.1.1. That
is, each column of data is Winsorized keeping the dependent values among the rows
in place. This process is illustrated with the following data, where .n = 9 participants
are measured at three different times. This is done with 20% trimming in which case
.g = 1. That is, the smallest and largest values are Winsorized.
18 34 16
9 19 10
23 4 36
54 12 8
19 26 34
26 42 19
33 25 21
21 31 30
6.1 Global Tests for a One-Way Design 137
18 34 16
18 19 10
23 4 34
33 12 10
19 26 34
26 42 19
33 25 21
21 31 30
For example, in the first column, the lowest value, 9, becomes 18 and the largest
becomes 33. In the second column, the lowest value, 4, becomes 12 and the largest
value, 42, becomes 34.
Let .Wij denote the Winsorized values for the ith participant and the j th measure.
Let .h = n−2g be the effective sample size, the number of values left after trimming.
The test statistic is computed as follows:
1
. X̄t = X̄tj
J
J
Qc = (n − 2g)
. (X̄tj − X̄t )2
j =1
J
n
Qe =
. (Wij − W̄.j − W̄i. + W̄.. )2 ,
j =1 i=1
where
1
n
W̄.j =
. Wij
n
i=1
1
J
. W̄i. = Wij
J
j =1
1
J n
W̄.. =
. Wij .
nJ
j =1 i=1
138 6 Comparing Multiple Dependent Groups
where
Qc
.Rc =
J −1
Qe
Re =
. .
(h − 1)(J − 1)
Next, randomly sample with replacement rows of data from the centered data
yielding
⎛ ∗ , . . . , C∗
⎞
C11 1L
⎜ .. ⎟
.⎝ ⎠. (6.4)
.
∗ , . . . , C∗
Cn1 nL
Next, compute the test statistic F based on a bootstrap sample from the centered data
yielding .F ∗ . Repeat this process B times and determine a critical value as described
in Sect. 4.1.1.
Two percentile bootstrap methods have been investigated. Both have the advantage
of being reasonable choices when using an estimator that has a reasonably high
breakdown point.
6.1 Global Tests for a One-Way Design 139
where .θ̄ = θ̂j /J and .θ̂ is any location estimator. The strategy is to determine
whether Q is unusually large when the null hypothesis is true. This is done by
centering the data as done when using the bootsrtap-t method with the goal of
estimating the distribution of Q when the null hypothesis is true. Now Q is
computed based on bootstrap samples from the centered data yielding .Q∗1 , . . . , Q∗B .
Put these B values in ascending order yielding .Q∗(1) ≤ · · · ≤ Q∗(B) . Then reject the
hypothesis of equal measures of location if .Q > Q∗(u) , where .u = (1−α)B rounded
to the nearest integer. Note that this method can be used when values are missing at
random.
The second method is based on bootstrap estimates of the measure of location.
That is, for each bootstrap sample, which is based on the observed data, not the
centered data, compute a measure of location for each of the J levels yielding
∗ ∗
.θ̂ , . . . , θ̂ . This process is repeated B times yielding a cloud of bootstrap estimates.
1 J
Next, the method estimates the assumed common measure of location using the
grand mean and then focuses on how deeply the grand mean is nested in the
bootstrap cloud. Ma and Wilcox (2013) found that the first percentile bootstrap
method is better at handling missing values, assuming that missing values occur
at random. But neither of the bootstrap methods in this section dominates in terms
of controlling the Type I error probability.
The R function
rmanova(x,tr=0.2,grp=c(1:length(x)))
.
tests the hypothesis of equal population trimmed means among J dependent groups
using the test statistic given by (6.2) with the null distribution approximated by an
F distribution. The R function
also uses the test statistic given by (6.2), but now a bootstrap estimate of the null
distribution F , given by (6.2), is used. The R function
misran=FALSE)
140 6 Comparing Multiple Dependent Groups
J2 − J
L=
.
2
differences scores .Di ( .i = 1, . . . , n; . = 1, . . . , L). If, for example, .J = 4, .L = 6
and
.H0 : θ1 = · · · = θL = 0, (6.6)
where .θ (. = 1, . . . , L) is the population measure of location associated with the
th set of difference scores, .Di (.i = 1, . . . , n).
.
⎛ ⎞
D11 , . . . , D1L
⎜ .. ⎟
.⎝ ⎠ (6.7)
.
Dn1 , . . . , DnL
yielding
⎛ ∗ , . . . , D∗
⎞
D11 1L
⎜ .. ⎟
.⎝ ⎠. (6.8)
.
∗ , . . . , D∗
Dn1 nL
For each of the L columns of the .D ∗ matrix, compute whatever measure of location
is of interest, and for the .th column label the result .θ̂∗ (. = 1, . . . , L). Next, repeat
this B times yielding .θ̂b∗ , .b = 1, . . . , B and then determine how deeply the vector
.0 = (0, . . . , 0), having length L, is nested within the bootstrap values .θ̂ .
∗
b
Now consider the matrix of .θ̂b ∗ values:
⎛ ∗ , . . . , θ̂ ∗
⎞
θ̂11 1L
⎜ .. ⎟
.⎝ ⎠. (6.9)
.
∗ , . . . , θ̂ ∗
θ̂B1 BL
For each row in this matrix, Mahalanobis distance (mentioned in Sect. 5.1) can
be used to measure how far it is from the center of this data cloud, where the
center is taken to be L means associated with the columns in (6.9). Mahalanobis
distance is a standardized distance based on both the means and covariances of
the values depicted by the matrix (6.9). Section 7.1 provides the details of how
Mahalanobis distance is computed. The main point here is that the Mahalanobis
distance of the null vector .(0, . . . , 0), relative to the distance of the other points in
the bootstrap cloud, can be used to compute a p-value (Wilcox, 2022a, Section 8.3).
If, for example, the distance of the null vector is greater than say 85% of the other
distances, the p-value is 0.15.
Note that here, distances are being measured based on means, variances, and
covariances that are not robust. Despite this, the method just outlined has been
found to perform relatively well in simulations when the goal is to compare robust
measures of location. Section 5.1 noted a computational concern with Mahalanobis
distance when testing the global hypothesis that J independent groups have a
common measure of location. For the situation here, this concern is not relevant.
The R function
142 6 Comparing Multiple Dependent Groups
...)
tests the hypothesis given by (6.6) using a percentile bootstrap method. By default,
the modified one-step M-estimator is used.
H0 : = 0
. (6.10)
are limited to a 20% trimmed mean. Wilcox (2023c) found that a percentile
bootstrap method is not quite satisfactory, the actual level can be well below the
nominal level. A better approach is to use a simulation to estimate the distribution
ˆ when dealing with a situation where sampling is from a multivariate normal
of .
distribution and where all J random variables are independent, and each of the
marginal distributions is standard normal. For example, if . ˆ∗ ≤ ··· ≤ ˆ∗
(1) (B)
are B estimates of . when sampling from this multivariate normal distribution, let
.ˆ∗ ≤ ··· ≤ ˆ ∗ denote the values written in ascending order. Let .c = (1 − α)B
(1) (B)
rounded to the nearest integer. Then reject (6.10) when testing at the .α level if
.ˆ ≥ ˆ ∗ . Extant simulation results indicate that this approach controls the Type
(c)
I error probability reasonably well when dealing with non-normal distributions,
including situations where the J random variables are correlated.
To provide some perspective on the magnitude of ., consider a multivariate
normal where all of the marginal distributions have a having a common variance
2
.σ , and where all of the means are zero except the first, which is equal .δσ . For
.δ = 0.2, 0.5 and 0.8, the expected value of . is approximately 0.2, 0.5 and 0.7,
respectively.
There is a method for testing the hypothesis given by (6.1) based on the projection
distance of the grand mean. The method is based on an estimate of the distribution of
the effect size when sampling from a normal distribution. There are situations where
this method has higher power compared to using F given by (6.2), and there are
6.2 Measures of Effect Size 143
situations where F has more power. This is not surprising because each is sensitive
to different features of the data. A speculation is that the test statistic Q given by
(6.5) can have more or less power than F , but this has not been investigated.
When using difference scores and the goal is to test (6.6), now a measure of
effect size is to use the projection distance of the null vector .(0, . . . , 0) from the
center of the cloud of differences scores corresponding (6.7). Currently, there are no
simulation results on how one might test the hypothesis of no effect based on this
measure of effect size.
The R function
ND=NULL,iter=2000,...)
computes a measure of effect size based on the projection distance of the grand mean
of the marginal distributions from center of the data cloud. Setting the argument
PV=TRUE, the function returns a p-value.
rmES.dif.pro(x,est=tmean,...)
.
computes an effect size based on the the projection distance of the null vector
(0, . . . , 0) from the center of the cloud of differences scores.
.
Example This example is based on the essay data available via the R package
WRS2. (The data are stored in the R object essays.) The data stem from a study
of the effects of two forms of written corrective feedback on lexico-grammatical
accuracy in the academic writing of English as a foreign language. A portion
consisted of an outcome measured at four different times. There were three groups,
but for illustrative purposes, the focus here is on the four measures (over time) for
the first group. The sample size is .n = 10. The goal is to understand how the four
measures compare over time. First, the hypothesis of identical marginal trimmed
means is tested using the test statistic F given by (6.2), the bootstrap-t method
based on F , and the percentile bootstrap method based on Q. The R commands
are as follows:
b=fac2list(essays[,4],essays[,2:3]) #sort the data into groups.
Consequently, b[1:4]
contains the four measures for the first group.
rmanova(b[1:4]) # Based on F , p-value .= 0.269
rmanovab(b[1:4]) # Using a bootstrap-t method, p-value .= 0.182
bd1way(b[1:4]) # using a percentile bootstrap method, p-value .= 0.317
144 6 Comparing Multiple Dependent Groups
Next, the first four measures are compared based on difference scores and four
measures of location.
rmdzero(b[1:4]) # Using the MOM estimator, p-value .= 0.119
mdzero(b[1:4],est=tmean) # Using a 20% trimmed mean, p-value 0.05
rmdzero(b[1:4],est=hd) # Using the Harrell–Davis estimate of median, p-
value .= 0.026
rmdzero(b[1:4],est=thd) # Using the trimmed Harrell–Davis estimator, p-
value 0.014.
As can be seen, the choice between marginal measures of location and using
difference scores can substantially alter the results. Moreover, when dealing with
difference scores, the location estimator used can alter the p-value substantially. If
the p-values based on difference scores are adjusted based on Hochberg’s method,
using the R function p.adjust, the results are 0.119, 0.0998, 0.078, and 0.0560.
If the Benjamini–Hochberg method for controlling the false discovery rate is used,
now the adjusted p-values are 0.119, 0.0665, 0.052, and 0.052, illustrating the extent
to which the Benjamini–Hochberg method can lower the p-values at the expense of
possibly not controlling the FWE rate.
Finally, effect sizes based on difference scores were estimated by the R function
rmES.dif.pro using three measures of location: 20% trimmed mean, trimmed
Harrell–Davis estimator, and the Harrell–Davis estimator. The estimates were 0.700,
0.514, and 0.623, respectively. These estimates are moderately large based on a
common convention, but the precision of the estimates is not known. That is, no
method for computing a confidence interval has been studied and found to be
reasonably satisfactory. And there is the basic concern that the sample size is small.
This section deals with global tests for a two-way design where the first factor deals
with independent groups, and the second factor deals with dependent groups. Note
that for each independent group, the dependent measures have unknown measures
of variances and covariances. Let .j (a K-by-K matrix) denote the variances and
covariances for the j th group. Classic methods assume, in addition to normality,
that .1 = · · · = J . That is, a type of homoscedasticity assumption is made that
includes the assumption that the groups have common covariances. This restrictive
assumption can be avoided using results in Johansen (1980), which can be readily
extended to a method based on trimmed means.
Here, a brief outline of the method is provided assuming familiarity with basic
matrix algebra, which is summarized in Appendix A. Complete computational
details are in Wilcox (2022a, Section 8.6.1). The method formulates the hypotheses
of no main effects and no interactions based on a collection of linear contrasts, C.
The null hypothesis is
H0 : Cμt = 0
. (6.11)
6.3 Global Tests for a Between-by-Within Design 145
the hypothesis that all of the linear contrasts are equal to zero. Let .Sj denote the
Winsorized covariance matrix of the K measures associated with the j th level of
Factor A. Let
(nj − 1)Sj
Vj =
. , j = 1, . . . , J
hj (hj − 1)
1
K
θ̄j. =
. θj k .
K
k=1
For each level of Factor A, generate B bootstrap samples for the K dependent
groups. Let .θ̄j.∗ be the bootstrap estimate for the j th level of Factor A. For levels
∗ = θ̄ ∗ − θ̄ ∗ . The null hypothesis is rejected
j and .j of Factor A, .j < j , set .δjj j. j .
based on how deeply .0, having length .(J 2 − J )/2, is nested within the B bootstrap
values.
Now consider Factor B. A simple way to proceed is to ignore the levels of
Factor A. Let .nj be the sample size for the j th level of Factor A. In effect, test
the hypothesis that K dependent groups have a common measure of effect size,
where the sample size is .N = nj . This can be done as described in Sect. 6.2,
using some marginal measure of location, or difference scores can be used.
146 6 Comparing Multiple Dependent Groups
Finally, there is the issue of testing the hypothesis of no interaction. One approach
is as follows. First, consider a 2-by-2 design, and for the first level of Factor A,
let .Di1 = Xi11 − Xi12 , .i = 1, . . . , n1 . Similarly, for level 2 of Factor A let
.Di2 = Xi21 − Xi22 , .i = 1, . . . , n2 , and let .θd1 and .θd2 be the population measure
of location corresponding to the .Di1 and .Di2 values, respectively. The hypothesis
of no interaction is
H0 : θd1 − θd2 = 0.
. (6.14)
Again the basic strategy for testing hypotheses is generating bootstrap estimates and
determining how deeply 0 is embedded in the B values that result. For a J -by-K
design, where J or K, or both, are greater than two, there are a total of
J2 − J K2 − K
C=
. ×
2 2
interactions to be tested, one for each pairwise difference among the levels of Factor
B and any two levels of Factor A. The null hypothesis of no interaction is tested
based on the depth of .(0, . . . , 0), a vector of length C, in a bootstrap cloud.
The following R functions are designed to test global hypotheses when dealing with
a between-by-within design. The R function
bwtrim(J,K,x,tr=0.2,grp=c(1:p),p=J*K)
.
performs global tests based on trimmed means using a generalization of the method
derived by Johansen (1980). As usual the argument tr controls the amount of
trimming. A boostrap-t method can be used instead via the R function
bwtrimbt(J,K,x,tr=0.2,JK=J*K,grp=c(1:JK),nboot=599)
.
The next three functions are based on the the percentile bootstrap method as
described in the previous section. The R function
tests the hypothesis of no main effects for Factor A. By default, a 20% trimmed
mean is used, but other measures of location can be used via the argument est. The
argument avg=TRUE indicates that the averages of the measures of location (the .θ̄j.
values) will be used. That is, (6.13) is tested. Otherwise, difference scores are used.
By default, the argument MDIS=TRUE, meaning that the depths of the points in the
bootstrap cloud are based on Mahalanobis distance. Otherwise, a projection distance
is used. If MDIS=FALSE and MC=TRUE, a multicore processor will be used if one
is available.
The R function
$psihat
[1] 0.198452381 0.009880952 -0.188571429
$con
[,1] [,2] [,3]
[1,] 1 1 0
[2,] -1 0 1
[3,] 0 -1 -1
The values labeled psihat are the estimates of the linear contrasts. The hypothesis
is that all three linear contrasts are equal to zero. The values under con are the
contrast coefficients for all pairwise comparisons. The first column indicates that
levels 1 and 2 of Factor A are being compared and that the first value under psihat,
0.198452381, is the estimate of the linear contrast, which in this case is .θ1. − θ2. .
Comments should be made about the contrast coefficients when testing the
hypothesis of no interactions. Here is the output of sppbi for a 2-by-3 design.
$psihat
[1] -0.1994419 0.1090603 0.2907358
$con
148 6 Comparing Multiple Dependent Groups
levels 1 and 2 of Factor B when focusing on level 1 of Factor A. In this case, one of
the interactions is based on .θd112 − θd212 . That is, do the difference scores between
levels 1 and 2 of Factor B differ when looking at levels 1 and 2 of Factor A? The
first three rows of $con correspond to the difference scores .θd112 , .θd113 and .θd123 ,
respectively. The first column indicates that .θd112 is being compared to .θd212 .
The R function
estimates the explanatory measure effect sizes for Factor A and a projection distance
measure of effect size for Factor B that was described in Sect. 6.2.
A within-by-within design refers to situations where both factors deal with depen-
dent groups. For example, brothers and sisters might be measured at two different
times. It is briefly noted that methods used to deal with a between-by-within design
can be extended to a within-by-within design. Readers interest in technical details
are referred to Wilcox (2022a).
The R function
tests for main effects and interactions in a within-by-within design using trimmed
means. The R function
6.5 Global Tests for a Three-Way Design 149
The methods for testing global hypotheses associated with a two-way design are
readily extended to a three-way design. Currently, the ability of the methods for
a three-way design to control the Type I error probability has not been studied
extensively. There are some indications that methods for testing global hypotheses
based on 20% trimmed mean control the Type I error probability reasonbly well
(Wilcox, 2022a), but extensive simulations have not been reported.
The R function
bbwtrim(J,K,L,x,grp=c(1:p),tr=0.2)
.
tests all omnibus main effects and interactions associated with a between-by-
between-by-within design. The data are assumed to be stored as described in
Sect. 5.2.4 in conjunction with the R function t3way. For a between-by-within-
by-within design use
bwwtrim(J,K,L,x,grp=c(1:p),tr=0.2).
.
wwwtrim(J,K,L,x,grp=c(1:p),tr=0.2).
.
The R functions
= TRUE)
TRUE).
150 6 Comparing Multiple Dependent Groups
and
are the same as the functions bbwtrim, bwwtrim, and wwwtrim, respectively,
only a bootstrap-t method is used.
The R function
wwwmed(J,K,L,x, alpha=0.05).
.
is like the function wwmed, which is based on medians, only adapted to a within-
by-within-by-within design.
The measures of effect size related to global tests provide some information about
the overall extent groups differ. But a more detailed understanding of how groups
differ and by how much is needed. As was done in previous sections, methods based
on marginal measures of location, as well as measures based on difference scores,
are covered. This section begins with non-bootstrap methods for trimmed means.
This is followed by a description of bootstrap methods.
This section discusses the special case where pairwise comparisons are made based
on the trimmed means associated with J dependent groups. When dealing with
the marginal distributions, a simple approach is to use the method in Sect. 4.1.1
via the the R function yuend and control the FWE rate with Hochberg’s method,
or control the false discovery rate using the Benjamini–Hochberg method. The
resulting confidence intervals can be adjusted based on either of these two methods
so that the simultaneous probability coverage is approximately .1 − α. That is,
confidence intervals can be constructed with the goal that all of the confidence
intervals contain the true value of .μtj − μtk (.j < k) with probability .1 − α.
When dealing with the trimmed means of the difference scores, now use the Tukey–
McLaughlin method via the R function trimci. The confidence intervals can be
adjusted in a similar manner. For the special case where the sample median is used
based on difference scores, the method in Sect. 2.3.3 can be used, which assumes
random sampling only.
6.6 Multiple Comparisons 151
Note that yet another way of proceeding is to use a sign test for each pair of
groups. And one could perform all pairwise comparisons using the methods in
Sect. 4.5 that are based on measures of effect size.
The R function
FALSE,ADJ.CI=FALSE)
performs all pairwise comparisons based on the trimmed means of the marginal
distributions. By default, Hochberg’s method is used to control the FWE rate. If
ADJ.CI=TRUE, the function will attempt to adjust the confidence intervals so that
the simultaneous probability coverage is .1 − α. The default is ADJ.CI=FALSE
because it is possible that the method for making the adjustment can encounter a
computational issue.
The R function
ADJ.CI=FALSE)
is like the function rmm.mar, only now trimmed means based on difference scores
are used.
For each pair of groups, the R function
tests the hypothesis that the median of the difference scores is zero using the method
in Sect. 2.3.3 that assumes random sampling only.
The R function
tests hypotheses based on four measures of effect size covered in Sect. 4.5 that are
based on difference scores. The four measures are AKP, quantile shift based on the
median, quantile shift based on the 20% trimmed mean, and the sign test.
When dealing with a trimmed mean, simply use the bootstrap methods in Chap. 4
for each pair of groups and use Hochberg’s method to control the FWE rate.
When using the M-estimator or the MOM estimator, based on the marginal
distributions (not the difference scores) an analog of the method in Sect. 5.3.1 can be
used. In fact, the method can be used with any collection of linear contrasts. For any
linear contrast, . , let . ˆ be the estimate of . and let S denote a bootstrap estimate
of the standard error of . ˆ . The estimate of the standard error is based on bootstrap
estimates of the variances and covariances of . ˆ , after which one proceeds in an
analogous manner to the estimate of the standard error used by (4.8) in Sect. 4.1.1.
Then, a reasonable test statistic is
ˆ
T =
. . (6.15)
S
Basically, this is a bootstrap analog of the approach described in Sect. 6.6.5.
Let .Tj k denote the value of T when comparing groups j and k When performing
multiple tests, an analog of methods based on the Studentized maximum modulus
distribution can be used. This is done by first centering the data. That is, compute
where .θ̂j is the estimate of the measure of location for the j th group. Proceed as
follows:
1. Generate bootstrap samples from the centered data.
2. Based on the bootstrap sample for groups j and k, .j < k, compute the test
statistic and label the result .Tj∗k .
3. Let .Tm∗ denote the largest .|Tj∗k | value.
Repeat steps 1–3 B times yielding .Tmb ∗ , .b = 1, . . . , B, which provide an estimate
The R function
The values under n are the sample sizes and the values under N are the sample
sizes when tied values are removed.
Using rmm.dif yields
$test
Group Group p.value p.adjust
[1,] 1 2 0.011616897 0.02323379
[2,] 1 3 0.003947171 0.01184151
[3,] 2 3 0.527672894 0.52767289
154 6 Comparing Multiple Dependent Groups
$psihat
Group Group est ci.lower ci.upper
[1,] 1 2 10.684615 2.856561 18.512670
[2,] 1 3 12.253846 4.747373 19.760319
[3,] 2 3 1.069231 -2.512494 4.650955
and rmm.mar yields
$test
Group Group p.value p.adjust
[1,] 1 2 0.003766160 0.007532320
[2,] 1 3 0.002051278 0.006153833
[3,] 2 3 0.507752947 0.507752947
$psihat
Group Group est 1 est 2 dif ci.lower ci.upper
[1,] 1 2 55.86154 44.49231 11.369231 4.454327 18.284134
[2,] 1 3 55.86154 43.26154 12.600000 5.588849 19.611151
[3,] 2 3 44.49231 43.26154 1.230769 -2.697101 5.158639
The results comparing measure of effect via the R function deplin.ES.
summary.CI are
$con
[,1] [,2] [,3]
[1,] 1 1 0
[2,] -1 0 1
[3,] 0 -1 -1
$output
$output[[1]]
NULL Est S M L ci.low ci.up p.value
AKP 0.0 0.6943179 0.10 0.30 0.50 0.1874195 1.5715061 0.016
QS (median) 0.5 0.8095238 0.54 0.62 0.69 0.5682958 1.0000000 0.008
QStr 0.5 0.7142857 0.54 0.62 0.69 0.5395288 1.0000000 0.036
SIGN 0.5 0.2857143 0.46 0.38 0.31 0.1355883 0.5021141 0.060
$output[[2]]
NULL Est S M L ci.low ci.up p.value
AKP 0.0 0.8304048 0.10 0.30 0.50 0.46925512 1.5375211 0.000
QS (median) 0.5 0.6666667 0.54 0.62 0.69 0.61904879 1.0000000 0.000
QStr 0.5 0.7619048 0.54 0.62 0.69 0.66666471 1.0000000 0.000
SIGN 0.5 0.1904762 0.46 0.38 0.31 0.07079275 0.4058885 0.007
$output[[3]]
NULL Est S M L ci.low ci.up p.value
AKP 0.0 0.1518562 0.10 0.30 0.50 -0.2918608 0.9331569 0.500
QS (median) 0.5 0.6666667 0.54 0.62 0.69 0.3809236 0.8100217 0.412
QStr 0.5 0.5714286 0.54 0.62 0.69 0.3333330 0.8095512 0.750
SIGN 0.5 0.3000000 0.46 0.38 0.31 0.1431593 0.5212908 0.080
As usual, the contrast coefficients labeled $con indicate which groups are
being compared. For example, the first column indicates that the results labeled
$output[[1]] refer to the results comparing time 1 to time 2. All methods used to
compare times 2 and 3 are consistent in the sense that none have a p-value less
than 0.05. Note, however, that when comparing time 2 to time 3, the sign test
yields a p-value vastly lower than the p-values based on the other methods that
were used, illustrating once again that the method chosen can make a substantial
difference when assessing the strength of the empirical evidence that a decision can
6.6 Multiple Comparisons 155
be made about which group has the smaller measure of interest. Moreover, the sign
test estimates the effect to be relatively large compared to the three other measures
of effect size reported by deplin.ES.summary.CI. When comparing the first
two times, the p-values range between 0.007 and 0.080.
Comparing the marginal medians using rmm.marpb, by setting est=hd, the
p-values are less than 0.001 comparing time 1 to time 2 and time 1 to time 3.
Comparing times 2 and 3, the p-value is 0.52. Based on difference scores, using
rmm.difpb the p-values are 0.002, 0.000 and 0.28, respectively. Comparing the
0.8 quantiles, including the argument q=0.8, the p-values based on the marginal
distributions are 0.31, 0.008, and 0.304. Based on the difference scores, all three
p-values are less than 0.001.
As was the case in Sect. 5.4, when dealing with two-way designs, a common goal
is to compare levels j and .j of the first factor for every .j < j . The same is done
for the second factor, and there is the goal of making inferences about each of the
two-by-two interactions. Once again, linear contrasts provide a convenient way of
dealing with these issues, including situations where a three-way design is used. The
main difference from Chap. 5 is that here, a method needs to take into account any
dependence among the measures that might exist. There is a general non-bootstrap
technique for dealing with this issue based on trimmed means (e.g., Wilcox, 2022a
Section 8.6.8). To provide at least some indication of how the method is applied, let
.h = n − 2g denote the number of values not trimmed, let
1
dj2 =
. (Wij − W̄j )2 ,
h(h − 1)
1
dj k =
. (Wij − W̄1 )(Wik − W̄2 )
h(h − 1)
and .Wij denotes the Winsorized data. For J dependent groups, an estimate of the
squared standard error of
. ˆ = cj X̄j
is estimated with
J
J
S=
. cj ck dj k , (6.17)
j =1 k=1
156 6 Comparing Multiple Dependent Groups
where .dj k = dj2 when .j = k and .c1 , . . . , cJ are linear contrast coefficients described
in Sect. 5.4. This suggests using the test statistic
ˆ
T =√ .
. (6.18)
S
tests all relevant pairwise comparisons among the main effects and all interac-
tions based on trimmed means and a between-by-within design. It creates all
relevant linear contrasts for main effects and interactions by calling the R function
con2way. The function returns results corresponding to Factor A, Factor B, and
all interactions. The FWE rate is controlled based on the argument method, which
defaults to Hochberg’s method. The R function
method = ‘hoch’)
bwmcppb.adj(J, K, x, est=tmean,JK = J *
.
6.6 Multiple Comparisons 157
and
and
This section describes some alternative methods for performing multiple compar-
isons that provide different perspectives on how groups compare when dealing with
a between-by-within design.
158 6 Comparing Multiple Dependent Groups
Method BWAMCP
One possibility is, for each level of Factor B, perform all pairwise comparisons
among the levels of Factor A. This can be done using the methods in Sect. 5.3.
Method BWBMCP
A related approach is to ignore the levels of Factor A and perform pairwise
comparisons among the levels of Factor B. That is, the data are pooled over the
levels of Factor A. For example, if .J = 2, .K = 4 and the sample sizes for levels 1
and 2 of Factor A are .n1 and .n2 , treat the data as a matrix having .N = n1 + n2 rows
and .K = 4 columns. Now proceed as described in Sect. 6.6.1.
Method BWIMCP
Section 6.6.5 noted that hypotheses about main effects and interactions can be tested
via appropriate linear contrasts. When dealing interactions, this approach is using
marginal measures of location among the dependent groups. Another approach is
to use difference scores. For a 2-by-2 design, this means that the goal is to test the
hypothesis given by (6.14). The point here is that for the general case of a J -by-K
design, this can be done for all of the interactions associated with any two levels of
Factor A and any two levels of Factor B.
Method BWIDIF
Consider any two levels of Factor A and any two levels of Factor B. For the first
level of Factor A, imagine that difference scores are computed for the two levels
of Factor B. Further imagine that difference scores are computed for the second
level of Factor A. These two sets of difference scores can be compared with Cliff’s
method described in Sect. 3.2. That is, the goal is to determine the probability that a
difference score for the first level of Factor A is less than the difference score for the
second level of Factor A. Of course, this can be done for any two levels of Factors
A and B.
Method BWIPH
Consider again any two levels of Factor A and any two levels of Factor B. Let .p1
denote the probability that for the first level of Factor A, the first level of Factor B
has a value less than the second level of Factor B. For example, Factor B might be
measures at two different times and .p1 is the probability the measure at time 1 is
less than the measure at time 2. Let .p2 denote the corresponding probability for level
two of Factor A. Then no interaction corresponds to the .p1 − p2 = 0. Inferences
about .p1 − p2 can be made using the methods in Sect. 3.4.
6.6 Multiple Comparisons 159
applies method BWAMCP. That is, it performs all pairwise comparisons among
levels of Factor A using trimmed means. The R function
bwimcp(J, K, x, tr=0.2)
.
bwiDIF(J,K,x,JK=J*K,grp=c(1:JK),alpha=.05,SEED=TRUE)
.
TRUE,...)
and
TRUE,...)
160 6 Comparing Multiple Dependent Groups
performs all pairwise comparisons among the levels of Factor B for each level of A.
Finally, the R function
spmcpi(J,K,x,est=tmean,JK=J*K,grp=c(1:JK),alpha=.05,nboot=NA,
.
SEED=TRUE,pr=TRUE,SR=FALSE,...).
uses a percentile bootstrap method for testing 2-by-2 interactions based on differ-
ence scores associated with Factor B, the dependent groups.
Effect sizes previously described are readily extended to two-way designs. For
example, for a between-by-within design, all pairwise comparisons among the levels
of the first factor, effect sizes based on independent groups, described in Sect. 3.6,
can be computed for each level of Factor B. In a similar manner, effect sizes for pairs
of dependent groups can be computed for each level of Factor A. As for interactions,
consider a 2-by-2 design. For the first level of Factor A, compute difference scores
based on the two levels of Factor B. Do the same for the second level of Factor A,
and compute a measure of effect size based on these two sets of difference scores
using methods in Sect. 3.6. For a J -by-K design, this can be done for any two levels
of Factor A and any two levels of Factor B.
The following R functions provide estimates of effect size when one or two factors
deal with dependent groups. For every level of Factor B, the R function
computes measures of effect size for each pair of levels of Factor A. For every level
of Factor A, the R function
6.8 R Functions bw.es.A, bw.es.B, bw.es.I, bw.2by2.int.es, ww.es, and fac2Mlist 161
computes measures of effect size for pairs of levels of Factor B. Setting the
argument CI=TRUE, confidence intervals will be reported. Effect sizes dealing with
interactions are computed by the R function
bw.2by2.int.es(x, CI = FALSE)
.
computes several measures of effect size simultaneously that are aimed a charac-
terizing interactions. For each level of Factor A, the function computes difference
scores based on the two levels of Factor B and then computes measures of effect
size using the ES.summary function in Sect. 3.6.4.
For each level of Factor A, the R function
computes measures of effect size for levels k and .k of Factor B. This is done
for all .k < k . The results are labeled $Factor A. Then, for each level of Factor
B, the function estimates measures of effect size for Factor A. The results under
$B[[1]]$effect.size[[1]] indicate six measures of effect size, described in Sect. 3.6,
for the two levels of Factor A when focusing on level one of Factor B. Effect sizes
for interactions are labeled INT.
When manipulating data stored in a file, the R function
can be useful. Imagine that there are J independent groups and that a certain column
of a matrix contains group identifications. Also imagine certain columns of the
matrix contain data taken at K different time points. The goal is to store the data
in a manner that can be used with the R functions designed for a between-by-within
design. The function sorts data into groups based on values stored in the column
indicated by the argument grp.col. The columns containing data for times 1
through K are indicated by the argument lev.col. The results are stored in list
model. If stored in the R object a, for example, a[[1]] would contain a matrix
with K columns.
162 6 Comparing Multiple Dependent Groups
a=fac2Mlist(x,5,c(2:4))
.
would sort the data into two groups based on gender and store the data in list mode.
The R object a[[1]] would contain a matrix with three columns corresponding to
the three CESD measures for males. The command
d=c(listm(a[[1]]),listm(a[[2]]))
.
would store the data in list mode where d[1:3] contains the three measures for
males and d[4:6] contains the measures for females. Now, for example, the
command
.bwtrim(2,3,d)
$B[[1]]$effect.size
$B[[1]]$effect.size[[1]]
Est NULL S M L
AKP -0.6192749 0.0 -0.20 -0.50 -0.80
EP 0.4927472 0.0 0.14 0.34 0.52
QS (median) NA 0.5 0.55 0.64 0.71
QStr NA 0.5 0.55 0.64 0.71
WMW 0.6826923 0.5 0.55 0.64 0.71
KMS -0.3005117 0.0 -0.10 -0.25 -0.40
6.8 R Functions bw.es.A, bw.es.B, bw.es.I, bw.2by2.int.es, ww.es, and fac2Mlist 163
$B[[2]]
$B[[2]]$con
[,1]
[1,] 1
[2,] -1
$B[[2]]$effect.size
$B[[2]]$effect.size[[1]]
Est NULL S M L
AKP -0.3831136 0.0 -0.20 -0.50 -0.80
EP 0.3488037 0.0 0.14 0.34 0.52
QS (median) NA 0.5 0.55 0.64 0.71
QStr NA 0.5 0.55 0.64 0.71
WMW 0.6057692 0.5 0.55 0.64 0.71
KMS -0.1870502 0.0 -0.10 -0.25 -0.40
$B[[3]]
$B[[3]]$con
[,1]
[1,] 1
[2,] -1
$B[[3]]$effect.size
$B[[3]]$effect.size[[1]]
Est NULL S M L
AKP -0.4820692 0.0 -0.20 -0.50 -0.80
EP 0.3539674 0.0 0.14 0.34 0.52
QS (median) NA 0.5 0.55 0.64 0.71
QStr NA 0.5 0.55 0.64 0.71
WMW 0.6346154 0.5 0.55 0.64 0.71
KMS -0.2352565 0.0 -0.10 -0.25 -0.40
Note that for QS and Qstr, NA is reported. This is because the sample size for
smokers is less than 10. The results under B[[1]] are the effect sizes for the
first trial, when comparing smokers to nonsmokers, which are labeled as being
moderately large. For the other two trials, the estimates are smaller. The contrast
coefficients indicate which levels of Factor A are being compared.
164 6 Comparing Multiple Dependent Groups
6.9 Exercises
1. The sign test is often viewed as having relatively low power. But are there
situations where it has the lowest p-value compared to the p-values based on
other measures of effect size?
2. When using 20% trimmed means, using difference scores, rather than marginal
measures of location, can result in substantially different p-values?
3. When using the R function sppbi, do the contrast coefficients correspond to
the groups or the difference scores?
4. Consider the hypothesis that J dependent groups have a common mean.
The classic F test assumes sphericity. This assumption is satisfied if the J
random variables have a common Pearson correlation. Comment on the strategy
of testing the hypothesis that J random variables have a common Pearson
correlation and assuming sphericity if this test fails to reject.
5. Imagine that inferences are made based on the one-step M-estimator rather than
a 20% trimmed mean. Generally, is it still possible to get a different p-value
comparing the marginal distributions rather than using the difference scores?
6. Section 6.2.1 reports estimated effect sizes, based on the essay data, for the four
measures associated with the first group. The effect sizes were estimated based
on difference scores and found to be moderately large. Compare these results
to a measure of effect size based on the marginal distributions that is reported
by rmES.pro. Comment on how the estimates compare to using difference
scores.
Hint: When using the command fac2list(essays[,4],essays[,2:3])
to sort the data into groups, the data stored in b are character data. To convert
it to numeric data, use the command lapply(b, as.numeric).
7. For the essay data used in the previous exercise, there are three independent
groups measured on four different occasions. Compare the groups using
bwtrim
8. Repeat the previous exercise, only now use the R function bwmcp. Comment
on the results related to interactions.
9. The R function bwmcp does not report the linear contrast coefficients. Indicate
how to easily determine the contrast coefficients.
10. For the essay data used in Exercise 6, compute the difference scores for level
1 of Factor A (the control group) and the first two levels of Factor B (essay 1
and essay 2) then plot the distribution of the difference scores using akerd.
Comment on the results.
11. Consider a 2-by-2, between-by-within design. Imagine the data are stored in the
R object a having list mode and that difference scores are to be used. The goal
is to compare the the 0.25 quantiles of the difference scores associated with the
two levels of Factor A. Indicate some R code that would accomplish this goal
in a manner that takes into account the material covered in Sect. 3.3
12. Section 6.8 described and illustrated the R function fac2Mlist using data
stored in the file CESDMF123_dat.txt. Duplicate those commands resulting in
the data being stored in the R object d. Next, compare the groups using the R
function bwtrim.
6.9 Exercises 165
13. Repeat the previous exercise, only now use the R function bwimcp(2,3,d).
Comment on the difference between this result and the results returned by
bwtrim.
14. Repeat Exercise 12, only now compare the 0.75 quantiles using bwmcppb in
conjunction with the trimmed Harrell–Davis estimator.
15. When testing the hypothesis given by (6.11), here are the linear contrast
coefficients that are used when testing the hypothesis of no main effect for the
between factor when dealing with a 3-by-3 design:
[,1] [,2]
[1,] 1 0
[2,] 1 0
[3,] 1 0
[4,] -1 1
[5,] -1 1
[6,] -1 1
[7,] 0 -1
[8,] 0 -1
[9,] 0 -1
What is a possible concern with this approach?
Chapter 7
Robust Regression Estimators
Y = β0 + β1 X.
. (7.1)
When the goal is to make inferences about the intercept and slope, an additional
assumption is routinely made:
Y = β0 + β1 X + ,
. (7.2)
where . has a normal distribution with mean zero and some unknown variance, .σ 2 .
Consider, for example, the situation where .β0 = 2 and if .β1 = 1. If .X = 3, the
model says that Y has a normal distribution with mean .2 + 3 = 5 and variance .σ 2 .
If .X = 6, Y has a normal distribution with mean .2 + 6 = 8, and again, the variance
is .σ 2 . That is, the variance of Y , given any value for X, does not depend on X.
This is called the homoscedasticity assumption. Violating this assumption can be a
serious concern as will be seen in Chap. 8.
Let .b0 and .b1 denote candidate choices for .β0 and .β1 , respectively. As indicated
in Chap. 1, the best-known approach for determining .b0 and .b1 is via the least
squares estimator.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 167
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9_7
168 7 Robust Regression Estimators
ri = Yi − b0 − b1 Xi
. (7.3)
(.i = 1, . . . , n) denote the residuals. As noted in Sect. 1.4, the least squares estimator
determines the values for .b0 and .b1 that minimize
. ri2 = (Yi − b0 − b1 Xi )2 . (7.4)
It can be shown that the resulting estimate of the slope, given by (1.16), can be
written as
.b1 = wi Yi , (7.5)
where
Xi − X̄
. wi = .
(Xi − X̄)2
That is, the slope is estimated with a weighted sum of the Y values, where, in
general, all of the weights differ from zero. Translation: the least squares estimate
of the slope has a breakdown point of only .1/n. The intercept is estimated with
b0 = Ȳ − b1 X̄,
. (7.6)
which has a breakdown point of only .1/n as well. Another important point is that
when using the least squares estimator,
Ŷ = b0 + bx
. (7.7)
Y = β0 + β1 X1 + · · · + βp Xp .
. (7.8)
Let .bj be some candidate choice for .βj (.j = 0, 1, . . . p). Then an estimate of some
measure of location associated with Y , for the ith participant, meaning the ith row
of the matrix (7.9), is
Ŷi = b0 + b1 Xi1 + · · · + bp Xi ,
. (7.10)
ri = Yi − Ŷi .
. (7.11)
Again,
2 the least squares estimator chooses .b0 , . . . , bp to be the values that minimize
. ri , the sum of the squared residuals. The breakdown point is again only .1/n.
One of the main goals in this chapter is to describe a collection of robust
regression estimators and outline their relative merits. But before continuing, it
helps to first describe methods for detecting outliers when dealing with multivariate
data. Methods that deal with this issue play a crucial role when trying to understand
the association between some independent variable Y and p independent variables,
.p ≥ 1.
The focus in this section is on detecting outliers among the independent variables.
For .p = 1, a single independent variable, methods in Sect. 1.5 can be used.
Next, consider the situation where there are two independent variables. The goal
of determining outliers might seem trivial: use the boxplot rule or the MAD-median
rule on both variables. However, this approach does not take into account the overall
structure of the data. Figures 7.1 and 7.2 illustrate what this means. The arrow in
the left panel of Fig. 7.1 indicates a point that appears to be unusual relative to all
of the other points in the plot. The right panel shows boxplots of both variables, and
as can be seen, no outliers are detected. The right panel of panel Fig. 7.2 shows the
data in the left panel of Fig. 7.1 rotated 45.◦ . From this perspective, the point noted
in Fig. 7.1 now appears to be a clear outlier. The right panel of Fig. 7.2 verifies that
this is the case and that two additional points are now declared an outlier.
A basic course on multivariate methods might seem to suggest a simple solution:
use the Mahalanobis distance to detect outliers. Let .Xi = (Xi1 , . . . Xip ) denote .p ≥
2 measures for the ith participant (.i = 1, . . . , n). Assuming familiarity with basic
matrix algebra, which is summarized in Appendix A, the Mahalanobis distance of
.Xi from the sample mean,
1
X̄ =
. Xi ,
n
is
170 7 Robust Regression Estimators
2
1
1
0
X2
0
−1
−1
−2
−2 −1 0 1 2 1 2
X1
Fig. 7.1 The left panel shows a scatterplot of points with one point, indicated by the arrow,
appearing to be a clear outlier. But based on boxplots for the individual variables, no outliers
are found
2
2
1
1
X2
0
−1
−1
−2
−2
X1
Fig. 7.2 The left panel shows a scatterplot of the same points in Fig. 7.1, only rotated 45.◦ . The
left boxplot detects the apparent outlier, and the other boxplot detects two additional outliers
7.1 Detecting Multivariate Outliers 171
where .S = (sj k ) is the variance-covariance matrix and .(Xi − X̄) is the transpose of
.(Xi − X̄). When .p = 1, the Mahalanobis distance reduces to
|X − X̄|
. , (7.13)
s
which was used in (1.18) to detect outliers.
There are two important points regarding the Mahalanobis distance. First, points
that are equidistant from the means form an ellipsoid. Second, the Mahalanobis
distance is based on measures of location and scatter that have a breakdown point
of only .1/n. As a result, using the Mahalanobis distance to detect outliers is subject
to masking.
A way of dealing with this limitation is to use an analog of the Mahalanobis
distance where the mean and covariance matrix are replaced by estimators that
have a reasonably high breakdown point. An early approach was to search for the
central half of the data that has the smallest volume, which is called the minimum
volume ellipsoid (MVE) method. Once obtained, compute the mean and covariance
matrix, and scale the covariance matrix so that it estimates the covariance matrix
when dealing with normal distributions (Rousseeuw & van Zomeren, 1990). This
approach has the highest possible breakdown point, 0.5. A possible concern is that
this method might declare too many points as outliers (Fung, 1993).
Another approach that has received considerable attention is the minimum
covariance determinant (MCD) method, which searches for the half of the data that
has the smallest generalized variance. The generalized variance is the determinant
of the covariance matrix, which quantifies the overall dispersion of a cloud of points.
Based on the half of the data that has the smallest generalized variance, the mean
and covariance matrix are computed, and then the covariance matrix is scaled so
that it estimates the covariance matrix when dealing with data that has a multivariate
normal distribution. Let .Di denote the resulting robust Mahalanobis distance of .Xi
from the center of the data. If
Di > c,
. (7.14)
declare .Xi an outlier, where c is the square root of the 0.975 quantile of a chi-squared
distribution with p degrees of freedom. There are many other possibilities based on
a robust analog of the Mahalanobis distance (e.g., Wilcox, 2022a). In effect, these
methods assume that the data are sampled from a distribution that is elliptically
contoured.
Another approach to detecting outliers is to use a projection method, which does
not assume that a distribution is elliptically contoured. The method described here
has a close connection to general theoretical results derived by Donoho and Gasko
(1992). To avoid certain technical details, it is assumed that each of the marginal
172 7 Robust Regression Estimators
2
point onto a line. The head of * *
* * **
the arrow points to the * *
** *
projection of the point located
* ** * *
at the other end of the arrow. * * *
* ** * * *
0
Note the point indicated by o. * * ** * *
* * * *
This reflects the projection of * ****
the center of the data cloud * *** *
−2
−4
−2 −1 0 1 2
distributions has been standardized by subtracting out the median from each value
and then dividing by MAD/0.6745.
Consider any point .Xi , and let .θ̂ = (θ̂1 , . . . , θ̂p ) denote some robust measure
of location. One possibility is where .θ̂j is the marginal median associated with
the j th measure (.j = 1, . . . , p). Arguments can be made for other choices (e.g.,
Wilcox, 2022a), but the default approach is to use the medians to avoid any
computational issues that can arise. When standardizing by subtracting out the
median as previously described, in effect, the data have been transformed to have a
median of zero.
Next, project all of the data onto the line connecting the center of the data and
the point .X1 . This process was illustrated by Fig. 5.2. For convenience, it is also
illustrated by Fig. 7.3. The arrow indicates a point that is (orthogonally) projected
onto the line. Note the point marked by an o. This indicates the center of the data
cloud after projecting the points onto the line.
In effect, the p-variate data have been reduced to single variable, in which case
a slight adjustment of the boxplot rule or the MAD-median rule can be used to
check for outliers. This process is repeated for each of the remaining n points,
.X2 , . . . , Xn . The point .Xj is flagged as an outlier if it is flagged as an outlier for
any of the n projections. (For an alternative approach that does not assume data
have an elliptically countered distribution, see Schreurs et al., 2021.)
To provide at least some indication of how the MAD-median rule is used, note
that among the projected data, there are n distances from the projection of the center
of the data cloud. More generally, for the ith projection, there are n distances from
the projected center of the data cloud: .Di1 , . . . , Din . To explain the notation in
a slightly different manner, .Dij is the distance of the j th point based on the ith
projection. For the ith projection, let .Mi denote the median of these distance values,
and let MAD.i denote MAD. Two decision rules have been studied for determining
whether a point is an outlier. The first declares the j th point an outlier if for any
.i = 1, . . . , n,
7.1 Detecting Multivariate Outliers 173
where .q1 and .q2 are the ideal fourths based on the .Di values and c is the 0.95
quantile of a chi-squared distribution with p degrees of freedom. The second
approach declares the j th point an outlier if for any .i = 1, . . . , n,
MADi
Dij > Mi + c
. , (7.16)
0.6745
where now c is the square root of the 0.975 quantile of a chi-squared distribution
with p degrees of freedom. Roughly, a point is declared an outlier if it is flagged
as an outlier based on any of n projections. The first decision rule uses a measure
of dispersion that has a breakdown point of 0.25, while the second method uses
a measure of dispersion that has a breakdown point of 0.5. However, there are
indications that the second method can suffer from swamping: it can declare too
many points as outliers.
An alternative approach is to determine the constant c in (7.15) and (7.16) so
that under normality, the expected proportion of points declared an outlier is equal
to some specified proportion of the sample size. The default proportion used here is
five percent. The adjustment is made based on a simulation where data are generated
from a multivariate normal distribution and all of the correlations are equal to zero.
This helps correct any concerns about swamping when using (7.16).
Here is another perspective on the projection method for detecting outliers.
Roughly, a type of standardized distance is assigned to every point. Note that when
using the MAD-median rule to check for outliers, the distance of a point is measured
by how far it is from the median, divided by MADN .= MAD/0.6745. For each
projection, a point has some standardized distance. The projection distance of a
point is taken to be its maximum distance among all n projections.
When n and p are relatively large, execution time might be an issue when using
the projection method for detecting outliers just described. Consequently, an alterna-
tive method might be preferred. One possibility is to use random projections via the
R package DepthProc. Here, this can done with the R function outpro.depth
described in Sect. 7.1.1, assuming the R package DepthProc has been installed.
Now declare a point an outlier if its projection distance is an outlier based on the
boxplot rule or the MAD-median rule in Sect. 1.5. There are many other options for
measuring depth (Wilcox, 2022a) that might have practical value when checking for
outliers among multivariate data. The relative merits of these alternative methods
need further study.
There is another method based on the generalized variance that appears to
perform relatively well (e.g., Wilcox, 2008). Computational details are summarized
in Wilcox (2022a, Section 6.4.7). Roughly, the method assigns a value to each point
regarding its impact on the generalized variance. Points with unusually high values
are flagged as outliers. In terms of declaring too many points as outliers, this MGV
has been found to compete well with the projection method. Moreover, the MGV
method competes well with the projection method in terms of detecting true outliers,
174 7 Robust Regression Estimators
but situations can be constructed where this is not the case. When detecting outliers,
both the projection method and the MGV method make less restrictive assumptions
about the nature of the distribution than the MCD and MVE methods. A practical
concern with the MGV method is that as the sample size increases, execution time
might be an issue.
The R function
checks for outliers using the projection method described in the previous section.
Here, the argument m is any R object containing data stored in a matrix or data
frame having n rows and p columns. The argument gval can be used to reset the
value of c used in (7.15) and (7.16). The argument MM=FALSE means that (7.15)
is used to detect outliers. Setting MM=TRUE, (7.16) is used instead. The default is
(7.15) because, as previously noted, there is evidence that it is better at dealing with
swamping: declaring too many points as outliers. But there are situations where the
higher breakdown point associated with (7.16) can be important. When .p = 2 and
plotit=TRUE, the data are plotted with outliers indicated by o. The center of the
data cloud can be specified via the argument center. By default, the marginal
medians are used.
The R function
also uses a projection method to detect outliers, but unlike outpro, it does
not determine the value of c based on a chi-squared distribution. Rather, it uses
a simulation to determine c so that the expected proportion of points declared
outliers, when sampling from a normal distribution, is equal to the value indicated
by the argument rate. The function returns the estimate of c, which is labeled
used.gval. If execution time is high, this value can be used to reset c when using
outpro, provided the same n and p apply. For example, if c is 4.1, use outpro
with the argument gval set equal to 4.1.
The R function
computes projection distances based on random projections and then declares points
outliers when their projection distance is an outlier based on the boxplot rule or the
MAD-median rule. The extent swamping remains an issue when using this function,
in conjunction with the MAD-median, is unknown.
The R function
applies the MGV method. If the argument y contains data, it is combined with the
data stored in the R object x. For example, if both x and y contain data for a single
variable, the data are combined into a matrix with two columns.
Situations are encountered where the goal is to check for outliers among some of
the variables but not others. An example is given in Sect. 7.4.10. The R function
is designed to deal with this issue in a relatively simple manner that is convenient
when using R functions for estimating regression lines described later in this chapter.
By assumption, the argument x is a matrix or data frame with two or more columns.
The argument id indicates which columns are ignored when searching for outliers.
For example, if x has four columns and column 3 indicates gender with a 0 or
1, while the other three columns have variables that are reasonably continuous,
including gender makes little sense when checking for outliers. The command
out.dummy(x,id=3) checks for outliers ignoring column 3.
176 7 Robust Regression Estimators
Linear models are routinely used, and all indications are that they can be highly
useful. But as will be illustrated, simply assuming that a linear model is adequate
can completely miss the nature of the association. This section describes methods
for checking the assumption that a linear model is reasonable. Section 7.3 describes
smoothers that can be very helpful when the linearity assumption is incorrect.
Certainly, one of the better-known and routinely taught methods for checking
the assumption that a linear model is reasonable is to plot the residuals and the
predicted values of the dependent variable, typically labeled .Ŷ . That is, plot the
points .(Ŷ1 , r1 ), . . . (Ŷn , rn ). This approach is also used to check whether there is
homoscedasticity.
Figure 7.4 shows plots of the residuals and the .Ŷ values for four situations. In the
upper left panel, .Y = X + where both X and . have standard normal distributions.
That is, the linearity assumption is true, and there is homoscedasticity. The sample
size is .n = 200. In the upper right panel, .Y = X + (|X| + 1). (The solid lines are
based on a smoother called LOWESS, which is described in Sect. 7.3.2.) Again, the
linearity assumption is true, but now, there is heteroscedasticity: the variance of Y
depends on X and is equal to .(|X| + 1)2 σ 2 . In the lower left panel, .Y = X2 + , and
the plot correctly suggests that a quadratic term is needed. Note that the lower right
panel also suggests a quadratic term might be needed. This turns out to be incorrect
for reasons to be described in Sect. 7.3.5. See in particular the discussion of Fig. 7.5.
In practical terms, additional methods can be needed to get a reasonable reflection
of the true association.
For completeness, there is a formal method for testing the hypothesis that a linear
model is correct. The method is motivated by results derived by Stute et al. (1998),
which can be generalized to deal with this issue in a more robust fashion (Wilcox,
1999).
Let .r1 , . . . , rn denote the residuals based on some regression estimator that
assumes a linear model is correct. Consider any two rows of data in the matrix
Res
*
** * * **** ********************************* * *** *
* *
Res
Res
** * ** * * * *
nonlinear but not in the sense *
*********************** **************** * *** ** ***** * *
* ************* ******* ** ***
suggested by the plot shown ********************************* * ** ******* *** * ***** ***
**** ************ ** * **
−3
*
−4
2
* ** *
LOWESS, based on the data * * * *
* ** * ** *
used in the lower right panel * * ** ******* *** * * ** *
** ***** * ** *********** ** * ** ** **
of Fig. 7.4. This suggests a * * * *** *** ********* * * * * * **
0
* * * ** ** ** *
positive association up about * ** * ** ***** ** *** *** * * *
* * ***** * *** * ** * ** *
Y
0, after which little or no * * * * ***** * * *
** ***** ** ** * *
−2
association appears to be the *
* ***** *
case, which is correct based *
* *
on the way the data were *
*
−4
generated *
*
−3 −2 −1 0 1 2 3
X
given by (7.9). The ith row, .Xi , is said to be less than or equal to j th row .Xj if
Xik ≤ Xj k for every .k = 1, . . . , p. If .Xi is less than or equal to .Xj , let .Ii = 1;
.
where
ri = Yi − Ȳ .
.
D = max|Rj |.
. (7.18)
1 2
D=
. Rj . (7.19)
n
A wild bootstrap method is used to determine an appropriate critical value. Gen-
erate n observations from a uniform distribution, and label the results .U1 , . . . , Un .
Compute
√
Vi =
. 12(Ui − 0.5),
ri∗ = ri Vi ,
.
and
(.i = 1, . . . , n), where .Ŷi is the predicted value of Y based on .Xi . Based on this
bootstrap sample, compute the test statistic D yielding .D ∗ . Repeat this process B
times yielding .D(1) ∗ ≤ · · · ≤ D ∗ . Put these B values in ascending order yielding
(B)
.D
∗ ≤ · · · ≤ D ∗ . The critical value is .D ∗ , where .u = (1 − α)B rounded to the
(1) (B) (u)
nearest integer. That is, reject if
∗
D ≥ D(u)
. . (7.20)
Note that this method provides another way of testing the hypothesis that Y and
X are independent. This method performs well in simulations, but often, it can be
.
difficult to determine exactly why it rejects. For example, a smooth can indicate a
very straight line, yet this wild bootstrap method rejects.
If there are .p > 1 independent variables and there are indications that a linear
model is not adequate, there is the issue of getting more details about the nature of
the nonlinearity. A simple method is to examine a plot for each independent variable.
The smoothers described in the next section can be used to do this. This approach is
known as a partial response plot, but (Berk & Booth, 1995) note that this approach
can be unsatisfactory. Berk and Booth suggest using instead a partial residual plot.
Assuming that the other predictors have a linear association with Y , fit a linear
model to the data ignoring the j th predictor. The partial residual plot simply plots
the resulting residuals versus .Xj . The R function prplot in Sect. 7.3.5 applies this
method.
The R function
TRUE)
.chk.lin(x,y,regfun=tsreg,xout=FALSE,outfun=outpro,LP=TRUE,...)
plots the points .(Ŷ1 , r1 ), . . . (Ŷn , rn ) assuming that a linear model is correct.
The regression estimator used is indicated by the argument regfun, which
defaults to the Theil-Sen estimator described in Sect. 7.4.5. To use the least
squares estimator, set regfun=ols. The argument xout=FALSE means that
leverage points (defined in Sect. 7.3.4) are not removed. Using xout=TRUE,
leverage points are removed based on the outlier detection method indicated by
the argument outfun. The default is outpro described in Sect. 7.1.1. Setting
7.3 Smoothers 179
.lintest(x,y,regfun=tsreg,nboot=500,alpha=0.05)
tests the hypothesis that a linear model is correct using the wild bootstrap method in
the previous section.
7.3 Smoothers
Let .m(x) denote some measure of location associated with Y , given that .X = x.
The linear model is a special case where the predicted value of Y is given by (7.1).
This section describes methods for estimating .m(x) in a more flexible manner that
are generally known as smoothers. An estimate of .m(x), .m̂(x), is called a smooth.
7.3.1 Splines
Seemingly the most obvious approach to dealing with a situation where (7.1) is
inadequate is to include a quadratic term or higher. Still using the least squares
regression estimator, this means that by assumption, the mean of Y , given X, is
estimated to be
Ŷ = b0 + b1 X + b2 X2 .
. (7.21)
Ŷ = b0 + b1 Xa ,
. (7.22)
Ŷ = b0 + b1 X + b2 X2 + b3 X3 .
. (7.23)
180 7 Robust Regression Estimators
But the idea of using a polynomial model has been criticized due to the global nature
of its fit. That is, there might be a region among the range of X values where some
choice for .b0 , .b1 , .b2 , and .b3 performs well, but for other regions, some other choice
for .b0 , .b1 , .b2 , and .b3 can be needed.
Splines refer to regression estimators that are aimed at dealing with this concern.
Basically, the method attempts to find intervals where a low degree polynomial
regression line gives a good fit to data. The intervals are marked by what are called
knots. This approach was a major breakthrough that remains popular today. There
are in fact several variations of this approach (e.g., James et al., 2017). However,
there are indications that alternative methods are more satisfactory in general (e.g.,
Härdle, 1990; Wilcox, 2022a, Section 11.5.6).
One concern is that generally, splines use a least squares estimator for each
interval, which is not robust. There is a spline method aimed at estimating the
quantiles of Y given X (e.g., He & Ng, 1999; Koenker & Ng, 2005). But using
default settings, it can poorly approximate the true regression line. More specifically,
it might indicate substantially more curvature than is actually present (Wilcox,
2016a). What was found to be more effective is a running interval smoother, which
is described in Sect. 7.3.3.
δi = |Xi − x|.
.
Next, retain the f n pairs of points that have the smallest δi values, where f is a
number, to be determined, that has a value between 0 and 1. The quantity f is
called the span. Let δm be the largest δi value among the retained points. Let
7.3 Smoothers 181
|x − Xi |
Qi =
. .
δm
wi = (1 − Q3i )3 ;
. (7.25)
MAD
|Xi − x| ≤ f
. . (7.26)
0.6745
182 7 Robust Regression Estimators
Y = β0 + m1 (X) + . . . mp (X) + .
.
A leverage point is a point for which the values of the independent variable are out-
liers. More formally, .(Yi , Xi1 , . . . , Xip ) is a leverage point if .Xi = (Xi1 , . . . , Xip )
is an outlier among .(X1 , . . . , Xn ). It will be seen in Sect. 7.4.1 that when dealing
with a linear model, certain types of leverage points can be beneficial. But other
types of leverage points can result in an estimate of the regression line that
completely masks the nature of the association among the bulk of the data. This
can occur even for estimators with a high breakdown point as will be illustrated in
Sect. 7.4. Smoothers are no exception: it is important to investigate the impact of
removing leverage points using an outlier detection method described in Sect. 7.1,
particularly when there is more than one independent variable. A difficulty is that
data tend to be sparse in regions where outliers occur making inferences difficult.
That is, there are very few nearest neighbors. Checking the impact of removing
leverage points is easily done with the R functions described in the next section.
The R function
estimates a regression line (or surface) using the running-interval smoother. For
values other than .X1 , . . . , Xn , .m(x) can be computed with the R function
rplot.pred(x,y,pts=NULL,est=tmean,fr=1,nmin=1,
.
xout=FALSE,outfun=outpro,XY.used=FALSE,...)
.qsm(x,y,qval=c(.2,.5,.8),fr=.8,plotit=TRUE, scat=TRUE,
pyhat=FALSE, eout=FALSE, xout=FALSE,
outfun=out,op=TRUE,LP=TRUE,tr=FALSE,
xlab=’X’,ylab=’Y’,pch=’.’)
can be used to accomplish this goal. By default, the 0.2, 0.5, and 0.8 quantiles are
used. Exercise 3 illustrates this function.
Example The lower right panel of Fig. 7.4 shows a plot of predicted Y values and
the residuals based on the least squares estimator. As previously noted, the plot
would seem to suggest using a quadratic term in the regression model. Figure 7.5
correctly captures how the data were generated. The data were generated with a
slope .β1 = 1 when .X < 0 and .β1 = 0 when .X > 0.
Of course, there is the practical issue of whether situations similar to Fig. 7.5
occur in practice. The next example illustrates that the answer is yes.
7.3 Smoothers 185
50
running-interval smooth
based on the CAR and a
40
measure of depressive
symptoms, CESD. Note that
20 30
CESD
the regression line suggests
that the nature of the
association differs depending
on whether the CAR is
positive or negative
10
0
−0.4 −0.2 0.0 0.2 0.4
CAR
Example This example is based on the Well Elderly data described in Sect. 3.1.3.
Here, the file A3B3C_dat.txt is used, which deals with measures taken after
intervention. The cortisol awakening response (CAR) is the difference between
cortisol measured upon awakening and measured again 30–45 minutes later. The
CAR has been found to be associated with measures of stress. The goal here
is to understand the association between the CAR and a measure of depressive
symptoms (CESD). Figure 7.6 shows the running-interval smooth with leverage
points removed. It is left as an exercise to show that retaining leverage points
completely masks the association shown in Fig. 7.6. Note that the nature of the
association appears to change close to CAR equal to zero. That is, the nature of
the association appears to depend on whether cortisol increases or decreases after
awakening. Fitting a regression line using the data where the CAR is greater than
zero, and testing the hypothesis that the slope is zero using robust methods in
Chap. 8, the p-value is 0.037. (The R function regci was used with xout=TRUE.)
Fitting a straight line to the points where the CAR is less than zero, the hypothesis
of a zero slope is not rejected, and the p-value is 0.706. And the hypothesis that
these two slopes are equal (using the R function reg2ci in Sect. 8.4.3) is rejected
as well; the p-value is less than 0.001.
Example The next example is based on data stored in the R object Leerkes, which
is available via the R package WRS2. The data deal with the relationship between
how girls were raised by their own mother and their later feelings of maternal
self-efficacy. Included is a third measure that reflects self-esteem. All variables are
scored on a continuous scale from 1 to 4. The sample size is .n = 92. To illustrate a
point, the measure of esteem is taken to be the dependent variable. Figure 7.7 shows
the smooth based on the running-interval smoother. The plot suggests that a linear
model is reasonable. Plotting the residuals and the predicted esteem values (not
shown here) again suggests that a linear model is reasonable. It provides a strong
indication that there is heteroscedasticity. Testing the hypothesis that the slope is
186 7 Robust Regression Estimators
Esteem
based on the Leerkes data.
This plot would seem to
suggest that a linear model is
reasonable
M
ac
at
fic
Ef
C
ar
e
Fig. 7.8 Shown is the
running-interval smooth for
the same data used in Fig. 7.7, 3.8
Esteem
ac
at
3.2
4.0
zero, the p-values are 0.003 for maternal care and 0.01 for efficacy. (The R function
regci, described in Chap. 8, was used that allows heteroscedasticity.)
However, there are four leverage points based on the projection method in
Sect. 7.1. (The R function outpro was used.) Figure 7.8 shows the smooth when
these four leverage points are removed. As is evident, this smooth paints a decidedly
different picture regarding the nature of the association when the leverage points are
retained. Figure 7.8 suggests that the association depends on whether maternal care
is less than or greater than 3. For maternal care less than 3, testing the hypothesis
of a zero slope, the p-values are 0.43 and 0.75 for maternal care and efficacy,
respectively. For maternal care greater than 3, now the p-values are both less than
0.001. Methods for comparing the slopes of these two groups are covered in Chap. 8.
Finally, the R function
creates the partial residual plot, described in Sect. 7.2.1, that provides a check on
the linearity assumption when there are two or more independent variables. By
default, it is assumed that curvature is to be checked using the data stored in
7.3 Smoothers 187
the last column of the matrix x. Setting the argument pval=2, for example, the
independent variable stored in column 2 would be used. The argument op=1 means
that the plot will be based on LOWESS; otherwise, the running-interval smoother is
used.
Specialized smoothers have been developed for situations where Y has only one of
two possible values, say 0 and 1. For a single independent variable X, the goal is to
estimate the probability .Y = 1 given that .X = x. In more formal terms, the goal is
to estimate
The method described here is based on a slight variation of the method in Hosmer
and Lemeshow (1989, p. 85).
For notational convenience, assume that .X1 , . . . , Xn have been standardized.
That is, if the dependent variables are .Z1 , . . . , Zn , set .Xi = (Zi − Mz )/MADN
where .Mz is the median based on .Z1 , . . . , Zn and MADN .= MAD/0.6745. Let
wi = Ih e−(Xi −x) ,
2
.
where .Ih = 1 if .|Xi − x| < h; otherwise, .Ih = 0. The .Xi values where .Ih = 1 are
called the nearest neighbors. By default, .h = 1.2 is used. The estimate of .m(x) is
wi yi
.m̂(x) = . (7.28)
wi
The R function
188 7 Robust Regression Estimators
logSM(x,y,pyhat=FALSE, plotit=TRUE,xlab=‘X’,ylab=‘Y’,
.
computes the smooth given by Eq. (7.28) when .p = 1, where the argument fr is the
span, h. When .p = 2, the function plots the regression surface based on the method
in Wilcox (2022a). The function
can be used to estimate the probability of y given that the independent variable has
the values stored in the argument pts. The R function
.rplot.bin(x,y,est=mean,scat=TRUE,fr=NULL,plotit=TRUE,pyhat=
FALSE,pts=x,theta=50,phi=25,scale=TRUE,expand=0.5,SEED=TRUE,
nmin=0,xout=FALSE,outfun=outpro,xlab=NULL,ylab=NULL,
zlab=’P(Y=1)’,pr=TRUE,duplicate=‘error’,...).
By default, a plot is created. Setting the argument pyhat=TRUE, the function returns
estimates of the probability of success for the explanatory values indicated by the
argument pts.
The R function
deals with the situation where Y has a discrete distribution with a relatively small
sample space. It returns a smooth for each unique value stored in the argument y.
7.4 Robust Regression Estimators for a Linear Model 189
In addition, for each possible value for Y , the function estimates the probability of
getting this value for each point stored in the argument pts.
This section deals with robust regression estimators assuming that a linear model is
correct. There are many such estimators. This section begins by highlighting some
additional concerns with the least squares estimator. This is followed by a descrip-
tion of three robust estimators that have received a great deal of attention. These
are the quantile regression estimator in Sect. 7.4.2, the MM-estimator described in
Sect. 7.4.3, and the Theil-Sen estimator in Sect. 7.4.4. A possible concern with these
estimators is described in Sect. 7.4.5 as well as a method that is designed to avoid
this concern. Section 7.4.6 comments on a collection of other robust methods that
have been proposed.
The introduction to this chapter showed that the least squares estimator has a
breakdown point of only .1/n. But to add perspective about robust regression
estimators as well as their practical advantages, it helps to review some other
properties of the least squares estimator.
Momentarily assume normality and homoscedasticity, and for convenience,
focus on a single independent variable. As indicated in the description of (7.2),
homoscedasticity means that the variance of Y , given that .X = x, is equal to .σ 2 ,
regardless of what the value of x happens to be. Said more succinctly, VAR.(Y |X =
x) = σ 2 . As noted in a basic statistics course, the squared standard error of the least
squares estimator of the slope, assuming homoscedasticity, is
σ2
. , (7.29)
(Xi − X̄2 )
. (Xi − X̄)2 , which in turn lowers the estimate of the standard error of the slope,
.b1 . However, the situation is not this simple. A leverage point is said to be a good
leverage point if it is reasonably consistent with the regression line associated with
the bulk of the data. Such points can indeed lower the standard error without having
a negative impact on .b1 , the estimate of the slope. A bad leverage point is a point
that is not reasonably consistent with the regression line associated with the bulk of
the data. Bad leverage points can result in an estimate of the slope, .b1 , that poorly
reflects the regression line for the bulk of the data.
Example Rousseeuw and Leroy (1987, p. 27) report data on the logarithm of the
effective temperature at the surface of 47 stars versus the logarithm of its light
intensity. Figure 7.9 shows a scatterplot of the data. The line with a slightly negative
slope is the least squares regression line using all of the data. The line with a
positive slope is the least squares regression line with all leverage points removed.
A concern, however, is that simply removing all leverage points can eliminate good
leverage points. Here, the six lowest surface temperatures are flagged as outliers.
(The four points in the upper left corner are red giants in contrast to the other stars
that were measured.) But note that the point close to (4, 4) is close to the regression
line with the positive slope, suggesting that it is a good leverage point. What is
needed is a method for making a distinction between these two types of leverage
points, particularly when dealing with .p > 1 independent variables. But before
describing how this might be done, other concerns need to be addressed.
There are two general features of data, beyond leverage points, that can result
in the least squares regression estimator having a large standard error relative to
other estimators that might be used. The first is outliers among the dependent
variable Y . The second is heteroscedasticity. The immediate goal is to describe
three regression estimators that deal with outliers among the dependent variable
that play a prominent role in this book. This is followed by a summary of their
relative merits including their ability to achieve a relatively low standard error when
there is heteroscedasticity. There are in fact many other robust regression estimators
(Wilcox, 2022a), some of which are summarized in Sect. 7.4.8.
Rather than choose values for the slopes and the intercept that minimizes the sum
of squared residuals, the least absolute value estimator (or .L1 estimator) chooses
values for the slopes and the intercept that minimize
. |ri | (7.31)
the sum of the absolute values of the residuals. In contrast to the least squares
estimator where the goal is to estimate the mean of Y , given .X, the least absolute
value estimator is designed to estimate the median of Y given .X. This approach
was proposed by R. Boscovich about 50 years before the least squares estimator
was proposed by Legendre. One reason Boscovich’s method gave way to the
least squares estimator is that the least squares estimator reduces substantially
the computational complexity of estimating the slope and intercept. Today, this
issue is no longer relevant due to the speed of modern computers. Also, from a
theoretical point of view, assuming normality, it is easier working with the least
squares estimator given the goal of deriving a method for making inferences about
the slope and intercept. But modern advances have found very effective methods
for making inferences based on the least absolute value estimator as will be seen in
Chap. 8.
An important generalization of the least absolute value estimator was derived by
Koenker and Bassett (1978) that is designed to estimate the qth quantile of Y given
.X. If .ri < 0, let
ui = q(ri − 1).
.
ui = qri .
.
The Koenker-Bassett estimator chooses values for the slopes and intercept that
minimize
. ui . (7.32)
The least absolute value estimator guards against outliers among the dependent
variable Y , but its breakdown point is only .1/n. The problem is .X, the dependent
variables. The least absolute value estimator is not designed to deal with leverage
points. Of course, a simple solution is to simply remove all leverage points, but a
concern is that this approach also removes good leverage points.
192 7 Robust Regression Estimators
7.4.3 MM-Estimator
Section 2.1.2 described M-estimators of location where extreme values are given
less weight. The same idea has been studied extensively when dealing with
regression. Rather than using (7.29) or (7.31) to determine the slopes and intercepts,
M-estimators use the data to determine how extreme the residuals happen to be. The
more extreme a residual happens to be, the less weight it receives. This includes the
possibility that some residuals are given no weight at all.
There are many M-estimators when dealing with regression (Wilcox, 2022a).
The focus here is on the MM-estimator derived by Yohai (1987). It has excellent
theoretical properties including the highest possible breakdown point, 0.5. Under
normality and homoscedasticity, its standard error is nearly as good as the least
squares estimator. When dealing with non-normal distributions, its standard error
can be substantially smaller than the standard error of least squares estimator,
especially when there is heteroscedasticity.
There are, however, two practical concerns. The MM-estimator requires finding
a solution to p+1 equations. There is no simple equation for doing this, but there is
an iterative estimation method that can be used. The first concern is that situations
are encountered where this iterative scheme does not converge. The second issue is
that despite its high breakdown point, it is subject to contamination bias. That is, a
few outliers cannot result in an estimate of the slopes that is arbitrarily large, but a
few bad leverage points can have a substantial impact on the estimate of the slopes
as will be illustrated in Sect. 7.4.5. This does not mean that the MM-estimator will
perform poorly when there are bad leverage points. There are situations where bad
leverage points have virtually no impact on the estimate of the slope. In Fig. 7.9, for
example, the four upper left points are bad leverage points, but they have virtually
no impact on the estimate of the slope. Nevertheless, bad leverage points can be a
serious concern as will be illustrated, so some caution is warranted. Section 7.4.6
describes a method for detecting bad leverage points. A simple solution is to simply
remove all leverage points.
The next estimator was proposed by Theil (1950) and Sen (1968). Consider a single
independent variable. For any two points, .(Xi , Yi ) and .(Xj , Yj ), let
Yi − Yj
Sij =
.
Xi − Xj
denote the slope of the line connecting these two points, assuming .Xi − Xj =
0. Computing .Sij for all .i < j yields .(n2 − n)/2 slopes. The median of these
slopes, .b1 , is the Theil-Sen estimate of .β1 . The intercept is taken to be the median of
7.4 Robust Regression Estimators for a Linear Model 193
the MM-estimator. Another negative is that, even when .p = 1, there are situations
where a few bad leverage points can alter the estimate of the slope considerably.
Like the MM-estimator, bad leverage points might have little or no impact on the
estimate of the slope. But this should not be taken for granted. Again, a method for
detecting bad leverage points is described in Sect. 7.4.6.
Another way of comparing the MM-estimator to the Theil-Sen estimator is in
terms of their standard errors. Neither dominates as illustrated in Wilcox (2022a,
Table 10.3). There are situations where the Theil-Sen estimator has a much smaller
standard error than the MM-estimator, but there are situations where the MM-
estimator offers an advantage over the Theil-Sen estimator. Which one has the
smaller standard error depends on the nature of the unknown distribution of the
error term . in (7.2) plus the type of heteroscedasticity.
The goal in this section is to illustrate that contamination bias can be an issue
when using the MM-estimator, the Theil-Sen estimator, and the quantile regression
estimator. Then a relatively simple method is described that deals with this issue.
Based on the basic linear model given by (7.2), two situations are considered.
The first is where both X and . have standard normal distributions. The sample
size is taken to be .n = 50. Data were generated with .β0 = β1 = 1. Then
four bad leverage points were added to the data, namely, .(2.3, −2.4), .(2.4, −2.5),
.(2.5, −2.6), and .(2.6, −2.7). The slope was estimated using the MM-estimator, the
Theil-Sen estimator, and the quantile estimator. This process was repeated 1000
times. Figure 7.10 shows boxplots of the results. The median of the slopes differ
from one for all three estimators. In this case, the MM-estimator is best at dealing
with contamination bias.
194 7 Robust Regression Estimators
1.5
estimates of the slopes, based
on the MM-estimator, the
Theil-Sen estimator, and the
quantile regression estimator
1.0
when X and the error term
have standard normal
distributions, .β1 = 1, .n = 50,
0.5
and four bad leverage points
are added to the data
0.0
MM TS QS
1.5
normal distribution
0.5
0.0
−0.5
MM TS QS
Consider again the situation shown in Fig. 7.10; only now the error term has
the mixed normal distribution shown in Fig. 1.2. Figure 7.11 shows boxplots of
the results. As can be seen, even for a slight departure from a normal distribution,
the MM-estimator performs much worse, compared to the situation in Fig. 7.10, in
terms of dealing with contamination bias. The main message is that it is prudent to
use a method that takes into account contamination bias. At a minimum, check on
whether bad leverage points are present, or use a method that automatically takes
this possibility into account These two approaches can be implemented with the
technique in Sect. 7.4.6.
Rousseeuw and van Zomeren (1990) derived a major advance toward detecting bad
leverage points. Their method is based in part on what is called the least median
squares (LMS) regression estimator. This means that the slopes and the intercept
7.4 Robust Regression Estimators for a Linear Model 195
are taken to be the values that minimize the median of the squared residuals.
This estimator has the highest possible breakdown point, 0.5. A concern is that
its standard error does not compete well with other estimators that might be used.
A more relevant concern here is that it can give a poor fit to the bulk of the data
due to contamination bias. The result is that in some situations, it misses bad
leverage points. This section describes a slight modification of their method that
deals with situations where their method fails. (For details about how well this
method performs, see Wilcox & Xu, 2023.)
The method used to check for bad leverage points is applied as follows:
1. Identify any leverage points using the MAD-median rule when p = 1 or the
projection method when p > 1.
2. Estimate the slopes and intercept using the MM-estimator or the Theil-Sen
estimator with any leverage points removed.
3. Based on the fit in step 2, compute the residuals using all of the data.
4. Check for any outliers among the residuals.
5. If a point is flagged as a leverage point and simultaneously the residual
corresponding to this point is an outlier, decide that the point is a bad leverage
point.
The R function
checks for bad leverage points. By default, it uses the Theil-Sen estimator. A good
alternative is the Theil-Sen estimator. When plotit=TRUE and .p = 1, the
function returns a scatterplot of the data with bad leverage points marked with an o
and good leverage points marked with an *.
The R function
tational issues. Setting the argument xout=TRUE, leverage points are removed. To
remove only bad leverage points, set the argument outfun=reglev.gen. The
function returns the residuals unless res.vals =FALSE.
The function
can be used to plot one or more quantile regression lines. The quantiles used are
specified by the argument qval, which defaults to the 0.2 and 0.8 quantiles. The R
function
.tsreg(x,y,xout=FALSE,outfun=outpro,iter=5,varfun=pbvar,tr=FALSE,
corfun=pbcor,plotit=FALSE,WARN=TRUE, HD = FALSE,
OPT=FALSE, xlab=‘X’,ylab=‘Y’,...)
the basic sample median in Sect. 1.5. However, when there are tied (duplicated)
values, there can be an advantage to using the Harrell-Davis estimator instead as
will be explained in Chap. 8. This can be done by setting the argument HD=TRUE,
or the function tshdreg can be used instead. To use the trimmed Harrell-Davis
estimator, set the argument tr=FALSE. The argument iter controls how many
iterations are used to estimate the slopes and intercept when there are .p > 1
independent variables. Setting iter=1 reduces execution time but at the risk of
highly inaccurate estimates of the slopes in some situations. Like MMreg, this
2 and .R , which are explained in Sect. 9.4.
function returns estimates of .Rpb pb
Example This example is based on data dealing with predicting the reading ability
of children. Data dealing with several measures are stored in the file read_dat.txt.
7.4 Robust Regression Estimators for a Linear Model 197
120
points are marked with an o;
good leverage points are o
o
marked with * o
WWISST2
o
100
o o
80
*
60
*
50 100 150
RAN1T1
Here, the goal is to understand the association between a measure of speeded naming
for digits (RAN1T1), the independent variable, and a measure of the ability to
identify words (WWISST2). With all leverage points removed, the slope based on
the MM-estimator is .b1 = −0.56. The estimate based on the Theil-Sen estimator
is .−0.6. Figure 7.12 shows the plot created by the R function reglev.gen using
either one of these two regression estimators. The bad leverage points marked by
an o are no longer flagged as bad leverage points when using the LMS estimator
as suggested by Rousseeuw and van Zomeren (1990). The reason is that the
LMS estimate of the slope is nearly equal to zero. A practical issue is whether
it is reasonable to decide whether the slope is negative as suggested by the MM-
estimator and the Theil-Sen estimator. Methods covered in Chap. 8 indicate that the
answer is yes.
Example The next example is based on the Leerkes data described in Sect. 7.3.5.
Here, the goal is to understand the association between the measure of maternal care
(the independent variable) and the measure of esteem. Rather than focusing on the
median of esteem given a value for maternal care, the goal is to estimate the 0.25 and
0.75 quantiles of the distribution of maternal care given a value for esteem. In effect,
the (conditional) interquartile range is being estimated. Figure 7.13 shows a plot of
the regression lines. The plot suggests that there is a type of heteroscedasticity: there
is less variability as the measure of maternal care increases. The R function qhomt
in Sect. 8.3.6 lends support that this is a reasonable conclusion. This function also
provides evidence that the slope for the 0.25 quantile regression line is greater than
the slope of the 0.75 quantile regression line. From a practical point of view, the
0.25 quantile regression line suggests that relatively low esteem measures are less
likely as the measure of maternal care increases.
As noted in Sect. 1.9, the file Rallfun contains a vast collection of R functions
for applying robust methods. Most of the R functions that deal with estimating the
parameters of a linear regression model have been modified to handle situations
198 7 Robust Regression Estimators
4.0
* * * * * * * * * * * * *
and 0.75 quantile regression * * * * * * * * *
lines based on the quantile * * * * * * *
3.5
regression estimator and the * * * * * *
* * * * * * * *
Leerkes data * * * * * * * *
3.0
Esteem
* * * * * *
* * * * *
* * *
2.5
*
*
2.0
*
1.5
*
2.5 3.0 3.5 4.0
MatCare
where bad leverage points can be eliminated via the R function reglev.gen. But
if any exceptions are encountered, the R function
.reg.reglev(x,y,plotit=TRUE,xlab=’X’,ylab=’Y’,GEN=TRUE,regfun=
tsreg,outfun=outpro,pr=TRUE,...)
can be used.
exp(β0 + β1 X1 + · · · + βp Xp )
P (Y = 1|X) =
. . (7.33)
1 + exp(β0 + β1 X1 + · · · + βp Xp )
A point worth stressing is that when .p = 1, this model assumes that the probability
of success (.Y = 1) is either monotonic increasing or decreasing as a function of
.X1 . It does not allow the possibility that the probability of success increases over
some range of the .X1 values and decreases over some other region. A goal here is
to suggest that this assumption should not be taken for granted.
Another issue is that the logistic regression model can be negatively impacted by
leverage points. A simple strategy is to remove leverage points as a partial check on
whether this makes a practical difference. Another possibility is to use an estimator
derived by Croux and Haesbroeck (2003). (It can be applied via the R function
wlogreg.) The extent it improves on simply removing leverage points is unknown.
7.4 Robust Regression Estimators for a Linear Model 199
The R function
can be used. The argument pts indicates the values of the independent variables to
be used.
Example The file kyphosis_dat.txt contains data from a study dealing with kypho-
sis, a postoperative spinal deformity. The focus here is on the association between
the probability of kyphosis given the age of the participant in months. Figure 7.14
shows the regression line based on the logistic regression model. The regression
line suggests that the likelihood of kyphosis increases slightly with age. However,
look at the smooth based on the R function logSm shown in Fig. 7.15. This plot
suggests that the likelihood of kyphosis increases up to about the age of 100 months
and levels off or possibly decreases. Methods in Chap. 8 will be used to look more
closely at this issue.
As previously mentioned, the logistic regression model assumes that the proba-
bility of success is monotonically increasing or decreasing as .X1 increases. It might
help to illustrate that simply adding a quadratic term can be an unsatisfactory method
for dealing with this limitation.
Example A sample of .n = 100 values were generated from a standard normal
distribution, and the probability of success was computed with
exp(β1 X2 )
P (Y = 1|X) =
. . (7.34)
1 + exp(β1 X2 )
200 7 Robust Regression Estimators
1.0
Fig. 7.14 The likelihood of
kyphosis, based on the
logistic regression model,
0.8
given a participant’s age
0.6
Y
0.4
0.2
0.0
0 50 100 150 200
1.0 Age
participant’s age
Pred.Prob
0.6
0.4
0.2
0.0
Figure 7.16 shows a plot of the probability of success as a function of X. The left
panel of Fig. 7.17 shows a plot of the regression line based on the model given
by (7.34). The figure is correct in the sense that as the value of .X2 increases, the
probability of success increases as well. A plot simply based on X yields a virtually
straight horizontal line simply because .X2 , not X, plays a role. Including both X
and .X2 in the model and plotting the results, again, the nature of the association in
Fig. 7.16 is still masked. The right panel of Fig. 7.17 is a plot of the regression line
based on the smoother given by (7.28), which is in agreement with the plot shown
in Fig. 7.16.
A common goal is to fit a linear model where one or more of the independent
variables are categorical. For example, gender might be indicated with a 0 or 1,
7.4 Robust Regression Estimators for a Linear Model 201
1.0
0.8
0.6
P(Y|X)
0.4
0.2
0.0
−2 −1 0 1 2
X
1.0
Fig. 7.17 The left panel is
the estimate of the regression
line, based on the
0.8
0.8
probabilities in Fig. 7.16,
using the logistic regression
0.6
0.6
model. The right panel is an
P(Y=1|X)
P(Y=1|X)
0.4
0.2
0.2
0.0
0.0
0 1 2 3 4 5 −2 0 1 2
X^2 X
Suppose the goal is to fit this model in a manner that eliminates outliers among the
CAR measures. If the data for the independent variables are stored in the R object X,
with gender and group identification in columns 1 and 3, and the dependent variable
is stored in CESD, the command
202 7 Robust Regression Estimators
tsreg(X,CESD,outfun=out.dummy,id=c(1,3))
.
accomplishes this goal when using the Theil-Sen estimator. The same can be done
with any of the other regression estimators in this chapter.
There are many robust regression estimators beyond the methods covered here
(Wilcox, 2022a). Although the methods used here have been found to perform
relatively well, there is the possibility that some other estimator offers a practical
advantage.
One approach is generally known as S-estimators. The basic idea is to estimate
the slopes and intercept with values that minimize some robust measure variation
associated with the residuals. There are many robust measures of variation that
might be used. There are, however, some possible concerns. Hössjer (1992) shows
that S-estimators cannot achieve simultaneously both a high breakdown point and
a standard error that is relatively low. Davies (1993) reports results indicating that
S-estimators are not stable.
A least trimmed squares estimator uses the sum of squared residuals to estimate
the slopes and intercepts, with a specified proportion of the largest residuals ignored.
That is, estimate the unknown parameters with values that minimize
h
2
. r(i) , (7.36)
i=1
variables. This estimator tends to have a relatively high standard error. A variation
is the least absolute value estimator where the squared residuals are replaced by
their absolute value. This estimator tends to have a relatively high standard error as
well.
Skipped estimators use some robust multivariate outlier detection technique to
search for outliers among all of the variables not just the independent variables.
That is, search for outliers among .(p + 1)-variate data .(Xi , Yi ) (.i = 1, . . . , n).
Any outliers that are found are removed, and a robust estimator is applied to the
remaining data. This approach eliminates bad leverage points, but it eliminates good
leverage points as well.
There are robust regression estimators that are based on robust covariances. This
approach performs reasonably well when there is homoscedasticity. But it can give
poor results when there is heteroscedasticity.
Consider a single independent variable. Rousseeuw and Hubert (1999) derived
a method for characterizing how deeply a line is nested in the cloud of points
7.5 Interactions 203
(X1 , Y1 ), . . . , (Xn , Yn ). They estimate the true regression line with the line that is
.
deepest in the cloud of points. If there are no tied values among .X1 , . . . , Xn , the
breakdown point is about 0.33. This estimator tends to have a relatively low standard
error. Contamination bias is an issue with this estimator, but this can be addressed
as indicated in Sect. 7.4.6. The method can be extended to .p > 1 independent
variables. Overall, this deepest regression line estimator might have practical value.
Evidently, experience using this estimator is limited.
The R function
mdepreg.orig(x,y,xout=FALSE,outfun=outpro)
.
computes the deepest regression line. For .p > 1 predictors, it uses an iterative
technique to estimate the slopes and intercept. The function
7.5 Interactions
Consider a situation where there are two independent variables, .X1 and .X2 . A
common goal is understanding how the value of say the second two independent
variable impacts the nature of the association between Y and .X1 . Roughly, if the
nature of the association between Y and .X1 depends on the value of .X2 , there is
said to be an interaction.
A common way of modeling an interaction is with the product of the two
independent variables. That is, use the model
.Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 (7.37)
described in this chapter can help assess the extent this approach is unsatisfactory.
An example is given in the next section. Chapter 8 elaborates on how interactions
might be studied.
The R functions
and
provide a partial check on the extent (7.37) provides a satisfactory method for
modeling an interaction. Both functions plot the regression surface assuming (7.37)
is true. The first uses the least squares estimator, and the second can be used based
on a robust regression estimator. These plots can be compared to plots created by
LOESS or the running-interval smoother.
Example This example is based on the Well Elderly data that was used in
one of the examples in Sect. 7.3.5. Once again, the goal is to understand the
association between the CAR (the cortisol awakening response) and a measure of
depressive symptoms (CESD); only now a second independent variable is included:
a measure of meaningful activities (MAPA). Figure 7.18 shows a plot of the
regression surface assuming that an interaction can be modeled with (7.37). The
function ols.plot.inter was used with leverage points removed. The function
reg.plot.inter gives very similar results. A cursory look would seem to
suggest that there is no interaction. Knowing the CAR value does not appear to have
any bearing on the association between MAPA and CESD. Now look at Fig. 7.19,
which is based on LOESS. The plot suggests that there is an interaction. The nature
7.6 Exercises 205
CESD
assuming that an interaction
can be modeled with (7.37)
M
AP
R
A
CA
10
5
20
25 0.4
30 0.2
0.0
M
AP
35 −0.2 R
CA
A
−0.4
40
−0.6
of the association between MAPA and CESD appears to depend on whether the
CAR is positive or negative. Chapter 8 will take a closer look at this issue.
7.6 Exercises
1. Using the data in the file A1, use the logistic regression model to plot the
probability that a participant did not finish high school given some CESD
value, which is a measure of depressive symptoms. Education level is stored
in A1$edugrp, and the measure of depressive symptoms is stored in A1$CESD.
Suggestion: First store the data in these two columns in an R object, and then
eliminate any missing values with the R function elimna.
2. Using the data in the previous exercise, use the R function multsm to plot
the likelihood of having no high school diploma, some college or technical
training, and 4 years of college given a participant’s CESD score. These groups
correspond to A1$edugrp equal to 1, 3, and 4, respectively. The solid line in the
resulting plot corresponds to no high school diploma. The dashed line is for the
some college or technical training, and the dotted line indicates the probability
206 7 Robust Regression Estimators
As done in Chap. 7, let .m(x) denote some measure of location associated with Y ,
given that an independent variable .X = x. For the moment, the focus is on a single
independent variable. Based on how the running-interval smoother is constructed,
methods in Chap. 2 can be used to make inferences about .m(x). Basically, focus on
the .Yi values for which .Xi is close to x. These Y values are determined as described
in Sect. 7.3.3. Once they are identified, one can compute a confidence interval for
.m(x).
To get an overall sense of the precision of the estimated regression line, it helps to
have a confidence interval for .m(x) for a range of x values. Moreover, it is desirable
to compute these confidence intervals such that simultaneous probability coverage is
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 207
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9_8
208 8 Inferential Methods Based on Robust Regression Estimators
MAD
.|Xj − Xi | ≤ f , (8.1)
0.6745
where MAD is based on .X1 , . . . , Xn and again f is the span. As noted in Chap. 7,
.f = 0.8 generally works well when fitting a regression line to the data, but for the
situation at hand, .f = 0.5 has been found to be more satisfactory.
Note that if .N(Xi ) is too small, this can result in a confidence interval for .m(Xi )
that has unsatisfactory probability coverage. Here, the strategy is to focus on .Xi if
.N(Xi ) > 12.
Imagine that the goal is to make inferences based on K values of the covariate.
Let .Z1 denote the minimum .Xi value such that .N (Xi ) > 12. Next, let .ZK denote
the maximum .Xi value such that .N (Xi ) > 12. Let .Z1 , . . . , ZK denote K values for
the independent variable that are evenly spaced between .Z1 and .ZK . The approach
is to compute a confidence interval based on the Y values corresponding to the
.X1 , . . . , Xn values that are close to .Zk (.k = 1, , . . . , K). The focus here is on
.K = 10 and .K = 25.
To provide some sense of how confidence intervals are adjusted so that the
simultaneous probability coverage is equal to .1 − α, momentarily focus on a single
point. Note that if a study is repeated many times, different p-values will result.
That is, p-values, when testing some hypothesis about some measure of location .θ ,
have an unknown distribution. As noted in a basic statistics course, a null hypothesis
is rejected at the .α level if the .1 − α confidence interval does not contain the null
value. If the null hypothesis is true, and the probability of getting a p-value less than
or equal to 0.05 is indeed equal to 0.05, this means that a 0.95 confidence interval
has a 0.95 probability of containing the true value of .θ .
Now consider K tests where all K hypotheses are true. Momentarily consider
the situations where there is independence and both Y and X have standard
normal distributions. Let .pk be the p-value for the kth hypothesis, and let .pmin =
min(p1 , . . . , pK ). If the distribution of .pmin were known, adjusted confidence
intervals could be constructed so that the simultaneous probability coverage is .1−α.
For example, if the 0.05 quantile is 0.004, compute a .1 − 0.004 = 0.996 confidence
interval for each of the K tests, in which case the simultaneous probability coverage
is 0.95. (This approach is similar in spirit to how the Studentized maximum modulus
distribution is used to compute confidence intervals that have some specified
simultaneous probability coverage.) The quantiles associated with the distribution
8.1 Inferences Based on the Running-Interval Smoother 209
of .pmin depend on the sample size, n; the number of tests, K; and the desired
simultaneous probability coverage. These quantiles have been determined for .α =
0.05, .K = 10 and 25, and sample sizes 50, 60, 70, 80, 100, 150, 200, 300, 400, 500,
600, 800, and 1000. These quantiles decrease very slowly as n gets large and have
been found to level off around .n = 200 or 300, but there is no proof that this is the
case.
The method just described mimics the approach to Student’s t test in the
following manner. Derive a solution assuming normality, and use simulations to
assess how well it performs when dealing with non-normal distributions as well
as situations where there is a nonlinear association. All indications are that this
approach performs reasonably well, based on the running-interval smoother, when
.n ≥ 50, .K = 10 and 25, the span is taken to be .f = 0.5, and the Tukey-McLaughlin
The R function
is the same as the function rplotCI; only confidence intervals for the medians are
used based on the distribution-free method in Sect. 2.3.3.
210 8 Inferential Methods Based on Robust Regression Estimators
50
regression line, using the
Well Elderly data, based on
40
CAR and CESD measures
30
CESD
20
10
0
−0.4 −0.2 0.0 0.2 0.4
CAR
Example The R function rplotCI is illustrated with the Well Elderly data used
to illustrate the R function rplot in Sect. 7.3.5. As before, the goal is to understand
the association between the CAR (the cortisol awakening response) and a measure
of depressive symptoms (CESD). The plot created by rplotCI is shown in
Fig. 8.1. Assuming the data are stored in the R object A3B3C, the R command that
was used is
.rplotCI(A3B3C$cort1-A3B3C$cort2,A3B3C$CESD,xout=TRUE,
xlab=‘CAR’,ylab=‘CESD’,LPCI=TRUE)
CESD scores greater than 15 are often taken to indicate mild depression. A score
greater than 21 indicates the possibility of major depression. The plot in Fig. 8.1
suggests that when the CAR is greater than zero, as the CAR increases, CESD scores
increase as well. When the CAR is equal to about 2.2, the estimate of the trimmed
mean of the CESD scores is equal to 15. But the confidence intervals make it clear
that no decision should be made about whether typical CESD scores are greater than
15. Simultaneously, based on the upper ends of the confidence intervals, no decision
should be made regarding whether typical CESD scores are always less than 21.
Assuming a linear model is reasonable, there is the issue of deriving analogs of the
methods in Sect. 8.1. For convenience, a single independent variable is assumed,
but the method described here is readily extended to .p > 1 independent variables.
More formally, the goal is to compute a confidence interval for the typical value of
8.2 Inferences About the Typical Value of Y , Given X, via a Linear Model 211
Y (x) = β0 + β1 x,
. (8.2)
where the notation .Y (x) is used to stress that the focus is on the situation where
.X = x. Currently, the best-known method is to first compute an estimate of the
standard error of .Ŷ (x) = b0 + b1 x, where .b0 and .b1 are estimates of the intercept
and slope, respectively, based on one of the regression estimators in Chap. 7. This is
done with a basic percentile bootstrap method.
A bootstrap sample is obtained by randomly sampling with replacement n pairs
of points from .(X1 , X1 ), . . . , (Xn , Yn ) yielding .(X1∗ , Y1∗ ), . . . , (Xn∗ , Yn∗ ). Based on
this bootstrap sample, compute estimates of the slope and the intercept yielding
∗ ∗ ∗ ∗ ∗
.Ŷ (x) = b + b x. Repeat B times yielding .Ŷ (x), . . . , Ŷ (x). The estimate of the
0 1 1 B
squared standard error of .Ŷ (x) is
1 ∗
τ̂ 2 (x) =
. (Ŷb (x) − Ȳ ∗ (x))2 , (8.3)
B −1
where .Ȳ ∗ (x) = Ŷb∗ (x)/B. Let .θ denote the true value of .Y (x). Assuming that
Ŷ (x) − θ
W =
. (8.4)
τ̂ (x)
Ŷ (x) ± cτ̂ ,
. (8.5)
where c is some critical value. When there is a single independent variable, and .K >
1 choices for x are being used, the FWE rate is controlled using an approximation
of the distribution of the minimum p-value. If, for example, the 0.05 quantile of this
distribution is 0.004, all K tests would be performed at the 0.004 level. The method
just described can be used with .p ≥ 1 independent variables. Simulation results on
how well the method performs are described in Wilcox (2017b).
For the special case where Y is binary, there is a standard method for computing
a confidence interval for .P (Y = 1|X = x) based on the logistic regression model
in Sect. 7.4.8. However, if the model is off ever so slightly, the resulting confidence
interval can be highly inaccurate (Wilcox, 2019c). A version of the running-interval
smoother was found to be more satisfactory. Basically, for the subset of Y values,
for which the corresponding covariate values are close to x, use the R function
212 8 Inferential Methods Based on Robust Regression Estimators
binom.conf in Sect. 2.5. As was done in Sect. 8.1, .Xi is taken to be close to x if
it satisfies (8.1).
The R function
outfun=outpro,...)
computes an estimate of .Y (x) for every value of the independent variable stored in
the argument xr. By default, the Theil-Sen estimator is used.
The R function
computes a .1 − α confidence interval for .Y (x) for every value indicated by the
argument pts. By default, a confidence interval is computed for each value stored
in the argument x. If there is a single independent variable, setting the argument
ADJ=TRUE, the confidence intervals are adjusted so that the simultaneous prob-
ability coverage is approximately equal to .1 − α, where .α is controlled via the
argument alpha. If the argument alpha differs from 0.05, an adjusted critical value
can be used, at the expense of possibly high execution time. Execution time can be
reduced by setting the argument MC=TRUE, assuming that a multicore processor is
available.
Suppose it is desired to determine for which values of the independent variable it
is reasonable to decide that .Y (x) is less than or greater than some specified value, .θ0 .
One possible appeal of this function is that setting plotPV = TRUE, the function
will plot the p-values when testing .H0 : Y (x) = θ0 . The p-values can be plotted
when .p = 2 provided the R package scatterplot3d has been installed.
The R function
plots the regression line as well as an approximate confidence band for .Y (x). The
argument ADJ = TRUE means that for every unique value stored in x, a confidence
interval for .Y (x) is computed where the simultaneous probability coverage is
approximately .1 − α. Said another way, the probability of one or more Type I errors
is approximately 0.05 when the argument alpha=0.05. Unlike the R function
regYci, this function is restricted to .p = 1. There are indications that with more
than one independent variable, a different adjustment is required, but this issue is
in need of more study before a recommendation can be made. If the argument
ADJ=FALSE, this function computes a confidence interval for values evenly spaced
between the smallest and largest values stored in x, but no adjustment is made so
that the simultaneous probability coverage is .1 − α. The number of points used is
controlled by the argument npts and defaults to 20. The function returns a plot
indicating the lower and upper ends of the confidence intervals. The probability
coverage for each confidence interval is .1 − α.
The R function
computes confidence intervals for .P (Y = 1|X = x), assuming that the logistic
regression model is correct. However, this method should be used with caution
because even a slight deviation from the logistics model can result in inaccurate
confidence intervals. Generally, the R function
runbin.CI(x,y,pts=NULL,fr=1.2,xout=FALSE,outfun=outpro)
.
H0 : β1 = β2 · · · = βp = 0
. (8.6)
assuming that some measure of location associated with Y is given by the linear
model
Y = β0 + β1 X1 + · · · + βp Xp .
. (8.7)
Classic methods based on the least squares estimator make the additional assump-
tion that
Y = β0 + β1 X1 + · · · + βp Xp + ,
. (8.8)
where the error term . has a normal distribution with mean zero and some unknown
variance .σ 2 . That is, homoscedasticity is assumed as described at the end of
Sect. 1.4. Independence between Y and .X1 , . . . , Xp implies homoscedasticity. But
when there is an association, there is no reason to assume homoscedasticity, and an
argument can be made that some degree of heteroscedasticity exists. A concern is
that methods that assume homoscedasticity are using an incorrect estimate of the
standard errors when in fact there is heteroscedasticity.
Although the least squares estimator is not robust, this section describes two
methods for dealing with heteroscedasticity. This is followed by a description of a
bootstrap method that allows heteroscedasticity, which performs well when using
a robust regression estimator. Other methods have been proposed, but currently the
bootstrap method described here has been found to be the most effective. When
dealing with .p > 1 independent variables, there is a method that has the potential
of increasing power, sometimes by a substantial amount. Details are covered in
Sect. 8.3.3.
Let .b = (b1 , . . . , bp ) denote the least squares estimate of the slopes. In recent years,
several methods have been derived that are aimed at avoiding the homoscedasticity
assumption given the goal of testing (8.6). One approach is to use some test statistic
based on an estimate of the variances and covariances of .b. Two versions of this
approach are described here.
The first is the HC3 method, which is motivated by results in Long and Ervin
(2000). Assuming familiarity with basic matrix algebra, which is summarized in
Appendix A, the HC3 estimate of the variances and covariances of .b is given by
8.3 Global Tests That All Slopes Are Equal to Zero 215
−1 ri2
S = (X X)
. X diag X(X X)−1 , (8.9)
(1 − hii )2
⎛ ⎞
1 X11 · · · X1p
⎜ 1 X21 · · · X2p ⎟
⎜ ⎟
.X = ⎜ . .. .. ⎟
⎝ .. . . ⎠
1 Xn1 · · · Xnp
and .Xi is the ith row of .X. If .b0 , . . . , bp are the least squares estimates of
the intercept and slopes, the diagonal elements of the matrix HC3 represent the
estimated squared standard errors.
Godfrey
(2006) derived an alternative to the HC3 estimator, the HC4 estimator.
Let .h̄ = hii /n, .eii = hii /h̄, and .dii = min(4, eii ). The HC4 estimator is
−1 ri2
S = (X X)
. X diag X(X X)−1 . (8.10)
(1 − hii )dii
Testing (8.6) can be done as follows. Let .V be the estimate of the variances and
covariances of .b using the HC3 or HC4 method. A reasonable test statistic is
W = nb Vb,
. (8.11)
When dealing with robust regression estimators, the basic percentile bootstrap
method can be used to test (8.6). This approach mimics the method based on
difference scores used in Sect. 6.1.4.
Here, what is observed is
216 8 Inferential Methods Based on Robust Regression Estimators
⎛ ⎞
X11 , . . . , X1p , Y1
⎜ .. ⎟
.⎝ ⎠. (8.12)
.
Xn1 , . . . , Xnp , Yn
Bootstrap samples are obtained by resampling with replacement n rows from this
matrix yielding
⎛ ∗ , . . . , X∗ , Y ∗ ⎞
X11 1p 1
⎜ .. ⎟
.⎝ ⎠. (8.13)
.
∗ , . . . , X∗ , Y ∗
Xn1 np n
Based on this bootstrap sample, compute the slopes yielding .b1∗ , . . . bp∗ . Repeat this
process B times yielding
⎛ ⎞
∗ , . . . , b∗
b11 1p
⎜ ⎟
.⎜ ⎟.
.. (8.14)
⎝ . ⎠
∗ , . . . , b∗
bB1 Bp
Figure 8.2 shows a scatterplot of bootstrap estimates of two slopes based on the
Theil-Sen estimator. The data were generated where .X1 and .X2 are normal with
Pearson’s correlation equal to 0.5. The error term, ., is normal as well. That is,
the conditional distribution of Y , given a value for X, is normal. The slopes are
.β1 = 1 and .β2 = 0. Note that for each point in the scatterplot, its distance from
the center of the cloud of points can be measured. Here, the Mahalanobis distance
is used. Although the Mahalanobis distance is not robust, simulations indicate that
it performs reasonably well for the situations where a robust regression estimator
is used. Let .d1 , . . . , dB denote the distance of the bth point (.b = 1, . . . , B). Let
slopes
−0.2
.d0 denote the distance of the null point (0, 0). If .d0 is unusually large compared to
the other B distances, this suggests that the null hypothesis is false. A p-value is
determined based on the proportion of times .d0 is greater than the other B distances
.d1 , . . . , dB . Let .Ib = 1 if .d0 > db ; otherwise, .Ib = 0, and let .A = Ib in which
case A denotes the number of times .d0 is greater than .db . A p-value is
A
1−
. . (8.15)
B
The polygon shown in Fig. 8.2 contains the central portion of the bootstrap values
and was created with the R function regtest in Section 8.3.4. By default, the
polygon contains the central 95%, which represents a 0.95 confidence region for
(.β1 , β2 ). The p-value is reported to be 0, which means that the null point (0, 0) has
a larger distance from the center than any of the other points in the plot. Notice that
the null point (0, 0) does not even appear in the plot. In this particular instance, the
true value for the slopes is well within the confidence region.
8.3.3 Collinearity
Collinearity refers to a situation where two or more predictor variables are closely
related to one another. For two variables, some measure of association might be
used to detect collinearity, but it is possible for collinearity to exist between three
or more variables, even if no pair of variables has a particularly high correlation.
This is called multicollinearity. Generally, multicollinearity is a practical concern
because it can result in relatively high standard errors when estimating the slope
parameters of a linear regression model. One possible consequence is low power
when testing (8.6).
Ridge estimators represent methods aimed at avoiding relatively high standard
errors. The basic form of a ridge estimator was derived by Hoerl and Kennard
(1970). The strategy is to find values for .β0 , . . . , βp that minimize
⎛ ⎞2
p
p
. ⎝yi − β0 − βj xij ⎠ + k βj2 . (8.16)
j =1 j =1
A property of ridge estimators should be stressed. When the null hypothesis (8.6)
is true, ridge estimators are unbiased. That is, the average estimate for each slope,
over many studies, will be equal to zero. But when the null hypothesis is false, this
is no longer the case: the average estimate of a slope, over many studies, generally
differs from the true value of the slope. A consequence is that computing reasonably
accurate confidence intervals for the slopes, based on a ridge estimator, cannot be
done based on current methods. But testing (8.6) can be done in a manner that
controls the Type I error probability reasonably well with the added bonus of having
as much power or more than the percentile bootstrap method in Sect. 8.2.1.
There are two methods for testing (8.6). When using the ridge estimator based
on the rescaled least squares estimator, let
The term .sjj (k), the j th diagonal element of .S(k), estimates the squared standard
error of .β̂j (k) and suggests testing .H0 : .βj = 0 (.j = 1, . . . , p) using
β̂j
Tj = √ ,
. (8.18)
sjj
method. If any of these p hypotheses is rejected, reject the global hypothesis given
by (8.6). This approach generally has as much power as using (8.11), and in some
cases, this approach can increase power substantially. But this method does not
yield accurate confidence intervals when the null hypothesis is false due to the bias
associated with the ridge estimator. And it does not reveal which slope parameters
differ from zero.
When using a robust ridge estimator, the test statistic
p(n − 1) −1
TR2 =
. β̃S β̃ (8.19)
n−p
can be used. The current method for controlling the Type I error probability is to
determine a critical value assuming normality and homoscedasticity. Simulations
indicate that this approach performs reasonably well when dealing with non-normal
distributions, including situations where there is heteroscedasticity (Wilcox, 2019b).
This method has about the same amount of power as the percentile bootstrap
method based on a robust regression estimator when Pearson’s correlation among
the independent variables is zero. As the strength of the association among the
independent variables increases, using a robust ridge estimator can substantially
increase power. But again, this approach provides no details about the individual
slopes and how they compare. Moreover, in terms of power, no single method
dominates.
8.3 Global Tests That All Slopes Are Equal to Zero 219
The R function
tests the hypothesis that all slope parameters are equal to zero based on the least
squares estimator. Theory indicates that with a sufficiently large sample size, this
method will perform reasonably well in terms of controlling the Type I error
probability. But it remains unclear just how large the sample size must be. The
argument pval controls which independent variables will be included in the model.
By default, all are included.
R has a built-in function, lm, which, in conjunction with the R function
summary, can be used to test the hypothesis that all slope parameters are equal
to zero. However, this function uses the least squares estimator, assuming both
normality and homoscedasticity.
The R function
regtest(x,y,regfun=tsreg,nboot=600,alpha=0.05,plotit=
.
tests the hypothesis that all slopes are equal to zero using a robust regression
estimator in conjunction with a percentile bootstrap method.
The R function
applies the method based on (8.18). If any hypothesis is rejected, reject the global
hypothesis that all of the slopes are equal to zero, but make no decision about which
of the individual slopes differ from zero. The R function
tests the hypothesis that all of the slope parameters are equal to zero using a robust
ridge estimator. The Theil-Sen version is used by default. Execution time is low
220 8 Inferential Methods Based on Robust Regression Estimators
when testing at the .α = 0.05 level, which is the default approach. To get a p-
value, set the argument PV=TRUE. This will increase the execution time because
the function must compute an approximation of the null distribution.
All of the functions in this section remove leverage points when the argu-
ment xout=TRUE. To remove only bad leverage points, also set the argument
outfun=outblp.
1 2
σ̂ 2 =
. ri
n
an estimate of the assumed common variance. Let .A =
denote (ri2 − σ̂ 2 )2 /n and
.ỹ = ŷi /n. The test statistic is
2
{
r (ŷi − ỹ)}2
.V = i , (8.20)
A (ŷi − ỹ)2
is
d
T =
. , (8.21)
sd∗
where .sd∗ is a bootstrap estimate of the standard error of d. This method is limited to
testing at the 0.05 level. That is, a critical value has been determined for this special
8.3 Global Tests That All Slopes Are Equal to Zero 221
case (e.g., Wilcox, 2022a, Section 11.3.1), but it is unknown how best to determine
a critical value when the Type I error probability is set at .α = 0.05.
Let .ri (.i = 1, . . . , n) denote the residuals based on some regression estimator,
which here are taken to be the residuals based on the running-interval smoother. The
third method is based on the fact that homoscedasticity implies that the regression
line used to predict .|r|, given x, will be a straight horizontal line. Here, testing
whether this is the case is done using the Theil-Sen estimator in conjunction with a
percentile bootstrap method.
The R function
khomreg(x, y)
.
tests the hypothesis two quantile regression lines have the same slope. The quantiles
that are used can be altered via the argument qval. For example, qval=c(0.25, 0.75)
would test .H0 : .β0.25 = β0.75 . For more than one independent variable, use the R
function
The R function
tests the hypothesis that there is homoscedasticity based on whether the regression
line that predicts .|r|, given x, has a slope equal to zero.
222 8 Inferential Methods Based on Robust Regression Estimators
Let .S denote the HC3 estimator given by (8.3.1). If .b0 , . . . , bp are the least squares
estimates of the intercept and slopes, the diagonal elements of .S represent the
estimated squared standard errors. Let .S02 , S12 , . . . , Sp2 denote the diagonal elements
of .S. Ng and Wilcox (2009) considered computing confidence intervals with
bj ± tSj ,
. (8.22)
where for .n < 40, .a = 6 and .c = 593; for .40 ≤ n < 80, .a = 7 and .c = 592;
for .80 ≤ n < 180, .a = 10 and .c = 589; and for .180 ≤ n < 250, .a = 13 and
.c = 586, while for .n ≥ 250, .a = 15 and .c = 584. For .n ≥ 250, use the standard
percentile bootstrap confidence interval. This approach currently seems best in terms
of computing a 0.95 confidence interval. But it does not yield a p-value, and there
is no known adjustment when .α < 0.05. When .p > 1, a basic percentile bootstrap
method is used at the .α/p level.
When dealing with a robust regression estimator, again, a basic percentile
bootstrap method performs relatively well. For each bootstrap sample, compute
estimates of the slopes and the intercept. This process is repeated B times.
Confidence intervals and p-values are computed in essentially the same manner
as described in Sect. 2.3.1.
The R function
8.4 Inferences About the Individual Slopes 223
computes 0.95 confidence intervals for regression parameters, based on the OLS
estimator, using the modified percentile bootstrap confidence interval given by
(8.23). Extant results indicate that this tends to be the most accurate method when
computing a 0.95 confidence interval at the expense of no p-value.
The R function
.olshc4(x,y,alpha=0.05,xout=FALSE,outfun=out,HC3=FALSE)
computes .1 − α confidence intervals for each of the .p + 1 parameters using the least
squares estimator in conjunction with HC4 estimator. The function returns p-values
as well.
The R function
regci(x,y,regfun=tsreg,nboot=599,alpha=0.05, SEED=TRUE,
.
is supplied that automatically removes bad leverage points when computing confi-
dence intervals.
Example This example is based on the reading data described in the example in
Sect. 7.4.7. It was noted that when leverage points are removed, there appears to be a
negative association between a measure of speeded naming for digits (RAN1T1), the
independent variable, and a measure of the ability to identify words (WWISST2).
Based on the R function regci, with bad leverage points removed, the slope was
224 8 Inferential Methods Based on Robust Regression Estimators
estimated to be .−0.63, and the p-value is less than 0.001. Assuming the data have
been read into the R object doi, this is the command that was used:
regci(doi[,4],doi[,8],xout=TRUE,outfun = outblp)
.
If leverage points are retained, now the estimate of the slope is .−0.28, and the
p-value is 0.018. As noted in Sect. 7.4.7, retaining leverage points can have a
substantial impact on the least squares estimator. Retaining leverage points and
using olshc4, the estimate of the slope is .−0.20, and the p-value is 0.81. The
function lsfitci fails to reject as well.
Next, a second independent variable is considered: RAN2T1, a measure of
speeded naming for letters. The R function regtest returns a p-value equal to
0.0017 with bad leverage points removed. The R function regci returns
$regci
ci.low ci.up Estimate S.E. p-value
Intercept 115.57052632 146.0569395 129.76605302 7.60728230 0.0000000
Slope 1 -0.09304521 0.1906200 0.06160697 0.06827075 0.3038397
Slope 2 -0.82450468 -0.4050107 -0.62011347 0.11003112 0.0000000
Note that the first independent variable, a measure of speeded naming for digits,
is no longer significant, but the other independent variable, a measure of speeded
naming for letters, the p-value for the slope is less than 0.001.
Example A mediation analysis is another way of investigating how one indepen-
dent variable influences the association between another independent variable and
the dependent variable. Extensive details are covered in the books by MacKinnon
(2008), as well as Vanderweele (2015). Briefly, consider some independent variable
X, and suppose the goal is to determine whether another independent variable, .Xm ,
mediates the association between Y and X. A basic version consists of four steps:
1. Establish that there is an association between Y and X. This step establishes that
there is an effect that might be mediated.
2. Establish that there is an association between X and .Xm .
3. Establish that there is an association between Y and .Xm .
4. To establish that .Xm completely mediates the association between X and Y , the
association between X and Y controlling for .Xm should be zero.
Extending the last example based on the reading data, imagine that the HC4 method
is used to determine whether there is an association between RAN2T1 (stored in
column 5 of the file used in the previous example) and RAN1T1. It is left as an
exercise to show that the estimate of the slope is 0.064 and the p-value when testing
the hypothesis that the slope is zero is 0.393. Removing bad leverage points, in
which case the sample size drops from 73 to 66, now the slope is estimated to be
0.831, and the p-value is less than 0.001. Using the Theil-Sen estimator, again with
bad leverage points removed, now the slope is estimated to be one, and again the p-
value is less than 0.001. (.B = 1000 bootstrap samples were used.) So removing bad
leverage points makes a difference in step 2 when investigating whether mediation
can be established. This result, coupled with the results in the last example, indicates
8.4 Inferences About the Individual Slopes 225
that RAN2T1 mediates the association between RAN1T1 and the ability to identify
words (WWISST2).
Example This next example is based on the Well Elderly study. Here, the file
B3_dat.txt file is used. The goal is to investigate measures of cortisol upon
awakening versus the CAR, the cortisol awakening response, which is the change in
cortisol measured again 30–45 minutes after awakening. The dependent variable is a
measure of meaningful activities, which is the R object with the label MAPAGLOB.
All of the analyses reported are based on removing bad leverage points. Using the
least squares estimator via the R function hc4test, the p-value is 0.08. Using the
Theil-Sen estimator and a percentile bootstrap to test the hypothesis both slopes
are zero, using the R function regtest, the p-value is 0.058. Using a ridge
estimator in conjunction with the test statistic (8.18), the R function ridge.test,
the adjusted p-value is 0.025. Here are the results using the R function regci with
the argument regfun=tshdreg, which was used because there are tied values.
$regci
ci.low ci.up Estimate S.E. p-value
Intercept 29.916754 35.94813963 32.658487 1.526268 0.00000000
Slope 1 -13.566170 -0.01614633 -6.036955 3.396850 0.04674457
Slope 2 -8.303404 5.01824368 -1.556220 3.252014 0.61769616
p.adj
Intercept 0.00000000
Slope 1 0.09348915
Slope 2 0.61769616
Consider two independent groups, assume that a linear model model is reasonable
for both groups, and let .βj 1 , . . . βjp denote the slopes for the j th group (.j = 1, 2).
The intercepts are denoted by .βj 0 . A common goal is to test
H0 : β1k = β2k ,
. (8.24)
The R function
When using this last function, the argument x has list mode, with length J , where
X[[j]] contains the independent variables for group j . In a similar manner,
Y[[j]] contains the data for the dependent variable associated with the j th group.
The R functions
and
8.5 Grids 227
pr=TRUE,...)
are like the R functions reg2ci and reg1mcp, respectively; only they are
designed to comparing parameters based on the least squares regression estimator.
8.5 Grids
While a linear model might suffice, there is the practical concern that even with
p = 2 independent variables, the regression surface might be complex to the point
.
that alternative perspectives are needed to understand the nature of the association.
When comparing two independent groups, one way of gaining perspective is to
split the data into groups. For example, split the data into two groups based on the
median of the first independent variable. For each of these two groups, split the data
again based on the median of the second independent variable. Next, use methods in
Chap. 5 that deal with a two-by-two design to study how these regions compare. Of
course, there are various alternative ways of splitting the data. For example, split the
data based on the quartiles for both independent variables yielding a four-by-four
design. This can reveal that a nonsignificant independent variable based on a linear
model actually plays a role in the association as is illustrated in the next section.
The R function
splits the data into groups based on quantiles specified by the arguments Qsplit1
and Qsplit2 and then compares the resulting groups based on trimmed means. By
default, the splits are based on the medians of two of the independent variables.
If the argument x has more than two columns, the columns used to split the
data can be specified via the argument IV. For each row of the first factor (the
splits based on the first independent variable), all pairwise comparisons are made
among the levels of the second factor (the splits based on the second independent
variable). In similar manner, for each level of the second factor (the splits based
228 8 Inferential Methods Based on Robust Regression Estimators
on the second independent variable), all pairwise comparisons among the levels
of the first factor are performed. Setting PB=TRUE, a percentile bootstrap method
is used, which makes it possible to use a robust measure of location other than a
trimmed mean via the argument est. Measures of effect size are returned as well.
To get confidence intervals for the measures of effect size, set the argument fun =
ES.summary.CI.
The R function
can be used to test hypotheses about linear contrasts. Linear contrast coefficients can
be specified via the argument con. By default, all relevant interactions are tested.
If it is desired to split the data based on a single independent variable, this can be
done with the R function
smgrid(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, tr = 0.2, PB = FALSE, est =
.
tmean, nboot = 1000, pr = TRUE, xout = FALSE, outfun = outpro, SEED = TRUE,
...).
For example, for a two-by-two design, the data are treated as having four groups,
in which case six tests are performed. In essence, this function treats the data as a
one-way design rather than a two-way design.
If the dependent variable is binary, use the function
smbinAB(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, tr = 0.2, method = ‘KMS’,
.
which is like the function smgridAB; only the KMS method for comparing two
binomial distributions, described in Sect. 5.2, is used by default. To use method SK,
also described in Sect. 5.2, set the argument method=‘SK’. The R function
smbin.inter(x, y, IV = c(1, 2), Qsplit1 = 0.5, Qsplit2 = 0.5, alpha = 0.05, con =
.
also deals with a binary dependent variable. By default, all interactions are tested,
but other linear contrasts can be tested via the argument con.
Example This example is based on the Well Elderly data used in the example in
Sect. 8.1.1. Here, data collected prior to intervention are used. The sample size, after
removing missing values, is 333. The two independent variables are the CAR and a
measure of meaningful activities (MAPA), and the dependent variable is a measure
of life satisfaction (LSIZ). The R function regci returns a p-value equal to 0.44
for the first slope (CAR). The p-value for the slope for MAPA is less than 0.001.
Figure 8.3 shows the regression surface based on the R function rplot with
leverage points removed. The plot hints at the possibility that CAR might matter
when MAPA scores are relatively low. And the nature of the association also appears
to depend on whether the CAR is negative or positive. That is, cortisol increasing
or decreasing after awakening might play a role. When MAPA scores are relatively
high, it appears that the CAR plays less of a role.
Here is the output from smgridAB where CAR is the first independent variable:
$est.loc.4.DV
[,1] [,2]
[1,] 16.82759 19.95652
[2,] 14.73077 19.59615
$n
[,1] [,2]
[1,] 94 74
[2,] 84 86
$A
$A[[1]]
Group Group psihat ci.lower ci.upper p.value Est.1
[1,] 1 2 -3.128936 -5.139054 -1.118817 0.002603189 16.82759
Est.2 adj.p.value
[1,] 19.95652 0.002603189
$A[[2]]
Group Group psihat ci.lower ci.upper p.value Est.1
[1,] 1 2 -4.865385 -6.334669 -3.3961 2.384662e-09 14.73077
Est.2 adj.p.value
[1,] 19.59615 2.38465e-09
$B
$B[[1]]
Group Group psihat ci.lower ci.upper p.value Est.1
[1,] 1 2 2.096817 0.1663711 4.027263 0.03356468 16.82759
Est.2 adj.p.value
[1,] 14.73077 0.03356468
$B[[2]]
Group Group psihat ci.lower ci.upper p.value Est.1
[1,] 1 2 0.3603679 -1.215197 1.935933 0.6504541 19.95652
Est.2 adj.p.value
[1,] 19.59615 0.6504541
$A.effect.sizes
$A.effect.sizes[[1]]
Est NULL S M L
AKP -0.4810536 0.0 -0.20 -0.50 -0.80
EP 0.3894225 0.0 0.14 0.34 0.52
230 8 Inferential Methods Based on Robust Regression Estimators
$A.effect.sizes[[2]]
Est NULL S M L
AKP -1.0616494 0.0 -0.20 -0.50 -0.80
EP 0.7113924 0.0 0.14 0.34 0.52
QS (median) 0.2386489 0.5 0.45 0.36 0.29
QStr 0.2386489 0.5 0.45 0.36 0.29
WMW 0.7754014 0.5 0.55 0.64 0.71
KMS -0.5307806 0.0 -0.10 -0.25 -0.40
$B.effect.sizes
$B.effect.sizes[[1]]
Est NULL S M L
AKP 0.3305569 0.0 0.20 0.50 0.80
EP 0.2256845 0.0 0.14 0.34 0.52
QS (median) 0.6299392 0.5 0.55 0.64 0.71
QStr 0.5735816 0.5 0.55 0.64 0.71
WMW 0.4119807 0.5 0.45 0.36 0.29
KMS 0.1650360 0.0 0.10 0.25 0.40
$B.effect.sizes[[2]]
Est NULL S M L
AKP 0.07690223 0.0 0.20 0.50 0.80
EP 0.06673028 0.0 0.14 0.34 0.52
QS (median) 0.52168448 0.5 0.55 0.64 0.71
QStr 0.52168448 0.5 0.55 0.64 0.71
WMW 0.48577938 0.5 0.45 0.36 0.29
KMS 0.03833826 0.0 0.10 0.25 0.40
The results labeled $est.loc.4.DV are the 20% trimmed means. The first row deals
with low CAR values, basically CAR values that are negative. The trimmed means
for the two MAPA groups are 16.83 and 19.96. The next row are for high CAR
values. For low MAPA scores, the CAR groups have trimmed means shown in
the first column, which are 16.83 and 14.73. The results labeled $A[[1]] are the
results when comparing the two MAPA groups associated with low CAR values. As
can be seen, both p-values are less than 0.003. All six measures of effect size are
approximately medium large. The results labeled $A[[2]] deal with comparing the
two MAPA groups associated with high CAR values. Now the p-value is less than
0.001, and the measures of effect size range between large and very large.
What is particularly interesting are the results labeled $B[[1]], where the two
levels of the first independent variable (CAR) are compared when dealing with
low values of the second independent variable (MAPA). The p-value is 0.036.
This suggests that for low MAPA scores, typical LSIZ scores, when the CAR is
negative (cortisol increases after awakening), are higher compared to the group
where the CAR is positive (cortisol decreases after awakening). The corresponding
measures of effect size range between small and medium. Overall, the data indicate
an association between the CAR and LSIZ for certain regions of the sample space.
The smoother LOESS can be useful, but some caution is warranted because of
the possible impact of outliers among the dependent variable. There is the potential
of getting a substantially different estimate of the typical value of the dependent
variable when using the running interval smoother.
8.5 Grids 231
LSIZ
MAPA 16
14
12
−0.5
40
35
0.0 30
C
PA
AR
25
MA
20
0.5
Ttoagg
Ttoagg
G
based on LOESS; the lower ga ga
G
PA
En En
PA
Ttoagg
the data. The right columns
are when leverage points are
removed
e ge
ag G
G
g PA ga
PA
En En
Example Section 1.6.4 described a standardized Totagg score that was highly
skewed with outliers. In the actual study, one goal was to understand the association
between the Totagg score and two independent variables: GPA and a measure of
academic engagement. The data are stored in the file shelley.csv. The sample size
is .n = 336. The first goal here is to illustrate the smooth obtained by LOESS. The
upper left panel of Fig. 8.4 shows the smooth created by the R function lplot
using all of the data. The upper right panel shows the smooth when leverage points
are removed. Leverage points can substantially impact the edges of a smooth as
illustrated here.
The lower left panel of Fig. 8.4 shows the smooth based on the running-interval
smooth when leverage points are retained, while in the lower right panel, they are
removed. A basic concern is that typically leverage points can impact the edges of
a smooth even when using a robust estimator.
The lower right panel of Fig. 8.4 suggests that when using a robust measure
of location, with leverage points removed, a linear model might be reasonable.
Using a percentile bootstrap method in conjunction with the Theil-Sen regression
estimator, via the R function regci, the p-values for the slopes are less than 0.001
for GPA and 0.058 for engage. Using the running-interval smoother again, only
now with the goal of estimating the 0.75 quantile of the Totagg distribution (using
the Harrell-Davis estimator), again, a linear model appears reasonable. Using the
232 8 Inferential Methods Based on Robust Regression Estimators
quantile regression estimator via the R function Qregci, both slopes now have a
p-value less than 0.025.
The R function
computes .Ŷ based on the regression estimator indicated by the argument regfun,
does the same based on the running-interval smoother using the measure of location
indicated by the argument est, and plots the results. By default, a quantile
regression estimator is used, and the measure of location used by the running-
interval smoother is the median. The same is done by the function
deals with situations where the dependent variable is binary. Estimated probabilities
are based on the smoother in Sect. 7.3.6 (using the R function logSMpred) and the
logistic regression model.
8.6 Interactions 233
0.2
on the R function
reg.vs.rplot where the *
goal is to predict Totagg *
* * * *
0.0
scores
Rplot.Est
***** * * *
* **
* * *** *
* ** * *
−0.2
* **** *** * ** ** * *
* *
* * ***************** * ** *
* * ** * * **
****************** ** * ** * * * **
−0.4
* *
** *********** ** * * *
** ************** *** * *
*
* **
−0.5 −0.4 −0.3 −0.2 −0.1
Reg.Est
Example This example is again based on Totagg scores described in Sect. 1.6.4. As
in the last example, the independent variables are GPA and a measure of academic
engagement. Figure 8.5 shows the plot created by the R function reg.vs.rplot.
The dashed line has a slope of one. The solid lines are the 0.25 and 0.75 quantile
regression lines where the .Ŷ values based on a linear model are taken to be the
independent variable and the .Ŷ values based on a smooth are the dependent variable.
As can be seen, the two methods are in fairly close agreement when the estimates
are relatively low. But they diverge substantially for situations where the estimates
tend to be relatively high.
8.6 Interactions
Y = β0 + β1 X1 + β2 X2 + β3 X1 X2 .
. (8.25)
.H0 : β3 = 0, (8.26)
which can be tested with methods already covered. Rearranging the terms in (8.25)
yields
That is, the model assumes that the slope for .X1 changes as a linear function of
X2 . A possible concern with this approach is that it might not be flexible enough
.
to detect or describe the nature of any interaction. Another concern is that in some
cases, there is severe collinearity when the product term is included in the model.
This point is illustrated in the example in Sect. 8.6.1. See in particular the discussion
of Fig. 8.6.
Smoothers can help assess whether there is an interaction as illustrated in
Sect. 7.5.1. The resulting plot might suggest exploratory methods for checking
and characterizing the nature of an interaction. For example, a plot might suggest
splitting the data based on one of the independent variables and then fitting a linear
regression model to both regions with the goal of determining whether and how
the nature of the associations compare. Grids, described in Section 8.5.1, might be
useful as well in conjunction with the hypothesis testing methods in Chap. 5.
.
The R functions
and
are supplied to help simplify the goal of testing (8.26). These functions merely add
the product term to the model. The R function olshc4.inter uses the least
squares estimator and regci.inter uses a robust regression estimator. Both
functions allow heteroscedasticity. The argument x is assumed to have two columns
of data. Note that the R functions ols.plot.inter and reg.plot.inter in
Sect. 7.5.1 complement the two R functions in this section.
8.6 Interactions 235
Example An example in Sect. 8.5.1 dealt with the Well Elderly data before
intervention. This example is based on data collected after intervention. Here, the
same two independent variables are used: the CAR (described in Sect. 8.1.1) and a
measure of meaningful activities (MAPA). The dependent variable is a measure of
life satisfaction (LSIZ). Now the sample size is .n = 246 after removing missing
values. The goal here is to investigate how the two independent variables interact
when trying to predict LSIZ.
First imagine that the goal is to test for an interaction assuming a linear model is
correct. The data are stored in the file A3B3C_dat.txt. Assuming the data are stored
in the R object A3B3C, here is the command that was used based on the Theil-Sen
estimator:
.regci.inter(cbind(A3B3C$cort1-A3B3C$cort2,A3B3C$MAPAGLOB),
A3B3C$LSIZ,xout=TRUE,outfun=outblp)
The last argument indicates that bad leverage points are removed. Here is a portion
of the output:
ci.low ci.up Estimate S.E. p-value
Intercept 1.5826789 9.77748100 5.6567935 2.18404452 0.01001669
Slope 1 -2.9746688 1.54708762 -0.8056416 1.18918901 0.52921536
Slope 2 0.2712050 0.51420611 0.3943051 0.06390249 0.00000000
Slope 3 -0.1285668 0.05478402 -0.0244347 0.05015989 0.70951586
The slope for MAPA is significant but not for CAR or the interaction term.
However, here is the output when using the MM-estimator:
ci.low ci.up Estimate S.E. p-value
Intercept -0.1948145 8.7488363 3.9013287 2.29375570 0.06677796
Slope 1 -28.4099474 -4.2663959 -15.4372923 6.28509044 0.01669449
Slope 2 0.2972574 0.5526794 0.4352238 0.06597316 0.00000000
Slope 3 0.1197545 0.8295569 0.4597078 0.18722097 0.02003339
Now all three slopes reject at the 0.05 level; the largest p-value for the slopes is
0.02. Of course, different robust regression estimators can give similar results, but in
this case, the estimates of the slope for CAR differ substantially. Using instead the
quantile regression estimator, by setting the argument regfun=Qreg, gives results
similar to those based on the MM-estimator. A concern here is that there is severe
collinearity as indicated by Fig. 8.6, which shows a scatterplot of CAR versus the
product of CAR and MAPA.
To gain perspective, Fig. 8.7 shows the estimated regression surface based on
the running-interval smoother. Compare this to Fig. 8.8 where the left panel shows
the regression surface based on the Theil-Sen estimator when the product term
model represented by (8.25) is assumed to be true. The right panel is based on
the MM-estimator. As is evident, the running-interval smoother suggests that there
are details about the association that are missed when using the Theil-Sen estimator.
Figure 8.7 suggests that the nature of the association depends on whether CAR is
relatively large or small. That is, there appears to be a bend in the regression surface
where the CAR is approximately equal to its median value. For high CAR values
and low MAPA values, the CAR appears to take on more importance compared
236 8 Inferential Methods Based on Robust Regression Estimators
50
illustrating a strong
association between these to
0
measures. That is, there is
CAR*MAPA
collinearity
−50
−150 −100
−6 −4 −2 0 2
CAR
PA
AR
MA
LSIZ
LSIZ
C
PA PA
C
AR
MA
AR
MA
Fig. 8.8 Estimates of the regression surface assuming the product term model for interactions is
true. The left panel used the Theil-Sen estimator, and the right panel used the MM-estimator
to when the CAR is low. The right panel of Fig. 8.8 suggests that using the MM-
estimator, assuming that (8.25) is true, provides a fit that is more in agreement with
the running-interval smoother. But the running-interval smoother suggests that again
the linear model is missing interesting features of the association.
To explore this possibility, first split the data into two groups based on the median
CAR value, which is equal to .−0.03. Next, fit a linear model to both groups,
and compare the resulting slopes and intercepts using the the robust method in
Sect. 8.4.2; the R function reg2ci was used. The output based on the Theil-Sen
estimator is
8.7 Exercises 237
The estimates of the intercept and slopes when the CAR is low are given under the
column headed by Group 1. The results indicate that it is reasonable to decide that
the slope for the CAR is greater when the CAR is low compared to when it is high.
In addition, the results indicate that the slope for MAPA is larger when the CAR is
high. The estimates based on the MM-estimator follow a similar pattern, but now
only the slope for CAR is significant at the 0.05 level, and the p-value is less than
0.012 based on .B = 2000 bootstrap samples.
Using grids via the R function smgridAB adds perspective. It is left as an
exercise to show that for high CAR values, comparing low to high MAPA groups,
the p-value is less than 0.001. Moreover, effect sizes are estimated to be quite large.
Checking for an interaction via the R function smgridLC, the p-value is 0.001.
The main point here is that multiple methods can be needed to get a reasonably
deep and nuanced understanding of how variables are related.
8.7 Exercises
1. Using the Leerkes data, available via the R package WRS2, suppose that an
esteem measure greater than 3 is considered reasonably high. The goal is to
determine when, given a value for maternity care, it is reasonable to decide
that esteem is greater than 3. Using the R function regYci, address this issue,
based on the confidence intervals, with leverage points removed. Also, plot the
p-values.
2. Next, compute confidence intervals, based on the R function regYband,
which are adjusted so that the simultaneous probability coverage is 0.95. Again,
use the confidence intervals to determine when it is reasonable to decide that
esteem is greater than 3.
3. For the Leerkes data, use the R function logreg.P.ci to compute confidence
intervals for the probability that esteem is greater than 3. Repeat this using the
R function rplot.binCI, and comment on the results.
4. When using the least squares estimator, the HC3 or HC4 estimates of the
standard errors provide an effective way of dealing with heteroscedasticity.
Consider the data used in Sect. 8.5.1 where the dependent variable is the Totagg
score. Assuming a linear model is correct, why is there evidence that this
approach might be unsatisfactory?
5. The example in Sect. 8.6.1 indicated that large measures of effect size are
revealed by the R function smgribAB when comparing the low and high
groups associated with MAPA when CAR values are high. Verify that this is
the case.
238 8 Inferential Methods Based on Robust Regression Estimators
6. Assume normality, the linear model is correct, and there are two or more
independent variables. What might explain low power other than a small sample
size?
7. The example in Sect. 7.5.1 dealt with understanding the interaction between the
independent variables CAR (the cortisol awakening response) and a measure
of meaningful activities (MAPA), when the goal is to predict a measure of
depressive symptoms (CESD). The data are stored in the file A3B3C_dat.txt.
Use regci.inter, olshc4.inter, and smgridLC to check for an
interaction when leverage points are removed.
8. Look at the plot in Fig. 7.19. As noted in Sect. 7.5.1, it appears that the nature
of the association depends on whether the CAR is positive or negative. Divide
the data into two groups based on whether CAR is positive or negative, and
compare the slopes for these two groups using the R function reg2ci. In
case it helps, here is some code that can be used assuming the data in the file
A3B3C_dat.txt are stored in the R object A3B3C:
z=cbind(A3B3C$MAPAGLOB,A3B3C$cort1-A3B3C$cort2,A3B3C
$CESD)
z=elimna(z)
id=z[,2]<0
reg2ci(z[id,1],z[id,3],z[!id,1],z[!id,3])
Comment on how the results compare to the results in the previous exercise.
9. The example in Sect. 8.6.1 dealt with the association between a measure of
life satisfaction (the dependent variable) and two independent variables: the
CAR and MAPA, a measure of meaningful activities. Assuming the linear
model given by (8.25) is reasonable and testing the hypothesis of no interaction
by testing (8.26), the p-value was shown to be 0.71 based on the Theil-Sen
estimator. Check for an interaction using the R function smgridLC. If, as was
done in the previous exercise, the data in the file A3B3C_dat.txt are stored in
the R object A3B3C, here is the R command:
smgridLC(cbind(A3B3C$cort1-A3B3C$cort2,A3B3C$MAPAG
LOB),A3B3C$LSIZ)
What does this demonstrate?
10. Assume normality and that the linear model given by (8.25) is correct. What
might explain relatively low power when using this model?
11. Use the file B3.txt to study the association between STRESS (the independent
variable stored in B3$STRESS) and depressive symptoms (CESD). First,
examine a smooth using lplot. Next, use regci to test the hypothesis of
a zero slope followed by regYband to get confidence intervals for the typical
CESD value given a value for STRESS. Plot the 0.2, 0.5, and 0.8 quantile
regression lines using qregplots. A CESD score greater than 15 is an
indication of mild depression. Use logSM to estimate the likelihood of having
mild depression or worse given a value for STRESS. Summarize what these
results tell you.
12. Columns 2 and 3 in the file marital_agg_dat.txt contain measures of aggression
in a home (the independent variable) and measures of the cognitive function
8.7 Exercises 239
of a child living in the home. Test the hypothesis of a zero slope using the
R function ols, which assumes homoscedasticity. How would you interpret
the results? Now use the function olshc4 which allows heteroscedasticity.
Finally, use the R function regci, and comment on how the results from these
three methods compare. Hint: what does a scatterplot of the data suggest?
13. Describe a possible advantage of increasing the number of bootstrap samples
when testing hypotheses.
14. When testing the hypothesis that a slope associated with the independent
variable X1 is zero, can the result depend on whether other independent
variables are included in the model?
15. Assume a linear model is reasonable. Is it valid to remove leverage points and
then use a heteroscedastic method to test hypotheses based on the least squares
estimator?
16. Assume a linear model is reasonable. Is it valid to remove outliers among
the dependent variable and then use a heteroscedastic (HC4) method to test
hypotheses based on the least squares estimator?
17. Imagine that the hypothesis that all of the slope parameters are equal to zero is
rejected based on a homoscedastic method used in conjunction with the least
squares estimator. That is, the classic F test, covered in an introductory course,
is used. What is a good way of reporting this result?
18. Consider the R function that computes the deterministic version of the MCD
method. By default, it searches for the 75% of the data that are most tightly
clustered together. If these points are used to fit a linear regression model, what
are some concerns about this approach?
19. Assume a linear model is correct. Describe concerns about testing hypotheses
about the slopes using the R functions lm and summary.
Chapter 9
Measures of Association
This chapter deals with measures of association, how inferences about these
measures of association might be made, plus methods for comparing measures
of association. Included is a method for making inferences about which of the
two independent variables is more important when both independent variables
are included in a linear model. There is also the issue of whether say the first
of two independent variables, taken together, have a stronger association with
the dependent variable compared to the association between the third dependent
variable and the independent variable.
The reality is that, among the various robust measures of association that
might be used, the choice of method can matter tremendously. The illustrations in
Sect. 9.4.2 demonstrate this point and help provide some indication of the caution
that must be exercised when characterizing the strength of an association.
Chapter 1 demonstrated that Pearson’s correlation, .ρ, is not robust. This section
comments on methods for making inferences about .ρ that might be of interest
despite its lack of robustness.
First consider the basic goal of testing
H0 : ρ = 0,
. (9.1)
the hypothesis that Pearson’s correlation is equal to zero. The method routinely
taught is based on the test statistic
n−2
T =r
. . (9.2)
1 − r2
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 241
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9_9
242 9 Measures of Association
If at least one of the variables has a normal distribution, and if X and Y are
independent, T has a Student’s t distribution with .ν = n − 2 degrees of freedom.
If X and Y are independent, T , given by (9.2), controls the Type I error
probability reasonably well (e.g., Kowalski, 1972; Srivastava and Awan, 1984;
Bishara and Hittner, 2012). Note that independence means in particular that there
is homoscedasticity. Homoscedasticity plays an essential role in the derivation of
T . If there is heteroscedasticity, the wrong standard error is being used. Wilcox
(2017a, Section 6.6) illustrates that when there is heteroscedasticity and .ρ = 0,
there are situations where the probability of rejecting, using T , increases as the
sample size n increases. That is, if the goal is to test the hypothesis that there is
independence, T performs reasonably well. But if the goal is to make inferences
about .ρ, such as computing a confidence interval for .ρ, using T is unsatisfactory.
There are heteroscedastic methods for computing a confidence interval, but all of
the methods studied by Bishara and Hittner (2012) were found to yield inaccurate
confidence intervals in some situations. A commonly made suggestion is to use what
is called Fisher’s r-to-z transformation. But Duncan and Layard (1973) describe
general conditions where this approach fails, so the details of this method are not
provided.
Currently, the best method for computing a confidence interval for .ρ, which was
not considered by Bishara and Hittner (2012), is to use a bootstrap-t method. The
basic idea is to use the available data to estimate the distribution of
r −ρ
T = √ ,
. (9.3)
V
where V is the HC4 estimate of the standard error of r. The quantiles of this
distribution yield a confidence interval for .ρ.
The method is applied as follows. First, compute r. Next, standardize both X and
Y yielding
Xi − X̄
.Zxi =
sx
and
Yi − Ȳ
Zyi =
.
sy
But the reality is that even a few outliers can result in a highly misleading value
for r. One of the main goals in this chapter is describing methods for dealing
with outliers and discussing their relative merits. One basic approach is to use a
method that deals with outliers among the marginal distributions. Examples of this
approach are described in Sect. 9.2. These methods certainly improve matters, they
might suffice in a given situation, but situations are encountered where this is not
the case because they do not take into account the overall structure of the data
cloud as described in Sect. 7.1. One could simply exclude points declared outliers
using results in Sect. 7.1 and compute something like Pearson’s correlation using the
remaining data. This raises the issue of how to make inferences about the population
correlation, a topic that is covered in Sect. 7.2. Yet another approach is to assume a
linear model is valid and use a measure of association that deals with bad leverage
points. Section 9.4.1 describes one way this might be done.
Type M correlations refer to correlations that deal with outliers among the marginal
distributions. That is, they guard against the deleterious impact of outliers among
the X values ignoring Y and the impact of outliers among the Y values ignoring X.
Four versions of this type of correlation are described in this section. Two are often
covered in an introductory statistics course, but it is important to understand their
relative merits in terms of their sensitivity to outliers. In practice, the methods in this
section might suffice, but a concern is that they do not take into account the overall
structure of the data cloud when dealing with outliers, an issue that was discussed
in Sect. 7.1. Section 9.4.2 illustrates that ignoring this issue can be a concern.
Kendall’s tau is one of the two best-known methods for dealing with outliers among
the marginal distributions. Roughly, the method characterizes the extent Y increases
when X increases. This is done in terms of the extent any two points are concordant.
That is, among all possible pairs of points, if a line is drawn between these points,
how does the proportion of times the slope is positive compare to the proportion of
times the slope is negative?
244 9 Measures of Association
The details are as follows. Consider two pairs of observations: .(X1 , Y1 ) and
.(X2 , Y2 ). For convenience, assume that .X1 < X2 . If .Y1 < Y2 , then these two
pairs of numbers are said to be concordant. That is, if X increases, Y increases as
well. Put another way, if X decreases, Y decreases. If two pairs of observations are
not concordant, meaning that when X increases, Y decreases, they are said to be
discordant. That is, a pair of points is discordant if .X1 < X2 but .Y1 > Y2 .
If the ith and j th pairs of points are concordant, let .Kij = 1. If they are
discordant, let .Kij = −1. Kendall’s tau is the average of all .Kij values for which
.i < j . More succinctly, Kendall’s tau is estimated with
2 i<j Kij
. τ̂ = , (9.5)
n(n − 1)
which has a value between .−1 and 1. If .τ̂ is positive, there is a tendency for Y to
increase with X – possibly in a nonlinear fashion – and if .τ̂ is negative, the reverse
is true. If Y always increases as X increases, .τ̂ = 1. If as X increases, Y always
decreases, .τ̂ = −1.
The population analog of .τ̂ is labeled .τ and can be shown to be zero when X and
Y are independent. The classic test of
H0 : τ = 0
. (9.6)
is based on
τ̂
Z=
. , (9.7)
στ
where
2(2n + 5)
. στ2 = .
9n(n − 1)
|Z| ≥ z1− α2 ,
. (9.8)
designed to test the hypothesis that X and Y are independent, but it is not well
designed for making inferences about .τ . This concern can be addressed by using a
percentile bootstrap method instead, which can be applied via the R function tauci
in Sect. 9.2.5.
9.2 Type M Correlations 245
Consider the random sample .X1 , . . . , Xn . The smallest value is said to have a rank
of 1. The next smallest has a rank of 2, and so on. When there are tied (duplicated)
values, midranks are typically used. Consider, for example, the values 12, 13, 13,
25, 45, and 64. So the value 12 gets a rank of 1, but there are two identical values
having a rank of 2 and 3. The midrank is simply the average of the ranks among
the tied values. Here, the rank assigned to the two values equal to 13 would be
.(2 + 3)/2 = 2.5, the average of their corresponding ranks. The ranks for all six
H0 : ρs = 0
. (9.9)
is based on
n−2
T = rs
. . (9.10)
1 − rs2
of freedom. As was the case with Pearson’s correlation and Kendall’s tau, this
approach might be unsatisfactory when making inferences about the corresponding
population measure of association, .ρs . A safer approach is to use a percentile
bootstrap method.
H0 : ρw = 0
. (9.11)
Let
ν = n − 2g − 2,
.
Yet another approach is to standardize the data based in part on a robust measure of
location and variation that empirically determine which values are outliers, which
are then eliminated. For example, the measure of location could be the M-estimator
in Sect. 2.1.2, and the measure of variation could be MAD. The percentage bend
correlation is based on a variation of this approach. It uses a slight modification
of the M-estimator in Sect. 2.1.2, which here is labeled .φ̂. (Precise details are in
Wilcox, 2022a, Section 9.3.1.) The measure of variation that is used is based on a
modification of MAD, namely, .ω̂, the 0.8 quantile of .|X1 −M|, . . . , |Xn −M|, which
has a breakdown point of 0.2. Let .Ui = (Xi − φ̂x )/ω̂x and .Vi = (Yi − φ̂y )/ω̂y . Let
.Ai = Ui if .−1 ≤ Ui ≤ 1. If .Ui > 1, .Ai = 1, and if .Ui < −1, .Ai = −1. Similarly,
.Bi = Vi if .−1 ≤ Vi ≤ 1. If .Vi > 1, .Bi = 1, and if .Vi < −1, .Bi = −1. The
H0 : ρpb = 0
. (9.14)
n−2
Tpb = rpb
. . (9.15)
1 − rpb
2
The R function
pbcor(x,y,beta= 0.2)
.
computes the percentage bend correlation and tests the hypothesis of independence
via (9.15). The argument beta= 0.2 means that the 0.8 quantile of .|X1 −
M|, . . . , |Xn −M| is used as a measure of variation by default. The power of this test
depends on the value chosen for beta. There is no optimal choice, but 0.8 appears
to be a good choice in most situations.
The R function
corb(x,y,corfun=pbcor,nboot=599,...)
.
tests the hypothesis of a zero correlation using the heteroscedastic percentile boot-
strap method. By default, it uses the percentage bend correlation, but other measures
of association can be used via the argument corfun provided the function labels
the estimate as $cor. For example, corb(x,y,corfun=wincor,tr= 0.25)
would use a 25% Winsorized correlation.
The R functions
tau(x,y,alpha= 0.05)
.
248 9 Measures of Association
spear(x,y)
.
and
wincor(x,y=NULL,tr= 0.2)
.
estimate Kendall’s tau, Spearman’s rho, and the Winsorized correlation, respec-
tively. They also test the hypothesis of independence using (9.8), (9.10), and (9.12),
respectively.
Although corb can be used with any of the correlation estimators in this section,
for convenience, the R functions
tauci(x,y=NULL,tr= 0.2)
.
spearci(x,y=NULL,tr= 0.2)
.
and
wincorci(x,y=NULL,tr= 0.2)
.
Type O correlations refer to correlations that deal with outliers in a manner that
takes into account the overall structure of the data cloud. One basic strategy is
to use the MVE or MCD estimators mentioned in Sect. 7.1, which are scaled to
estimate the covariance matrix when dealing with a multivariate normal distribution.
The resulting covariance matrix can be used to compute correlations. For any two
random variables, the resulting correlation is given by (1.26) in Sect. 1.7, where now
.sxy , .sx , and .sy are the rescaled estimates based on the MVE or MCD methods. Here,
the MCD estimator is computed via the method derived by Hubert et al. (2012). The
default version of this estimator, when using the R function DETMCD, is based on
the central 75% of the data rather than the central half.
A variation of this method is to use a robust analog of the Mahalanobis distance
based on the MCD estimator to detect outliers. Next, remove any outliers that are
found, and compute something like Pearson’s correlation using the remaining data.
This approach implicitly assumes that a distribution is elliptically contoured. This is
an example of a skipped estimator. Skipped estimators generally refer to the strategy
of removing any outliers before some estimator is used.
9.3 Type O Correlations 249
Another approach is to use the projection method outlined in Sect. 7.1 for
detecting outliers. Again, any outliers that are found are removed, and something
like Pearson’s correlation is computed based on the remaining data. This approach
eliminates the assumption that a distribution is elliptically contoured. It is noted that
the term skipped correlation is often taken to mean that outliers are removed using
a projection method.
The R function
mcd.cor(x,y)
.
computes the MCD correlation for two random variables. A confidence interval can
be computed via the R function corb in Sect. 9.2.5. The R function
MCDCOR(x)
.
computes the skipped correlation, the correlation after outliers are identified and
removed using a projection method. By default, n projections are used as explained
in Sect. 7.1. If execution time is an issue, one option is to set the argument
RAN=TRUE, in which case random projections are used. Another possibility is to
set MC=TRUE. This uses a multicore processor based on n projections assuming the
R package parallel has been installed. Once outliers are removed, correlations are
computed based on the argument corfun, which defaults to Pearson’s correlation.
The R function
computes a confidence interval and a p-value for the skipped correlation when only
two variables are involved.
Notice that if there are three or more random variables, there are two distinct
approaches when using a skipped correlation. The first is to compute a skipped
correlation for each pair of random variables where outliers are identified for the
pair of variables of interest, ignoring the other variables that are available. This is
in contrast to checking for outliers using the data for the p variables taken together
rather than in pairs.
The R function
scorall(x,outfun=outpro,corfun=pcor,RAN=FALSE,...)
.
accomplishes this goal. To get p-values and confidence intervals, use the R function
Y = β0 + β1 X1 + · · · + βp Xp .
. (9.16)
When using the least squares estimator, a standard method for characterizing the
strength of the association is with
VAR(Ŷ )
R2 =
. , (9.17)
VAR(Y)
where
Ŷ = b0 + b1 X1 + · · · + bp Xp
. (9.18)
and .b0 , b1 , . . . , bp are the least squares estimate of the intercept and slopes. .R 2 is
generally known as the coefficient of determination and contains .r 2 , the square of
Pearson’s correlation as described in Chap. 1, as a special case. A technical issue
is that .R 2 is biased. That is, on average, over many studies, its value tends to be
higher than the population version of .R 2 . Gonzales and Li (2022) compared .R 2 to
an estimator that deals with this issue. They also considered
R2
f2 =
. ,
1 − R2
labeled Strength.Assoc.
When simply removing all leverage points, inferential methods in Chap. 8 remain
valid. The same is true when removing only bad leverage points. This is in contrast
to a skipped correlation that removes outliers among both the dependent and
independent variables. Special techniques are required for dealing with a skipped
correlation.
A closer look at bad leverage points is helpful. For convenience, first focus on
a single independent variable X. Here, the BLP measure of association is based
on first fitting a regression line with bad leverage points removed. For notational
convenience, let .(X1 , Y1 ), . . . , (XN , YN ) denote the data after bad leverage points
have been removed. Next, estimate the slope and intercept based on some robust
regression estimator yielding .b0 and .b1 , respectively. Let .Ŷi = b0 + b1 Xi based on
.Xi , .i = 1, . . . , N. Let .U denote some measure of variation based on .Ŷi , and let
2
.V
2 be some measure of variation based on .Yi , .i = 1, . . . , N . The default measure
of variation here is the percentage bend measure of variation with the understanding
that in some cases, an alternative measure of variation might have some practical
value. Then an analog of .R 2 is
U2
2
Rblp
. = . (9.19)
V2
2 is readily computed when dealing with .p ≥ 1 independent variables.
Note that .Rblp
A well-known property of .R 2 , given by (9.17), is that it increases whenever a new
independent variable is added to the model. It is noted that this is not necessarily the
2 .
case when using .Rblp
For the special case .p = 1, an analog of Pearson’s correlation is
where .sign(b1 ) = 1 if the slope, .b1 , is positive, .−1 if the slope is negative, and
0 if the slope is zero. Note that Pearson’s correlation and related measures make
no distinction between the independent variable and the dependent variable. This
is in contrast to the BLP measure of association. If X is taken to be the dependent
variable rather than Y , this can result in a different value for .rblp .
The hypothesis
H0 : ρblp = 0
. (9.21)
can be tested by first computing S, a bootstrap estimate of the standard error of .rblp ,
and then assume that
ρblp
W =
. (9.22)
S
has a standard normal distribution when the null hypothesis is true. That is, reject at
the .α level if .|W | ≥ z1−α/2 , where again .z1−α/2 is the .1 − α/2 quantile of a standard
9.4 Measures of Association Based on a Linear Model 253
ρblp ± z1−α/2 S
. (9.23)
It is briefly noted that using a percentile bootstrap has been found to be satisfactory
provided bootstrap samples are based on the entire data set, not the data after bad
leverage points are removed.
The R function
corblp(x,y,regfun=MMreg,varfun=pbvar,plotit=FALSE,...)
.
computes the skipped correlation .rblp when the argument x is a vector, and it
2 when there are two or more independent variables. The R function
computes .Rblp
cor7(x,y,regfun=tsreg)
.
computes seven correlations. Situations are encountered where all seven give very
similar results. But there are exceptions as illustrated next.
Example One of the examples in Sect. 8.4.1 dealt with the random variables
RAN2T1, a measure of speeded naming for letters, and RAN1T1, a measure of
speeded naming for digits, that are part of the reading data. The main goal is
to demonstrate the extent different methods can yield different indications of the
strength of the association. Here are the results using the R function cor7:
Est. p.value ci.low ci.up
Pearson, BT.HC4 0.1061694 1.000000e-02 0.01824913 0.3029565
Winsor 0.4318906 2.000000e-03 0.19359834 0.6513134
Spearman 0.4526175 0.000000e+00 0.20588094 0.6499297
Tau 0.3470320 0.000000e+00 0.16628615 0.5000000
Per. Bend 0.4132662 0.000000e+00 0.14701431 0.6400764
Skip 0.6454276 0.000000e+00 0.45427446 0.8248476
BLP 0.7678046 6.434918e-07 0.46548147 1.0000000
Note that the estimates range between 0.106 and 0.768. Pearson’s correlation is
substantially lower than the estimates based on the other methods considered, and
it has the largest p-value. The Type M methods give fairly similar results. The
254 9 Measures of Association
skipped estimator (a Type O estimator based in the projection method for detecting
outliers) yields a much higher estimate than estimates based on the Type M methods.
And the method that eliminates bad leverage points yields the largest estimate. The
main point is that how outliers are treated can make a practical difference. Simply
dealing with the outliers among the marginal distributions can miss a much stronger
association among the bulk of the data. A plot of the data, not shown here, makes it
clear that there are bad leverage points that are impacting the Type M methods.
Example This next example is based on the star data described in Sect. 7.4.1. Here
is the output from cor7:
Est. p.value ci.low ci.up
Pearson, BT.HC4 -0.2104133 6.500000e-01 -0.423389100 0.4291123
Winsor 0.3444762 1.360000e-01 -0.187425100 0.6419949
Spearman 0.2951495 1.320000e-01 -0.071237686 0.6197467
Tau 0.2497687 5.400000e-02 -0.002775208 0.4671600
Per. Bend 0.3111173 1.602671e-01 -0.198934092 0.6576393
Skip 0.6821947 0.000000e+00 0.454693015 0.8140987
BLP 0.6068866 2.984392e-06 0.352283940 0.8614893
In this case, the skipped correlation is largest followed by the BLP correlation.
Again, the range of the estimates is quite large. Note the wide range of p-values.
Example This example is based on the chili data stored in the WRS2 with the name
chile. The variables are the length of the chile in centimeters and the heat of the chili
measured on a scale from 0 to 11. Here is the output from cor7:
Est. p.value ci.low ci.up
Pearson, BT.HC4 -0.3669241 0.010000000 -0.5680674 -0.13331585
Winsor -0.3147331 0.006000000 -0.5268275 -0.10798394
Spearman -0.3632750 0.000000000 -0.5298448 -0.16914229
Tau -0.2406162 0.000000000 -0.3593838 -0.11092437
Per. Bend -0.3784720 0.003338898 -0.5597648 -0.18402007
Skip -0.4177695 0.012000000 -0.5456365 -0.16742722
BLP -0.2848294 0.004581257 -0.4817384 -0.08792026
X = β11 Z + β01 +
. 1 (9.24)
and
9.4 Measures of Association Based on a Linear Model 255
Y = β12 Z + β02 +
. 2, (9.25)
where . 1 and . 2 have some unknown bivariate distribution. The standard approach
estimates the slopes and intercepts via the least squares estimator. Let .rij (.i =
1, . . . , n; .j = 1, 2) denote the corresponding residuals, where .j = 1 refers to the
residuals associated with (9.24) and .j = 2 are the residuals associated with (9.25).
Then the partial correlation is simply Pearson’s correlation based on these residuals.
To get a robust analog, first remove any bad leverage points, and replace the
least squares estimator with some robust regression estimator. Next, based on the
resulting residuals, replace Pearson’s correlation with some robust measure of
association. This approach is called M1, which uses the residuals for all of the data.
Method M2 is a variation of M1, which is applied as follows. Let
.(X1 , Y1 , Z1 ), . . . (XN , YN , ZN ) denote the data after removing points flagged
H0 : ρxy.z = 0,
. (9.26)
where .ρxy.z is some robust analog of the partial correlation (Wilcox and Friedemann,
2022). However, when using Spearman’s rho, Kendall’s tau, and the Winsorized
correlation, bad leverage points might still be a source of concern in terms of
controlling the Type I error probability. Using the skipped correlation avoids
problems with bad leverage points at the possible expense of less power.
Example The Leerkes data are used to illustrate the partial correlation coefficient.
The skipped correlation between esteem and maternal care is 0.469, and the 0.95
confidence interval is (0.205, 0.628). The partial correlation, taking into account
efficacy, is 0.345, the 0.95 confidence interval is (0.044, 0.565), and the p-value is
0.027.
The R function
computes a partial correlation based on the data stored in the arguments x and y,
controlling for the data in z, which can be a vector or a matrix with p columns
corresponding to p independent variables. By default, method M2 is used with
a Winsorized correlation and the MM-estimator regression estimator. A bootstrap
method is used to compute a confidence interval for .ρxy.z . Setting BOOT = FALSE,
the hypothesis given by (9.26) is tested using methods in Sect. 9.2, in which case no
confidence interval is reported. These non-bootstrap methods are fine for testing the
hypothesis of independence but can be unsatisfactory when computing a confidence
interval. If it is desired to remove all leverage points, set XOUT.blp = FALSE
and XOUT = TRUE.
*
6.0
*
* *
* *
**
C−Peptide
* * * * ** * *
5.0
** *
* * ** * **
** ** * *
* *
4.0
* * *
* *
* *
3.0
*
5 10 15
Age
there appears to be little or no association. For ages less than or equal to 8 months,
testing the hypothesis that the slope of the regression line is zero, using the Theil-
Sen estimator via the R function regci, the p-value is 0.038, and the strength of
the association is estimated to be .R = 0.93. Comparing the slope of the regression
line when age is less than or equal to 8 months to slope of the regression line when
age is greater than 8 months, using the R function reg2ci, the p-value is less than
0.01. Using the logarithms of the C-peptide levels, as done in the actual study, now
.R = 0.15 based on LOWESS.
This section deals with methods aimed at comparing measures of association. First,
methods for comparing independent groups are described, followed by methods
where dependent variables are involved.
H0 : ρ1 = ρ2 ,
. (9.27)
the hypothesis that two independent groups have the same Pearson correlation, a
natural strategy aimed at dealing with heteroscedasticity is to use the test statistic
r1 − r2
T =√
. , (9.28)
V1 + V2
where .V1 and .V2 are the HC4 estimates of the squared standard error of .r1 and .r2 ,
respectively. As usual, there is the issue of estimating the distribution of T . The
method used here basically mimics the bootstrap-t method in Sect. 9.1.
The strategy for computing a confidence interval is to estimate the distribution of
(r1 − ρ1 ) − (r2 − ρ2 )
T =
. √ . (9.29)
V1 + V2
Briefly, take a bootstrap sample from both groups, and compute Pearson’s correla-
tion based on these bootstrap samples, yielding .r1∗ and .r2∗ , and let .V1∗ and .V2∗ denote
the HC4 estimate of the square standard error of .r1∗ and .r2∗ , respectively. The HC4
estimator is computed as indicated in Sect. 9.1. Let
258 9 Measures of Association
(r1∗ − r1 ) − (r2∗ − r2 )
U∗ =
. ∗ .
V1 + V2∗
∗ ≤
Repeat this process B times, and put the values in ascending order yielding .U(1)
∗
· · · ≤ U(B) . Let . = αB/2, rounded to the nearest integer, and let .u = B − + 1.
Then a .1 − α confidence interval for .ρ1 − ρ2 is
∗ ∗
.((r1 − r2 ) − U(u) (V1 + V2 ), (r1 − r2 ) − U(+1) (V1 − V2 )). (9.30)
Consider a linear model with p independent variables. Let .ρyj denote Pearson’s
correlation between Y and .Xj , .j = 1, . . . , p. The goal is to test
H0 : ρyj = ρyk .
. (9.31)
9.6 Comparing Measures of Association 259
Because both of these two correlations involve Y , the estimates of these correlations
are dependent, which must be taken into account. The method used here is based on
a modification of a method derived by Zou (2007).
Let .(l1 , u1 ) and .(l2 , u2 ) be .1 − α confidence intervals for .ρy1 and .ρy2 ,
respectively. Here, these confidence intervals are based on (9.4), which uses the
bootstrap HC4 method in Sect. 9.1. Then a .1 − α confidence interval for .ρ12 − ρ13
is
(L, U ),
. (9.32)
where
L = r12 − r13 −
. (r12 − l1 )2 + (u2 − r13 )2 − 2corr(r12 , r13 )(r12 − l1 )(u2 − r13 ),
U = r12 − r23 + (u1 − r12 )2 + (r23 − l2 )2 − 2corr(r12 , r13 )(u1 − r12 )(r23 − l2 ),
.
and
The R function
tests the hypothesis given by (9.31). That is, it compares Pearson correlations for
the overlapping case. To get a p-value, use the R function
twoDcorR(x,y,corfun=wincor,alpha=0.05,nboot=500,
.
SEED=TRUE,MC=FALSE)
260 9 Measures of Association
.j = 1, 2). Note that .Vj is the numerator of the explanatory power of .Xj . The
H0 : τ12 = τ22 ,
. (9.33)
explanatory power of .X1 and .X2 is larger or smaller than the explanatory power of
.X3 ? However, the method is not appropriate for comparing the explanatory power
of .X1 and .X2 to the explanatory power of .X1 . The reason is that it is known that the
explanatory power of .X1 and .X2 is at least as large as the explanatory power of .X1
without any data.
9.7 Comparing Independent Variables 261
The R function
compares the explanatory power for each pair of independent variables. If, for
instance, .p = 3, the importance of .X1 is compared to the importance of .X2 , the
importance of .X1 is compared to the importance of .X3 , and the importance of .X2 is
compared to the importance of .X3 .
The R function
is the same as regIVcom except that the dependent variable is assumed to be binary
and the logistic regression model is assumed.
Example The reading data used in the example in Sect. 8.4.1 dealt with an
independent variable RAN1T1, a measure of speeded naming for digits, and a
dependent variable. The dependent variable was a measure of the ability to identify
words. Testing the hypothesis of a zero slope, the p-value is less than 0.001 when
leverage points are removed. However, adding a second independent variable to the
linear model, a measure of the ability to identify letters, now the p-value for first
independent variable is 0.304, suggesting that the second independent variable is
more important. To add perspective, the strength of the association for these two
independent variables is compared with the R function regIVcom. With leverage
points removed, the strength of these two independent variables, R, the square root
of explanatory power, was estimated to 0.013 and 0.719, respectively. The p-value is
262 9 Measures of Association
0.042. This result lends strength to the conclusion that the first independent variable
adds very little to the model when the second independent variable is included.
Example The reading data used in the last example are used again; only now the
goal is to include three independent variables:
• Measure of speeded naming for digits
• Accuracy of identifying lowercase letters
• Speed of identifying lowercase letters
Assuming the data are stored in the R object doi, the command
. regIVcom(doi[,c(4,6,7)],doi[,8],IV1=1,IV2=c(2,3),xout=TRUE)
compares the strength of the first independent variable to the strength of the other
two independent variables. The p-value is 0.0585. Using instead the command
.regIVcom(doi[,c(4,6,7)],doi[,8],IV1=1,IV2=c(2,3),xout=TRUE,outfun=outblp)
which eliminates only bad leverage points, now the p-value is 0.022. The estimated
strengths are 0.102 and 0.578, respectively. Here is the output from the command
regIVcommcp(doi[,c(4,6,7)],doi[,8],xout=TRUE,outfun=outblp)
.
The results indicate that taken in pairs, the accuracy of identifying lowercase
letters is the most important independent variable.
Example This example uses the same data as the last example; only now the goal
is to compare the conventional coefficient of determination to a robust explanatory
power. The coefficient of determination is estimated to be .R 2 = 0.186 via the R
function ols. Using the R function corblp, .Rblp 2 = 0.26, illustrating once again
9.8 Exercises
1. When studying the association between two random variables, what is a good
first step?
2. Suppose the test statistic, given by (9.2), rejects. What conclusion is reason-
able?
3. Assume a linear model is reasonable. Imagine that with a large sample size,
Pearson’s correlation is close to zero and the p-value when using the HC4
9.8 Exercises 263
method is close to one. Is it reasonable to stop and conclude that there is little
or no association?
4. Note that one could check for bad leverage points, remove any that are found,
and then use a percentage bend correlation based on the remaining data. What
is a possible concern with this approach?
5. The population version of Pearson’s correlation, ρ, is not robust. What does this
mean?
6. Pearson’s correlation makes no distinction about which variable is the depen-
dent variable. Is the same true when using the BLP correlation coefficient?
7. Imagine that the skip correlation between Y and X1 is 0.4 and the skip
correlation between Y and X2 is 0.1. Further assume that there is strong
evidence that the first correlation is larger than the second correlation. If both
X1 and X2 are included in a linear model, why would it be inappropriate to
conclude that X2 is less important than X1 ?
8. The file cancer_rate_dat.txt contains data on breast cancer rates and levels
of solar radiation in various cities in the United States. Compute Pearson’s
correlation and the skipped correlation, and verify that the results are identical.
Why is this not surprising?
9. The C-peptide data described in Sect. 9.5 are stored in the file dia-
betes_sockett_dat.txt. Compute a confidence interval for the BLP measure
of association, the skipped correlation, and Kendall’s tau. How do the estimates
compare? Why might it be argued that these methods are unsatisfactory?
10. For the Well Elderly data in the file A3B3C_dat.txt, suppose the variables
STRESS and CESD are used to predict the typical value of life satisfaction
(LSIZ). Can a decision be made about which of these two independent variables
is most important when using regIVcom and testing at the 0.05 level?
11. For the reading data used in Sect. 8.4.1, the data in columns 4 and 5 deal with
a measure of speeded naming for letters and a measure of speeded naming for
digits. Use the R function cor7 to examine the strength of the association.
Next, use the R function corblp.ci instead in conjunction with the deepest
regression estimator using the R function mdepreg.orig. Comment on the
results.
Chapter 10
Comparing Groups When There Is a
Covariate
This chapter deals with the goal of comparing groups when there is a covariate.
Consider, for example, the Well Elderly data described in Sect. 3.1.3. Imagine the
goal is to compare males to females based on a measure of depressive symptoms
(CESD). Using the data in the file A3B3C_dat.txt, no significant difference is found
at the 0.05 level when comparing 20% trimmed means via Yuen’s method, and the
p-value is 0.18. Similar results are obtained using an M-estimator (p-value = 0.257)
or Cliff’s method described in Sect. 3.2 (p-value = 0.09). Of interest here is whether
males and females differ when a covariate, namely, a measure of life satisfaction
(LSIZ), is taken into account. One way of addressing this issue is to fit a regression
line to both groups, where life satisfaction is the independent variable, and then
compare the groups by comparing the slopes and intercepts with the method in
Sect. 8.4.2 via the R function reg2ci in Sect. 8.4.3. The p-value when comparing
the slopes is 0.868. The p-value when comparing the intercepts is 0.604. An issue is
whether males and females differ in some manner that is being missed when simply
comparing the slopes and intercepts. Results presented in Sect. 10.2.5 indicate that
the answer is yes.
This chapter begins with linear models. The initial goal is to review a classic
method when there is a single covariate and to point out some of its limitations.
This is followed by a description of some basic inferential methods that avoid
the limitations of the classic method summarized in Sect. 10.1. Next, methods for
estimating various measures of effect size are described and illustrated. This is
followed by methods that deal with more than one independent variable. Finally,
methods are summarized that deal with nonlinearity via smoothers.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 265
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9_10
266 10 Comparing Groups When There Is a Covariate
To add perspective, it helps to review the classic method for dealing with a covariate.
The method is generally known as an analysis of covariance (ANCOVA). The
method assumes that for the j th group,
Yj = β0j + β1 Xj + ,
. (10.1)
This section deals with linear models, but the slopes are not assumed to be identical,
and no homoscedasticity assumptions are made. Non-normality is addressed by
focusing on a robust regression estimator. First attention is focused on comparing
conditional measures of location followed by measures of effect size.
Imagine the goal is to compare the typical Y value for group 1 to the typical Y value
for group 2 when the value of the covariate is .X = x, where x is some specified
value of the covariate that is of interest. Assuming a linear model suffices, for the
j th group (.j = 1, 2), the typical value of Y given that .X = x is
where the slopes and intercepts are estimated based on one of the robust regression
estimators in Chap. 7. The immediate goal is to test
H0 : Y1 (x) = Y2 (x)
. (10.4)
and to compute a confidence interval for .Y1 (x) − Y2 (x). In practice, several choices
for x can be needed to get a good sense of how and to what extent the groups differ.
This raises the issue of controlling the familywise error rate, which is addressed as
well.
The method uses a standardized test statistic that is based on a bootstrap estimate
of the standard errors. For the j th group, generate a bootstrap sample as done in
∗ and .b∗ ,
Sect. 8.3.2. Compute the estimates of the slopes and intercepts yielding .b1j 0j
respectively. Let .Ŷj∗ (x) = b0j
∗ + b∗ x. Repeat this process B times yielding .Ŷ ∗ (x)
1j jb
(.b = 1, . . . , B). A bootstrap estimate of the squared standard error of .Ŷj (x) is
1 ∗ 2
τ̂j2 =
. Ŷj b (x) − Ȳj∗ (x) , (10.5)
B −1
where .Ȳj∗ (x) = Ŷj∗b (x)/B. A test statistic for testing (10.4) is
which is assumed to have a standard normal distribution when the null hypothesis is
true. That is, reject if .|W | ≥ z, where z is the .1 − α/2 quantile of a standard normal
distribution. A .1 − α confidence interval for .Y1 (x) − Y2 (x) is
Ŷ1 (x) − Ŷ2 (x) ± z τ̂12 + τ̂22 .
. (10.7)
Now consider the situation where the goal is to make inferences about .Y1 (x) −
Y2 (x) for multiple x values. When the number of x values is relatively small,
controlling the familywise error (FWE) rate is accomplished via the Studentized
maximum modulus distribution briefly mentioned in Sect. 5.3.1.
When the number of x values, K, is relatively large, say 10 or more, an improve-
ment on the Studentized maximum modulus distribution, as well as Hochberg’s
method, is used. Here is an outline of the method that is used.
Note that testing each of these K hypotheses yields K p-values. Consider the
situation where all K hypotheses are true. Based on a random sample from each
group, let .pmin denote the smallest of the K p-values. The strategy is to estimate the
268 10 Comparing Groups When There Is a Covariate
distribution of .pmin . That is, the value of .pmin varies over many studies, and the goal
is to determine .pc the .α quantile of this distribution. This was done via simulations
assuming normality and homoscedasticity, after which the impact of non-normality
and heteroscedasticity was investigated (Wilcox, 2017c). The point is that if all K
hypotheses are true, the probability of one or more Type I errors will not exceed .α if
each test is rejected only if its p-value is less than or equal to .pc . When K is large,
this approach can have more power compared to controlling the FWE rate with the
Studentized maximum modulus distribution.
It is noted that the methods just described are readily extended to situations
where there are two or more covariates. This assumes that the points chosen for the
covariates are well within the cloud of points that are observed. For example, if there
are two covariates and the goal is to compare two groups when the first covariate has
the value 2 and the second covariate has the value 20, this is reasonable provided
that for both groups, the point (2, 20) is nested within the cloud of covariate points
that are observed. If (2, 20) is an outlier, comparing the groups for this particular
point cannot be recommended.
Multiple groups can be compared. In particular, when there are J groups, linear
contrasts can be used. That is, the goal is to make inferences about
(x) =
. cj Yj (x) (10.8)
where .c1 , . . . , cJ are linear contrast coefficients as discussed in Sect. 5.4. This
includes as a special case where all pairwise comparisons are to be made.
This section describes an extension of the KMS measure effect size, introduced in
Sect. 3.6.1, to situations where there is a covariate. Included is a method for making
inferences about this measure of effect size. The method is based in part on an
estimate of the conditional distribution of Y given that .X = x. This is done via
the Koenker-Bassett quantile regression estimator described in Sect. 7.4.2; see in
particular (7.31). For more details about the methods described here, see Wilcox
(2022e). Note that estimating effect size for a range of x values can help provide
perspective on the role of the covariate, as will be illustrated in Sect. 10.2.5.
First consider a single group, and let .Vq (x) = bq1 x + bq0 be the estimate of
the qth quantile of Y , given that .X = x. A robust measure of dispersion for the
(conditional) distribution of Y , given that the covariate .X = x, is
where .z0.75 and .z0.25 are the 0.75 and 0.25 quantiles, respectively, of a standard
normal distribution. The denominator in (10.9) is included so that under normality,
.U (x) estimates the standard deviation of Y given that .X = x.
10.2 Robust Methods Based on a Linear Model 269
Next, consider two independent groups. For the j th group, let .Mj (x) = b0.5,1 x −
b0.5,0 denote the estimate of the median of .Yj given that .Xj = x. Let .Uj (x) denote
the value of .U (x) for the j th group. An analog of the KMS measure of effect size is
M1 (x) − M2 (x)
η̂(x) =
. , (10.10)
ϕ̂
where
N = n1 + n2 , and .u = n1 /N.
.
Let .η(x) denote the population measure of effect size being estimated by .η̂(x).
Rather than compare the groups based on a measure of location, another approach
is to test
H0 : η(x) = 0,
. (10.11)
and there is the goal of computing a confidence interval for .η(x). A percentile
bootstrap method has been found to be unsatisfactory. A much better approach
is to use a bootstrap estimate of the standard error of .η̂(x). This is done in a
manner similar to the approach for estimating a standard error as described in
Sect. 10.2.1. First, generate a bootstrap sample from each group, and compute .η̂
based on these bootstrap samples yielding .η̂∗ (x). Repeat this process B times
yielding .η̂1∗ (x), . . . , η̂B∗ (x). An estimate of the squared standard error of .η̂(x) is
1 ∗
.S 2 (x) = (η̂b (x) − η̄∗ (x))2 , (10.12)
B −1
where .η̄∗ (x) = η̂b∗ (x)/B. Let
η̂(x)
W (x) =
. (10.13)
S(x)
and reject .H0 at the .α level if .|W (x)| ≥ z1−α/2 , where .z1−α/2 is the .1−α/2 quantile
of a standard normal distribution. That is, .W (x) is assumed to have a standard
normal distribution when the null hypothesis, given by (10.11), is true. A .1 − α
confidence interval for .η(x) is simply
accurate probability coverage, are limited to two covariates. When the sample sizes
are relatively small, the actual Type I error probability can drop below 0.025 when
testing at the 0.05 level.
There is the issue of choosing values for the covariate. In practice, there might be
substantive reasons to choose particular values. Another approach is to use a range
of values that avoids extrapolation. That is, if .x = 20, say, is used, but for the first
group this value is well outside the range of the observed values for the covariate,
comparing the groups when .x = 20 is dubious at best. Let .xj q be an estimate of
the qth quantile, .q < 0.5, of the covariate associated with group j . One approach
is to take the first point to be .xL = max(x1,q , x2,q ). The next value is taken to be
.xU = min(x1,1−q , x2,1−q ), and a third value is taken to be .xM = (xL + xU )/2.
Section 3.2 described methods for making inferences about P , the probability
that a randomly sampled value from the first distribution is less than a randomly
sampled value from the second distribution. This section describes an analog of
those methods given that the covariates .X1 = X2 = x. That is, both covariate
values are equal to x. The goal is to make inferences about
In order to do this, information about the conditional distribution of .Yj , given that
Xj = x, is needed. Wilcox (2023a) used the Koenker-Bassett quantile regression
.
estimator to estimate these conditional distributions, which in turn are used to make
inferences about .P (x).
For notational convenience, denote the percentile corresponding to q by .u =
100q, and let
where the intercept and slope are estimated via the Koenker-Bassett quantile
regression estimator. Note that computing .Duj (x) for .q = 0.01(0.01)0.99 yields
an estimate of the conditional distribution of .Yj given that .Xj = x. In words,
estimating the conditional quantiles extending from the 0.01 to the 0.99 quantiles
yields information about the conditional distributions. Let (the indicator function)
.Iuv (x) = 1 if .Du1 (x) < Dv2 (x); otherwise, .Iuv (x) = 0. Then an estimate of .P (x)
is simply
1 1
99 99
P̂ (x) =
. Iuv (x). (10.18)
99 99
u=1 v=1
Roughly, based on 99 values for each conditional distribution, .P̂ (x) is the proportion
of times a value from the first group is less than a value from the second group.
As was done in Sect. 10.2.2, a bootstrap estimate of the standard error of .P̂ (x) is
used to make inferences about .P (x). Here, this bootstrap estimate of the standard
error is denoted by .V (x) to distinguish it from .S(x), the estimate used in conjunction
with the KMS measure of effect size in Sect. 10.2.2. The hypothesis
H0 : P (x) = 0.5
. (10.19)
is tested with
P̂ (x) − 0.5
W (x) =
. , (10.20)
V (x)
which is assumed to have a standard normal distribution when the null hypothesis
is true. A .1 − α confidence interval is given by
This section extends the quantile shift measure of effect, introduced in Sect. 3.6.2, to
situations where there is a covariate. There are two approaches. The first is aimed at
situations where one of the groups is a control group and the other is an experimental
group (Wilcox, 2022f).
Again, let
be the estimate of the median of .Yj given that .Xj = x, where .j = 1 corresponds
to the control group and .j = 2 corresponds to the experimental group, and where
the slope and intercept are estimated via the Koenker-Bassett quantile regression
estimator. The idea is to quantify how unusual the estimate of the median for
the experimental group happens to be relative to the distribution of the control
group. For example, if the median of the experimental group corresponds to the
.Q = 0.8 quantile of control group, this is one way of characterizing the extent
the experimental group differs from the control group. In general, if the median
of the experimental group corresponds to the Qth quantile of the control group,
the measure of effect size is taken to be .Qc = Q, where the subscript c is used
to indicate that the control group is being used as the reference group. No effect
is .Qc = 0.5: the median of the experimental group corresponds to the median of
the control group. For the situation at hand, it is better to write .Qc as .Qc (x), to
emphasize that the goal is to estimate .Qc given that the covariate .X = x.
Estimating .Qc (x) requires information about the distribution of the control
group. Again, the Koenker-Bassett quantile regression estimator is used to provide
this information. It is convenient to alter the notation a bit and let
b01q + b11q x
.
denote the estimate of the qth quantile of the control group given x. The estimate of
Qc (x) is the value q such that
.
There is no simple equation for determining q, but there are numerical methods that
yield a solution. (The method used here is called the Nelder and Mead algorithm.)
Wilcox (2022f) found that a percentile bootstrap method is relatively effective at
making inferences about .Qc (x).
Note that the version of the quantile shift measure of effect size in Sect. 3.6.2 does
not make a distinction between a control group and an experimental group. Rather, it
characterizes the location of the median of the typical difference relative to the null
distribution where the median of the typical difference is zero. An analog of this
version of the quantile shift measure of effect size can be estimated by mimicking
the method in Sect. 10.2.3. That is, estimate both conditional distributions via the
10.2 Robust Methods Based on a Linear Model 273
H0 : Q(x) = 0.5.
. (10.24)
The first two R functions in this section compare measures of location given a value
for a single covariate assuming a linear model is correct. The R function
uses the method in Sect. 10.2.1 where the Studentized maximum modulus distri-
bution is used to control the FWE rate and is a good choice when the number of
covariate values is small. By default, five covariate values are used that are chosen so
that their values are within the range of the observed values. Values for the covariate
can be specified via the argument pts. The R function
tests hypotheses about linear contrasts when dealing with J independent groups.
The arguments x and y are assumed to be matrices with J columns, or they can
have list mode with length J .
The R function
can be used. The function picks values for the covariates that are reasonably well
nested within the cloud of covariate points. When there are two covariates, the
function plots the covariate points that were used. The points marked with a +
indicate the points where a significant result was obtained.
The R function
.ancova.KMSci(x1,y1,x2,y2,pts=NULL,alpha=.05,nboot=100,SEED=TRUE,
10.2 Robust Methods Based on a Linear Model 275
QM=FALSE,ql=.2, xout=FALSE,outfun=outpro,xlab=‘Pts’,
ylab=‘Y’,method=‘hoch’,plotit=TRUE)
computes confidence intervals for the KMS measure of effect size when there is
a single covariate. By default, it plots the estimate for five points chosen by the
function, coupled with an indication of the confidence intervals. The covariate
values used can be specified by the argument pts. The R function
plots an estimate of the effect size for a range values for the covariate.
The R function
plots the estimate of effect size for a range of values for the covariate. For two
covariates, use
By default, this function picks 30 covariate points where the groups are compared.
If plotit=TRUE, the function plots the covariate points and indicates which
276 10 Comparing Groups When There Is a Covariate
covariate points were used with *, and any significant result is indicated by o. The
R function
.anclin.QS.CIpb(x1, y1, x2, y2, alpha = 0.05, pts = NULL, xout = FALSE, ALL =
FALSE, npts = 10, outfun = outpro, nboot = 200, MC = TRUE, REQMIN = 0.01,
SEED = TRUE, ...)
deals with the quantile shift (QS) measure of effect size when comparing a control
group to an experimental group. To get a plot, use the R function
pts=NULL,q=0.1,xout=FALSE,ALL=TRUE,npts=10,line=TRUE,
xlab=’X’,ylab=’QS.Effect’,outfun=outpro,REQMIN=.001,...).
ancNCE.QS.plot(x1,y1,x2,y2,pts=NULL,q=0.1,xout=FALSE,
.
ALL=TRUE,npts=10,line=TRUE,xlab=‘X’,ylab=‘QS.Effect’,
outfun=outpro,...)
Comparing the groups via six measures of effect size, using the R function
ES.summary.CI, the p-values range from 0.09 to 0.20. Comparing the 0.1, 0.25,
0.5 0.75, and 0.9 quantiles using the R function qcomhd, the adjusted p-value for
the 0.25 quantile is 0.0055. The other p-values are greater than 0.24.
Now consider the impact of including as the covariate a measure of life
satisfaction (LSIZ). The intercepts and slopes, based on the Theil-Sen estimator,
are very similar and do not differ significantly based on the R function reg2ciMC.
The corresponding p-values are 0.92 and 0.44 with leverage points removed. Here
is the output using the R function ancJN, with leverage points removed, which
compares a robust conditional measure of location:
$n
[1] 103 223
$intercept.slope.group1
Intercept
27 -1
$intercept.slope.group2
Intercept
27.65 -0.90
$output
X Est1 Est2 DIF TEST se ci.low ci.hi
[1,] 6 21 22.25 -1.25 -0.3550453 3.520677 -10.298140 7.7981403
[2,] 11 16 17.75 -1.75 -0.6989104 2.503897 -8.185016 4.6850163
[3,] 16 11 13.25 -2.25 -1.4306152 1.572750 -6.291967 1.7919673
[4,] 21 6 8.75 -2.75 -2.7400338 1.003637 -5.329348 -0.1706522
[5,] 26 1 4.25 -3.25 -2.3761662 1.367749 -6.765116 0.2651161
p.value adj.p.values
[1,] 0.722555628 0.72255563
[2,] 0.484608013 0.72255563
[3,] 0.152540518 0.45762155
[4,] 0.006143287 0.03071644
[5,] 0.017493583 0.06997433
So there is some indication that for relatively high LSIZ scores, the typical CESD
scores for women are higher than the typical CESD scores for men. For LSIZ values
equal to 21 and 26, ancova.KMSci returns
pts Est. Test.Stat ci.low ci.up p-value
[1,] 21 -0.1670025 -2.318267 -0.3081937 -0.02581128 0.02043483
[2,] 26 -0.2743682 -1.825826 -0.5688934 0.02015702 0.06787649
p.adjusted
[1,] 0.04086967
[2,] 0.06787649
The estimates indicate an effect size that is moderately large for LSIZ=26, but the
confidence intervals do not rule out the possibility that the effect is quite small. Here
is the output from the R function wmw.ancbse using the same LSIZ values used
by ancJN
pts Est S.E. test.stat ci.low ci.up p.value
[1,] 6 0.5179063 0.09514471 0.1882011 0.3314261 0.7043865 0.850719040
[2,] 11 0.5382104 0.07368076 0.5185938 0.3937987 0.6826220 0.604044038
[3,] 16 0.5666769 0.05132193 1.2991886 0.4660877 0.6672660 0.193879208
[4,] 21 0.6126926 0.03921810 2.8734840 0.5358265 0.6895586 0.004059716
[5,] 26 0.6953372 0.07734862 2.5254131 0.5437367 0.8469377 0.011556237
adj.p.value
[1,] 0.85071904
[2,] 0.85071904
[3,] 0.58163762
[4,] 0.02029858
278 10 Comparing Groups When There Is a Covariate
[5,] 0.04622495
Again, there is an indication that women tend to have higher CESD scores when
LSIZ is relatively high. The estimates for LSIZ=21 and 26 are moderately large
and large, respectively, based on a common convention, but again the confidence
intervals do not rule out a relatively small effect.
Overall, the conditional measures of location and effect sizes provide interesting
details that go beyond the simple strategy of comparing the slopes and intercepts.
There is a consistent indication that women have higher CESD scores when taking
into account LSIZ. The estimated effect size ranges between a moderately large and
relatively large value when the LSIZ score is high, depending to some extent on
which measure of effect size is used. The precision of the estimates, based on the
confidence intervals, indicates that no decision should be made about whether very
large, as well as very small, measures of effect size occur when the LSIZ score is
high.
Example This example is based on data taken from Field et al. (2012, p. 485),
which is fictional data dealing with the effect that wearing a cloak of invisibility has
on people’s tendency to mischief. The data are available via the R package WRS2
and are stored in the R object invisibility. Hidden cameras recorded how many
mischievous acts were conducted over 3 weeks. After 3 weeks, 34 participants were
told that the cameras were switched off so that no one would be able to see what they
were up to. The remaining 46 participants were given a cloak of invisibility. These
people were told not to tell anyone else about their cloak and they could wear it
whenever they liked. The number of mischievous acts was recorded over the next 3
weeks. Here, the cloak group is compared to the no cloak group based on the second
3 weeks with the measures taken during the first 3 weeks taken to be the covariate.
Here is the output based on the R function ancova.KMSci with leverage points
removed:
pts Est. Test.Stat ci.low ci.up p-value
[1,] 2 0.49797853 2.1189007 0.03735289 0.9586042 0.03409886
[2,] 4 0.34385875 2.3341948 0.05512930 0.6325882 0.01958552
[3,] 5 0.25986058 1.5522529 -0.06825437 0.5879755 0.12060173
[4,] 6 0.17181206 0.7876831 -0.25570184 0.5993260 0.43088212
[5,] 7 0.08037731 0.2848248 -0.47272276 0.6334774 0.77577836
p.adjusted
[1,] 0.13639544
[2,] 0.09792762
[3,] 0.36180518
[4,] 0.77577836
[5,] 0.77577836
As can be seen, the effect size is quite large when the number of mischievous acts
was low during the first 3 weeks. As the number of mischievous acts increases
during the first 3 weeks, the effect size decreases. The p-values are less than 0.05
for the first two values of the covariate, but their adjusted values are greater than
0.05.
Example This next example is again based on the Well Elderly data after interven-
tion; only now the goal is to compare males and females based on perceived health
10.2 Robust Methods Based on a Linear Model 279
For negative CAR values (cortisol increases after awakening), all three of the
adjusted p-values are less than 0.014. For the two positive CAR values, the p-values
are greater than 0.22. Here are the results using the KMS measure of effect size via
the R function ancova.KMSci:
pts Est. Test.Stat ci.low ci.up p-value
[1,] -0.32160000 0.5112410 2.803721 0.1538539 0.8686281 5.051653e-03
[2,] -0.14490072 0.4596717 4.111707 0.2405559 0.6787875 3.927439e-05
[3,] -0.01307176 0.4168761 3.288861 0.1684429 0.6653093 1.005939e-03
[4,] 0.08277492 0.3858397 2.409778 0.0720216 0.6996579 1.596224e-02
[5,] 0.26549698 0.3302561 1.416579 -0.1266828 0.7871951 1.566060e-01
p.adjusted
[1,] 0.015154958
[2,] 0.000196372
[3,] 0.004023754
[4,] 0.031924475
[5,] 0.156606020
Note that now the first four p-values are less than 0.016, but also note that this
function picks different points than those used by ancJN. Here, the first positive
CAR value is 0.083, while for ancJN, it is 0.18. Using ancova.KMSci with
the same CAR values used by ancJN, now only the second and third CAR
values have adjusted p-values less than 0.05. That is, the choice of points can
be crucial. Using instead anclin, which is designed to deal with 25 covariate
values by default, the function reports that any p-value less than or equal to 0.015
is rejected if the FWE rate is set to 0.05. Here, the null hypothesis is rejected for
15 covariate values ranging from .−0.506 to 0.028. If the same 25 covariate values
are used in conjunction with the R function ancova.KMSci, only 8 hypotheses
are rejected after using the Hochberg adjusted p-values. That is, despite testing
more hypotheses, anclin can reject more hypotheses than ancova.KMSci when
dealing with a relatively large number of covariate values.
Example This next example illustrates the output from the R function ancJNPVAL
(described in Sect. 10.3.3) when there are two covariates. Here, the two covariates
are measures of life satisfaction (LSIZ) and stress, again using the Well Elderly data
in the file A3B3C. The dependent variable is a measure of depressive symptoms
280 10 Comparing Groups When There Is a Covariate
8
who did not complete high
school and those that did. The * * * * * * + +
covariate points where there
* * * * + + + + + +
6
STRESS
was a significant difference
are denoted by a + * * * * * + + + + + + + +
* * * + + + + + + + +
4
* * * + + + + + + + +
* + + + + + + +
2 + + + +
12 14 16 18 20 22
LSIZ
(CESD). The two groups are participants who did not complete high school and
those that did. As previously indicated, ancJNPVAL picks covariate values that are
reasonably well nested within the data cloud. Figure 10.1 shows the plot created
by the function. The points marked with a + indicate the points where a significant
result was obtained. The function also returns estimates of the measure of location,
confidence intervals, and p-values. It also reports the covariate points that were
used, and it lists the points where there was a significant difference. Here, 71
covariate points were chosen. A significant result was obtained for 43 points. Among
the 71 covariate points, estimates of typical CESD measures were always higher for
participants who did finish high school. Generally, a significant difference is found
for LSIZ greater than or equal to 16, with a few exceptions when stress is greater
than 6. That is, even when participants have the same relatively high LSIZ score,
the first group tends to have higher CESD scores taking stress into account.
The R function
ancovap2.KMS(x1,y1,x2,y2,pts=NULL,BOTH=TRUE,npts=20,
.
profun=prodepth,xout=FALSE,outfun=outpro)
computes the KMS measure of effect size when there are two covariates. The
argument profun determines the method used to measure the depth of the
covariate points. The default is to use random projections. To use a deterministic
method, set profun=pdepth. If pts = NULL, the function picks points based
on how deeply nested they happen to be among the combined data stored in x1 and
x2. They range between the deepest point (the median) and the least deep point.
If the argument BOTH=FALSE, only points stored in x1 are used. The argument
npts determines how many covariate points are used. The R function
ancovap2.KMSci(x1,y1,x2,y2,pts=NULL,alpha=.05,nboot=100,
.
10.3 Methods Based on Smoothers 281
SEED=TRUE,npts=20,profun=prodepth,
plotit=TRUE,xlab=‘X1’,ylab=‘X2’,BOTH=TRUE,
xout=FALSE,outfun=outpro,method=’hoch’)
.ancovap2.KMS.plot(xx1,y1,x2,y2,pts=NULL,xlab=‘X1’,ylab=‘X2’,
plots estimates of the KMS measure of effect size based on the covariate points
stored in pts, assuming that there are two covariates. If pts=NULL, the function
uses all of the combined data in x1 and x2.
As noted in Chaps. 7 and 8, situations are encountered where a linear model can
be inadequate. Smoothers provide a more flexible way of studying the association
between a dependent variable and p independent variables. In terms of comparing
groups when there is a covariate, smoothers have the potential of revealing details
about how groups compare that are missed when using the more obvious linear
models. But as usual, no single approach dominates. If a linear model does in fact
reflect the true association, the methods in Sect. 10.2 can have more power than a
method based on a smoother.
The focus in this section is on the running-interval smoother. It provides a simple
and effective way to proceed when dealing with robust measures of location. For
example, it is a simple matter to estimate a trimmed mean of Y , or any other measure
of location that might be of interest, given a value for X. Presumably, situations are
encountered where some other type of smoother provides some advantage over the
running-interval smoother. This issue is in need of further study.
which .Xi values are close to x and then applies some measure of location to the
corresponding .Yi values. The value .Xi is considered to be close to x if it satisfies
(7.25) in Sect. 7.3.3.
Now consider two independent groups, and for the j th group, let .Nj (x) denote
the number of .Xij values that are close to x. Then comparing the groups based on
the dependent variable, given that .X = x, can be accomplished simply by applying
one of the methods in Chap. 3 based on the .Yij values corresponding to the .Xij
values that are close to x.
Basic Methods For convenience, let .Vij denote the .Yij values for which .Xij is close
to x. One could simply use Yuen’s method or a bootstrap method based on these .Vij
values with the goal of testing
H0 : m1 (x) = m2 (x),
. (10.25)
the hypothesis that the trimmed mean for the first group, given that .X = x, is
equal to the trimmed means for the second group. When comparing the groups for a
collection of x values, one approach to controlling the FWE rate when using Yuen’s
method is to determine an appropriate critical value via the Studentized maximum
modulus distribution, which is called method Y. As was the case in Sect. 10.2.1,
this approach works well when the number of covariate values is relatively small.
Not surprisingly, bootstrap-t or a percentile bootstrap method can be used instead.
These methods are reasonable provided that .Nj (x) is not too small. The default
convention here is that the two groups can be compared when .Nj (x) ≥ 12, for both
.j = 1 and 2. The method picks five points. The smallest x value is taken to be the
smallest .Xij value such that both .N1 (Xij ) ≥ 12 and .N2 (Xij ) ≥ 12. In a similar
manner, the largest covariate value is taken to be the largest .Xij value such that
both .N1 (Xij ) ≥ 12 and .N2 (Xij ) ≥ 12. Three other covariates spaced between the
smallest and largest values are used as well.
Method UB There is a variation of the percentile bootstrap method that has the
potential of more power compared to the basic methods just indicated. The method
consists in part of taking bootstrap samples from all of the data, rather than
resampling from the .Vij values. A critical p-value is used that is taken to be .pc , the .α
quantile of the distribution of the minimum p-value among all p-values when testing
K hypotheses. This is essentially the same approach that was used in Sect. 8.1. This
method is designed for a situation where .K = 5. The method yields adjusted p-
values but no confidence intervals.
Method TAP Method TAP is designed for a large number of covariate values. It
is based on Yuen’s test but with a critical p-value, .pc , determined as described in
Sect. 8.1. That is, a hypothesis is rejected if its p-value is less than or equal to .pc .
The default here is to use a value for .pc so that the FWE rate is 0.05. For a single
covariate, the smallest and largest values for the covariate are determined as done
in the basic method previously described. The remaining covariate values are taken
to be values evenly spaced between the smallest and largest values that are used.
10.3 Methods Based on Smoothers 283
The current version is designed to deal with .K = 25 covariate values. The method
yields both adjusted p-values and confidence intervals.
Effect Size As is probably evident, the measures of effect size in Sect. 3.6 are readily
extended to the situation at hand. Given some value for the covariate, x, simply
compute measures of effect size based on the .Vij values. Improvements on the
Wilcoxon-Mann-Whitney method, described in Sect. 3.2, can be used as well.
Binary Dependent Variable Note that when Y is binary, the running-interval
smoother can again be used. The only difference is that now the methods in Sect. 3.4
would be used.
Imagine that there is interest in the covariate values .x1 , . . . , xK . Rather than
compare two independent groups by testing (10.25) for each of these K values,
it might be desired to test the global hypothesis that simultaneously,
is true for every .k = 1, . . . , K. Such a method has been derived (Wilcox, 2022a,
Section 12.4), but the details are rather involved. It is based in part on quantifying
how deeply a regression line is nested within the cloud of data. The R function
ancGLOB, described in Sect. 10.3.3, applies the method. A possible criticism stems
from Tukey’s argument mentioned in Sect. 1.2: surely at some decimal place,
.m1 (xk ) differs from .m2 (xk ). An issue is whether it is reasonable to make a decision
about whether .m1 (xk ) is less than or greater than .m2 (xk ). Perhaps an analog of the
step-down method in Sect. 5.3.4 has some practical value for the situation at hand,
but this has not been investigated.
.ancpb(x1, y1, x2, y2, est = hd, pts = NA, fr1 = 1, fr2
= 1, nboot = NA, nmin = 12, alpha = 0.05, xout = FALSE,
outfun = outpro, plotit = TRUE, LP = TRUE, xlab = "X",
ylab = "Y", pch1 = "*", pch2 = "+", ....)
is like the function ancova; only a percentile bootstrap method is used. The R
function
pts=NA, plotit=TRUE)
can be used. If pts=NA, the function picks K values for the covariate; the argument
npts is used to indicate the value of K. The current version uses Hochberg’s
method to control the FWE rate.
To get several measures of effect size simultaneously, including confidence
intervals, the R function
can be used. The function picks five covariate values. Alternative values can be
specified via the argument pts. Measures of effect size are computed for each value
of the covariate. The R function
.ancsm.es(x1,y1,x2,y2,ES=’KMS’,npt=8,est=tmean,method=’BH’,
fr1=1,fr2=1,nboot=NA,nmin=12,alpha=.05,xout=FALSE,SEED=TRUE,
outfun=outpro,plotit=TRUE,LP=FALSE,xlab=’X’,ylab=’Effect
Size’,...)
and
can be used to plot the regression lines as well. The first uses the running-interval
smoother, and the second uses LOWESS. The R function
plots the estimated difference, .m1 (x) − m2 (x), the difference between the trimmed
means, for each of the covariate values. Confidence intervals, having simultaneous
probability coverage approximately equal to the value of the argument alpha, are
plotted as well. To get a plot of the quantile regression lines, use the R function
For example, setting the argument q=0.75, the 0.75 conditional quantiles are
estimated and plotted for both groups. The estimates are based on the Harrell-Davis
estimator.
The R function
applies method UB. By default, a 20% trimmed mean is used. To compare, for
example, the 0.75 quantiles based on the Harrell-Davis estimator, set the argument
est=hd, and include q=0.75. So the command would look like this:
ancovaUB(x,y,est=hd,q=0.75)
.
If qpts=TRUE, covariate values are chosen based on the quantiles indicated by the
argument qvals in conjunction with the data in the argument x1. The function
reports p-values and adjusted p-values based on Hochberg’s method. The method
10.3 Methods Based on Smoothers 287
also reports a critical p-value, .pc , meaning that if any hypothesis is rejected when
the p-value is less than or equal .pc , the FWE rate will be approximately 0.05
when all of the hypotheses are true. This might result in more power compared
to Hochberg’s method.
The R function
applies method TAP. The argument npts indicates how many covariate values will
be used, which defaults to 25. With plotit=TRUE, the function plots a smooth for
both regression lines. Setting plot.dif=TRUE, the function plots the difference
between the regression lines, .m̂1 (x) − m̂2 (x), based on the covariate values that
were used. A confidence band is also plotted based on the adjusted p-value, .p̂c .
That is, the simultaneous probability coverage among the K confidence intervals is
approximately .1 − α, where .α is specified via the argument alpha, which defaults
to 0.05.
The R function
compares multiple groups when there is a single covariate. The arguments x and y
can be matrices with J columns where J is the number of groups, or they can have
list mode with length J . The argument op determines how the groups are compared.
There are four options:
• op = 1: omnibus test for trimmed means, based on the R function t1way, with the
amount of trimming controlled via the argument tr
• op = 2: omnibus test for medians based on the R function med1way. (Not
recommended when there are tied values; use op = 4)
• op = 3: multiple comparisons using trimmed means and a percentile bootstrap
via the R function linconpb
• op = 4: multiple comparisons using medians and percentile bootstrap via the R
function medpb
288 10 Comparing Groups When There Is a Covariate
The results for the first covariate point are returned in the R object $points[[1]],
the results for the second covariate point are in $points[[2]], and so on. By default,
$points[[k]] contains the results for all pairwise comparisons among the J groups
based on the kth covariate point, .k = 1, . . . , p. The argument con can be used to
specify the linear contrasts of interest. For example, in a 2-by-2 design, the hypoth-
esis of no interaction can be tested by setting con=con2way(2,2)$conAB.
When using the R function ancova and setting the argument method=‘WMW’,
it reports a measure of effect size that reflects the probability that .Y1 is less than .Y2 ,
given that .X1 = X2 = x. To actually test the hypothesis that this probability is 0.5,
or to compute a confidence interval for this probability, use the R function
This is done via Cliff’s method that was mentioned in Sect. 3.2. The argument
plotit=TRUE means a smooth of the regression line is created based on the
measure of location indicated by the argument est.
Example This example is based on the same data used in Sect. 10.2.5 where the
goal is to compare males to females based on a measure of depressive symptoms
(CESD) taking into account life satisfaction (LSIZ). Figure 10.2 shows the plot
of the smooths returned by ancova when the argument LP=TRUE. The dashed
+
50
+
+ + *
40
* + +
+ + +
+ + +
* +
30
CESD
* + + + *
+ + + * +
+ + +
+
+ + +
+ + + +* + +
+* +* +* * + +
* + * + * +* + + +
+ +* +
20
+**
* * * + + +* + * +* + * + + +
+ + * + +* +* + +
+ +
+ + + + +
+ +* + + * +
+ + +
+ + +
+* + +
* + + * +
+* +
+* + +* ++ +* +* +
+* +* +
+
+* ++*
10
* * * + + + +
+ + + + + +
+ +* + +* + +*
+ + + +
+ +
+ + + +
+* +*
+
* +* + +** +**
*+* + *+ +
+
+ +* +*
+ + + + +* +* +
+* +* *+ +
+
*
+ * +* *+ ++** +** +** +
+ +* * *
+* * +*
+ * +* +**
0
0 5 10 15 20 25
LSIZ
Fig. 10.2 Shown is the plot created by the R function ancova when comparing males to females
based on depressive symptoms (CESD), using life satisfaction (LSIZ) as a covariate
10.3 Methods Based on Smoothers 289
+
+ *
40
* +
+
+ + +
30 * +
+ * +
+ + *
+ + + ++ + +
CESD
+* +* +
*+* + + + + * * + + +
* + +* + +* + + +* +
20
* * + + +* +* +* * +** + * + + +
* + +* *
+ + + + + + + +
+ ++ + +
+ +
+ + +
+ + ++
+
* +* +
+ +
+
* +
+*
* + +* + * +* +* +* + +* ++*
+
10
* * +
+ +* + + +
+ + * + + + + + + +
+** + + +* + +
+ +
+ + + +
+* *
* +* +
+ * +* +
+
** +* + +
+ +* +* ++* + + + + +* +
+ + + ** * +** +*
* +
+* ** +*
+* * * +* +
+* +* + +
+ + * +* +** +* +*
0
10 15 20 25
LSIZ
Fig. 10.3 Shown is the plot created by the R function ancova using the same data used in
Fig. 10.2; only leverage points have been removed
line corresponds to females. The plot suggests that as LSIZ increases, there is an
increasing difference between males and females. The function picked the covariate
values 7, 14, 19, 22, and 26. For the first two points, the unadjusted p-values are
0.93 and 0.77, respectively. The remaining p-values are 0.043, 0.006, and 0.0017.
Figure 10.3, which is based on the same data used in Fig. 10.2, shows the
plot obtained when leverage points are removed and LP=FALSE. For the females,
values less than 10 were flagged as leverage points. This is why the dashed line
in Fig. 10.3 stops where LSIZ is equal to 10. The same general pattern is obtained
as shown in Fig. 10.2, but removing leverage points alters the overall sense of how
the two groups compare. If LP=TRUE had been used, this would result in smoother
regression lines, but the dotted line would lie slightly above the solid line for the two
lowest covariate values used here, in contrast to the estimate based on the running-
interval smoother. The explanation is that LP=TRUE alters somewhat the estimates
based on the running-interval smoother.
Example This next example illustrates the R functions qhdsm2 and ancpb.
Again, the Well Elderly data are used, but now the covariate is taken to be the
CAR, and the outcome variable is a measure of depressive symptoms. Figure 10.4
shows the plot returned by qhdsm2 when the argument q = 0.75. That is, the goal
is to estimate the 0.75 quantile regression line. CESD scores greater than 15 are
generally taken to indicate mild depression or worse. The horizontal dotted line
indicates a CESD score equal to 15. For negative CAR values (cortisol increases
after awakening), the plot indicates that 25% of the females had a CESD score
greater than 15.
290 10 Comparing Groups When There Is a Covariate
50
+
40 + +
+ + +
+ +
30
CESD
+ + + + +
+ + + + ++ +
+
+ ++ + + + ++ ++
+ + + + +
20
+ +
++ ++ ++ + +
++ + + +
+ + ++ + + + + +++ ++ +++ + + + +
+ + ++ ++ ++ +
10
+ ++ + + + +++++++ +
++ +
++ + ++ ++++ ++ ++
+ ++ + + + ++ ++ ++++ ++ ++ ++
+ + + +
+ +++ + + + ++ + +
0
Fig. 10.4 Shown is the plot created by the R function ancova using the same data used in
Fig. 10.2, but with the covariate LSIZ replaced by the cortisol awakening response. Leverage points
have been removed. The dotted line indicates a CESD score of 15. Scores greater than 15 are taken
to be an indication of mild depression or worse
Example This example again compares the two education groups used in the last
example; only now the goal is to compare the groups based on the probability that
a participant has a CESD score greater than 15, which is taken to indicate mild
depression or worse. Again, LSIZ is taken to be the covariate. Figure 10.5 shows
the plot produced by the function anc.2gbin. Generally, the first group (did not
complete high school) is estimated to be more likely to have mild depression or
worse. Based on the adjusted p-values and an FWE rate of 0.05, a significant result
is obtained at LSIZ scores 14 and 16.
1.0
0.5
+ + + +
+ + + + +
+
Est. Dif
+ + +
0.0
+ + + +
+ +
+
−0.5
−1.0
10 15 20 25
LSIZ
Fig. 10.5 Shown is the plot created by the R function anc.2gbin using the Well Elderly data.
The y-axis is the difference between the probability of depression for the first group and the
probability of depression for the second group. LSIZ is a measure of life satisfaction
MC3 described below. A concern when K is small is that this might miss important
differences had other reasonable covariate points been used.
Imagine that the goal is to test K hypotheses. It is possible to test the global
hypothesis that all K of these hypotheses are true rather than test each individual
hypothesis. This can be done based on the K p-values resulting from testing each
of the K hypotheses. One version uses the average of the p-values. A second
version uses the product of the p-values where some p-values are reset to one if
they are sufficiently large (Wilcox, 2022a, Section 12.3.2). This is called method
MC2, which uses a larger number of covariate points than MC1. Moreover, it is not
limited to using Yuen’s method; it can be used with any hypothesis testing method
that yields a p-value.
But a criticism of method MC2 is that it does not yield information about where
significant differences are found. Perhaps MC2 can be used in some version of a
step-down multiple comparison procedure that has a practical advantage, but such
a method has not been established. Method MC3 uses more covariate points than
MC1. The basic strategy for controlling the FWE rate is to use a critical p-value,
.pc , which is determined in a manner similar to the approach in Sect. 8.1.
292 10 Comparing Groups When There Is a Covariate
A basic issue is choosing the covariate points that will be used. Like the R
function ancJNPVAL, the strategy is to choose points that are reasonably well
nested within the cloud of the covariate data. The approach used here is based on
the notion of projection distances. Readers interested in the computational details
are referred to Wilcox (2022a). The only goal here is to note that two versions of
this approach are readily applied with extant R functions. There is even a method for
dealing with three or four covariates provided the sample size is sufficiently large.
With a small sample size, it can be impossible to find a point .x such that .Nj (x) ≥ 12
for both .j = 1 and 2. With .n = 80, such points might be available. With .n = 150,
a fair number of points are likely to be available.
The R function
pts=NA,plottit=FALSE,FWE=FALSE)
compares two independent groups using method MC1. That is, it compares trimmed
means via Yuen’s method. The arguments x1 and x2 are assumed to be a matrix
or data frame with two columns. The FWE rate is controlled by the Studentized
maximum modulus distribution. If the argument plotit=TRUE, a plot of the
covariate points is produced. If the p-value is less than the value in the argument
alpha, it is indicated in the plot by +. If FWE=TRUE, a covariate point is indicated
by a + if its adjusted p-value is less than or equal to the value in the argument
alpha. The R function
is like ancovamp, only a percentile bootstrap method is used, and any measure of
location can be employed via the argument est.
The R function
can be used with different inferential methods via the argument test. By default,
Yuen’s method is used. Unlike the functions covered so far in this section, this
function can be used with a large number of covariate points. The argument FRAC
controls the proportion of the points that are used. For example, setting FRAC=0.3,
the deepest 70% of the covariate points would be used. The critical value is
known when using FRAC=0.5, the default value. But otherwise the critical value
must be computed, which increases execution time considerably. Setting MC=TRUE
can reduce execution time. If the outcome variable is binary, set the argument
test=binom2g.
When there are .J ≥ 2 groups and two covariates, the R function
performs a global test based on the individual p-values associated with each of the K
hypotheses being tested. By default, Yuen’s method for comparing trimmed means
is used. Setting the argument test = qcomhd, medians would be compared
based on a percentile bootstrap method and the Harrell-Davis estimator. If the
argument plotit=TRUE and PV=FALSE, the function plots .m1 (x) − m2 (x)
as a function of the two covariates using LOESS. If PV=TRUE, the function
creates a plot of the p-values as a function of the two covariates. If the argument
294 10 Comparing Groups When There Is a Covariate
DETAILS=TRUE, all p-values are returned, in which case they can be adjusted
using Hochberg’s or Hommel’s method (via the R function p.adjust) with the
goal of controlling the probability of one or more Type I errors. Setting the argument
cp.value=TRUE, the function returns a p-value based on the test statistic that was
used to test the global hypothesis given by Eq. (10.26). This can increase execution
time considerably. Again, execution time can be reduced by setting MC=TRUE,
assuming that a multicore processor is available.
Method MC3 can be applied via the R function
The R function
compares trimmed means via Yuen’s method and method MC4. By default, all
covariate points are used for which both .N1 (xi1 ) and .N2 (xi1 ) are greater than or
equal to 12. The function reports which points are used. The covariate points can
be specified via the argument pts. To compute effect sizes for the points that are
significant, use the R function
10.3 Methods Based on Smoothers 295
xout=FALSE, outfun=outpro,...).
When dealing with two covariates, using grids might help provide perspective on
where groups differ. For example, the groups might differ significantly in a region
where both covariates are relatively small, but not otherwise. The R function
applies this approach. Basically, it divides the data into groups as described in
Sect. 8.5 and then compares the trimmed means of groups based Yuen’s method.
If the argument PB=TRUE, a percentile bootstrap method is used instead, in which
case other measures of location can be used via the argument est. Measures of
effect size are returned as well. If the argument CI=TRUE, confidence intervals for
the measures of effect size are reported. When the dependent variable is binary, use
the R function
If the number of possible values for the dependent variable is greater than two but
small, the R function
H0 : P (Y1 = y) = P (Y2 = y)
. (10.27)
given that the covariate values are in some specified region. This is done for every
possible y value using the approach described in Sect. 3.4.
296 10 Comparing Groups When There Is a Covariate
First consider difference scores, and for illustrative purposes, imagine that partici-
pants are measured at two different times. For simplicity, the focus is on a single
covariate that is measured at time 1 or at both time 1 and time 2. Now the data
consists of .(Xi1 , Yi1 , Xi2 , Yi2 ) (.i = 1, . . . , n), where all four random variables
are possibly dependent. Assuming a linear model is adequate, the goal is to make
inferences based on the assumption that some measure of location associated with
.Yd = Y1 − Y2 is given by
Yd = β0 + β1 X1 + β2 X2 ,
. (10.28)
Yd = β0 + β1 Xd ,
. (10.29)
or there might be situations where the covariate is limited to a measure taken at time
1, in which case the model is simply
. Yd = β0 + β1 X1 . (10.30)
For this latter case, the point is that the method in Sect. 8.2 can be used to make
inferences about .Yd (x), a measure of location associated with .Yd , given that .X1 = x.
For two covariates, again the method in Sect. 8.2 can be used, where now .Yd (x) is
a measure of location given that .X1 = X2 = x. See in particular the R functions
regYci and regYband in Sect. 8.2.2.
Rather than use difference scores, there might be interest in comparing .Y1 (x) and
.Y2 (x), the marginal measures of location given a value for the covariate. A method
for testing
10.4 Methods for Dependent Groups 297
H0 : Y1 (x) = Y2 (x)
. (10.31)
is to use a simple extension of the method used in Sect. 8.2: Use a bootstrap
estimate of the standard error, which in turn can be used to test (10.31) and compute
a confidence interval for .Y1 (x) = Y2 (x). Exercise 8 at the end of this chapter
illustrates these approaches.
There is a technical point that is worth mentioning. At time j , let
Yj = β0j + β1j Xj
. (10.32)
Consider .(Xi1 , Yi1 , Xi2 , Xi2 ), .i = 1, . . . , n. The method used here estimates .β01
and .β11 , the intercept and slope at time 1, using the time 1 data .(Xi1 , Yi1 ). The
time 2 data are ignored. In a similar manner, the intercept and slope at time 2 are
estimated with the time 2 data, ignoring the time 1 data. There are robust regression
estimators that take into account the possible association between .Y1 and .Y2 , the
dependent variable measured at times 1 and 2 (e.g., Wilcox, 2022a, Section 10.17).
But the relative merits of these estimators, for the problem at hand, are unknown.
The method used here to test (10.31) mimics the approach used in Sect. 10.2.1.
Based on the bootstrap sample .(Xi1 ∗ , Y ∗ , X ∗ , X ∗ ), .i = 1, . . . , n, let .Ŷ ∗ (x) = b∗ +
i1 i2 i2 j 0
∗ ∗ ∗
b1 x, where .b0 and .b1 are estimates of the intercept and slope, respectively, based on
∗ ∗
.(X , Y ). Let
ij ij
Repeat this process B times yielding .Db∗ (.b = 1, . . . , B). An estimate of the squared
standard error of .Ŷ1 (x) − Ŷ2 (x) is
1 ∗ 2
τ̂ 2 =
. Db − D̄ ∗ ,
B −1
where .D̄ ∗ = Db /B. The hypothesis given by (10.31) can be tested with
As for controlling the FWE rate when dealing with two or more x values, the
Studentized maximum modulus distribution is used. A speculation is that when
298 10 Comparing Groups When There Is a Covariate
The R function
Dancts(x1,y1,x2,y2,pts=NULL,regfun=tsreg,fr1=1,fr2=1,
.
alpha=.05,plotit=TRUE,xout=FALSE,outfun=out,nboot=100,
SEED=TRUE,xlab=’X’,ylab=’Y’,pr=TRUE,...)
tests the hypothesis given by (10.31), and a confidence interval for .Y1 (x) − Y2 (x) is
returned as well. Covariate points can be specified via the argument pts. By default,
the function picks five values. If it is desired to use the least squares estimator, use
the R function
yielding
10.4 Methods for Dependent Groups 299
⎛ ∗ , Y ∗ , X∗ , Y ∗
⎞
X11 11 12 12
⎜ .. ⎟
.⎝ ⎠. (10.36)
.
∗ , Y ∗ , X∗ , Y ∗
Xn1 n1 n2 n2
Next, based on this bootstrap sample, use the running-interval smoother to compute
an estimate of the marginal measures of location yielding say .Y1∗ (x) and .Y2∗ (x).
Let .D ∗ = Y1∗ (x) − Y2∗ (x). Next, repeat this process B times yielding .D1∗ , . . . , DB∗ .
Then a .1 − α confidence interval can be computed as described in Sect. 3.1.2. And
a p-value, when testing (10.25), can be computed as well.
Again, difference scores can be used instead. Let .Yid = Yi1 − Yi2 , .i = 1, . . . , n.
Now a bootstrap sample is generated by sampling with replacement n rows from the
matrix
⎛ ⎞
X11 , X12 , Y1d
⎜ .. ⎟
.⎝ ⎠ (10.37)
.
Xn1 , Xn2 , Ynd
yielding
⎛ ∗ , X∗ , Y ∗
⎞
X11 12 1d
⎜ .. ⎟
.⎝ ⎠. (10.38)
.
∗ , X∗ , Y ∗
Xn1 n2 nd
Based on this bootstrap sample, let .Ŷd∗ (x) denote the estimate of .Yd based on
the running-interval smoother given that the covariates are equal to x. Repeat this
∗ (x), . . . , Ŷ ∗ (x). These B bootstrap estimates can be
process B times yielding .Ŷ1d Bd
used to compute confidence intervals and to test the hypothesis
. H0 : Yd (x) = 0 (10.39)
as indicated in Chap. 2.
When there are two covariates, simple modifications of the methods, previously
described in this section, can be used. Another possibility is to use grids in a manner
similar to the approach mentioned in Sect. 8.5.
Several R functions are available for dealing with a single covariate. Some are
designed for situations where the number of covariate values is relatively small. The
300 10 Comparing Groups When There Is a Covariate
uses non-bootstrap methods for dealing with a trimmed mean. By default, the
argument DIF = FALSE, meaning that the marginal trimmed means are used.
Setting DIF = TRUE, difference scores are used. The argument x2=x1 means
that by assumption, the covariate is measured at time 1 but not at time time 2. If a
covariate is measured at time 2 and is stored say in R object T2, set x2=T2.
The R function
Dancova.ES.sum(x1, y1, x2=x1, y2, fr1 = 1, fr2 = 1, tr = 0.2, alpha = 0.05, pts =
.
NA, xout = FALSE, outfun = out, REL.MAG = NULL, SEED = TRUE, nboot =
1000, ...)
uses a percentile bootstrap method. Bootstrap samples are generated based on the
Vij values, the .Yij values for which .Xij is close to x. By default, medians are used
.
also uses a percentile bootstrap method, but now bootstrap samples are generated
by resampling from (10.35), or resampling is from (10.37) when dealing with
difference scores. (This is method DUB in Wilcox, 2022a.)
In terms of controlling the FWE rate, the R functions described so far are
designed for situations where the number of covariate values is relatively small.
For a relatively large number of covariate values, use the R function
By default, 25 covariate points are used. The function reports an adjusted critical
p-value, .pc , to control the FWE rate, meaning that a hypothesis is rejected if its p-
value is less than or equal to .pc . The function is fast when the FWE rate, indicated
by the argument alpha, is 0.05. Otherwise, the function computes .pc , which can
result in high execution time.
The R function
alpha=0.05, pts=NULL,
SEED=TRUE,DIF=TRUE,cov.fun=skipcov,...)
uses grids assuming there is a single covariate. The argument method indicates the
method used to compare the two dependent groups based. When DIF=TRUE, the
choices are:
• TR (trimmed means using the Tukey-McLaughlin method)
• TRPB (trimmed means using a percentile bootstrap)
• MED (inference based on the median of the difference scores)
• AD (inference based on the median of the distribution of the typical difference;
see Section 3.2)
• SIGN (sign test based on an estimate of the probability that for a random pair,
the first is less than the second)
When DIF=FALSE, only trimmed means are used.
Example This example is based on the Well Elderly data using measures of stress
and depressive symptoms (CESD) taken before intervention and after 6 months of
intervention. The data are stored in the file well_T1_T2_dat.csv. Here, the goal is
to compare measure of depressive symptoms before and after intervention when a
measure of stress is taken into account. If the R function DancovaUB is used, it
picks three covariate (stress) values and reports that when comparing CESD values
based on a 20% trimmed, the smallest p-value is greater than 0.58.
Now look at Figure 10.6, which was created by the R function Dancova. Shown
are the smooths for predicting the 20% trimmed mean of CESD scores given a value
for stress. The solid line is the predicted CESD value before intervention. Here is
the output stemming from Dancova:
$output
X n DIF TEST se ci.low ci.hi
[1,] 0 110 1.2096970 1.7708308 0.6831240 -0.1545958 2.573990
[2,] 2 196 -0.2047458 -0.2832042 0.7229617 -1.6365336 1.227042
R function Dancova *
*
+ *
+* +* +
* *
40
+ +* *
* +* *
* +* ** + +*
+ + +* + * *
* +* *
30
+
CESD
+ * *
+
+ * ** +
+* +* + * +*
+ * + ** + * +* +** +
+ +**
+
+ + +
+ +
+ +** +*
+ * +* +* +**
+ + +** +** * +
+* +
+ * **
+*** +**
20
+
+ +** + + +* +
* ** ** +
+ * +
+** +
+
**
** +** + +* + *
+ * * +
+ +
+ + +
+ +
+** +* + * +*
+
+** * +***
+ +**
+ +
+ ** +
+ +
+** +
+*
*
+
+ * +** +
+ +** +* +***
+ +
+* +
+*** **
10
+
+*** +
+ ** +** +
+ * +
+*** +** +
+
+** +
+*** +
+ ** +
+** +
+*** +
+** +
+* +
+ *
+
+** +
+** +
+** +
+ * +** +
+ + +*
+
+ * +
+** +
+ +
+*** +*** +
* +
+** +
+
+*** +***
+ +** +*** +**
+ + *
* *
0
0 2 4 6 8 10
STRESS
10.5 Exercises 303
Note that the smooths are nearly identical for stress less than or equal to 6. The
stress values chosen by the R function DancovaUB were 2, 4, and 6. Based on
Figure 10.6, it is not surprising that no significant difference was found. Even after
controlling the FWE rate, the R function Dancdet, which uses 25 values for the
covariate, rejects for 2 stress values: 8.3 and 8.75. For stress values ranging from
7.3 to 12.74, the p-values are less than 0.01. For stress=15.97, the p-value is 0.03.
Computing effect sizes, using the R function Dancova.ES.sum for stress equal
to 7, 8, 9, and 10, the estimates range between a medium effect size and a large
effect size. Here are the results for stress equal to 10:
NULL Est S M L ci.low ci.up p.value
AKP 0.0 0.3425973 0.10 0.30 0.50 0.01766226 0.8633059 0.035
QS (median) 0.5 0.6800000 0.54 0.62 0.69 0.51506302 0.8656900 0.040
QStr 0.5 0.6200000 0.54 0.62 0.69 0.46138919 0.8240978 0.082
SIGN 0.5 0.3000000 0.46 0.38 0.31 0.19027070 0.4382683 0.006
10.5 Exercises
5. Using again the data in the file B3_dat.txt, and the R function ancova,
compare males to females using PEOP with the covariate and leverage points
removed; only now the variable named pfnbs_s is the dependent variable, which
is a measure of health and wellbeing.
6. For this exercise, use the data in the file B3_dat.txt with two covariates: PEOP
and LSIZ (life satisfaction). The dependent variable is the same dependent
variable used in the previous exercise, pfnbs_s. Compare males to females using
the R function ancJNPVAL with leverage points removed.
7. Imagine there are six covariates. Discuss the possible concerns that arise with
the various methods that might be used.
8. The file PEOP_CESD_PRE_POST.txt contains measures of personal sup-
port (PEOP) and depressive symptoms (CESD) before intervention and after
intervention. First, test the hypothesis given by (10.31) using the R function
Dancts. Next, use difference scores based on both the dependent variable and
the covariate. That is, test (10.29) using the R function regYband. Comment
on how the results differ.
9. Describe a reason why one would expect that using a running-interval smoother
might have less power than using a linear model when a linear model provides
a reasonable fit to the data.
10. When there are two covariates, one could check the adequacy of a linear model
by plotting the Ŷ versus the residuals using the R function chk.lin. Describe
a concern with this approach.
11. Section 10.1 summarized the classic ANCOVA method. If this method is used
and it rejects, what would be a reasonable conclusion? Comment on the relative
merits of this conclusion.
12. The R function ancJN tests hypotheses corresponding to a relatively small
number of covariate values. The R function anclin tests hypotheses corre-
sponding to a relatively large number of covariate values. Given the goal of
controlling the FWE rate, does this mean that anclin has less power?
Appendix A
Basic Matrix Algebra
the first row and first column is .x11 = 32, and the value in the third row and second
column is .x32 = 56.
Example Within statistics, a commonly encountered square matrix is the corre-
lation matrix. That is, for every individual, we have p measures with .rij being
Pearson’s correlation between the ith and j th measures. Then the correlation matrix
is .R = (rij ). If .p = 3, .r12 = .2, .r13 = .4, and .r23 = .3, then
⎛ ⎞
1 .2 .4
.R = ⎝ .2 1 .3 ⎠ .
.4 .3 1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 305
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9
306 A Basic Matrix Algebra
The transpose of a matrix is just the matrix obtained when the rth row becomes
the rth column. More formally, the transpose of the matrix .X = (xij ) is
X = (xj i ),
.
is
23 51 63 11
.X = .
91 29 76 49
001
is the identity matrix when .r = c = 3. A common notation for the identity matrix
is .I. An identity matrix with p rows and columns is created by the R command
diag(nrow=p).
Two .r × c matrices, .X and .Y, are said to be equal if for every i and j , .xij = yij .
That is, every element in .X is equal to the corresponding element in .Y.
The sum of two matrices having the same number of rows and columns is
When using R, the R command X+Y adds the two matrices, assuming both X and
Y are R variables having matrix mode with the same number of rows and columns.
A Basic Matrix Algebra 307
Example
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
1 3 8 2 9 5
. ⎝ 4 −1 ⎠ + ⎝ 4 9⎠ = ⎝ 8 8⎠.
9 2 1 6 10 8
aX = (axij ).
.
That is, every element of the matrix .X is multiplied by a. Using R, if the R variable
a=3, and X is a matrix, the R command a*X will multiply every element in X by 3.
Example
⎛ ⎞ ⎛ ⎞
82 16 4
.2 ⎝ 4 9 ⎠ = ⎝ 8 18 ⎠ .
16 2 12
X̄ = (X̄1 , . . . , X̄p ),
.
the vector of the sample means corresponding to the p measures. That is,
n
1
X̄j =
. Xij , j = 1, . . . , p.
n
i=1
c
zij =
. xik ykj .
k=1
Example
⎛ ⎞ ⎛ ⎞
8 2 44 26
53
.⎝4 9⎠ = ⎝ 38 21 ⎠ .
21
1 6 17 9
X%*%Y
will multiply the two matrices X and Y.
Example Consider a random sample of n observations, .X1 , . . . , Xn , and let .J be
a row matrix of ones. That is, .J = (1, 1, . . . , 1). Letting .X be a column matrix
containing .X1 , . . . , Xn , then
. Xi = JX.
1
. X̄ = JX.
n
The sum of the squared observations is
. Xi2 = X X.
That is, .S = (sj k ), where .sj k is the covariance between the j th and kth measures.
When .j = k, .sj k is the sample variance corresponding to the j th variable under
study.
For any square matrix .X, the matrix .X−1 is said to be the inverse of .X if
XX−1 = I,
.
solve(m),
.
where m is any R variable having matrix mode with the number of rows equal to the
number of columns.
Example Consider the matrix
A Basic Matrix Algebra 309
53
. .
21
−1 3
. .
2 −5
It is left as an exercise to verify that multiplying these two matrices together yields
I, the identity matrix.
.
does not have an inverse. The R function solve, applied to this matrix, reports that
the matrix appears to be singular.
Consider any r-by-c matrix .X, and let k indicate any square submatrix. That is,
consider the matrix consisting of any k rows and any k columns taken from .X. The
rank of .X is equal to the largest k for which a k-by-k submatrix is nonsingular.
The notation
diag{x1 , . . . , xn }
.
refers to a diagonal matrix with the values .x1 , . . . , xn along the diagonal. For
example,
⎞ ⎛
400
.diag{4, 5, 2} = ⎝ 0 5 0 ⎠ .
002
The R command diag(X) returns the diagonal values stored in the R variable X. If
r < c, the r rows and the first r columns of the matrix X are used, with the remaining
.
columns ignored. And if .c < r, the c columns and the first r rows of the matrix X
are used, with the remaining rows ignored.
The trace of a square matrix is just the sum of the diagonal elements and is often
denoted by tr. For example, if
53
A=
. ,
21
then
310 A Basic Matrix Algebra
.tr(A) = 5 + 1 = 6.
sum(diag(X)).
.
A block diagonal matrix refers to a matrix where the diagonal elements are
themselves matrices.
Example If
9 2
V1 =
.
4 15
and
11 32
V2 =
. ,
14 29
then
⎞ ⎛
9 2 0 0
⎜ 4 15 0 0 ⎟
.diag(V1 , V2 ) = ⎜ ⎟
⎝ 0 0 11 32 ⎠ .
0 0 14 29
Let .A be an .m1 ×n1 matrix, and let .B be an .m2 ×n2 matrix. The (right) Kronecker
product of .A and .B is the .m1 m2 × n1 n2 matrix
⎛ ⎞
a11 B a12 B . . . a1n1 B
⎜ a21 B a22 B . . . a2n B ⎟
⎜ 1 ⎟
.A ⊗ B = ⎜ .. .. .. .. ⎟ .
⎝ . . . .⎠
am1 1 B am1 2 B . . . am1 n1 B
Associated with every square matrix is a number called its determinant. The
determinant of a 2-by-2 matrix is easily computed. For the matrix
ab
.
cd
the determinant is .ad −cb. For the more general case of a p-by-p matrix, algorithms
for computing the determinant are available, but the details are not important here.
A Basic Matrix Algebra 311
(The R function det can be used.) If the determinant of a square matrix is 0, it has
no inverse. That is, it is singular. If the determinant differs from 0, it has an inverse.
Eigenvalues (also called characteristic roots or characteristic values) and eigen-
vectors of a square matrix .X are defined as follows. Let .Z be a column vector having
length p that differs from .0. If there is a choice for .Z and a scalar .λ that satisfies
XZ = λZ,
.
3. −
.XX X = X
4. −
.X XX
− = X−
Akinshin, A. (2022). Trimmed Harrell-Davis quantile estimator based on the highest density
interval of the given width. Communications in Statistics—Simulation and Computation.
Online. https://doi.org/10.1080/03610918.2022.2050396.
Agresti, A., & Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of
binomial proportions. American Statistician, 52, 119–126.
Algina, J., Keselman, H. J., & Penfield, R. D. (2005). An alternative to Cohen’s standardized mean
difference effect size: A robust parameter and confidence interval in the two independent groups
case. Psychological Methods, 10, 317–328. https://doi.org/10.1037/1082-989X.10.3.317.
Arnold, B. C., Balakrishnan, N., & Nagaraja, H. N. (1992). A First Course in Order Statistics. New
York: Wiley.
Beasley, T., Erickson, S., & Allison, D. (2009). Rank-based inverse normal transformations are
increasingly used, but are they merited? Behavior Genetics, 39, 580–595. https://doi.org/10.
1007/s10519-009-9281-0.
Becher, H., Hall, P., & Wilson, S. R. (1993). Bootstrap hypothesis testing procedures. Biometrics,
49, 1268–1272. https://doi.org/10.2307/2532271.
Bellman, R. E. (1961). Adaptive Control Processes. Princeton, NJ: Princeton University Press.
Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society B, 57, 289–300.
Berk, K. N., & Booth, D. E. (1995). Seeing a curve in multiple regression. Technometrics, 37,
385–398.
Bernhardson, C. (1975). Type I error rates when multiple comparison procedures follow a
significant F test of ANOVA. Biometrics, 31, 719–724.
Biau, D. J., Brigitte, M., Jolles, M. J., & Porcher, R. (2010). P value and the theory of hypothesis
testing: An explanation for new researchers. Clinical Orthopaedics and Related Research, 468,
885–892.
Bishara, A., & Hittner, J. A. (2012). Testing the significance of a correlation with nonnormal data:
Comparison of pearson, spearman, transformation, and resampling approaches. Psychological
Methods, 17, 399–417.
Blyth, C. R. (1986). Approximate binomial confidence limits. Journal of the American Statistical
Association, 81, 843–855.
Boik, R. J. (1987). The Fisher-Pitman permutation test: A non-robust alternative to the normal
theory F test when variances are heterogeneous. British Journal of Mathematical and
Statistical Psychology, 40, 26–42.
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical
Society, Series B, 26, 211–252.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 313
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9
314 References
Bradley, J. V. (1978) Robustness? British Journal of Mathematical and Statistical Psychology, 31,
144–152. https://doi.org/10.1111/j.2044-8317.1978.tb00581.x.
Branden, K. V. & Verboven, S. (2009). Robust data imputation. Computational Biology and
Chemistry, 33, 7–13.
Brunner, E., Bathke, A. C., & Konietschke, F. (2019). Rank and Pseudo Rank Procedures for
Independent Observations in Factorial Designs. Cham: Springer.
Brunner, E., Domhof, S., & Langer, F. (2002). Nonparametric analysis of longitudinal data in
factorial experiments. New York: Wiley.
Brunner, E., & Munzel, U. (2000). The nonparametric Behrens-Fisher problem: Asymptotic theory
and small-sample approximation. Biometrical Journal, 42, 17–25.
Cain, M. K., Zhang, Z., & Yuan, K.-H. (2017). Univariate and multivariate skewness and kurtosis
for measuring nonnormality: Prevalence, influence and estimation. Multivariate Behavioral
Research, 49, 1716–1735.
Carling, K. (2000). Resistant outlier rules and the non-Gaussian case. Computational Statistics &
Data Analysis, 33, 249–258. https://doi.org/10.1016/S0167-9473(99)00057-2.
Chung, E., & Romano J. P. (2013). Exact and asymptotically robust permutation tests. Annals of
Statistics, 41, 484–507.
Clark, F., Jackson, J., Carlson, M., Chou, C.-P., Cherry, B. J., Jordan-Marsh M., Knight, B. G.,
Mandel, D., Blanchard, J., Granger, D. A., Wilcox, R. R., Lai, M. Y., White, B., Hay, J., Lam,
C., Marterella, A., & Azen, S. P. (2012). Effectiveness of a lifestyle intervention in promoting
the well-being of independently living older people: Results of the Well Elderly 2 Randomise
Controlled Trial. Journal of Epidemiology and Community Health, 66, 782–790. https://doi.
org/10.1136/jech.2009.099754.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal
of the American Statistical Association, 74, 829–836. https://doi.org/10.2307/2286407.
Cleveland, W. S., & Devlin, S. J. (1988). Locally-weighted regression: An approach to regression
analysis by local fitting. Journal of the American Statistical Association, 83, 596–610. https://
doi.org/10.1080/01621459.1988.10478639.
Cliff, N. (1996). Ordinal methods for behavioral data analysis. Mahwah, NJ: Erlbaum.
Clopper, C., & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case
of the binomial. Biometrika, 26, 404–413. https://doi.org/10.1093/biomet/26.4.404.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). New York:
Academic Press.
Crawley, M. J. (2007). The R Book. New York: Wiley.
Cressie, N. A. C., & Whitford, H. J. (1986). How to use the two sample t-test. Biometrical Journal,
28, 131–148. https://doi.org/10.1002/bimj.4710280202.
Croux, C., & Haesbroeck, G. (2003). Implementing the Bianco and Yohai estimator for logistic
regression. Computational Statistics and Data Analysis, 44, 273–295.
Danilov, M., Yohai, V. J., & Zamar, R. H. (2012). Robust estimation of multivariate location and
scatter in the presence of missing data. Journal of the American Statistical Association, 107,
1178–1186.
De Neve, J., & Thas, O. (2017). A Mann–Whitney type effect measure of interaction for factorial
designs. Communications in Statistics—Theory and Methods, 46(issue 22), 11243–11260.
https://doi.org/10.1080/03610926.2016.1263739.
Davies, P. L. (1993). Aspects of robust linear regression. Annals of Statistics, 21, 1843–1899.
Devlin, S. J., Gnanadesikan, R., & Kettenring, J. R. (1981). Robust estimation of dispersion
matrices and principal components. Journal of the American Statistical Association, 76, 354–
362.
Dietz, E. J. (1987). A comparison of robust estimators in simple linear regression. Communications
in Statistics–Simulation and Computation, 16, 1209–1227.
Dixon, W. J., & Tukey, J. W. (1968). Approximate behavior of the distribution of Winsorized t
(Trimming/Winsorization 2). Technometrics, 10, 83–98.
Doksum, K. A., & Sievers, G. L. (1976). Plotting with confidence: Graphical comparisons of two
populations. Biometrika, 63, 421–434. https://doi.org/10.2307/2335720.
References 315
Donoho, D. L., & Gasko, M. (1992). Breakdown properties of the location estimates based on
halfspace depth and projected outlyingness. Annals of Statistics, 20, 1803–1827.
Du, L., Guo, X., Sun, W., & Zou, C. (2023). False discovery rate control under general dependence
by symmetrized data AggregatioN. Journal of the American Statistical Association, 118, 607–
621. https://doi.org/10.1080/01621459.2021.1945459.
Duncan, G. T., & Layard, M. W. (1973). A Monte-Carlo study of asymptotically robust tests for
correlation. Biometrika, 60, 551–558.
Dunnett, C. W. (1980). Pairwise multiple comparisons in the unequal variance case. Journal of the
American Statistical Association, 75, 789–795
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. Thousand Oaks: Sage
Frigge, M., Hoaglin, D. C., & Iglewicz, B. (1989). Some implementations of the Boxplot. American
Statistician, 43, 50–54.
Fung, K. Y. (1980). Small sample behaviour of some nonparametric multi-sample location tests in
the presence of dispersion differences. Statistica Neerlandica, 34, 189–196.
Fung, W.-K. (1993). Unmasking outliers and leverage points: A confirmation. Journal of the
American Statistical Association, 88, 515–519.
Gleason, J. R. (1993). Understanding elongation: The scale contaminated normal family. Journal
of the American Statistical Association, 88, 327–337.
Godfrey, L. G. (2006). Tests for regression models with heteroskedasticity of unknown form.
Computational Statistics & Data Analysis, 50, 2715–2733.
Gonzales, I., & Li, J. (2022). What effect sizes should researchers report for multiple regression
under non-normal data? Communications in statistics—simulation and computation (pp. 1–19).
https://doi.org/10.1080/03610918.2022.2091778.
Graybill, F. A. (1983). Matrices with applications in statistics. Belmont, CA: Wadsworth.
Grayson, D. (2004). Some myths and legends in quantitative psychology. Understanding Statistics,
3, 101–134. https://doi.org/10.1207/s15328031us0302_3.
Guo, J. H., & Luh, W. M. (2000). An invertible transformation two-sample trimmed t-statistic
under heterogeneity and nonnormality. Statistics & Probability Letters, 49, 1–7.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics. New
York: Wiley.
Hand, A. (1998). A history of mathematical statistics from 1750 to 1930. New York: Wiley.
Härdle, W. (1990). Applied nonparametric regression. Econometric Society Monographs No. 19.
Cambridge, UK: Cambridge University Press.
Harrell, F. E., & Davis, C. E. (1982). A new distribution-free quantile estimator. Biometrika, 69,
635–640. https://doi.org/10.1093/biomet/69.3.635.
Hayes, A. F., & Cai, L. (2007). Further evaluating the conditional decision rule for comparing two
independent means. British Journal of Mathematical and Statistical Psychology, 60, 217–244.
He, X., Ng, P., & Portnoy, S. (1998). Bivariate quantile smoothing splines. Journal of the Royal
Statistical Society B, 60, 537–550.
He, X., & Ng, P. (1999). Quantile splines with several covariates. Journal of Statistical Planning
and Inference, 75, 343–352.
Hettmansperger, T. P. (1984). Statistical inference based on ranks. New York: Wiley.
Hettmansperger, T. P., & Sheather, S. J. (1986). Confidence interval based on interpolated order
statistics. Statistical Probability Letters, 4, 75–79.
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika,
75, 800–802. https://doi.org/10.1093/biomet/75.4.800.
Hochberg, Y., & Tamhane, A. C. (1987). Multiple comparison procedures. New York: Wiley.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: biased estimation for nonorthogonal
problems. Technometrics, 12, 55–67. https://doi.org/10.2307/1271436.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics, 6, 65–70. https://www.jstor.org/stable/4615733
Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified Bonferroni
test. Biometrika, 75, 383–386. https://doi.org/10.2307/2336190.
Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. New York: Wiley.
316 References
Hössjer, O. (1992). On the optimality of S-estimators. Statistics and Probability Letters, 14, 413–
419.
Huber, P. J., & Ronchetti, E. (2009). Robust statistics (2nd ed.). New York: Wiley.
Hubert, M., Rousseeuw, P. J., & Verdonck, T. (2012). A deterministic algorithm for robust location
and scatter. Journal of Computational and Graphical Statistics, 21, 618–637.
Hyndman, R. J., & Fan, Y. (1996). Sample quantiles in statistical packages. American Statistician,
50, 361–365.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). An introduction to statistical learning:
With applications in R (7th Printing). New York: Springer.
Johansen, S. (1980). The Welch-James approximation of the distribution of the residual sum of
squares in weighted linear regression. Biometrika, 67, 85–92.
Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of the significance test. Psychological
Methods, 5, 411–414. https://doi.org/10.1037/1082-989x.5.4.411.
Keselman, H. J., Wilcox, R. R., Taylor, J., & Kowalchuk, R. K. (2000). Tests for mean equality that
do not require homogeneity of variances: Do they really work? Communications in Statistics–
Simulation and Computation, 29, 875–895.
Keselman, H. J., Othman, A. R., Wilcox, R. R., & Fradette, K. (2004). The new and improved
two-sample t test. Psychological Science, 15, 47–51.
Kowalski, C. J. (1972). On the effects of non-normality on the distribution of the sample product-
moment correlation coefficient. Applied Statistics, 21, 1–12.
Kirk, R. E. (1995). Experimental design. Pacific Grove, CA: Brooks/Cole.
Kmetz, J. L. (2019). Correcting corrupt research: Recommendations for the profession to stop
misuse of p-values. American Statistician, 73(sup1), 36–45. https://doi.org/10.1080/00031305.
2018.1518271.
Koenker, R. (2005). Quantile regression. New York: Cambridge University Press.
Koenker, R., & Ng, P. (2005). Inequality constrained quantile regression Sankhya. The Indian
Journal of Statistics, 67, 418–440.
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrika, 46, 33–50. https://doi.org/
10.2307/1913643.
Koenker, R., & Bassett, G. (1981). Robust tests for heteroscedasticity based on regression
quantiles. Econometrika, 50, 43–61.
Kulinskaya, E., Morgenthaler, S., & Staudte, R. (2010). Variance stabilizing the difference of two
Binomial proportions. American Statistician, 64, 350–356.
Kulinskaya, E., Morgenthaler, S., & Staudte, R. (2008). Meta analysis: A guide to calibrating and
combining statistical evidence. New York: Wiley. https://doi.org/10.1348/000711005X68174.
Kulinskaya, E., & Staudte, R. G. (2006). Interval estimates of weighted effect sizes in the one-
way heteroscedastic ANOVA. British Journal of Mathematical and Statistical Psychology, 59,
97–111.
Lax, D. A. (1985). Robust estimators of scale: Finite-sample performance in long-tailed symmetric
distributions. Journal of the American Statistical Association, 80, 736–741.
Liang, H., Su, H., & Zou, G. (2008). Confidence intervals for a common mean with missing data
with applications in an AIDS study. Computational Statistics and Data Analysis, 53, 546–553.
Liu, R. G., & Singh, K. (1997). Notions of limiting P values based on data depth and bootstrap.
Journal of the American Statistical Association, 92, 266–277. https://doi.org/10.2307/2291471.
Liu, X., Song, Y., & Zhang, K. (2022). An exact bootstrap-based bandwidth selection rule for
kernel quantile estimators. Communications in Statistics—Simulation and Computation (pp.
1–22). Online https://doi.org/10.1080/03610918.2022.2110595.
Lix, L. M., & Keselman, H. J. (1998). To trim or not to trim: Tests of mean equality under
heteroscedasticity and nonnormality. Educational and Psychological Measurement, 58, 409–
429.
Lombard, F. (2005). Nonparametric confidence bands for a quantile comparison function. Techno-
metrics, 47, 364–369.
Long, J. S., & Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear
regression model. American Statistician, 54, 217–224. https://doi.org/10.2307/2685594.
References 317
Ma, J., & Wilcox, R. R. (2013). Robust within groups ANOVA: Dealing with missing values.
Mathematics and Statistics, 1, 1–4. Horizon Research Publishing. https://doi.org/10.13189/ms.
2013.010101.
MacKinnon, D. P. (2008). Introduction to statistical mediation analysis. Clifton, New Jersey:
Psychology Press.
Mair, P., & Wilcox, R. (2019). Robust statistical methods in R using the WRS2 package. Behavior
Research Methods, 52, 464–488. https://doi.org/10.3758/s13428-019-01246-w.
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is
stochastically larger than the other. Annals of Mathematical Statistics, 18, 50–60.
Markowski, C. A., & Markowski, E. P. (1990). Conditions for the effectiveness of a preliminary
test of variance. American Statistician, 44, 322–326.
Maronna, R., Martin, R. D., Yohai, V., & Salibián-Barrera, M. (2019). Robust Statistics: Theory
and Methods (with R) (2nd ed.). New York: Wiley.
Meinshausen, N. (2006). Quantile regression forests. Journal of Machine Learning Research, 7,
983–999.
Moser, B. K., Stevens, G. R., & Watts, C. L. (1989). The two-sample t-test versus Satterthwaite’s
approximate F test. Communications in Statistics- Theory and Methods, 18, 3963–3975.
Ng, M., & Wilcox, R. R. (2009). Level robust methods based on the least squares regression
estimator. Journal of Modern and Applied Statistical Methods, 8, 384–395.
Navruz, G., & Özdemir, A. F. (2020). A new quantile estimator with weights based on a
subsampling approach. British Journal of Mathematical and Statistical Psychology, 73, 506–
521. https://doi.org/10.1111/bmsp.12198.
Özdemir, A. F., Yildiztepe, E., & Wilcox, R. R. (2020). A new test for comparing J independent
groups by using one-step M-estimator and bootstrap-t. Technical Report.
Patel, K. M., & Hoel, D. G. (1973). A nonparametric test for interaction in factorial experiments.
Journal of the American Statistical Association, 68, 615–620. https://doi.org/10.2307/2284788.
Pratt, J. W. (1964). Robustness of some procedures for the two-sample location problem. Journal
of the American Statistical Association, 59, 665–680.
Racine, J., & MacKinnon, J. G. (2007). Simulation-based tests than can use any number of
simulations. Communications in Statistics–Simulation and Computation, 36, 357–365.
Ramsey, P. H. (1980). Exact type I error rates for robustness of Student’s t test with unequal
variances. Journal of Educational Statistics, 5, 337–349.
Randal, J. A. (2008). A reinvestigation of robust scale estimation in finite samples. Computational
Statistics & Data Analysis, 52, 5014–5021.
Rom, D. M. (1990). A sequentially rejective test procedure based on a modified Bonferroni
inequality. Biometrika, 77, 663–666.
Romano, J. P. (1990). On the behavior of randomization tests without a group invariance
assumption. Journal of the American Statistical Association, 85, 686–692. https://doi.org/10.
2307/2290003.
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression & outlier detection. New York: Wiley.
Rousseeuw, P. J., & van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage
points. Journal of the American Statistical Association, 85, 633–639. https://doi.org/10.2307/
2289995.
Rousseeuw, P. J., & Hubert, M. (1999). Regression depth. Journal of the American Statistical
Association, 94, 388–402.
RStudio Team. (2020). RStudio: Integrated development for R. PBC. Boston, MA: Rstudio. http://
www.rstudio.com/.
Ruscio, J., & Mullen, T. (2012). Confidence intervals for the probability of superiority effect size
measure and the area under a receiver operating characteristic curve. Multivariate Behavioral
Research, 47, 201–223.
Rust, S. W., & Fligner, M. A. (1984). A modification of the Kruskal-Wallis statistic for the
generalized Behrens-Fisher problem. Communications in Statistics–Theory and Methods, 13,
2013–2027.
318 References
Schilling, M., & Doi, J. (2014). A coverage probability approach to finding an optimal binomial
confidence procedure. American Statistician, 68, 133–145. https://doi.org/10.1080/00031305.
2014.899274.
Salk, L. (1973). The role of the heartbeat in the relations between mother and infant. Scientific
American, 235, 26–29.
Schreurs, J., Vranckx, I., Hubert, M., Suykens, J., & Rousseeuw, P. J. (2021). Outlier detection in
non-elliptical data by kernel MRCD. Statistics and Computing, 31, 1–18.
Sen, P. K. (1968). Estimate of the regression coefficient based on Kendall’s tau. Journal of the
American Statistical Association, 63, 1379–1389. https://doi.org/10.2307/2285891.
Shabbir, M., Chand, S., & Iqbal, F. (2023). A new ridge estimator for linear regression model
with some challenging behavior of error term. Communications in Statistics—Simulation and
Computation. https://doi.org/10.1080/03610918.2023.2186874.
Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. New York:
Chapman and Hall/CRC. ISBN 9781138331457. https://plotly-r.com.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. New York: Chapman
and Hall.
Srivastava, M. S., & Awan, H. M. (1984). On the robustness of the correlation coefficient in
sampling from a mixture of two bivariate normals. Communications in Statistics–Theory and
Methods, 13, 371–382.
Staudte, R. G., & Sheather, S. J. (1990). Robust estimation and testing. New York: Wiley.
Steegen, S., Tuerlinck, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through
a multiverse analysis. Perspectives on Psychological Science, 11, 702–712. https://doi.org/10.
1177/1745691616658637.
Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900.
Belknap Press of the Harvard Univesity Press Cambridge, MA.
Storer, B. E., & Kim, C. (1990). Exact properties of some exact test statistics for comparing two
binomial proportions. Journal of the American Statistical Association, 85, 146–155.
Stute, W., Gonzalez Manteiga, W. G., & Presedo Quindimil, M. P. (1998). Bootstrap approxi-
mations in model checks for regression. Journal of the American Statistical Association, 93,
141–149. https://doi.org/10.2307/2669611.
Suhali, M., Chand, S., & Aslam, M. (2023). New quantile based ridge M-estimator for
linear regression models with multicollinearity and outliers. Communications in Statistics—
Simulation and Computation, 52(Issue 4), 1417–1434. https://doi.org/10.1080/03610918.2021.
1884715.
Theil, H. (1950). A rank-invariant method of linear and polynomial regression analysis. Indaga-
tiones Mathematicae, 12, 85–91.
Tukey, J. W. (1960). A survey of sampling from contaminated normal distributions. In I. Olkin,
S. Ghurye, W. Hoeffding, W. Madow, & H. Mann (Eds.). Contributions to probability and
statistics. Stanford, CA: Stanford University Press (pp. 448–485).
Tukey, J. W., & McLaughlin D. H. (1963). Less vulnerable confidence and significance procedures
for location based on a single sample: Trimming/Winsorization 1. Sankhya: The Indian Journal
of Statistics, Series A, 331–352.
Tukey, J. W. (1991). The philosophy of multiple comparisons. Statistical Science, 6, 100–116.
www.jstor.org/stable/2245714.
Vanderweele, T. J. (2015). Explanation in causal inference: Method for mediation and interaction.
Oxford: Oxford University Press.
Venables, W. N., & Smith, D. M. (2002). An introduction to R. Bristol, UK: Network Theory Ltd.
Verzani, J. (2004). Using R for introductory statistics. Boca Raton, FL: CRC Press.
Wang, Q. H., & Rao, J. N. K. (2002). Empirical likelihood-based inference in linear models with
missing data. Scandanavian Journal of Statistics, 29, 563–576.
Wasserstein, R. L., Schirm, A. L., & Lazar, N. A. (2019). Moving to a world beyond ‘.p < 0.05’.
American Statistician, 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913.
Welch, B. L. (1938). The significance of the difference between two means when the population
variances are unequal. Biometrika, 29, 350–362.
References 319
A C
Agresti–Coull, 41 CAR, 210
Agresti-Coull method, 42 Coefficient of determination, 251
All pairs power, 120 Cohen’s d, 75
ANOVA F Contamination bias, 192
and homoscedasticity, 11 Control group
using R, 125
Correlation
B Kendall’s tau, 244
Bad leverage points partial, 255
detecting, 194 skipped, 249
Between-by-between-by-between, 111 Spearman, 245
Bias corrected, 36 Winsorized, 245
Binary data Covariate, 265
ANCOVA, 293 Curse of dimensionality, 182
inference based on grids, 229 Curvature
Binomial checks on, 176
Agresti-Coull, 42 partial response, 178
Bonferroni method, 119
Bootstrap
BCa, 36 D
choosing B, 37 Difference scores, 83
estimate standard error, 40, 41 Distribution
sample, 34 heavy-tailed, 3, 5
wild, 177 mixed normal, 4
Bootstrap-t, 34, 242 normal, 3
Boxplot rule, 14 DUB, 301
Breakdown point
defined, 9
BWAMCP, 158 E
BWBMCP, 158 Effect size
BWIDIF, 158 ANCOVA, 300
BWIMCP, 158 standardized, 44
BWIPH, 158 Eout, 181
© The Editor(s) (if applicable) and The Author(s), under exclusive license to 321
Springer Nature Switzerland 2023
R. R. Wilcox, A Guide to Robust Statistical Methods,
https://doi.org/10.1007/978-3-031-41713-9
322 Index
F
False discovery rate, 120 M
Family-wise error (FWE), 97 MAD-median rule, 15
Fisher’s r-to-z, 242 MADN, 15
Mahalanobis distance, 103, 169
Mann–Whitney test, 62
G Marginal, 84
Generalized additive model, 182 Masking, 13
MC1, 290
MC2, 291
H MC3, 291
Heteroscedasticity MCD, 171
defined, 11 Measure of location
Hochberg’s method, 119 defined, 2
Homoscedasticity, 10, 18 Median
defined, 10, 167 and tied values, 38, 58, 87
linear model, 12 typical difference, 62
Median absolute deviation (MAD), 15
Mediation, 224
I M-estimator (MOM), 26
Ideal fourths, 13 standard error, 40
Interaction Method Y, 282
three-way, 111 MGV method, 173
Interquartile range, 14 Minimum volume ellipsoid (MVE), 171
Missing values, 88, 139
M1, 88
Moderator analysis, 233
K Multinomial
Kendall’s tau, 244 compared, 71
KMS Multiple comparisons
two binomials, 71 Hochberg, 119
KMS effect size, 76
and a covariate, 268
interaction covariate, 270 N
two covariates, 280 Nearest neighbors, 180, 187
Kolmogorov–Smirnov test, 65 Normal distribution
derivation, 3
Null value, 5
L
Least median squares (LMS), 194
Least squares regression O
estimation, 12 Order statistics, 17
Least trimmed squares, 202 Outliers
Least trimmed values, 202 boxplot rule, 14
Leerkes, 185 MAD-median rule, 15
Leverage point MCD method, 171
bad, 190 MGV, 173
good, 190 MVE method, 171
Linear contrasts projection method, 171
and a covariate, 268 two standard deviation rule, 12
Index 323
P ancova.ESci, 275
Paired Student’s t test, 83 ancovamp, 292
Partial response, 178 ancovampG, 293
Pearson’s correlation, 19 ancovap2.KMS, 280
Percentage bend, 31 ancovap2.KMSci, 281
Percentile bootstrap, 36 ancovap2.KMS.plot, 281
Permutation methods, 61 ancovap2.wmw.plot, 276
Pivotal test statistic, 32 ancovaUB, 286
Power ancovaWMW, 288
all pairs, 120 ancpb, 284
defined, 6 ancsm.es, 285
Projection bbmcp, 126
distance, 173 bbmcppb, 126
and outliers, 171 bbwmcppb, 157
p-value bbwtrim, 149
concerns, 6 bbwtrimbt, 149
defined, 6 bca.mean, 36
bd1way, 140
bdm, 132
Q binband, 72
Q2, 64 binom.conf.pv, 43
Quantile binom2g, 72
confidence interval, 38 bmp, 63
defined, 9 bootse, 41
Quantile shift, 45 box1way, 101
Quartiles, 13 boxplot, 14
bwamcp, 159
bw.2by2.es, 161
R bwbmcp, 159
R bw.es.A, 160
regci.MC, 223 bw.es.B, 161
anc.2gbin, 285 bw.es.I, 161
ancboot, 284 bw.es.main, 148
ancdet, 287 bwiDIF, 159
ancdet2C, 294 bwimcp, 159
ancdetM4, 294 bwmcp, 156
ancdifplot, 286 bwmcppb.adj, 157
anc.ES.sum, 285 BWPHmcp, 159
anc.grid, 295 bwtrim, 146
anc.grid.bin, 295 bwtrimbt, 146
anc.grid.cat, 295 bwwmcppb, 157
ancJN, 273 bwwtrim, 149
ancJN.LC, 274 bwwtrimbt, 150
ancJNPVAL, 274 cat.dat.ci, 43
anclin, 274 chk.lin, 178
anclin.QS.CIpb, 276 cidv2, 63
anclin.QS.plot, 276 comdvar, 94
ancM.COV.ES, 295 comvar2, 74
ancmg, 293 con2way, 130, 156
ancmg1, 287 con3way, 130
ancmppb, 292 conCON, 125
ancNCE.QS.plot, 276 cor7, 253
ancov2COV, 293 corb, 247
ancova, 284 corblp, 253
324 Index