6 Two-And Higher-Dimensional Random Variables
6 Two-And Higher-Dimensional Random Variables
In our study of random variables we have, so far, considered only the one
dimensional case. That is, the outcome of the experiment could be recorded as a
single number x.
FIGURE 6.1
93
94 Two- and Higher-Dimensional Random Variables 6.1
Note: As in the one-dimensional case, our concern will be not with the functional
nature of X(s) Y(s), but rather with the values which X and Y assume. We shall
and
again speak of the range space of (X, Y), say RxxY, as the set of all possible values of
(X, Y). In the two-dimensional case, for instance, the range space of (X, Y) will be a
subset of the Euclidean plane. Each outcome X(s), Y(s) may be represented as a point
(x, y) in the plane. We will again suppress the functional nature of X and Y by writing,
for example, P[X � a, Y � b] instead of P[X(s) � a, Y(s) � b].
As in the one-dimensional case, we shall distinguish between two basic types of ran
dom variables: the discrete and the continuous random variables.
(6.1)
(2) L L p(x;, Yi) = 1.
i=l i=l
6.1 Two-Dimensional Random Variables 95
The function p defined for all (xi, Yi) in the range space of (X, Y) is called the
probability function of (X, Y). The set of triples (xi, Y;, p(xi, Yi)), i, j I, 2,
=
Notes: (a) The analogy to a mass distribution is again clear. We have a unit mass
distributed over a region in the plane. In the discrete case, all the mass is concentrated
at a finite or countably infinite number of places with mass p(x;, y 1) located at (x1, y i).
In the continuous case, mass is found at all points of some noncountable set in the plane.
(b) Condition 4 states that the total volume under the surface given by the equation
z = f(x, y) equals 1.
(c) As in .the one-dimensional case, f(x, y) does not represent the probability of any
thing. However, for positive �x and �Y sufficiently small, f(x, y) �x �Y is approximately
equal to P(x :S X :S x + �x, y :S Y :S y + �y).
(d) As in the one-dimensional case we shall adopt the convention that f(x, y) = 0
if (x, y) f/:. R. Hence we may consider f defined for all (x, y) in the plane and the require
ment 4 above becomes J�= J�= f(x, y) dx dy = 1.
(e) We shall again suppress the functional nature of the two-dimensional random
variable (X, Y). We should always be writing statements of the form P[X(s) x;, =
Y(s) = Y;], etc. However, if our shortcut notation is understood, no difficulty should
arise.
(f) Again, as in the one-dimensional case, the probability distribution of (X, Y) is
actually induced by the probability of events associated with the original sample space S.
However, we shall be concerned mainly with the values of (X, Y) and hence deal directly
with the range space of (X, Y). Nevertheless, the reader should not lose sight of the fact
that if P(A) is specified for all events A CS, then the probability associated with events
in the range space of (X, Y) is determined. That is, if B is in the range space of (X, Y),
we have
P(B) = P[(X(s), Y(s>) E B) = P[s I (X(s), Y(s)) E B).
This latter probability refers to an event in S and hence determines the probability of B.
In terms of our previous terminology, B and {s I (X(s), Y(s)) E B} are equivalent events
(Fig. 6.2).
s RxxY
FIGURE 6.2
96 Two- and Higher-Dimensional Random Variables 6.1
if (X, Y) is discrete, where the sum is taken over all indices (i,j) for which (x;, y1) E B.
And
P(B)
=ff f(x,y)dxdy,
B
(6.4)
if (X, Y) is continuous.
EXAMPLE 6.1. Two production lines manufacture a certain type of item. Sup
5 items for line I and 3 items for line
pose that the capacity (on any given day) is
II. Assume that the number of items actually produced by either production line
is a random variable. Let ( X, Y) represent the two-dimensional random variable
yielding the number of items produced by line I and line II, respectively. Table 6.1
gives the joint probability distribution of (X, Y). Each entry represents
we find that
TABLE 6.1
I� 0 0
0
0.01
1
0.03
2
0.05
3
0.07
4
0.09
5
X=5000 X=l0,000
FIGURE 6.3
19000110,000
=
f+<» !+"'
_.., _..,
f(x, y) dx dy =
rooo ( dx dy
J5ooo }5000
I I
P(B)
(5000)2
"
rooo y
= -
}5000 [
1 17
I dy5000 .
-
(5000)2 J
=
25
=
f(x,y) x2 xy 0 � x � I, 0 � y � 2,
=
+ 3'
= 0, elsewhere.
98 Two- and Higher-Dimensional Random Variables 6.1
(
f ( x , dxdy = J
r
J 1(x2+j')dxdy
Y)
2
!+"" !+""
_00
_00
0 0
12(1-+-)dy=-y+-y212
0 3
y
6 3
I
12 0
�+:n= t.
Let B =
{X + Y 2:: I}. (See Fig. 6.4.) We shall compute P(B) by evaluating
I - P(B), where B =
{X + Y < I}. Hence
y
P(B) =I - i1 iI-x(x2+xJ)dydx
I- i1[x2(1-x)+x(l�x)2]dx
I - -i'-2 = -tt.
In studying one-dimensional random variables,
we found that F, the cumulative distribution func
tion, played an i mportant role. In the two-dimen
sional case we can again define a cumulative
function as follows. FIGURE 6.4
EXAMPLE 6.4. Let us again consider Example 6.1. In addition to the entries of
Table 6.1, let us also compute the "marginal" totals, that is, the sum of the 6
columns and 4 rows of the table. (See Table 6.2.)
The probabilities appearing in the row and column margins represent the
probability distribution of Y and X, respectively. For instance, P( Y l) 0.26, = =
P(X 3)= 0.21, etc. Because of the appearance of Table 6.2 we refer, quite
=
TABLE 6.2
� 0
0
0
1
0.01 0.03
2 3
0.05 0.07
4
0.09
5 Sum
0.25
1 0.01 0.02 0.04 0.05 0.06 0.08 0.26
2 0.01 0.03 0.05 0.05 0.05 0.06 0.25
3 0.01 0.02 0.04 0.06 0.06 0.05 0.24
Y =Y i for somej and can occur with Y Yi for only onej, we have =
The function p defined for x1, x2, , represents the marginal probability dis
• • .
These pdrs correspond to the basic pdrs of the one-dimensional random variables
X and Y, respectively. For example
f.d g(x)dx.
(The units have been adjusted in order to use values between 0 and 1.) The mar
ginal pdf of X is given by
that the constant equals I/area (R). 00 We are assuming that R is a region
with finite, nonzero area.
1 y
f(x,y) = (x,y) ER.
area(R)'
We find that
f(x,y) = 6, (x,y) ER
= 0, (x,y) f1. R.
FIGURE 6.5
= 6(x - x2), 0 � x � 1;
= 6(v'.Y - y), 0 � y � 1.
EXAMPLE 6.7. Consider again Examples 6.1 and 6.4. Suppose that we want to
evaluate the conditional probability P(X = 2I Y = 2). According to the defini
tion of conditional probability we have
P(X 2, y 2) 0.05
2I y 2) 0 20
=
P(X
=
2) 0.25 . .
= = = = =
P(Y =
We can carry out such a computation quite generally for the discrete case. We
have
p(xi I Y;) = P(X Xi I Y
= = Y;)
p(x;,Y;) . 1f q(yj) > o' (6.5)
q(y;)
q(y; I Xi) =
P( y = Yi I x =
Xi)
p(xi,Y;)
if p (xi) 0. (6.6)
_
>
- p(xi )
102 Two- and Higher-Dimensional Random Variables 6.2
g(x) h(y)
X=� (I, 0)
y=� (I, 0)
(a) (b)
FIGURE 6.6
Note: For given j, p(x; IY;) satisfies all the conditions for a probability distribution.
We h�ve p(x; I Y;) � 0 and also
g(x Iy) =
f(x, y)' h(y) > 0. (6.7)
h(y)
h(y Ix) =
f(x,y)' g(x) > 0. (6.8)
g(x)
Notes: (a) The above conditional pdf's satisfy all the requirements for a one-dimen
sional pdf. Thus, for fixed y, we have g(x I y) � 0 and
An analogous computation may be carried out for h(y I x) . Hence Eqs. (6.7) and (6.8)
define pdf's on Rx and Ry, respectively.
(b) An intuitive interpretation of g(x Iy) is obtained if we consider slicing the surface
represented by the joint pdf /with the plane y = c, say. The intersection of the plane
with the surface z = f(x, y) will result in a one-Jimensional pdf, namely the pdf of X
for Y = c. This will be precisely g(x I c).
(c) Suppose that· (X, Y) represents the height and weight of a person, respectively.
Let /be the joint pdf of (X, Y) and let g be the marginal pdf of X (irrespective of Y).
6.3 Independent Random Variables 103
Hence J2.s g(x) dx would represent the probability of the event {5.8 5 X s 6} irre
spective of the weight Y. And Jf.8 g(x I 150) dx would be interpreted as P(5.8 5 X 5
6IY = 150). Strictly speaking, this conditional probability is not defn
i ed in view of
our previous convention with conditional probability, since P(Y 150) 0. However, = =
we si.µiply use the above integral to define this probability. Certainly on intuitive grounds
this ought to be the meaning of this number.
2
g(x) = fo ( x2 + Y) dy = 2x2 + x, �
1
h(y) f ( x2 + )
xy d
x �+!.
Jo
=
=
3 6 3
Hence,
6x +2 '
0 ::::; y ::::; 2, 0 ::::; x ::::; 1.
1
1
0
6x2 + 2xy d
2+y
x =
2+ y
2+y
= 1 for ally .
all pairs of independent random variables, For the X's depend only on the charac
teristics of source I while the Y's depend on the characteristics of source 2, and
there is presumably no reason to assume that the two sources influence each other's
behavior in any way. When we consider the possible independence of X1 and X2,
however, the matter is not so clearcut. Is the number of particles emitted during
the second hour influenced by the number that was emitted during the first hour?
To answer this question we would have to obtain additional information about
the mechanism of emission. We could certainly not assume. a priori, that X1
and X2 are independent.
Let us now make the above intuitive notion of independence more precise.
Note: If we compare the above definition with that given for independent eve111s,
the similarity is apparent: we are essentially requiring that the joint probability (or joint
pdf) can be factored. The following theorem indicates that the above definition is equiva
lent to another approach we might have taken.
EXAMPLE 6.10. Suppose that a machine is used for a particular task in the morn
ing and for a different task in the afternoon. Let X and Y represent the number of
times the machine breaks down in the morning and in the afternoon, respectively.
Table 6.3 gives the joint probability distribution of (X. Y).
An easy computation reveals that for all the entries in Table 6.3 we have
Thus X and Y are independent random variables. (See also Example 3.7, for
comparison.)
6.3 Independent Random Variables 105
TABLE 6.3
l)Z 0 1 2 q(yj)
EXAMPLE 6.11. Let X and Y be the life lengths of two electronic devices. Sup
pose that their joint pdf is given by
x � 0, y � 0.
-r -v
Since we can factor f(x, y) = e e , the independence of X and Y is established.
{(x. y) I o � x � y � l}
Note: From the definition of the marginal probability distribution (in either the
discrete or the continuous case) it is clear that the joint probability distribution deter
mines, uniquely, the marginal probability distribution. That is, from a knowledge of the
joint pdf f. we can obtain the marginal pdf's g and h. However, the converse is not
true! Thal is, in general, a knowledge of the marginal pdf's g and h do not determine the
joint pdf .f Only when X and Y are independent is this true, for in this case we have
f(x, y) = g(x)h(y).
The following result indicates that our definition of independent random vari
ables is consistent with our previous definition of independent events.
a subset of Rv, the range space of Y.) Then, if X and Y are independent
random variables, we have P(A n B) = P(A)P(B).
experiment and each of which assigns a real number to every s E S, thus yield
ing the two-dimensional vector [X(s), Y(s)].
Let us now consider Z H1 (X, Y), a function of the two random variables
=
X = 0, Y 2 or X = 0, Y
= 3 or X =I, Y = 0 or X
= 2, Y = 0 or X = 3,
=
6.4 Functions of a Random Variable 107
Y = 0 or X = 4, Y = 0 or X 5, Y
= 0. Hence P( U = 0) = 0.28. The rest
=
of the probabilities associated with U may be obtained in a similar way. Hence the
probability distribution of U may be summarized as follows: u: 0, I, 2, 3; P( U u): =
0.28, 0.3p, 0.25, 0.17. The probability distribution of the random variables V and
W as defined above may be obtained in a similar way. (See Problem 6.9.)
If (X, Y) is a continuous two-dimensional random variable and if Z = H 1 ( X, Y)
is a continuous function of (X, Y), then Z will be a continuous (one-dimensional)
random variable and the problem of finding its pdf is somewhat more involved.
In order to solve this problem we shall need a theorem which we state and discuss
below. Before doing this, let us briefly outline the basic idea.
In finding the pdf of Z =H1( X, YJ it is often simplest to introduce a second
random variable, say W = H2( X, Y), and first obtain the joint pdf of Z and W,
say k(z, w). From a knowledge of k(z, w) we can then obtain the desired pdf of
Z, say g(z), by simply integrating k(z, w) with respect to w. That is,
+z
The remaining problems are (I) how to find the joint pdf of Z and W, and (2)
how to choose the appropriate random variable W = H2(X, Y). To resolve the
latter problem, let us simply state that we usually make the simplest possible choice
for W. In the present context, W plays only an intermediate role, and we are not
really interested in it for its own sake. In order to find the joint pdf of Z and W
we need Theorem 6.3.
Then the joint pdf of (Z, W ), say k(z, w), is given by the following expression:
k(z, w) = f[G 1(z, w), G2(t, w)] jJ(z, w)j, where J(z, w) is the following 2 X 2
determinant:
iJx ax
iJz iJw
J(z, w) =
iJy iJy
az iJw
This determinant is called the Jacobian of the transformation (x, y) --+ (z, w)
and is sometimes denoted by iJ(x, y)/iJ(z, w). We note that k(z, w) will be
nonzero for those values of (z, w) corresponding to values of (x, y) for which
f(x, y) is nonzero.
108 Two- and Higher-Dimensional Random Variables 6.4
w y
(a) z (b)
FIGURE 6.8
Notes: (a) Although we shall not prove this theorem, we will at least indicate what
needs to be shown and where the difficulties lie. Consider the joint cdf of the two
dimensional random variable (Z, W), say
Since f is assumed to be known, the integral on the right-hand side can be evaluated.
Differentiating it with respect to z and w will yield the required pdf. In most texts on
advanced calculus it is shown that these techniques lead to the result as stated in the
above theorem.
(b) Note the striking similarity between the above result and the result obtained in
the one-dimensional case treated in the previous chapter. (See Theorem 5.1.) The
monotonicity requirement for the function y H(x) is replaced by the assumption that
=
the correspondence between (x, y) and (z, w) is one to one. The differentiability con
dition is replaced by certain assumptions about the partial derivatives involved. The
final solution obtained is also very similar to the one obtained in the one-dimensional
case: the variables x and y are simply replaced by their equivalent expressions in terms
of z and w, and the absolute value of dx/dy is replaced by the absolute value of the
Jacobian.
y
EXAMPLE 6.13. Suppose that we are aiming at a circular
target of radius one which has been placed so that its
center is at the origin of a rectangular coordinate system
(Fig. 6.9). Suppose that the coordinates ( X, Y) of the
point of impact are uniformly distributed over the circle. --+----+----x
That is,
FIGURE 6.10 1
1---- ---
'----------�-----•
1(2 , 1)
...
Suppose that we are interested in the random variable R representing the distance
from the origin. (See Fig. 6.10.) That is, R= v X2 + Y 2• We shall find the
pdf of R, say g, as follows: Let <I>= tan-1 (Y/X). Hence X= H1(R,<I>) and
Y = H (R,<I>) where x = H 1 (r, cp) = r cos cp and y= H (r, cp) = r sin � (We
2 2
are simply introducing polar coordinates.)
The Jacobian is
ax ax
cos cp -r sin cp
ar aq,
J=
ay ay
sin cp r cos cp
ar aq,
r cos2 cp + r sin2 cp= r.
Under the above transformation the unit circle in the x y-plane is mapped into
the rectangle in the cp r-plane in Fig. 6.11. Hence the joint pdf of {<I>, R) is given by
r
g(cp, r)= - ' 0 :=:; r :=:; l ,
1r
Note: This example points out the importance of obtaining a precise representation
of the region of possible values for the new random variables introduced.
(6.9)
0
l=
-w u
u2 u
Note: In evaluating the above integral we may use the fact that
EXAMPLE 6.14. Suppose that we have a circuit in which both the current I and
the resistance R vary in some random way. Specifically, assume that I and R are
independent continuous random variables with the following pdf's.
-�e2 �11
l e/3
q(z) =
f+oo
_00 g(vz)h(v)lvl dv. (6.10)
t(z, v) = g(vz)h(v)jvj.
Integrating this joint pdf with respect to v yields the required marginal pdf of Z.
EXAMPLE 6.15. Let X and Y represent the life lengths of two light bulbs manu
factured by different processes.
Assume that X and Y are independent random
variables with the pdf's f and g, respectively, where
Of interest might be the random variable X/ Y, representing the ratio of the two
life lengths. Let q be the pdf of Z. 00
By Theorem 6.5 we have q(z) = f�00 g(vz)h(v)lvl dv. Since X and Y can assume
only nonnegative quantities, the above integration need only be carried out over
the positive values of the variable of integration. In addition, the integrand will
be positive only when both the pdf's appearing are positive. This implies that we
must have v � 0 and vz � 0. Since z > 0, these inequalities imply that v � 0.
112 Two- and Higher-Dimensional Random Variables 6.6
q(z)
2 z;;::: 0.
(z + 2)2
=
'
(a) f(x1, ... , Xn) 2:: 0 for all (xi. ..., Xn).
(b)
+ao
f_00 • • •
+ao
J_00 f(xi, ..., Xn) dx1 · · · dxn I.
00
J:00 J.:"' f(xi. X2, X3) dx1 dx2 = g(x3),
where g is the marginal pdf of the one-dimensional random variable X3, while
6.6 n-Dimensional Random Variables 113
where h represents the joint pdf of the two-dimensional random variable (X 1, X 2),
etc. The concept of independent random variables is also extended in a natural
way. We say that (Xr, ..., Xn) are independent random variables if and only if
their joint pdf f(x1, • • • , Xn) can be factored into
g I (x i) · · · gn(Xn).
There are many situations in which we wish to consider n-dimensional random
variables. We shall give a few examples.
(a) Suppose that we study the pattern of precipitation due to a particular storm
system. If we have a network of, say, 5 observing stations and if we let Xi be the
rainfall at station i due to a particular frontal system, we might wish to consider
the five-dimensional random variable (Xr, X2, X3, X4, X5).
(b) One of the most important applications of n-dimensional random variables
occurs when we deal with repeated measurements on some random variable X.
Suppose that information about the life length, say X, of an electron tube is re
quired. A large number of these tubes are produced by a certain manufacturer,
and we test n of these. Let Xi be the life length of ith tube, i
1, ..., n. Hence =
Problems .of this type are studied at a more advanced level. (An excellent reference
for this subject area is "Stochastic Processes" by Emanuel Parzen, Holden-Day,
San Francisco, 1962.)
Note: In some of our discussion we have referred to the concept of "n-space." Let us
summarize a few of the basic ideas required.
With each real number x we may associate a point on the real number line, and con
versely. Similarly, with each pair of real numbers (xr, x2) we may associate a point in
the rectangular coordinate plane, and conversely. Finally, with each set of three real
numbers (x1, x2, x3), we may associate a point in the three-dimensional, rectangular
coordinate space, and conversely.
In many of the problems with which we are concerned we deal with a set of n real
numbers, (x1, x2, ... , xn), also called an n-tuple. Although we cannot draw any sketches
114 Two- and Higher-Dimensional Random Variables
ff f(x, y) dx dy,
A
where A is a region in the (x, y)-plane, then the extension of this concept to
where R is a region inn-space should be clear. If f represents the joint pdf of the two
dimensional random variable (X, Y) then
ff f(x, y) dx dy
A
represents P[(X, Y) E A]. Similarly, if f represents the joint pdf of (X 1, . • . , Xn) then
represents
P[(X1, ... , Xn) ER].
PROBLEMS
6.1. Suppose that the following table represents the joint probability distribution of
the discrete random variable (X, Y). Evaluate all the marginal and conditional dis
tributions.
Ii< 1
1
1
2
-- ---
-- ---
1
3
0
1 2 6
-- ---
1 1
2 0 9 5
-- ---
1
3 ls 4 ls
6.2. Suppose that the two-dim�andQrn_ variable (X, Y) has joint pdf
_
f-c::.;;:-kx(� - y), 0 < x < �. -x < y < x,
= -0,. ---�lsewhere.
6.3. Suppose that the joint pdf of the two-dimensional random variable (X, Y) is
given by
= 0, elsewhere.
(a) P(X > !); (b) P(Y < X); (c) P(Y < ! / X < !).
6.4. Suppose that two card� are drawn at random from a deck of cards. Let X be
the number of aces obtained and let Y be the number of queens obtained.
(a) Obtain the joint probability distribution of (X, Y).
(b) Obtain the marginal distribution of X and of Y.
(c) Obtain the conditional distribution of X (given Y) and of Y (given X).
6.5. For what value of k is f(x, y) = ke-<x+u> a joint pdf of (X, Y) over the region
0 < x < 1, 0 < y < 1?
6.6. Suppose that the continuous two-dimensional random variable (X, Y) is uni
formly distributed over the square whose vertices are (1, 0), (0, 1), ( - 1 , 0), and (0, -I).
Find the marginal pdf's of X and of Y.
6.7. Suppose that the dimensions, X and Y, of a rectangular metal plate may be con
-
sidered to be independent continuous random variables with the following pdf's.
-x + 3, 2 < x < 3,
0, elsewhere.
Y: h(y) 1
2• 2 < y < 4,
0, elsewhere.
1000
f(x ) x > 1000 ,
�2'
0, elsewhere.
6.12. The intensity of light at a given point is given by the relationship I = C/ D2,
where C is the candlepower of the·source and Dis the distance that the source is from the
given point. Suppose that C is uniformly distributed over (1, 2), while Dis a continuous
random variable with pdf /(d) = e-d, d > 0. Find the pdf of /, if C and D are in
dependent. [Hn
i t: First find the pdf of D2 and then apply the results of this chapter.]
6.13. When a current I (amperes) flows through a resistance R (ohms), the power
generated is given by W = 12R(watts). Suppose that I and Rare independent random
variables with the following pdf's.
0, elsewhere.
Determine the pdf of the random variable Wand sketch its graph.
bx+ c, then -b/2a represents the value at which a relative maximum or relative
minimum occurs.
In the nondeterministic or random mathematical models which we have been
considering, parameters may also be used to characterize the probability distribu
tion. With each probability distribution we may associate certain parameters
which yield valuable information about the distribution (just as the slope of a
line yields valuable information about the linear relationship it represents).
EXAMPLE 7.1. Suppose that Xis a continuous random variable with pdf/(x) =
ke-kx, x ?:: 0. To check that this is a pdf note that f;' ke-kx dx = l for all
k > 0, and that ke-kx > O for k > 0. This distribution is called an exponential
distribution, which we shall study in greater detail later. It is a particularly useful
distribution for representing the life length, say X, of certain types of equipment
or components. The interpretation of k, in this context, will also be discussed
subsequently.
EXAMPLE 7.2. Assume that items are produced indefinitely on an assembly line.
The probability of an item being defective is p, and this value is the same for all
items. Suppose also that the successi�e items are defective (D) or nondefective
(N) independently of each other. Let the random variable X be the number of
items inspected until the first defective item is found. Thus a typical outcome of the
117
118 Further Characterizations of Random Variables 7.1
1
p = 1 if 0 < IPI < l.
=
1 - (1 - p)
Thus the parameter p may b e any number satisfying 0 < p < l.
EXAMPLE 7.3. A wire cutting machine cuts wire to a specified length. Due to cer
tain inaccuracies of the cutting mechanism, the length of the cut wire (in inches),
say X, may be considered as a uniformly distributed random variable over [11.5,
12.5]. The specified length is 12 inches. If 11.7 � X < 12.2, the wire can be sold
for a profit of $0.25. If X � 12.2, the wire can be recut, and an eventual profit of
$0.10 is realized. And if X < 11.7, the wire is discarded with a loss of $0.02. An
easy computation shows that P(X � 12.2) 0.3, P( l l .7 � X < 12.2)
= 0.5, =
Suppose that a large number of wire specimens are cut, say N. Let N8 be the
number of specimens for which X < 11.7, NR the number of specimens for which
11.7 � X < 12.2, and NL the number of specimens for which X � 12.2. Hence
the total profit realized from the production of the N specimens equals T =
Ns(-0.02) + NR(0.25) + NL(0.10). The total profit per wire cut, say W, equals
W =(N8/N)(-0.02) + (NR/N)(0.25) + (NL/N)(O. l ). (Note that W is a
random variable, since Ns, NR, and N1, are random variables.)
We have already mentioned that the relative frequency of an event is close to
the probability of that event if the number of repetitions on which the relative
frequency is based is large. (We shall discuss this more precisely in Chapter 12.)
Hence, if N is large, we would expect Ns/N to be close to 0.2, NR/N to be close to
0.5, and NL/N to be close to 0.3. Therefore, for large N, W could be approximated
as follows:
w � (0.2)( -0.02) + 0.5(0.25) + (0.3)(0.1) = $0.151.
Thus, if a large number of wires were produced, we would expect to make a profit
of $0.151 per wire. The number 0.151 is called the expected value of the random
variable W.
7.1 The Expected Value of a Random Variable 119
if the series L'i=1 xip(xi) converges absolutely, i.e., if L'i=1 lxijp(xi) < oo.
lustrates, strikingly, that E(X) is not the outcome we would expect when X is observed
a single time. In fact, in the above situation, E(X) i is not even a possible value for
=
fairly general conditions, the arithmetic mean will be close to E(X) in a probabilistic
sense. For example, in the above situation, if we were to throw the die a large number
of times and then compute the arithmetic mean of the various outcomes, we would expect
this average to become closer to i the more often the die were tossed.
(c) We should note the similarity between the notion of expected value as defined above
(particularly if X may assume only a finite number of values) and the notion of an average
of a set of numbers say, z 1, . . . , Zn. We usually define z (1/n)L:i'=i z; as the arithme =
tic mean of the numbers zi, . . . , Zn. Suppose, furthermore, that we have numbers
zi, . . . , z�, where z� occurs n; times, Lf=in; n. Letting f; n;/n, Lf=i f; = 1, = =
-1 2:
n i=l
n;z� = 2: f;z�.
i=l
EXAMPLE 7.4. A manufacturer produces items such that IO percent are defective
and 90 percent are nondefective. If a defective item is produced, the manufacturer
loses $1 while a nondefective item brings a profit of $5. If X is the net profit per
item, then X is a random variable whose expected value is computed as E(X) =
-1(0.1) + 5(0.9) = $4.'40. Suppose that a large number of such items are pro
duced. Then, since the manufacturer will lose $1 about IO percent of the time and
earn $5 about 90 percent of the time, he will expect to gain about $4.40 per item
in the long run.
E(X) = np.
k
Proof: Since P(X = k) = (k)p ( l - pr-k, we have
n! k n-k
= � p p)
6_ (k -'- l)!(n - k)! (I -
(since the term with k = 0 equals zero). Lets k - I in the above sum. = As
k assumes values from one through n,s assumes values from zero through(n - 1).
Replacing k everywhere by (s + 1) we obtain
n
E(X) = I: 1
n ( � ) p•+10 Pt-•-l -
8=0
= np I: ( � ) p"(I - Pt-I-•.
n
1
8=0
The sum in the last expression is simply the sum of the binomial probabilities with
n replaced by (n - I) [that is, (p + (I - p))n-1] and hence equals one. This
establishes the result.
Note: The above result certainly corresponds to our intuitive notion. For suppose
that the probability of some event A is, say 0.3, when an experiment is performed. If we
repeat this experiment, say 100 times, we would expect A to occur about 100(0.3) = 30
times. The concept of expected value, introduced above for the discrete random variable,
will shortly be extended to the continuous case.
Again it may happen that this (improper) integral does not converge. Hence
we say that E(X) exists if and only if
Note: We should observe the analogy between the expected value of a random variable
and the concept of "center of mass" in mechanics. If a unit mass is distributed along the
line at the discrete points x 1, . . . , Xn, . . . and if p(x;) is the mass at x;, then we see that
L;'=1 x;p(x;) represents the center of mass (about the origin). Similarly, if a unit mass
is distributed continuously over a line, and if f(x) represents the mass density at x, then
J�: xf(x) dx may again be interpreted as the center of mass. In the above sense, E(X)
can represent "a center" of the probability distribution. Also, E(X) is sometimes called
a measure of central tendency and is in the same units as X.
fl..x)
� x=lSOO x=3000
-x
FIGURE 7.1
EXAMPLE 7.6. Let the random variable X be defined as follows. Suppose that
X is the time (in minutes) during which electrical equipment is used at maximum
122 Further Characterizations of Random Variables 7.1
1
f(x) = x 0 :::; x :::; 1500,
(1500)2 ,
-1
(x -3000), 1500 :::; x :::; 3000,
(1500)2
0, elsewhere.
Thus
E(X) =
f+oo
_00 xf(x) dx
J
{1500 3000
;
(1500 (1500) [Jo x2 dx
-f soo
x(x - 3000) dx
1500 minutes.
EXAMPLE 7.7. The ash content in coal (percentage), say X, may be considered.
as a continuous random variable with 3the following pdf: f(x) 4817 5 x2, =
10 :::; x :::;25. Hence E(X) = :ttfrs JfJ x dx 19.5 percent. Thus the ex
=
pected ash content in the particular coal specimen being considered is 19.5 percent.
Theorem 7.2. Let X be uniformly distributed over the interval [a, b]. Then
a+ b.
E(X) =
1b Ib
x 1 x2 a+ b.
E(X) = -- dx = ----
a b- a b-a2 a 2
(Observe that this represents the midpoint of the interval [a, b], as we would expect
intuitively.)
S to be finite? Since x; =
of X. For example, how do we express Eq. (7 .1) in terms of the outcome s E S, assuming
X(s) for some s E S and since
we may write
n
where P(s) is the probability of the event {s} C S. For instance, if the experiment con
sists of classifying three items as defective (D) or nondefective (N), a sample space for
this experiment would be
E(X) = L: X(s)P(s)
sES
= o· m + 1m + l(U + l(U + 2m + 2m + 2m + 3W
- 3
- 2·
Of course, this result could have been obtained more easily by applying Eq. (7.1) directly.
However, it is well to remember that in order to use Eq. (7.1) we needed to know the num
bers p(x;), which in turn meant that a computation such as the one used above had to be
carried out. The point is that once the probability distribution over Rx is known [in this
case the values of the numbers p(x;)], we can suppress the functional relationship between
Rx and S.
E( Y) = L: y;q(y;). (7.4)
i=l
Note: Of course, these definitions are completely consistent with the previous defini
tion given for the expected value of a random variable. In fact, the above simply repre
sents a restatement in terms of Y. One "disadvantage" of applying the above definition
in order to obtain E(Y) is that the probability distribution of Y (that is, the probability
distribution over the range space Ry) is required. We discussed, in the previous chapter,
methods by which we may obtain either the point probabilities q(y;) or g, the pdf of Y.
However, the question arises as to whether we can obtain E(Y) without first finding the
probability distribution of Y, simply from the knowledge of the probability distribution
of X. The answer is in the affirmative as the following theorem indicates.
E( Y) =
E(H( X )) =
L H(x i)p(xi ). (7.6)
i=l
(b) If X is a continuous random variable with pdf f, we have
E( Y) =
E(H( X )) =
Note: This theorem makes the evaluation of E(Y) much simpler, for it means that we
need not find the probability distribution of Y in order to evaluate E(Y). The knowledge
of the probability distribution of X suffices.
Proof· [We shall only prove 'Eq. (7.6). The proof of Eq. (7.7) is somewhat more
intricate.] Consider the sum Lf=l H(xi)p(xi) L:;f=1 (L:; H(x;)p(x;)), where
=
the inner sum is taken over all indices i for which H(x;) Yi> for some fixed Yi· =
Hence all the terms H(x;) are constant in the inner sum. Hence
00 00
L H(xi )p(xi ) =
L Yi L p(x ;).
i=l i=l
However,
L p(x ;) =
L P[x I H(x;) = Yi] = q(yi).
i i
EXAMPLE 7 8 Let Vbe the wind velocity (mph) and suppose that Vis uniformly
. .
distributed over the interval [O, 10]. The pressure, say W (in lb/ft2), on the sur-
7.2 Expectation of a Function of a Random Variable 125
= r1° 0.003v2 1 dv
J0 10
= O.l lb/ft2•
(b) Using the definition of E( W), we first need to find the pdf of W, say g, and
then evaluate e: wg(w) dw. To find g(w), we note that w 0.003v2 is a mono =
g(w) =
l� 1::1
= ! v.lf- w-112, 0 :-:::; w :-:::; 0.3 ,
= 0, elsewhere.
Hence
( 0 .3
E( W) wg(w) dw = 0.1
J0
=
after a simple computation. Thus, as the theorem stated, the two evaluations of
E( W) yield the same result.
(b) To evaluate £( Y) using the definition, we need to obtain the pdf of Y = I XI,
say g. Let G be the cdf of Y. Hence
G(y) =
P( Y � y) = P[IXI � y] = P[-y � X � y] = 2P(O � X � y),
( (11 . x
G(y) = 2 l
o
"
f(x) dx = 2 lo �
e
dx = -e-11 + l.
Thus we have for g, the pdf of Y , g(y) = G'(y) = e-Y, y � 0. Hence E(Y) =
EXAMPLE 7.10. In many problems we can use the expected value of a random
variable in order to make a certain decision in an optimum way.
Suppose that a manufacturer produces a certain type of lubricating oil which
loses some of its special attributes if it is not used within a certain period of time.
Let X be the number of units of oil ordered from the manufacturer during each
year. (One unit equals 1000 gallons.) Suppose that X is a continuous random
variable, uniformly distributed over [2, 4]. Hence the pdf f has the form,
f(x) =
!, 2 � x � 4,
= 0, elsewhere.
Suppose that for each unit sold a profit of $300 is earned, while for·each unit not
sold (during any specified year) a loss of $100 is taken, since a unit not used will
have to be discarded. Assume that the manufacturer must decide a few months
prior to the beginning of each year how much he will produce, and that he decides
to manufacture Y units. ( Y is not a random variable; it is specified by the manu
facturer.) Let Z be the profit per year (in dollars). Here Z is clearly a random vari
able since it is a function of the random variable X. Specifically, Z H(X) where =
H(X) = 300Y; if X � Y,
= 300X + (-IOO)(Y - X) if X < Y.
(The last expression may be written as 400X - 100 Y.)
In order for us to obtain E(Z) we apply z
7.3 and write
J.__
_ -----+--�1
Theorem
E(Z) =
J_:"' H(x)f(x) dx
-+-I X
� i4 H(x) dx.
-·
2 y 4
=
FIGURE 7.2
7.3 Two-Dimensional Random Variables 127
E(Z) = 300 Y if Y � 2
2
-100 Y + 700 Y - 400
if 2 < y < 4
1200 - 100 y if y 2:: 4. Y=2 Y=3.5 Y=4
The concepts discussed above for the one-dimensional case also hold for higher
dimensional random variables. In particular, for the two-dimensional case, we
make the following definition.
E(Z) =
we have
Q() Q()
E(Z) =
Note: We shall not prove Theorem 7.4. Again, as in the one-dimensional case, this is
an extremely useful result since it states that we need not find the probability distribution
of the random variable Zin order to evaluate its expectation. We can find E(Z) directly
from the knowledge of the joint distribution of (X, Y).
EXAMPLE 7.11. Let us reconsider Example 6.1 4 and find E(E) where E = JR.
We found that I and R were independent random variables with the following
pdrs g and h, respectively:
g(i) = i
2, 0 s is 1; h( r) r2/9,
= 0 S r S 3.
We also found that the pdf of Eis p(e) ie(3 - e), 0 S e S 3. Since I and
=
R are independent random variables, the joint pdf of (/, R) is simply the product
of the pdf of I and R: f(i, r) �ir2,
0 S = S l, 0 S i
S 3. To evaluate E(E) r
using Theorem 7.4 we have
1 i2 dih3 r3 dr l
�fo
= =
given only for the continuous case. The reader should be able to supply the argu
ment for the discrete case by simply replacing integrals by summations.
Proof
..-----F(x) = 1
E(X) =
1:..,"' Cf(x)dx
C1: "' f(x) dx C.
=
.., =
x=C
FIGURE 74
.
Note: The meaning of X equals C is the following. Since X is a function from the
sample space to Rx, the above means that Rx consists of the single value C. Hence
X equals C if and only if P[X(s) = C] = 1. This notion is best explained in terms of the
cdf of X. Namely, F(x) = 0, if x < C; F(x) equals1, if x � C (Fig. 74
. ). Such a ran
dom variable is sometimes called degenerate.
Proof· E(CX) =
1:..,"' Cxf(x)dx cJ:..,"' xf(x)dx
= = CE(X).
E(Z + W) = E(Z) + E( W ).
Proof
Property 7.4. Let X and Y be any two random variables. Then E(X + Y) =
E(X) + E(Y).
The expectation of a linear function is that same linear function of the expectation. This
is not true unless a linear function is involved, and it is a common error to believe other
wise. For instance, E(X2) ,c. (E(X))2, E(ln X) ,c. In E(X), etc. Thus if X assumes the
values -1 and +1, each with probability t, then E(X) 0. However, =
(b) In general, it is difficult to obtain expressions for E(l/X) or E(X112), say, in terms
of 1/E(X) or (E(X)) 112• However, some inequalities are available, which are very easy
to derive. (See articles by Pleiss, Murthy and Pillai, and Gurland in the February 1966,
December 1966, and April 1967 issues, respectively, of The American Statistician.)
For instance, we have:
(1) If X assumes only positive values and has finite expectation, then E(l/X) ::::-: 1/E(X).
(2) Under the same hypotheses as in (1), E(X112) _-:::; (E(X)) 112.
Proof
E(XY) =
= (l - Pi ·
Therefore
P(X 1 = k+ l) = 1 - ( l - p)k
and hence
E(X1) = 1 · (l - p)k + (k + l)[l - (l - p)k]
= k[l - (l - Pi+ k-1).
Thus
(The above formula is valid only for k > 1, since for k = 1 it yields E(X) =
1
equivalent to k- < (l - pi. This cannot occur if (l p) < !. For, in that -
case, ( l - p)k < !k < l/k, the last inequality following from the fact that
2k > k. Thus we obtain the following interesting conclusion: If p, the probability
of a positive test on any given individual, is greater than!, then it is never preferable
to group specimens before testing. (See Problem 7.llb.)
EXAMPLE 7.13. Let us apply some of the above properties to derive (again) the
expectation of a binomially distributed random variable. The method used may
be applied to advantage in many similar situations.
Consider n independent repetitions of an experiment and let X be the number
of times some event, say A, occurs. Let p equal P(A) and assume that this number
is constant for all repetitions considered.
Define the auxiliary random variables Yi. ... , Yn as follows:
= 0, elsewhere.
Hence
X = Y1 + Y2 + · · · + Yn,
Note: Let us reinterpret this important result. Consider the random variable X/n.
This represents the relative frequency of the event A among the n repetitions of 8. Using
Property 7.2, we have E(X/n) = (np)/n = p. This is, intuitively, as it should be, for it
says that the expected relative frequency of the event A is p, where p = P(A). It repre
sents the first theoretical verification of the fact that there is a connection between the
relative frequency of an event and the probability of that event. In a later chapter we
shall obtain further results yielding a much more precise relationship between relative
frequency and probability.
EXAMPLE 7.14. Suppose that the demand D, per week, of a certain product is
a random variable with a certain probability distribution, say P(D n) p(n), = =
n = 0, l , 2, . Suppose that the cost to the supplier is C 1 dollars per item, while
. .
he sells the item for C 2 dollars. Any item which is not sold at the end of the week
must be stored at a cost of C 3 dollars per item. If the supplier decides to produce
7.4 Properties of Expected Value 133
N items at the beginning of the week, what is his expected profit per week? For
what value of N is the expected profit maximized? If T is the profit per week,
we have
T = NC 2 - NC1 if D > N,
= DC2 - C1N - C3 (N - D) if D ::; N.
if N ::; 5,
if N > 5.
E(T) = 6N + 2 [
N(N +
l
I) - N 2 ]
if N ::; 5,
= 6N + 2(15 - 5N) if N > 5,
2
= 7N - N if N ::; 5,
= 30 - 4N if N > 5.
134 Further Cllaracterizations of Random Variables 7.5
E(T)
N=3.5
FIGURE 7.5
Suppose that for a random variable X we find that E(X) equals 2. What is the
significance of this? It is important that we do not attribute more meaning to this
information than is warranted. It simply means that if we consider a large number
of determinations of X, say x i. . .. , Xn, and average these values of X, this average
would be close to 2 if n is large. However, it is very crucial that we should not put
too much meaning into an expected value. For example, suppose that X represents
the life length of light bulbs being received from a manufacturer, and that E(X) =
1000 hours. This could mean one of several things. It could mean that most of the
bulbs would be expected to last somewhere between 900 hours and I 100 hours.
It could also mean that the bulbs being supplied are made up of two entirely dif
ferent types of bulbs: about half are of very high quality and will last about 1300
hours, while the other half are of very poor quality and will last about 700 hours.
There is an obvious need to introduce a quantitative measure which will dis
tinguish between such situations. Various measures suggest themselves, but the
following is the most commonly used quantity.
Notes: (a) The number V(X) is expressed in square units of X. That is, if X is measured
in hours, then V(X) is expressed in (hours)2• This is one reason for considering the
standard deviation. It is expressed in the same units as X.
(b) Another possible measure might have been EIX - E(X)j. For a number of rea
sons, one of which is that X2 is a "better-behaved" function than IXI, the variance is
preferred.
7.5 The Variance of a Random Variable 135
(c) If we interpret E(X) as the center of a unit mass distributed over a line, we may in
terpret V(X) as the moment of inertia of this mass about a perpendicular axis through
the center of mass.
(d) V(X) as defined in Eq. (7.12) is a special case of the following more general notion.
The kth moment of the random variable X about its expectation is defined as µk =
The evaluation of V(X) may be simplified with the aid of the following result.
Theorem 7.5
V(X) = E(X2) - [E(X)]2•
Proof' Expanding E[X - E(X)]2 and using the previously established proper
ties for expectation, we obtain
= E(X2) - [E(X)]2•
EXAMPLE 7.15. The weather bureau classifies the type of sky that is visible in
terms of "degrees of cloudiness." A scale of 11 categories is used: 0, l, 2, . . . , 10,
where 0 represents a perfectly clear sky, lO represents a completely overcast sky,
while the other values represent various intermediate conditions. Suppose that
such a classification is made at a particular weather station on a particular day
and time. Let X be the random variable assuming one of the above 11 values.
Suppose that the probability distribution of X is
Po = Pio = 0.05;
Pi = P2 = Ps = pg = 0.15;
Pa = P4 = ps =Pa = P1 = 0.06.
Hence
E(X) = l (0.15) + 2(0.15) + 3(0.06) + 4(0.06) + 5(0.06)
f(x)
FIGURE 7.6
Hence
V(X) = E(X2) - (E(X))2 = 35.6 - 25 10.6,
(See Fig. 7.6.) Because of the symmetry of the pdf, E(X) = 0. (See Note below.)
Furthermore,
Hence V(X) = t.
Note: Suppose that a continuous random variable has a pdf which is symmetric about
x = 0. That is, /(-x) f(x) for all x. Then, provided E(X) exists, E(X)
= 0, which =
Note: This property is intuitively clear, for adding a constant to an outcome X does
not change its variability, which is what the variance measures. It simply "shifts" the
values of X to the right or to the left, depending on the sign of C.
Proof
V(X + Y) =
E(X + Y)2 - (E(X + Y))2
E(X2 + 2XY + Y2) - (E(X))2 - 2E(X)E(Y) - (E(Y))2
=
E(X2) - (E(X))2 + E(Y2) - (E(Y))2 V(X) + V(Y). =
Note: It is important to realize that the variance is not additive, in general, as is the
expected value. With the additional assumption of independence, Property 7.9 is valid.
Nor does the variance possess the linearity property which we discussed for the expecta
tion, that is, V(aX + b) � aV(X) + b. Instead we have V(aX + b) = a2 V(X).
Notes: (a) This is an obvious extension of Theorem 7.5, for by letting a = 0 we ob
tain Theorem 7.5.
(b) If we interpret V(X) as the moment of inertia and E(X) as the center of a unit
mass, then the above property is a statement of the well-known parallel-axis theorem in
mechanics: The moment of inertia about an arbitrary point equals the moment of inertia
about the center of mass plus the square of the distance of this arbitrary point from the
center of mass.
138 Further Characterizations of Random Variables 7.6
(c) E[X - a]2 is minimized if a = E(X). This follows immediately from the above
property. Thus the moment of inertia (of a unit mass distributed over a line) about an
axis through an arbitrary point is minimized if this point is chosen as the center of mass.
(E(X))2• To compute E(X2) we use the fact that P(X = k) (k) pk(l - pt-k, =
k = 0, I, . . , n. Hence E(X2)
.
Lk=O k 2(k)pk(l - pt-k. This sum may be
=
evaluated fairly easily, but rather than do this, we shall employ a simpler method.
We shall again use the representation of X introduced in Example 7.13, namely
X = Y1 + Y + + Y n . We now note that the Y/s are independent random
2
· · ·
variables since the value of Yi depends only on the outcome of the ith repetition,
and the successive repetitions are assumed to be independent. Hence we may
apply Property 7.10 and obtain
b
i 1 b3 a3
E(X2) x2--dx
-
b - a 3(b - a)
= =
a
7.7 Expressions for Expectation and Variance . 139
Hence
(b - a) 2
V(X) = E(X2 ) - [E(X)]2
12
after a simple computation.
Notes: (a) This result is intuitively meaningful. It states that the variance of X does
not depend on a and b individually but only on (b a)2, that is, on the square of their
-
difference. Hence two random variables each of which is uniformly distributed over
some interval (not necessarily the same) will have equal variances so long as the lengths
of the intervals are the same.
(b) It is a well-known fact that the moment of inertia of a slim rod of mass Mand
length L about a transverse axis through the center is given by ML2/12.
we need not find the probability distribution of Y, but may work directly with the
probability distribution of X. Similarly, if Z = H(X, Y), we can evaluate E(Z)
and V(Z) without first obtaining the distribution of Z.
If the function His quite involved, the evaluation of the above expectations and
variances may lead to integrations (or summations) which are quite difficult.
Hence the following approximations are very useful.
Theorem 7.6. Let X be a random variable with E(X) = µ, and V(X) = u2•
Suppose that Y H(X). Then
=
H' µ) 2
E( Y) � H(µ) + u , i (7.18)
V( Y) � [H'(µ))2u2• (7,19)
(In order to make the above approximations meaningful, we obviously require
that Hbe at least twice differentiable at x = µ.)
Proof (outline only): In order to establish Eq. (7.18), we expand the function
Hin a Taylor series about x = µ to two terms. Thus
(X - µ H"(µ)
Y = H(µ) + (X - µ)H'(µ) + t + Ri.
we discard the remainder R2 and take the variance of both sides, we have
140 Further Characterizations of Random Variables 7.7
EXAMPLE 7.19. Under certain conditions, the surface tension of a liquid (dyn/cm)
is given by the formula S 2(1 - 0.005 T)1.2, where T is the temperature of the
=
t ;::: 10,
= 0, elsewhere.
Hence.
And
V(T) = E (T2) - (15)2
and
Rather than evaluate these expressions, we shall obtain approximations for E(S)
and V(S) by using Eqs. (7.18) and (7.19). In order to use these formulas we have
to compute H'(l5) and H"(l5), where H(t) = 2(1 - 0.0051)1.2• We have
Hence
H(l5) = 1.82, H'(l5) = 0.01.
Similarly,
Therefore
0.000012 o+
H"(l5) .
_ _
- (0.925)0.8 -
Thus we have
[We shall assume that the various derivatives of H exist at (µx, µ11).] Then
if Xand Yare independent, we have
2
1 aH 2 [2
a H 2
E(Z) � H(µx, µ71) + 2 ax2 <Tx + a 2 <T11 '
]
y
Proof· The proof involves the expansion of Hin a Taylor series aLout the point
(µx , µ71) to one and two terms, discarding the remainder, and then taking the
expectation and variance of both sides as was done in the proof of Theorem 7.6.
We shall leave the details to the reader. (If Xand Yare not independent, a slightly
more complicated formula may be derived.)
Note: The above result may be extended to a function of n independent random vari
ables, say Z H(X1. . .. , X,.). If E(X;)
= µ;, V(X;) u:, we have the following
= =
.
2
1 " a H 2
E(Z) � H(µ.1. . ., µ.,.) + .2 L: -2 <Ti,
( )
i=l dXi
2
" aH 2
V(Z) � I: �
uXi
u;,
i=l
where all the partial derivatives are evaluated at the point (µ.i, ... , µ...).
EXAMPLE 7.20. Suppose that we have a simple circuit for which the voltage,
say M, is expressed by Ohm's Law as M IR, where I and R are the current and
=
c be any real number. Then, if E(X- c)2 is finite and E is any positive number,
we have
(7 .2 0)
1
I - cl < E) � 1 ..,... 2 E(X- c)2•
P[X (7.20a)
E
(b) Choosing c = µwe obtain
Var X
P[IX- µ�
I E) � -- · (7.20b)
E2
(c) Choosing c = µand E = ku, where u2 = Var X > 0,
we obtain
(7 .21)
This last form (7 .21) is particularly indicative of how the variance measures the
"degree of concentration" of probability near E(X) µ. =
Proof (We shall prove only 7 2 . 0 since the others follow as indicated. We shall
deal only with the continuous case. In the discrete case the argument is very
similar with integrals replaced by sums. However, some care must be taken with
endpoints of intervals.):
Consider
I - cl� E)
P([X =
Lix-cl�ef(x)dx .
(The limit on the integral says that we are integrating between - oo and c- E
and between c + E and + oo. )
Now Ix - cl � E is equivalent to (x - c)2/E2 � 1. Hence the above in
tegral is
� { (x - c)2 x x,
f( )d
JR E2
where R = {x: Ix - cl � E}.
7.8 Chebyshev's Inequality 143
!+"'
(x
�. -«> �2 c) 2
f(x)dx
which equals
l
-E[X c]
-
2
E2
'
as was to be shown.
Notes: (a) It is important to realize that the above result is remarkable precisely be
cause so little is assumed about the probabilistic behavior of the random variable X.
(b) As we might suspect, additional information about the distribution of the random
variable X will enable us to improve on the inequality derived. For example, if C = !
we have, from Chebyshev's inequality,
Observe that although the statement obtained from Chebyshev's inequality is consistent
with this result, the latter is a more precise statement. However, in many problems no
assumption concerning the specific distribution of the random variable is justified, and
in such cases Chebyshev's inequality can give us important information about the be
havior of the random variable.
As we note from Eq. (7.21), if V(X) is small, most of the probability distribution
of X is "concentrated" near E(X). This may be expressed more precisely in the
following theorem.
Notes: (a) This theorem shows that zero variance does imply that all the probability
is concentrated at a single point, namely at E(X).
(b) If E(X) = 0, then V(X) = E(X2), and hence in this case, E(X2) = 0 implies the
same conclusion.
144 Further Characterizations of Random Variables 7.9
(c) It is in the above sense that we say that a random variable X is degenerate: It
assumes only one value with probability 1.
So far we have been concerned with associating parameters such as E(X) and
V(X) with the distribution of one-dimensional random variables. These param
eters measure, in a sense described previously, certain characteristics of the dis
tribution. If we have a two-dimensional random variable (X, Y), an analogous
problem is encountered. Of course, we may again discuss the one-dimensional
random variables X and Y associated with (X, Y). However, the question arises
whether there is a meaningful parameter which measures in some sense the "degree
of association" between X and Y. This rather vague notion will be made precise
shortly. We state the following formal definition.
Notes: (a) We assume that all the expectations exist and that both V(X) and V(Y)
are nonzero. When there is no question as to which random variables are involved we
shall simply write p instead of p,,11•
(b) The numerator of p, E{[X - E(X)][Y - E(Y)]}, is called the covariance of X
and Y, and is sometimes denoted by u,,11•
(c) The correlation coefficient is a dimensionless quantity.
(d) Before the above definition can be very meaningful we must discover exactly what
p measures. This we shall do by considering a number of properties of p.
Theorem 7.9
E(XY) - E(X)E(Y)
P
= �-
v�v Y::::-)
=(=x=w=<::;: �
Proof· Consider
E(XY) = E(X)E(Y)
if X and Y are independent.
7.9 The Correlation Coefficient 145
Note: The converse of Theorem 7.10 is in general not true. (See Problem 7.39.) That
is, we may have p = 0, and yet X and Y need not be independent. If p = 0, we say
that X and Y are uncorrelated. Thus, _being uncorrelated and being independent are, in
general, not equivalent. The following example illustrates this point.*
Let X and Y be any random variables having the same distribution. Let U =
X -Y and V = X + Y. Hence E(U) 0 and cov(U, V) = E[(X
= Y)(X + Y)] - =
E(X2 Y2)
- 0. Thus U and V are uncorrelated. Even if X and Y are independent,
=
U and V may be dependent, as the following choice of X and Y indicates. Let X and Y
be the numbers appearing on the first and second fair dice, respectively, which have
been tossed. We now find, for example, that P[V = 4I U = 3 ] = 0 (since if X - Y = 3,
X + Y cannot equal 4), while P(V = 4) = 3/36. Thus U and V are dependent.
q(t) q(t)
(a) (b)
FIGURE 7.8
at2 + bt + c has the property that q(t) � 0 for all t, it means that its graph
touches the t-axis at j ust one place or not at all, as indicated in Fig. 7.8. This,
in turn, means that its discriminant b2 - 4ac must be �O, since b2 - 4ac > 0
would mean that q(t) has two distinct real roots. Applying this conclusion to the
function q(t) under consideration above, we obtain
4[E(VW)]2 - 4E(V2)E(W2) � 0.
* The example in this note is taken from a discussion appearing in an article entitled
"Mutually Exclusive Events, Independence and Zero Correlation," by J. D. Gibbons,
appearing in The American Sta tistician, 22, No. 5, December 1968, pp. 31-32.
146 Further Characterizations of Random Variables 7.9
This implies
Proof· Consider again the function q(t) described in the proof of Theorem 7.11.
It is a simple matter to observe in the proof of that theorem that if q(t) > 0 for
all t, then p2 < I. Hence the hypothesis of the present theorem, namely p2 = 1,
implies that there must exist at least one value of t, say t0, such that q(t0) =
E(V + t0W)2 =0. Since V + t0W =[X - E(X)] + t0[Y - E(Y)], we
have that E(V + t0W) =0 and hence variance (V+ t0W) =E(V + t0W)2•
Thus we find that the hypothesis of Theorem 7.12 leads to the conclusion that
the variance of (V+ t0W) 0. Hence, from Theorem 7.8 we may conclude
=
that the random variable (V+ t0W) =0 (with probability I). Therefore
[X - E(X)]+ t0[Y - E(Y)] =0. Rewriting this, we find that Y =AX+ B
(with probability 1), as was to be proved.
Note: The converse of Theorem 7.12 also holds as is shown in Theorem 7.13.
Theorem 7.13. Suppose that X and Y are two random variables for which
Y AX+ B, where A and B are constants. Then p2 = I. If A > 0,
=
<
p =+I ; if A 0, p = - I.
Note: Theorems 7.12 and 7.13 establish the following important characteristic of the
correlation coefficient: The correlation coefficient is a measure of the degree of linearity
between X and Y. Values of p near +1 or -1 indicate a high degree of linearity while
values of p near 0 indicate a lack of such linearity. Positive values of p show that Y
tends to increase with increasing X, while negative values of p show that Y tends to
decrease with increasing values of X. There is considerable misunderstanding about
the interpretation of the correlation coefficient. A value of p close to zero only indicates
the absence of a linear relationship between X and Y. It does not preclude the possi
bility of some nonlinear relationship.
y
Hence
E(XY) - E(X)E(Y)
p=
yV(X)V(Y) 2
As we have noted, the correlation coefficient is a dimensionless quantity. Its
value is not affected by a change of scale. The following theorem may easily be
proved. (See Problem 7.4 1 .)
Just as we defined the expected value of a random variable X (in terms of its
probability distribution) as C: xf(x) dx or L:7=1 xip(xi), so we can define the
conditional expectation of a random variable (in terms of its conditional prob
ability distribution) as follows.
oo
E(X I y) =
f_
+
00
xg(x I y) dx. (7.23)
Theorem 7.15
E[E(X I Y)] = E(X), (7.25)
E[E( y I X)] = E( Y). (7.26)
f_
.,
=
f-ao (y)
y)
dx '
Hence
E[E(X I Y)] =
/_:"' E(X Iy)h(y) dy = /_:"' [/_:"' !�0,f) ]
x dx h(y) dy.
If all the expectations exist, it is permissible to write the above iterated integral
with the order of integration reversed. Thus
[A similar argument may be used to establish Eq. (7. 26).) This theorem is very
useful as the following example illustrates.
n: 10 11 12 13 14 15
P(N n): = 0.05 0.10 0.10 0.20 0.35 0.20
The probability that any particular part is defective is the same for all parts and
equals 0.10. If X is the number of defective parts arriving each day, what is the
expected value of X? For given N equals n , X has a binomial distribution. Since
N is itself a random variable, we proceed as follows.
We have E(X) = E[E(X IN)]. However, E(X IN) = O.lON, since for given
N, X has a binomial distribution. Hence
Theorem 7.16. Suppose tl;iat X and Y are independent random variables. Then
plied, the company makes a profit of $0.03. If the demand exceeds the supply, the
company gets additional power from another source making a profit on this power
150 Further Characteril.ations of Random Variables 7.11
of $0.01 per kilowatt supplied. What is the expected profit during the specified
time considered?
Let T be this profit. We have
T = 0.03Y if Y < X,
= 0.03X + O.Ol(Y - X) if y > x.
{
if < <
10
Af0.015x2 - l.5 + 2 + 0.4x - 0.005x2 - 0.02x2].
if 10 < x < 20,
lo if 20 < x < 30,
.
if < <
Therefore
E(Y[x) E(X[y)
(a) (b)
FIGURE 7.10
7.11 Regression of the Mean 151
X= -l x=l
FIGURE 7.11
EXAMPLE 7.24. Suppose that (X, Y) is uniformly distributed over the semicircle
indicated in Fig. 7.11. Then/(x, y) 2/rr, (x, y) E semicircle. Thus
=
,/1-z2 ·
f ,/1-112
h(y) = _'?:.dx = ±v1 - y2, O�y�l.
_,/1-112 7r r
Therefore
l
g(xly)= ---
2yl - y2
1
h(y Ix)= �= ===
yl - x2
Hence
,/1-z2
E( Y J x) ""'. ro yh(y I x)dy
J
../1-z2 1-z2
Y2 ../
=
1 y
1
dy =
1
----;:=
: =
1
0 vl - x2 yl - x2 2 0
Similarly
f+../1-112
E(Xly) xg(xly)dx
.
-�112
=
f+../1-112 + 1-11
x2 v' 2 1
x dx --;::====
-�112 2v1 - y 2 2v1 - y2 2 -../1-112
=
= 0.
It may happen that either or both of the regression curves are in fact straight
lines (Fig. 7.12). That is, E(YIx) may be a linear function of x and/or E(X Iy)
may be a linear function of y. In this case we say that the regression of the mean
of Yon X (say) is linear.
152 Further Characterizations of Random Variables 7.11
FIGURE 7.12
EXAMPLE 7.25. Suppose that (X, Y) is uniformly distributed over the triangle
indicated in Fig. 7.13. Then f(x, y) = 1, (x, y) ET. The following expressions
for the marginal and conditional pdf's are easily verified:
2-y
g(x) 2x, 0 � x � l; h(y) --' 0 � y � 2.
2
= =
2 I
g(x Iy) 2 y/2 � x � t; h(y Ix) 0 � y � 2x.
y 2x
= _ • = •
[1 xg(xjy)dx [1 x 2 1
11112 11112
E( X j y) - dx = E.4 + - ·
2 y 2
= =
ax + {3, then we can easily express the coefficients a and {3 in terms of certain
parameters of the joint distribution of (X, Y). We have the following theorem.
y y
(l, 2) (l, 2)
E(X\y)
Notes: (a) As is suggested by the above wording, it is possible that one of the regres
sions of the mean is linear while the other one is not.
(b) Note the crucial role played by the correlation coefficient in the above expressions.
If the regression of X on Y, say, is linear, and if p 0, then we find (again) that E(XI y)
=
does not depend on y. Also observe that the algebraic sign of p determines the sign of
the slope of the regression line.
(c) If both regression functions are linear, we find, upon solving Eqs. (7.27) and (7.28)
simultaneously, that the regression lines intersect at the "center" of the distribution,
(µ..,, µy).
As we have noted (Example 7.23, for instance), the regression functions need
not be linear. However we might still be interested in trying to approximate the
regression curve with a linear function. This is usually done by appealing to the
principle of least squares, which in the present context is as follows: Choose the
constants a and b so that E[E(Y IX) - (aX + b)]2 is minimized. Similarly,
choose the constants c and d so that E[E(X I Y) - (c Y + d)]2 is minimized.
The lines y =ax + b and x cy + d are called the least-squares approxima
=
tions to the corresponding regression curves E(YI x) and E(X Iy), respectively.
The following theorem relates these regression lines to those discussed earlier.
then a = a' and b = b'. An analogous statement holds for the regression
of X on Y.
PROBLEMS
7.2. Show that E(X) does not exist for the random variable X defined in Problem 4.25.
7.3. The following represents the probability distribution of D, the daily demand of
a certain product. Evaluate E(D).
d: 1, 2, 3, 4, 5,
P(D=d): 0.1, 0.1, 0.3, 0.3, 0.2.
7.5. A certain alloy is formed by combining the melted mixture of two metals. The
resulting alloy contains a certain percent of lead, say X, which may be considered as a
random variable. Suppose that X has the following pdf:
Suppose that P, the net profit realized. in selling this alloy (per pound), is the following
function of the percent content of lead: P = C1 + C2X. Compute the expected profit
(per pound).
7.6. Suppose that an electronic device has a life length X (in units of 1000 hours)
which is considered as a continuous random variable with the following pdf:
Suppose that the cost of manufacturing one such item is $2.00. The manufacturer sells
the item for $5.00, but guarantees a total refund if X � 0.9. What is the manufacturer's
expected profit per item?
7.7. The first 5 repetitions of an experiment cost $10 each. All subsequent repetitions
cost $5 each. Suppose that the experiment is repeated until the first successful outcome
occurs. If the probability of a successful outcome always equals 0.9, and if the repetitions
are independent, what is the expected cost of the entire operation?
7.8. A lot is known to contain 2 defective and 8 nondefective items. If these items are
inspected at random, one after another, what is the expected number of items that must
be chosen for inspection in order to remove all the defective ones?
Problems 155
7.9. A lot of 10 electric motors must either be totally rejected or is sold, depending on
the outcome of the following procedure: Two motors are chosen at random and in
spected. If one or more are defective, the lot is rejected. Otherwise it is accepted. Sup
pose that each motor costs $75 and is sold for $100. If the lot contains 1 defective motor,
what is the manufacturer's expected profit?
7.10. Suppose that D, the daily demand for an item, is a random variable with the
following probability distribution:
P(D = d) = C2d/d!, d = 1, 2, 3, 4.
7.12. Suppose that X and Y are independent random variables with the following
pdf's:
7.14. A fair die is tossed 72 times. Given that X is the number of times six appears,
evaluate E(X2).
7.15. Find the expected value and variance of the random variables Y and Z of
Problem 5.2.
7.16. Find the expected value and variance of the random variable Y of Problem 5.3.
7.17. Find the expected value and variance of the random variables Y and Z of
Problem 5.5.
7.18. Find the expected value and variance of the random variables Y, Z, and W of
Problem 5.6.
7.19. Find the expected value and variance of the random variables V and S of
Problem 5.7.
156 Further Characterizations of Random Variables
7.20. Find the expected value and variance of the random variable Y of Problem 5.10
for each of the three cases.
7.21. Find the expected value and variance of the random variable A of Problem 6.7.
7 .22. Find the expected value and variance of the random variable Hof Problem 6.11.
7.23. Find the expected value and variance of the random variable W of Problem 6.13.
7.24. Suppose that X is a random variable for which E(X) = 10 and V(X) 25.=
For what positive values of a and b does Y = aX - b have expectation 0 and variance 1?
7.25. Suppose that S, a random voltage, varies between 0 and 1 volt and is uniformly
distributed over that interval. Suppose that the signal S is perturbed by an additive, in
dependent random noise N which is uniformly distributed between 0 and 2 volts.
(a) Find the expected voltage of the signal, taking noise into account.
(b) Find the expected power when the perturbed signal is applied to a resistor of 2 ohms.
7.26. Suppose that X is uniformly distributed over [-a, 3a]. Find the variance of X.
7.27. A target is made of three concentric circles of radii l/v'3, 1, and v'3 feet. Shots
within the inner circle count 4 points, within the next ring 3 points, and within the third
ring 2 points. Shots outside the target count zero. Let R be the random variable repre
senting the distance of the hit from the center. Suppose that the pdf of R is /(r) =
2/11"(1 + r2), r > 0. Compute the expected value of the score after 5 shots.
x � 0.
Let Y = X2• Evaluate E(Y):
(a) directly without first obtaining the pdf of Y,
(b) by first obtaining the pdf of Y.
7.29. Suppose that the two-dimensional random variable (X, Y) is uniformly dis-
tributed over the triangle in Fig. 7.15. Evaluate V(X) and V(Y).
7.30. Suppose that (X, Y) is uniformly distributed over the triangle in Fig. 7.16.
(a) Obtain the marginal pdf of X and of Y.
(b) Evaluate V(X) and V(Y).
y
(2, 4)
l
(-1, 3) (l, 3)
� •x "-���---'-��---x(2, 0)
7.31. Suppose that X and Y are random variables for which E(X) = µz, E(Y) = µ11,
V(X) = ;
u , and V(Y) = ;
u . Using Theorem 7.7, obtain an approximation for E(Z)
and V(Z), where Z = X/Y.
7.32. Suppose that X and Y are independent random variables, each uniformly dis
tributed over (1, 2). Let Z = X/Y.
(a) Using Theorem 7.7, obtain approximate expressions for E(Z) and V(Z).
(b) Using Theorem 6.5, obtain the pdf of Z and then find the exact value of E(Z) and
V(Z). Compare with (a).
7.33. Show that if X is a continuous random variable with pdf /having the property
that the graph of /is symmetric about x = a, then E(X) = a, provided that E(X) exists.
(See Example 7.16.)
7.34. (a) Suppose that the random variable X assumes the values -1 and 1 each with
probability !. Consider P[ / X - E(X) / 2: kv'V(X)] as a function of k, k > 0. Plot
this function of k and, on the same coordinate system, plot the upper bound of the
above probability as given by Chebyshev's inequality.
(b) Same as (a) except that P(X = -1) = !, P(X = 1) = �·
7.35. Compare the upper bound on the probability P[ / X - E(X) / 2: 2v'V(X)] ob
tained from Chebyshev's inequality with the exact probability if X is uniformly dis
tributed over (-1, 3).
7.36. Verify Eq. (7.17).
7.37. Suppose that the two-dimensional random variable (X, Y) is uniformly dis
tributed over R, where R is defined by {(x, y) I x2 + y2 � 1, y 2: O}. (See Fig. 7.17.)
Evaluate Pzy, the correlation coefficient:
x
-1
7.38. Suppose that the two-dimensional random variable (X, Y) has pdf given by
= 0, elsewhere.
7.39. The following example illustrates that p = 0 does not imply independence.
Suppose that (X, Y) has a joint probability distribution given by Table 7.1.
(a) Show that E(XY) = E(X)E(Y) and hence p = 0.
(b) Indicate why X and Y are not independent.
158 Further Characterizatiom of Random Variables
(c) Show that this example may be generalized as follows. The choice of the number
l is not crucial. What is important is that all the circled values are the same, all the boxed
values are the same, and the center value equals zero.
TABLE 7.1
x -1 0
-- --
1
---
-'- l
CD [I] CD --
---
0
[[] 0 [I]
1 CD []] CD
7.40. Suppose that A and B are two events associated with an experiment 8. Suppose
thatP(A) > 0 and P(B) > 0. Let the random variables X and Y be defined as follows.
thing, using xg(x) and then solve the resulting two equations for A and for B.]
7.47. Suppose that both of the regression curves of the mean are in fact linear. Spe
cifically, assume that E(Y Ix) = -jx - 2 and E(X I y) = -!y - 3.
(a) Determine the correlation coefficient p.
(b) Determine E(X) and E(Y).
7.48. Consider weather forecasting with two alternatives: "rain" or "no rain" in the
next 24 hours. Suppose that p Prob(rain in next 24 hours) > 1/2. The forecaster
=
scores 1 point if he is correct and 0 points if not. In making n forecasts, a forecaster with
no ability whatsoever chooses at random r days (0 ::::; r ::::; n) to say "rain" and the
remaining n - r days to say "no rain." His total point score is Sn. Compute E(Sn)
and Var(Sn) and find that value of r for which E(Sn) is largest. [Hint: Let X; 1 or 0 =
depending on whether the ith forecast is correct or not. Then Sn = Li=1 X;. Note
that the X;'s are not independent.]