0% found this document useful (0 votes)
81 views

Introduction To Stata: Li-Pin Juan

This document provides an introduction to using Stata, covering topics such as importing and exporting data, manipulating variables, generating subsets of data, creating graphs, and basic programming. It includes many examples of Stata commands and their proper syntax and use.

Uploaded by

Kenn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Introduction To Stata: Li-Pin Juan

This document provides an introduction to using Stata, covering topics such as importing and exporting data, manipulating variables, generating subsets of data, creating graphs, and basic programming. It includes many examples of Stata commands and their proper syntax and use.

Uploaded by

Kenn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction to Stata

Li-Pin Juan

Contents
1

Basics of Stata

1.1

Example Commands

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Generating Subsets: the Use of by, in and if  Qualiers . . . . . . . . . .

1.3

Group Average

1.4

Creating Numerica Labels, and Coding Missing Values

1.5

A Collection of Well-Used Functions

1.6

Conversion Between Numeric and String Formats: Labeled-Numeric Variabe

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

Formation, and Strings Concatenating

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . .

11

1.7

Explicit Subscripts of Variables

1.8

Importing Data in Dierent File Format

. . . . . . . . . . . . . . . . . . . .

11

1.9

Data Sets Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.9.1

Vertical combination of datasets . . . . . . . . . . . . . . . . . . . . .

12

1.9.2

Horizontal combination of datasets

1.9.3

Checking duplicates (more details later)

. . . . . . . . . . . . . . . . . . .

15

1.10 Retrieving Commands' Outputs . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.11 Reshaping a Data Set

16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.11.1 From wide to long format: listing your data in country-year order
1.11.2 From long to wide format

. .

16

. . . . . . . . . . . . . . . . . . . . . . . .

19

1.12 Collapsing Data across Observations

. . . . . . . . . . . . . . . . . . . . . .

21

1.13 Manipulating Date Variables, and Deconstructing Numerica Variables . . . .

22

1.14 Documenting your code, adding breaking lines for readibility and so on

. . .

25

1.15 Variance Decomposition in Linear Regression Model . . . . . . . . . . . . . .

25

1.16 Head Counting (and Displaying Group Average)


1.17 Handling Duplicated Observation

13

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

26

. . . . . . . . . . . . . . . . . . . . . . . .

27

Graph

27

2.1

Example Commands

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.2

Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.3

Bubble Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

2.4

Combining Two Graphs

29

2.5

Combining Three Graphs with Legend

. . . . . . . . . . . . . . . . . . . . .

29

2.6

Time Series Plot with Annotation . . . . . . . . . . . . . . . . . . . . . . . .

31

2.7

Line plot with string markers

. . . . . . . . . . . . . . . . . . . . . . . . . .

32

2.8

Subplots in One Figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.9

Horizontal Bar Plots in One Figure

2.10 Grouped Horizontal Bar Plot

. . . . . . . . . . . . . . . . . . . . . . .

34

. . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.11 Horizontal Bar Plot with Sorted Groups

. . . . . . . . . . . . . . . . . . . .

36

2.12 Bar Plot with Subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.13 Grouped Bar Plot with Numerical Bar Labels

37

. . . . . . . . . . . . . . . . .

Programming

39

3.1

Global and local macros

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

3.2

Looping commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.3

Installing User-Written Command . . . . . . . . . . . . . . . . . . . . . . . .

41

3.4

Do-File Example: Weighted Sum

41

. . . . . . . . . . . . . . . . . . . . . . . .

Reference

41

Basics of Stata

1.1

Example Commands

oldle Retrieves previously-saved Stata-format dataset oldle.dta from disk.


use oldle , clear Used if other data are currently in memory and to be discarded

use

without saving, when loading

save

oldle.

newdata , replace Save the dataset in use to disk as newdata.dta.

option means to discard the existing le with the same lename.
exists, use

compress

save

newdata .

If no le already

Convert all variables to their most ecient storage types to save memory

and disk space. Usually called before the

browse

The replace

save command.

Open the spreadsheet-like Data Browser for viewing the data (no editing

allowed).

browse

price mileage

for observations in which

if

year

year

> 1980 Show only the variables

is greater than 1980. The

price

and

mileage

if qualier will be discussed

later.

list

Lists the data in table format. Use list, display for displaying large datasets.

list x y in 10/50

Lists the

and

values of the 10th through 20th observations, as

and

values of the fourth-from-last through the last

currently ordered in memory.

list x y in -4/l

Lists the

observations.

1 In the Data Editor or Data Broswer, string variable values appear in red, distinguishing them from
numeric (black) or labeled numeric (blue) variables.

tabulate x if y < 65

Produces a frequency table for x using only observations for

which y values are below 65.

tabulate var , gen(dum ) will generate whole sets of dummy variables dum1 , dum2 ,
and so on.

You can nd out what the values the dummy variables stand for from

describe.

table

var1 var2 ,

c(mean

var3 ) shows a two-way table with means of var3


var1 and var2 .

for

each cell in the combination of

year , c(mean var3 ) shows a one-way table with means of var3 for each year
tabulate variable , missing counts the number of missing obersvations under the
column variable .

table

describe

literally describes each variable in Stata's memory, including their storage

type, display format, and variable lable, in addition to observation size.

codebook

examines the variable names, labels, and data to produce a codebook de-

scribing the dataset. It is a convenient way to get information about string variables.

pop population in 1000s, 1982 Labels variable pop.


rename var1 place Rename var1 as place.

label variable

drop _all
dropmiss

Discards all the current data from memory.


Drop from the dataset in memory any variable having missing values.

dropmiss, obs

Drop from the dataset in memory any observation that have missing

values for every variable.

drop in 12/13

Drops the 12th through the 13th observations.

sd_lw87-sd_lw02 drops the set of variables, sd_lw87 , sd_lw88 , ... ,


sd_lw01 , and sd_lw02 .
drop sd_lw* same as above
encode stringvar , generate(numvar ) Creates a new variable named numvar, with

drop

labeled numeric values based on the string variable

format

price

stringvar.

%8.2f Establishes a xed display format for numeric variable price:

8-digit width, with two digits always shown after the decimal.

generate newvar = (x + y)/10


y divided by 100.
correlate

Creates a variable named newvar, equal to

plus

unemp mlife ife Gives the correlation between unemp mlife and ife.
3

relational operator

meaning

==

logical operator

is equal to

&

and

!=

is not equal to (~= also works)

or

>

is greater than

not (~ also works)

<

is less than

>=

is greater than or equal to

<=

is less than or equal to

drawnorm m1 ,

m2 , m3 , n(500) Creates an articial dataset with 500 observations,

each with three random variables.

generate

newvar

= uniform() Creates a variable with values sampled from a uni-

form distribution

sample 10

Drops all the observations in memory except for a 10% random sample.

sample 40, count drops all but a random sample of size

replace oldvar = 10 *

oldvar

n = 40.

Replaces the values of oldvar with 10 times their

previous values.

set memory 500m

Allocates 500 megabytes of memory for Stata data. Type

clear

before using this command.

sort

Sorts the data from lowest to highest values of

x.

place unemp mlife ife pop Controls the order of variables within a dataset.
drop if regexm( var1 , "E.S.") > 0 drops observations that contains specic string
variables. In this case, we drops those observations whose variable var1 contains E.S.

order

1.2

Generating Subsets: the Use of by, in and if  Qualiers

list if

unemp

mlife

< 8 | (

>=75 &

ife

tabulate vote if age >= 60 & age <.


select observations with nonmissing values.

>= 81)

The condition

age<. is a general way to

place == Canada | pop < 150


keep place pop unemp
keep if place != Canada & pop >= 150
by var1 : egen var2 = mean(var3 ) equals to the command egen var2 = mean(var3 ),
by(var1 )

drop if

2 Stata permits up to 27 dierent missing values codes. The other 26 codes are represented internally as
numbers even larger than .

Example 1 (Creating a set of dichotomies, each of which is coded wither 0 or 1):

canada2 , clear3
tabulate type
tabulate type , generate(type ) The tabulate command will create dummy variables

use

automatically if we add the

list

generate option.

place pop type type1-type3

Shows the result of re-expressing categorical infor-

mation stored in the labeled-numeric variable type as a set of dummy variables.

generate

mlife2

= 0 if

mlife

< 75 The two commands from now on generates

dummy variables using an alternative way.

replace

mlife2

= 1 if

mlife >= 75 & mlife <. Add the option mlife <. to retain

the information concerning the missing values in the original dataset.

generate mlife3 = autocode(mlife , 3,70,76) Provides automatic grouping of measurement variables. It creates a new ordinal variable

mlife3, which group values of mlife

into three equal-width groups over the interval from 70 to 76.

list

place mlife mlife2 mlife3

1
2
3
4
5
6
7
8

1.
2.
3.
4.
5.

9
10
11
12
13
14

6.
7.
8.
9.
10.

15
16
17
18
19

11.
12.
13.

++
|
place
mlife
mlife2
mlife3 |
||
|
Canada
75.1
1
76 |
|
Newfoundland
73.9
0
74 |
| Prince Edward Island
74.8
0
76 |
|
Nova Scotia
74.2
0
76 |
|
New Brunswick
74.8
0
76 |
||
|
Quebec
74.5
0
76 |
|
Ontario
75.5
1
76 |
|
Manitoba
75
1
76 |
|
Saskatchewan
75.2
1
76 |
|
Alberta
75.5
1
76 |
||
|
British Columbia
75.8
1
76 |
|
Yukon
71.3
0
72 |
| Northwest Territories
70.2
0
72 |
++

The table shows how the new dummy

mlife2

and ordinal

to values of the original measurement variable

mlife.

mlife3

variables correspond

3 Source: http://www.stata.com/bookstore/swsdl.html. Filename: Canada3.dta.

1.3

Group Average

Suppose the data set including three variables: income denoted by


by

year,
1.

sort

Sorting your variable before using any syntax with

2.

by

egen

==1

1.4

age,

year

<=60 &

sex

age by

sec.

and sex by

year
year :

inc,

inc_male

inc )

= mean(

if

age

>= 25

by .
& age

Creating Numerica Labels, and Coding Missing Values

Example 1:
1.
2.
3.

ife - mlife
label variable gap Female minus male life expectancy gap
format gap %4.1f

generate

gap

Example 2:
1.

replace

age

= 29 in 1453 Corrects an error in the value of

age

for observation

number 1453.
2.

replace

age

born

if

age

>=. |

age

< 2008 -

year of birth, if age


is less than 2008 minus the year of birth.

variable

age

= 2008 -

with 2008 minus the

born

Replaces values of

is missing or if the reported

age

Example 3 (Creating a set of numeric labels):


1.
2.
3.
4.
5.
6.
7.

wbdata , clear
generate type = 1
replace type = 2 if place == USA | place == Canada
replace ty pe = 3 if place == USA
label variable type Country Name
label dene typelab 1 Others 2 Canada 3 USA
label values type typelab Label values specify to which variable these labels apply
use

(it assigns a set of labels, denoted by

type ).
8.

list

typelab,

to the values of the numerival variable

place ife mlife gap type

Example 4 (Coding missing values):

4 Source: http://www.stata.com/bookstore/swsdl.html. Filename: Granite_06_10s.dta.

1.
1
2
3
4
5
6
7
8
9
10
11

2.

tabulate novint

Interest in Nov 2006 |


election |
Freq.
Percent
Cum.
+
Extremely interested |
102
19.81
19.81
Very interested |
174
33.79
53.59
Somewhat interested |
171
33.20
86.80
Not very interested |
60
11.65
98.45
Don't know |
5
0.97
99.42
No answer |
3
0.58
100.00
+
Total |
515
100.00
tabulate

novint ,

nolabel Shows the numerical values corresponding to respective

value labels without displaying value labels. It is shown that the missing value codes
are 98 and 99 for Don't know and No answer. These missing values essentially skew
the statistics of the dataset.

1
2
3
4
5
6
7
8
9
10
11
12

3.

Interest in |
Nov 2006 |
election |
Freq.
Percent
Cum.
+
1 |
102
19.81
19.81
2 |
174
33.79
53.59
3 |
171
33.20
86.80
4 |
60
11.65
98.45
98 |
5
0.97
99.42
99 |
3
0.58
100.00
+
Total |
515
100.00
generate

novint2

novint

Creates a dummy variable

novint2

for recoding its 98

and 99 as separate extended missing values, non-numerical ones.


4.

mvdecode

novint2 ,

mv(98=.c \ 99=.d) Reconding

novint2

values, 98 and 99,

into missing values code, .c and .d In other cases, you may use a list of variables in

novint2 . 5
tabulate novint2 , missing
place of

5.
1
2
3
4
5
6
7

novint2 |
Freq.
Percent
Cum.
+
1 |
102
19.81
19.81
2 |
174
33.79
53.59
3 |
171
33.20
86.80
4 |
60
11.65
98.45
.c |
5
0.97
99.42

5 These may stand for dierent reason the values are missing, such as responses of Redused to answer
or Don't know, and Not applicable on a questionnaire.

8
9
10

.d |
3
0.58
100.00
+
Total |
515
100.00
In Stata, the missing values do not enter into calculations of statistics, such as means
or correlations, which solves the inated mean caused by original data.

6.

summarize It is clear that the original variables


the same problematic value as

1.5

novint.

edlevel, hincome, and favbush contain

Same procedure could be applied.

A Collection of Well-Used Functions

generate

expinc

the exponential of

= exp(

income.

income )

Creates a new variable named

expinc,

equal to

Table 1: Example Functions


command

function

abs(x)
exp(x)
normal(z)
int(x)
ln(x)
date(s1 , s2 [, y])
mdy(M, D, Y )
normalden(x, m, s)

absolute value of

sum(x )

exponential (e to power)
cumulative standard normal
truncating

natural (base

toward zero

e)

logarithm

elapsed date corresponding to


elapsed date corresponding to

display exp(2)+10

Performs a single calculation and displays the result onscreen.

Returns

if

by partitioning the interval

if false.

Creates a variable

as the maximum

generates a set of statistics temporarily stored in Stata. To retrieve the

values of these statistics, type, for example,

r(mean) for the resulting mean.

Example 1 (Retrieve the statistics after summarize):

return list

from

is evaluated to be true, and

generate y=cond(var1 >var2 ,var1 ,var2 )


of var1 and var2.

summarize

by treating missing values as zero.

autocode(x ,n ,xmin ,xmax ) Forms categories


[xmin, xmax] into n equal-distance intervals

summarize

and

normal density distribution with customed mean and variance

Returns the sum of

code(x ,a ,b )

s1
M , D,

unemp
Retrieves the list of statistics after

summarize.

6 Source: http://www.stata.com/bookstore/swsdl.html. Filename: canada1.dta.

generate

unempdi

unemp

- r(mean)

Example 2 (Groups and subsets of the data):

egen

stdpop

= std(pop) Creates a new variable named

dardized values of

bysort

place :

pop, to every observation in the sample.

egen

mlifeMed

= median(

stdpop,

equal to the stan-

mlife ) Calculates a new variable named

mlifeMed, equal to the median of each subgroup with the same place

egen

avg

value.

w ,x ,y ,z ) Creates a variable avg, equal to the row mean of

= rowmean(

each observations values on

w , x, y ,

and

z.

total = rowtotal(w ,x ,y ,z )
egen xrank = rank(x ) Creates a new variable xrank , holding ranks corresponding

egen

to values of

x: xrank = 1 for the observation with highest x, xrank = 2 for the second

highest, and so on.

1.6

Conversion Between Numeric and String Formats:

Labeled-

Numeric Variabe Formation, and Strings Concatenating


Example:
1.

use http://www.stata-press.com/data/r10/auto, clear

2.

list

make foreign

Dataset auto.dta contains one string variable

labeled categorical variable,

foreign.

Beneath the labels,

foreign

make,

and also a

remains a numeric

variable, indicated by a blue font in the Data Browser.


3.
4.

make foreign , nolabel Shows the underlying numbers beneath the labels.
encode make , generate(makenum ) Generates a labeled-numeric variable makenum

list

from the string variable


5.

decode

foreign ,

make.

generate(

foreignstr ) We can generate a string variable from a

labeled-numeric variable.
6.
7.
8.

make makenum foreign foreignstr


list make makenum foreign foreignstr , nolabel
summarize make makenum foreign foreignstr For calculation purposes, derived

list

string variables do not matter. Only labeled-numeric and numeric variables enter into
calculation.
9.

drop if regexm(

var1 ,"E.S.")

> 0 Drops observation based on the ourcome of

string comparison.

10.

var2 =regexr(var1 ,"[.0-9]+","") Extracts strings from a combination of

generate

strings and numbers.


11.

mean_score = rowmean(*_score ) Automatically takes all the variable

generate

with the sux


12.

tostring

_score into account.

year month , replace converts variables from numeric to string format. See

1.11.2 for the instruction.


13.

date =year +"_0"+month

generate

month )==1 concatenates a set

if length(

of string variables. See 1.11.2 for the instruction.


14.

gen

var2

var1 ,"[.\}\)\*a-zA-Z]+","")

= regexr(

Removes or replaces strings

from var1 below use the following command (in which we are replacing all string and
special characters with nothing)
Example 1 (Converting string variables embedded with numeric characters):

use http://www.ats.ucla.edu/stat/stata/faq/hsbs, clear


list in 6/15

You will see the string variable

race

is more of a numerical variable that

contains some observations hold non-numerical values. Use


variable
1
2
3
4
5
6
7
8

6.
7.
8.
9.
10.

9
10
11
12
13
14
15

11.
12.
13.
14.
15.

race

real to convert the string

to a numerical variable. Stata can recognize as such.

++
| id
gender
race
schtyp
read
science |
||
| 113
m
1
pub
44
63 |
| 50
m
3
pub
50
53 |
| 11
m
2
pub
34
39 |
| 84
m
1
pub
63
. |
| 48
m
3
pub
57
50 |
||
| 75
m
1
pub
60
53 |
| 60
m
X
pub
57
63 |
| 95
m
1
pub
73
61 |
| 104
m
1
pub
54
55 |
| 38
m
3
pub
45
31 |
++

generate

read_n = real(read ) Real translates numeric values stored as strings into

numeric values.

destring

read , generate(readnum ) force The destring command provides a more

exible way for coverting string variables to numeric. This line accomplishes the same
thing as above.

destring, replace
variables except for

Alternatively, you may go all out and reach the same result for all

race , gender

and

schtyp .

Since these variables had characters

in them, the destring command left such variables along.

10

Note: The dierence between

real(

string_var ) and destring is that if the function

real() encounters a non-numeric value, it sets the variable equal to missing in that
case and moves on. Without advanced instruction, destring removes the specied nonnumeric characters and move on, which means that, for example, a4 can be converted
to 4. This property of

destring comes in handy when one has numeric values stored

as strings that occasiionally contain things like commas (e.g. 1,234).

destring

race , replace ignore(X) To teach destring do what real(), we need to tell


ignore(to_be_ignored_strings ).

Stata which character(s) should be ignored by adding

1.7

Explicit Subscripts of Variables

use http://www.stata-press.com/data/r10/auto, clear


generate

obsID

= _n Creates a new variable

obsID, equal to the ranking number

of each observation as presently sorted.

sort obsID If the data is later rearranged in aother order, we can return to the earlier
order as listed in

obsID

by issuing this command.

Creating and saving unique case

identication numbers that store the order of observations at an early stage of dataset
development greatly facilites later data management.

display mpg[3]
generate

Displays the mileage in the 3rd observation.

divar1

var1

var [_n-1]

In time series analysis, we may issue a

command in similar format to generate a new variable

var1

1.8

divar1, equal to the change in

since the previous day.

Importing Data in Dierent File Format


Sometime, careful editing of the spreadsheet to be loaded in Stata migh be needed. For
example, it is desired to insert a row of variable names (single words) for each column
atop the rst observation in our spreadsheet. In doing so, Stata should automatically
recognize that rst row as the variable names.

inle

variable-list

using lename.raw Reads into memory a pure numerical value

ASCII le, such as lename.raw, in which the values are separated by space(s). variablelist is optional, and applied when you want to assign a list of names to imported variables. inle can only handle string variables, whether spaces embedded within them
or not, enclosed by double quotes.

inle str18
exists, a
variable

inle

make price mpg rep78

using

auto.raw , clear If any string variables

str# statement needs to proceed its variable name. Here, str18 indicates the

make

is a string one and of up to 8-digit column width.

gender age income

using

lename.raw , automatic If our variables have

non-numerical values (say, male or female in variable

automatic will store them as labeled numeric variables.


11

gender ),

adding the option

Table 2: Append
id

var1

var2

1991

var3

one.dta

1992
1993

+
id

var1

1994

var2

var3

two.dta

1995
1996

insheet

variable-list

using

lename.raw , comma names Issue this command to

load comma-delimited spreadsheet-like data with the rst row of the le containing
a single-words variable name for each column (i.e. the column headings). Note that
insheet could not handle a le that uses a mixture of commas and tabs as delimiters.

insheet variable-list using lename.raw , tab Issue this command to load tab-delimited
data written by spreadsheet programs, such as Excel. If no variable name is shown in
the rst row, Stata automatically assigns variable names

var1, var2, and so on.

inx str make 1-13 mpg 15-16 weight 18-21 price 23-26 using

auto.raw, clear

If the dataset is created in xed-column format, where the values are not necessarily
delimited at all, but occupy predened column positions. For example,
1
2

AMC Spirit
22 2640 3799
Buick Century 20 3250 4816
The

str statement proceeds the rst variable

make

is needed for telling Stata this

varaible is of string format. In xed-format dataset, Stata interprets blanks as missing


values.

compress

Once data are loaded in memory, issue this command to ensure no variable

takes up more space than it needs.

1.9
1.9.1

Data Sets Combination


Vertical combination of datasets

As long as the variables in two les are the same and the only thing you need to do is to
add observations from one le to the other le, this is vertical combination.

use

one

append using

two

(see Table 2)

12

Table 3: One-to-one merge


id

var1

var2

var3

one.dta

2
.
.
.

id

1.9.2

2
.
.
.

var4

var5

var6

two.dta

Horizontal combination of datasets

Example: a dangerous one

use

one , clear

merge 1:1 using

two

(see Table 3)

Never do this because you are assuming that every observation in


with the corresponding observation in
Bob and observation 2 in

two.dta

Example: including the

id

two.dta.

one.dta

matches perfectly

It is likely that observation 2 in

one.dta

is

is Mary.

variable

one , clear
merge 1:1 id using two

use

Example: check uniqueness of

id

variables

one , clear // test 1


sort id
by id : assert _N==1 // verify truth of that every id occurs once is true
use two , clear //test 2
sort id
by id : assert _N==1

use

We should even check whether the occurrence of each combination of a set of


once.

id sex
by id sex : assert _N==1

sort

Example: introduce more than one identication variable

13

id

variables is

If the two dataset have more than one variable in common, it is desired to introduce more
variables as the id variable to avoid the problem that Bob in one dataset, however he is
identied, means Mary in the other. In this case, we can introduce the common variable

one , clear
merge 1:1 id sex

sex.

use

variables.

using

two

//the id variable is extended to a combination of two

It is undesired to code

merge 1:1

id sex

using

two ,

keep(matched)

because it loses unmatched information to conduct test 3 below.

sort id

// test 3:

to check if Bob, female or Mary, male occurs, and thus is counted

as a unique observation in the merged le

by

id :

assert _N==1

keep if _merge==3
generate

= uniform() //test 4: look at a random sample after the merge

sort

list in 1/5
drop

Example: check for outlandish values in the merged data

gen di = income2010 - income2009

//the example for time series data

or

gen di = exports - imports

//the example for cross-section data

sort dif f
list in 1/5

//here and next line is used to check outliers on variable

di

list in -5/1
Example: many-to-one merge

sample , clear test 1


sort personid
by personid : keep if _n==1 don't skip this step
merge 1:m personid using payroll , keep(master match) Checks to see if every

use

personid

use

that appreas in

sample.dta

also appreas in

sample , clear test2


14

payroll.dta.

division date
by division date: keep if _n==1
merge 1:m division date using payroll , keep(master match) Checks to see if

sort

every

division

that appreas in

sample.dta

also appreas in

payroll.dta.

With three key variables, the possible pairs are (personid, date), (personid, division), and
(division, date). We may follow the same procedure to examine which pair leads to the data
that loses the least information due to the merge, and to spot potential inconsistency in the
les.

sample , clear
sort division date
by division date : keep if _n==1
merge 1:m division date using payroll , keep(master match)

use

1.9.3

Checking duplicates (more details later)

egen

c = count(_n), by(id year month ) Counts the sample size for each combi-

nation of

id, year, and month.

c>1
browse if c > 1
egen tag = tag(id year month ) Creates an indicator (a dummy variable) which

list if

will be 1 for only one observation per station, and 0 for all other observations of the
same station.

tag
drop tag

keep if

1.10

== 1 Keeps the rst observation in each of the id-year-month variables.

Retrieving Commands' Outputs

return list

Sees which calculations the

ereturn list

summarize command saved.

Sees the outcome of regression.

matrix list e(b)


matrix list e(V)
display _se[x ]

Shows the standard error of variable

display 3*_b[x ]

Shows the result of 3 times the coecient of variable

15

Figure 1: Raw data to be reshaped, with undened missing-value symbol

When programming, it can be useful to remember that Stata saves the results of the
latest

summarize command. Among the ones easily accessible are: _result(1), the

number of observations;

_result(2), sum of weight; _result(3), the mean; _re-

sult(4), the variance; _result(5), the minimum observation; and _result (6), the
maximum observation. Typing display _result(3) after running the summarize command on var 1, for example, would tell Stata to display the mean of var1. Besides,
when using count command, we can use

r(N) instead of _result(1) to access the

couting number.

1.11

Reshaping a Data Set

In this section, we like to reshape datasets prepared by the economic organizations, such
as the World Bank, and the International Monetary Fund. Figure 1 shows the sample le.

1.11.1

From wide to long format: listing your data in country-year order

Before putting this data in Stata, we need to (1) add a character to the column headings,
and (2) replace non-numerical record (such as .. indicating a missing value) in any numerical variable with a blank. Both of the missions can easily be completed through Excel's Find
and Replace dialogue window with highlighting the cells taken by the column headings.
Step 1: Make sure the numbers are numbers. Go to Format Cells, select Number in
the Number tab and click OK. Then save the le as *.csv format. We name it as gdp.csv
(see Figure 2).

insheet using
and

gdp.csv It is clear that due to the inserted blanks under column x1995

x1996, both of the data in x1995

and

ax1995 = real(x1995 )
generate ax1996 = real(x1996 )
drop x1995 x1996
rename ax1995 x1995

generate

16

x1996

are stored as string variables.

Figure 2: Removing missing-value symbols, and adding column headings

Figure 3: processed data after converted from long to wide form

17

Figure 4: Data in long form, with a column

rename

variable

mixed with

var1

and

var2

ax1996 x1996

Now, to reshape our data from the wide form to the long one, the procedure is as what
follows:

generate

id

= _n

x , i(id ) j(year ) The attribute long species to go from wide to long


format. The variables with the prex  x  are to be converted from wide to long. i(id )
is a unique identier for the wide format. j(year) indicates that the sux of  x 

reshape long

should be put in variables called  year. If you have more than one variable you can
list them as follows:

reshape long

Step 2: What remains is that column


with

var1

var1
and

and

var2

var2, respectively.

x y z , i(id ) j(year ).

variable

(see Figure 1.11.1

consists of two groups of string variables

It requires further steps to correctly have pairs of values of

be associated with the combination of corresponding

year

and

country.

The

following procedure is to separate these two stacked variables.


1.

encode

variable , gen(numvar ) Creates a new variable with the (numerical) labels

of each variable.
2.

label save

numvar

using

varname ,

created.

18

replace You will nd that

varname.do

is

Now, open your text editor to change the le

varname.do from

label dene numvar 1 `"var1"', modify


label dene numvar 2 `"var2"', modify
to, by labeling var1 as x1, and var2 as x2,
)label variable x1 `"var1"'
label variable x2 `"var2"'
so that the labels and corresponding values of the labeled-numeric variable
could later be assigned to two variables,

var1

and

var2 , separately.

numvar

Step 3:
1.
2.
3.
4.
5.

6.

id2 = group(country year )


move id2 year
drop id
drop variable
reshape wide x , i(id2 ) j(numvar ) In doing so, each country has two variables
from 1960 to 2006, x1 and x2 , holding the values that are originally stored in var1
and var2 . Imagine the arrangement that, for each country (-year combination), stores
values of var1 under column x1 , and var2 under column x2 .
run the varname.do to change the labels for x1 and x2 . Figure below is what we
egen

nally get. (see Figure 5)

1.11.2

From long to wide format

Below shown is a more complex example where the data is in the long form and we like
it to be changed to the wide form (see Figure 6).

1.
2.

return , clear Be sure to drop all the varialbe of id =5.


tostring month year , replace Notice that time is separated in years and months

use

(monthly data). We need to combine them into one date variable.


3.

generate

date =year +"_0"+month

if length(

month )==1 This and next lines

creates the date variable holding the value of, for example, 2003_01, to imply that the
data is dated in January, 2003.
4.

replace

date =year +"_"+month

if date==""

19

Figure 5: Data in country-year order

Figure 6: Data set to be converted from long to wide format

20

5.
6.
7.

year month
order id date
reshape wide r i , i(id ) j(date ) str We add 'str' because date is a string variable.

drop

[Figure

??]

The resulting dataset shows that returns and interest rate are together for the same
period. If you want to have all returns and interst rates together, you need to take one
more step:
1.

xpose, clear

name.

varname

sort _

3.

xpose, clear

4.

order

5.

outsheet

id r*

outsheet

i*

1.12

_var-

varname

2.

6.

Transposes the data and generates a string variable

id

testr.csv.

using

using

testr.csv ,

comma replace Exports the return data into

testr.csv , comma replace Then exports the interest rate data.

Collapsing Data across Observations

Sometimes you have data les that need to be collapsed to be useful to you. For example,
you might have student data but you really want classroom data, or you might have weekly
data but you want monthly data, etc.
1.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear

2.

list

1
2
3
4
5
6
7
8
9
10

3.

1.
2.
3.
4.
5.
6.
7.
8.
9.

famid
1
1
1
2
2
2
3
3
3

collapse

kidname
Beth
Bob
Barb
Andy
Al
Ann
Pete
Pam
Phil

birth
1
2
3
1
2
3
1
2
3

age
9
6
3
8
6
2
6
4
2

wt
60
40
20
80
50
20
60
40
20

sex
f
m
f
m
m
f
m
f
m

age , by(famid ) Collapses across all the observations to make a single record

with the average age of the kids in each family.

21

4.

avgage

collapse (mean)

age , by(famid ) Instead of line 2, this command pro-

duces the same thing as above, except that the average of


the command of computing the
1
2
3
4

5.

famid
1
2
3

1.
2.
3.

collapse (mean)

age

is named

average and

mean is explicitly made.

avgage
6
5.333333
4

avgage =age avgwt =wt

(count)

numkids =birth , by(famid )

After reloading the same dataset, this command gets the average for
the number of kids

numkids

age

and

wt, and

for each family.

Suppose we want a count of the number of boys and girls in the family. The procedure is as
follows: (1) creates respective dummay variable for boy and girl, which holds value of 1 (0)
if true (if not). (2) The sum of the boy (girl) dummy variable is the number of boys (girls).
1.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear

2.

tabulate

1
2
3
4
5
6
7
8
9
10

3.

1.
2.
3.
4.
5.
6.
7.
8.
9.

3
4

1.13

famid
1
1
1
2
2
2
3
3
3

collapse (count)

1
2

sex , generate (sexdum )

1.
2.
3.

famid
1
2
3

sex
f
m
f
m
m
f
m
f
m

sexdum1
1
0
1
0
0
1
0
1
0

sexdum2
0
1
0
1
1
0
1
0
1

numkids =birth (sum) girls =sexdum1 bo ys=sexdum2 , by(famid )


boys
1
2
2

girls
2
1
1

numkids
3
3
3

Manipulating Date Variables, and Deconstructing Numerica


Variables

The trick to inputting dates in Stata is to forget they are dates, and treat them as character strings, and then later convert them into a Stata date variable. For example, you have

dates1.raw
1
2
3
4

looking like

John 1 Jan 1960


Mary 11 Jul 1955
Kate 12 Nov 1962
Mark 8 Jun 1959
22

1.

inx str

name

1-4 str

bday

6-17 using

dates1.raw Reads the data above. Since

bday is a string variable, you cannot do any kind of date computations with it until
you make a date variable from it. You can generate a date version of bday using the

date() function.
2.

birthday =date(bday ,"DMY")

generate

from the character variable

bday.

Creates a date variable called

birthday

Dates are actually stored as the number of days from

Jan 1, 1960 which is convenient for the computer storing and performing date computations.

generate

birthday =date(bday ,"DMY",

2008) forces Stata to consider

the change in centuries and indicates the last year of the series.
3.

format birthday %d Tells Stata that birthday should be displayed using the %d
format to make it easier for humans to read.

4.

generate

brith_quarter =qofd(birthday ) Transforms the processed variable above

to quarterly data.
Even for datasets with messy dates, such as the one below, Stata can hadle them well:
1
2
3
4

John
Mary
Kate
Mark

Jan 1 1960
07/11/1955
11.12.1962
Jun/8 1959

2.

name 1-4 str bday 6-17 using dates2.raw


generate birthday =date(bday ,"MDY")

3.

list

4.

format

1.

inx str

birthday

%d

Continue to experiment on a dataset with even messier collection of string dates:


1
2
3
4
5
6
7

4121990
4.12.1990
Apr 12, 1990
Apr12,1990
April 12, 1990
4/12.1990
Apr121990

2.

bday 1-20 using dates3.raw


generate birthday =date(bday ,"MDY")

3.

list

4.

format

1.

inx str

birthday

%d

23

1
2
3
4
5
6
7
8

bday
4121990
4.12.1990
Apr 12, 1990
Apr12,1990
April 12, 1990
4/12.1990
Apr121990

1.
2.
3.
4.
5.
6.
7.

birthday
12apr1990
12apr1990
12apr1990
12apr1990
12apr1990
12apr1990
.

Note that Stata was able to handle Apr12,1990 even though there was not a delimiter
between the month and day. The only date that did not work was Apr121990 and that
is because there was no delimiter between the day and year. As you can see, the

date(

) function can handle just about any date as long as there are delimiters separating
the month day and year.
On the other hand, we may have the month, day, and year stored as numeric variables in a
dataset. For example, look at the dataset

1
2
3
4

dates4.raw

below:

7 11 1948
1 1 1960
10 15 1970
12 10 1971

1.
2.
3.

month 1-2 day 4-5 year 7-10 using dates4.raw


generate birthday =mdy(month ,day ,year )
format birthday %d
inx

What if the year data is stored in the form that the rst two digits of 1970, for example, is
omitted? For example, the dataset

dates5.raw

below contains observations all recorded in

1948, 1960, 1970, and 1970:


1
2
3
4

7 11 48
1 1 60
10 15 70
12 10 71

3.

month 1-2 day 4-5 year 7-10 using dates5.raw


generate birthday =mdy(month ,day ,year +1900)
format birthday %d

4.

list

5.

gen

1.
2.

inx

= month(

birthday ) Conversely, we can have the month, day, year, and the

day of the week returned separately from a given date variable.


6.

gen

birthday )

= day(

24

7.
8.
9.

y = year(birthday )
gen weekday = dow(birthday )
gen age2000 = (mdy(1,1,2000)-birthday )/365.25 Calculates everyone's age on
gen

January 1, 2000.
10.

gen

age2000alt = oor(([ym(2000, 1) - ym(year(birthday ), month(birthday ))]


birthday )]) / 12) Provides alternative way to calculate one's age without

- [1 < day(

incurring possible error.


11. To deconstrcuting date variables by using

substr(), go to the following link for details

(scrolling up on the web page to see the corresponding example).

1.14

Documenting your code, adding breaking lines for readibility


and so on

Comments can begin with an asterisk (*) and end with a carriage return, or they can
begin with two slashes (//) and end with a carriage return, or they can be bracketed
by (/*) and (*/) and span as many lines as needed.

In a do-le (the detail is given later), Stata assumes that each command is no more
than 1 line long, end that each line ends with a carriage return (when you press the
Enter key, a text editor inserts a carriage return symbol).

If you want to type a command more than one line long, you need to tell Stata to look
for a semi-colon with

#delimit ; instead of a carriage return. From that point on,

you must end each command, whether one or more lines long, with a semi-colon. To
switch back to carriage return, use

#delimit cr.

There are other ways to continue a single command across more than one line. One
way is to comment out the carriage return - type

/* at the end of one line, and */ at

the beginning of the next line (to end the comment).

Another way, available in version 8 and later, is to end a line with

///, which tells

Stata to continue reading the next line as a continuation of this line.

set more o
1.15
1.
2.
3.
4.

asks the output scrolled down at full speed

Variance Decomposition in Linear Regression Model

ENEU_xxxx.dta 
tabe skill , gen(skd ) creates a set of dummy variables for skill
reg inc skd * edad runs regression analysis
predict r , resid assigns prediction residuals to r
use 

25

sd_regression = sd(r )
gen Educontr = _b[skd1 ]*skd1 +_b[skd2 ]*skd2 +_b[skd3 ]*skd3 +_b[skd4 ]*skd4

egen

5.
6.

calculates the tted values of income when education level is controlled for

egen

7.

1.16
1.
2.
3.
4.
5.
6.
7.
8.
9.

sdEducontr

Educontr )

= sd(

Head Counting (and Displaying Group Average)

ENEU_xxxx.dta 
sort year
by year : gen totp1_f=1 if edad >=25 & edad <=60 & sex ==2
by year : gen empl_f = 1 if edad >=25 & edad <=60 & inc >0 & sex ==2
by year : egen sum_tot_f = sum(totp1_f )
by year : egen sum_empl_f = sum(empl_f )
by year : gen empl_rate_f = sum_empl_f /sum_tot_f
gen vmp = empl_rate_f
table year , c(mean vmp p90 vmp p50 vmp p10 vmp ) f(%9.3f ) creates a table
use 

to show the ratio / mean / percentile just calculated.


1
2
3
4
5
6
7
8

10.
1
2
3
4
5
6
7
8
9
10
11
12
13

Year of
|
Survey
| mean(sex)
+
1987 |
1.523618
1988 |
1.524052
1989 |
1.523699

table

year , by(sex) c(mean vmp

sd

vmp ) f(%9.3f )

Gender
|
and Year |
of Survey | mean(inc)
+
1
|
1987 |
.1715026
1988 |
343.8696
1989 |
455.3309
+
2
|
1987 |
.0520051
1988 |
94.3484

26

14
15

1989 |
123.9082

1.17

Handling Duplicated Observation

Link 1 from the blog of stataman: link

Link 2 from the service provided by UCLA: link

Line 3 about the use of the

tag subcommand for detecting duplicates from IU: link7

Graph

2.1

Example Commands

histogram

y , frequency Draws histogram of variable y, showing frequencies on the

vertical axis, instead of density (the fraction).

historgram

y , start(0) width(10) norm fraction Draws histogram of y with bins

10 units wide, starting from 0. Adds a normal curve based on the sample mean and
standard deviation.

historgram
includes a 

kdensity

price ,

total 

x,

by (

region , total )

Creates the distribution of price by

region,

histogram showing the distribution for all regions combined.

generate(

xpoints , xdensity )

width(20) biweight Graphs kernel

x. xpoints contains the x values at


xde nsity is the density corresponding to each xpoints.

density estimate of the distribution of


density is estimated;

which the

y x Displays a basic two-variable scatterplot of y against x.


graph twoway scatter y x , mlabel(country ) Creates scatterplot with data points

graph twoway scatter

(markers) labeled by the values of variable

graph twoway lt

by overlaying two

with

yx

|| scatter

yx

country.

Visualizes the linear regression (lt) of

on

twoway graphs. To include a 95% condence band, replace lt

ltci. To add the condence interval on the basis of the standard error of the

forcast, substitute

qtci for lt.

graph twoway (lt

y x )(scatter y x ) Same as above ||-separator notation, except

it is called the ()-binding notation. It doesn't matter which notation you use.

graph twoway (lt

y x )(scatter y x, mlabel(make )) Same as above except that

markers are added with corresponding values of variable

make.

7 The tag subcommand along with the generate() option ags duplicate observations by assigning 1 to
duplicacy in the variable duple

27

graph twoway scatter

y x , xlabel(0(10)100) ylabel(-3(1)6, horizontal) Con-

structs scatterplot of y against x, with x axis labeled at 0, 10, ..., 100. y axis is labeld
at -3, -2, ..., 6, with labels written horizontally instead of vertically (the default).

graph twoway scatter


of

x2.

foreign ) can
be written as graph twoway qtci mpg weight , stdf by(foreign ) || scatter mpg
weight
graph twoway scatter y x1 [fweight = population ], msymbol(Oh) Draws a

graph twoway

scatterplot of

qtci

y x1 , by(x2 ) In one gure, draws scatterplots for each value

mpg weight, stdf || scatter mpg weight, by(

against

x1.

Marker symbols are hollow circles, with their size propor-

tional to frequency-weight variable

population.

Type

help msymbol for the full range

of marker shape options.

graph twoway connected

y time

A time plot of

shown connected by line segments. Use

against

time.

Data points are

line instead of connected, if you don't like

the data point markers to be shown.

graph twoway line

y1 y2 time

Draws a time plot with two

variables that both

have the same scale, with connected data points without data point markers shown.

graph twoway line


plot with

scatter

y1 time ,

1
2
3
4
5

y2 time ,

yaxis(2) Draws a time

variables that have dierent scales, by overlaying two individual line plots.

mpg weight ,

foreign ,

by(

one gure. Instead of row,

2.2

yaxis(1) || line

total row(1)) Produces graphs lined in row in

col(1) option stacks graphs in one column.

Scatter Plot

sysuse lifeexp, clear


gen gnp000 = gnppc/1000
label var gnp000 "GNP per capita, thousands of dollars"
scatter lexp gnp000, xsca(log) ///
xlabel(.5 2.5 10(10)40, grid)

2.3

Bubble Plot

Bubble plot makes symbol size proportional to a third variable.

1
2
3
4
5
6

sysuse census, clear


gen drate = divorce / pop18p
label var drate "Divorce rate"
scatter drate medage [w=pop18p] if state!="Nevada", msymbol(Oh) ///
note("State data excluding Nevada" ///
"Area of symbol proportional to state's population aged 18+")
28

Figure 7: Scatter plot

2.4

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

sysuse sp500, clear


#delimit ;
twoway
line close date, yaxis(1)
||
bar change date, yaxis(2)
||
in 1/52,
ysca(axis(1) r(1000 1400)) ylab(1200(50)1400, axis(1))
ysca(axis(2) r(50 300)) ylab(50 0 50, axis(2))
ytick(50(25)50, axis(2) grid)
legend(off)
title("S&P 500")
subtitle("January March 2001")
note("Source: Yahoo!Finance and Commodity Systems, Inc.")
yline(1150, axis(1) lstyle(foreground))
;
#delimit cr

2.5

1
2
3

Combining Two Graphs

Combining Three Graphs with Legend

sysuse uslifeexp, clear


gen diff = le_wm le_bm
label var diff "Difference"

4
5
6

#delimit ;
line le_wm year, yaxis(1 2) xaxis(1 2)
29

Figure 8: Bubble plot

Figure 9: Combination of two graphs in one plot

30

7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

|| line le_bm year


|| line diff year
|| lfit diff year
||,
ylabel(0(5)20, axis(2) gmin angle(horizontal))
ylabel(0 20(10)80,
gmax angle(horizontal))
ytitle("", axis(2))
xlabel(1918, axis(2)) xtitle("", axis(2))
ytitle("Life expectancy at birth (years)")
title("White and black life expectancy")
subtitle("USA, 19001999")
note("Source: National Vital Statistics, Vol 50, No. 6"
"(1918 dip caused by 1918 Influenza Pandemic)")
legend(label(1 "White males") label(2 "Black males"))
legend(col(1) pos(3))
;
#delimit cr

ylabel( 0(5)20, axis(2) grid gmin angle(horizontal))


causes the right axis to have labels at 0, 5, ..., 20.

The attribute

axis(2)

gmin forced the grid line at 0

because Stata does not like to draw grid lines too close to the axis.

angle(horizontal)

makes the labels horizontally shown.

ylabel( 0 20(10)80, gmax angle(horizontal))


because it is what

The attribute

axis (1) is omitted,

ylabel() assumes.

ytitle( "", axis(2))

Discards the axis title on the right y-axis

xtitle( "", axis(2))

Discards the x-axis title that is associated with the right y-axis.

legend(label(1 "White males") label(2 "Black males"))

Species new variable

labels, instead of using the variable labels originally assigned to variables

le_bm.
2.6

1
2
3
4
5
6
7
8
9
10
11
12

le_wm

Time Series Plot with Annotation

sysuse uslifeexp, clear


#delimit ;
graph twoway line le year || fpfit le year ||,
ytitle("Life Expectancy, years")
xlabel(1900 1918 1940(20)2000)
title("Life Expectancy at Birth")
subtitle("U.S., 19001999")
note("Source: National Vital Statistics Report, Vol. 50 No. 6")
legend(label (1 "actual") label(2 "fitted") position(10) ring(0) rows(2))
text(48.5 1923
"The 1918 Influenza Pandemic was the worst epidemic"
"known in the U.S."
31

and

Figure 10: Use of legend

13

"More citizens died than in all combat deaths of the"


"20th century.", box place(se) just(left) margin(l+4 t+1 b+1) width...
(85))

14
15
16

;
#delimit cr

label(1 acutal)
position(2)

2
3
4
5
6
7
8
9
10
11
12
13

Places the legend at 2 o'clock position (upper right)

ring(0)

Places the legend within the plot space

rows(2)

Organizes the legend to have two rows

2.7

Labels rst-named variable actual

Line plot with string markers

sysuse lifeexp, clear


keep if region==2 | region==3
replace gnppc = gnppc / 1000
label var gnppc "GNP per capita (thousands of dollars)"
gen lgnp = log(gnp)
qui reg lexp lgnp
predict hat
label var hat "Linear prediction"
replace country = "Trinidad" if country=="Trinidad and Tobago"
replace country = "Para" if country == "Paraguay"
gen pos = 3
replace pos = 9 if lexp > hat
replace pos = 3 if country == "Colombia"
32

Figure 11: Time Series with Annotation

14
15
16
17
18
19
20
21
22
23
24
25
26
27

replace pos = 3 if country == "Para"


replace pos = 3 if country == "Trinidad"
replace pos = 9 if country == "United States"
#delimit ;
twoway
(scatter lexp gnppc, mlabel(country) mlabv(pos))
(line hat gnppc, sort)
, xsca(log) xlabel(.5 5 10 15 20 25 30, grid) legend(off)
title("Life expectancy vs. GNP per capita")
subtitle("North, Central, and South America")
note("Data source: World bank, 1998")
ytitle("Life expectancy at birth (years)")
;
#delimit cr

pos( )

is the abbreviation for

scatter

position( ).

lexp gnppc , mlabel(country ) mlabv(pos ) Species the position for each

of the data point marker.

line

hat gnppc , sort If the data are already in the order of gnppc, the sort is unnec-

essary.

xsca(log)

2.8

1
2

Species the scale of x axis as a logorithm scale.

Subplots in One Figure

sysuse uslifeexp, clear


line le_male
year, ylab(,grid) saving(male)
33

Figure 12: String markers

3
4

line le_female year, ylab(,grid) saving(female)


gr combine male.gph female.gph, col(1) scale(1)

2.9

1
2
3
4
5
6
7
8
9

sysuse nlsw88, clear


#delimit ;
graph hbar wage, over( occ, axis(off) sort(1) )
blabel( group, pos(base) color(bg) )
ytitle( "" )
by( union,
title("Average Hourly Wage, 1988, Women Aged 3446")
note("Source: 1988 data from NLS, U.S. Dept. of Labor, Bureau of Labor...
Statistics") ) ;
#delimit cr

2.10

1
2
3
4
5
6
7
8
9

Horizontal Bar Plots in One Figure

Grouped Horizontal Bar Plot

sysuse citytemp, clear


#delimit ;
graph bar (mean) tempjuly tempjan,
over(division, label(labsize(*.75)))
over(region)
bargap(30) nofill
ytitle("Degrees Fahrenheit")
legend( label(1 "July") label(2 "January") )
title("Average July and January temperatures")

34

Figure 13: Subplots in one gure

Figure 14: Subplots for bar plot

35

Figure 15: Horizontal bar plot

10
11
12

subtitle("by region and division of the United States")


note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ;
#delimit cr

2.11

1
2
3
4
5
6
7
8
9
10

Horizontal Bar Plot with Sorted Groups

sysuse educ99gdp, clear


generate total = private + public
#delimit ;
graph hbar (asis) public private,
over(country, sort(total) descending) stack
title("Spending on tertiary education as % of GDP , 1999" ,
span pos(11))
subtitle(" ")
note("Source: OECD, Education at a Glance 2002", span) ;
#delimit cr

To give an informative illustration, it is desired to sort total spending in descending or ascending order, and thus we issue the command

descending) stack.

2.12

1
2
3
4

Bar Plot with Subgroups

sysuse citytemp, clear


#delimit ;
graph bar (mean) tempjuly tempjan,
over(division, label(labsize(*.75)))
36

country ,

over(

sort(

total )

Figure 16: Sorted groups within bar plot

5
6
7
8
9
10
11
12

over(region)
bargap(30) nofill
ytitle("Degrees Fahrenheit")
legend( label(1 "July") label(2 "January") )
title("Average July and January temperatures")
subtitle("by region and division of the United States")
note("Source: U.S. Census Bureau, U.S. Dept. of Commerce") ;
#delimit cr

graph bar (mean)

tempjan ,

bar gap(-30) noll

Creates overlapping bars in each combination of

over(

division )

region )

over(

The comparison of

tempjuly and tempjan is made in each combination of division and region. Variable
region provides the uppermost groups which is further decompsed into its subgroups,
division, thus being written in the end of the command.

region.

If the variation of the data in time is replaced by a label-numeric variable, say,


and thus the original separeted data by
variable, say,

temperature,

issuing the command

tempjuly

and

tempjan

then we can save the command

1
2
3

and

month,

can reduce to a single

bargap(-30) noll by

graph bar (mean) temperature, over(month, descend

gap(-30)) over(division) over(region) and so on.

2.13

division

Grouped Bar Plot with Numerical Bar Labels

sysuse citytemp, clear


#delimit ;
graph bar tempjuly tempjan, over(region) bargap(30)
37

Figure 17: Bar plot with subgroups

4
5
6
7
8
9
10

legend( label(1 "July") label(2 "January") )


ytitle("Degrees Fahrenheit")
title("Average July and January temperatures")
subtitle("by regions of the United States")
note("Source: U.S. Census Bureau, U.S. Dept. of Commerce")
blabel(bar, position(inside) format( %9.1f) color(white)) ;
#delimit cr

38

Figure 18: Bar plot with numerical bar labels

Programming

3.1

Global and local macros

A macro is a string of characters that stands for another string of characters. For example,
you can use the macro

xlist

in place of the string 

price weight .

This substitution can

lead to code that is shorter, and easier to read.

global macros is accessible across Stata do-les or throughout a Stata session. A

local macro can be accessed only within a given do-le or in the interactive session.
Global macros are dened with the
global macro, put the character

global

command. To access what was stored in a

$ immediately before the macro name.

Possible senarios where we may call for the use of global macros include:
1. When tting several dierent models with the same regressor list is to be undertaken,
substituting the list with a global macro makes a single change for all instances easier.
2. When a key parametres used commonly in several models, we may need to change its
value back and forth for many times. For example, in the early stages of our analysis,
exploratory data analysis might set the parameter to a small value such as 5 to save
computational time, whereas nal results set the parameter to an appropriately higher
value such as 100.
A macro can be used in place of a scalar so that the macro is not dropped after the execution
of the

drop all command.

Local macros are dened with the

local

command.

To access what was stored in the

local macro, enclose the macro name in single quotes. For example, consider a regression on
several regressors. We dene the local macro
enclosing the name in single quotes as

xlist and subsequently access its contents by

`xlist'.

We can also dene a local macro through evaluation of a function. For example,

39

1
2

local z = 2+2
display `z'
Local macros apply only to the current program and have the advantage of no potential conict with other programs.

They are preferred to global macros, unless there is a

compelling reason to use global macros.

3.2

Looping commands

Stata has three looping constructs:

foreach, forvalues, and while. The foreach con-

struct loops over items in a list, where the list can a list of variable names (possibly given
in a macro) or a list of numbers. The
numbers. A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

forvalues construct loops over consecutive values of

while loop continues until a user-specied condition is not met. For example,

* make artificial dataset of 50 observations on 3 uniform variables


clear
set seed 12345
gen var1 = runiform()
gen var2 = runiform()
gen var3 = runiform()
* manually calculate the summation of the three variables
generate sum = var1x + var2x + var3x
summarize sum
* illustration of foreach loop with a variable list
quietly replace sum = 0
foreach var of varlist var1x var2x var3x {
2. quietly replace sum = sum + `var'
3. }
summarize sum
* illustration of forvalues loop that iterates over consecutive values
quietly replace sum = 0
forvalues i = 1/3{
2. quietly replace sum = sum + var`i'x
3. }
* illustration of while loop
quietly replace sum = 0
local i 1
while `i' <=4
2. quietly replace sum = sum + var`i'x
3. local i = `i' + 1
4. }
summarize sum
For the forvalues loop, the choice of the dummy variable i for the local macro is arbitrary.
We may use other increments, such as i = 1(2)10, where the index goes from 1 to 10 in
increments of 2.
As seen in common programming language, the continue and break commands provide a
way to prematurely cease execution of the current loop iteration moving onto the start ofthe
next loop iteration, and force to skip the remaining actions/iterations in the current loop,
respectively.

40

3.3

Installing User-Written Command

For example, we would like to install a method for instrumental-variables estimation. A


leading user-written command for IV is

ivreg2, and we type ndit ivreg2 to get it. This

gives the information on IV commands available both within Stata and on the Internet.
Left-clicking on the highlighted text st0030_3 you will see a new window for details in
installation. By left-clicking on the

(click here to install) you will install the les into an

ado-directory.

3.4

Do-File Example: Weighted Sum

http://www.stata.com/statalist/archive/2008-01/msg00837.html

Reference

Microeconometrics Using Stata, A. Colin Cameron and Pravin K. Trivedi, Stata Press

Statistics with Stata version 10, Lawrence C. Hamilton, Brooks/Cole, Cengage Learning, 2009

Stata Graphics Reference Manual Release 11,Stata Press

Introduction to SAS. UCLA: Academic Technology Services, Statistical Consulting


Group. from http://www.ats.ucla.edu/stat/sas/notes2/ (accessed November 24, 2007)

Stata Tutorial. Carolina Population Center, the University of North Carolina at Chapel
Hill. from http://www.cpc.unc.edu/research/tools/data_analysis/statatutorial

Data and Statistical Services. Princeton University. from http://www.princeton.edu/~otorres/Stata/

The Stata Project-Oriented Guide, Blog of stataman from http://stataproject.blogspot.com/2007/12/


4-thank-god-for-egen-command.html

41

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy