0% found this document useful (0 votes)
63 views

STATA Codes - Basic

This document provides an overview of common Stata commands for data management and analysis. It discusses how to (1) import data, (2) distinguish between variable names and labels, (3) open and close log files, (4) create and drop variables, (5) use relational operators, (6) handle missing values, (7) generate descriptive statistics, (8) identify patterns of missing data, and (9) calculate changes between successive values for panel data. The document serves as a handy reference for basic Stata skills.

Uploaded by

蕭得軒
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views

STATA Codes - Basic

This document provides an overview of common Stata commands for data management and analysis. It discusses how to (1) import data, (2) distinguish between variable names and labels, (3) open and close log files, (4) create and drop variables, (5) use relational operators, (6) handle missing values, (7) generate descriptive statistics, (8) identify patterns of missing data, and (9) calculate changes between successive values for panel data. The document serves as a handy reference for basic Stata skills.

Uploaded by

蕭得軒
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Handout: Stata codes_basic updated: 2019/3/16

*******************************
(1) How to Read data sets into Stata
*******************************
To read a .csv (comma separated values) data file directly into Stata use insheet command.
The first row of the spreadsheet (data file) may contain the variable names.

insheet using F:\path\DATA\CCDATA_2012.csv

****************************************************
(2) The difference between Variable name vs. variable label?
****************************************************
• Stata changes the variable names from upper- to lower-case letters if you use variable names
in uppercase letters in the spreadsheet. Use variable names rather variable labels in your
command.
• Numeric vs. string (or text) variable: if the value entered for a variable is a number, then Stata
assumes that this column is a numeric variable and it will permit only numbers as values,
which should not include any commas, such as 1,000,000. If you do so, Stata will consider it
a "string" variable and generally refuse to do calculations.

*****************************
(3) How to open/close a LOG file?
*****************************
• We generally open a log file at the start to record commands and results tables (but not
graphs). Type,

log using F:\LOGS\cc2012.log, replace


clear

replace specifies that filename be overwritten if it already exists. If the specified file already
exists, an error message, "file H:\LOGS\cc2012.log already exists", is issued
and logging is not started. Also, if clear is omitted, Stata will warn you that data in memory
will be lost.

At the end of the session, to close and save log file, we type

log close
save cc2012.log, replace

Again, if replace is omitted, you will receive a warning message.

********************************************************
(4) How to create or drop variables? Qualifier/logical operators?
********************************************************
• Create a new var. named "CEOPAY" equal to the mid-point of ceo_pay:
1
Handout: Stata codes_basic updated: 2019/3/16

generate CEOPAY=(ceo_pay_l+ceo_pay_u)/2 or
gen CEOPAY=(ceo_pay_l+ceo_pay_u)/2

• To create a new variable equal to the natural log of CEOPAY, we type

gen LCEOPAY=ln(CEOPAY)

• Drop observations. For example, drop observation with board_size equal to zero, type

drop if board_size==0

• Tabulate number of observations, mean, std dev, min, and max of a variable.

summarize CEOPAY or
sum CEOPAY

• Use "if" qualifier to select observations based on specific conditions.

sum CEOPAY if year < 2011


sum CEOPAY if year >=2011

• Logical operators: Two or more relational operators can be combined within a single if
expression by using logical operators. Stata’s logical operators are listed as follows.

& and
| or (a vertical bar, not the number one or letter “el”)
! not

For example:

sum CEOPAY if (board_size==10|board_size==15)&year>2009


&year <=2011

Note: Parentheses allow us to specify the precedence among multiple operators.

*****************************
(5) What are relational operators?
*****************************
== is equal to (double equal sign used in if expressions indicates a logical test: "Is
the LHS value the same as the RHS value?")
!= is NOT equal to
> is greater than
< is less than
2
Handout: Stata codes_basic updated: 2019/3/16

>= is greater than or equal to


<= is less than or equal to

Note: To Stata, a single equals sign means something different: “Make the value on the left
side be the same as the value on the right.” The single equals sign is not a relational operator
and cannot be used within if qualifiers.

***********************************************
(6) Missing values, !missing(.), and missing(.) function
***********************************************
• Missing values issue: Stata shows missing values as a period, " . ". However, these same
missing values are treated as if they were large positive numbers in the if expression.

sum CEOPAY if board_size >=10

A command such as the above would summarize statistics of CEOPAY not only for firms with
board size greater than or equal to 10, but also for firm-year observations whose board size
values are missing.

To correct this, we may use !missing() to screen out missing values.

sum CEOPAY if board_size >=10 & !missing(board_size)

The missing() function evaluates to 1 if a value is missing, and 0 if it's not. Thus,
alternatively, to set missing values for board_size all aside, we may type

sum CEOPAY if board_size >=10 & missing(board_size)==0

To summarize statistics of CEOPAY only for those observations that have non-missing values of
CEOPAY and board_size, type

sum CEOPAY if missing(CEOPAY, board_size)==0

**********************************
(7) How to tabulate descriptive statistics?
**********************************
• To tabulate descriptive statistics for non-missing observations for both CEOPAY and
board_size, we may type

tabstat CEOPAY board_size if missing(CEOPAY, board_size)==0,


stats(n mean median sd p25 p75)

3
Handout: Stata codes_basic updated: 2019/3/16

• Tabulate descriptive statistics for CEOPAY and board_size for each industry sector, type

tabstat CEOPAY board_size if missing(CEOPAY, board_size)==0,


stats(n mean median sd p25 p75) by(ind_tse)

or

by ind_tse, sort: tabstat CEOPAY board_size if


missing(CEOPAY, board_size)==0, stats(n mean median sd
p25 p75)

• Tabulate descriptive statistics for CEOPAY and board_size for each sample year, type

by year, sort: tabstat CEOPAY board_size if


missing(CEOPAY, board_size)==0, stats(n mean median sd
p25 p75)

*********************************************************
(8) How to find the patterns of missing data for a set of variables?
*********************************************************
"misschk" is used to identify the number and pattern of missing obs. Type "findit
misschk" in the command panel and install this package.

misschk varlist

********************************************************
(9) How to find the changes between successive values for each firm?
********************************************************
For example, we would like to find the change in sales revenue for two successive years. First,
we need to know how Stata defines "_n" and "_N". The internal Stata variable _n is equal to
the numeric position of an observation and _N is equal to the total number of observations.

For example, in a dataset of 10 observations as shown below, in the first observation, _n is equal
to 1 and _N is equal to 10. In the second observation, _n is equal to 2 and _N is equal to 10, etc.

id year _n _N rev
---- ---- -- -- ------
1002 2001 1 10 1000
1002 2002 2 10 1500
1002 2003 3 10 1750
1002 2004 4 10 2000
1002 2005 5 10 2150
1003 2003 6 10 500
1003 2004 7 10 750
1003 2005 8 10 860
1004 2004 9 10 10100
1004 2005 10 10 22000

4
Handout: Stata codes_basic updated: 2019/3/16

When using by command, _n is equal to 1 for the first observation of the by-group and _N is
equal to the total number of observations for the by-group.

id year _n _N rev rev_lag


---- ---- -- -- ------ -------
1002 2001 1 5 1000 .
1002 2002 2 5 1500 1000
1002 2003 3 5 1750 1500
1002 2004 4 5 2000 1750
1002 2005 5 5 2150 2000
1003 2003 1 3 500 .
1003 2004 2 3 750 500
1003 2005 3 3 860 750
1004 2004 1 2 10100 .
1004 2005 2 2 22000 10100

So, in this case, the first firm (id=1002) has 5 observations in the data set,

by id:

creates 3 by-groups where the first observation in the by-group _n=1 and _N=5. The second
observation _n=2 and _N=5 , etc.

Now, if we would like to ensure data are sorted by two variables, we type

by varlist1 (varlist2), sort: stata_cmd

Then, we create a new variable equal to the previous year, for example:

bysort id (year): gen rev_lag = rev[_n-1]

Now, we may find the difference in sales for two consecutive years by typing:

gen ch_rev = rev - rev_lag

Note:
1. Most Stata commands allow the by prefix, which repeats the command for each group of
observations for which the values of the variables in varlist are the same. by without the
sort option requires that the data be sorted by varlist.
2. The varlist1 (varlist2) syntax is of special use to programmers. It verifies that the
data are sorted by varlist1 varlist2 and then performs a by as if only varlist1
were specified. Please note that if you type bysort id year: gen rev_lag=rev[_n-
1], you will find no observation since there is only one observation in each id-year group.
3. rev (rev[_n-1]) is the sales revenue for the current year (previous year).
4. This command is very different from bysort var1 var2: egen var_median=
median(var3). An example is shown as follows.

5
Handout: Stata codes_basic updated: 2019/3/16

*********************************************************
(10) How to find the median (or mean) for each industry-year group?
*********************************************************

by ind year, sort: egen CSR_median=median(CSR)

This command generates a new variable CSR_median, the values of which are equal to the
median of CSR scores for each ind-year group. Also, when calculating the median (or mean),
Stata ignores missing values.

*******************************
(11) How to create dummy variable?
*******************************
• Display a frequency distribution table for all non-missing values of a variable. Type

tabulate ind_tse

• Create dummies variables (IND#) for each industry sector. Type

tabulate ind_tse, gen(IND)

******************************************
(12) How to winsorize the values of observations?
******************************************
We may install "winsor2" to conduct the winsorization procedure. 1

ssc install winsor2


winsor2 vars, cuts(1 99) replace

******************************************************************
How to Performs ordinary least squares (OLS) regression of variable y on x?
******************************************************************
regress y x

1
The contributed commands from the Boston College Statistical Software Components (SSC) archive, often called
the Boston College Archive, are provided by RePEc. The commands available are implemented as one or more ado-
files, and together with their corresponding help files and any other associated files, they form a package. These
packages are available at SSC. ssc allows you to easily download a package. For example, when you type

. ssc install outreg


all of the files associated with the package named outreg are downloaded and installed on your computer. Packages
can easily be unistalled. You type ado dir to obtain a list of packages that you have previously installed, and then
you type ado uninstall [#] to uninstall the package.

6
Handout: Stata codes_basic updated: 2019/3/16

reg y x if year==2000

To control for control variables (x1, x2, and x3) and industry and year effects. Type

reg y x x1 x2 x3 IND? IND?? YR?

or reg y x x1 x2 x3 IND* YR*

Question marks (?) can be used as wild cards for single letters or numbers. The
asterisk/star ( *) is a wild card that represents many characters in the variable name. They
could be numbers or letters.

****************************************************
How to find/save fitted values and residuals from regression?
****************************************************
• Generates a new variable (here arbitrarily named yhat) equal to the predicted (fitted) values
from the most recent regression.

predict yhat, xb

• Generates a new variable (here arbitrarily named e) equal to the residuals from the most
recent regression.

predict e, resid

• Performs multiple regression of y on three predictor variables, x1, x2 and x3.

regress y x1 x2 x3

To suppress constant term, type

reg y x1 x2 x3, noconstant

• Performs an F test of the null hypothesis that coefficients on x1 and x2 both equal zero in the
most recent regression model.

test x1 x2

Test whether a coefficient equals a specified constant. For example, to test the null
hypothesis that the coefficient on x1 equals 1 (H 0 :β 1 = 1), instead of testing the usual
null hypothesis that it equals 0 (H 0 :β 1 = 0), type

7
Handout: Stata codes_basic updated: 2019/3/16

test x1 = 1

Test whether two coefficients are equal. For example, the following command evaluates
the null hypothesis H 0 : β 1 = β 2

test x1=x2

• Calculates robust (Huber/White) estimates of standard errors. See the User’s Guide for details.
The robust option works with many other model fitting commands as well.

regress y x1 x2 x3, robust

• Displays a matrix of Pearson correlations, using pairwise deletion of missing values and
showing probabilities from t tests of null hypothesis H0: ρ = 0, for each correlation.
Statistically significant correlations (in this example, p < .05) are indicated by stars (*).

pwcorr x1 x2 x3 y, sig star(.05)

• To obtain the Spearman rank correlation between x1 and x2, equivalent to the Pearson
correlation if these variables were transformed into ranks, type

spearman x1 x2, star(0.05)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy