STATA Codes - Basic
STATA Codes - Basic
*******************************
(1) How to Read data sets into Stata
*******************************
To read a .csv (comma separated values) data file directly into Stata use insheet command.
The first row of the spreadsheet (data file) may contain the variable names.
****************************************************
(2) The difference between Variable name vs. variable label?
****************************************************
• Stata changes the variable names from upper- to lower-case letters if you use variable names
in uppercase letters in the spreadsheet. Use variable names rather variable labels in your
command.
• Numeric vs. string (or text) variable: if the value entered for a variable is a number, then Stata
assumes that this column is a numeric variable and it will permit only numbers as values,
which should not include any commas, such as 1,000,000. If you do so, Stata will consider it
a "string" variable and generally refuse to do calculations.
*****************************
(3) How to open/close a LOG file?
*****************************
• We generally open a log file at the start to record commands and results tables (but not
graphs). Type,
replace specifies that filename be overwritten if it already exists. If the specified file already
exists, an error message, "file H:\LOGS\cc2012.log already exists", is issued
and logging is not started. Also, if clear is omitted, Stata will warn you that data in memory
will be lost.
At the end of the session, to close and save log file, we type
log close
save cc2012.log, replace
********************************************************
(4) How to create or drop variables? Qualifier/logical operators?
********************************************************
• Create a new var. named "CEOPAY" equal to the mid-point of ceo_pay:
1
Handout: Stata codes_basic updated: 2019/3/16
generate CEOPAY=(ceo_pay_l+ceo_pay_u)/2 or
gen CEOPAY=(ceo_pay_l+ceo_pay_u)/2
gen LCEOPAY=ln(CEOPAY)
• Drop observations. For example, drop observation with board_size equal to zero, type
drop if board_size==0
• Tabulate number of observations, mean, std dev, min, and max of a variable.
summarize CEOPAY or
sum CEOPAY
• Logical operators: Two or more relational operators can be combined within a single if
expression by using logical operators. Stata’s logical operators are listed as follows.
& and
| or (a vertical bar, not the number one or letter “el”)
! not
For example:
*****************************
(5) What are relational operators?
*****************************
== is equal to (double equal sign used in if expressions indicates a logical test: "Is
the LHS value the same as the RHS value?")
!= is NOT equal to
> is greater than
< is less than
2
Handout: Stata codes_basic updated: 2019/3/16
Note: To Stata, a single equals sign means something different: “Make the value on the left
side be the same as the value on the right.” The single equals sign is not a relational operator
and cannot be used within if qualifiers.
***********************************************
(6) Missing values, !missing(.), and missing(.) function
***********************************************
• Missing values issue: Stata shows missing values as a period, " . ". However, these same
missing values are treated as if they were large positive numbers in the if expression.
A command such as the above would summarize statistics of CEOPAY not only for firms with
board size greater than or equal to 10, but also for firm-year observations whose board size
values are missing.
The missing() function evaluates to 1 if a value is missing, and 0 if it's not. Thus,
alternatively, to set missing values for board_size all aside, we may type
To summarize statistics of CEOPAY only for those observations that have non-missing values of
CEOPAY and board_size, type
**********************************
(7) How to tabulate descriptive statistics?
**********************************
• To tabulate descriptive statistics for non-missing observations for both CEOPAY and
board_size, we may type
3
Handout: Stata codes_basic updated: 2019/3/16
• Tabulate descriptive statistics for CEOPAY and board_size for each industry sector, type
or
• Tabulate descriptive statistics for CEOPAY and board_size for each sample year, type
*********************************************************
(8) How to find the patterns of missing data for a set of variables?
*********************************************************
"misschk" is used to identify the number and pattern of missing obs. Type "findit
misschk" in the command panel and install this package.
misschk varlist
********************************************************
(9) How to find the changes between successive values for each firm?
********************************************************
For example, we would like to find the change in sales revenue for two successive years. First,
we need to know how Stata defines "_n" and "_N". The internal Stata variable _n is equal to
the numeric position of an observation and _N is equal to the total number of observations.
For example, in a dataset of 10 observations as shown below, in the first observation, _n is equal
to 1 and _N is equal to 10. In the second observation, _n is equal to 2 and _N is equal to 10, etc.
id year _n _N rev
---- ---- -- -- ------
1002 2001 1 10 1000
1002 2002 2 10 1500
1002 2003 3 10 1750
1002 2004 4 10 2000
1002 2005 5 10 2150
1003 2003 6 10 500
1003 2004 7 10 750
1003 2005 8 10 860
1004 2004 9 10 10100
1004 2005 10 10 22000
4
Handout: Stata codes_basic updated: 2019/3/16
When using by command, _n is equal to 1 for the first observation of the by-group and _N is
equal to the total number of observations for the by-group.
So, in this case, the first firm (id=1002) has 5 observations in the data set,
by id:
creates 3 by-groups where the first observation in the by-group _n=1 and _N=5. The second
observation _n=2 and _N=5 , etc.
Now, if we would like to ensure data are sorted by two variables, we type
Then, we create a new variable equal to the previous year, for example:
Now, we may find the difference in sales for two consecutive years by typing:
Note:
1. Most Stata commands allow the by prefix, which repeats the command for each group of
observations for which the values of the variables in varlist are the same. by without the
sort option requires that the data be sorted by varlist.
2. The varlist1 (varlist2) syntax is of special use to programmers. It verifies that the
data are sorted by varlist1 varlist2 and then performs a by as if only varlist1
were specified. Please note that if you type bysort id year: gen rev_lag=rev[_n-
1], you will find no observation since there is only one observation in each id-year group.
3. rev (rev[_n-1]) is the sales revenue for the current year (previous year).
4. This command is very different from bysort var1 var2: egen var_median=
median(var3). An example is shown as follows.
5
Handout: Stata codes_basic updated: 2019/3/16
*********************************************************
(10) How to find the median (or mean) for each industry-year group?
*********************************************************
This command generates a new variable CSR_median, the values of which are equal to the
median of CSR scores for each ind-year group. Also, when calculating the median (or mean),
Stata ignores missing values.
*******************************
(11) How to create dummy variable?
*******************************
• Display a frequency distribution table for all non-missing values of a variable. Type
tabulate ind_tse
******************************************
(12) How to winsorize the values of observations?
******************************************
We may install "winsor2" to conduct the winsorization procedure. 1
******************************************************************
How to Performs ordinary least squares (OLS) regression of variable y on x?
******************************************************************
regress y x
1
The contributed commands from the Boston College Statistical Software Components (SSC) archive, often called
the Boston College Archive, are provided by RePEc. The commands available are implemented as one or more ado-
files, and together with their corresponding help files and any other associated files, they form a package. These
packages are available at SSC. ssc allows you to easily download a package. For example, when you type
6
Handout: Stata codes_basic updated: 2019/3/16
reg y x if year==2000
To control for control variables (x1, x2, and x3) and industry and year effects. Type
Question marks (?) can be used as wild cards for single letters or numbers. The
asterisk/star ( *) is a wild card that represents many characters in the variable name. They
could be numbers or letters.
****************************************************
How to find/save fitted values and residuals from regression?
****************************************************
• Generates a new variable (here arbitrarily named yhat) equal to the predicted (fitted) values
from the most recent regression.
predict yhat, xb
• Generates a new variable (here arbitrarily named e) equal to the residuals from the most
recent regression.
predict e, resid
regress y x1 x2 x3
• Performs an F test of the null hypothesis that coefficients on x1 and x2 both equal zero in the
most recent regression model.
test x1 x2
Test whether a coefficient equals a specified constant. For example, to test the null
hypothesis that the coefficient on x1 equals 1 (H 0 :β 1 = 1), instead of testing the usual
null hypothesis that it equals 0 (H 0 :β 1 = 0), type
7
Handout: Stata codes_basic updated: 2019/3/16
test x1 = 1
Test whether two coefficients are equal. For example, the following command evaluates
the null hypothesis H 0 : β 1 = β 2
test x1=x2
• Calculates robust (Huber/White) estimates of standard errors. See the User’s Guide for details.
The robust option works with many other model fitting commands as well.
• Displays a matrix of Pearson correlations, using pairwise deletion of missing values and
showing probabilities from t tests of null hypothesis H0: ρ = 0, for each correlation.
Statistically significant correlations (in this example, p < .05) are indicated by stars (*).
• To obtain the Spearman rank correlation between x1 and x2, equivalent to the Pearson
correlation if these variables were transformed into ranks, type