Azdren's Stata Reference Sheet

last updated: 11/21/22

Disclaimer: This is a work-in-progress, there may be errors. If you find a flaw, please let me know via email.

Please feel free to bookmark and share this page.

My most used commands:

help gen

help is a command in stata that will show you information about any command. In this example, you will get help for the commend, “gen.”

clear

this command clear all of the loaded data in Stata

set obs 50

sets the number of observations to 50

gen newvar = oldvar1^2 * oldvar2

generate a new variable per observation, which has the value of ‘oldvar1’ squared, multiplied by ‘oldvar2’. 

gen newvar = _n

generate a new variable per observation, the values of zero to the value of the last observation

egen var2=std(var1)

‘egen’ allows you to do complex calculations, like finding the standard deviation of another variable.

replace var1=0 if var1==.

this command will replace all observations of ‘var1’ that have a missing value, with the value ‘0’.

replace var1=0 if var1==. if othervar==1 | othervar==3

this command will replace all observations of ‘var1’ that have a missing value, with the value ‘0’, if and only if either ‘othervar’ has a value of ‘1’ or ‘3’.

recode newvar 1=0 2=0 3=0 4=1 5=1

this command turns all ‘newvar’ values that were 1, 2, or 3, into ‘0’. And all values of ‘4’ and ‘5’ into ‘1’.

drop freq year

this command drop the variable ‘freq’ and ‘year’ from the dataset

keep freq year

this command drops everything else except for the variable ‘freq’ and ‘year’ from the dataset

rename var1 newVARname

this command renames the variable ‘var1’ to ‘newVARname’

label variable newVARname “New Variable Label”

this command allows you to label the variable ‘newVARname’ the way you want it to appear in graphs and tables

label define newVARnameLABELS 0 “low” 1 “medium” 2 “high”

this command assigns labels to the values of 0, 1, and 2.

label values newVARname newVARnameLABELS

this command allows you to apply the attribute labels to the variable ‘newVARname’ with the labels defined in ‘newVARnameLABELS’

list in 3/5 if gender==1

lists all observations between observation 3 and 5 where the variable ‘gender’ has a value of ‘1’

order var1 var3

this command places ‘var1’ and ‘var3’ as the first and second column in the dataset

sum comp if gender==0 & race==1

this command does a basic statistical analysis of all observations where the variable ‘gender’ has a value of ‘0’ and the variable ‘race’ has a value of ‘1’

sum comp if gender==0 & race==1, detail

this command does a more detailed statistical analysis of all observations where the variable ‘gender’ has a value of ‘0’ and the variable ‘race’ has a value of ‘1’

des

this command provides a description of the dataset

tab happy trust, col row cell

this command creates a cross-tabulation of the variables ‘happy’ and ‘trust’, and also displays the sums and proportions for each column, row, and cell, as a proportion of the whole table.

alpha var1 var2 var3, item gen(var_all)

this command creates a Cronbach’s alpha value for the variables ‘var1’, ‘var2’, and ‘var3’. The new variable generated is named ‘var_all’

gen xvar=runiform(22,303)

this command generates random numbers between 22 and 303

gen select=round(xvar, 1)

this command generates ‘select’, which takes ‘xvar’ and it rounds it to the nearest integer

Data analysis tools

encode varname, gen(newvar)

this command converts the string variable ‘varname’ to a number variable ‘newvar’

destring year month, replace

this command converts the variables ‘year’ and ‘month’ to numbers. Although, I have not yet had the time to explore the difference between this variable and the previous variable

decode varname

this command converts the numeric variable into a string

tostring ruledate, replace

this command converts a variable into a string, although like above, I don’t know what the difference are in the two commands. That is up to you to discover

real(var1)

this command converts the variable ‘var1’ into a real number or to missing ‘.’ if it can’t be converted

strlen(varname)

this command returns a numerical value of the number of characters in the variable ‘varname’

substr(varname,5,7)

this command captures the characters of a string in the variable ‘varname’ between character ‘5’ and ‘7’

stgrpos(varname, “whatever word”)

this command returns a value of ‘1’ if it finds “whatever word” in the string ‘varname’

ustrregexm(s,re,noc)

this command performs a match of a regular expression and returns a ‘1’ if the expression ‘re’ is satisfied by the string ‘s’, otherwise returns a ‘0’

substr(long_string,1,4)

if “long_string” = “ABCDEFG”, then this command will output “ABCD”, because you are asking to capture the string from the first position to the fourth.

Data management

use usa_demo

this command loads the dataset ‘usa_demo’ into Stata. Furthermore, adding ‘, clear’ at the end will clear any previously loaded data before loading the new data

merge 1:1 household person using usa_demo

this command does a one-to-one merge, using the specified variables ‘household’ and ‘person’, merging the loaded dataset with the ‘usa_demo’ dataset. (note: both data sets need to be sorted based on the variables you are using to merge)

append using usa_demo

this command adds observations (or appends) to the previously loaded dataset

expand count

this command blows up the data set (not like with explosions, but) using the string variable ‘count’ to measure how many new observations to add per observation 

contract pray comp

this command is essentially the opposite of ‘expand’, and it regroups the variables ‘pray’ and ‘comp’

Exporting to Excel or Word

First, you have to run your regressions. So, for example:

logistic var1 var2

estimates store my_results

logistic var3 var4

estimates store my_results2

 

Then, the first time you run this code, you need to install the package, by simply typing the code:

ssc install outreg2

To get more information on this package, type:

help outreg2

 

To export to Excel, do the following code:

outreg2 [my_results my_results2] using reg_results, replace excel dec(3) label eform e(r2_p ll)

 

Or export Summary Statistics to word, do the following code:

outreg2 using x.doc, replace sum(log) keep(price mpg turn)

Basic graphing and analyzing data

histogram res

this creates a basic histogram of the variable ‘res’

scatter mins units

scatterplot of ‘mins’ and ‘units’

avplots

(must be done after a regression)

graph twoway (lfit yvar xvar)

Twoway linear prediction plots

Plotting residuals (or studentized residuals) against y, y-hat. First, run the multiple regression.

Then, predict the studentized residuals:

predict StudsRes, rstudent

predict yhat

graph twoway (lfit StudsRes, yhat)(scatter StudsRes yhat)

lfit means linear fit.

Evidence for heteroscedasticity exists when the further you go out on x, the more scattered the y variable is (e.g., shotgun blast). For example ,as the house value increases, prediction is less useful

Also plot the histogram.

histogram StudsRes

 

To plot residuals, first run the regression.

predict Res, r

predict yhat

histogram Res

If the plot distribution is not normal, then it is probably overshooting or undershooting the prediction

Solving specific problems in Stata

 sort ein

gen hold=ein[_n]-ein[_n-1]

drop if hold==0

drop hold

If you want to save only the first observation when you have multiple entries using the same ID