Azdren's Stata Reference Sheet
last updated: 11/21/22
Disclaimer: This is a work-in-progress, there may be errors. If you find a flaw, please let me know via email.
Please feel free to bookmark and share this page.
My most used commands:
help gen
help is a command in stata that will show you information about any command. In this example, you will get help for the commend, “gen.”
clear
this command clear all of the loaded data in Stata
set obs 50
sets the number of observations to 50
gen newvar = oldvar1^2 * oldvar2
generate a new variable per observation, which has the value of ‘oldvar1’ squared, multiplied by ‘oldvar2’.
gen newvar = _n
generate a new variable per observation, the values of zero to the value of the last observation
egen var2=std(var1)
‘egen’ allows you to do complex calculations, like finding the standard deviation of another variable.
replace var1=0 if var1==.
this command will replace all observations of ‘var1’ that have a missing value, with the value ‘0’.
replace var1=0 if var1==. if othervar==1 | othervar==3
this command will replace all observations of ‘var1’ that have a missing value, with the value ‘0’, if and only if either ‘othervar’ has a value of ‘1’ or ‘3’.
recode newvar 1=0 2=0 3=0 4=1 5=1
this command turns all ‘newvar’ values that were 1, 2, or 3, into ‘0’. And all values of ‘4’ and ‘5’ into ‘1’.
drop freq year
this command drop the variable ‘freq’ and ‘year’ from the dataset
keep freq year
this command drops everything else except for the variable ‘freq’ and ‘year’ from the dataset
rename var1 newVARname
this command renames the variable ‘var1’ to ‘newVARname’
label variable newVARname “New Variable Label”
this command allows you to label the variable ‘newVARname’ the way you want it to appear in graphs and tables
label define newVARnameLABELS 0 “low” 1 “medium” 2 “high”
this command assigns labels to the values of 0, 1, and 2.
label values newVARname newVARnameLABELS
this command allows you to apply the attribute labels to the variable ‘newVARname’ with the labels defined in ‘newVARnameLABELS’
list in 3/5 if gender==1
lists all observations between observation 3 and 5 where the variable ‘gender’ has a value of ‘1’
order var1 var3
this command places ‘var1’ and ‘var3’ as the first and second column in the dataset
sum comp if gender==0 & race==1
this command does a basic statistical analysis of all observations where the variable ‘gender’ has a value of ‘0’ and the variable ‘race’ has a value of ‘1’
sum comp if gender==0 & race==1, detail
this command does a more detailed statistical analysis of all observations where the variable ‘gender’ has a value of ‘0’ and the variable ‘race’ has a value of ‘1’
des
this command provides a description of the dataset
tab happy trust, col row cell
this command creates a cross-tabulation of the variables ‘happy’ and ‘trust’, and also displays the sums and proportions for each column, row, and cell, as a proportion of the whole table.
alpha var1 var2 var3, item gen(var_all)
this command creates a Cronbach’s alpha value for the variables ‘var1’, ‘var2’, and ‘var3’. The new variable generated is named ‘var_all’
gen xvar=runiform(22,303)
this command generates random numbers between 22 and 303
gen select=round(xvar, 1)
this command generates ‘select’, which takes ‘xvar’ and it rounds it to the nearest integer
Data analysis tools
encode varname, gen(newvar)
this command converts the string variable ‘varname’ to a number variable ‘newvar’
destring year month, replace
this command converts the variables ‘year’ and ‘month’ to numbers. Although, I have not yet had the time to explore the difference between this variable and the previous variable
decode varname
this command converts the numeric variable into a string
tostring ruledate, replace
this command converts a variable into a string, although like above, I don’t know what the difference are in the two commands. That is up to you to discover
real(var1)
this command converts the variable ‘var1’ into a real number or to missing ‘.’ if it can’t be converted
strlen(varname)
this command returns a numerical value of the number of characters in the variable ‘varname’
substr(varname,5,7)
this command captures the characters of a string in the variable ‘varname’ between character ‘5’ and ‘7’
stgrpos(varname, “whatever word”)
this command returns a value of ‘1’ if it finds “whatever word” in the string ‘varname’
ustrregexm(s,re,noc)
this command performs a match of a regular expression and returns a ‘1’ if the expression ‘re’ is satisfied by the string ‘s’, otherwise returns a ‘0’
substr(long_string,1,4)
if “long_string” = “ABCDEFG”, then this command will output “ABCD”, because you are asking to capture the string from the first position to the fourth.
Data management
use usa_demo
this command loads the dataset ‘usa_demo’ into Stata. Furthermore, adding ‘, clear’ at the end will clear any previously loaded data before loading the new data
merge 1:1 household person using usa_demo
this command does a one-to-one merge, using the specified variables ‘household’ and ‘person’, merging the loaded dataset with the ‘usa_demo’ dataset. (note: both data sets need to be sorted based on the variables you are using to merge)
append using usa_demo
this command adds observations (or appends) to the previously loaded dataset
expand count
this command blows up the data set (not like with explosions, but) using the string variable ‘count’ to measure how many new observations to add per observation
contract pray comp
this command is essentially the opposite of ‘expand’, and it regroups the variables ‘pray’ and ‘comp’
Exporting to Excel or Word
First, you have to run your regressions. So, for example:
logistic var1 var2
estimates store my_results
logistic var3 var4
estimates store my_results2
Then, the first time you run this code, you need to install the package, by simply typing the code:
ssc install outreg2
To get more information on this package, type:
help outreg2
To export to Excel, do the following code:
outreg2 [my_results my_results2] using reg_results, replace excel dec(3) label eform e(r2_p ll)
Or export Summary Statistics to word, do the following code:
outreg2 using x.doc, replace sum(log) keep(price mpg turn)
Basic graphing and analyzing data
histogram res
this creates a basic histogram of the variable ‘res’
scatter mins units
scatterplot of ‘mins’ and ‘units’
avplots
(must be done after a regression)
graph twoway (lfit yvar xvar)
Twoway linear prediction plots
Plotting residuals (or studentized residuals) against y, y-hat. First, run the multiple regression.
Then, predict the studentized residuals:
predict StudsRes, rstudent
predict yhat
graph twoway (lfit StudsRes, yhat)(scatter StudsRes yhat)
lfit means linear fit.
Evidence for heteroscedasticity exists when the further you go out on x, the more scattered the y variable is (e.g., shotgun blast). For example ,as the house value increases, prediction is less useful
Also plot the histogram.
histogram StudsRes
To plot residuals, first run the regression.
predict Res, r
predict yhat
histogram Res
If the plot distribution is not normal, then it is probably overshooting or undershooting the prediction
Solving specific problems in Stata
sort ein
gen hold=ein[_n]-ein[_n-1]
drop if hold==0
drop hold
If you want to save only the first observation when you have multiple entries using the same ID