## Azdren's Stata Reference Sheet

*last updated: 11/21/22*

Disclaimer: This is a work-in-progress, there may be errors. If you find a flaw, please let me know via email.

Please feel free to bookmark and share this page.

## My most used commands:

**help gen**

help is a command in stata that will show you information about any command. In this example, you will get help for the commend, “gen.”

**clear**

this command clear all of the loaded data in Stata

**set obs 50**

sets the number of observations to 50

**gen newvar = oldvar1^2 * oldvar2**

generate a new variable per observation, which has the value of ‘oldvar1’ squared, multiplied by ‘oldvar2’.

**gen newvar = _n**

generate a new variable per observation, the values of zero to the value of the last observation

**egen var2=std(var1)**

‘egen’ allows you to do complex calculations, like finding the standard deviation of another variable.

**replace var1=0 if var1==.**

this command will replace all observations of ‘var1’ that have a missing value, with the value ‘0’.

**replace var1=0 if var1==. if othervar==1 | othervar==3**

this command will replace all observations of ‘var1’ that have a missing value, with the value ‘0’, if and only if either ‘othervar’ has a value of ‘1’ or ‘3’.

**recode newvar 1=0 2=0 3=0 4=1 5=1**

this command turns all ‘newvar’ values that were 1, 2, or 3, into ‘0’. And all values of ‘4’ and ‘5’ into ‘1’.

**drop freq year**

this command drop the variable ‘freq’ and ‘year’ from the dataset

**keep freq year**

this command drops everything else except for the variable ‘freq’ and ‘year’ from the dataset

**rename var1 newVARname**

this command renames the variable ‘var1’ to ‘newVARname’

**label variable newVARname “New Variable Label”**

this command allows you to label the variable ‘newVARname’ the way you want it to appear in graphs and tables

**label define newVARnameLABELS 0 “low” 1 “medium” 2 “high”**

this command assigns labels to the values of 0, 1, and 2.

**label values newVARname newVARnameLABELS**

this command allows you to apply the attribute labels to the variable ‘newVARname’ with the labels defined in ‘newVARnameLABELS’

**list in 3/5 if gender==1**

lists all observations between observation 3 and 5 where the variable ‘gender’ has a value of ‘1’

**order var1 var3**

this command places ‘var1’ and ‘var3’ as the first and second column in the dataset

**sum comp if gender==0 & race==1**

this command does a basic statistical analysis of all observations where the variable ‘gender’ has a value of ‘0’ and the variable ‘race’ has a value of ‘1’

**sum comp if gender==0 & race==1, detail**

this command does a more detailed statistical analysis of all observations where the variable ‘gender’ has a value of ‘0’ and the variable ‘race’ has a value of ‘1’

**des**

this command provides a description of the dataset

**tab happy trust, col row cell**

this command creates a cross-tabulation of the variables ‘happy’ and ‘trust’, and also displays the sums and proportions for each column, row, and cell, as a proportion of the whole table.

**alpha var1 var2 var3, item gen(var_all)**

this command creates a Cronbach’s alpha value for the variables ‘var1’, ‘var2’, and ‘var3’. The new variable generated is named ‘var_all’

**gen xvar=runiform(22,303)**

this command generates random numbers between 22 and 303

**gen select=round(xvar, 1)**

this command generates ‘select’, which takes ‘xvar’ and it rounds it to the nearest integer

## Data analysis tools

**encode varname, gen(newvar)**

this command converts the string variable ‘varname’ to a number variable ‘newvar’

**destring year month, replace**

this command converts the variables ‘year’ and ‘month’ to numbers. Although, I have not yet had the time to explore the difference between this variable and the previous variable

**decode varname**

this command converts the numeric variable into a string

**tostring ruledate, replace**

this command converts a variable into a string, although like above, I don’t know what the difference are in the two commands. That is up to you to discover

**real(var1)**

this command converts the variable ‘var1’ into a real number or to missing ‘.’ if it can’t be converted

**strlen(varname)**

this command returns a numerical value of the number of characters in the variable ‘varname’

**substr(varname,5,7)**

this command captures the characters of a string in the variable ‘varname’ between character ‘5’ and ‘7’

**stgrpos(varname, “whatever word”)**

this command returns a value of ‘1’ if it finds “whatever word” in the string ‘varname’

**ustrregexm(s,re,noc)**

this command performs a match of a regular expression and returns a ‘1’ if the expression ‘re’ is satisfied by the string ‘s’, otherwise returns a ‘0’

**substr(long_string,1,4)**

if “long_string” = “ABCDEFG”, then this command will output “ABCD”, because you are asking to capture the string from the first position to the fourth.

## Data management

**use usa_demo**

this command loads the dataset ‘usa_demo’ into Stata. Furthermore, adding ‘, clear’ at the end will clear any previously loaded data before loading the new data

**merge 1:1 household person using usa_demo**

this command does a one-to-one merge, using the specified variables ‘household’ and ‘person’, merging the loaded dataset with the ‘usa_demo’ dataset. (note: both data sets need to be sorted based on the variables you are using to merge)

**append using usa_demo**

this command adds observations (or appends) to the previously loaded dataset

**expand count**

this command blows up the data set (not like with explosions, but) using the string variable ‘count’ to measure how many new observations to add per observation

**contract pray comp**

this command is essentially the opposite of ‘expand’, and it regroups the variables ‘pray’ and ‘comp’

## Exporting to Excel or Word

First, you have to run your regressions. So, for example:

**logistic var1 var2**

**estimates store my_results**

**logistic var3 var4**

**estimates store my_results2**

Then, the first time you run this code, you need to install the package, by simply typing the code:

**ssc install outreg2**

To get more information on this package, type:

**help outreg2**

To export to Excel, do the following code:

**outreg2 [my_results my_results2] using reg_results, replace excel dec(3) label eform e(r2_p ll)**

Or export Summary Statistics to word, do the following code:

**outreg2 using x.doc, replace sum(log) keep(price mpg turn)**

## Basic graphing and analyzing data

**histogram res**

this creates a basic histogram of the variable ‘res’

**scatter mins units**

scatterplot of ‘mins’ and ‘units’

**avplots**

(must be done after a regression)

**graph twoway (lfit yvar xvar)**

Twoway linear prediction plots

Plotting residuals (or studentized residuals) against y, y-hat. First, run the multiple regression.

Then, predict the studentized residuals:

**predict StudsRes, rstudent**

**predict yhat**

**graph twoway (lfit StudsRes, yhat)(scatter StudsRes yhat)**

lfit means linear fit.

Evidence for heteroscedasticity exists when the further you go out on x, the more scattered the y variable is (e.g., shotgun blast). For example ,as the house value increases, prediction is less useful

Also plot the histogram.

**histogram StudsRes**

To plot residuals, first run the regression.

**predict Res, r**

**predict yhat**

**histogram Res**

If the plot distribution is not normal, then it is probably overshooting or undershooting the prediction

## Solving specific problems in Stata

** sort ein**

**gen hold=ein[_n]-ein[_n-1]**

**drop if hold==0**

**drop hold**

If you want to save only the first observation when you have multiple entries using the same ID