* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Introduction to Stata - Part II & OLS * * * * Please note, this document is partly baesd on other web resources on Stata. * * (Deatils are available upon request) * * * * Contact: Martin Halla (martin.halla@jku.at) * * * * Last Update: 2009/11/30 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * No-hassle do-file * ********************* /* version 9 capture log close cd "C:\..." log using "filename.log", replace /* What project this is */ /* Which data is used */ /* Who wrote it */ /* When it was written */ /* Maybe what this do-file does */ clear set memory XXm use "datafilename.dta" *** your commands go here *** log close exit */ /* version 9 - Version 9 tells Stata to use its version 9 command interpreter. In the future, when you're running version 10 or later, even if a command has become obsolete, this do-file will still work, because Stata will switch to the version it was written for! cd - command directory capture log close - We close the log at the beginning just in case it's open. If it's open, the next command "log using" will fail. So, we close it first. Capture is there to intercept an error message and allow the do-file to continue. If the log did close, or you closed it outside the do-file, the log close command would cause an error. Capture gets past any error the log close command might generate. /* */ - If you document your do-files, you or the person who inherits your do-files on a long-term project, will have a good chance of understanding what they do. The more information you put here, the better chance everyone will have of understanding them later set memory XXm - he set memory command tells Stata how much memory (RAM) to set aside for your data file. You need to look at how large the file is on disk, add a few Megabytes, and replace "XX" with the number. */ * Okay let us continue start with a case * ****************************************** version 9 *capture log close *cd "C:\returns" *log using "returns.log", replace *************************************************** * Case: Returns to Education * * Data: Current Population Survey, September 2006 * *************************************************** clear set mem 400m use http://www.econ.jku.at/members/Halla/files/teaching/data/cpssep06_raw.dta * We want to study the returns to education /* We need information on - employment status - wage rate - working hours - sex - age - education */ ************************ * Generating variables * ************************ /* To generate or modify variables there are 4 importants commands - rename: to give an existing variable a new name - egen: like generate, however, with more options - generate: for creating a new variable; you can use numbers, variables and functions - replace: works when a variable already exists; allows boolean expressions > with generate and replace you can use + - for addition and subtraction you can use * / for multiplication and division you can use ^ for exponents you can use ( ) for controlling order of operations > boolean expressions Equal to == Not equal to !=, ~= Greater than > Greater than/equal to >= Less than < Less than/equal to <= And & Or | - Missing values: are important, however easy to forget > . for numbers > “” for text */ ***************** * Data cleaning * ***************** /* - Always necessary to some extent - Always use a do file - Never overwrite original data - Re-Check your work - In particualr, watch out for missing values - Label as much as you can */ * Okay let's get started: * age * ******* ta peage * top coding! *VAR: age gen age=. replace age=peage replace age=80 if age>80 replace age=. if age<0 *VAR: age2 gen age2=age^2 sum age age2 * employment status * ********************* ta pemlr ta pemlr, nola *VAR: employed gen employed=. replace employed=1 if pemlr==1 | pemlr==2 replace employed=0 if pemlr==3 | pemlr==4 | pemlr==5 | pemlr==6 | pemlr==7 replace employed=0 if pemlr==-1 & age<=14 *VAR: unemployed gen unemployed=. replace unemployed=1 if pemlr==3 | pemlr==4 replace unemployed=0 if pemlr==1 | pemlr==2 | pemlr==5 | pemlr==6 | pemlr==7 replace unemployed=0 if pemlr==-1 & age<=14 *VAR: outoflf gen outoflf=. replace outoflf=1 if pemlr==5 | pemlr==6 | pemlr==7 replace outoflf=0 if pemlr==1 | pemlr==2 | pemlr==3 | pemlr==4 replace outoflf=1 if pemlr==-1 & age<=14 * wage rate * ************* * top-coding! * possible candidates: sum puernh1c peernh2 peernh1o prernhly sum puernh1c peernh2 peernh1o prernhly if employed==1 *VAR: wage rename prernhly wage * sex * ******* ta pesex ta pesex, nola *VAR: male gen male=. replace male=1 if pesex==1 replace male=0 if pesex==2 *VAR: female gen female=. replace female=1 if pesex==2 replace female=0 if pesex==1 sum male female * education * ************* ta peeduca ta peeduca, nola *VAR: educ (ORDINAL) gen educ=. replace educ=1 if peeduca==-1 & age<=14 replace educ=1 if peeduca==31 replace educ=2 if peeduca==32 replace educ=3 if peeduca==33 replace educ=4 if peeduca==34 replace educ=5 if peeduca==35 replace educ=6 if peeduca==36 replace educ=7 if peeduca==37 replace educ=8 if peeduca==38 replace educ=9 if peeduca==39 replace educ=10 if peeduca==40 replace educ=11 if peeduca==41 replace educ=12 if peeduca==42 replace educ=13 if peeduca==43 replace educ=14 if peeduca==44 replace educ=15 if peeduca==45 replace educ=16 if peeduca==46 *VAR: school gen school=. replace school=0 if educ==1 replace school=2 if educ==2 replace school=5.5 if educ==3 replace school=7.5 if educ==4 replace school=9 if educ==5 replace school=10 if educ==6 replace school=11 if educ==7 replace school=12 if educ==8 replace school=13 if educ==9 replace school=15 if educ==10 replace school=15 if educ==11 replace school=15 if educ==12 replace school=16 if educ==13 replace school=16 if educ==14 replace school=18 if educ==15 replace school=20 if educ==16 ******************** * Select variables * ******************** keep employed unemployed outoflf wage male female age age2 educ school /* There is also a command 'drop' */ ******************* * Select a sample * ******************* keep if employed==1 keep if wage~=. drop if wage==0 drop if wage>60 **************** * Inspect data * **************** des sum * Some labels * *************** label variable wage "Hourly wage rate" label variable school "Years of schooling" sum wage school ta school * Some histograms * ******************* histogram wage #delimit; histogram wage, density fcolor(navy) lcolor(dknavy) xtitle(, size(medium)) xlabel(, labsize(small)) ytitle(, size(medium)) ylabel(, labsize(small)); #delimit cr #delimit; histogram school, discrete percent fcolor(navy) lcolor(dknavy) xtitle(, size(medium)) xlabel(, labsize(small)) ytitle(, size(medium)) ylabel(, labsize(small)); #delimit cr #delimit; histogram school, discrete percent fcolor(navy) lcolor(dknavy) xlabel( 0 "0" 2 "2" 6 "6" 8 "8" 9 "9" 10 "10" 11 "11" 12 "12" 13 "13" 15 "15" 16 "16" 18 "18" 20 "20") xtitle(, size(medium)) xlabel(, labsize(small)) ytitle(, size(medium)) ylabel(, labsize(small)); #delimit cr * Some scatterplots * ********************* #delimit; twoway (scatter wage school, mcolor(navy) msize(small)), xlabel( 0 "0" 2 "2" 6 "6" 8 "8" 9 "9" 10 "10" 11 "11" 12 "12" 13 "13" 15 "15" 16 "16" 18 "18" 20 "20") xtitle(, size(medium)) xlabel(, labsize(small)) ytitle(, size(medium)) ylabel(, labsize(small)); #delimit cr #delimit; twoway (scatter wage school, mcolor(navy) msize(small)) (lfit wage school, lcolor(cranberry)), xlabel( 0 "0" 2 "2" 6 "6" 8 "8" 9 "9" 10 "10" 11 "11" 12 "12" 13 "13" 15 "15" 16 "16" 18 "18" 20 "20") xtitle(, size(medium)) xlabel(, labsize(small)) ytitle(, size(medium)) ylabel(, labsize(small)); #delimit cr *************** * Regressions * *************** * regress * *********** /* In Stata you can easily estimate a OLS-Regression using the 'regress' command (shortcut reg). Suppose your dependent variable is depvar and your independent variables are listesd in varlist. Then the command to run the linear regression is, 'reg depvar varlist' */ reg wage school * Calculate t-value display _b[school]/_se[school] * Calculate p-value display tprob(9023,28) * Calculate 95% conf. interval display _b[school]-1.96*_se[school] display _b[school]+1.96*_se[school] /* Interpretation: An additional year of schooling increases the hourly wage rate by US-Dollar 0.81. */ /* Please notee that Stata automatically adds a constant to the list of independent variables. If you want to exclude it, you can use the option nocons: */ reg wage school, nocons /* As for most other commands in Stata, the regress command can be also applied to a subset of the data. For instance, by specifying an if statement: */ reg wage school if female==1 sum wage school if female==1 /* Interpretation: An additional year of schooling increases the hourly wage rate by US-Dollar 0.98 .*/ reg wage school if female==0 sum wage school if female==0 /* Interpretation: An additional year of schooling increases the hourly wage rate by US-Dollar 0.76. */ reg wage school female /* Interpretation: Women's hourly wage rate is 2.5 US-Dollar less than men's rate. */ * Using logarithmic functional forms * ************************************** generat lwage=log(wage) sum wage lwage generat lschool=log(school) sum school lschool * Model: level-log reg wage lschool /* Interpretation: An one percent increase in years of schooling increases the hourly wage rate by US-Dollar 0.07.*/ * Model: log-level reg lwage school /* Interpretation: An additional year of schooling increases the hourly wage rate by 5.50 percent. */ * Model: log-log reg lwage lschool /* Interpretation: An one percent increase in years of schooling increases the hourly wage by 0.46 percent. */ * Using quadratic terms * ************************* reg wage school age age2 display -1*_b[age]/(2* _b[age2]) reg lwage school age age2 display -1*_b[age]/(2* _b[age2]) * Using interaction terms * *************************** reg wage school female age age2 gen school_female=school*female reg wage school school_female female age age2 /* There are several options available with the regress command (check 'help regress'). For instance, you can easily calculate heteroskedastic-robust standard errors (specifically, the so-called Huber/White/sandwich estimator). */ reg wage school age age2, robust /* Usually regression tables are presented in publiciations (e.g. journals, books, etc.) in a certain way. The outreg command, for instance, makes produces tables in a format that is commonly used. */ reg wage school age age2 outreg using "C:\test.doc" /* Usefull options: replace, se and 3 aster */ reg wage school age age2 outreg using "C:\test.doc", replace se 3aster /* The append options*/ reg wage school age age2 if female==1 outreg using "C:\test.doc", append se 3aster reg wage school age age2 if female==0 outreg using "C:\test.doc", append se 3aster /* How to delete the file? */ erase "C:\test.doc"