júlí 18, 2011

Combining data sets

In many empirical research projects, the raw data to be utilized are stored in a number of separate files; panel data,
time series data extracted from different databases, and the like. Stata only permits a single data set to be accessed at one time. How, then,
do you work with multiple data sets? Several commands are available, including append, merge, and joinby.

The append command combines two Stata-format data sets that possess variables in common, adding observations to the existing
variables. The same variables need not be present in both files, as long as a subset of the variables are common to the “master” and
“using” data sets. It is important to note that “PRICE" and “price” are different variables, and one will not be appended to the other.

Free hosting here

The merge command is very powerful. Like append, it works on a “master” data set—the current contents of memory—and one or more
using” data sets. One or more merge variables are specified, and both master and using data sets must be sorted on those variables.
The distinction between “master” and “using” is important. When the same variable is present in each of the files, Stata’s default behavior is
to hold the master data inviolate and discard the using dataset’s copy of that variable. This may be modified by the update option, which
specifies that non-missing values in the using dataset should replace missing values in the master, and update replace, which specifies
that non-missing values in the using dataset should take precedence.

Free hosting

A “one-to-onemerge specifies that each record in the using data set is to be combined with one record in the master data set. This would
be appropriate if you acquired additional variables for the same observations. A new variable, _merge, takes on integer values
indicating whether an observation appears in the master only, the using only, or appears in both. This may be used to determine whether
the merge has been successful, or to remove those observations which remain unmatched (e.g. merging a set of households from
different cities with a comprehensive list of ZIP codes; one would then discard all the unused ZIP code records). The _merge variable must
be dropped before another merge is performed on this data set. The unique option should be used if you believe that both data sets
should have unique values of the merge key.The unique option should be used if you believe that both data sets should have unique values of the merge key.

The merge command can also do a “match merge”, or “one-to-N” merge, in which each record in the using data set is matched with a
number of records in the master data set. If a number of the households lived in the same ZIP code, then the match would place variables from the ZIP code file on the household records, repeating where necessary. This is a very useful technique to combine aggregate data with disaggregate data without dealing with the details. Although “one-to-N” and “N-to-one” merges are commonplace and very useful, you probably never want to do a “N-to-N” merge, which will yield seemingly random results. To ensure that one data set has unique identifiers, specify the uniqmaster or uniqusing options, or use the isid command to ensure that a dataset has a unique
identifier.

Enjoy free hosting.

júlí 13, 2011

Working with variables

Creating new variables

Variables are created by the command generate

Syntax:  generate [variablename]

generate [newvar] = [expression]

While creating a vairiable, you can too assign an initial value by use of = operator.

eg generate true=1

generate fullname = last + “, “ + first

Renaming variables

This is done to assign a new variable to an existing variable

Syntax: rename [old name] [new name]

Here I am changing the variable name from “old name” to “new name”

Replacing variable values

when you need to replace the existing values of a variable, the re is provision through the command replace

Syntax: replace [variable name] =[value] [conditions]  [options]

e.g replace true=0 in 1/10 this will replace value of true to 0 for all variables 1 to 10

Recoding variables

suppose you had this data

.tab age

Age Frequency
18 2
19 6
20 3
21 4

now you want to recode the variable age into age groups, say 18 to 19, 20 to 21

this will be achieved through this command

.recode age (18 19 = 1 “18 to 19”) ///
(20 21= 2 “20 to 21”) ///
(else=.), generate(agegroups) label(agegroups)

As we see, recode helps reorganize data.

.tab age

Age Frequency
18 to 19 8
20 to 21 7

Labeling variables

to label a variable, first you have to define the label then implement it on the variable.

This is how its done in Stata

Syntax: label define [value/variable] [Lable] …………….

label values [variable] [label]

eg

label define genderlabel 1 male 2 female

label values gendervariable genderlabel

Deleting variables

If you had a temporary variable, or you are doing data cleanup, sometimes its necessary to delete some variables.

Syntax: drop [variable name] [options] Note: If you just execute drop without variable name of options, you will be deleting all variable in the data set

eg. drop age if age==.

júlí 04, 2011

Three way crosstabs

We have seen descriptive statistics and in this post, I am going to highlight how to do a cross-tabulation using more than two variables.

this is achievable by using the tabstat command

One can specify the statistics to show and with the help of bysort command, you can show cross-tabulations involving more than one variable.

Syntax:

tabstat variable[s], statistics(statistics) by(conditional variable)

Example 1:

tabstat age sat score heightin readnews, statistics(mean median sd var count range min max) by(gender)

Eaxmple 2:

bysort age:tab ed_level major- this examples first sorts the records by age and then cross-tabulates the dataset variables(ed_level and major)

Example 3:

bysort studentstatus: tab gender major, sum(sat) –This adds a fourth variable

Get data set here