In many empirical research projects, the raw data to be utilized are stored in a number of separate files; panel data,
time series data extracted from different databases, and the like. Stata only permits a single data set to be accessed at one time. How, then,
do you work with multiple data sets? Several commands are available, including append, merge, and joinby.
The append command combines two Stata-format data sets that possess variables in common, adding observations to the existing
variables. The same variables need not be present in both files, as long as a subset of the variables are common to the “master” and
“using” data sets. It is important to note that “PRICE" and “price” are different variables, and one will not be appended to the other.
The merge command is very powerful. Like append, it works on a “master” data set—the current contents of memory—and one or more
“using” data sets. One or more merge variables are specified, and both master and using data sets must be sorted on those variables.
The distinction between “master” and “using” is important. When the same variable is present in each of the files, Stata’s default behavior is
to hold the master data inviolate and discard the using dataset’s copy of that variable. This may be modified by the update option, which
specifies that non-missing values in the using dataset should replace missing values in the master, and update replace, which specifies
that non-missing values in the using dataset should take precedence.
A “one-to-one” merge specifies that each record in the using data set is to be combined with one record in the master data set. This would
be appropriate if you acquired additional variables for the same observations. A new variable, _merge, takes on integer values
indicating whether an observation appears in the master only, the using only, or appears in both. This may be used to determine whether
the merge has been successful, or to remove those observations which remain unmatched (e.g. merging a set of households from
different cities with a comprehensive list of ZIP codes; one would then discard all the unused ZIP code records). The _merge variable must
be dropped before another merge is performed on this data set. The unique option should be used if you believe that both data sets
should have unique values of the merge key.The unique option should be used if you believe that both data sets should have unique values of the merge key.
The merge command can also do a “match merge”, or “one-to-N” merge, in which each record in the using data set is matched with a
number of records in the master data set. If a number of the households lived in the same ZIP code, then the match would place variables from the ZIP code file on the household records, repeating where necessary. This is a very useful technique to combine aggregate data with disaggregate data without dealing with the details. Although “one-to-N” and “N-to-one” merges are commonplace and very useful, you probably never want to do a “N-to-N” merge, which will yield seemingly random results. To ensure that one data set has unique identifiers, specify the uniqmaster or uniqusing options, or use the isid command to ensure that a dataset has a unique
identifier.