september 28, 2011

Reading in Data

If Data you are interested in is in Stata Format, then all you need is knowing its file path. Assuming you are using PC, and data set is named “Coffee.dta”, and is located in “My documents”, then the Stata command for loading such will be

use "C:\Documents and Settings\Paul\My Documents\Coffee.dta", clear

Note: Don’t forget putting the double quotes (“”) enclosing the path of the data set.

Data Set inform of Excel(or text file)

If the data set you are interested in is .xls, Its still possible to work with it in Stata.

first, from your excel application, save the data as of type Text(Tab delimited). this will save it in a .txt file extension.

Now our coffee data set, coffee.xls becomes coffee.txt.

from Stata, issue the following command:

insheet using "C:\Documents and Settings\Paul\My Documents\coffee.txt"

The same command is issued if importing a .csv file.

insheet using "C:\Documents and Settings\Paul\My Documents\coffee.csv"

Note:

You can also save Excel data COULD be saved in the .csv format (“save as”, if not already in that format).

You can also enter the data manually (or copy and paste from Excel or almost any format) by typing

From Stata’s Command window, type edit.

This will trigger a Data Editor window. Paste your Data here or edit it manually.

If you choose this option, Stata will automatically name your variables by column, e.g. the variable first column is “var1”. To change this, you can rename your variables by typing
rename var1 price – to rename “var1” to “price”. Read more on variables renaming.



júlí 18, 2011

Combining data sets

In many empirical research projects, the raw data to be utilized are stored in a number of separate files; panel data,
time series data extracted from different databases, and the like. Stata only permits a single data set to be accessed at one time. How, then,
do you work with multiple data sets? Several commands are available, including append, merge, and joinby.

The append command combines two Stata-format data sets that possess variables in common, adding observations to the existing
variables. The same variables need not be present in both files, as long as a subset of the variables are common to the “master” and
“using” data sets. It is important to note that “PRICE" and “price” are different variables, and one will not be appended to the other.

Free hosting here

The merge command is very powerful. Like append, it works on a “master” data set—the current contents of memory—and one or more
using” data sets. One or more merge variables are specified, and both master and using data sets must be sorted on those variables.
The distinction between “master” and “using” is important. When the same variable is present in each of the files, Stata’s default behavior is
to hold the master data inviolate and discard the using dataset’s copy of that variable. This may be modified by the update option, which
specifies that non-missing values in the using dataset should replace missing values in the master, and update replace, which specifies
that non-missing values in the using dataset should take precedence.

Free hosting

A “one-to-onemerge specifies that each record in the using data set is to be combined with one record in the master data set. This would
be appropriate if you acquired additional variables for the same observations. A new variable, _merge, takes on integer values
indicating whether an observation appears in the master only, the using only, or appears in both. This may be used to determine whether
the merge has been successful, or to remove those observations which remain unmatched (e.g. merging a set of households from
different cities with a comprehensive list of ZIP codes; one would then discard all the unused ZIP code records). The _merge variable must
be dropped before another merge is performed on this data set. The unique option should be used if you believe that both data sets
should have unique values of the merge key.The unique option should be used if you believe that both data sets should have unique values of the merge key.

The merge command can also do a “match merge”, or “one-to-N” merge, in which each record in the using data set is matched with a
number of records in the master data set. If a number of the households lived in the same ZIP code, then the match would place variables from the ZIP code file on the household records, repeating where necessary. This is a very useful technique to combine aggregate data with disaggregate data without dealing with the details. Although “one-to-N” and “N-to-one” merges are commonplace and very useful, you probably never want to do a “N-to-N” merge, which will yield seemingly random results. To ensure that one data set has unique identifiers, specify the uniqmaster or uniqusing options, or use the isid command to ensure that a dataset has a unique
identifier.

Enjoy free hosting.

júlí 13, 2011

Working with variables

Creating new variables

Variables are created by the command generate

Syntax:  generate [variablename]

generate [newvar] = [expression]

While creating a vairiable, you can too assign an initial value by use of = operator.

eg generate true=1

generate fullname = last + “, “ + first

Renaming variables

This is done to assign a new variable to an existing variable

Syntax: rename [old name] [new name]

Here I am changing the variable name from “old name” to “new name”

Replacing variable values

when you need to replace the existing values of a variable, the re is provision through the command replace

Syntax: replace [variable name] =[value] [conditions]  [options]

e.g replace true=0 in 1/10 this will replace value of true to 0 for all variables 1 to 10

Recoding variables

suppose you had this data

.tab age

Age Frequency
18 2
19 6
20 3
21 4

now you want to recode the variable age into age groups, say 18 to 19, 20 to 21

this will be achieved through this command

.recode age (18 19 = 1 “18 to 19”) ///
(20 21= 2 “20 to 21”) ///
(else=.), generate(agegroups) label(agegroups)

As we see, recode helps reorganize data.

.tab age

Age Frequency
18 to 19 8
20 to 21 7

Labeling variables

to label a variable, first you have to define the label then implement it on the variable.

This is how its done in Stata

Syntax: label define [value/variable] [Lable] …………….

label values [variable] [label]

eg

label define genderlabel 1 male 2 female

label values gendervariable genderlabel

Deleting variables

If you had a temporary variable, or you are doing data cleanup, sometimes its necessary to delete some variables.

Syntax: drop [variable name] [options] Note: If you just execute drop without variable name of options, you will be deleting all variable in the data set

eg. drop age if age==.

júlí 04, 2011

Three way crosstabs

We have seen descriptive statistics and in this post, I am going to highlight how to do a cross-tabulation using more than two variables.

this is achievable by using the tabstat command

One can specify the statistics to show and with the help of bysort command, you can show cross-tabulations involving more than one variable.

Syntax:

tabstat variable[s], statistics(statistics) by(conditional variable)

Example 1:

tabstat age sat score heightin readnews, statistics(mean median sd var count range min max) by(gender)

Eaxmple 2:

bysort age:tab ed_level major- this examples first sorts the records by age and then cross-tabulates the dataset variables(ed_level and major)

Example 3:

bysort studentstatus: tab gender major, sum(sat) –This adds a fourth variable

Get data set here

júní 21, 2011

General command syntax

Most of the Stata commands can be shortened. For example, instead of typing summarize, Stata will also accept gen. The help screen demonstrates for each command
how it can be abbreviated, by showing underlined letters in the syntax section of the help.

Stata syntax follows mostly the following basic structure:

Syntax:

[by varlist1:] command [varlist2] [=exp] [if] [in] [using filename] [, options]

where square brackets shows optional qualifiers.

example:

bysort gender: tabulate age if weight < 50, nolabel

A variable list (varlist) is a list of variable names with blanks in between. There are a number of shorthand conventions to reduce the amount of typing. For instance:
myvar                                          Just one variable
myvar var1 var2            Three variables
myvar*                                       All variables starting with myvar
*var                                             All variables ending with var
my*var                                      All variables starting with my and ending with var
my~var                                      A single variable starting with my and ending with var
my?var                                     All variables starting with my and ending with var with one other character between
myvar1-myvar6                     myvar1, myvar2, ..., myvar6 (probably)
this-that                                  All variables in the order of the variables window this through that
The * character indicates to match one or more characters. All variables matching the pattern are returned. The ~ character also indicates to match one or more characters, but unlike *, only one variable is allowed to match. If more than one variable match, an error message is returned. The ? character matches a single character. All variables matching the pattern are returned. The - character indicates that all variables in the dataset, starting with the variable to the left of the - and ending with the variable to the right of the – are to be returned. Any command that takes varlist understands the keyword _all to mean all variables. Some commands are using all variables by default if none are specified (e.g., summarize shows summary statistics for all variables, and is equivalent to summarize _all).

júní 20, 2011

Using by/if/in

Now we have used the previous commands on the data set without specific or orderly sequence.

Suppose you want to describe, browse, or even summarize a group of variables in the data set? Did I really intend to ask you this or should I be answering this?

Its possible to analyze just a sample or portion of data set by use of the following conditional options

  • IF
  • IN
  • BY

IF

Executes a command when the set condition is met

Syntax: command  [variable(s)] if (condition)

example: 1) summarize age if (age>10)

  2)list make price if foreign==1

  3)list make price if price > 10000 & price <.

Explanation: 1) will only execute the command summarize on variable age on only the occurrences where age is greater than ten.

              2) Will list prices of only foreign cars .(to load this data, enter sysuse auto in the command window)

             3) Will list only prices greater than 10000 and value is not missing

Other possible operators include less than(<), less than or equal to(<=), greater than or equal(>=), not equal to(!=), if missing

Now that I have mentioned “missing”, let me tell you something about “missing”. If a variable of type numeric has no value, Stata uses dot(.) to show this. If the variable was of type string, it is shown by a blank(“ “).

IN

Used to specify  a range

Syntax: command summarize [variable] in 1/20

This will display a summary of variable 1 through to 20

BY

Primarily used to sort data

As a prefix:

When a command is prefixed with a bylist, it is performed repeatedly for each element of the variable or variables in that list, each of which must be categorical. For instance,
by foreign: summ price
will provide descr iptive statistics for both foreign and domestic cars. If the data are not already sorted by the bylist variables, the prefix bysort should be used. The option ,total will add the overall summar y.

Examples:

bysort rep78: summ price
bysort rep78 foreign: summ price

As a suffix:

summarize price, by(foreign)

júní 15, 2011

I Want to summarize data set( or Variable)

With descriptive Statistics road ending, I want to bid farewell with one more important command. The command is summarize. this gives brief summary statistics about all the variables or a specific chosen variable.

Form the menu bar, follow these steps

  1. Select Statistics > Summaries, tables, and tests > Summary and descriptive statistics > Summary statistics.
  2. Enter or select [Variable] in the Variables field.
  3. Select Display additional statistics.
  4. Click Submit.

Syntax:

.summarize or summ

or

.summarize [variable] or summ [variable]

This will output the details of the variable including percentile, mean, Kurtosis, Skewness and Standard Deviation.

Other similar commands: browse, describe, codebook

Assumptions:

  • You have loaded data set
  • all commands are typed in the Command window
A Gentle Introduction to Stata, Second Edition
stata

júní 13, 2011

My Codebook

Either type codebook  in the Command window and press Enter or navigate the menus to Data > Describe data > Describe data contents (codebook), and click OK. We get a large amount of output that is worth investigating. Look it over to see that much can be learned from this simple command. You can scroll back in the Results window to see earlier results, if need be. I will focus on specific variables, here the variable id

Syntax:

.codebook id (I assume you have loaded our data set)

Output

codebook

The codebook commands gives the variable name(id) Variable label (ID), type(whether numeric, String, e.t.c) number of unique values(here 30) ,mean, standard deviation, and major percentiles.

Use codebook when you want to know more about a variable.

Other similar Commands: describe, summary, browse

Describing Data

We have seen how to browse data. Suppose now we want to get into more details of the data variables, variable labels, data types, data formats, e.t.c

This iis pretty easy in Stata. With the help of .describe command, one is able to catch a glimpse of whole data set or just a few variable.

Syntax:

.describe

this will give a description of all variables in the data set

(Assuming we are still using the Students Excel format already loaded here)

. describe

output

describe

For just defined variables

.describe gender this describes only the gender variable

.describe gender state this describes the gender and state variables only.

Other similar  Commands: browse, summary, codebook

júní 06, 2011

Stata is a full-featured statistical programming language for data analysis. Stata is available in several packages. It is also packaged for Business, Academic or Governmental institutions. Read more

Features and limits 


Stata package:

Small Intercooled
IC (the standard version)
SE
(an extended version)
Number of observations 1,200 unlimited unlimited
Number of Variables 99 2,047 32,767
Number of characters in a command 8,697 67,800 1,081,527
Number of options for a command 70 70 70
Length of a string variable 244 244 244
Length of a variable name 32 32 32

Comparison with other packages

R:

R is a free software package designed for use with command line only. While being a language is one of R's greatest strengths, it can make it harder to learn for those without programming experience. However, once learnt, you are no longer subject to price increases. The developer’s community ensures to constantly provide add-ons and also ensures that the software will continue to exist. R is extremely versatile in graphics, and generally good for people who really want to find out “what their data have to say”.

SAS:

SAS is the second most costly package. It can be used with, both, command line and graphical user interface (GUI). SAS is particularly strong on data management (especially with large files), and good for cutting edge research. It covers many graphical and statistical tasks. The main focus is on business customers now.

SPSS:

SPSS is the first choice for the occasional user. However, it is the most expensive of the four. SPSS is clearly designed for point-and-click usage on the GUI. A command structure exists, but it is not well defined and sometimes inconsistent. SPSS is good for basic data management and basic statistical analysis, but rather weak in graphics. In  future, SPSS might be the weakest of the four packages with regard to the scope of

statistical procedures it offers due to its main focus on business customers.

Stata:

Stata is designed for the usage by command line, but it also offers a GUI that allows for working with menus. The simple and consistent command structure makes it rather easy to learn. It is the cheapest of the packages that entail costs, and it offers additional reductions for the educational sector. Stata is relatively weak on ANOVA, but extraordinary on regression analysis and complex survey designs. Stata is completely focused on scholars. In the future, Stata may have the strongest collection of advanced statistical procedures. You can get orders from http://www.stata.com/order/

Graphical Interface

Evolving from command-driven interface, stata also operates in a graphical windowed interface/Environment.




Results window
All outputs appear in this window. Only graphics will appear in a separate window.
Command window
This is the command line where commands are typed for execution.
Variables window
All variables in the currently open dataset appear here. By clicking on a variable, its name can be transferred to the command window.
Review window
Previously used commands are listed here and can be transferred to the command
window by clicking on them.
Major Buttons
The most important button functions are the following:
Open (use): Opens a new data file.
Save: Saves the current data file.
Print results: Prints the content of the results window.
New Viewer: Opens a new viewer window, e.g. to open log-files.
New Do-file Editor: Opens a new instance of the do-file editor (same as doedit).
Data Editor: Opens the data editor window (same as edit).
Data Browser: Opens the data browser (same as browse).
Break: Allows to cancel currently running calculations.
Menu
Almost all commands can be called from the menu. However, we do not recommend to learn Stata using the menu commands since the command line will give the user much better control and allows for a much faster and more exact working process.

External links