Saturday, November 28, 2015

Global Agro-Ecological Zones (GAEZ) database

Global Agro-Ecological Zones (GAEZ) database measures the suitability for cultivating each of 27 different crops at 5x5 arc-minute cells across the globe.

The dataset is increasingly popular among economists, because it provides exogenous variations in whether or not a particular set of crops are cultivated.

  • Nunn and Qian (2011) is a pioneer, estimating the impact of potato cultivation in Europe on population growth after potatoes were introduced from the New World. 
  • Costinot and Donaldson (2012) tests Ricardian trade theory of comparative advantage by observing relative productivity in producing different crops in each location.
  • Costinot, Donaldson, and Smith (forthcoming in JPE) estimate the impact of climate change on agricultural markets by predicting the future agricultural international trade based on relative productivity of different crops in each location under climate change scenarios. 
  • Mayshar et al (2015) uses the suitability for cereal production to predict the emergence of the state. 

Sunday, November 1, 2015

World Bank Governance Indicators

For details, see
Kaufmann, Daniel, Aart Kraay and Pablo Zoido-Lobaton, “Governance Matters,” World Bank Policy Research Working Paper No. 2196, 1999.

Kaufmann, Daniel, Aart Kraay and Pablo Zoido-Lobaton, “Governance Matters II – Updated Indicators for 2000/01,” World Bank Policy Research Working Paper No. 2772, 2002.

Kaufmann, Daniel, Aart Kraay and Massimo Mastruzzi, “Governance Matters V: Aggregate and Individual Governance Indicators for 1996-2005,” World Bank Policy Research Working Paper No. 4012, 2006.

Alesina and Zhuravskaya (2009) "Segregation and the quality of government in a cross-section of countries" find that these governance indicators are negatively correlated with the degree to which ethnic groups are segregated.

Michalopoulos and Papaioannou 2014 find no average impact of the governance indicator on night time light luminosity in Africa across the border splitting ethnic group homelands.

Wednesday, October 28, 2015

Ethnographic Atlas

Compiled by George Peter Murdock (see his obituary in American Anthropologist, 88(3), pp. 682-6, for who he is) and published in 29 successive instalments in the journal ETHNOLOGY between 1962-1980.

The data for 862 societies, at the time before their encounter with Europeans, was published in Ethnology, volume 6, number 2 (April 1967), and then in a hardcover from the University of Pittsburgh Press.

Since then, more societies were added and some of the 1967 data were revised (see here for detail). The final version, compiled by Patrick Gray in World Cultures, vol. 10, issue 1 (1998), is available as the SPSS data file or R file from this webpage.
  • For Stata users, use USESPSS ado (by Sergiy Radyakin) to convert the SPSS data.

Sample size

The Patrick Gray version contains 1267 societies. However, there are two duplicated observations (Chilcotin and Tokelau).

  • According to Gray (1998), which lists the revision history for every society included in the data, one of the duplicated entries for both Chilcotin and Tokelau appeared in the April 1967 issue of Ethnology and had no further revision afterwards. The other entry appeared before the April 1967 issue (October 1964 for Chilcotin; January 1967 for Tokelau). 
  • Given these, it seems appropriate to keep the entry that appeared in April 1967.  
  • Whether or not each observation appeared in April 1967 is recorded in the variable v89.

Consequently, the sample size is 1265 societies.


Location (latitude and longitude), Major subsistence activities (gathering, hunting, fishing, animal husbandry, or agriculture), mode of marriage (dowry etc.), family organization (extended or nuclear, monogamy or polygyny, etc.), type of agriculture (irrigation, type of crops), number of jurisdictional levels, specialization of economic activities by sex, language group, class structure, form and prevalence of slavery, succession rules for local headman, property inheritance rules, type of dwellings.

Used by

The data on the number of jurisdictional levels beyond local community is used as the explanatory variable for economic development by Gennaioli and Rainer (2007) and Michalopoulos and Papaioannou (2013). It is also used as the dependent variable by Fenske (2014) and Mayshar et al (2015).

Alesina et al (2013) exploit the data on the use of plough in agriculture and on gender-based division of labor in agriculture, to show the correlation between the two.

Tuesday, October 20, 2015

Demographic and Health Surveys (DHS)

DHS is a set of cross-country household surveys on health. See Corsi et al (2012) for a succinct overview of the surveys.

The Minnesota Population Center leads the initiative to make the DHS data comparable across countries: see the Integrated DHS website for detail.

HOW TO OBTAIN DATA: Log on to and follow the procedure outlined here. (For some surveys, an extra permission of use is needed.)

LIST OF COUNTRIES: available here.
For the following countries, see also separate posts in this blog: Ethiopia, Lesotho, Paraguay, Peru, Sri Lanka, and Zambia.

SAMPLING METHOD: This is the guideline for all DHS surveys. For the actual sampling method used in each survey, see the DHS Final Report for relevant surveys, which can be downloaded at (Pick the country name and choose "DHS Final Reports" for publication type to search.)

QUESTIONNAIRE: The DHS surveys have evolved over time. There are five phases each of which contains a different set of questions (some of them are consistent across phases). The Model Questionnaire from all the five phases are available here. For the actual questionnaire used in each survey, see the DHS Final Report for relevant surveys, which can be downloaded at (Pick the country name and choose "DHS Final Reports" for publication type to search.)

The questionnaire consists of the core questionnaire (used in every survey) and optional modules (used in some of the surveys). Among these modules is maternal mortality.

CODEBOOK: For the complete list of variables in the datasets, download DHS Recode Manuals (one for each phase) from here. Beware that it is not easy to map each question in the questionnaire to the variable number in the codebook because the numbering system is very different. But the description of variables in the codebook is sometimes imprecise. Checking exactly what question is asked is therefore essential. Some variables are not available depending on which survey you are looking at. To learn this, see the ".doc" file zipped in the individual recode file for each survey.

Other Documentations:
Fieldwork Manuals

Data Processing Manual (see pp. 7-14 for how dates of birth, marriage, etc. are imputed)
Date imputation takes two stages. First, define the upper and lower bounds of the date. For the date of birth of children of interviewed women, for example, the age of the children, the age at death (if dead), the dates of vaccination, the duration of breastfeeding, etc. are used to narrow the bounds. The fact that two births cannot be less than 7 months apart (allowing for premature births) is also used to narrow the bounds. Then the date is imputed by picking one number randomly within the final bounds.

Description on dataset types (what is household recode, individual recode, etc.)

Description on file types (what is hierarchical or flat file?)

Description on file names (e.g. what does COIR41FL.ZIP stand for?)

What variables are available:

Anthropometry data (height and weight) is now collected for all the surveyed women of child bearing age. In the previous rounds of the surveys (until 1999), such data is collected only for women who gave birth within 3 or 5 years before the survey. (Exception to this is Cote d'Ivoire 1988/89 where all surveyed women were measured for their height and weight.) In the earliest rounds of surveys (until 1989), women's anthropometric data was not collected at all. Deaton (2007) uses the height data from DHS surveys extensively.

Hemoglobin level (for measuring anemia) is collected in the latest rounds of the survey (after 1999). Page 14 of the Model Questionnaire has some description on this. For details, see the following two documents:

Information on children's food intake is available though different versions collect the information in different ways. In DHS I and II surveys, what liquid and food is given during the past 24 hours as complementary diet is collected only for women still breastfeeding (available as variables V409-V414). DHS III surveys collect the same information for all living children born in the past 3 or 5 years (variables M37 or V409-V414 for the last born child). In addition, DHS III surveys ask what liquid and food is given during the past 7 days (variables M40). DHS IV surveys collect the same information only for the last born child living with his/her mother for the last 24 hours (variables M37 or V469) or for the last 7 days (variables M40 or V470).

Information on the treatment for fever and cough (symptoms of malaria and pneumonia, respectively, two of the major killers of children) is not very consistent across different rounds of surveys. First of all, the recall period changed from DHS-II surveys: in DHS-I, mothers were asked whether their child got fever or cough during the last four weeks; in DHS-II, it changed to the last two weeks. Then what treatments were given stopped being asked in DHS-III surveys. In DHS-IV, drugs taken for fever were again asked, but not for cough. To whom advice or treatment was sought keeps being collected, but it is not clear whether health professionals were not absent during the visit.

For variables V414 and V414A-D (what foods did the baby eat during the last 24 hours) in the DHS II surveys, even though the Recode Manual suggests that V414 is used only if V414-D is not collected, Burkina Faso's 1992 survey and Niger's 1992 survey use both. For both cases, it seems V414 refers to another food category in addition to V414A and V414D (in the questionnaire, three types of foods were asked about: "bouillie", "Autre aliment specialement préparé pour l'enfant", and "Plat familial").

Data on schooling for all the members of surveyed households is available in the household schedule. See variables HV106-HV110, HV121-HV129.

Data on infrastructure at the household level is available: access to electricity (HV206), telephone (HV221), and tap water (HV231).

Self-reported ethnicity of women is available (variable V131) for some surveys including the Rwanda 1992 survey.

References for understanding the dataset further

In order to understand the medical background for each variable, read Model Questionnaires, which also explain the purpose of each question. Do have a look at the earlier versions of the questionnaire because the newest version often describes the purpose of questions newly added only.

For further medical background for child health history questions (immunization, treatment of diarrhea, etc.), Gareth Jones et al. (2003) "How Many Child Deaths Can We Prevent This Year?" The Lancet, 362: 65-71 is extremely useful.

I have created this document to help researchers to match subnational districts in Sub-Saharan African countries between different rounds of DHS surveys.

Lubotsky and Wittenberg (2006, Review of Economics & Statistics) proposes how to make the best of asset ownership variables in the DHS data to estimate the impact of household wealth.

Papers using this dataset include:

Oster (2007) to measure sexual behaviour and knowledge on HIV in African countries.

Oster (2005) to measure sex ratios at birth in African countries.

Young (2005) for the data on fertility and education. In some footnotes of this paper, the author reports a couple of peculiarities found in the data (see footnotes 35 and 58).

Dow et al. (1999) use Malawi 1992, Tanzania 1994, Zambia 1992, and Zimbabwe 1994 to investigate the effect of tetanus vaccine on birth weight.

Pitt (1997) use the datasets for 14 Sub-Saharan African countries to investigate the determinants of child mortality.

Used by demographers to investigate the determinants of immunization uptakes. See Desai and Alva (1998) and Gage et al. (1997).

Thomas et al. (1991) use Brazil 1986 survey to investigate how parental education affects child height, by using the information on the frequency for mothers to listen to the radio, watch television, and read a newspaper.

Other papers using the DHS surveys can be found here.

Tuesday, October 6, 2015

Land Use Data

Africover is a land use data for several countries in East Africa. Click "Web Maps" in the left column to see what kind of data is available for each country.

Global Land Cover 2000 provides land use data for all over the world, produced from daily satellite images from November 1, 1999 to December 31, 2000. It's not clear if the data is available in the ArcGIS shape file.

Global Land Cover Characteristics (GLCC) database

MODIS Land Cover database

  • Downloadable for free of charge here
  • 5'x5' resolution, annually 2001-2012
  • Forests, shrublands, savannas, grasslands, wetlands, croplands, urban areas, etc.
  • see Friedl et al. (2002)

Wednesday, September 23, 2015

World Migration Matrix, 1500-2000

Constructed by Louis Putterman and his colleagues.

See this page for details.

Used by Putterman and Weil (2010).

Michalopoulas (2012) "The Origins of Ethnolinguistic Diversity" uses this data to identify countries in which more than 40% of the current population can trace ancestry within the same country boundary back to 1500 AD, for which variation in agricultural suitability and elevation is found to be a strong predictor of ethnic diversity

Tuesday, September 22, 2015


See also the Terrain Ruggedness Index.


Developed by the U.S. Geological Survey's Center for Earth Resources Observation and Science (EROS) in 1996, GTOPO30 provides elevations at the 30 arc seconds (roughly 1km) grid level. See the USGS/EROS website for detail.

GTOPO30 was used by Deininger and Minten (2002), Nunn and Puga (2007) to measure the degree of ruggedness of the earth surface of each country, and Duflo and Pande (2007) to calculate river gradient in India.

GTOPO30 is now superseded by Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010).


SRTM3 is an updated version of GTOPO30 (I suppose) at a higher spatial resolution of 3 arc-seconds (roughly 100m). SRTM30 is a version that aggregates SRTM3 to the 30 arc second resolution. See the USGS/EROS website for more details. SRTM30 is supposed to be better than GTOPO30 (see the description of SRTM30 on this page).

For SRTM3 (version 2.1), the data is available here and the documentation is available here. For SRTM30 (version 2.1), both the data and the documentation is available here.

SRTM30 has been widely used by economists: Taryn Dinkelman's working paper (now published in American Economic Review) "The Effects of Rural Electrification on Employment: New Evidence from South Africa" (to create an instrument for electricity grid placements); Melissa Dell's working paper (now published in Econometrica) "The Persistent Effects of Peru's Mining Mita" (to create control variables); Acemoglu and Dell's paper forthcoming in AEJ Macro "Productivity Differences Within and Between Countries" (to calculate the distance to paved roads that takes into account elevation); Olken (2009) (to obtain the strength of TV signals in each sub-district of Indonesia); and Yanagizawa (2009) "Propaganda and Conflict: Theory and Evidence from the Rwandan Genocide".

How to use GTOPO30 / SRTM30 in ArcGIS

Here is the tip for "How to import GTOPO30 or SRTM30 data into ArcMap (for ArcGIS 9.x)". Step 7 should be skipped because the SRTM30 version 2 uses the value 0 (instead of -9999) for the ocean (see section 1.2 of the documentation).

Elevation data is also available by TerrainBase, constructed by National Oceanic and Atmospheric Administration (NOAA) and U.S. National Geophysical Data Center (downloadable at the Atlas of Biosphere). This one is used by Michalopoulos (2008). It is not clear if this is the same as, better or worse than, GTOPO30 and SRTM30. However, if the study area is the whole globe, this data is easier to use because it comes in one file. (GTOPO30 and SRTM30 are provided in several files each of which covers a part of the whole globe.)