Monday, May 16, 2016

Ethiopian Rural Household Surveys

A panel household survey conducted by Stefan Dercon and his colleagues in 1989, 1994, 1995, 1997, 1999, 2004, and 2009. The details are available on his website.

Panel data conducted 1989-1995 are downloadable at the above webpage. 1997 can be obtained by contacting IFPRI.


15 villages with 1477 households in total were surveyed three times, twice in 1994 and once in 1995. Six of the 15 villages were surveyed in 1989 as well. For more detail, see this documentation. In the 1997 survey, 9 additional villages were surveyed.

Household composition, education, asset ownership, credit, non-food expenditure, off-farm activities, land use, agricultural inputs, crop outputs, livestock ownership, health, health care provision, anthropometrics, food expenditure, female and child activities, etc.

The 1997 survey collects community-level information on electricity and water, sewage and toilet facilities, health services, education, NGO activity, migration, wages, and production and marketing.

The survey in 2004 collects information on interviewees' trust in government, perception on property rights security etc., and is used by Zerfu (2006).

Since 2004, subjective well-being measures are collected, used by Alem and Colmer (2015).

Friday, May 6, 2016


GeoNames allows you to find the geographic coordinate of locations across the world, by typing the name of a location.

The website is useful to geo-reference your data if the location name is available.

If you have a lot of observations, however, using the web search is costly to do. A possible way to get around is to download the entire data of GeoNames. This page lists the zip files each of which contains all the locations in a particular country (see here to understand which file is for which country). This file can then be merged with your data by location name, which does geo-reference your data.

The list of variables (called fields) in the data (taken from the bottom of this page):

geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see, char(1)
feature code      : see, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int) 
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format

Wednesday, May 4, 2016

Ethnographic Atlas

Compiled by George Peter Murdock (see his obituary in American Anthropologist, 88(3), pp. 682-6, for who he is) and published in 29 successive instalments in the journal ETHNOLOGY between 1962-1980.

The data for 862 societies, at the time before their encounter with Europeans, was published in Ethnology, volume 6, number 2 (April 1967), and then in a hardcover from the University of Pittsburgh Press.

Since then, more societies were added, and some of the 1967 data were revised (see here for detail). The final version, compiled by Patrick Gray in World Cultures, vol. 10, issue 1 (1998), is available as the SPSS data file or R file from this webpage.
  • For Stata users, use USESPSS ado (by Sergiy Radyakin) to convert the SPSS data.

Sample size

The Patrick Gray version contains 1267 societies. However, there are two duplicated observations (Chilcotin and Tokelau).
  • According to Gray (1998), which lists the revision history for every society included in the data, one of the duplicated entries for both Chilcotin and Tokelau appeared in the April 1967 issue of Ethnology and had no further revision afterwards. The other entry appeared before the April 1967 issue (October 1964 for Chilcotin; January 1967 for Tokelau). 
  • Given these, it seems appropriate to keep the entry that appeared in April 1967.  
  • Whether or not each observation appeared in April 1967 is recorded in the variable v89.
Consequently, the sample size is 1265 societies.

Date of observation

Different societies are observed in different years. Fenske (2013) note that "eight societies [are] observed before 1500 (Ancient Egypt, Aryans, Babylonia, Romans, Icelander, Uzbeg, Khmer and Hebrews)" (page 1366).


The original dataset records the centroid of each society as a pair of integers (longitude and latitude in degrees). To associate each society with geographic characteristics, different researchers use the different definition of each society's geographic `territory'.
An alternative approach is to match each society with the map of ethnic groups from another source.  
The third approach is to create Thiessen polygons (see Alsan 2015, p. 399)


Location (latitude and longitude), Major subsistence activities (gathering, hunting, fishing, animal husbandry, or agriculture), mode of marriage (dowry etc.), family organization (extended or nuclear, monogamy or polygyny, etc.), type of agriculture (irrigation, type of crops), number of jurisdictional levels, specialization of economic activities by sex, language group, class structure, form and prevalence of slavery, succession rules for local headman, property inheritance rules, type of dwellings.

Used by

The data on the number of jurisdictional levels beyond local community is used as the explanatory variable for economic development by Gennaioli and Rainer (2007) and Michalopoulos and Papaioannou (2013). It is also used as the dependent variable by Fenske (2014)Alsan (2015), and Mayshar et al (2015).

Alesina et al (2013) exploit the data on the use of plough in agriculture and on gender-based division of labor in agriculture, to show the correlation between the two.

Fenske (2013) use the variables on the presence of land rights and slavery. Page 1368 lists various other studies using Ethnographic Atlas.

Alsan (2015) use the variables on domesticated animals, female participation in agriculture, intensive agriculture, indigenous slavery as well as jurisdictional hierarchy, to see if these variables are explained by the climate suitability for Tsetse flies (which kill domesticated animals in Africa).

Boix (2015, chapter 1) looks at the correlation of foraging societies with the size of settlements, inheritance rules, social stratification, and the number of jurisdictional levels beyond local community.

Friday, April 22, 2016

Ethnic Diversity

See Section 5.2 of Alesina and La Ferrara (2005) for a survey of different measures of ethnic diversity. For more recent discussions in relation to civil wars, see Cederman and Girardin (2007) and Fearon, Kasara, and Laitin (2007).

For sub-national level ethnic diversity, see this post.

1. Ethnolinguistic fragmentation index

First brought into economics by Mauro (1995)

See Easterly and Levine (1997), whose dataset is available here.

Criticised by Alesina et al. (2003), Fearon (2003), and Posner (2004).

2. Ethnic fractionalization index

Proposed by Alesina et al. (2003)

Found to be robustly correlated with the degree of insulation of political leader (see Aghion et al. 2004)

Criticised by Posner (2004) (see footnote 10)

3. Linguistic fractionalization index

Proposed by Alesina et al. (2003)
  • Used by Alesina et al. (2016) to show that ethnic/linguistic diversity is uncorrelated with GDP per capita once inequality across ethnic groups is controlled for.

4. Ethnic polarization index

Proposed by Montalvo and Reynal-Querol (2005a), who apply the polarization index proposed by Esteban and Ray (1994), and shown to be correlated with incidence of civil wars.

The dataset is available at American Economic Review website (see Montalvo and Reynal-Querol's article).

Also found to be correlated with the degree of insulation of political leaders by Aghion et al. (2004)

5. Linguistic heterogeneity index (updated on 24 Nov and 6 Dec 2011)

Desmet, Ortuno-Ortin, and Wacziarg (2012) propose a measure of linguistic diversity at different levels of linguistic aggregation, based on language trees from the Ethnologue data. The data is available from Wacziarg's website (click Research).

  • Used by Alesina et al. (2016) to show that ethnic/linguistic diversity is uncorrelated with GDP per capita once inequality across ethnic groups is controlled for.
Esteban, Mayoral, and Ray (2011) use the same Ethnologue data to calculate the distance between two linguistic groups for obtaining a polarization index that is more general than the one by Montalvo and Reynal-Querol (2005) described above (in which the distance between any two groups is set to be one).

6. Proportion of the population in African countries whose ethnic group lives in other countries as well

Constructed by Englebert (2000)

Treated as the measure of legitimacy of African states over their population

Data and codebook is downloadable at Englebert's website.

7. Ethnopolitical group indices

Developed by Scarritt, James and Shaheen Mozaffar. 1999. “The Specification of Ethnic
Cleavages and Ethnopolitical Groupsfor the Analysis of Democratic Competition
in Contemporary Africa.” Nationalism and Ethnic Politics 5(1).

Used by Scarritt, Mozaffar, and Galaich (2003) and Bllimpo, Harding, and Wantchekon (2013) "Public Investment in Rural Infrastructure: Some Political Economy Considerations" (forthcoming in Journal of African Economies, Forthcoming).

Administrative Boundaries

There are several GIS datasets on administrative boundaries.

National Boundaries

the World Vector Shoreline (WVS)


  • If you're interested in the historical national boundaries since 1945

Sub-national Boundaries

Global Administrative Unit Layers (GAUL)

  • Annual PANEL data from 1990 to 2014.
  • Thus, useful if you're interested in changes of subnational boundaries during this period.
  • Used by Briggs (2015).

Global Administrative Areas (GADM)
  • an alternative to GAUL. Whether it is better or worse is not clear. 
  • mentioned by Gleditsch and Weidmann (2012) in their review of spatial data analysis in political science.
  • Used by the Gridded Population of the World Version 4 (see here).
  • Used by Dreher et al (2015).
    • They mention that GADM does not include the second level administrative boundaries (counties/districts) for Egypt, Equatorial Guinea, Lesotho, Libya, and Swaziland.
  • Also used by Alesina et al. (2016) to measure inequality across subnational administrative regions (which turns out to be negatively correlated with per capita GDP).

The Second Administrative Level Boundaries (SALB) dataset

  • compiled by the United Nations
  • provides the GIS data on second-tier subnational administrative boundaries (ie. district boundaries). 
  • I'm not sure whether the GAUL dataset mentioned above incorporates this or the SALB dataset has its original data.
  • For subnational boundary changes during early years
  • This is the online updated version of the book Administrative Subdivisions of Countries by Gwillim Law (Jefferson, North Carolina: McFarland & Company, 1999).
  • provides the list of administrative regions for every country, past and present. Very useful if you need to match different sub-national or micro datasets based on sub-national regions, especially when a country of your interest has changed the boundaries of sub-national regions quite frequently such as Nigeria and Uganda.

DMSP-OLS Nighttime Lights Time Series Version 4

Downloadable here.

The spatial resolution is 30 arc-second (about 1km).

Data construction

To understand how this dataset is constructed from the original satellite images and the potential data issues, see Elvidge et al. (2001) and Elvidge et al. (2010). Noor et al. (2008) is also useful to understand this data.


Min et al (2013) validate this measure against survey-based electricity access measure in rural Senegal and Mali in 2011. Their conclusions (quoted from Min and Gaba 2014, p. 9512) are:

  • Electrified villages are consistently brighter than unelectrified villages across a variety of nighttime satellite images
  • Electrified villages appear brighter in satellite imagery because of the presence of streetlights, and brightness increases with the number of streetlights.
  • The correlation between light output recorded by the satellite with household electricity use and access is low.

Min and Gaba (2014) conduct the same validation exercise for villages in Vietnam in 2013. They reach the same conclusions except for the last point: in Vietnam, household-level access to electricity is also correlated with nighttime light satellite images.

See also Chen and Nordhaus (2011).

Use in economics research

The data is becoming popular among economists. Recent examples include Henderson et al. (2012), Papaioanno and Michalopoulos (2013, 2014), and Alesina et al. (2012)Hodler and Raschky (2014) exploit the annual panel nature of the data to find that the birth place of a new national leader becomes brighter after he assumes power. Baskaran et al (2015) relate nighttime light to electoral cycles in India.

Bleakey and Lin (2012) use nighttime light as a measure of spatial distribution of contemporary economic activity, to see whether portage sites still predict where economic activities are concentrated today, long after their original advantage became obsolete. 

The raw data ranges from 0 to 63. To be used in regression analysis, there are several ways to aggregate the raw data.
  • Henderson et al. (2012), Papaioanno and Michalopoulos (20132014) and Hodler and Raschky (2014) use the nighttime light data as the measure of living standards. They use the logarithm of the average within each spatial unit of analysis.
    • Logarithmic transformation is used because the distribution of nighttime light intensity is right-skewed with around 10% of observations being zero.
    • Papaioanno and Michalopoulos (20132014) and Hodler and Raschky (2014) add 0.01 to the average before taking log, to use the 10% of the observations without light.
  • Alesina et al. (2016) and Baskaran et al (2015) use the average or sum of light values from all pixels within each spatial unit of analysis divided by population.
  • Baskaran et al (2015) also measure the proportion of villages with the positive value of nighttime light at the village centroid. 

To use this dataset as a panel data, one issue is the compatibility of different satellites in measuring light intensity. Henderson et al. (2012) simply take the average if two satellites provide the data for the same year and control for year fixed effects in regression analysis to account for any differences across years. Alternatively, the following book chapter attempts to calibrate values from different satellites to account for inter-satellite differences and inter-annual sensor decay:
Elvidge, Christopher D., Feng-Chi Hsu, Kimberly E. Baugh and Tilottama Ghosh (2014). "National Trends in Satellite Observed Lighting: 1992-2012." Global Urban Monitoring and Assessment Through Earth Observation. Ed. Qihao Weng. CRC Press.
The calibrated version aggregated to the 0.5x0.5 degree cell level is available as part of the PRIO-GRID data.

World Language Mapping System

Compiled by Global Mapping International.

"This dataset consists of polygons covering most of the world, for each language spoken today. The language group locations are accurate for the approximate years of 1990-1995. The data are based on SIL International’s 15th edition of the Ethnologue linguistics database of languages around the world." (Matuszeski and Schneider 2006, p. 11)

Alesina et al. (2016) note (p. 433) that the coverage is limited for Americas and Australia because major immigrant groups are not recorded.

Used by