Friday, April 22, 2016

Ethnic Diversity

See Section 5.2 of Alesina and La Ferrara (2005) for a survey of different measures of ethnic diversity. For more recent discussions in relation to civil wars, see Cederman and Girardin (2007) and Fearon, Kasara, and Laitin (2007).

For sub-national level ethnic diversity, see this post.

1. Ethnolinguistic fragmentation index

First brought into economics by Mauro (1995)

See Easterly and Levine (1997), whose dataset is available here.

Criticised by Alesina et al. (2003), Fearon (2003), and Posner (2004).

2. Ethnic fractionalization index

Proposed by Alesina et al. (2003)

Found to be robustly correlated with the degree of insulation of political leader (see Aghion et al. 2004)

Criticised by Posner (2004) (see footnote 10)

3. Linguistic fractionalization index

Proposed by Alesina et al. (2003)
  • Used by Alesina et al. (2016) to show that ethnic/linguistic diversity is uncorrelated with GDP per capita once inequality across ethnic groups is controlled for.

4. Ethnic polarization index

Proposed by Montalvo and Reynal-Querol (2005a), who apply the polarization index proposed by Esteban and Ray (1994), and shown to be correlated with incidence of civil wars.

The dataset is available at American Economic Review website (see Montalvo and Reynal-Querol's article).

Also found to be correlated with the degree of insulation of political leaders by Aghion et al. (2004)

5. Linguistic heterogeneity index (updated on 24 Nov and 6 Dec 2011)

Desmet, Ortuno-Ortin, and Wacziarg (2012) propose a measure of linguistic diversity at different levels of linguistic aggregation, based on language trees from the Ethnologue data. The data is available from Wacziarg's website (click Research).

  • Used by Alesina et al. (2016) to show that ethnic/linguistic diversity is uncorrelated with GDP per capita once inequality across ethnic groups is controlled for.
Esteban, Mayoral, and Ray (2011) use the same Ethnologue data to calculate the distance between two linguistic groups for obtaining a polarization index that is more general than the one by Montalvo and Reynal-Querol (2005) described above (in which the distance between any two groups is set to be one).

6. Proportion of the population in African countries whose ethnic group lives in other countries as well

Constructed by Englebert (2000)

Treated as the measure of legitimacy of African states over their population

Data and codebook is downloadable at Englebert's website.

7. Ethnopolitical group indices

Developed by Scarritt, James and Shaheen Mozaffar. 1999. “The Specification of Ethnic
Cleavages and Ethnopolitical Groupsfor the Analysis of Democratic Competition
in Contemporary Africa.” Nationalism and Ethnic Politics 5(1).

Used by Scarritt, Mozaffar, and Galaich (2003) and Bllimpo, Harding, and Wantchekon (2013) "Public Investment in Rural Infrastructure: Some Political Economy Considerations" (forthcoming in Journal of African Economies, Forthcoming).

Administrative Boundaries

There are several GIS datasets on administrative boundaries.

National Boundaries

the World Vector Shoreline (WVS)


  • If you're interested in the historical national boundaries since 1945

Sub-national Boundaries

Global Administrative Unit Layers (GAUL)

  • Annual PANEL data from 1990 to 2014.
  • Thus, useful if you're interested in changes of subnational boundaries during this period.
  • Used by Briggs (2015).

Global Administrative Areas (GADM)
  • an alternative to GAUL. Whether it is better or worse is not clear. 
  • mentioned by Gleditsch and Weidmann (2012) in their review of spatial data analysis in political science.
  • Used by the Gridded Population of the World Version 4 (see here).
  • Used by Dreher et al (2015).
    • They mention that GADM does not include the second level administrative boundaries (counties/districts) for Egypt, Equatorial Guinea, Lesotho, Libya, and Swaziland.
  • Also used by Alesina et al. (2016) to measure inequality across subnational administrative regions (which turns out to be negatively correlated with per capita GDP).

The Second Administrative Level Boundaries (SALB) dataset

  • compiled by the United Nations
  • provides the GIS data on second-tier subnational administrative boundaries (ie. district boundaries). 
  • I'm not sure whether the GAUL dataset mentioned above incorporates this or the SALB dataset has its original data.
  • For subnational boundary changes during early years
  • This is the online updated version of the book Administrative Subdivisions of Countries by Gwillim Law (Jefferson, North Carolina: McFarland & Company, 1999).
  • provides the list of administrative regions for every country, past and present. Very useful if you need to match different sub-national or micro datasets based on sub-national regions, especially when a country of your interest has changed the boundaries of sub-national regions quite frequently such as Nigeria and Uganda.

DMSP-OLS Nighttime Lights Time Series Version 4

Downloadable here.

The spatial resolution is 30 arc-second (about 1km).

Data construction

To understand how this dataset is constructed from the original satellite images and the potential data issues, see Elvidge et al. (2001) and Elvidge et al. (2010). Noor et al. (2008) is also useful to understand this data.


Min et al (2013) validate this measure against survey-based electricity access measure in rural Senegal and Mali in 2011. Their conclusions (quoted from Min and Gaba 2014, p. 9512) are:

  • Electrified villages are consistently brighter than unelectrified villages across a variety of nighttime satellite images
  • Electrified villages appear brighter in satellite imagery because of the presence of streetlights, and brightness increases with the number of streetlights.
  • The correlation between light output recorded by the satellite with household electricity use and access is low.

Min and Gaba (2014) conduct the same validation exercise for villages in Vietnam in 2013. They reach the same conclusions except for the last point: in Vietnam, household-level access to electricity is also correlated with nighttime light satellite images.

See also Chen and Nordhaus (2011).

Use in economics research

The data is becoming popular among economists. Recent examples include Henderson et al. (2012), Papaioanno and Michalopoulos (2013, 2014), and Alesina et al. (2012)Hodler and Raschky (2014) exploit the annual panel nature of the data to find that the birth place of a new national leader becomes brighter after he assumes power. Baskaran et al (2015) relate nighttime light to electoral cycles in India.

Bleakey and Lin (2012) use nighttime light as a measure of spatial distribution of contemporary economic activity, to see whether portage sites still predict where economic activities are concentrated today, long after their original advantage became obsolete. 

The raw data ranges from 0 to 63. To be used in regression analysis, there are several ways to aggregate the raw data.
  • Henderson et al. (2012), Papaioanno and Michalopoulos (20132014) and Hodler and Raschky (2014) use the nighttime light data as the measure of living standards. They use the logarithm of the average within each spatial unit of analysis.
    • Logarithmic transformation is used because the distribution of nighttime light intensity is right-skewed with around 10% of observations being zero.
    • Papaioanno and Michalopoulos (20132014) and Hodler and Raschky (2014) add 0.01 to the average before taking log, to use the 10% of the observations without light.
  • Alesina et al. (2016) and Baskaran et al (2015) use the average or sum of light values from all pixels within each spatial unit of analysis divided by population.
  • Baskaran et al (2015) also measure the proportion of villages with the positive value of nighttime light at the village centroid. 

To use this dataset as a panel data, one issue is the compatibility of different satellites in measuring light intensity. Henderson et al. (2012) simply take the average if two satellites provide the data for the same year and control for year fixed effects in regression analysis to account for any differences across years. Alternatively, the following book chapter attempts to calibrate values from different satellites to account for inter-satellite differences and inter-annual sensor decay:
Elvidge, Christopher D., Feng-Chi Hsu, Kimberly E. Baugh and Tilottama Ghosh (2014). "National Trends in Satellite Observed Lighting: 1992-2012." Global Urban Monitoring and Assessment Through Earth Observation. Ed. Qihao Weng. CRC Press.
The calibrated version aggregated to the 0.5x0.5 degree cell level is available as part of the PRIO-GRID data.

World Language Mapping System

Compiled by Global Mapping International.

"This dataset consists of polygons covering most of the world, for each language spoken today. The language group locations are accurate for the approximate years of 1990-1995. The data are based on SIL International’s 15th edition of the Ethnologue linguistics database of languages around the world." (Matuszeski and Schneider 2006, p. 11)

Alesina et al. (2016) note (p. 433) that the coverage is limited for Americas and Australia because major immigrant groups are not recorded.

Used by

Geo-referencing of Ethnic Groups (GREG) dataset

Cited from the dataset's website:
Relying on maps and data drawn from the classical Soviet Atlas Narodov Mira, the “Geo-referencing of ethnic groups” (GREG) dataset employs geographic information systems (GIS) to represent group territories as polygons. The GREG dataset consists of 8969 polygons and is provided in ESRI shapefile format.
Used by:

Wednesday, April 20, 2016


See also the Terrain Ruggedness Index.

The best elevation data as of 2016 seems to be WorldDEM, although I haven't seen any application in economics research. It's also not for free of charge. Below is the list of other elevation datasets (available for free of charge) that have been used by economists in the past.


Developed by the U.S. Geological Survey's Center for Earth Resources Observation and Science (EROS) in 1996, GTOPO30 provides elevations at the 30 arc seconds (roughly 1km) grid level. See the USGS/EROS website for detail.

GTOPO30 was used by Deininger and Minten (2002), Nunn and Puga (2007) to measure the degree of ruggedness of the earth surface of each country, and Duflo and Pande (2007) to calculate river gradient in India.

GTOPO30 is now superseded by Global Multi-resolution Terrain Elevation Data 2010 (GMTED2010).


SRTM3 is an updated version of GTOPO30 (I suppose) at a higher spatial resolution of 3 arc-seconds (roughly 100m). SRTM30 is a version that aggregates SRTM3 to the 30 arc second resolution. See the USGS/EROS website for more details. SRTM30 is supposed to be better than GTOPO30 (see the description of SRTM30 on this page).

For SRTM3 (version 2.1), the data is available here and the documentation is available here. For SRTM30 (version 2.1), both the data and the documentation is available here. For a graphical interface to download the data, visit here.

SRTM30 has been widely used by economists: Taryn Dinkelman's working paper (now published in American Economic Review) "The Effects of Rural Electrification on Employment: New Evidence from South Africa" (to create an instrument for electricity grid placements); Melissa Dell's working paper (now published in Econometrica) "The Persistent Effects of Peru's Mining Mita" (to create control variables); Acemoglu and Dell's paper forthcoming in AEJ Macro "Productivity Differences Within and Between Countries" (to calculate the distance to paved roads that takes into account elevation); Olken (2009) (to obtain the strength of TV signals in each sub-district of Indonesia); and Yanagizawa (2009) "Propaganda and Conflict: Theory and Evidence from the Rwandan Genocide".

How to use GTOPO30 / SRTM30 in ArcGIS

Here is the tip for "How to import GTOPO30 or SRTM30 data into ArcMap (for ArcGIS 9.x)". Step 7 should be skipped because the SRTM30 version 2 uses the value 0 (instead of -9999) for the ocean (see section 1.2 of the documentation).

ASTER Global Digital Elevation Model Version 2

An alternative elevation data to SRTM. Rexer and Hirt (2014) validate SRTM and ASTER against elevation data in Australia, concluding that SRTM is superior in general, with ASTER better for mountainous areas.

Downloadable here.

Used by Mariaflavia Hariri's working paper entitled "Cities in Bad Shape: Urban Geometry in India".

Elevation data is also available by TerrainBase, constructed by National Oceanic and Atmospheric Administration (NOAA) and U.S. National Geophysical Data Center (downloadable at the Atlas of Biosphere). This one is used by Michalopoulos (2008). It is not clear if this is the same as, better or worse than, GTOPO30 and SRTM30. However, if the study area is the whole globe, this data is easier to use because it comes in one file. (GTOPO30 and SRTM30 are provided in several files each of which covers a part of the whole globe.)

Tuesday, April 12, 2016

GDELT (Global Dataset on Events, Location and Tone)

From Manacorda and Tesei (2016):
"an open-access database that, through an automated coding of newswires, collects information on the occurrence and location of political events, including protests, worldwide. The dataset contains an average of 8.3 millions fully geo-coded records of daily events per year for the entire world, although the number of observations increases considerably over time. For each event the data report the exact day of occurrence and precise location (latitude and longitude of the centroid) at the level of city or landmark."
Used by Manacorda and Tesei (2016) to measure the location of protests in Africa.