DATA SOURCES FOR CROSS-COUNTRY STUDIES

This section provides an overview of data available for multivariate analysis of within- country inequality in an international comparison.

19.3.1 Different Strategies for Multicountry Studies

At the outset, although seldom explicitly, research needs to decide on the precise coverage of a country sample to be analyzed.

While this choice may be constrained (but should not be motivated) by the availability of data, two different strategies exist when using multicountry samples to explain variations in inequality. First, the sample may be formed by a set of countries sharing similar systemic characteristics (e.g., the OECD area), a strategy called “most similar design” by Przeworski and Teune (1970). Conversely, the aim can be to test a hypothesis such as the Kuznets-type relationship between development and inequality on a set of countries with a maximum of differing systemic characteristics, a “most dissimilar design” strategy.

While many earlier studies of global causes of inequality aimed to include as many countries as possible to the analyses, they still had an overrepresentation of developed countries in the sample. Coverage of African countries in particular was very low. In a typical study with “universal” coverage of inequality observations in the 1990s and early 2000s, OECD countries represented half up to two-thirds of the whole data set. This has changed in more recent studies, but the OECD area still makes up typically a third of all country observations. While this choice is dictated by data availability, the precision and generalization in the interpretation of empirical results suffers.^[257] Depending on the nature of a research question and following a thorough examination of underlying data and their quality, a reduced sample of countries may be a preferred option, or, as Atkinson and Brandolini (2006a) propose, “A deeper understanding of national sources ∙ ∙ ∙ may lead us to analyze a carefully matched subset of countries, rather than to seek to maximize their number.”

In that sense, even the focus on an apparently more homogenous country panel such as the group of OECD or EU countries may involve interpretational problems, especially if new member countries are included in the analysis.

The results from empirical analyses of the importance of sector dualism and sector bias between agriculture and industry as a driver of inequality (see Section 19.3.2) very much depends on how the OECD area and the EU area are defined. Empirical findings may be blurred if these definitions include not only the “traditional” OECD member countries or the “old” EU member states but also newer member countries such as Poland or Mexico, where the share of agricultural employment is still important (more than half the OECD average) and the dualism model may have some salience. In what follows we go through the “menu” of the available data sets for inequality research.

19.3.2 Data Sources: The Fast Development of Data Availability in the Last Decades

In the concluding remarks of his seminal article on economic growth and income inequality, Simon Kuznets (1955) acknowledged that his “paper is perhaps 5 per cent empirical information and 95 per cent speculation” (p. 26). Until the early 1990s, the availability of internationally comparable income inequality data still was scarce. During the past two to three decades, however, a substantive amount of household surveys became available, and much progress in distributional data collection and standardization has been made in OECD countries. The situation is still far from being ideal, but today’s research and results may perhaps mirror 50% empirical information and 50% speculation.

This section is about the former 50%. It describes the main sources of data on income inequality and other key variables used in cross-country studies of the drivers of inequality. It reviews international data sets of income inequality: ex ante standardized data, ex post standardized data, data standardized on best national sources, and secondary data sets. The review focuses on data sets that include at least most of the group of OECD countries. It will also become clear how some of these new data sources open prospects for new types of research questions and application of new types of analytic methods (notably the use of longitudinal panel data).

19.3.2.1 StandardizedMicrodata

Despite continuing progress, the availability of comparable primary data sets for inequality research is still limited. The major initial and pioneering effort was launched 30 years ago by the data collection of the Luxembourg Income Study (LIS). Since the mid-2000s, the Statistical Office of the European Union (EUROSTAT) launched a harmonized household survey on income and living conditions (EU-SILC), which is available for the 28 member countries and some additional European countries. Equally, since the mid-2000s, OECD has made available a detailed set of standardized household income and poverty indicators for their 34 member countries.

19.3.2.1.1 LuxembourgIncomeStudy

The LIS, formerly known as the Luxembourg Income Study, is a data archive and research centre dedicated to cross-national analysis (http://www.lisdatacenter.org/). The project collects income microdata from household surveys and standardizes those into a common framework of income, demographic and employment variables. The standardization is undertaken ex post. The key concept is that of disposable income, and detailed income aggregates are available. When the project started, it included data from seven countries. Today, LIS stores microdata for over 40 countries, for 8 points in time, starting with a year around 1980, in approximately 5-year intervals. Access to the LIS microdata is granted to researchers of financially contributing countries and institutions and students worldwide upon registration. Use of the microdata is permitted for scholarly, research or educational purposes but not for commercial purposes.

One of the key assets of the LIS database is that it allows researchers the access to the microdata, via a remote access system. The scrutiny of the ex post standardization also allows a high degree of comparability of the micro variables. One main disadvantage is the somewhat limited geographical and time coverage, although the recent inclusion of a number of middle-income and emerging countries as well as a more frequent update (3-year rather than 5-year intervals) will allow more extensive panel data analysis (http:// www.lisdatacenter.org/our-data/lis-database/documentation/list-of-datasets/).

19.3.2.1.2 EU Statistics on Income and Living Conditions

The EU-SILC is an annual survey that collects microdata on income, poverty, social exclusion and living conditions in the 28 EU member countries and 4 non-EU countries. It has been implemented since 2004 for 15 countries and since 2007 for 32 countries (http://epp.eurostat.ec.europa.eu/portal/page/portal/microdata/eu_silc). The EU-SILC surveys are “output” rather than “input” standardized. This implies that the data are not collected with a single survey across all countries; rather, countries are provided a list of variables that they can collect using national surveys and definitions, and the necessary standardization is made on this basis by EUROSTAT. EU-SILC includes longitudinal information insofar as the surveys are based on a rotational panel (usually with a duration of 4 years). In contrast to most other longitudinal surveys, cross sectional and longitudinal data are released separately in the EU-SILC.

Access to the anonymized EU-SILC microdata (the so-called user database) is not granted to individuals but only to research institutions (or similar entities) inside the EU and European economic area countries by means of research contracts. For other kinds of organizations inside the EU and organizations outside the EU, approval for access needs to be requested from the European Statistical System Committee, which takes about 6 months. A detailed set of indicators on incomes and other living conditions from these data is available from the EUROSTAT databank (http://epp. eurostat.ec.europa.eu/portal/page/portal/income_social_inclusion_living_conditions/ data/database).

The main assets of the EU-SILC are the high degree of standardization, especially with regard to income concepts; the availability of annual data; and the availability of a longitudinal part of the data. One disadvantage for researchers today is simply the fact that the project is still relatively young: microdata are generally available for less than 10 years, thus preventing the analysis oflong-term series.^[258] There are also a few remaining problems that have to do with the loss of some information when the wealth of original microdata is transformed into a more restricted final data set for which the underlying methodology of such transformations as well as treatment of data at the national level (e.g., imputation procedures) are not always exhaustively documented.

That said, it has been suggested that most of the latter set of problems can be easily overcome with a greater consistency and clarity in documentation in the years to come (Iacovou et al., 2012).

19.3.2.1.3 OECD Data (Income Distribution Database)

The OECD income distribution database (IDD) builds on regular data collection undertaken by the OECD through a network of national consultants who provide standard tabulations from national microdata considered the “most appropriate” data source in each country and are based on comparable definitions and methodological approaches. This is done via a detailed data questionnaire consisting of tabulations on income distribution and poverty indicators, together with standardized terms of references. The main concept of the data collection is that of equivalized household disposable income, including wages and salaries, self-employment incomes, realized property incomes and cash transfers from the general government less taxes and social security contributions paid by households. The definitions used in calculating these income components are based on the recommendations for household income statistics adopted by the Canberra Group (see http://www.unece.org/stats/groups/cgh.html).

A detailed set of variables for the 34 OECD member countries is available from the OECD “data cube” (http://www.oecd.org/social/income-distribution-database.htm). It includes several summary inequality and poverty measures (on a before and after tax/transfer basis) as well as data on income levels and population ventilations. Data are available in approximately 5-year intervals back to the mid-1990s and, for a subset of countries, to the mid-1980s and mid-1970s. From the mid-2000s, data are available on a more frequent basis, depending on the underlying surveys but, in general, annually (for 28 of the 34 countries). Access to these data is free.

The method of data collection used by the OECD IDD allows coverage of the entire region of OECD countries with harmonized data that facilitate cross-country comparison, based on information that is both more up to date relative to that available through other statistical sources and better suited for assessing changes in income distribution over time.

However, data are available only on an “equivalized” household basis, which renders comparison with indicators on a “per capita” basis (used in many of the more global data sets) very difficult. The main disadvantage of the OECD database is that it does not allow access to the original microdata, which constrains the analyses that can be performed. In that sense, the OECD income distribution database constitutes its own category between primary and secondary data sets.

19.3.2.2 Secondary Datasets

With regard to the difference of the data sets described above, secondary datasets are based on a collection of published or otherwise available summary key inequality indicators. These usually include the Gini coefficient, quintile share ratios and/or percentile ratios and, more rarely, other summary measures such as the Theil index. Often, alternative series for the same country and year point are proposed alongside recommendations of “preferred” series, along the lines of, for instance, the A-B-C typology used by Atkinson (2008).^[259] Typically, such data sets aim to collect indicators for the greatest number of countries. The trade-off is that there is necessarily less room available for verification of data quality and consistency, which leads to issues of data comparability between and within countries.

19.3.2.2.1 The Deininger-Squire Data Set (Measuring Income Inequality Database)

Klaus Deininger and Lyn Squire brought together a large set of worldwide inequality indicators in 1996. Their data set (DS) compiled Gini coefficients and cumulative quintile shares for 138 developed and developing countries, adding summary information on the nature of the data (population coverage, income or consumption base, net or gross income base). Most of the data cover the period between the 1960s and early 1990s.

With regard to earlier data compilations,^[260] the DS data set imposed “minimum standards for quality,” namely that indicators are based on household surveys, on comprehensive coverage of the population, and on comprehensive coverage of income sources (Deininger and Squire, 1996, p. 567). On this basis, among the entire data set of 2630 observations, Deininger and Squire identify a subset of “high-quality” observations, with 693 observations for 115 countries. Those observations labelled “accept” in the DS data set nonetheless include indicators based on different definitions and methodologies, which impedes the comparability of these data.^[261]

The DS data set is freely available at the World Bank’s website (http://go.worldbank. org/UVPO9KSJJ0). It became a major data source for international inequality research during the early 2000s, including many of the cross-country panel studies reviewed in this chapter. While there were further developments on the basis of the DS data sets in the frame of follow-up projects (see 19.3.2.2.2), the above-mentioned version has not been updated or revised for corrections.

19.3.2.2.2 UNU-WIDERDatabase

In the vein of the DS data set and partly based on it, the United Nations UniversityWorld Institute for Development Research (WIDER) World Income Inequality Database (WIID) collects a secondary inequality data set for developed, developing and transition countries. The project started in the late 1990s and led to a first release of data for 155 countries (WIID1), extending the time frame to the early 2000s and augmenting the number of distributional indicators: calculated and reported Gini coefficients, decile and quintile shares, as well as survey means and medians, along with the income shares of the richest 5% and the poorest 5%. In addition to income and consumption, the data set also includes indicators for earnings.

A second and substantially revised version of WIID was compiled in the mid-2000s and resulted in the release of WIID2. The currently available version—World Income Inequality Database V2.0c (May 2008)—proposes data series up to 2006 and is described by the authors as a “new” rather than “updated” data set. It adds, where possible, a second Gini coefficient estimate calculated using a method developed by Shorrocks and Wan (2008) to estimate the Gini coefficient from decile data. An update of the database to WIID3.0 is pending at the time of writing.

Similar to the DS data set, WIID defines three quality criteria—(i) whether the underlying concepts are known, (ii) coverage of concepts and (iii) survey quality—but provides a more detailed quality ranking from 1 (underlying concepts are known and the quality of the income concept and survey can be judged as sufficient) to 4 (unreliable).^[262]

The WIID dataset is freely available at the UNU/WIDER website (http://www. wider.unu.edu/research/Database/). It has been increasingly used in international inequality research and, with the merge with the former DS data set, constitutes the most widely known secondary inequality data set. One of four articles reviewed in Annex Table A19.1 make use of this data set.

19.3.2.2.3 All the Ginis Data Set

The All the Ginis (ATG) data set has been put together by Branko Milanovic from the World Bank since 2004. It includes combined and harmonized Gini coefficients (but no further inequality indicators) from seven original sources: the LIS, the Socio-Economic Database for Latin America, the EU-SILC, the World Bank Europe and Central Asia data set, the World Income Distribution (WYD), World Bank PovCal, and the WIDER.

The most recent version of the ATG data set was released in 2013 and includes close to 4000 Gini observations for 164 countries for the period from 1950 through 2012. Almost 2000 of these observations have been considered “consistent.” Rather than classifying observations as “accept” (DS) or “reliable” (WIID), this “consistent” classification is based on an approach described as “choice by precedence.” This approach takes the Gini values in overlapping cases in order of preference of the seven data sources, namely in the order as they are listed above.^[263] The ATG data set presents the Gini values along with key dummy variables defining the type of welfare aggregate (income or expenditure, net or gross) and recipient unit (household or individual). Another specific feature of the ATD data set is that it includes a variable that allows the survey to be distinguished from the income year.

The ATG data set is freely available in form of a stata file at http://econ.worldbank. org/projects/inequality.

19.3.2.2.4 WYD Data Set (World Bank)

The WYD database was created as part of the World Bank’s work on global income distribution. The objective of this work is to gather and analyze detailed household survey data for as many countries as possible for several benchmark years to calculate estimates of global inequality. Currently, data exist for five benchmark years (1988, 1993, 1998, 2002 and 2005). The objective of the WYD database was to create “rich” (numerous in terms of countries) and “dense” (ventiles or percentiles for each country’s distribution) coverage for the benchmark years, not to maximize the number of Gini observations or provide longer-term series for individual countries. The WYD series are integrated into the ATG data set described earlier.

The WYD data are freely available in form of a stata file at http://go.worldbank.org/ IVEJIU0FJ0.

19.3.2.2.5 The PovCal Database (World Bank)

The PovCal database covers the period since 1978 and includes 124 low-income, lower- middle-income and upper-middle-income countries, thus excluding higher-income OECD countries. In general, PovCal shares the same underlying survey data sources as WYD. There are over 800 Gini observations, most of which are calculated from direct access to household surveys. The PovCalNet tool is available at http://iresearch. worldbank.org/PovcalNet/index.htm.

19.3.2.2.6 World Development Indicators (World Bank)

The World Development Indicators (WDI) is the primary World Bank collection of development indicators compiled from officially recognized international sources. These also include the Gini index. However, data on OECD countries are scarce, with many countries missing data in all years. A priori, WDI Ginis also should come from the same underlying microdata used by WYD and PovCal. The data are available at http:// databank.worldbank.org/data/home.aspx.

19.3.2.2.7 Sociometro-BID (Inter-American Development Bank)

Socidmetro-BID is a diverse data set of social indicators derived from national household survey data, covering 21 Latin American and Caribbean countries from 1990 to 2009. While the Socidmetro includes traditional global indicators including the millennium development goals, the database also includes information on Gini coefficients for per capita household income. The data are freely available at http://www.iadb.org/ research/sociometroBID/tables.cfm?indicator=4&lang=en.

19.3.2.2.8 TRANS-MONEE Database (UNICEF)

The TransMonEE (Transformative Monitoring for Enhanced Equity) database collects a vast range of data relevant to social and economic issues in 28 countries of Central Eastern Europe and the Commonwealth of Independent States. The database was initiated by the UNICEF Innocenti Research Centre in 1992 and is updated annually. The 2012 version of the database contains 180 economic and social indicators divided into 10 topics (population, natality, child and maternal mortality, life expectancy and adult mortality, family formation, health, education, child protection, crime and juvenile justice, economy).

It includes data on Gini coefficients, covering the period 1989—2009. In general, these data are based on interpolated distributions from grouped data from household budget surveys. The data are freely available at http://www.transmonee.org/.

19.3.2.2.9 International LaborOrganization Database

Since 2012, the International Labor Organization (ILO) database provides recent data for over 100 indicators and 230 countries. It includes a series of D9/D1 and D9/D5 percentile ratios for earnings for employees (although the precise definition and concept are not clear from the description). The data are freely available from http://www.ilo.org/ ilostat/faces/home/statisticaldata. The former ILO database LABORSTA (http:// laborsta.ilo.org/) included both decile values and Gini coefficients for selected years up to the early 2000s.

19.3.2.2.10 The GINI Inequality and Poverty Dataset

The GINI Inequality and Poverty Dataset is a very recent outcome of the “Growing Inequalities’ Impacts” (GINI) project completed within the 7th Framework program of the European Commission between 2009 and 2013. The project produced in-depth case studies for the 30 participant countries, which include 25 of the 28 EU countries together with 5 non-European countries: Australia, Canada, Japan, Korea and the United States. The country case studies followed a predetermined template specifying the most important variables to be monitored over a 30-year time span (from 1980 to 2010). The variables related to inequality cover Gini coefficients and relative income poverty. For both Gini coefficients and poverty, the preferred income concept is net/ disposable equivalized household income. The income sharing unit is the household, whereas the unit of analysis for the computation of various indexes is the individual member of the household. In each case the figures refer to national coverage and thresholds rather than, for example, regions or specific social groups. For most of the countries and for most of the data points these requirements are met (for further details see Salverda et al., 2014 and Toth, 2014).

19.3.2.2.11 Chartbook of Economic Inequality Data

Atkinson and Morelli (2014) created a chartbook of economic inequality that includes indicators beyond income inequality measures for 25 countries (of which 17 are OECD countries) and covers series for up to 100 years until the present. These refer to earnings inequality (usually D9/D5 ratios for OECD countries) and overall inequality (usually Gini coefficients of household income) as well as poverty, pretax top income shares and wealth. These series are based on “preferred” definitions, which are documented for each country included in the data. The focus of this data collection is on over-time comparability rather than between-country comparability. The underlying data are freely available at www.chartbookofeconomicinequality.com.

19.3.2.2.12 World Top Incomes Database

Long-run data series on pretax top incomes ranging back 80 years or more have been collected and prepared by Facundo Alvaredo, Tony Atkinson, Thomas Piketty, Emmanuel Saez and various collaborators and are available online (http://topincomes. parisschoolofeconomics.eu/). The database includes information on top income levels and top income shares (such as the top 1%, top 0.1% or top 0.01%) for 27 countries, of which 18 are OECD countries.

Two main limitations of these data sets are that they cannot be used to describe the whole distribution (and hence do not include summary inequality measures) and that data refer to pretax incomes. Further limitations of tax data for inequality analysis are that taxexempt income is typically not reported and consequently is left out of the indicators; cross-country differences (and changes over time) in the concept of income that is measured; the extent of tax planning and tax evasion; and the definition of the tax unit. For a summary of the main results from analyses of these data and a discussion of the underlying data, see, for instance, Atkinson et al. (2011).

19.3.2.3 Secondary Synthetic Data Compilations

Synthetic data compilations are based on regression-based procedures to estimate time series from existing inequality data sets.

19.3.2.3.1 University of Texas Inequality Project

The University of Texas Inequality Project data set, which is associated with the work of James Galbraith, is based on a project concerned with measuring and explaining movements of inequality in wages and earnings and patterns of industrial change around the world. It uses microdata available based on industrial statistics from the United Nations Industrial Development Organization. The project establishes a relationship between these measures and the broader concepts of inequality, such as income inequality, which is considered reasonably reliable. The data use the Theil’s T statistic to compute inequality indexes from industrial, regional and sectoral data. It produces data sets on pay inequality at the global level; at the national level, including data for Argentina, Brazil, Cuba, China, India, and Russia; and at the regional level for Europe. Data on pay inequality were used as an instrument to estimate measures of household income inequality for a large panel of countries from 1963 through 2008. This global data set has around 4000 country-year observations. All data sets are available at http://utip.gov.utexas.edu/ data.html.

19.3.2.3.2 SWIID Database

The SWIID database standardizes the WIDER data (described earlier) and other inequality data while minimizing reliance on problematic assumptions by using as much information as possible from proximate years within the same country.^[264] It uses the data collected by the LIS as the benchmark standard. The SWIID currently incorporates Gini indexes of gross and net income inequality for 173 countries for as many years as possible from 1960 to the present, as well as estimates standard errors for these statistics. The SWIID data and the procedure used to generate it are available at http://myweb. uiowa.edu/fsolt/swiid/swiid.html and are described by Solt (2009).

There are other, more one-off exercises to build synthetic cross-country data compilations from existing inequality data sets, such as the Standardized Income Distribution Database. This database was created by Babones and Alvarez-Rivadulla (2007) on the basis of the UNU-WIDER dataset (WIID) but is not available online. It can be requested from the authors.

19.3.3 Concluding Remarks

There is no single “ideal” data set for international research on the multiple causes of inequality, despite the rapid development of international data sets of primary and secondary inequality data over the past 20 years. Opting for one or the other of the abovedescribed data sets depends on the nature of the research question as well as on the target group of countries that are to be compared. If a study is confined to the group of EU and/ or OECD countries, one of the primary data sets may reveal the first choice because of their higher degree of standardization. For more global country coverage, secondary data sets provide a necessary starting point but great care needs to be taken, and not all series can be integrated in econometric analysis. In particular, compared with primary data sets, generally fewer resources can be devoted by the suppliers of these data sets to ensure data quality and consistency.

Many of the criticisms regarding quality and consistency in secondary income distribution data put forward by Atkinson and Brandolini (2001) and later by Francois and Rojas-Romagosa (2005) are still valid. More generally, it also has been argued that survey estimates that build the basis for both primary and secondary data sets often only partially portray the income distribution (Pyatt, 2003). In addition, the fact that secondary data sets include indicators based on different concepts and definitions is often tackled by applying “dummy variable” adjustments. Atkinson and Brandolini (2001) conclude that such adjustments are not satisfactory because “differences in methodology may affect not only the level but also the trend of variables (so that it may not be sufficient to apply a fixed- effect correction in panel data estimation)” (p. 295). Rigorous sensitivity analyses are therefore required because data choices can impede both levels and trends in distributional indicators, which in turn can greatly affect the identification and interpretation of causal factors in an internationally comparative context. Primary users of the databases discussed above should not take the series collected at face value; they need to carefully examine the downloaded data. In turn, secondary users of the research based on one or the other of these databases (“meta-users”) need to verify to what extent the researchers validated the data they used.

19.4.

<< | >>

↑

Source: Atkinson Anthony, Bourguignon François. Handbook of Income Distribution. Volume 2B. North Holland, 2014. — 2366 p..

More economic literature on Economics.Studio

DATA SOURCES FOR CROSS-COUNTRY STUDIES

More on the topic DATA SOURCES FOR CROSS-COUNTRY STUDIES: