regions R package

Abstract

An open source R package for validating sub-national statistical typologies, re-coding across standard typologies of sub-national statistics, and making valid aggregate level imputation, re-aggregation, re-weighting and projection down to lower hierarchical levels to create meaningful data panels and time series.

Installation

You can install the development version from GitHub with:

devtools::install_github("rOpenGov/regions")

or the released version from CRAN:

install.packages("devtools")

You can review the complete package documentation on regions.danielantal.eu. If you find any problems with the code, please raise an issue on Github. Pull requests are welcome if you agree with the Contributor Code of Conduct

If you use regions in your work, please cite the package.

Motivation

Working with sub-national statistics has many benefits. In policymaking or in social sciences, it is a common practice to compare national statistics, which can be hugely misleading. The United States of America, the Federal Republic of Germany, Slovakia and Luxembourg are all countries, but they differ vastly in size and social homogeneity. Comparing Slovakia and Luxembourg to the federal states or even regions within Germany, or the states of Germany and the United States can provide more adequate insights. Statistically, the similarity of the aggregation level and high number of observations can allow more precise control of model parameters and errors.

The advantages of switching from a national level of the analysis to a sub-national level comes with a huge price in data processing, validation and imputation. The package Regions aims to help this process.

This package is an offspring of the eurostat package on rOpenGov. It started as a tool to validate and re-code regional Eurostat statistics, but it aims to be a general solution for all sub-national statistics. It will be developed parallel with other rOpenGov packages.

Sub-national Statistics Have Many Challenges

  • Frequent boundary changes: as opposed to national boundaries, the territorial units, typologies are often change, and this makes the validation and recoding of observation necessary across time. For example, in the European Union, sub-national typologies change about every three years and you have to make sure that you compare the right French region in time, or, if you can make the time-wise comparison at all.

  • Hierarchical aggregation and special imputation: missingness is very frequent in sub-national statistics, because they are created with a serious time-lag compared to national ones, and because they are often not back-casted after boundary changes. You cannot use standard imputation algorithms because the observations are not similarly aggregated or averaged. Often, the information is seemingly missing, and it is present with an obsolete typology code.

Package functionality

  • Generic vocabulary translation and joining functions for geographically coded data
  • Keeping track of the boundary changes within the European Union between 1999-2021
  • Vocabulary translation and joining functions for standardized European Union statistics
  • Vocabulary translation for the ISO-3166-2 based Google data and the European Union
  • Imputation functions from higher aggregation hierarchy levels to lower ones, for example from NUTS1 to NUTS2 or from ISO-3166-1 to ISO-3166-2 (impute down)
  • Imputation functions from lower hierarchy levels to higher ones (impute up)
  • Aggregation function from lower hierarchy levels to higher ones, for example from NUTS3 to NUTS1 or from ISO-3166-2 to ISO-3166-1 (aggregate; under development)
  • Disaggregation functions from higher hierarchy levels to lower ones, again, for example from NUTS1 to NUTS2 or from ISO-3166-1 to ISO-3166-2 (disaggregate; under development)

Vignettes / Articles

Feedback?

Raise and issue on Github or get in touch. Downloaders from CRAN: metacrandownloads

Related