Reproduction of Spielman et al.’s 2020 Evaluation of the Social Vulnerability Index¶
Authors¶
- Liam Smith*, lwsmith@middlebury.edu, @Liam-W-Smith, Middlebury College
- Joseph Holler, josephh@middlebury.edu , @josephholler, ORCID link, Middlebury College
- Katie Heo, jheo@middlebury.edu, @katieheo, Middlebury College
* Corresponding author and creator
Version 1.1 | Created 2023-07-12 | Last Updated 2023-12-16
Abstract¶
This study is a reproduction of:
Spielman, S. E., Tuccillo, J., Folch, D. C., Schweikert, A., Davies, R., Wood, N., & Tate, E. (2020). Evaluating Social Vulnerability Indicators: Criteria and their Application to the Social Vulnerability Index. Natural Hazards, 100(1), 417–436. https://doi.org/10.1007/s11069-019-03820-z
The Spielman et al. (2020) paper is in turn a replication of:
Cutter, S. L., Boruff, B. J., & Shirley, W. L. (2003). Social vulnerability to environmental hazards. Social Science Quarterly, 84(2), 242–261. https://doi.org/10.1111/1540-6237.8402002
Spielman et al. (2020) developed methods to evaluate the internal consistency and construct validity of the Cutter, Boruff and Shirley (2003) Social Vulnerability Index (SoVI). First, they reproduce a national SoVI model and validate it against an SPSS procedure provided by the original research group (Hazards Vulnerability Research Institute at University of South Carolina). The original SoVI uses 42 independent z-score normalized variables from the U.S. Census, reduces the data to factors using Principal Components Analysis, selects the first eleven factors, inverts factors with inverse relationships to social vulnerability, and sums the factors together to produce a SoVI score. The reproduced SoVI model was slightly different than the original model due to changes in U.S. Census data, using only 28 variables.
Spielman et al. modify the geographic extent of the SoVI calculation by calculating SoVI on a national extent, and then recalculating for each of ten Federal Emergency Management Agency (FEMA) regions, and again for a single state or cluster of states within each of the ten regions, resulting in 21 total indices. Internal consistency is assessed by calculating the spearman rank correlation coefficient of the SoVI score for counties in the state model compared to the FEMA region model and national model. Construct validity is assessed by summing the loadings for each input variable across the PCA factors in each model and calculating the variables sign (positive/negative) and the rank of the variable's total loading compared to the other variables. These signs and ranks are summarized across all 21 versions of the SoVI model with regard to the number of times the sign is different from the national model and the distributions of ranks.
In this reproduction study, we attempt to reproduce identical SoVI model outputs for each of the 21 models in the original study. We will compare these outputs to data files in Spielman et al.'s GitHub repository. We will also attempt to reproduce identical results of internal consistency analysis (figure 1 and table 2) and construct validity analysis (figure 2) from Spielman et al.'s paper. We succeed in reproducing identical SoVI model outputs, but find slight discrepancies in our figures and tables.
The code in this Jupyter notebook report is adapted from Spielman et al.'s GitHub repository. The original study states the intended open source permissions in the acknowledgements: "To facilitate advances to current practice and to allow replication of our results, all of the code and data used in this analysis is open source and available at (https://github.com/geoss/sovi-validity). Funding was provided by the US National Science Foundation (Award No. 1333271) and the U.S. Geological Survey Land Change Science Program."
Keywords¶
Social vulnerability, social indicators, Principal Component Analysis, reproducibility
Study design¶
We computationally reproduce Spielman et al.'s original work using the code provided in their Github repository (https://github.com/geoss/sovi-validity), adapting their code to run in an updated Python environment using current package versions. We make all of our work available online using the HEGSRR reproducible research compendium template.
The original paper was a replication study testing the sensitivity of SoVI to changes in geographic extent. Spielman et al. addressed the following hypotheses in their work:
OR-H1: SoVI is internally inconsistent.
To address this hypothesis, Spielman et al. illustrated that SoVI is not robust to changes in geographic extent by calculating SoVI scores for ten selected states or groups of states on three geographic extents: national, FEMA region, and state(s). The counties within the state(s) of interest were then selected and ranked according to their SoVI score. OR-H1 was tested by calculating Spearman's rank correlation between the state and FEMA region models and between the state and national models.
OR-H2: SoVI is theoretically inconsistent.
To address this hypothesis, Spielman et al. used the same SoVI models as described under OR-H1. For each model, they summed all of the PCA factors together to determine the net influence of each variable in each model. Then they recorded the signs of each variable and calculated the number of deviations of the ten state and FEMA region models from the national model. They also ranked the variables by absolute value for each model and calculated summary statistics regarding the distribution of ranks for each variable amongst all models. Spielman et al. did not use a particular statistical method to test OR-H2, but illustrated substantial disagreements between variable rankings and signs amongst the 21 SoVI models.
For our reproduction, we address the following three hypotheses:
RPr-H1: Reproduced SoVI model scores and other reproduced output datasets are not identical to the original study SoVI model scores and provided output datasets for each of the 21 SoVI models.
RPr-H2: Reproduced figures and tables for the internal consistency analysis are not identical to the figures and tables (figure 1 and table 2) of the original study.
RPr-H3: For the theoretical consistency analysis, reproduced direction reversals and min, average, and max SoVI rank value of 28 demographic variables are not identical to the direction reversals and min, average, and max SoVI rank values shown in figure 2 of the original study.
We answer these questions by working through Spielman et al.'s code line by line in an updated python coding environment.
In our reproduction, we will map the national SoVI levels to clearly visualize social vulnerability nationally and hopefully produce a map that will allow for the identification of geographic patterns and facilitate a comprehensive understanding of the overall vulnerability landscape at a national scale. We will create an output table of factor loadings for national SoVI, which may enhance accuracy by quantifying the influence of different variables on the vulnerability index, providing insight into the relative importance of each factor at a national level. We will also reweight the SoVI by percent variance to evaluate the internal consistency of the SoVI model, which we hope adds clarity to the study and helps researchers assess the reliability and theoretical consistency of the vulnerability index on a national scale.
To improve reproducibility, we reorganize Spielman's repository into the Template for Reproducible and Replicable Research in Human-Environment and Geographical Sciences (doi:10.17605/OSF.IO/W29MQ) and use one Jupyter notebook for the reproduction report and code. We catalogue barriers to reproducibility and make improvements wherever possible.
Disclaimer: we worked with the data and code before writing this report, so there is no pre-registration of the analysis plan. We originally intended to publish only a replication of this study; we did not anticipate publishing a reproduction until we spent some time working with the code.
Spatio-temporal metadata¶
Spatial Coverage
: United States, excluding Puerto RicoSpatial Resolution
: Counties and county equivalentsSpatial Reference System
: EPSG:4269Temporal Coverage
: 2008 - 2012 (data is the 2012 5-year ACS)Temporal Resolution
: One-time measurement, does not address change over time
Materials and procedure¶
Computational environment¶
Currently, we are using a 2020 MacBook Pro running on macOS Ventura 13.3.1. We anticipate collaborators working on the project from different computers and different operating systems, and we seek to containerize the project so that scripts can be run on many different machines.
The original study used Python for their analysis, so we reproduce their results in Python, using a containerized conda environment. This environment consists of Python 3.9.16 and the software packages listed in requirements.txt
To set up this environment on another machine, one should install the correct version of Python and run the cell below to install the correct package versions. If a user wishes to create a self-contained environment, they should explore venv, conda, or pipenv virtual environments.
# report python version and install required packages
# switch if statement from True to False once packages have been installed
if True:
!python -V
!pip install -r ../environment/requirements.txt
# Import modules, define directories
import pygris
import pandas as pd
import geopandas as gpd
from pygris.data import get_census
from pygris import counties
from pyhere import here
import numpy as np
import libpysal as lps
import lxml
import tabulate
from scipy.stats import spearmanr
from scipy.stats.mstats import zscore as ZSCORE
from scipy.stats import rankdata
import mdp as MDP
from operator import itemgetter
import copy
from matplotlib.colors import ListedColormap
from matplotlib import patheffects as pe
import matplotlib.pyplot as plt
from IPython import display
from IPython.display import Markdown, Latex
pd.set_option("chained_assignment", None)
path = {
"dscr": here("data", "scratch"),
"drpub": here("data", "raw", "public", "spielman", "input"),
"drpub2": here("data", "raw", "public"),
"drpriv": here("data", "raw", "private"),
"ddpub": here("data", "derived", "public", "version1"),
"ddpriv": here("data", "derived", "private"),
"rfig": here("results", "figures"),
"roth": here("results", "other"),
"rtab": here("results", "tables"),
"og_out": here("data", "raw", "public", "spielman", "output"),
"dmet": here("data", "metadata")
}
# Switch from False to True to regenerate documentation of computational environment
# Note that this approach is not perfect -- it may miss some packages
# This code may work better from the command prompt
if False:
!pip install pigar
!pigar generate -f ../environment/requirements.txt
# Install folium package for national chloropleth map
!pip install folium
import folium
Requirement already satisfied: folium in c:\users\jheo\.conda\envs\spielman\lib\site-packages (0.15.1) Requirement already satisfied: branca>=0.6.0 in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from folium) (0.7.0) Requirement already satisfied: jinja2>=2.9 in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from folium) (3.1.2) Requirement already satisfied: numpy in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from folium) (1.21.5) Requirement already satisfied: requests in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from folium) (2.31.0) Requirement already satisfied: xyzservices in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from folium) (2023.10.1) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from jinja2>=2.9->folium) (2.1.1) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from requests->folium) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from requests->folium) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from requests->folium) (1.26.18) Requirement already satisfied: certifi>=2017.4.17 in c:\users\jheo\.conda\envs\spielman\lib\site-packages (from requests->folium) (2023.11.17)
Data and variables¶
For Spielman et al.'s original study, the data sources were the 2008-2012 5-year American Community Survey and the 2010 decennial census. Spielman et al. downloaded their data from Social Explorer; in our reproduction, we pull our data directly from the census into Python via a census API package known as pygris. These variables are based on the original work by Cutter et al. to create SoVI, and cover a wide range of social and demographic information, the particulars of which are described below.
In order to confirm that our data and Spielman et al.'s data perfectly match each other, we import the names of relevant variables from both datasets here.
# Import data dictionary
acs_vars = pd.read_csv( here("data", "metadata", "ACS_2012_data_dictionary.csv") )
acs_vars.drop(columns=acs_vars.columns[0], axis=1, inplace=True)
acs_variables = list(acs_vars['Reproduction Label'][1:])
spielman_acs_variables = list(acs_vars['Spielman Label'][1:])
(1) 2008-2012 American Community Survey (5-year)¶
Used in both original study and reproduction.
Planned deviation: to enhance reproducibility, we draw the data directly from the census into python using the pygris package.
# Switch from False to True to download fresh data from the Census
if False:
# Acquire attribute data for reproduction
counties_detailed = get_census(dataset = "acs/acs5", # dataset name on the Census API you are connecting to; find datasets at https://api.census.gov/data.html
variables = acs_variables, # string (or list of strings) of desired vars. For the 2021 5-year ACS Data Profile, those variable IDs are found at https://api.census.gov/data/2021/acs/acs5/profile/variables.html
year = 2012, # year of your data (or end-year for a 5-year ACS sample)
params = { # dict of query parameters to send to the API.
"for": "county:*"},
guess_dtypes = True,
return_geoid = True)
# Drop Puerto Rico
counties_detailed = counties_detailed.loc[~counties_detailed['GEOID'].str.startswith('72')]
# Download and save raw data
counties_detailed.to_csv( here(path["drpub2"], "counties_attributes_raw.csv"))
else:
counties_detailed = pd.read_csv( here(path["drpub2"], "counties_attributes_raw.csv"), dtype = {'GEOID': object} )
counties_detailed = counties_detailed.drop(counties_detailed.columns[0],axis=1)
Load data from Spielman et al.'s research repository for validation of the reproduction study.
# Import original data from Spielman et al.
# Import base ACS data
make_strings = {'Geo_FIPS': object, 'Geo_STATE': object, 'Geo_COUNTY': object,
'Geo_TRACT': object, 'Geo_CBSA': object, 'Geo_CSA': object}
acs = pd.read_csv(here(path["drpub"], 'sovi_acs.csv'),
dtype=make_strings, skiprows=1,encoding='latin-1')
# Import, join an ACS supplemental
acs_sup2 = pd.read_csv(here(path["drpub"], 'sovi_acs_kids.csv'),
dtype=make_strings, skiprows=1,encoding='latin-1')
acs = acs.merge(acs_sup2, how = "inner", on='Geo_FIPS')
# Drop Puerto Rico
acs = acs[acs.Geo_STATE_x != '72']
Markdown( here(path["dmet"], "ACS_2012_geographic_metadata.md") )
Title
: American Community Survey 2012 5-year Estimate Demographic VariablesAbstract
: The 5-year ACS provides estimates surrounding demographic information in the USA. These estimates are more reliable than 1-year and 3-year estimates but less reliable than decennial census data. On the other hand, 5-year estimates are less current than 1-year and 3-year estimates because they represent measurements taken over 60 months. See the census website for more details.Spatial Coverage
: United States, excluding Puerto RicoSpatial Resolution
: County and county-equivalentsSpatial Reference System
: None, just attribute dataTemporal Coverage
: 2008-2012Temporal Resolution
: Data averaged over five yearsLineage
: Original data downloaded from Social Explorer and then placed in the original study's GitHub repository: https://github.com/geoss/sovi-validity. Reproduction data obtained directly from the census via API.Distribution
: The reproduction data is distributed via a census API. See the detailed tables on the census website and instructions for drawing census data directly into python on the pygris website. Spielman et al. originally accessed the ACS data with Social Explorer from the following two tables.Constraints
: Census data is available in the public domainData Quality
: Margin of error provided by the Census Bureau for relevant variablesVariables
: See ACS_2012_data_dictionary.csv
acs_vars
Reproduction Label | Spielman Label | Alias | Definition | Type | Domain | Missing Data Value(s) | Missing Data Frequency | |
---|---|---|---|---|---|---|---|---|
0 | GEOID | Geo_FIPS | FIPS code unique identifier | Unique code for every county and county-equiva... | string | 01001 - 56045 | None | 0.0 |
1 | B01002_001E | ACS12_5yr_B01002001 | median age | MEDIAN AGE BY SEX: Estimate!!Median age!!Total | float64 | 21.7 - 63 | NaN | 0.0 |
2 | B03002_001E | ACS12_5yr_B03002001 | total population of respondents to race/ethnicity | HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T... | int64 | 66 - 9840024 | NaN | 0.0 |
3 | B03002_004E | ACS12_5yr_B03002004 | total Black population | HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T... | int64 | 0 - 1267825 | NaN | 0.0 |
4 | B03002_005E | ACS12_5yr_B03002005 | total Native American population | HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T... | int64 | 0 - 59060 | NaN | 0.0 |
5 | B03002_006E | ACS12_5yr_B03002006 | total Asian population | HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T... | int64 | 0 - 1343920 | NaN | 0.0 |
6 | B03002_012E | ACS12_5yr_B03002012 | total Latinx population | HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T... | int64 | 0 - 4694846 | NaN | 0.0 |
7 | B06001_002E | ACS12_5yr_B06001002 | total population under 5 years of age | PLACE OF BIRTH BY AGE IN THE UNITED STATES: Es... | float64 | 0 - 651662 | NaN | 78.0 |
8 | B09020_001E | ACS12_5yr_B09020001 | total population over 65 years of age | RELATIONSHIP BY HOUSEHOLD TYPE (INCLUDING LIVI... | int64 | 5 - 1078555 | NaN | 0.0 |
9 | B01003_001E | ACS12_5yr_B01003001 | total population | TOTAL POPULATION: Estimate!!Total | int64 | 66 - 9840024 | NaN | 0.0 |
10 | B25008_001E | ACS12_5yr_B25008001 | total population in occupied housing units | TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ... | int64 | 62 - 9664175 | NaN | 0.0 |
11 | B25002_002E | ACS12_5yr_B25002002 | total occupied housing units | OCCUPANCY STATUS: Estimate!!Total!!Occupied | int64 | 35 - 3218511 | NaN | 0.0 |
12 | B25003_003E | ACS12_5yr_B25003003 | total renter occupied housing units | TENURE: Estimate!!Total!!Renter occupied | int64 | 14 - 1695180 | NaN | 0.0 |
13 | B25002_001E | ACS12_5yr_B25002001 | total housing units for which occupancy status... | OCCUPANCY STATUS: Estimate!!Total | int64 | 70 - 3441416 | NaN | 0.0 |
14 | B09020_021E | ACS12_5yr_B09020021 | total 65+ living in group quarters | RELATIONSHIP BY HOUSEHOLD TYPE (INCLUDING LIVI... | int64 | 0 - 37611 | NaN | 0.0 |
15 | B01001_026E | ACS12_5yr_B01001026 | total female population | SEX BY AGE: Estimate!!Total!!Female | int64 | 20 - 4987765 | NaN | 0.0 |
16 | B11001_006E | ACS12_5yr_B11001006 | total female-headed family households | HOUSEHOLD TYPE (INCLUDING LIVING ALONE): Estim... | int64 | 0 - 498851 | NaN | 0.0 |
17 | B11001_001E | ACS12_5yr_B11001001 | total households for which household type is k... | HOUSEHOLD TYPE (INCLUDING LIVING ALONE): Estim... | int64 | 35 - 3218511 | NaN | 0.0 |
18 | B25002_003E | ACS12_5yr_B25002003 | total vacant housing units | OCCUPANCY STATUS: Estimate!!Total!!Vacant | int64 | 35 - 245069 | NaN | 0.0 |
19 | B19025_001E | ACS12_5yr_B19025001 | aggregate household income | AGGREGATE HOUSEHOLD INCOME IN THE PAST 12 MONT... | int64 | 1785600 - 263044380000 | NaN | 0.0 |
20 | B23022_025E | ACS12_5yr_B23022025 | total males unemployed for last 12 months | SEX BY WORK STATUS IN THE PAST 12 MONTHS BY US... | int64 | 1 - 726803 | NaN | 0.0 |
21 | B23022_049E | ACS12_5yr_B23022049 | total females unemployed for last 12 months | SEX BY WORK STATUS IN THE PAST 12 MONTHS BY US... | int64 | 0 - 1131737 | NaN | 0.0 |
22 | B23022_001E | ACS12_5yr_B23022001 | total population for which unemployment and se... | SEX BY WORK STATUS IN THE PAST 12 MONTHS BY US... | int64 | 45 - 6658456 | NaN | 0.0 |
23 | B17021_002E | ACS12_5yr_B17021002 | total population below poverty level | POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 M... | int64 | 0 - 1658231 | NaN | 0.0 |
24 | B17021_001E | ACS12_5yr_B17021001 | total population for which poverty information... | POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 M... | int64 | 64 - 9684503 | NaN | 0.0 |
25 | B25024_010E | ACS12_5yr_B25024010 | number of mobile home housing units in structure | UNITS IN STRUCTURE: Estimate!!Total!!Mobile home | int64 | 0 - 85377 | NaN | 0.0 |
26 | B25024_001E | ACS12_5yr_B25024001 | total housing units in structure | UNITS IN STRUCTURE: Estimate!!Total | int64 | 70 - 3441416 | NaN | 0.0 |
27 | C24010_038E | ACS12_5yr_C24010038 | total female employed | SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED PO... | int64 | 12 - 2056023 | NaN | 0.0 |
28 | C24010_001E | ACS12_5yr_C24010001 | total population for which sex and occupation ... | SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED PO... | int64 | 54 - 4495118 | NaN | 0.0 |
29 | B19055_002E | ACS12_5yr_B19055002 | total households with social security income | SOCIAL SECURITY INCOME IN THE PAST 12 MONTHS F... | int64 | 9 - 726298 | NaN | 0.0 |
30 | B19055_001E | ACS12_5yr_B19055001 | total households for which social security inc... | SOCIAL SECURITY INCOME IN THE PAST 12 MONTHS F... | int64 | 35 - 3218511 | NaN | 0.0 |
31 | B09002_002E | ACS12_5yr_B09002002 | total children in married couple families | OWN CHILDREN UNDER 18 YEARS BY FAMILY TYPE AND... | int64 | 0 - 1380977 | NaN | 0.0 |
32 | B09002_001E | ACS12_5yr_B09002001 | total children for which family type and age a... | OWN CHILDREN UNDER 18 YEARS BY FAMILY TYPE AND... | int64 | 0 - 2032147 | NaN | 0.0 |
33 | B19001_017E | ACS12_5yr_B19001017 | total households with over 200k income | HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 201... | int64 | 0 - 208954 | NaN | 0.0 |
34 | B06007_005E | ACS12_5yr_B06007005 | total Spanish-speakers who speak english less ... | PLACE OF BIRTH BY LANGUAGE SPOKEN AT HOME AND ... | float64 | 0 - 1695391 | NaN | 78.0 |
35 | B06007_008E | ACS12_5yr_B06007008 | total people who speak another language and sp... | PLACE OF BIRTH BY LANGUAGE SPOKEN AT HOME AND ... | float64 | 0 - 743418 | NaN | 78.0 |
36 | B06007_001E | ACS12_5yr_B06007001 | total population with known language spoken at... | PLACE OF BIRTH BY LANGUAGE SPOKEN AT HOME AND ... | float64 | 66 - 9188362 | NaN | 78.0 |
37 | B16010_002E | ACS12_5yr_B16010002 | total population with less than a high school ... | EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B... | int64 | 5 - 1508273 | NaN | 0.0 |
38 | B16010_001E | ACS12_5yr_B16010001 | total for which education, employment, languag... | EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B... | int64 | 65 - 6380366 | NaN | 0.0 |
39 | C24050_002E | ACS12_5yr_C24050002 | total population in extractive industries | INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOY... | int64 | 0 - 54942 | NaN | 0.0 |
40 | C24050_001E | ACS12_5yr_C24050001 | total population for which industry known | INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOY... | int64 | 54 - 4495118 | NaN | 0.0 |
41 | C24050_029E | ACS12_5yr_C24050029 | total people in service occupations | INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOY... | int64 | 4 - 837607 | NaN | 0.0 |
42 | B08201_002E | ACS12_5yr_B08201002 | total households with no available vehicle | HOUSEHOLD SIZE BY VEHICLES AVAILABLE: Estimate... | int64 | 0 - 577967 | NaN | 0.0 |
43 | B08201_001E | ACS12_5yr_B08201001 | total households for which vehicle status and ... | HOUSEHOLD SIZE BY VEHICLES AVAILABLE: Estimate... | int64 | 35 - 3218511 | NaN | 0.0 |
44 | B25064_001E | ACS12_5yr_B25064001 | median gross rent | MEDIAN GROSS RENT (DOLLARS): Estimate!!Median ... | int64 | 275 - 1678 | NaN | 0.0 |
45 | B25077_001E | ACS12_5yr_B25077001 | median home value | MEDIAN VALUE (DOLLARS): Estimate!!Median value... | float64 | 19400 - 944100 | NaN | 1.0 |
The above are the metadata files that we wrote for our pygris-acquired version of this data, stored as ACS_2012_data_dictionary.csv and ACS_2012_geographic_metadata.md. The metadata files provided by Spielman et al. are also in our repository, named sovi_acs.txt and sovi_acs_kids.txt.
(2) 2010 Decennial Census¶
Used in Spielman et al.'s original study.
# Import decennial supplemental
dec_sup1 = pd.read_csv(here(path["drpub"],'sovi_decennial_sup1.csv'),
dtype=make_strings,skiprows=1,encoding='latin-1')
Markdown( here(path["dmet"], "dec_2010_metadata.md") )
Title
: 2010 Decennial CensusAbstract
: Collected once every ten years, the decennial census documents demographic and population data in the United States.Spatial Coverage
: United States, excluding Puerto RicoSpatial Resolution
: County and county-equivalentsSpatial Reference System
: None, just attribute dataTemporal Coverage
: 2010Temporal Resolution
: One-time observationsLineage
: Original data downloaded from Social Explorer and then placed in the original study's GitHub repository: https://github.com/geoss/sovi-validity.Distribution
: Visit this URL for accessConstraints
: Census data is available in the public domainData Quality
: Margin of error provided by the Census Bureau for relevant variablesVariables
:
Label | Alias | Definition | Type | Accuracy | Domain | Missing Data Value(s) | Missing Data Frequency |
---|---|---|---|---|---|---|---|
SE_T02A_002 | Land area | Area (Land) in square miles | float64 | ... | 1.998779 - 145504.8 | nan | 0 |
Geo_FIPS | FIPS code unique identifier | Unique code for every county and county-equivalent in USA | string | ... | g01001 - g56045 | None | 0 |
Original metadata file provided by Spielman et al. as sovi_decennial_sup1.txt.
(3) USA Counties Shapefile¶
Used in Spielman et al.'s original study
spielman_geom = gpd.read_file( here(path["drpub"], "USA_Counties_500k.shp") )
Markdown( here(path["dmet"], 'USA_counties_metadata.md') )
Title
: USA Counties Geographic ShapefileAbstract
: No metadata provided, so it is unclear exactly where Spielman et al. acquired this file but they likely downloaded it directly or indirectly from the census. The shapefile provides the geometries of counties and county-equivalents in the United States, with limited attribute information including county name and a unique identifier.Spatial Coverage
: United States, excluding Puerto RicoSpatial Resolution
: County and county-equivalentsSpatial Reference System
: EPSG:4269Temporal Coverage
: UnknownTemporal Resolution
: One-time observationsLineage
: UnknownDistribution
: Unknown. Presumably downloaded from census.Constraints
: Census data is available in the public domainData Quality
: 1:500,000 scaleVariables
: For each variable, enter the following information. If you have two or more variables per data source, you may want to present this information in table form (shown below)
Label | Alias | Definition | Type | Accuracy | Domain | Missing Data Value(s) | Missing Data Frequency |
---|---|---|---|---|---|---|---|
geoFIPS | FIPS code unique identifer | Unique code for every county and county-equivalent in USA | string | ... | g01001 - g56045 | None | 0 |
No original metadata file provided.
(4) USA Counties Cartographic Boundaries¶
Used in reproduction study.
Planned deviation: to enhance reproducibility, we draw the data directly from the census into python using the pygris package.
# Switch False to True if you wish to acquire data directly from census
if False:
# Acquire geographical data for reproduction
counties_shp = counties(cb = True, year = 2010, cache = True) # year 2012 (and 2011) cartographic boundaries not available
# Save raw data
counties_shp.to_file( here(path["drpub2"], "counties_geometries_raw.gpkg") )
else:
counties_shp = gpd.read_file( here(path["drpub2"], "counties_geometries_raw.gpkg") )
Markdown( here(path["dmet"], "county_geom_2010_metadata.md") )
Title
: USA Counties Cartographic BoundariesAbstract
: The cartographic boundary files provided by the US census are simplified representations of the MAF/TIGER files. We use the 2010 boundary file because the census has not made such a file available for 2012 or 2011 and the original paper also used land area from 2010. This shapefile provides the geometries of counties and county-equivalents in the United States, with limited attribute information including land area.Spatial Coverage
: United States, excluding Puerto RicoSpatial Resolution
: County and county-equivalentsSpatial Reference System
: EPSG:4269Temporal Coverage
: 2010Temporal Resolution
: One-time observationsLineage
: We use pygris to pull the data directly from the census into python.Distribution
: This file is distributed via a census API. See more information on the census website and instructions for drawing census data directly into python on the pygris website.Constraints
: Census data is available in the public domainData Quality
: 1:500,000 scaleVariables
:
Label | Alias | Definition | Type | Domain | Missing Data Value(s) | Missing Data Frequency |
---|---|---|---|---|---|---|
STATE | State-level FIPS code | State-level FIPS code | string | 01 - 56 | None | 0 |
COUNTY | County-level FIPS code | County-level FIPS code | string | 001 - 840 | None | 0 |
CENSUSAREA | land area | land area in square miles | float64 | 1.999 - 145504.789 | NaN | 0 |
The metadata file for this data is stored as county_geom_2010_metadata.md.
Data transformations¶
A workflow diagram for this section is displayed below.
We begin with step P1: joining the geometry and attribute data
# Step P1
# Join geometry and attribute data for reproduction
counties_shp['GEOID'] = counties_shp.STATE + counties_shp.COUNTY
counties = counties_shp.merge(counties_detailed, how = "inner", on = "GEOID")
# Also join Spielman's land area information to the rest of Spielman's data
# (to check that all data is accurate, not for purposes of analysis)
acs = acs.merge(dec_sup1, how = "inner", on='Geo_FIPS')
Planned deviation: Because we decided to acquire our data independently from Spielman et al., we need to check that our data is indeed the same as theirs.
To begin, we define a function that can check that the entries of two pandas DataFrames are equal.
# Define a function that can determine whether every entry in specified columns of two tables match
def equiv(table1, sort1, column1, table2, sort2, column2):
'''
Tests two tables to see whether corresponding columns have equivalent entries.
Parameters:
table1 - the first table
sort1 - the column in the first table to join by (str)
column1 - the column(s) in the first table to test the values of (list of str) (should list analogous columns for columns2)
table2 - the second table
sort2 - the column in the second table to join by (str)
column2 - the column(s) in the second table to test the values of (list of str)
'''
# Sort tables
table1 = table1.sort_values(by = sort1).reset_index()
table2 = table2.sort_values(by = sort2).reset_index()
# Rename column name in table2 to match that in table1
for i in range(len(column1)):
table2 = table2.rename(columns={column2[i]: column1[i]})
# Select the columns to test equivalency of
table1 = table1[column1]
table2 = table2[column1]
# Perform equivalency test
test = table1.eq(table2)
return test
Next, we round our area columns to the nearest integer, just for the purposes of comparing the two columns. These columns came from different sources and we know that they do not match up exactly.
# # Round area column
# acs['SE_T02A_002_check'] = acs.SE_T02A_002.round(0)
# counties['CENSUSAREA_check'] = counties.CENSUSAREA.round(0)
# Add the area variables to the lists of variables
acs_variables.append('CENSUSAREA')
spielman_acs_variables.append('SE_T02A_002')
# Perform equivalency test
test = equiv(counties, "GEOID", acs_variables, acs, "Geo_FIPS", spielman_acs_variables)
matching_cols = pd.DataFrame({"test": test.sum().eq(3143)}) # 3143 matches the number of rows
matching_cols.loc[~matching_cols.test] # Identify the columns that have some data discrepencies
test | |
---|---|
B25077_001E | False |
CENSUSAREA | False |
The following variables have some discrepancy between the original and reproduction data:
- B25077_001E
- CENSUSAREA
# Inspect the data values for B25077_001E at the indices with data discrepancies
# Find the rows for which there are data discrepancies
messed_up_indices = test[["B25077_001E"]].loc[~test.B25077_001E]
# Select the data of interest from tygris
tygris_data = counties.sort_values(by = "GEOID")\
.reset_index().loc[messed_up_indices.index]\
[["GEOID", "B25077_001E"]]
# Select the data of interest from Spielman et al.
spielman_data = acs.sort_values(by = "Geo_FIPS")\
.reset_index().loc[messed_up_indices.index]\
[["Geo_FIPS", "ACS12_5yr_B25077001"]]
# Join and inspect
merged = tygris_data.merge(spielman_data, how = "inner", left_on = "GEOID", right_on = "Geo_FIPS")
merged
GEOID | B25077_001E | Geo_FIPS | ACS12_5yr_B25077001 | |
---|---|---|---|---|
0 | 15005 | NaN | 15005 | NaN |
By inspection, we see that the one disagreement between B25077_001E and ACS12_5yr_B25077001 occurs because of a NaN value in an analogous location in each of the two datasets. Rather than revealing an issue in matching our data, this shows us that we will need to impute a missing value for one NaN in B25077_001E -- median home value (see P3).
# Inspect the data values for CENSUSAREA at the indices with data discrepancies
# Find the rows for which there are data discrepancies
messed_up_indices = test[["CENSUSAREA"]].loc[~test.CENSUSAREA]
# Select the data of interest from tygris
tygris_data = counties.sort_values(by = "GEOID")\
.reset_index().loc[messed_up_indices.index]\
[["GEOID", "CENSUSAREA"]]
# Select the data of interest from Spielman et al.
spielman_data = acs.sort_values(by = "Geo_FIPS")\
.reset_index().loc[messed_up_indices.index]\
[["Geo_FIPS", "SE_T02A_002"]]
# Join and inspect
merged = tygris_data.merge(spielman_data, how = "inner", left_on = "GEOID", right_on = "Geo_FIPS")
merged
GEOID | CENSUSAREA | Geo_FIPS | SE_T02A_002 | |
---|---|---|---|---|
0 | 01001 | 594.436 | 01001 | 594.4361 |
1 | 01005 | 884.876 | 01005 | 884.8763 |
2 | 01007 | 622.582 | 01007 | 622.5823 |
3 | 01009 | 644.776 | 01009 | 644.7759 |
4 | 01011 | 622.805 | 01011 | 622.8047 |
... | ... | ... | ... | ... |
2285 | 55137 | 626.153 | 55137 | 626.1533 |
2286 | 55141 | 793.116 | 55141 | 793.1165 |
2287 | 56013 | 9183.814 | 56013 | 9183.8130 |
2288 | 56023 | 4076.129 | 56023 | 4076.1300 |
2289 | 56037 | 10426.649 | 56037 | 10426.6500 |
2290 rows × 4 columns
There are many disagreements between CENSUSAREA and SE_T02A_002, but they appear to be relatively minor differences. Let us evaluate how large those differences are.
merged["Difference"] = abs(merged["CENSUSAREA"] - merged["SE_T02A_002"])
print("Maximum difference:", merged['Difference'].max(), "\nAverage difference:", merged['Difference'].mean())
Maximum difference: 0.010999999998603016 Average difference: 0.00031027379912322995
The largest discrepency between the two different sources of land area data is just over 0.01 square miles and the average difference (amongst those with a difference) is tiny. With such a minor difference between our data and theirs, for our purposes we will consider our data roughly the same as Spielman et al.'s.
# Step P2
# Calculating the variables used in SoVI
counties['MEDAGE_ACS'] = counties.B01002_001E
counties['BLACK_ACS'] = counties.B03002_004E / (counties.B03002_001E)
counties['QNATAM_ACS'] = counties.B03002_005E / (counties.B03002_001E)
counties['QASIAN_ACS'] = counties.B03002_006E / (counties.B03002_001E)
counties['QHISP_ACS'] = counties.B03002_012E / (counties.B03002_001E)
counties['QAGEDEP_ACS'] = (counties.B06001_002E + counties.B09020_001E) / (counties.B01003_001E)
counties['QPUNIT_ACS'] = counties.B25008_001E / (counties.B25002_002E)
counties['PRENTER_ACS'] = counties.B25003_003E / (counties.B25002_001E)
counties['QNRRES_ACS'] = counties.B09020_021E / (counties.B01003_001E)
counties['QFEMALE_ACS'] = counties.B01001_026E / (counties.B01003_001E)
counties['QFHH_ACS'] = counties.B11001_006E / (counties.B11001_001E)
counties['QUNOCCHU_ACS'] = counties.B25002_003E / (counties.B25002_001E)
counties['QCVLUN'] = (counties.B23022_025E + counties.B23022_049E) / \
counties.B23022_001E
counties['QPOVTY'] = (counties.B17021_002E) / counties.B17021_001E
counties['QMOHO'] = (counties.B25024_010E) / counties.B25024_001E
counties['QFEMLBR'] = (counties.C24010_038E) / counties.C24010_001E
counties['QSSBEN'] = (counties.B19055_002E) / counties.B19055_001E
counties['QFAM'] = (counties.B09002_002E) / counties.B09002_001E
counties['QRICH200K'] = (counties.B19001_017E) / counties.B11001_001E
counties['PERCAP_ALT'] = counties.B19025_001E / (counties.B25008_001E)
counties['QESL_ALT'] = (counties.B06007_005E + counties.B06007_008E) / \
counties.B06007_001E
counties['QED12LES_ALT'] = (counties.B16010_002E) / counties.B16010_001E
counties['QEXTRCT_ALT'] = (counties.C24050_002E) / counties.C24050_001E
counties['QSERV_ALT'] = (counties.C24050_029E) / counties.C24050_001E
counties['QNOAUTO_ALT'] = (counties.B08201_002E) / counties.B08201_001E
counties['MDGRENT_ALT'] = counties.B25064_001E
counties['MHSEVAL_ALT'] = counties.B25077_001E
counties['POPDENS'] = counties.B01003_001E / (counties.CENSUSAREA)
As noted before, B25077_001E is missing a data value. We now perform one final check to see if we need to impute anything else.
# Check for missing data
for i in counties.columns:
x = counties[i].isnull().sum()
if x > 0:
print(i, "contains", x, "missing value(s).")
# Check for infinities
counties_num = counties.select_dtypes(include=['int64','float64'])
for i in counties_num.columns:
xmin = counties_num[i].min()
xmax = counties_num[i].max()
if xmin == -np.inf:
print(i, "contains a negative infinity")
elif xmax == np.inf:
print(i, "contains a positive infinity")
LSAD contains 2 missing value(s). B25077_001E contains 1 missing value(s). QFAM contains 2 missing value(s). MHSEVAL_ALT contains 1 missing value(s).
There are four variables with missing data. LSAD is not used in our analysis, so we may ignore this. B25077_001E and MHSEVAL_ALT are literally identical, so we will ignore B25077_001E and simply impute for MHSEVAL_ALT's one missing value. We also need to impute for QFAM's 2 missing values. We use the same imputation decisions that Spielman et al. employ in their analysis.
Unplanned deviation: When imputing for MHSEVAL_ALT's missing data, we removed a fair amount of extraneous code that was filling in missing spatial lag data with original data for the county. This was unnecessary because we only needed to impute data for one county and that county had spatial lag data. Also note that Spielman et al.'s method for imputing data for MHSEVAL_ALT is a deviation from Cutter's original methodology, in which she imputed a 0 for any missing value. While this is a deviation from the original SoVI methodology, 0 is an unrealistic median home value, so Spielman et al.'s method for imputation seems like a reasonable improvement.
# Step P3
# Replace missing QFAM data with 0
counties.QFAM = counties.QFAM.replace([np.inf, -np.inf, np.nan], 0)
# Replace missing MHSEVAL_ALT data with its spatial lag
# Calculate spatial weights matrix
w = lps.weights.Queen.from_dataframe(counties)
w.transform = 'R'
# Calculate spatial lag
counties['MHSEVAL_ALT_LAG'] = lps.weights.lag_spatial(w, counties.MHSEVAL_ALT)
# Impute for the missing value
counties.MHSEVAL_ALT[np.isnan(counties['MHSEVAL_ALT'])] = counties[["MHSEVAL_ALT_LAG"]][pd.isna(counties['MHSEVAL_ALT'])]
('WARNING: ', 68, ' is an island (no neighbors)') ('WARNING: ', 546, ' is an island (no neighbors)') ('WARNING: ', 547, ' is an island (no neighbors)') ('WARNING: ', 549, ' is an island (no neighbors)') ('WARNING: ', 1226, ' is an island (no neighbors)') ('WARNING: ', 1876, ' is an island (no neighbors)') ('WARNING: ', 2976, ' is an island (no neighbors)')
C:\Users\jheo\.conda\envs\spielman\lib\site-packages\libpysal\weights\weights.py:172: UserWarning: The weights matrix is not fully connected: There are 10 disconnected components. There are 7 islands with ids: 68, 546, 547, 549, 1226, 1876, 2976. warnings.warn(message)
The missing data procedure attempts to fill some counties' missing data with the average values of the surrounding counties. This procedure produces a warning about 10 disconnected components and 7 islands in the weights matrix. The cause of this error is areas of the United States that are not contiguous with one another. Fortunately, these areas are not missing any data, and therefore this error does not affect our procedure to fill gaps in missing data.
Before adjusting directionality, let us check that all of our derived variables match all of Spielman et al.'s derived variables.
# Import Spielman et al.'s derived variables
US_All = pd.read_csv(here("data", "raw", "public", "spielman", "output", "sovi_inputs.csv"))
counties.to_csv(here(path["ddpub"],'counties.csv'))
counties.to_file(here(path["ddpub"],'counties.gpkg'))
counties = pd.read_csv(here(path["ddpub"], "counties.csv"), dtype = {'GEOID': object})
# Select only the relevant columns
# Attribute name and expected influence on vulnerability
input_names = [['MEDAGE_ACS', 'pos', 'person', 'Median Age'],
['BLACK_ACS', 'pos', 'person', 'Pop African-American (%)'],
['QNATAM_ACS', 'pos', 'person', 'Pop Native American (%)'],
['QASIAN_ACS', 'pos', 'person', 'Pop Asian (%)'],
['QHISP_ACS', 'pos', 'person', 'Pop Hispanic (%)'],
['QAGEDEP_ACS', 'pos', 'person', 'Age Dependency (%)'],
['QPUNIT_ACS', 'pos', 'person', 'Persons Per Housing Unit'],
['PRENTER_ACS', 'pos', 'hu', 'Rental Housing (%)'],
['QNRRES_ACS', 'pos', 'person', 'Nursing Home Residents (%)'],
['QFEMALE_ACS', 'pos', 'person', 'Pop Female (%)'],
['QFHH_ACS', 'pos', 'hu', 'Female-Headed Households (%)'],
['QUNOCCHU_ACS', 'pos', 'hu', 'Vacant Housing (%)'],
['PERCAP_ALT', 'neg', 'person', 'Per-Capita Income'],
['QESL_ALT', 'pos', 'person', 'English as Second Language (%)'],
['QCVLUN', 'pos', 'person', 'Unemployment (%)'],
['QPOVTY', 'pos', 'person', 'Poverty (%)'],
['QMOHO', 'pos', 'hu', 'Mobile Homes (%)'],
['QED12LES_ALT', 'pos', 'person',
'Adults Completed <Grade 12 (%)'],
['QFEMLBR', 'pos', 'person', 'Female Employment (%)'],
['QEXTRCT_ALT', 'pos', 'person',
'Extractive Sector Employment (%)'],
['QSERV_ALT', 'pos', 'person', 'Service Sector Employment (%)'],
['QSSBEN', 'pos', 'hu', 'Social Security Income (%)'],
['QNOAUTO_ALT', 'pos', 'hu', 'No Automobile (%)'],
['QFAM', 'neg', 'person', 'Children in Married Families (%)'],
['QRICH200K', 'neg', 'hu', 'Annual Income >$200K (%)'],
['MDGRENT_ALT', 'neg', 'hu', 'Median Rent'],
['MHSEVAL_ALT', 'neg', 'hu', 'Median Home Value'],
['POPDENS', 'pos', 'person', 'Population Density']]
# Get attribute names
attr_names1 = [j[0] for j in input_names] + ['GEOID']
attr_names2 = [j[0] for j in input_names] + ['Geo_FIPS']
# Select only the columns needed to compute SoVI
counties = counties[attr_names1]
US_All = US_All[attr_names2]
counties["GEOID"] = "g" + counties["GEOID"]
counties['stateID'] = counties.GEOID.str.slice(0, 3, 1)
attr_names1.remove('GEOID')
counties = counties.set_index(counties["GEOID"]).sort_index()
US_All['stateID'] = US_All.Geo_FIPS.str.slice(0, 3, 1)
attr_names2.remove('Geo_FIPS')
US_All = US_All.set_index(US_All["Geo_FIPS"]).sort_index()
counties.eq(US_All).sum()
BLACK_ACS 3143 GEOID 0 Geo_FIPS 0 MDGRENT_ALT 3143 MEDAGE_ACS 3143 MHSEVAL_ALT 3143 PERCAP_ALT 3143 POPDENS 853 PRENTER_ACS 3143 QAGEDEP_ACS 3143 QASIAN_ACS 3143 QCVLUN 3143 QED12LES_ALT 3143 QESL_ALT 3143 QEXTRCT_ALT 3143 QFAM 3143 QFEMALE_ACS 3143 QFEMLBR 3143 QFHH_ACS 3143 QHISP_ACS 3143 QMOHO 3143 QNATAM_ACS 3143 QNOAUTO_ALT 3143 QNRRES_ACS 3143 QPOVTY 3143 QPUNIT_ACS 3143 QRICH200K 3143 QSERV_ALT 3143 QSSBEN 3143 QUNOCCHU_ACS 3143 stateID 3143 dtype: int64
Therre are 3143 observations in the dataset, so it appears that all variables match up perfectly except POPDENS. POPDENS is the one variable that was derived from the land area, so this was to be expected. Let us confirm that the differences between the two datasets are minor.
diff = (abs(counties[["POPDENS"]] - US_All[["POPDENS"]]))
print("Maximum difference:", diff.max()[0], "\nAverage difference:", diff.mean()[0])
Maximum difference: 0.949787960271351 Average difference: 0.0015009727346509663
With a maximum difference less than 1 and an average difference less than 0.01, our data is sufficiently close to Spielman et al.'s for our purposes.
Now we proceed to step P4, switching the directionality of variables as needed in order to ensure that higher values of a variable are associated with higher levels of vulnerability.
# Step P4
# Flip signs as needed to ensure that each variable contributes as expected to the final Sovi
for name, sign, sample, hrname in input_names:
if sign == 'neg':
counties[name] = -counties[name].values
print("Inverting variable:", name)
elif sign == 'pos':
pass
else:
print("problem in flipping signs")
raise
Inverting variable: PERCAP_ALT Inverting variable: QFAM Inverting variable: QRICH200K Inverting variable: MDGRENT_ALT Inverting variable: MHSEVAL_ALT
Spielman et al. constructed a class to conduct SPSS-style PCA with varimax rotation in python and validated their procedure against Cutter et al.'s SPSS workflow used to calculate SoVI. Below I include a workflow diagram that shows, without too much detail, the main operations and important outputs of their SPSS_PCA class. After that, I have included their relevant code.
class SPSS_PCA:
'''
A class that integrates most (all?) of the assumptions SPSS imbeds in their
implimnetation of principal components analysis (PCA), which can be found in
thier GUI under Analyze > Dimension Reduction > Factor. This class is not
intended to be a full blown recreation of the SPSS Factor Analysis GUI, but
it does replicate (possibly) the most common use cases. Note that this class
will not produce exactly the same results as SPSS, probably due to differences
in how eigenvectors/eigenvalues and/or singular values are computed. However,
this class does seem to get all the signs to match, which is not really necessary
but kinda nice. Most of the approach came from the official SPSS documentation.
References
----------
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf
http://spssx-discussion.1045642.n5.nabble.com/Interpretation-of-PCA-td1074350.html
http://mdp-toolkit.sourceforge.net/api/mdp.nodes.WhiteningNode-class.html
https://github.com/mdp-toolkit/mdp-toolkit/blob/master/mdp/nodes/pca_nodes.py
Parameters
----------
inputs: numpy array
n x k numpy array; n observations and k variables on each observation
reduce: boolean (default=False)
If True, then use eigenvalues to determine which factors to keep; all
results will be based on just these factors. If False use all factors.
min_eig: float (default=1.0)
If reduce=True, then keep all factors with an eigenvalue greater than
min_eig. SPSS default is 1.0. If reduce=False, then min_eig is ignored.
varimax: boolean (default=False)
If True, then apply a varimax rotation to the results. If False, then
return the unrotated results only.
Attributes
----------
z_inputs: numpy array
z-scores of the input array.
comp_mat: numpy array
Component matrix (a.k.a, "loadings").
scores: numpy array
New uncorrelated vectors associated with each observation.
eigenvals_all: numpy array
Eigenvalues associated with each factor.
eigenvals: numpy array
Subset of eigenvalues_all reflecting only those that meet the
criterion defined by parameters reduce and min_eig.
weights: numpy array
Values applied to the input data (after z-scores) to get the PCA
scores. "Component score coefficient matrix" in SPSS or
"projection matrix" in the MDP library.
comms: numpy array
Communalities
sum_sq_load: numpy array
Sum of squared loadings.
comp_mat_rot: numpy array or None
Component matrix after rotation. Ordered from highest to lowest
variance explained based on sum_sq_load_rot. None if varimax=False.
scores_rot: numpy array or None
Uncorrelated vectors associated with each observation, after
rotation. None if varimax=False.
weights_rot: numpy array or None
Rotated values applied to the input data (after z-scores) to get
the PCA scores. None if varimax=False.
sum_sq_load_rot: numpy array or None
Sum of squared loadings for rotated results. None if
varimax=False.
'''
def __init__(self, inputs, reduce=False, min_eig=1.0, varimax=False):
# Step S1
z_inputs = ZSCORE(inputs) # seems necessary for SPSS "correlation matrix" setting (their default)
# The rest is step S2
# run base SPSS-style PCA to get all eigenvalues
pca_node = MDP.nodes.WhiteningNode() # settings for the PCA
scores = pca_node.execute(z_inputs) # base run PCA
eigenvalues_all = pca_node.d # rename PCA results
# run SPSS-style PCA based on user settings
pca_node = MDP.nodes.WhiteningNode(reduce=reduce, var_abs=min_eig) # settings for the PCA
scores = pca_node.execute(z_inputs) # run PCA (these have mean=0, std_dev=1)
weights = pca_node.v # rename PCA results (these might be a transformation of the eigenvectors)
eigenvalues = pca_node.d # rename PCA results
component_matrix = weights * eigenvalues # compute the loadings
component_matrix = self._reflect(component_matrix) # get signs to match SPSS
communalities = (component_matrix**2).sum(1) # compute the communalities
sum_sq_loadings = (component_matrix**2).sum(0) # note that this is the same as eigenvalues
weights_reflected = component_matrix/eigenvalues # get signs to match SPSS
scores_reflected = np.dot(z_inputs, weights_reflected) # note that abs(scores)=abs(scores_reflected)
if varimax:
# SPSS-style varimax rotation prep
c_normalizer = 1. / MDP.numx.sqrt(communalities) # used to normalize inputs to varimax
c_normalizer.shape = (component_matrix.shape[0],1) # reshape to vectorize normalization
cm_normalized = c_normalizer * component_matrix # normalize component matrix for varimax
# varimax rotation
cm_normalized_varimax = self._varimax(cm_normalized) # run varimax
c_normalizer2 = MDP.numx.sqrt(communalities) # used to denormalize varimax output
c_normalizer2.shape = (component_matrix.shape[0],1) # reshape to vectorize denormalization
cm_varimax = c_normalizer2 * cm_normalized_varimax # denormalize varimax output
# reorder varimax component matrix
sorter = (cm_varimax**2).sum(0) # base the ordering on sum of squared loadings
sorter = zip(sorter.tolist(), range(sorter.shape[0])) # add index to denote current order
sorter = sorted(sorter, key=itemgetter(0), reverse=True) # sort from largest to smallest
sum_sq_loadings_varimax, reorderer = zip(*sorter) # unzip the sorted list
sum_sq_loadings_varimax = np.array(sum_sq_loadings_varimax) # convert to array
cm_varimax = cm_varimax[:,reorderer] # reorder component matrix
# varimax scores
cm_varimax_reflected = self._reflect(cm_varimax) # get signs to match SPSS
varimax_weights = np.dot(cm_varimax_reflected,
np.linalg.inv(np.dot(cm_varimax_reflected.T,
cm_varimax_reflected))) # CM(CM'CM)^-1
scores_varimax = np.dot(z_inputs, varimax_weights)
else:
comp_mat_rot = None
scores_rot = None
weights_rot = None
# assign output variables
self.z_inputs = z_inputs
self.scores = scores_reflected
self.comp_mat = component_matrix
self.eigenvals_all = eigenvalues_all
self.eigenvals = eigenvalues
self.weights = weights_reflected
self.comms = communalities
self.sum_sq_load = sum_sq_loadings
self.comp_mat_rot = cm_varimax_reflected
self.scores_rot = scores_varimax # PCA scores output
self.weights_rot = varimax_weights # PCA weights output
self.sum_sq_load_rot = sum_sq_loadings_varimax
def _reflect(self, cm):
# reflect factors with negative sums; SPSS default
cm = copy.deepcopy(cm)
reflector = cm.sum(0)
for column, measure in enumerate(reflector):
if measure < 0:
cm[:,column] = -cm[:,column]
return cm
def _varimax(self, Phi, gamma = 1.0, q = 100, tol = 1e-6):
# downloaded from http://en.wikipedia.org/wiki/Talk%3aVarimax_rotation
# also here http://stackoverflow.com/questions/17628589/perform-varimax-rotation-in-python-using-numpy
p,k = Phi.shape
R = np.eye(k)
d=0
for i in range(q):
d_old = d
Lambda = np.dot(Phi, R)
u,s,vh = np.linalg.svd(np.dot(Phi.T,np.asarray(Lambda)**3 - (gamma/p) *
np.dot(Lambda, np.diag(np.diag(np.dot(Lambda.T,Lambda))))))
R = np.dot(u,vh)
d = np.sum(s)
if d_old!=0 and d/d_old < 1 + tol:
break
return np.dot(Phi, R)
Basic set-up for storing results¶
# Build FEMA subRegions Dict values= state ID's
FEMA_subs = dict()
FEMA_subs['FEMA_1'] = ['g23g33g25', 'g50', 'g09', 'g44']
FEMA_subs['FEMA_2'] = ['g36', 'g34']
FEMA_subs['FEMA_3'] = ['g42', 'g10', 'g11', 'g24', 'g51', 'g54']
FEMA_subs['FEMA_4'] = ['g21', 'g47', 'g37', 'g28', 'g01', 'g13', 'g45', 'g12']
FEMA_subs['FEMA_5'] = ['g27', 'g55', 'g26', 'g17', 'g18', 'g39']
FEMA_subs['FEMA_6'] = ['g35', 'g48', 'g40', 'g05', 'g22']
FEMA_subs['FEMA_7'] = ['g31', 'g19', 'g20', 'g29']
FEMA_subs['FEMA_8'] = ['g30', 'g38', 'g56', 'g46', 'g49', 'g08']
FEMA_subs['FEMA_9'] = ['g06', 'g32', 'g04']
FEMA_subs['FEMA_10'] = ['g53', 'g41', 'g16']
####################################
# DataFrames to hold US, FEMA region, and state level results
####################################
# Dict to hold variable loadings
# key will be [USA, Fema_region, stateid] depending on level of analysis
varContrib = {}
# National Score
US_Sovi_Score = pd.DataFrame(index=counties.GEOID,
columns=['sovi', 'rank'])
# In the FEMA_Region_Sovi_Score data frame ranks are BY FEMA REGION.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# FEMA Region)
FEMA_Region_Sovi_Score = pd.DataFrame(index=counties.GEOID,
columns=['sovi', 'rank', 'fema_region'])
# Create New England conglomerate of states
# These are the FIPS codes for the states with the letter "g" appended
counties.loc[counties.stateID.isin(['g23', 'g33', 'g25']), 'stateID'] = 'g23g33g25'
# These are the states in the state level analysis
stateList = ['g23g33g25', 'g36', 'g51', 'g13',
'g17', 'g48', 'g29', 'g46', 'g06', 'g16']
# In the State_Sovi_Score data frame ranks are BY STATE.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# state in stateList)
State_Sovi_Score = pd.DataFrame(
index=counties.index[counties.stateID.isin(stateList)],
columns=['sovi', 'rank', 'state_id'])
Calculating SoVI¶
At this stage, we seek to calculate the SoVI ranks and variable weightings on the national, FEMA region, and state-level spatial extents. Below is a workflow for calculating SoVI, followed by the code for each spatial extent.
#######################
# Compute National SoVI
#######################
# compute SoVI
# Step M2
inputData = counties.drop(['GEOID', 'stateID'], axis=1, inplace=False)
# Step M3
inputData_array = inputData.values # Convert DataFrame to NumPy array
# Step M4
pca = SPSS_PCA(inputData_array, reduce=True, varimax=True)
# variance explained for each component is the eigenvalue divided by the sum of all eigenvalues
# the eigenvalue is equivalent to the sum of squared loadings
variance_explained = pca.sum_sq_load_rot / pca.eigenvals_all.sum()
print("Percent variance explained by each component")
print(variance_explained)
Percent variance explained by each component [0.17749433 0.1621382 0.12832775 0.09034435 0.06494221 0.05548509 0.05517805 0.05235274]
Factor loadings for national model
loadings = pd.DataFrame(pca.weights_rot)
loadings
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 0.028558 | -0.097043 | 0.263585 | -0.021732 | 0.064951 | -0.047413 | 0.041284 | -0.020852 |
1 | 0.231124 | -0.081235 | -0.032835 | -0.128022 | 0.035158 | -0.172086 | -0.031832 | -0.086503 |
2 | -0.120374 | 0.051713 | 0.034822 | -0.061283 | -0.125527 | 0.649159 | 0.025288 | 0.040600 |
3 | 0.059697 | -0.177355 | 0.004209 | 0.044335 | 0.096726 | -0.020888 | -0.089701 | -0.006796 |
4 | -0.059316 | 0.021661 | 0.036007 | 0.430707 | 0.004060 | -0.100344 | 0.094955 | 0.134917 |
5 | -0.032367 | 0.008297 | 0.231777 | 0.132433 | 0.137210 | 0.043627 | 0.235682 | 0.003748 |
6 | -0.025747 | 0.041537 | -0.025242 | 0.124587 | -0.292591 | 0.168631 | 0.109509 | -0.084228 |
7 | 0.040047 | 0.009589 | -0.232451 | 0.042722 | 0.268396 | -0.015020 | -0.015465 | 0.069733 |
8 | -0.028948 | 0.071364 | -0.074561 | 0.021579 | 0.395471 | -0.046200 | -0.074874 | 0.067630 |
9 | -0.029268 | 0.039311 | 0.079810 | 0.115028 | -0.091273 | 0.036534 | 0.704244 | -0.117437 |
10 | 0.144722 | -0.003821 | -0.043494 | 0.005639 | -0.033313 | 0.016059 | 0.107111 | 0.014858 |
11 | 0.007502 | -0.099646 | 0.285825 | 0.023212 | -0.113704 | 0.177186 | -0.127205 | 0.126578 |
12 | 0.033256 | 0.178163 | -0.038022 | 0.052142 | -0.033767 | 0.034974 | 0.028228 | 0.024147 |
13 | -0.018957 | -0.042262 | 0.030951 | 0.386709 | 0.049780 | -0.036214 | 0.067625 | 0.043922 |
14 | 0.197523 | -0.049893 | 0.087432 | -0.044395 | -0.101308 | -0.093135 | -0.127564 | -0.018783 |
15 | 0.123177 | 0.050919 | -0.003049 | 0.039227 | 0.032417 | 0.060821 | 0.015164 | 0.024725 |
16 | 0.158646 | -0.007006 | 0.098538 | -0.064254 | -0.184658 | -0.079373 | -0.104182 | -0.184974 |
17 | 0.156469 | 0.022291 | 0.038671 | 0.155598 | -0.000912 | -0.104740 | 0.011764 | -0.139073 |
18 | -0.004034 | 0.001923 | 0.010762 | -0.069927 | -0.025838 | 0.005249 | 0.257073 | 0.330415 |
19 | -0.027330 | 0.053829 | 0.036552 | 0.118236 | 0.148751 | 0.117389 | -0.091575 | -0.241253 |
20 | -0.049305 | 0.014880 | 0.040729 | 0.115068 | 0.068498 | 0.006772 | -0.155902 | 0.606872 |
21 | 0.077827 | -0.019041 | 0.244719 | 0.030591 | 0.020558 | -0.038686 | 0.090472 | -0.004187 |
22 | 0.143955 | -0.112268 | 0.029233 | -0.053675 | 0.207615 | 0.352145 | -0.013351 | -0.154331 |
23 | 0.147075 | -0.021550 | -0.012901 | -0.026884 | 0.081235 | -0.000337 | 0.010718 | 0.116535 |
24 | -0.031148 | 0.213393 | -0.076973 | -0.012445 | 0.041138 | 0.006493 | -0.028958 | 0.103793 |
25 | -0.012350 | 0.200516 | -0.058312 | -0.016243 | 0.143973 | 0.045744 | -0.001854 | -0.075001 |
26 | -0.019246 | 0.235548 | -0.107857 | 0.000564 | 0.081574 | -0.005632 | 0.016087 | -0.056065 |
27 | 0.162733 | -0.206821 | 0.063250 | 0.000379 | 0.266078 | 0.132037 | 0.005506 | -0.268089 |
# Step M5
# get calculated Principal Component values from PCA
sovi_actual_us = pca.scores_rot
# weight both loadings and component values by variance explained
for i in range(0, len(variance_explained)):
sovi_actual_us[i] = sovi_actual_us[i] * (variance_explained[i])
loadings[i] = loadings[i] * variance_explained[i]
# Step M6
sovi_actual_us = pd.DataFrame(
sovi_actual_us, index=counties.GEOID)
# sum reweighted components
sovi_actual_us['sovi'] = sovi_actual_us.sum(axis=1)
# Step M8
sovi_actual_us['rank'] = sovi_actual_us['sovi'].rank(
method='average', ascending=False)
US_Sovi_Score.update(sovi_actual_us)
# Step M9
attrib_contribution_us = loadings.sum(1)
varContrib['USA'] = zip(attr_names1, attrib_contribution_us.tolist()) # Generate dictionary for all net loadings by variable for US
# Quick check of ranks: max should equal number of counties in US
try:
US_Sovi_Score['rank'].max() == len(counties)
except:
print("error in ranking check")
raise
# cleanup
del inputData
# del inputData_norm
# del sovi_actual_us
del attrib_contribution_us
national = pd.DataFrame(pca.scores_rot)
national
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 0.029446 | -0.024451 | -0.109792 | -0.127427 | -0.258519 | -0.072021 | 0.072921 | -0.109715 |
1 | -0.048314 | -0.111374 | 0.129130 | -0.043986 | -0.179622 | -0.031461 | 0.060592 | 0.033541 |
2 | 0.312285 | 0.034858 | -0.046551 | -0.095825 | -0.003084 | -0.072393 | -0.129102 | -0.003301 |
3 | 0.120573 | 0.047311 | -0.009957 | -0.067153 | -0.152596 | -0.047050 | -0.130647 | -0.159628 |
4 | 0.005660 | 0.035700 | 0.002583 | 0.010332 | -0.075191 | -0.032967 | 0.012418 | -0.083134 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
3138 | -0.851567 | -0.299621 | -1.266409 | 0.118841 | -0.843377 | -0.213839 | -1.582643 | -1.401207 |
3139 | -1.635098 | -3.099562 | 0.514943 | 0.475535 | -2.037896 | 0.406036 | -1.526467 | 1.714776 |
3140 | -0.800701 | 0.234603 | -0.946891 | 0.062005 | -1.206986 | 0.121283 | -0.800773 | -0.625157 |
3141 | -1.078669 | 0.255776 | -0.265670 | 0.158393 | 0.625822 | -0.130143 | 0.204333 | -0.122033 |
3142 | -0.592648 | 0.270919 | 0.020170 | -0.486524 | 0.526319 | -0.129719 | -1.766916 | -1.475899 |
3143 rows × 8 columns
###########################
# Compute FEMA Region SoVI
###########################
for i in FEMA_subs:
# Step M1: Subset FEMA subregion
FEMARegionData = counties[counties['stateID'].isin(FEMA_subs[i])]
# Step M2
inputData = FEMARegionData.drop(
['GEOID', 'stateID'], axis=1, inplace=False)
# Step M3
inputData_array = inputData.values # Convert DataFrame to NumPy array
# Step M4
pca = SPSS_PCA(inputData_array, reduce=True, varimax=True)
variance_explained = pca.sum_sq_load_rot / pca.eigenvals_all.sum()
loadings = pd.DataFrame(pca.weights_rot)
# Step M5
sovi_actual_fema = pca.scores_rot
# weight both loadings and component values by variance explained
for c in range(0, len(variance_explained)):
sovi_actual_fema[c] = sovi_actual_fema[c] * (variance_explained[c])
loadings[c] = loadings[c] * variance_explained[c]
# Step M6
sovi_actual_fema = pd.DataFrame(
sovi_actual_fema, index=FEMARegionData.index)
# sum reweighted components
sovi_actual_fema['sovi'] = sovi_actual_fema.sum(axis=1)
# Step M7
sovi_actual_fema['fema_region'] = i
# Step M8
sovi_actual_fema['rank'] = sovi_actual_fema['sovi'].rank(
method='average', ascending=False)
FEMA_Region_Sovi_Score.update(sovi_actual_fema)
# Step M9
attrib_contribution_fema = loadings.sum(1)
varContrib[i] = zip(attr_names1, attrib_contribution_fema.tolist())
# cleanup
del FEMARegionData
del inputData
del sovi_actual_fema
del attrib_contribution_fema
############################
# Compute State Level SoVI
############################
for st in stateList:
# Step M1: Subset FEMA subregion
stateData = counties[counties.stateID == st]
# Step M2
inputData = stateData.drop(['GEOID', 'stateID'], axis=1, inplace=False)
# Step M3
inputData_array = inputData.values # Convert DataFrame to NumPy array
# Step M4
pca = SPSS_PCA(inputData_array, reduce=True, varimax=True)
variance_explained = pca.sum_sq_load_rot / pca.eigenvals_all.sum()
loadings = pd.DataFrame(pca.weights_rot)
# Step M5
sovi_actual = pca.scores_rot
# weight both loadings and component values by variance explained
for c in range(0, len(variance_explained)):
sovi_actual[c] = sovi_actual[c] * (variance_explained[c])
loadings[c] = loadings[c] * variance_explained[c]
# Step M6
sovi_actual = pd.DataFrame(
sovi_actual, index=stateData.index)
# sum reweighted components
sovi_actual['sovi'] = sovi_actual.sum(axis=1)
# Step M7
sovi_actual['state_id'] = st
# Step M8
sovi_actual['rank'] = sovi_actual['sovi'].rank(
method='average', ascending=False)
State_Sovi_Score.update(sovi_actual)
# Step M9
attrib_contribution = pca.weights_rot.sum(1)
varContrib[st] = zip(attr_names1, attrib_contribution.tolist())
# cleanup
del stateData
del inputData
del sovi_actual
del attrib_contribution
Internal consistency analysis¶
Now that we have generated the SoVI scores for the 21 different models, we turn to our analysis of internal consistency.
This analysis checks for consistent SoVI rankings of counties in a region of interest (a state or group of small states) through three versions of a SoVI model, each using a different geographic extent for input data. Those extents are: 1) all counties in the country, 2) all the counties in a FEMA region, and 3) all counties in a single state or group of small states. The SoVI scores for the counties in the region of interest are selected and ranked. The agreement between the three sets of rankings is calculated using the Spearman's Rho rank correlation coefficient. If the model is internally consistent, one could expect a nearly perfect positive rank correlation close to 1, implying that counties have similar levels of social vulnerability vis a vis one another in the region of interest, regardless of how much extraneous information from other counties in the FEMA region or from the whole United States has been included in the SoVI model.
##########################################################################
# Ranks w/ Geographic Extent
# For each county rank within state for US, state, and fema_region sovis
##########################################################################
# Step IC1: Create an empty DataFrame with a column for each SoVI spatial extent and an index of each county FIPS code in the selected state(s) of the 10 FEMA regions
county_in_state_rank = pd.DataFrame(index=State_Sovi_Score.index,
columns=['state_sovi_rank', 'fema_region_sovi_rank', 'us_sovi_rank'])
for st in stateList:
if st == 'g23g33g25':
# Step IC2: Select the index and SoVI scores from national model for Maine, New Hampshire, and Massachusetts
st_cty_scores1 = US_Sovi_Score.loc[['g23' in s for s in US_Sovi_Score.index], 'sovi']
st_cty_scores2 = US_Sovi_Score.loc[['g33' in s for s in US_Sovi_Score.index], 'sovi']
st_cty_scores3 = US_Sovi_Score.loc[['g25' in s for s in US_Sovi_Score.index], 'sovi']
st_cty_scores = pd.concat([st_cty_scores1, st_cty_scores2, st_cty_scores3])
# Step IC3: Re-rank the national SoVI scores but just for the counties in the relevant states
county_in_state_rank.loc[st_cty_scores.index, 'us_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)
# Step IC4: Select the index and SoVI scores from FEMA model for Maine, New Hampshire, and Massachusetts
st_cty_scores1 = FEMA_Region_Sovi_Score.loc[['g23' in s for s in FEMA_Region_Sovi_Score.index], 'sovi']
st_cty_scores2 = FEMA_Region_Sovi_Score.loc[['g33' in s for s in FEMA_Region_Sovi_Score.index], 'sovi']
st_cty_scores3 = FEMA_Region_Sovi_Score.loc[['g25' in s for s in FEMA_Region_Sovi_Score.index], 'sovi']
st_cty_scores = pd.concat([st_cty_scores1, st_cty_scores2, st_cty_scores3])
# Step IC5: Re-rank the FEMA SoVI scores but just for the counties in the relevant states
county_in_state_rank.loc[st_cty_scores.index, 'fema_region_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)
# Step IC6: Pull the state-only SoVI ranks into the same dataframe as the other data
county_in_state_rank.loc[st_cty_scores.index, 'state_sovi_rank'] = State_Sovi_Score.loc[State_Sovi_Score['state_id'] == 'g23g33g25', 'rank']
else:
# Step IC2: select the index and SoVI scores from national model for the relevant state
st_cty_scores = US_Sovi_Score.loc[[st in s for s in US_Sovi_Score.index], 'sovi']
# Step IC3: Re-rank the national SoVI scores but just for the counties in the relevant state
county_in_state_rank.loc[st_cty_scores.index, 'us_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)
# Step IC4: Select the index and SoVI scores from FEMA model for the relevant state
st_cty_scores = FEMA_Region_Sovi_Score.loc[[st in s for s in FEMA_Region_Sovi_Score.index], 'sovi']
# Step IC5: Re-rank the FEMA SoVI scores but just for the counties in the relevant state
county_in_state_rank.loc[st_cty_scores.index, 'fema_region_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)
# Step IC6: Pull the state-only SoVI ranks into the same dataframe as the other data
county_in_state_rank.loc[st_cty_scores.index, 'state_sovi_rank'] = State_Sovi_Score.loc[State_Sovi_Score['state_id'] == st, 'rank']
######################
# CORRELATIONS
######################
# Step IC 7: Create an empty DataFrame to hold Spearman test results
state_corrs = pd.DataFrame(index = stateList, columns = ['spearman_r_st_fema', 'pvalue_st_fema', 'spearman_r_st_us', 'pvalue_st_us'])
for st in stateList:
if st == 'g23g33g25':
# Step IC8: Calculate spearman correlation between state and FEMA, state and national
multi_state_data_tmp1 = county_in_state_rank.loc[['g23' in s for s in county_in_state_rank.index], ]
multi_state_data_tmp2 = county_in_state_rank.loc[['g25' in s for s in county_in_state_rank.index], ]
multi_state_data_tmp3 = county_in_state_rank.loc[['g33' in s for s in county_in_state_rank.index], ]
multi_state_data_tmp = pd.concat([multi_state_data_tmp1, multi_state_data_tmp2, multi_state_data_tmp3])
st_fema_spearman = spearmanr(multi_state_data_tmp[['state_sovi_rank', 'fema_region_sovi_rank']])
st_us_spearman = spearmanr(multi_state_data_tmp[['state_sovi_rank', 'us_sovi_rank']])
state_corrs.loc['g23g33g25', ] = [st_fema_spearman[0], st_fema_spearman[1], st_us_spearman[0], st_us_spearman[1]]
else:
# Step IC8: Calculate spearman correlation between state and FEMA, state and national
st_fema_spearman = spearmanr(county_in_state_rank.loc[[st in s for s in county_in_state_rank.index], ['state_sovi_rank', 'fema_region_sovi_rank']])
st_us_spearman = spearmanr(county_in_state_rank.loc[[st in s for s in county_in_state_rank.index], ['state_sovi_rank', 'us_sovi_rank']])
state_corrs.loc[st, ] = [st_fema_spearman[0], st_fema_spearman[1], st_us_spearman[0], st_us_spearman[1]]
Theoretical consistency analysis¶
Finally, we investigate the questions surrounding theoretical consistency.
This analysis checks for consistent signs and ranks of variables across the same 21 models that were used in the internal consistency analysis. To evaluate the signs and ranks of variables, we sum all components together, producing one vector for each model containing the net effect of each variable on the SoVI score. Theoretical consistency is indicated by little variation amongst all models in the signs and magnitudes of variable weights. Theoretical inconsistency is indicated by substantial variation in the signs and weights of variables and by disagreement between a variable's theoretical influence and modeled influence on vulnerability.
Unplanned deviation: we were unable to find all of the code for this part the analysis, so we wrote much of this code ourselves.
# Step TC1: Create a DataFrame to hold variable contributions values
variable_contributions = pd.DataFrame(index=attr_names1)
# Step TC2: Add variable contributions values to DataFrame
for area in varContrib.keys():
variable_contributions[area] = [x for i, x in varContrib[area]]
# Step TC3: For all SoVI models, rank variables from the greatest to the least magnitudes
rankContrib = abs(variable_contributions).apply(rankdata, axis=0, method='average')
rankContrib = (28-rankContrib) + 1
# Step TC4: Sort variable rankings according to national model's most to least important
rankContrib = rankContrib.sort_values("USA", ascending = True).reset_index()
rankContrib.index = rankContrib["index"]
rankContrib = rankContrib.drop(columns = ["index"])
# Step TC5: Calculate summary statistics for each variable
summary_stats = pd.DataFrame( {"Min": rankContrib.min(axis = 1).round(),
"Max": rankContrib.max(axis = 1).round(),
"Range": rankContrib.max(axis = 1) - rankContrib.min(axis = 1).round(),
"Average": rankContrib.mean(axis = 1).round(2)
} )
# Step TC6: determine signs of USA model
def pos_neg(x):
if x > 0:
return "+"
else:
return "-"
usa = variable_contributions["USA"].apply(pos_neg)
# Step TC7: Determine all positive/negatives
reversals_adj = variable_contributions < 0
# Step TC8: Separate data
reference = reversals_adj[["USA"]]
other_vars = reversals_adj.drop(columns = ["USA"])
# Step TC9: Determine all reversals from expected sign
for i in range(len(other_vars.columns)):
other_vars.iloc[:, i] = reversals_adj["USA"].eq(other_vars.iloc[:, i]).eq(False)
# Step TC10: calculate reversals
reversal_sum = pd.DataFrame( {"Reversals": other_vars.sum(axis = 1)} )
# Step TC11: Join data
summary_stats = summary_stats.merge(reversal_sum, left_index = True, right_index = True)
# Step TC12: Join data
summary_stats = summary_stats.merge(usa, left_index = True, right_index = True)
# Step TC13: Change index labels to reflect any changes prior to SoVI calculation
for name, sign, sample, hrname in input_names:
if sign == 'neg':
summary_stats = summary_stats.rename(index={name: '- '+ name})
else:
pass
# Table aesthetics edits
summary_stats = summary_stats.rename(columns={"USA": "National Model"})
summary_stats = summary_stats.loc[:,['National Model', 'Reversals', 'Min', 'Average', 'Max', 'Range']]
We made a different choice than Spielman et al. in this table. Since we adjusted all variables such that larger values are theoretically associated with a higher degree of vulnerability before calculating SoVI, we would expect all outputs to be positive. In Spielman et al.'s "expected contribution" column, some signs were negative -- they recorded the directionality we would expect of the variables if they had made no adjustments before calculating SoVI. This leads to a misleading table because their "original contribution" column displays the signs of the model output including prior adjustments to directionaity, but the "expected contribution" column displays the signs if the model had not adjusted directionality. To make their "expected contribution" column consistent with their "original contribution" column, they would need all of the signs to be positive in the "expected contribution" column. We choose to simply not include an "expected contribution" column since there would be no variation within it anyway. Additionally, we add a negative sign in front of any variables that we changed the directionality of for the sake of clarity.
Save analysis results¶
#####################################################
# OUTPUT TABLES
#####################################################
US_Sovi_Score.to_csv( here(path["ddpub"], 'US_Sovi_Score_Weighted.csv') )
# In the FEMA_Region_Sovi_Score data frame ranks are BY FEMA REGION.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# FEMA Region)
FEMA_Region_Sovi_Score.to_csv( here(path["ddpub"], 'FEMA_Region_Sovi_Score_Weighted.csv') )
# In the State_Sovi_Score data frame ranks are BY STATE.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# state in stateList)
State_Sovi_Score.to_csv( here(path["ddpub"], 'State_Sovi_Score_Weighted.csv') )
# County rank within state for US, state, and fema_region sovis
county_in_state_rank.to_csv( here(path["ddpub"], 'County_in_State_Rank_Weighted.csv') )
# Variable contributions for sovis at all geographic extents
variable_contributions.to_csv( here(path["ddpub"], 'variable_contributions_Weighted.csv') )
# Correlation of ranks
state_corrs.to_csv( here(path["ddpub"], 'state_fema_us_rank_correlations_Weighted.csv') )
Results¶
Rpr-H1¶
First, we tested RPr-H1, that reproduced SoVI model scores for each county are not identical to the original study SoVI model scores for each county for each of the 21 SoVI models.
We define a function, check_it to check equivalency of the original output files to our reproduced output files.
def check_it(file, rounder = False):
'''
Given a file name, this function finds the corresponding file provided by Spielman et al. and the file produced
by our code and returns the number of matches for each column.
'''
global rpl
global og
global test
rpl = pd.read_csv( here(path["ddpub"], file) )
og = pd.read_csv( here(path["og_out"], file) )
og = og.rename(columns = {"Geo_FIPS": "GEOID"})
if "sovi" in rpl.columns:
rpl["sovi"] = rpl["sovi"].round(2)
og["sovi"] = og["sovi"].round(2)
if "Unnamed: 0" in rpl.columns:
rpl.index = rpl["Unnamed: 0"]
rpl = rpl.drop(columns = ["Unnamed: 0"])
if "Unnamed: 0" in og.columns:
og.index = og["Unnamed: 0"]
og = og.drop(columns = ["Unnamed: 0"])
if rounder != False:
og = og.round(rounder)
rpl = rpl.round(rounder)
test = rpl.eq(og)
if test.sum().eq(len(rpl)).sum() == len(test.sum()):
return print("All values match!")
else:
return test.sum()
US SoVI Scores & Rankings¶
check_it('US_Sovi_Score.csv')
GEOID 3143 sovi 3143 rank 3141 dtype: int64
merged = og.merge(rpl, how = "inner", on = "GEOID")
merged.loc[~test["rank"]]
GEOID | sovi_x | rank_x | sovi_y | rank_y | |
---|---|---|---|---|---|
1150 | g22075 | -4.53 | 2984.0 | -4.53 | 2983.0 |
3120 | g56001 | -4.53 | 2983.0 | -4.53 | 2984.0 |
We have identically reproduced national SoVI scores (rounded to 2 decimal points) for all 3143 counties compared to the original study, but two county ranks are different, probably due to the small differences between our area column and theirs.
FEMA Region SoVI Scores & Rankings¶
check_it('FEMA_Region_Sovi_Score.csv')
GEOID 3143 sovi 3109 rank 3109 fema_region 3109 dtype: int64
Our check it function found potential differences in 34 counties. The following table shows the SoVI scores and ranks for those counties.
merged = og.merge(rpl, how = "inner", on = "GEOID")
merged.loc[~test["rank"] | ~test["sovi"] | ~test["fema_region"]]#.head()
GEOID | sovi_x | rank_x | fema_region_x | sovi_y | rank_y | fema_region_y | |
---|---|---|---|---|---|---|---|
67 | g02013 | NaN | NaN | NaN | NaN | NaN | NaN |
68 | g02016 | NaN | NaN | NaN | NaN | NaN | NaN |
69 | g02020 | NaN | NaN | NaN | NaN | NaN | NaN |
70 | g02050 | NaN | NaN | NaN | NaN | NaN | NaN |
71 | g02060 | NaN | NaN | NaN | NaN | NaN | NaN |
72 | g02068 | NaN | NaN | NaN | NaN | NaN | NaN |
73 | g02070 | NaN | NaN | NaN | NaN | NaN | NaN |
74 | g02090 | NaN | NaN | NaN | NaN | NaN | NaN |
75 | g02100 | NaN | NaN | NaN | NaN | NaN | NaN |
76 | g02105 | NaN | NaN | NaN | NaN | NaN | NaN |
77 | g02110 | NaN | NaN | NaN | NaN | NaN | NaN |
78 | g02122 | NaN | NaN | NaN | NaN | NaN | NaN |
79 | g02130 | NaN | NaN | NaN | NaN | NaN | NaN |
80 | g02150 | NaN | NaN | NaN | NaN | NaN | NaN |
81 | g02164 | NaN | NaN | NaN | NaN | NaN | NaN |
82 | g02170 | NaN | NaN | NaN | NaN | NaN | NaN |
83 | g02180 | NaN | NaN | NaN | NaN | NaN | NaN |
84 | g02185 | NaN | NaN | NaN | NaN | NaN | NaN |
85 | g02188 | NaN | NaN | NaN | NaN | NaN | NaN |
86 | g02195 | NaN | NaN | NaN | NaN | NaN | NaN |
87 | g02198 | NaN | NaN | NaN | NaN | NaN | NaN |
88 | g02220 | NaN | NaN | NaN | NaN | NaN | NaN |
89 | g02230 | NaN | NaN | NaN | NaN | NaN | NaN |
90 | g02240 | NaN | NaN | NaN | NaN | NaN | NaN |
91 | g02261 | NaN | NaN | NaN | NaN | NaN | NaN |
92 | g02270 | NaN | NaN | NaN | NaN | NaN | NaN |
93 | g02275 | NaN | NaN | NaN | NaN | NaN | NaN |
94 | g02282 | NaN | NaN | NaN | NaN | NaN | NaN |
95 | g02290 | NaN | NaN | NaN | NaN | NaN | NaN |
546 | g15001 | NaN | NaN | NaN | NaN | NaN | NaN |
547 | g15003 | NaN | NaN | NaN | NaN | NaN | NaN |
548 | g15005 | NaN | NaN | NaN | NaN | NaN | NaN |
549 | g15007 | NaN | NaN | NaN | NaN | NaN | NaN |
550 | g15009 | NaN | NaN | NaN | NaN | NaN | NaN |
These 34 counties are missing data in both the original study and our reproduction study. The counties and county equivalents are all located in Hawaii (FIPS code 15) and Alaska (FIPS code 02). In Spielman et al.'s code, when they define the states in FEMA region IX, they do not include HI, and when they define the states in FEMA region X, they do not include AK. All differences here arise from missing data in analogous places in both my output and theirs. This result was successfully reproduced.
State SoVI Scores & Rankings¶
check_it('State_Sovi_Score.csv')
All values match!
We have identically reproduced SoVI scores for all state models.
County in State Rank¶
check_it("County_in_State_Rank.csv")
All values match!
We have identically reproduced the SoVI rankings in the state(s) of interest for all 21 models.
Variable Contributions¶
check_it("variable_contributions.csv", rounder = 3)
All values match!
When rounded to 3 decimal places, we have successfully reproduced all variable contributions for all models.
State FEMA US Rank Correlations¶
check_it("state_fema_us_rank_correlations.csv", rounder = 14)
All values match!
.sum_sq_load_rotWhen rounded to 14 decimal places, we have succesfully reproduced all Spearman's rank correlations.
National SoVI Score Map¶
Mapping national SoVI levels will make for a clear visualization of social vulnerability nationally, facilitating a comprehensive understanding of the overall vulnerability landscape at a national scale. The purple reflect high SoVI values while the orange reflect low SoVI scores. Lower SoVI scores can be found near cities in the midwest whereas higher scores can be identified near the South and in Alaska.
# get county shapes
counties = gpd.read_file( here(path["ddpub"], "counties.gpkg") )
# Edit counties GEOID to match other datasets
counties["GEOID"] = "g" + counties["GEOID"]
# Select columns we need
counties_US = counties[["GEOID", "geometry"]]
# Join all datasets
counties_US = counties_US.merge(US_Sovi_Score, on = "GEOID")
# Basemap
map = folium.Map(location=[38.41623851044224, -99.90769524852975], tiles="OpenStreetMap", zoom_start=5)
# Choropleth map
folium.Choropleth(
geo_data=counties_US,
data=counties_US,
columns=['GEOID','sovi'],
key_on= 'feature.properties.GEOID',
fill_color='PuOr',
fill_opacity=0.7,
line_color='white',
line_weight=0.3,
line_opacity=1,
legend_name='National SoVI Scores'
).add_to(map)
map