---
title: "R Users Breakout Session"
author: Matt Gunther - IPUMS PMA Senior Data Analyst
date: 2022-04-06
output:
ioslides_presentation:
widescreen: true
smaller: true
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
R.options = list(width = 100)
)
```
## Setup
R users: remember to select a **.dat (fixed-width text)** data format
You'll receive a compressed **dat.gz** file - no need to decompress!
Save both of those files in the "data" folder of your working directory.
```{r, echo = FALSE}
knitr::include_graphics("images/download.png", dpi = 150)
```
---
You'll need the [ipumsr](https://tech.popdata.org/ipumsr/index.html) package to load them. If not installed, you can download from CRAN.
```{r, eval = FALSE}
install.packages("ipumsr")
```
Each session, load the `ipumsr` library before you import data.
```{r, results='hide', message=FALSE, warning=FALSE}
library(ipumsr)
# Load data into R with `ipumsr`
dat <- read_ipums_micro(
ddi = "data/pma_00093.xml",
data = "data/pma_00093.dat.gz"
)
```
```{r, echo = FALSE}
knitr::include_graphics("images/logo.png", dpi = 150)
```
---
Other useful packages for IPUMS data:
```{r, results='hide', message=FALSE, warning=FALSE}
# General toolkit
library(tidyverse)
# For label manipulation:
library(labelled)
# For survey analysis:
library(survey)
library(srvyr)
```
```{r, echo = FALSE}
knitr::include_graphics("images/logo-ribbon.png", dpi = 150)
```
# 1 - Analytic Sample
---
PMA uses an **open panel design** - women may enter the panel after Phase 1, and they may be lost to follow-up after any phase.
See [RESULTFQ](https://pma.ipums.org/pma-action/variables/RESULTFQ)
Women who enter the panel at Phase 2 are `NA` for all variables at Phase 1.
```{r}
dat %>% count(RESULTFQ_1)
```
---
Women whose households were not found again after Phase 1 are `NA` for all variables at Phase 2.
```{r}
dat %>% count(RESULTFQ_2)
```
---
We will only include women who were available and completed the Female Questionnaire for *both* Phase 1 and Phase 2.
```{r}
dat <- dat %>% filter(RESULTFQ_1 == 1 & RESULTFQ_2 == 1)
dat %>% count(RESULTFQ_1, RESULTFQ_2)
```
---
Additionally, PMA samples are only valid for the *de facto* population: women who
slept in the household the night before the Household interview.
See [RESIDENT](https://pma.ipums.org/pma-action/variables/RESIDENT)
```{r}
dat %>% count(RESIDENT_1)
```
We'll also drop cases where the woman was not part of the *de facto* population in either Phase 1 or Phase 2.
```{r}
dat <- dat %>% filter(RESIDENT_1 %in% c(11, 22) & RESIDENT_2 %in% c(11, 22))
```
---
How many cases remain?
```{r}
dat %>% count(COUNTRY)
```
# 2 - Recoding Independent variables
---
PMA surveys contain many **categorical** variables. These are usually represented as **factors** in R.
In an IPUMS data extract, you won't see factors!
Instead, we generate **labelled** numeric variables (note the label in brackets).
```{r}
dat %>% ipums_var_label(CVINCOMELOSS_2)
dat %>% count(CVINCOMELOSS_2)
```
---
The [ipumsr](https://tech.popdata.org/ipumsr/index.html) package contains tools for working with labelled IPUMS data.
Usually, we handle codes like `99 [NIU (not in universe)]` before transforming other missing data to `NA`.
```{r}
dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2)
```
---
**Tip:**
Information the code `NIU (not in universe)` can always be found on a variable's [universe tab](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS#universe_section).
```{r, echo = FALSE}
knitr::include_graphics("images/universe.png", dpi = 300)
```
---
For [CVINCOMELOSS_2](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS), `99 [NIU (not in universe)]` may indicate that the household experienced *no income loss in the last year*, or it may indicate that [HHINCOMELOSSAMT_2](https://pma.ipums.org/pma-action/variables/HHINCOMELOSSAMT) is `98 [No response or missing]`.
We should treat the `NIU` women from households without *any* income loss as "No" in [CVINCOMELOSS_2](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS).
```{r}
dat <- dat %>%
mutate(
CVINCOMELOSS_2 = CVINCOMELOSS_2 %>%
labelled::recode_if(HHINCOMELOSSAMT_2 == 1, 0)
)
dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2)
```
---
Next, we'll use `NA` to represent the remaining values above `90`:
* `97 [Don't know] ` and
* remaining cases marked `99 [NIU (not in universe)]`
```{r}
dat <- dat %>%
mutate(
CVINCOMELOSS_2 = CVINCOMELOSS_2 %>%
lbl_na_if(~.val > 90)
)
dat %>% count(CVINCOMELOSS_2, HHINCOMELOSSAMT_2)
```
---
Once you're done with labels, we recommend transforming key variables into **factors** with [forcats::as_factor](https://forcats.tidyverse.org/reference/as_factor.html).
The [forcats](https://forcats.tidyverse.org) package is included when you load `library(tidyverse)`.
```{r}
dat <- dat %>% mutate(CVINCOMELOSS_2 = as_factor(CVINCOMELOSS_2))
dat %>% count(CVINCOMELOSS_2)
```
This will make categorical variables easier to use in data visualization and as "dummy" variables in regression analysis.
---
Likert-style questions can be treated as factors, too.
```{r}
dat %>% ipums_var_label(COVIDCONCERN_2)
dat %>% count(COVIDCONCERN_2)
```
---
This time we'll treat codes `5` and above as `NA`.
```{r}
dat <- dat %>%
mutate(
COVIDCONCERN_2 = COVIDCONCERN_2 %>%
lbl_na_if(~.val >= 5) %>%
as_factor()
)
dat %>% count(COVIDCONCERN_2)
```
---
You can apply the same transformation to several variables with help from [dplyr::across](https://dplyr.tidyverse.org/reference/across.html).
[dplyr]() is another package included when you load `library(tidyverse)`.
```{r}
dat <- dat %>%
mutate(
across(
c(COUNTRY, URBAN, WEALTHT_2, EDUCATTGEN_2),
~.x %>% lbl_na_if(~.val >= 90) %>% as_factor()
)
)
```
---
Often, it's important to set a **reference group** against which all dummy variables will be compared.
You can manually specify a **refernece group** when you set factor "levels" with a function like [forcats::fct_relevel](https://forcats.tidyverse.org/reference/fct_relevel.html).
```{r}
dat <- dat %>%
mutate(
AGE_2 = case_when(
AGE_2 < 25 ~ "15-24",
AGE_2 < 35 ~ "25-34",
AGE_2 < 50 ~ "35-49"
),
AGE_2 = AGE_2 %>% fct_relevel("15-24", "25-34", "35-49")
)
```
# 3 - Dependent variables
---
We'll use our recoded variables to model the likelihood of contraceptive method **adoption** and **discontinuation** between phases.
See [CP](https://pma.ipums.org/pma-action/variables/CVINCOMELOSS)
```{r}
dat <- dat %>% filter(CP_1 < 90 & CP_2 < 90)
dat %>% count(CP_1, CP_2)
```
---
A woman has **adopted** a method if she was *not* using one at Phase 1, but then reported using one at Phase 2.
She has **discontinued** a method if she *did* use one at Phase 1, but no longer uses one at Phase 2.
```{r}
dat <- dat %>%
mutate(
FPSTATUS = case_when(
CP_1 == 1 & CP_2 == 1 ~ "User",
CP_1 == 0 & CP_2 == 0 ~ "Non-user",
CP_1 == 1 & CP_2 == 0 ~ "Discontinued",
CP_1 == 0 & CP_2 == 1 ~ "Adopted"
),
FPSTATUS = fct_infreq(FPSTATUS)
)
```
---
Un-weighted sample proportions for `FPSTATUS` can be found with [count](https://dplyr.tidyverse.org/reference/count.html) and [prop.table](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/prop.table):
```{r}
dat_nowt <- dat %>%
group_by(COUNTRY) %>%
count(FPSTATUS) %>%
mutate(prop = prop.table(n))
dat_nowt
```
We'll plot this table with [ggplot2](https://ggplot2.tidyverse.org/index.html) (also included with the [tidyverse](https://tidyverse.tidyverse.org/)).
---
```{r, fig.height=4, fig.width=10}
dat_nowt %>%
ggplot(aes(x = prop, y = FPSTATUS, fill = FPSTATUS)) +
geom_bar(stat = "identity") +
facet_wrap(~COUNTRY) + theme_minimal() +
theme(axis.title = element_blank(), legend.position = "none") +
scale_x_continuous(labels = scales::label_percent())
```
---
For *weighted* population estimates, use [as_survey_design](http://gdfe.co/srvyr/reference/as_survey_design.html) and [survey_mean](http://gdfe.co/srvyr/reference/survey_mean.html) from the [srvyr](http://gdfe.co/srvyr/index.html) package.
Use `prop = TRUE` to adjust standard errors near 0% or 100% for proportions.
```{r}
dat_wtd <- dat %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>%
group_by(COUNTRY, FPSTATUS) %>%
summarise(survey_mean(prop = TRUE, prop_method = "logit", vartype = "ci"))
dat_wtd
```
---
```{r, fig.height=4, fig.width=10}
dat_wtd %>%
ggplot(aes(x = coef, y = FPSTATUS, fill = FPSTATUS)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(xmin = `_low`, xmax = `_upp`), width = 0.2, alpha = 0.5) +
facet_wrap(~COUNTRY) + theme_minimal() +
theme(axis.title = element_blank(), legend.position = "none") +
scale_x_continuous(labels = scales::label_percent())
```
# 4 - Analysis
---
The same [srvyr](http://gdfe.co/srvyr/index.html) toolkit can be used to model our dependent variables with [survey::svyglm](http://r-survey.r-forge.r-project.org/survey/).
Consider women who were *not* using a method at Phase 1:
```{r}
adopt_glm <- dat %>%
filter(CP_1 == 0) %>%
mutate(adopt = FPSTATUS == "Adopted") %>%
group_by(COUNTRY) %>%
summarise(
adopt = cur_data() %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>%
svyglm(
adopt ~ CVINCOMELOSS_2 + COVIDCONCERN_2 + URBAN + WEALTHT_2 + EDUCATTGEN_2 + AGE_2,
family = "quasibinomial", design = .
) %>%
broom::tidy(exp = TRUE) %>%
mutate(sig = gtools::stars.pval(p.value)) %>%
list()
)
adopt_glm
```
---
For Phase 1 non-users in Burkina Faso, **very high** levels of concern about becoming infected with COVID-19 are significantly associated with higher chances of adopting a contraceptive method (relative to women who had no such concern).
Lesser levels of concern are not statistically significant, nor is household income loss from COVID-19.
```{r}
adopt_glm %>%
filter(COUNTRY == "Burkina Faso") %>%
unnest(adopt)
```
---
In Kenya, neither of these measures are significantly predictive of adoption among non-users.
```{r}
adopt_glm %>%
filter(COUNTRY == "Kenya") %>%
unnest(adopt)
```
---
What about method **dicontinuation** for women who *were* using a method at Phase 1?
```{r}
stop_glm <- dat %>%
filter(CP_1 == 1) %>%
mutate(stop = FPSTATUS == "Discontinued") %>%
group_by(COUNTRY) %>%
summarise(
stop = cur_data() %>%
as_survey_design(weight = PANELWEIGHT, id = EAID_1, strata = STRATA_1) %>%
svyglm(
stop ~ CVINCOMELOSS_2 + COVIDCONCERN_2 + URBAN + WEALTHT_2 + EDUCATTGEN_2 + AGE_2,
family = "quasibinomial", design = .
) %>%
broom::tidy(exp = TRUE) %>%
mutate(sig = gtools::stars.pval(p.value)) %>%
list()
)
stop_glm
```
---
This time, neither of the COVID-19 measures are significantly associated with **discontinuation** for Phase 1 contraceptive users in Burkina Faso.
```{r}
stop_glm %>%
filter(COUNTRY == "Burkina Faso") %>%
unnest(stop)
```
---
However, higher levels concern with becoming infected with COVID-19 *are* significantly associated with higher odds of discontinuation for Phase 1 contraceptive users in Kenya.
```{r}
stop_glm %>%
filter(COUNTRY == "Kenya") %>%
unnest(stop)
```
---
For more R tips for IPUMS data, check out:
* The [IPUMS PMA blog](https://tech.popdata.org/pma-data-hub/)
* The [ipumsr](https://tech.popdata.org/ipumsr/) documentation website
* The [ipums tutorials](https://www.ipums.org/support/tutorials) page
Thank you!
pFad - Phonifier reborn
Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies:
Alternative Proxy
pFad Proxy
pFad v3 Proxy
pFad v4 Proxy