Skip to content

Commit

Permalink
rm prev gender encoding stuff from readme
Browse files Browse the repository at this point in the history
  • Loading branch information
mpadge committed Apr 15, 2022
1 parent 9ebfc2c commit 4ee93ed
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 277 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: genderconsciousrouting
Title: Gender-Conscious Routing Along Streets Named After Women Rather
Than Men
Version: 0.0.1.030
Version: 0.0.1.031
Authors@R: c(
person("Mark", "Padgham", , "mark.padgham@email.com", role = c("aut", "cre")),
person("Jörg", "Michael", role = "cph",
Expand Down
116 changes: 2 additions & 114 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,117 +30,5 @@ genders, and which provides only binary definitions of gender. The library
nevertheless allows gender associations in a very wide variety of languages,
far more than any other equivalent library, and so allows the functionality of
this package to be applied to far greater regions of the world that would
other, non-binary alternatives.



## Gender catogorizer

This **R** package includes an internally bundled C library for categorising
the gender of first names, thanks to Michael Jörg (available
[here](https://www.heise.de/ct/ftp/07/17/182/)). The library is extremely fast
and flexible; covers all European languages and a host of others, and
categorises gender very accurately. Here's a test run on a very large data set
of names from the English-speaking world:

```{r load, echo = FALSE, message = FALSE}
devtools::load_all (".", export_all = FALSE)
```



```{r babynames}
u <- "https://github.com/hadley/data-baby-names/raw/master/baby-names.csv"
if (!file.exists ("baby-names.csv"))
chk <- download.file (u, "baby-names.csv")
n <- read.csv ("baby-names.csv", stringsAsFactors = FALSE)
format (nrow (n), big.mark = ",")
st <- system.time (x <- get_gender (n$name))
st
knitr::kable (table (x$gender))
knitr::kable (table (n$sex))
```

Categorising `r format (nrow (n), big.mark = ",")` names took only `r st [3]`
seconds, or around 100,000 names per second. The following code compares the
accuracy, noting that many names are of course unisex, whereas the "baby-names"
data are direct records of individual names and sex.

```{r babyname-output}
x$gender [x$gender == "IS_MALE"] <- "boy"
x$gender [x$gender == "IS_MOSTLY_MALE"] <- "boy"
x$gender [x$gender == "IS_FEMALE"] <- "girl"
x$gender [x$gender == "IS_MOSTLY_FEMALE"] <- "girl"
index_right <- which (x$gender == n$sex)
message (format (length (index_right), big.mark = ","), " / ",
format (nrow (x), big.mark = ","),
" of names correctly classified = ",
formatC (100 * length (index_right) / nrow (x),
format = "f", digits = 1), "%")
```

Noting that the baby name records are structured over time, and include
many repeats of the same names, we can try to create "mostly girl/boy"
categories based on relative proportions.


```{r categorise-sex}
library (dplyr)
categorise_sex <- function (sex, size) {
# define relative proportions:
# if > rel_props [2], then category is singular
# else if > rel_props [1], then category is "mostly" singular,
# else category is unisex
rel_props <- c (4, 1000)
if (length (size) == 1)
return (sex)
bi <- which (sex == "boy")
gi <- which (sex == "girl")
if (size [bi] > (size [gi] * rel_props [2]))
return ("boy")
else if (size [gi] > (size [bi] * rel_props [2]))
return ("girl")
else if (size [bi] > (size [gi] * rel_props [1]))
return ("mostly boy")
else if (size [gi] > (size [bi] * rel_props [1]))
return ("mostly girl")
else
return ("unisex")
}
n2 <- n |>
group_by (name, sex) |>
summarise (size = n ()) |>
group_by (name) |>
summarise (category = categorise_sex (sex, size))
knitr::kable (table (n2$category))
```

The above values for relative proportions were selected to give good agreement
with the observed overall distribution of categories as determined by the
internal library. These two more refined data sets can then be compared:

```{r}
n2$gender <- get_gender (n2$name)$gender
n2$gender [n2$gender == "IS_FEMALE"] <- "girl"
n2$gender [n2$gender == "IS_MALE"] <- "boy"
n2$gender [n2$gender == "IS_MOSTLY_FEMALE"] <- "mostly girl"
n2$gender [n2$gender == "IS_MOSTLY_MALE"] <- "mostly boy"
n2$gender [n2$gender == "IS_UNISEX_NAME"] <- "unisex"
```
Some names are simply not found, so we'll remove those from the comparison
before calculating final statistics.
```{r contingency}
n2 <- n2 [which (!n2$gender == "NAME_NOT_FOUND"), ]
knitr::kable (with (n2, table (category, gender)))
```
The accuracy in that case is
```{r contingency2}
ct <- with (n2, table (category, gender))
sum (diag (ct)) / sum (ct)
```
other, non-binary alternatives. See the vignette for further detail on the
gender encoding system used here.
162 changes: 3 additions & 159 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

<!-- README.md is generated from README.Rmd. Please edit that file -->
<!-- badges: start -->

Expand All @@ -21,162 +22,5 @@ definitions of gender. The library nevertheless allows gender
associations in a very wide variety of languages, far more than any
other equivalent library, and so allows the functionality of this
package to be applied to far greater regions of the world that would
other, non-binary alternatives.

## Gender catogorizer

This **R** package includes an internally bundled C library for
categorising the gender of first names, thanks to Michael Jörg
(available [here](https://www.heise.de/ct/ftp/07/17/182/)). The library
is extremely fast and flexible; covers all European languages and a host
of others, and categorises gender very accurately. Here’s a test run on
a very large data set of names from the English-speaking world:

``` r
u <- "https://github.com/hadley/data-baby-names/raw/master/baby-names.csv"
if (!file.exists ("baby-names.csv"))
chk <- download.file (u, "baby-names.csv")
n <- read.csv ("baby-names.csv", stringsAsFactors = FALSE)
format (nrow (n), big.mark = ",")
#> [1] "258,000"
st <- system.time (x <- get_gender (n$name))
st
#> user system elapsed
#> 1.389 1.839 3.234
knitr::kable (table (x$gender))
```

| Var1 | Freq |
|:-----------------|-------:|
| IS_FEMALE | 103059 |
| IS_MALE | 95751 |
| IS_MOSTLY_FEMALE | 15919 |
| IS_MOSTLY_MALE | 17290 |
| IS_UNISEX_NAME | 11296 |
| NAME_NOT_FOUND | 14685 |

``` r
knitr::kable (table (n$sex))
```

| Var1 | Freq |
|:-----|-------:|
| boy | 129000 |
| girl | 129000 |

Categorising 258,000 names took only 3.234 seconds, or around 100,000
names per second. The following code compares the accuracy, noting that
many names are of course unisex, whereas the “baby-names” data are
direct records of individual names and sex.

``` r
x$gender [x$gender == "IS_MALE"] <- "boy"
x$gender [x$gender == "IS_MOSTLY_MALE"] <- "boy"
x$gender [x$gender == "IS_FEMALE"] <- "girl"
x$gender [x$gender == "IS_MOSTLY_FEMALE"] <- "girl"

index_right <- which (x$gender == n$sex)
message (format (length (index_right), big.mark = ","), " / ",
format (nrow (x), big.mark = ","),
" of names correctly classified = ",
formatC (100 * length (index_right) / nrow (x),
format = "f", digits = 1), "%")
#> 217,630 / 258,000 of names correctly classified = 84.4%
```

Noting that the baby name records are structured over time, and include
many repeats of the same names, we can try to create “mostly girl/boy”
categories based on relative proportions.

``` r
library (dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:testthat':
#>
#> matches
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union

categorise_sex <- function (sex, size) {
# define relative proportions:
# if > rel_props [2], then category is singular
# else if > rel_props [1], then category is "mostly" singular,
# else category is unisex
rel_props <- c (4, 1000)

if (length (size) == 1)
return (sex)

bi <- which (sex == "boy")
gi <- which (sex == "girl")

if (size [bi] > (size [gi] * rel_props [2]))
return ("boy")
else if (size [gi] > (size [bi] * rel_props [2]))
return ("girl")
else if (size [bi] > (size [gi] * rel_props [1]))
return ("mostly boy")
else if (size [gi] > (size [bi] * rel_props [1]))
return ("mostly girl")
else
return ("unisex")
}
n2 <- n |>
group_by (name, sex) |>
summarise (size = n ()) |>
group_by (name) |>
summarise (category = categorise_sex (sex, size))
#> `summarise()` has grouped output by 'name'. You can override using the `.groups` argument.
knitr::kable (table (n2$category))
```

| Var1 | Freq |
|:------------|-----:|
| boy | 2764 |
| girl | 3345 |
| mostly boy | 147 |
| mostly girl | 224 |
| unisex | 302 |

The above values for relative proportions were selected to give good
agreement with the observed overall distribution of categories as
determined by the internal library. These two more refined data sets can
then be compared:

``` r
n2$gender <- get_gender (n2$name)$gender
n2$gender [n2$gender == "IS_FEMALE"] <- "girl"
n2$gender [n2$gender == "IS_MALE"] <- "boy"
n2$gender [n2$gender == "IS_MOSTLY_FEMALE"] <- "mostly girl"
n2$gender [n2$gender == "IS_MOSTLY_MALE"] <- "mostly boy"
n2$gender [n2$gender == "IS_UNISEX_NAME"] <- "unisex"
```

Some names are simply not found, so we’ll remove those from the
comparison before calculating final statistics.

``` r
n2 <- n2 [which (!n2$gender == "NAME_NOT_FOUND"), ]
knitr::kable (with (n2, table (category, gender)))
```

| | boy | girl | mostly boy | mostly girl | unisex |
|:------------|-----:|-----:|-----------:|------------:|-------:|
| boy | 1643 | 19 | 92 | 15 | 61 |
| girl | 19 | 2221 | 15 | 89 | 66 |
| mostly boy | 90 | 3 | 36 | 4 | 8 |
| mostly girl | 0 | 173 | 3 | 30 | 7 |
| unisex | 27 | 44 | 65 | 79 | 66 |

The accuracy in that case is

``` r
ct <- with (n2, table (category, gender))
sum (diag (ct)) / sum (ct)
#> [1] 0.8196923
```
other, non-binary alternatives. See the vignette for further detail on
the gender encoding system used here.
9 changes: 6 additions & 3 deletions codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"codeRepository": "https://github.com/mpadge/gender-conscious-routing",
"issueTracker": "https://github.com/mpadge/gender-conscious-routing/issues",
"license": "https://spdx.org/licenses/GPL-3.0",
"version": "0.0.1.30",
"version": "0.0.1.031",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
Expand Down Expand Up @@ -112,10 +112,13 @@
},
"sameAs": "https://CRAN.R-project.org/package=sf"
},
"SystemRequirements": null
"SystemRequirements": {}
},
"fileSize": "12107.981KB",
"readme": "https://github.com/mpadge/gender-conscious-routing/blob/main/README.md",
"contIntegration": ["https://github.com/mpadge/gender-conscious-routing/actions?query=workflow%3AR-CMD-check", "https://codecov.io/gh/mpadge/gender-conscious-routing"],
"contIntegration": [
"https://github.com/mpadge/gender-conscious-routing/actions?query=workflow%3AR-CMD-check",
"https://codecov.io/gh/mpadge/gender-conscious-routing"
],
"developmentStatus": "http://www.repostatus.org/#concept"
}

0 comments on commit 4ee93ed

Please sign in to comment.
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy