Skip to content
Snippets Groups Projects
README.md 3.77 KiB
Newer Older
Facundo Muñoz's avatar
Facundo Muñoz committed

<!-- README.md is generated from README.Rmd. Please edit that file -->

Facundo Muñoz's avatar
Facundo Muñoz committed
# topomatch

Helper function for matching toponyms from different sources, that can
be written in slightly different ways. Allows to inspect the matching
and act accordingly.
Facundo Muñoz's avatar
Facundo Muñoz committed

``` r
countries1 <- spData::world$name_long
Facundo Muñoz's avatar
Facundo Muñoz committed
countries2 <- unique(maps::world.cities$country.etc)

(country_matches <- topomatch(countries1, countries2))
#> 156 names matched exactly: Fiji, Tanzania, Western Sahara, ... 
Facundo Muñoz's avatar
Facundo Muñoz committed
#> 
#> 15 matches based on similarity: 
#>   1. United States:  United Arab Emirates 
#>   2. Democratic Republic of the Congo:  Congo Democratic Republic 
#>   3. Russian Federation:  Russia 
#>   4. French Southern and Antarctic Lands:  Northern Mariana Islands 
#>   5. Timor-Leste:  East Timor 
#>   6. Côte d'Ivoire:  Cape Verde 
#>   7. The Gambia:  Gambia 
#>   8. United Kingdom:  United Arab Emirates 
#>   9. Brunei Darussalam:  Brunei 
#>   10. Antarctica:  Vatican City 
#>   11. Northern Cyprus:  Northern Mariana Islands 
#>   12. Somaliland:  Swaziland 
#>   13. Serbia:  Serbia and Montenegro 
#>   14. Montenegro:  Serbia and Montenegro 
#>   15. South Sudan:  South Africa 
Facundo Muñoz's avatar
Facundo Muñoz committed
#> 
#> 6 unresolved matches: 
#>   1. Republic of the Congo: Czech Republic, Dominican Republic, ... 
#>   2. eSwatini: Palestine, Estonia 
#>   3. Lao PDR: San Marino, ... 
#>   4. Dem. Rep. Korea: Korea South, Korea North 
#>   5. Republic of Korea: Czech Republic, Dominican Republic, ... 
#>   6. Kosovo: Comoros, Solomon Islands
Facundo Muñoz's avatar
Facundo Muñoz committed
```

Facundo Muñoz's avatar
Facundo Muñoz committed
There are some manual fixes needed for those toponyms that weren’t
correctly matched. Just write the fixes in a named vector. If there is
no correct match for one toponym, give it an `NA`.

``` r
## Inspect the competing candidates for the unmatched countries
(bm <- best_matches(country_matches)[unmatched(country_matches)])
#> $`Republic of the Congo`
#> [1] "Czech Republic"            "Dominican Republic"       
#> [3] "Congo Democratic Republic" "Central African Republic" 
#> 
#> $eSwatini
#> [1] "Palestine" "Estonia"  
#> 
#> $`Lao PDR`
#> [1] "San Marino"               "Central African Republic"
#> [3] "Sao Tome and Principe"   
#> 
#> $`Dem. Rep. Korea`
#> [1] "Korea South" "Korea North"
#> 
#> $`Republic of Korea`
#> [1] "Czech Republic"            "Dominican Republic"       
#> [3] "Congo Democratic Republic" "Central African Republic" 
#> 
#> $Kosovo
#> [1] "Comoros"         "Solomon Islands"

cnames_fixes <- setNames(
  c("Congo Democratic Republic", NA, "Laos", "Korea North",
    "Korea South", NA),
  names(bm)
)

## Fix the incorrectly matches from similarity as well
cnames_fixes <- c(
  cnames_fixes,
  "United States" = "USA",
  "French Southern and Antarctic Lands" = "France",
  "Côte d'Ivoire" = "Ivory Coast",
  "United Kingdom" = "UK",
  "Antarctica" = NA,
  "Northern Cyprus" = "Cyprus",
  "Somaliland" = "Somalia",
  "South Sudan" = "Sudan"
)
```

Now you can `transcribe` the original toponyms to the matched terms.

``` r
translate <- transcribe(country_matches, fixes = cnames_fixes)

translate(c("United Kingdom", "Kosovo"))
#> [1] "UK" NA

## "Translate" all of the original toponyms
countries1_trans <- translate(countries1)

## Only those "fixed" as NA are not found in the second list
countries1[!countries1_trans %in% countries2]
#> [1] "eSwatini"   "Antarctica" "Kosovo"
```

Facundo Muñoz's avatar
Facundo Muñoz committed
## Method
Facundo Muñoz's avatar
Facundo Muñoz committed
Wraps local-global alignment algorithm borrwed from bioConductor package
`Biostrings`. Works better than global alignment and requires less
fine-tuning (although is considerably slower too)
<https://ro-che.info/articles/2016-12-11-local-alignment>.
Facundo Muñoz's avatar
Facundo Muñoz committed

Facundo Muñoz's avatar
Facundo Muñoz committed
## Installation
Facundo Muñoz's avatar
Facundo Muñoz committed

``` r
Facundo Muñoz's avatar
Facundo Muñoz committed
remotes::install_gitlab("umr-astre/topomatch", host = "forgemia.inra.fr")
Facundo Muñoz's avatar
Facundo Muñoz committed
```

<!-- You can install the released version of topomatch from [CRAN](https://CRAN.R-project.org) with: -->
<!-- ``` r -->
<!-- install.packages("topomatch") -->
<!-- ``` -->