Skip to content
Snippets Groups Projects

topomatch

Helper function for matching toponyms from different sources, that can be written in slightly different ways. Allows to inspect the matching and act accordingly.

countries1 <- spData::world$name_long
countries2 <- unique(maps::world.cities$country.etc)

(country_matches <- topomatch(countries1, countries2))
#> 156 names matched exactly: Fiji, Tanzania, Western Sahara, ... 
#> 
#> 15 matches based on similarity: 
#>   1. United States:  United Arab Emirates 
#>   2. Democratic Republic of the Congo:  Congo Democratic Republic 
#>   3. Russian Federation:  Russia 
#>   4. French Southern and Antarctic Lands:  Northern Mariana Islands 
#>   5. Timor-Leste:  East Timor 
#>   6. Côte d'Ivoire:  Cape Verde 
#>   7. The Gambia:  Gambia 
#>   8. United Kingdom:  United Arab Emirates 
#>   9. Brunei Darussalam:  Brunei 
#>   10. Antarctica:  Vatican City 
#>   11. Northern Cyprus:  Northern Mariana Islands 
#>   12. Somaliland:  Swaziland 
#>   13. Serbia:  Serbia and Montenegro 
#>   14. Montenegro:  Serbia and Montenegro 
#>   15. South Sudan:  South Africa 
#> 
#> 6 unresolved matches: 
#>   1. Republic of the Congo: Czech Republic, Dominican Republic, ... 
#>   2. eSwatini: Palestine, Estonia 
#>   3. Lao PDR: San Marino, ... 
#>   4. Dem. Rep. Korea: Korea South, Korea North 
#>   5. Republic of Korea: Czech Republic, Dominican Republic, ... 
#>   6. Kosovo: Comoros, Solomon Islands

There are some manual fixes needed for those toponyms that weren’t correctly matched. Just write the fixes in a named vector. If there is no correct match for one toponym, give it an NA.

## Inspect the competing candidates for the unmatched countries
(bm <- best_matches(country_matches)[unmatched(country_matches)])
#> $`Republic of the Congo`
#> [1] "Czech Republic"            "Dominican Republic"       
#> [3] "Congo Democratic Republic" "Central African Republic" 
#> 
#> $eSwatini
#> [1] "Palestine" "Estonia"  
#> 
#> $`Lao PDR`
#> [1] "San Marino"               "Central African Republic"
#> [3] "Sao Tome and Principe"   
#> 
#> $`Dem. Rep. Korea`
#> [1] "Korea South" "Korea North"
#> 
#> $`Republic of Korea`
#> [1] "Czech Republic"            "Dominican Republic"       
#> [3] "Congo Democratic Republic" "Central African Republic" 
#> 
#> $Kosovo
#> [1] "Comoros"         "Solomon Islands"

cnames_fixes <- setNames(
  c("Congo Democratic Republic", NA, "Laos", "Korea North",
    "Korea South", NA),
  names(bm)
)

## Fix the incorrectly matches from similarity as well
cnames_fixes <- c(
  cnames_fixes,
  "United States" = "USA",
  "French Southern and Antarctic Lands" = "France",
  "Côte d'Ivoire" = "Ivory Coast",
  "United Kingdom" = "UK",
  "Antarctica" = NA,
  "Northern Cyprus" = "Cyprus",
  "Somaliland" = "Somalia",
  "South Sudan" = "Sudan"
)

Now you can transcribe the original toponyms to the matched terms.

translate <- transcribe(country_matches, fixes = cnames_fixes)

translate(c("United Kingdom", "Kosovo"))
#> [1] "UK" NA

## "Translate" all of the original toponyms
countries1_trans <- translate(countries1)

## Only those "fixed" as NA are not found in the second list
countries1[!countries1_trans %in% countries2]
#> [1] "eSwatini"   "Antarctica" "Kosovo"

Method

Wraps local-global alignment algorithm borrwed from bioConductor package Biostrings. Works better than global alignment and requires less fine-tuning (although is considerably slower too) https://ro-che.info/articles/2016-12-11-local-alignment.

Installation

remotes::install_gitlab("umr-astre/topomatch", host = "forgemia.inra.fr")