Skip to content
This repository was archived by the owner on Jan 8, 2024. It is now read-only.

madhurima-nath/nlp_fuzzy_match_algorithms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 

Repository files navigation

WWCode Data Science: NLP Fuzzy Match Algorithms

Fuzzy string matching is technique to find strings which have approximate matches. There are multiple applications of fuzzy matching. This talk will cover a few algorithms which are implemented for such approximate string matchings.

Link to the Jupyter notebook.

YouTube Link


Outline of the talk:    

  • Introduction to fuzzy matching
  • Applications of fuzzy matching
  • Algorithms used for fuzzy matching
    • Levenshtein distance algorithm
    • Damerau-Levenshtein distance algorithm
    • Bitap algorithm
    • n-gram algorithm
  • Implementation of fuzzy matching on real data
  • Other fuzzy matching algorithms
  • Record Linkange Toolkit library to link records in or between data sources and provides tools for deduplication and record linkage.

Libraries used:

  • Jellyfish: Refer here for more information
  • Fuzzywuzzy: Refer here for more information
  • Fuzzy_match: Refer here for more information

Implementation on Real Data

Download data here from Kaggle.

The csv file is also here.

The data contains two columns for room type descriptions. Column 1 is the description from Expedia, and column 2 is the associated room type in Booking.com.

Aim: is to compare and match these two columns and the result would be 'human like understanding that the matched entries are same'.

Snapshot of the data:

image


References:

  1. Levenshtein, Vladimir I. "Binary codes capable of correcting deletions, insertions, and reversals." In Soviet physics doklady, vol. 10, no. 8, pp. 707-710. 1966.
  2. Damerau, Fred J. "A technique for computer detection and correction of spelling errors." Communications of the ACM 7, no. 3 (1964): 171-176.
  3. Cayrol, M., Farreny, H. and Prade, H. (1982), 'Fuzzy Pattern Matching', Kybernetes, Vol. 11 No. 2, pp. 103-116.
  4. Ukkonen, Esko. "Algorithms for approximate string matching." Information and control 64, no. 1-3 (1985): 100-118.
  5. Geek for Geeks - applications of fuzzy string matching
  6. Geek for Geeks - Bitap Algorithm
  7. Stanford slides on n-gram
  8. Data camp tutorial - fuzzy string matching
  9. Levenshtein distance theory
  10. Article on record linking and fuzzy matching
  11. Medium post on Levenshtein distance
  12. stackoverflow for n-gram similarity

Feel free to reach out if you have any questions.

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy