THE DATA SCIENCE INTERVIEW BOOK
Buy Me a Coffee ☕FollowForum
  • About
  • Log
  • Mathematical Motivation
  • STATISTICS
    • Probability Basics
    • Probability Distribution
    • Central Limit Theorem
    • Bayesian vs Frequentist Reasoning
    • Hypothesis Testing
    • ⚠️A/B test
  • MODEL BUILDING
    • Overview
    • Data
      • Scaling
      • Missing Value
      • Outlier
      • ⚠️Sampling
      • Categorical Variable
    • Hyperparameter Optimization
  • Algorithms
    • Overview
    • Bias/Variance Tradeoff
    • Regression
    • Generative vs Discriminative Models
    • Classification
    • ⚠️Clustering
    • Tree based approaches
    • Time Series Analysis
    • Anomaly Detection
    • Big O
  • NEURAL NETWORK
    • Neural Network
    • ⚠️Recurrent Neural Network
  • NLP
    • Lexical Processing
    • Syntactic Processing
    • Transformers
  • BUSINESS INTELLIGENCE
    • ⚠️Power BI
      • Charts
      • Problems
    • Visualization
  • PYTHON
    • Theoretical
    • Basics
    • Data Manipulation
    • Statistics
    • NLP
    • Algorithms from scratch
      • Linear Regression
      • Logistic Regression
    • PySpark
  • ML OPS
    • Overview
    • GIT
    • Feature Store
  • SQL
    • Basics
    • Joins
    • Temporary Datasets
    • Windows Functions
    • Time
    • Functions & Stored Proc
    • Index
    • Performance Tuning
    • Problems
  • ⚠️EXCEL
    • Excel Basics
    • Data Manipulation
    • Time and Date
    • Python in Excel
  • MACHINE LEARNING FRAMEWORKS
    • PyCaret
    • ⚠️Tensorflow
  • ANALYTICAL THINKING
    • Business Scenarios
    • ⚠️Industry Application
    • Behavioral/Management
  • Generative AI
    • Vector Database
    • LLMs
  • CHEAT SHEETS
    • NumPy
    • Pandas
    • Pyspark
    • SQL
    • Statistics
    • RegEx
    • Git
    • Power BI
    • Python Basics
    • Keras
    • R Basics
  • POLICIES
    • PRIVACY NOTICE
Powered by GitBook
On this page

Was this helpful?

  1. NLP

Lexical Processing

PreviousRecurrent Neural NetworkNextSyntactic Processing

Last updated 2 years ago

Was this helpful?

In any large text document (say, a hundred thousand words), the word frequencies follow Zipf distribution:

  • Remove STOP words as they have high frequency and is most cases does not provide anything important

  • Tokenize or break the corpus by words, sentences, etc.

  • Canonicalisation or Reduce the words into their base forms:

    • Stemming: a rule-based technique that just chops off the suffix of a word to get its root form which is called the ‘stem’. For example converts driving, drive etc. to driv. But is not good for words like feet, drove etc.

    • Lemmatization: a more sophisticated technique, it doesn’t just chop off the suffix of a word. Instead, it takes an input word and searches for its base word by going recursively through all the variations of dictionary words. The base word, in this case, is called the ‘lemma’. It is more resource intensive and you need to pass the POS tag of the word along with the word to be lemmatized.

    • Phonetic Hashing: certain words which have different pronunciations in different languages. As a result, they end up being spelt differently. Example being Delhi and Dilli

    • Edit Distance: Edit distance is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. There are different methodologies for doing it, e.g. Hamming distance,Levenshtein distance, Jaro-Winkler etc.

    • Pointwise Mutual Information(PMI): Words like "Massachusetts Institute of Technology" are essentially one word but tokenization reduces these into individual words which is not desireable. PMI is used to determine whether this term should be represented by a single token or not.

  • Convert the data into tabular form:

    • Bag of Words(BoW): A table containing which word is present in which document, can be either count or binary

    • tf-idf: is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf−idf=freq of term ’t’ in doc ’d’total terms in ’d’∗logtotal number of docstotal number of docs having term ’t’tf-idf = \frac{\text{freq of term 't' in doc 'd'}}{\text{total terms in 'd'}} * log \frac{\text{total number of docs}}{\text{total number of docs having term 't'}}tf−idf=total terms in ’d’freq of term ’t’ in doc ’d’​∗logtotal number of docs having term ’t’total number of docs​

Zipf distribution
Soundex Algorithm for Phonetic Hashing ()
Source