Original Source Here
Here are several techniques that are often used in natural language processing in order to understand natural languages:
- Bag of Words: counting the number of words occurrences in a sentence or a text.
The downside of this technique is that an important word such as ‘universe’ has the same weight as a stop word such as ‘the’ or ‘a’. One way to solve this problem is by analyzing the occurrences of words in all text to see if a word is commonly use across texts or not.
- Tokenization: breaking down a text into sentences and words.
This process usually also remove characters like punctuations, commas, and question marks and removal of these characters might cause changes in meaning. Another problem that might be caused is because token are usually separated by blank spaces, some token that consist of two or more words might be considered as separate token and give different meaning. (e.g. New York and San Francisco).
- Stop Words Removal: removing common words from the text.
Words like articles, pronouns, and prepositions give little to no meanings in a sentence because of how common they are and therefore, removing them shouldn’t cause a lot of changes in meaning and can help saving space in the database.
- Lemmatization: replacing derived words with their base form
For example in English language, ‘go’, ‘went’, ‘gone’, and ‘going’ come from the same base word ‘go’. So whenever those 4 words are found in a sentence, the program will transform it to the same word ‘go’.
- Stemming: replacing words with their root form.
The difference between stemming and lemmatization is that stemming removes the last few characters of a word without considering the context where it’s being used while lemmatization considers the context.
For example, lemmatization can convert the word ‘stripes’ to both ‘stripe’ and ‘strip’ depends on the context the word is being used while stemming will always convert it to ‘strip’ regardless of where it is being used.
- Morphological segmentation: dividing words into a unit called morphemes
For example, the word ‘unbreakable’ can be further broken down into 3 morphemes: ‘un’, ‘break’, and ‘able.’ The meaning of individual morphemes then can be combined to get the meaning of the word ‘unbreakable’ itself.
- Word segmentation: dividing a text into smaller units, in this case, words.
- Part-of-speech tagging: giving a tag to each words to identify which part of speech they are (e.g. noun, verb, adjective, etc.)
- Parsing: analyzing the text based on the grammatical rule of the language.
- Sentence breaking: deciding the start and the end of a sentence.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot