probability sequence models, markov chain
Week 1 – auto correction
The goal of this week is to build an auto-correction system. It is context free. It uses the edit distance and word probability.
- identify misspelled word
- not in a dictionary.
- (advanced) take the context (surrounding word) into account.
- find the words candidates with 1, 2,3,… edit distance. These candidates might or might not be a valid word
- insert letter
- edit letter
- remove letter
- switch two adjacent letters
- Filter candidates
- remove word candidates that are not in the dictionary
- Calculate word probabilities.
- calculate the word frequency (no context is considered)
How to consolidate the edit distance and word probabilities? we have two values to rank.
In the course 1, we use naive bayes which assumes each word appears in a text independently.
Now we drop this assumption by using the markov chain.
Minimum edit distance
cost: insert 1, remove 1, replace 2 (it’s equivalent to insert then remove a letter)
D[i,j]= min (D[i-1,j] + 1, D[i, j-1] + 1, D[i-1, j-1](if A[i] == B[i]) or D[i-1, j-1] + 2)
Use dynamic programming to solve the minimum edit distance problem.
通过edit distance (primary criteria)和词汇的probability/frequency (secondary criteria)来做auto correction
Week 2 – Hidden markov chain and part of speech taging
These probabilities are calculated via tagged corpus.
This is calculated via the tagged corpus as well.
The hidden markov model consists of States, Transition matrix and emission matrix.
Generating the transition matrix and emission matrix
The transition matrix and emission matrix are generated from the corpus.
Teacher drinks coffee
<s>NN VB NN</s>
I’m going to eat
<s>NN VB O VB</s>
Who is your mother
<s>NN VB NN NN</s>We use <s> </s> to denote beginning and ending of a sentence
Given the corpus about we can get the transition matrix
P(VB|NN) (count(NN -> VB) / count(NN) 当NN出现了，后面接的是VB的概率) = 3 / 6
P (NN | VB) = 2 / 4
P (NN|<s>) = …
Regarding to the emission matrix
P (‘Teacher’|NN) = 1/6
How to use the transition matrix and emission matrix to tag the sentence – Viterbi algorithm
Algorithm definition: Given the sentence and the transition/emission matrix, tag the whole sentence to maximize the probability.
Use dynamic programming to maximize the probability. Very similar to the minimal edit distance with the path preserved.
3 steps in the viterbi algorithm: initialization, forward pass, backward pass
Initialize two matrix, C and D. C_i_j stores the maximum probability if we choose tag i (t_i) to emit the word at position j (w_j).
D_i_j stores the which cell in column j-1 generates the maximum probability in C_i_j. D stores the path for backward pass.
Forward pass populates the C and D.
a_k_i is the transission probability from t_k to t_i
b_i_cindex(w_j) is the emission probability to emit w_j at t_i
d_i_j stores the k that maximizes c_i_j
Backward pass is to extract the tag path from matrix D. We starting at the right most column and the row whose corresponding cell in C contains the greatest value.
Two implementation issues
用markov chain做sentence tagging. 其核心思想是一个单词的词性(tag)的概率和它的上文有关。整个句子的tagging的目的是最大化整个句子的markov chain tagging的probability。
- 必须train在一个corpus上。对于没见过的词，它只能通过词缀（-ize, -tion）来推测它是什么词性。如果词缀也推测不出来的话，那就只能忽略emission matrix，转移到最大化ti的tag上。
week 3 – autocompletion and language model
Definition of language model.
Use cases of a language model
An N-gram is a sequence of N words
sequence notation: w^<end index>_<start index> , both inclusive
Given a sentence, what is its probability occurring in the language?
From the equation in the N-gram probability above, we get
P(A, B, C, D) = P(D|A,B,C) * P (A, B, C)
= P(D|A, B, C) * P ( C | A, B) * P (A, B)
= P(D|A, B, C) * P (C | A, B) * P (B|A) * P(A)
Problem: The corpus probably don’t have the exact subsequence like (A, B, C) or even (A, B).
Solution: we run approximation probability
So, using Bi-gram approximation, we get P(A, B, C, D) = P(A) * P(B|A) * P(C|B) * P(D|C)
How to calculate P(A)? We can not use the frequency of A divided by the would count since P(A) means the probability of A occurs at the beginning of the sentence.
We add <s> to mark the start of the sentence. Then P(A) becomes P(A|<s>)
Similarly we add </s> to mark the end of the sentence. Otherwise C(x w) !== C(x) in the pic below
N-gram language model
count matrix [i][j] stores the count for the bi-gram vocab[i], vocab[j]
With this language model built on top of bi-gram, we can calculate sentence probability and do next word prediction. We can also do sentence generation.
How to evaluate the quality of a language model?
We calculate the perplexity. It models the entropy of the language model. The lower the perplexity, the higher the language model quality.
a good model’s perplexity is between 20-60
Some researchers uses log perplexity to avoid multiplying small values (underflow) in computer.
Out of vocabulary words
The vocabulary in the training set might not contain all words in the test set and realworld.
To handle the out of vocabulary words. we first build up the vocabulary. By using the following criteria:
- defined a min word frequency f, we set the word to <UNK> in the training set if the frequency is less than f.
- define a vocabulary size V, order the words by frequency. Remove low frequency words that are outside of V.
Note: the <UNK> shall be sparing. 稀疏的.And only evaluate LMs with same V.
Thought: 其实就是把<unk>归成一类特殊的字符。并且把training set中低频的词也作为UNK，所以可以学习出它的pattern来。
Problem: some n-gram pairs might be missing in the training set.
Solution: Laplacian smoothing (add 1, or add k smoothing)
The corpus should be large to not let the “plus 1” distort its distribution.
The other solution is backoff. We use N-1 gram with a discount value (magic number 0.4, stupid backoff) to estimiate the N gram probability.
The other alternative to backoff is interpolation.
question: why do we know P(chocolate| John drinks)? and still calculate P^(chocolate|john drinks)?
Week3 final thoughts
- 我们假设词形变化对语义的情感无影响或影响很小。(happy, happ, happiness同义或相似意思)
Week4 word embeddings
one hot vector embedding. Simple, but no semantic relationship embedded.
Basic Word embedding methods
- continues bag of words (CBOW) – shallow neural network
- continues skip-gram / skip-gram with negative sampling (SGNS)
Global vectors (GloVe) (standford, 2014)
fastText (facebook, 2016)
- supports out-of-vocabulary (OOV) words
Advanced word embedding methods
Deep learning, contextual embeddings
- BERT(Google, 2018)
- ELMo (Allen Institute for AI, 2018)
- GPT-2 (OpenAI, 2018)
Tunable pre-trained models available on the internet.
Thoughts: For word2vec, GloVe or fastText, it maps a single word to a vector, regardless of its context.
For BERT, ELMo, GPT-2, it maps a single word in a sentence to a vector. You can not map a single word without context to a vector using these methods.
Continuous bag of words word embedding process
Assumption: semantically similar words are surrounded by similar words.
Cleaning and tokenization matters
- letter case – all lowercase
- number – drop it or convert to a special char <NUMBER> depending on your use case
- punctuations – stopping punctuations all converts to ‘.’, such as ? ! !! ??.
- drop all other non stopping punctuations, such as ” ‘ ,
- special characters – emoji for example.
Mapping words to vectors
Given the corpus “I am happy because I am learning” and half window size = 2. The context words are “am happy I am” and the center word is “because”.
We average the one hot vectors of the context word as the input vector.
Question: why not using a randomly initialized vector as its initial value? but uses the averaged one hot vector as the input vector?
Cross entropy loss
We only care about if the y_hat is high for the correct label y. We want to maximize the y_hat.
The cross-entropy loss listed in the pic below is a general form of the binary cross-entropy.
More detailed explanation can be found here: https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#cross-entropy
multiclass loss和binary class loss的cross-entropy loss的公式看起来不同，但实际上是一回事。binary class loss的公式是multiclass loss的特殊形式。
它特殊在，yhat是class1的预测值，那么1-yhat就是class2的预测值。binary indicator y_o_c 被拆解成了 y 和 1-y.
Evaluating the word embeddings
Intrinsic evaluation – 找到词之间的相对关系
Clustering – 语义相似的词应该离得比较近，人工去看这些cluster。
visualizing – 人工地去看这些词，语义相近的词是不是在一起。
test word embedding on external task
e.g. named entity recognition, parts-of-speech tagging.
+ evaluates actual usefullness of word embedding
– time consuming
– hard to trouble shoot when the performance is poor.
word embedding无法处理多义词。实际上，词的语义与上下文有关，记语义为函数P，词为W，上下文为C。word embedding 假设 P(W)，但实际上，是P(W, C)。