bash FT_installation.bashWARNING: the tool will be installed by default into the "tools" directory located into your home directory.
source ~/.bashrc
fasttext --helpthe result should print the help of the fasttext command.
The main idea is to train your first word embeddings model
WARNING: Read carefully the documentation available here: https://github.com/facebookresearch/fastText
Exercice 1: Train two models for each dataset, with a vector dimension of 50, 5iteration of training and a window size of 15.
Q1: Train a cbow model, explain your command line (1pt)
Q2: Train a skip-gram model, explain your command line (1pt)
Q3: Give example of the vector file, describe it (1pt)
Loading the WE is not difficult in itself.
To make computations fast, we store the whole set of embeddings in a numpy array of shape (num words, dimensions).
We also build a dictionary (vocab) for mapping words to row index in this array, and another (rev vocab) for mapping indexes back to word forms.
def load(filename): vocab = {} rev_vocab = [] lines = open(filename).readlines() vectors = np.zeros((int(lines[0].split()[0]), int(lines[0].split(" ")[1]))) for i, line in enumerate(lines): tokens = line.strip().split(" ") if (i > 0) : vocab[tokens[0]] = i-1 rev_vocab.append(tokens[0]) vectors[i-1] = [float(value) for value in tokens[1:]] return vocab, rev_vocab, vectors
Exercice 1: Loading
Q1: write in Python the necessary script to load the WE (1pt)
Q2: what "vocab", "rev_vocab" and "vectors" stand for? (1pt)
Exercice 2: Compute the cosine similarity for each model and each datasets
Q1: Explain what is the cosine similarity. (1pt)
Q2: How to compute it in python using numpy? and using scipy? (2pt)
Q3: What is the cosine similarity between the vectors representing "dog" and "cat", and what about "dog" and "dentist"? (1pt)
Q4: What is the closest word to "bank": "river" or "trade"? (1pt)
Now you have your WE model, you will use a tric to compute all the closest words.
This can be done once for all, by computing the dot product between he matrix containing all
vectors and the transpose of the target word vector (np.dot(vectors, v.T)).
Then we can use some numpy trick to recover the indices of the n highest scores.
def closest(vectors, vector, n=10): n=n+1 scores = np.dot(vectors, vector.T) indices = np.argpartition(scores, -n)[-n:] indices = indices[np.argsort(scores[indices])] output = [] for i in [int(x) for x in indices]: output.append((scores[i], i)) return reversed(output)
Exercice 1: Code the function
Q1: What pre-process to all the vectors do you NEED to do before using the dot product instead of the cosine? (3pt)
Q2: What "argpartition" stand for? (1pt)
Q3: Add Comments to the code. (1pt)
Exercice 2: Analysis
Q1: What are the closest words to "apple"? (1pt)
Q2: What about other words (close to "apple")? (1pt)
Q3: Can you find words which have a strange neighborhood? (1pt)
Q4: Check your answers using scipy.spatial.distance.cosine regarding the scores obtained what can you conclude about the dot product?
Word analogies can be exposed by translating a word vector in a direction that corresponds to a linguistic or semantic relationship between two other words. So if we have w1 and w2 in relation R(w1,w2), we can compute the relation r=vec(w2)-vec(W1), and then apply this relation to the vector of another word vec(w3). The word closest to vec(w3)+r should also exhibit the same relation. Therefore, the idea is to use the closest function to find vectors similar to vec(w2)-vec(w1)+vec(w3).
Exercice 1: Code the analogy function
Q1: Can you use the previous code? How far? (1pt)
Exercice 2: Solve the analogies
Q1: "paris" is to "france" what "delhi" is to ... ? (1pt)
Q2: "gates" - "microsoft" + "apple" = ... ? (1pt)
Q3: "king" - "man" + "woman" = ... ? (1pt)
Q4: "slow" - "slower" + "fast" = ... ? (1pt)
Exercice 3: Bonus
Q1: Augment the dimension of the WE to 300 and retrain them (1pt)
Q2: What are the impact on the analogies? (1pt)