In this demo you will learn how to process PoS tagging using Machine Translation approach.
This is a very common way to proceed, since Statistical Machine Translation approaches where widely used.
git clone https://github.com/QwantResearch/OpenNMT-tf.gitThen
cd OpenNMT-fr ; python3 -m pip install --force .
python3 -m pip install subword-nmt
tar xvfz tp3_files.tgzor
gunzip tp3_files.tgzand then
tar xvf tp3_files.tar
You should send me your report the day before the next session using PDF format at the address christophe_dot_servan_at_epita_dot_fr (replace _dot_ and _at_ by dots and at, respectively)
Remarks:
The main idea is to fits the raw data to the format. you will find additionnal informations in the documentation here
WARNING: Read carefully the documentation. if the file is not at the right format it won't work.
Bitexts are sentence align text generally separated in two files. In our case, the amount of words per line have to be the same. Here is an example:
==> train.src (source) <== Confidence in the pound is widely expected to take another sharp dive if trade figures for September , due for release tomorrow , fail to show a substantial improvement from July and August 's near-record deficits . Chancellor of the Exchequer Nigel Lawson 's restated commitment to a firm monetary policy has helped to prevent a freefall in sterling over the past week . But analysts reckon underlying support for sterling has been eroded by the chancellor 's failure to announce any new policy measures in his Mansion House speech last Thursday . ==> train.tgt (target)<== NN IN DT NN VBZ RB VBN TO VB DT JJ NN IN NN NNS IN NNP PUNCT JJ IN NN NN PUNCT VB TO VB DT JJ NN IN NNP CC NNP POS JJ NNS PUNCT NNP IN DT NNP NNP NNP POS VBN NN TO DT NN JJ NN VBZ VBN TO VB DT NN IN NN IN DT JJ NN PUNCT CC NNS VBP VBG NN IN NN VBZ VBN VBN IN DT NN POS NN TO VB DT JJ NN NNS IN PRP$ NNP NNP NN JJ NNP PUNCT
Exercice 1: You have to create bitext for train, valid and test set composed of words (source) and labels (target).
Q1: Give the bash command to transform original data into bitexts, the valid set should be an extract of 1000 lines from the train set. (1pt)
Q2: Then, extract three vocabularies from train set using onmt-build-vocab use the char_tokenization.yml file for the character tokenization(2pt) :
Q3: Give details about the amount of train and test data available for your experiments. (1pt)
The main idea is to train your model using the data you processed.
Exercice 1: Set up the training by set up the config file (given in the additionnal data for OpenNMT) and then launch the training
Q1: What is your command line needed to launch the training? (1pt)
Q2: How long was your training? (1pt)
Now you have your DL model, you will evaluate it using
Remark: You will use script eval.py to evaluate your results.
Exercice 1: Evaluation of the test set
Q1: Give the necessary commands to launch the evaluation of the model on the test set. (3pts)
Q2: Give evaluation scores given by the script eval.py and explain them. (5pts)