Demo 3: Part-of-Speech Tagging part 3: Deep Learning for NLU

Christophe Servan

Introduction

In this demo you will learn how to process PoS tagging using Machine Translation approach.

This is a very common way to proceed, since Statistical Machine Translation approaches where widely used.


Toolkits


Data


Report

You should send me your report the day before the next session using PDF format at the address christophe_dot_servan_at_epita_dot_fr (replace _dot_ and _at_ by dots and at, respectively)

Remarks:


Work to do

      This demo you will work with a NMT toolkit to perform PoS tagging:

    I. Creating data

      The main idea is to fits the raw data to the format. you will find additionnal informations in the documentation here

      WARNING: Read carefully the documentation. if the file is not at the right format it won't work.

      Bitexts are sentence align text generally separated in two files. In our case, the amount of words per line have to be the same. Here is an example:

      ==> train.src (source) <==
      Confidence in the pound is widely expected to take another sharp dive if trade figures for September , due for release tomorrow , fail to show a substantial improvement from July and August 's near-record deficits . 
      Chancellor of the Exchequer Nigel Lawson 's restated commitment to a firm monetary policy has helped to prevent a freefall in sterling over the past week . 
      But analysts reckon underlying support for sterling has been eroded by the chancellor 's failure to announce any new policy measures in his Mansion House speech last Thursday . 
      		
      ==> train.tgt (target)<==
      NN IN DT NN VBZ RB VBN TO VB DT JJ NN IN NN NNS IN NNP PUNCT JJ IN NN NN PUNCT VB TO VB DT JJ NN IN NNP CC NNP POS JJ NNS PUNCT 
      NNP IN DT NNP NNP NNP POS VBN NN TO DT NN JJ NN VBZ VBN TO VB DT NN IN NN IN DT JJ NN PUNCT 
      CC NNS VBP VBG NN IN NN VBZ VBN VBN IN DT NN POS NN TO VB DT JJ NN NNS IN PRP$ NNP NNP NN JJ NNP PUNCT 
      		

        Exercice 1: You have to create bitext for train, valid and test set composed of words (source) and labels (target).

          Q1: Give the bash command to transform original data into bitexts, the valid set should be an extract of 1000 lines from the train set. (1pt)

          Q2: Then, extract three vocabularies from train set using onmt-build-vocab use the char_tokenization.yml file for the character tokenization(2pt) :

          • words from the source part of the train set
          • character from the source part of the train set
          • tags from the target part of the train set

          Q3: Give details about the amount of train and test data available for your experiments. (1pt)

    II. Training the DL model

      The main idea is to train your model using the data you processed.

        Exercice 1: Set up the training by set up the config file (given in the additionnal data for OpenNMT) and then launch the training

          Q1: What is your command line needed to launch the training? (1pt)

          Q2: How long was your training? (1pt)

    III. Evaluation

      Now you have your DL model, you will evaluate it using

      Remark: You will use script eval.py to evaluate your results.

        Exercice 1: Evaluation of the test set

          Q1: Give the necessary commands to launch the evaluation of the model on the test set. (3pts)

          Q2: Give evaluation scores given by the script eval.py and explain them. (5pts)



The End.