Practical: Using Linux commands to analyze text documents & introduction to UNICODE
Inspired from "Unix for Poets" from Kenneth Ward Church, IBM Research, kwchurch (AT) us.ibm.com
Program
Tools
Suggestion: always remember to read the tool's man page if you have any doubts about the tool's parameters and functionality. For example, the man wc command indicates how you can count only lines (-l) or only words (-w) in a file.
Input / output redirection
By default, most tools read the input (stdin) from the keyboard and write the output (stdout) to the screen. You can redirect inputs and outputs to files using the following commands:
< |
Read from an input file |
> |
Write to an output file |
| |
Write the output of the previous command to the input of the next command (pipe) |
Data
File for exercises: RADIOS.txt or RADIOS.txt.UTF-8 depending on the operating system available here. This file contains transcripts of radio recordings: Franceinter, RFI,...
X
For each exercise, in addition to answering the questions, also you need to explain the sequence of commands used and show an extract of the result using the commands head and/or less
Exercises
Exercise 0: Transform all RADIOS.txt text into capital letters
Exercise 1: Find the sequence of instructions that allows you to count words in a text
Help: use tr, sort and uniq, think of "piping" (|) the instructions
Exercise 2: Sort the words (sort)
See page man of fate
Examples
sort -d |
dictionary order |
sort -f |
ignore upper / lower case |
sort -n |
numerical order |
sort -nr |
reverse numerical order |
sort -k 1 |
start at field 1 (the first is field 0) |
sort -k 0.50 |
start at the 50th character |
sort -k 1.5 |
start at 5th character of the field 1 |
Exercise 3: Co-occurrences of words or bigrams
Find and count all the bigrams text RADIOS.txt
Do the same for n-grams (n = 2,3,4).
Help: use the tail and paste commands
Exercise 4: filters (grep)
See man page of grep.
Examples
grep '[A-Z]' |
lines containing a capital letter |
grep '^[A-Z]' |
lines starting with a capital letter |
grep '[A-Z]$' |
lines ending with a capital letter |
grep '^[A-Z]*$' |
fully uppercase lines |
grep "[aeiouAEIOU]" |
lines containing a vowel |
grep "^[aeiouAEIOU]" |
lines starting with a vowel |
grep "[aeiouAEIOU]$" |
lines ending with a vowel |
grep -i '[aeiou]' |
lines containing a vowel |
grep -i '^[aeiou]' |
lines starting with a vowel |
grep -i '[aeiou]$' |
lines ending with a vowel |
grep -i '^[^aeiou]' |
lines starting with a non-vowel |
grep -i '[^aeiou]$' |
lines ending with a non-vowel |
grep -i '[aeiou].*[aeiou]' |
lines with at least two vowels |
grep -i '^[^aeiou]*[aeiou][^aeiou]*$' |
lines with exactly one vowel |
With regular expressions
a |
the letter "a" |
[a-z] |
a tiny letter |
[A-Z] |
a capital letter |
[0-9] |
a number |
[0123456789] |
a number |
[aeiouAEIOU] |
a vowel |
[^aeiouAEIOU] |
anything but a vowel |
. |
a character |
^ |
start of line |
$ |
end of line |
x* |
"x" repeated 0 or more times |
x+ |
"x" repeated 1 or more times |
x|y |
"x" or "y" (only egrep) |
(x) |
impose rule previously (only egrep) |
Exercise 5: Awk language
awk is a language whose syntax is similar to C, and which allows operations to be performed on fields in a file where each line is of the type "field1 field2 field3 field4..."
Example
It is possible to write the program in a file and after calling it, for example, by typing SelectPremierChamp.awk <file
The available predicates are:>, <, > =, <=, ==, !=, &&, ||
Exercise 6: Replacement with sed
See page man of sed.
The sed tool allows you to replace text using regular expressions. Thus, the command sed 's/exp1/exp2/[options]' will replace (s) the expression exp1 by the expression exp2. The option is often the letter g to say that all occurrences on the same line should be replaced. For example:
Exercise 7: Union of files with join
A measure of association is a numerical value which estimates the degree of association between two words. This measure can be used to automatically detect "collocations", ie sequences of words which appear more often than one would expect by chance. A measure of association often used is the measure PMI (pointwise mutual information), which is defined as follows for a bigram formed by the words w 1 and w 2:
PMI = c(w1 w2) . log [ c(w1 w2) / E(w1 w2) ] |
Where c (w1, w2) is the number of occurrences of the bigram w1 w2, as calculated in exercise 3, and E (w1,w2) is the expected number of occurrences of this bigram defined as:
c(w1) c(w2) |
|
E(w1 w2) = |
----------- |
N |
Where c (w1) is the number of occurrences of the first word, c (w2) is the number of occurrences of the second word and N is the total number of words in this text.
For each bigram in the text, calculate its PMI association value according to the given formula. Each line of the output file must have the form "w1 w2 c(w1,w2) . c(w1) c(w2) PMI"
Help: Use the join command to combine the information from the files in Exercises 2 and 3, but be careful to sort the files by the "junction" field (s). To avoid warnings, before any sort and join command, you must modify the LC_ALL=C variable, for example: LC_ALL=C sort RADIOS.hist
Exercise 8: observing the encodings
Open the files testFR.txt.UTF-8, testRU.txt, testGK.txt and testAR.txt with a web browser and modify the character encoding for
display (UTF-8, UTF-16, Western ISO-8859 -1, Arabic Windows-1256,
etc.):
The files can be found here.
Questions:
Open the testFR.txt, testRU.txt, testGK.txt and testAR.txt files with office software such as MS-Word or Libre Office. Same questions.
Exercise 9: analysis of encodings
Compare the different Unicode layers for the testFR.txt.UTF-8 file, in particular:
Questions:
Help: To find out the glyph from the point code, type / usr / bin / printf "\ uXXXX \ n" replacing XXXX with the point code of the glyph
Exercise 10: encoding and place occupied
Compare the files testFR.txt.UTF-8 and testFR.txt.iso8859-1
Exercise 11: conversion to UTF-8
Help: for Arabic, the character encoding of the file is in windows format, called cp1256.
Exercise 12: conversion from UTF-8
Convert the file testFR.txt.UTF-8 to UTF-16 with iconv. Visualize the change with hexedit.