02 - UNIX - Reading: 1 UNIX Commands For Data Scientists
02 - UNIX - Reading: 1 UNIX Commands For Data Scientists
02 - UNIX - Reading: 1 UNIX Commands For Data Scientists
shakespeare.txt
env: filename=./unix/shakespeare.txt
If you are instead running in a shell, you can just define a shell variable named filename with
this syntax:
filename=./unix/shakespeare.txt
1
Make sure that there are no spaces around the equal sign.
We can verify that the variable is now defined by printing it out with echo. For the rest of this
reading we will use this variable to point to the filename.
./unix/shakespeare.txt
1.3 head
head prints some lines from the top of the file, you can specify how many with -n, what happens
if you don’t specify a number of lines?
1.4 tail
In [5]: !tail -n 10 $filename
1.5 wc
wc, which stands for wordcount, prints the number of lines, words and characters:
you can specify -l to only print the number of lines. Execute (in Git Bash on Windows or on
Linux):
wc --help
2
or (on Mac or on Linux):
man wc
124505 ./unix/shakespeare.txt
1.6 cat
You can use pipes with | to stream the output of a command to the input of another, this is useful
to compone many tools together to achieve a more complicated output.
For example cat dumps the content of a file, then we can pipe it to wc:
124505
1.7 grep
grep is an extremely powerful tool to look for text in one or more files. For example in the next
command we are looking for all the lines that contain a word, we also specify with -i that we are
interested in case insensitive matching, i.e. don’t care about case.
We can combine grep and wc to count the number of lines in a file that contain a specific word:
72
3
1.8 sed
sed is a powerful stream editor, it works similarly to grep, but it also modifies the output text, it
uses regular expressions, which are a language to define pattern matching and replacement.
For example:
s/from/to/g
means:
• s for substitution
• from is the word to match
• to is the replacement string
• g specifies to apply this to all occurrences on a line, not just the first
Then we are checking with grep that temp.txt contains the word “manuscript”:
1.9 sort
In [13]: !head -n 5 $filename
We can sort in alphabetical order the first 5 lines in the file, see that we are just ordering by the
first letter in each line:
4
In [14]: !head -n 5 $filename | sort
We can specify that we would like to sort on the second word of each line, we specify that the
delimiter is space with -t' ' and then specify we want to sort on column 2 -k2.
Therefore we are sorting on “is, of, presented, releases”
124505
110834
5
sed -e 's/ /\n/g' -e 's/\r//g'
with:
In [18]: !sed -e 's/ /\n/g' -e 's/\r//g' $filename | sed '/^$/d'| sort | uniq -c |
23244 the
19542 I
18302 and
15623 to
15551 of
12532 a
10824 my
9576 in
9081 you
7851 is
7531 that
7068 And
6948 not
6722 with
6218 his
sort: write failed: 'standard output': Broken pipe
sort: write error
do not worry about the Broken Pipe error, it is due to the fact that head is closing the pipe
after the first 15 lines, and sort is complaining that it would have more text to write
sed is making 2 replacements. The first replaces each space with \n, which is the symbol for a
newline character, basically this is splitting all of the words in a text on separate lines. See yourself
below!
The second replacement is more complicated, shakespeare.txt is using the Windows con-
vention of using \r\n to indicate a new line. \r is carriage return, we want to get rid of it, so we
are replacing it with nothing.
This
is
the
100th
Etext
file
presented
by
6
Project
Gutenberg,
sed: couldn't write 48 items to stdout: Broken pipe
Next we are not interested in counting empty lines, so we can remove them with:
sed '/^$/d'
In [20]: !sed -e 's/ /\n/g' -e 's/\r//g' $filename | sed '/^$/d' | sort | uniq -c
1 __
9 -
2 ?
1 /
51 .
241 "
1 (~),
1 (_)
1 (*)
14 [
uniq: write error: Broken pipe
Good so we have counted the words, so we need to sort but we need to sort in numeric order-
ing instead of alphabetical so we specify -n, also we need reverse order -r, bigger first!
And finally we take the first 15 lines:
In [21]: !sed -e 's/ /\n/g' -e 's/\r//g' $filename | sed '/^$/d' | sort | uniq -c |
23244 the
19542 I
18302 and
15623 to
15551 of
12532 a
10824 my
9576 in
9081 you
7851 is
7531 that
7
7068 And
6948 not
6722 with
6218 his
sort: write failed: 'standard output': Broken pipe
sort: write error
In [22]: !sed -e 's/ /\n/g' -e 's/\r//g' < $filename | sed '/^$/d' | sort | sed '/^
23244 the
19542 I
18302 and
15623 to
15551 of
12532 a
10824 my
9576 in
9081 you
7851 is
7531 that
7068 And
6948 not
6722 with
6218 his