Chapter 8 – Extract and Process Data

In this chapter, we are going to learn how to extract and process data from a file on Linux.

Extract and process data in Linux

To test the examples given in this chapter, I will create a file that I will name “data.txt” containing a list of books, with their year of publication, author, and country of origin:

In Search of Lost Time, 1913, Marcel Proust, France
Ulysses, 1922, James Joyce, Ireland
Don Quixote, 1615, Miguel De Cervantes, Spain
The Great Gatsby, 1925, F. Scott Fitzgerald, United States
War and Peace, 1869, Leo Tolstoy, Russia

I invite you to create the same file on your local machine and copy the above content.

Extract Data

When it comes to extracting data from a file, there are two essential commands that you should know: cut and grep.

Extract Portions of lines

The command “cut” allows you to extract portions of each line on a given file depending on provided options. The resulted text is then sent to the standard output.

You should specify at least one option with the command so that it knows how to cut each line. Otherwise, it won’t work.

Let’s see how to use two of these options.

Extract by character

The “-c” flag specifies which characters to extract.

Here are a few examples to make things clear.

The following command will output the fifth character of each line:

$ cut -c 5 data.txt
e
s
Q
G
a

This command will output the third, sixth, and eighth character of each line:

$ cut -c 3,6,8 data.txt
 ac
ye,
nux
era
rn 

And finally, this command will output all characters of each line positioned between the fourth and tenth position:

$ cut -c 4-10 data.txt
Search 
sses, 1
 Quixot
 Great 
 and Pe

As you can see from these examples, using the “-c” flag doesn’t separate between letters, commas, and spaces. Everything is considered a character.

Extract by field

To extract by field, you need to specify two essential parameters:

  • The delimiter (Using the “-d” flag) : This is how you tell the “cut” command which character separates the fields on each line.
  • The field number (Using the “-f” flag) : This is where you specify which field to extract.

Once again, a few examples will be more useful to you than plain explanations.

The following command will output the third field from our file (i.e. The author name):

$ cut -d ',' -f 3 data.txt
Marcel Proust
James Joyce
Miguel De Cervantes
F. Scott Fitzgerald
Leo Tolstoy

As you can see, I have specified comma (,) as the separator and I have selected the third field to extract.

If instead, we want to retrieve the title of the books, we can simply assign the value of 1 to the “-f” option.

$ cut -d ',' -f 1 data.txt
In Search of Lost Time
Ulysses
Don Quixote
The Great Gatsby
War and Peace

Now, what if we want to extract both the title of the book and its author?

Well, we can do that as well.

$ cut -d ',' -f 1,3 data.txt
In Search of Lost Time, Marcel Proust
Ulysses, James Joyce
Don Quixote, Miguel De Cervantes
The Great Gatsby, F. Scott Fitzgerald
War and Peace, Leo Tolstoy

I can go on and on with the examples, but I think the idea should be clear to you by now. I invite you to try to extract other fields on your own in order to familiarize yourself with the “cut” command.

Extract Lines

The grep command is a life-saver. I am certain that in the future, you will find yourself using it very often. Not only does it allow you to extract content from a file, but it is also very useful in finding files that contain certain words.

The basic syntax for using “grep” is very simple. Just type the command followed by the string to search for, and then the file where to search.

$ grep 'scott fitzgerald' data.txt

If you run the above command, then you shouldn’t get any input. This might be contrary to what you were expecting, especially that our file does contain ‘Scott Fitzgerald’.

The reason for this is that “grep” is, by default, a case-sensitive command. This means that it differentiates between uppercase and lowercase letters. Therefore, grep does not consider ‘Scott Fitzgerald’ to be the same string as ‘scott fitzgerald’.

Thankfully, we can add the flag “-i” to make the grep command case insensitive. By doing this, we can finally get the expected result:

$ grep -i 'scott fitzgerald' data.txt
The Great Gatsby, 1925, F. Scott Fitzgerald, United States

One other way we can use “grep” is by adding the “-v” flag. This will extract the non-matching lines. For instance, in the example below, we retrieved all the lines that don’t contain ‘scott fitzgerald’.

$ grep -vi 'scott fitzgerald' data.txt
In Search of Lost Time, 1913, Marcel Proust, France
Ulysses, 1922, James Joyce, Ireland
Don Quixote, 1615, Miguel De Cervantes, Spain
War and Peace, 1869, Leo Tolstoy, Russia

Before we wrap up this section about grep and data extraction, there is one last thing that I need to mention and that makes ‘grep’ special. And that is its support for Regular Expressions (Also known as Regex). Regex is very widely used as a way to specify search formats, and it can be very useful in finding certain string patterns. However, I am not going to cover it here, as that would make this a lengthy chapter. But rest assured, We will have a future chapter dedicated to regex.

Process Data

Sorting Lines

The sort command, as its name implies, allows you to sort a list of strings. It expects a list on its standard input, it sorts it and then sends it to its standard output.

If we run the command on our file “data.txt”, we should get an ordered set of lines.

$ sort data.txt
Don Quixote, 1615, Miguel De Cervantes, Spain
In Search of Lost Time, 1913, Marcel Proust, France
The Great Gatsby, 1925, F. Scott Fitzgerald, United States
Ulysses, 1922, James Joyce, Ireland
War and Peace, 1869, Leo Tolstoy, Russia

By default, the result is sorted alphabetically from A to Z. If we wanted to reverse the order (From Z to A), we can simply specify the -r flag.

$ sort -r data.txt
War and Peace, 1869, Leo Tolstoy, Russia
Ulysses, 1922, James Joyce, Ireland
The Great Gatsby, 1925, F. Scott Fitzgerald, United States
In Search of Lost Time, 1913, Marcel Proust, France
Don Quixote, 1615, Miguel De Cervantes, Spain

Now, let’s make things more interesting and try to sort the name of the authors (third column).

Let’s try this out:

$ cut -d "," -f 3 data.txt | sort
F. Scott Fitzgerald
James Joyce
Leo Tolstoy
Marcel Proust
Miguel De Cervantes

Sorting Numbers

To sort by numbers, and not alphabetically, we should add the -n flag. Otherwise, the numbers wouldn’t be sorted properly if they have different numbers of digits.

For example, let’s consider the following file, which I named “numbers.txt”:

254
13
92
543
7
65

If we try to sort it as we did before, the result wouldn’t be correct.

sort numbers.txt 
13
254
543
65
7
92

That’s because, as I mentioned earlier, sort orders alphabetically by default.

However, with the “-n” flag, we can force it to sort by numbers, as shown in the following example.

sort -n numbers.txt 
7
13
65
92
254
543

Remove duplicates

The command “uniq” is used to remove adjacent duplicate lines in a file. It is very simple to use, just type “uniq” followed by the name of the file.

However, as the command will only remove adjacent duplicate lines, only matching lines that are next to each other will be considered duplicates by the command and thus will be deduplicated.

So, to be effective, we need to sort the file first before providing it as input to “uniq”.

Here is a file that I named “users.txt”, which contains a list of users with some duplicates added here and there.

David
Alex
Maria
Carlos
Anna
Marco
Ana
Antonio
Daniel
Andrea
David
Laura
Ali
Jose
Sandra
Maria
Sara
Carlos
Ana
Michael

Now, let’s filter out to duplicates using “sort” and “uniq“:

$ sort users.txt | uniq
Alex
Ali
Ana
Andrea
Anna
Antonio
Carlos
Daniel
David
Jose
Laura
Marco
Maria
Michael
Sandra
Sara

Counting

The command “wc” (short for word counting) displays basic statistics about the content of the file provided in its standard input.

By default, it will show three values in the following order: The number of lines, the number of words, and the number of bytes.

$ wc data.txt
5 37 234 data.txt

If we don’t need all these values, and we’re only interested in the number of lines, we can use the “-l” flag as shown in the example below:

$ wc -l data.txt
5 data.txt

Similarly, specifying the “-w” flag will print the word count.

$ wc -w data.txt
37 data.txt

And to print out the number of characters, we can use the “-c” flag.

$ wc -c data.txt                                                                                                                                                                           
234 data.txt

We have reached the end of this chapter. We have covered 5 new commands to extract and process data in Linux: cut, grep, sort, uniq, and wc. Take some time to practice what you have learned here today, and make sure that you are comfortable with each of these commands before jumping into the next chapter.

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *