In this chapter, we are going to learn how to extract and process data from a file on Linux.
To test the examples given in this chapter, I will create a file that I will name “data.txt” containing a list of books, with their year of publication, author, and country of origin:
In Search of Lost Time, 1913, Marcel Proust, France Ulysses, 1922, James Joyce, Ireland Don Quixote, 1615, Miguel De Cervantes, Spain The Great Gatsby, 1925, F. Scott Fitzgerald, United States War and Peace, 1869, Leo Tolstoy, Russia
I invite you to create the same file on your local machine and copy the above content.
When it comes to extracting data from a file, there are two essential commands that you should know: cut and grep.
Extract Portions of lines
The command “cut” allows you to extract portions of each line on a given file depending on provided options. The resulted text is then sent to the standard output.
You should specify at least one option with the command so that it knows how to cut each line. Otherwise, it won’t work.
Let’s see how to use two of these options.
Extract by character
The “-c” flag specifies which characters to extract.
Here are a few examples to make things clear.
The following command will output the fifth character of each line:
$ cut -c 5 data.txt e s Q G a
This command will output the third, sixth, and eighth character of each line:
$ cut -c 3,6,8 data.txt ac ye, nux era rn
And finally, this command will output all characters of each line positioned between the fourth and tenth position:
$ cut -c 4-10 data.txt Search sses, 1 Quixot Great and Pe
As you can see from these examples, using the “-c” flag doesn’t separate between letters, commas, and spaces. Everything is considered a character.
Extract by field
To extract by field, you need to specify two essential parameters:
- The delimiter (Using the “-d” flag) : This is how you tell the “cut” command which character separates the fields on each line.
- The field number (Using the “-f” flag) : This is where you specify which field to extract.
Once again, a few examples will be more useful to you than plain explanations.
The following command will output the third field from our file (i.e. The author name):
$ cut -d ',' -f 3 data.txt Marcel Proust James Joyce Miguel De Cervantes F. Scott Fitzgerald Leo Tolstoy
As you can see, I have specified comma (,) as the separator and I have selected the third field to extract.
If instead, we want to retrieve the title of the books, we can simply assign the value of 1 to the “-f” option.
$ cut -d ',' -f 1 data.txt In Search of Lost Time Ulysses Don Quixote The Great Gatsby War and Peace
Now, what if we want to extract both the title of the book and its author?
Well, we can do that as well.
$ cut -d ',' -f 1,3 data.txt In Search of Lost Time, Marcel Proust Ulysses, James Joyce Don Quixote, Miguel De Cervantes The Great Gatsby, F. Scott Fitzgerald War and Peace, Leo Tolstoy
I can go on and on with the examples, but I think the idea should be clear to you by now. I invite you to try to extract other fields on your own in order to familiarize yourself with the “cut” command.
The grep command is a life-saver. I am certain that in the future, you will find yourself using it very often. Not only does it allow you to extract content from a file, but it is also very useful in finding files that contain certain words.
The basic syntax for using “grep” is very simple. Just type the command followed by the string to search for, and then the file where to search.
$ grep 'scott fitzgerald' data.txt
If you run the above command, then you shouldn’t get any input. This might be contrary to what you were expecting, especially that our file does contain ‘Scott Fitzgerald’.
The reason for this is that “grep” is, by default, a case-sensitive command. This means that it differentiates between uppercase and lowercase letters. Therefore, grep does not consider ‘Scott Fitzgerald’ to be the same string as ‘scott fitzgerald’.
Thankfully, we can add the flag “-i” to make the grep command case insensitive. By doing this, we can finally get the expected result:
$ grep -i 'scott fitzgerald' data.txt The Great Gatsby, 1925, F. Scott Fitzgerald, United States
One other way we can use “grep” is by adding the “-v” flag. This will extract the non-matching lines. For instance, in the example below, we retrieved all the lines that don’t contain ‘scott fitzgerald’.
$ grep -vi 'scott fitzgerald' data.txt In Search of Lost Time, 1913, Marcel Proust, France Ulysses, 1922, James Joyce, Ireland Don Quixote, 1615, Miguel De Cervantes, Spain War and Peace, 1869, Leo Tolstoy, Russia
Before we wrap up this section about grep and data extraction, there is one last thing that I need to mention and that makes ‘grep’ special. And that is its support for Regular Expressions (Also known as Regex). Regex is very widely used as a way to specify search formats, and it can be very useful in finding certain string patterns. However, I am not going to cover it here, as that would make this a lengthy chapter. But rest assured, We will have a future chapter dedicated to regex.
The sort command, as its name implies, allows you to sort a list of strings. It expects a list on its standard input, it sorts it and then sends it to its standard output.
If we run the command on our file “data.txt”, we should get an ordered set of lines.
$ sort data.txt Don Quixote, 1615, Miguel De Cervantes, Spain In Search of Lost Time, 1913, Marcel Proust, France The Great Gatsby, 1925, F. Scott Fitzgerald, United States Ulysses, 1922, James Joyce, Ireland War and Peace, 1869, Leo Tolstoy, Russia
By default, the result is sorted alphabetically from A to Z. If we wanted to reverse the order (From Z to A), we can simply specify the -r flag.
$ sort -r data.txt War and Peace, 1869, Leo Tolstoy, Russia Ulysses, 1922, James Joyce, Ireland The Great Gatsby, 1925, F. Scott Fitzgerald, United States In Search of Lost Time, 1913, Marcel Proust, France Don Quixote, 1615, Miguel De Cervantes, Spain
Now, let’s make things more interesting and try to sort the name of the authors (third column).
Let’s try this out:
$ cut -d "," -f 3 data.txt | sort F. Scott Fitzgerald James Joyce Leo Tolstoy Marcel Proust Miguel De Cervantes
To sort by numbers, and not alphabetically, we should add the -n flag. Otherwise, the numbers wouldn’t be sorted properly if they have different numbers of digits.
For example, let’s consider the following file, which I named “numbers.txt”:
254 13 92 543 7 65
If we try to sort it as we did before, the result wouldn’t be correct.
sort numbers.txt 13 254 543 65 7 92
That’s because, as I mentioned earlier, sort orders alphabetically by default.
However, with the “-n” flag, we can force it to sort by numbers, as shown in the following example.
sort -n numbers.txt 7 13 65 92 254 543
The command “uniq” is used to remove adjacent duplicate lines in a file. It is very simple to use, just type “uniq” followed by the name of the file.
However, as the command will only remove adjacent duplicate lines, only matching lines that are next to each other will be considered duplicates by the command and thus will be deduplicated.
So, to be effective, we need to sort the file first before providing it as input to “uniq”.
Here is a file that I named “users.txt”, which contains a list of users with some duplicates added here and there.
Now, let’s filter out to duplicates using “sort” and “uniq“:
$ sort users.txt | uniq
The command “wc” (short for word counting) displays basic statistics about the content of the file provided in its standard input.
By default, it will show three values in the following order: The number of lines, the number of words, and the number of bytes.
$ wc data.txt
5 37 234 data.txt
If we don’t need all these values, and we’re only interested in the number of lines, we can use the “-l” flag as shown in the example below:
$ wc -l data.txt 5 data.txt
Similarly, specifying the “-w” flag will print the word count.
$ wc -w data.txt 37 data.txt
And to print out the number of characters, we can use the “-c” flag.
$ wc -c data.txt 234 data.txt
We have reached the end of this chapter. We have covered 5 new commands to extract and process data in Linux: cut, grep, sort, uniq, and wc. Take some time to practice what you have learned here today, and make sure that you are comfortable with each of these commands before jumping into the next chapter.