Intermediate programming with R

Inspecting a file

Learning Objectives

  • Inspect a file from the command line
  • Chain Unix commands using pipes
  • Search with grep
  • Redirect output to a new file

Now let’s explore the data files we have downloaded. Inspecting data files with the Unix shell is a quick and easy way to learn about a data set before attempting to import it with R.

Switch to the data directory.

cd data
ls
counts-norm.txt.gz counts-raw.txt.gz

Inspect the first line of the raw counts file.

head -n 1 counts-raw.txt.gz
��LOdf_all.txt�\�v��v#_��I���}�~x�E=l�ƴ�D��d�%W ��Cj����|��a������#�%٧

It wasn’t very informative because the file is compressed to save space. You could de-compress it with gunzip, but instead use gunzip -c to send the decompressed data to standard out. This allows us to view the contents of the file while still saving disk space. Pass standard out to the head function using the “pipe” command (it’s the vertical bar on your keyboard).

gunzip -c counts-raw.txt.gz | head -n 1
"doi"  "pubDate"    "journal"   "title" "articleType"   "authorsCount"  "f1000Factor"   "backtweetsCount"   "deliciousCount"    "pmid"  "plosSubjectTags"   "plosSubSubjectTags"    "facebookShareCount"    "facebookLikeCount" "facebookCommentCount"  "facebookClickCount"    "mendeleyReadersCount"  "almBlogsCount" "pdfDownloadsCount" "xmlDownloadsCount" "htmlDownloadsCount"    "almCiteULikeCount" "almScopusCount"    "almPubMedCentralCount" "almCrossRefCount"  "plosCommentCount"  "plosCommentResponsesCount" "wikipediaCites"    "year"  "daysSincePublished"    "wosCountThru2010"  "wosCountThru2011"

Now that worked as expected. From this header line, we observe that some columns contain descriptions of the publication, e.g. “journal” and “title”, and others contain the counts for the various metrics, e.g. “wosCountThru2011” is the number of citations the paper received thru 2011 according to Thomson Reuters’ Web of Science.

Now check the number of articles in each of the files using wc.

gunzip -c counts-raw.txt.gz | wc -l
24332
gunzip -c counts-norm.txt.gz | wc -l
21097

The normalized file contains data on fewer publications. According to their publication, they focus only on articles that are labeled “Research Articles”. Confirm that this is the reason for the difference between the two files by inspecting the 5th column, “articleType”. You can select specific columns (aka fields) using cut.

gunzip -c counts-raw.txt.gz | cut -f5 | head
"articleType"
"Research Article"
"Research Article"
"Synopsis"
"Synopsis"
"Research Article"
"Research Article"
"Synopsis"
"Feature"
"Community Page"

You can count the number of occurrences of each “articleType” using the function uniq and passing it the -c flag. However, uniq requires that the data is pre-sorted to work properly. Thus pipe the data through the command sort before passing it to uniq.

gunzip -c counts-raw.txt.gz | cut -f5 | sort | uniq -c | head
      1 "articleType"
      5 "Best Practice"
     57 "Book Review/Science in the Media"
     10 "Case Report"
      1 "Clinical Trial"
     56 "Community Page"
    172 "Correction"
    283 "Correspondence"
     13 "Correspondence and Other Communications"
    189 "Editorial"

We can see that the raw counts file contains many different types of articles.

Perform the same operation on the normalized counts file.

gunzip -c counts-norm.txt.gz | cut -f5 | sort | uniq -c
      1 "articleType"
  21096 "Research Article"

And indeed that is the difference. The normalized counts file only contains data on research articles.

Let’s keep exploring. What is the maximum number of citations for a single paper in this data set? Use the data from 2011 in column 32.

gunzip -c counts-raw.txt.gz | cut -f32 | sort -n | tail -n 1
737

The -n passed to sort is critical because it specifies the data is numeric. By default sort performs alphabetical sorting, in which case 9 would be greater than 100.

The 11th columns contains the PLOS subject tags. How many articles have the subject tag “Evolutionary Biology”? Use grep to search for the term.

gunzip -c counts-raw.txt.gz | cut -f11 | grep "Evolutionary Biology" | wc -l
2864

How many articles have the subject tag “Evolutionary Biology” and “Cell Biology”?

gunzip -c counts-raw.txt.gz | cut -f11 | grep "Evolutionary Biology" | grep "Cell Biology" | wc -l
153

Instead of simply counting the files that match the search criteria, save them to a new file. This is done with the redirection operator, >.

gunzip -c counts-raw.txt.gz | grep "Evolutionary Biology" | grep "Cell Biology" > evo-cell-bio.txt
wc -l evo-cell-bio.txt
170

What could be the reason for the discrepancy in the number of articles in our saved file?

Largest number of Wikipedia cites

What is the largest number of Wikipedia cites that an article in this data set has received? Hint: The counts of Wikipedia cites are in column 28.

Find articles in your field

Choose two PLOS subject tags to search for and save these articles to a new file. How many articles are there?