Programming
24 April 2012 0 Comments

Parsing Proteins in the GenBank/GenPept Flat File Format with BioJava 1.8.1

This post describes parsing annotated protein sequences from the RefSeq database. I was unable to find any complete examples for parsing RefSeq protein sequences in .gpff.gz files with Java, so here is a quick and dirty one.

The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. After downloading the latest release release from the FTP server, you end up with a lot of .gz files. An example of the filenames:

complete.1.1.genomic.fna.gz
complete.1.bna.gz
complete.1.genomic.gbff.gz
complete.10.bna.gz
complete.10.genomic.gbff.gz
complete.100.protein.gpff.gz

The README tells us that the filenames describe the type of information (genomic, protein, dna, rna). This information is split up in many (numbered) files. We are interested in protein information in the GenPept/GenBank Flat File format. Every file with protein information in this format has a name of the form complete.<number>.protein.gpff.gz.

Oh, and the regular expression for these filenames is:

^complete.[0-9]+.protein.gpff.gz$

Writing a parser…