Recently, I was asked by a collegue if I recognize the following raw data format coming from a quite old dataset which came out from the first next generation sequencers and relatively old software which was used for base calling:
203K0:1:1:626:335:ATTCCATTCCATTCCATTCCATTCCATTCCAT:[[[[[[[[[[[[[[[[[[[[[UUUUUUUUOUUU
203K0:1:1:119:614:TAAAAACTAGATAGAAGCAATGTCAGAACTTT:[[[[[[[[[[[[[[[W[[[[[UUUUUUUUUUUU
203K0:1:1:114:772:TCCTAGCTAGTTCCCTGCAGCTTTTTATTAAC:[[[[[[[[[[[[[[[[[[WWUUUUUUUCIUUU
203K0:1:1:490:490:GTTGGTGCTTAAAAGTCTTGGATTTTGAAACA:[[[[[[[[[[[[[[W[[[[[UUUUUUOOIUUU
It turned out to be the older Illumina SCARF format which contained all information for one read in one line. The read qualities score in the above example are in ASCII Phred64 format, as can be determined by this nice awk script. Before implementing our own converter in Perl (which is quite dirty and took about 5 minutes), we tried one nice perl script that converts almost all older raw data formats to standard fastq. However, it seems that it expected only numeric quality scores. So here is a quick and dirty script to perform the conversion of multiple scarf files to standard fastq, keeping the original scores:
#!/usr/bin/perl -w use strict; use File::Basename; use File::Spec; my @input = @ARGV; foreach my $input (@input) { my ($base,$dir,$ext) = fileparse($input,'\.[^.]*'); my $output = File::Spec->catfile($dir,$base.".fq"); open(SCARF,$input); open(OUTPUT,">$output"); while () { chomp $_; my @cols = split(":",$_); print OUTPUT "@",join(":",@cols[0..4]),"\n"; print OUTPUT $cols[5],"\n"; print OUTPUT "+\n"; print OUTPUT $cols[6],"\n"; } close(OUTPUT); close(SCARF); }