Outline
- Motivation
- Practical Example
- Scanner examples
- Performance considerations
- SQL utilities
- Status and outlook
Abstract:
An often occuring task in nowadys biological/physiological work is analysing,
also called "parsing", of large amounts of biological data. Although tools has
been written many times in many programming languages for such tasks most
applications do not satisfy the need of end users - researchers with limited
programming skills. Installation, speed and output are not suited for easy
usage of tools and concentrating on the scientifc task. Instead of handwriting
applications we use scanner generators like Flex and Jflex to create fast
scanners for biological data which are easy to code, easy to maintain, easy to
install and as standalone applications without any external dependencies easy
to use. The data output of those tools is standard SQL-code not tied to a
certain RDBMS which can be used to fill the database. This allows researches
to use their database of choice and SQL statements to assembl data from
different resources and applications. Some practical examples are provided to
document the advantages of using scanners which can fill databases to combine
results from different scanning procedures.
Motivation
- Biologist
- data mountains
- many output formats
- many converters
- difficulties in making clues from data
- Bioinformaticist
- parsing programs
- library hell
- incompatibilities, updates, headaches
- writing "parsers" again and again
Est Annotation Pipeline
Current Problems
- biologist needs help by an bioinformaticist
- bioinformaticist assembles pipelines of parsers
- output changes by one program requires rewriting of code by an bioinformaticist
- often libraries are not in sync with output changes
- often programs are difficult to compile
- seldom they use databases as the output target
Solution
- standalone scanners
- no installation
- almost no library dependencies
- RDBMS independent sql output
- biologists / bioinformaticists make relations via SQL joins
- output changes of your program, just a new one file application
Orthologs Via Reciprocal Best Hit
-- instead of writing a program
-- you join your data ...
select a.hit_id,a.query_id
from blast_h2e_hits as a
join blast_e2h_hits as b
where a.hit_id = b.query_id
and a.query_id = b.hit_id
and a.hit_id = b.query_id
and a.hit = 1 and b.hit = 1
Scanner Generators
- Flex, Lex, Jflex, Aflex, Ifickle, Plex, Re2c ...
- Writing pattern matching regular expressions and actions for that pattern
- input stream (file,stdin) is translated into tokens
- scanners are often frontends for compilers
- strictly spoken scanners are not parsers
- only together with a grammar and parser generators (yacc,bison) you can create parsers and compilers
- we only need a scanner, not a parser
Scanner Advantages
- Easier to program
- Easier to maintain
- Easier to compile and install
- Easier to keep different versions
- One file = one program
- Often much faster than handwritten scanners
Recent observations by others
- http://portal.open-bio.org/pipermail/biopython-dev/2006-July.txt
- Lexer based Fasta Readers are 20 times faster than BioPython ones
- http://bioinformatics.oxfordjournals.org/cgi/reprint/19/8/1035.pdf
- Paquola et al (2003). Zerg: a very fast BLAST parser library. Flex based parser is 100 times faster than BioPerl ones
Most simple scanner
file: flex.in
%%
compile and run:
flex in.flex
gcc lex.yy.c -o prog -lfl
./prog < lex.yy.c > temp.yy.c
Lex like input files
definitions
%%
rules
%%
user code
- only scanners with such flex like input are investigated
Sample Blast Files
dgroth@zetes>
ls -l ~/project/samples/* | grep 1- | perl -ane '$F[8] =~ s|.+/.+_|Pliv_|;
printf("%55s %6.0f kb\n",$F[8],$F[4]/1024.0);'
Pliv_Clusters-1-10000.fasta.swplus 487614 kb
Pliv_Clusters-1-1000.fasta.swplus 42607 kb
Pliv_Clusters-1-100.fasta.swplus 3967 kb
Pliv_Clusters-1-10.fasta.swplus 351 kb
Pliv_Clusters-1-1.fasta.swplus 45 kb
wc with flex
%{
int nchar, nword, nline ;
%}
%%
\n { nline++;nchar++; }
[^ \t\n]+ { nword++, nchar += yyleng ;}
. { nchar++;}
%%
int main( argc, argv ) int argc; char **argv; {
if ( argc > 1 )
yyin = fopen( argv[1], "r" );
else
yyin = stdin;
yylex();
printf(" %d\t%d\t%d", nchar,nword,nline);
if ( argc > 1 ) printf("\t%s",argv[1]);
printf("\n");
return 0 ;
}
wc with java
package sf.net.bioscanners;
%%
%public
%class Wc
%standalone
%unicode
%{
int nchars = 0; int nwords = 0; int nlines = 0;
%}
%eof{
System.out.println(" "+nlines+"\t"+nwords+"\t"+nchars);
%eof}
%%
[\n] { nlines += 1; nchars += 1; }
[ \t]+ { nchars += yylength() ; }
[^ \t\n]+ { nwords += 1; nchars += yylength(); }
wc with pascal I
%{
uses LexLib;
var numlines : integer = 0;
var numchars : integer = 0;
var numwords : integer = 0;
%}
%%
[ \t]+ begin
numchars += yyleng;
end;
\n begin
numchars += 1; numlines += 1 ;
end;
[!-~]+ begin
numchars += yyleng; numwords += 1;
end;
%%
wc with pascal II
var infile : String = '';
begin
case paramCount of
1 : begin
infile := paramStr(1);
end;
end;
assign(yyinput, infile);
reset(yyinput);
if yylex=0 then ;
writeln(' ',numlines, ' ',numwords,' ', numchars, ' ',infile);
end.
Wc Performance I
| Mode | 1 | 10 | 100 | 1000 | 10000 |
| tcl | 0.401 | 2.388 | 25.914 | 278.242 | nd |
| tcl8564 | 0.190 | 1.090 | 11.698 | 127.907 | nd |
| tcl8532 | 0.267 | 1.640 | 17.873 | 194.538 | nd |
| java | 0.212 | 0.269 | 0.456 | 1.701 | 19.592 |
| java14 | 0.216 | 0.262 | 0.475 | 1.730 | 15.713 |
| javaip | 0.118 | 0.348 | 2.848 | 28.850 | 328.231 |
| javaip14 | 0.117 | 0.348 | 2.789 | 28.705 | 328.136 |
| gcj | 0.087 | 0.306 | 2.889 | 30.625 | nd |
| gcj-exe | 0.095 | 0.124 | 0.464 | 4.096 | 44.106 |
| flex | 0.009 | 0.011 | 0.105 | 1.094 | 12.576 |
Wc Performance II
| flex | 0.009 | 0.011 | 0.105 | 1.094 | 12.576 |
| flexpp | 0.039 | 0.189 | 1.958 | 21.418 | 246.365 |
| re2c | 0.009 | 0.004 | 0.035 | 0.352 | 4.045 |
| plex | 0.028 | 0.041 | 0.457 | 4.644 | 52.825 |
| unix | 0.005 | 0.028 | 0.294 | 3.122 | 34.577 |
| perl-lex | 0.173 | 0.884 | 9.263 | 96.487 | nd |
| perl-hand | 0.006 | 0.021 | 0.164 | 1.616 | 18.369 |
Replacer Flex
%%
[a-z] printf("x");
%%
int main( argc, argv ) int argc; char **argv; {
if ( argc > 1 ) {
yyin = fopen( argv[1], "r" );
yylex();
} else {
printf ("usage: %s <inputfile>\n");
}
return 0 ;
}
Replacer with Tcl
%{
#!/usr/bin/tclsh8.4
%}
%buffersize 1024
%%
[a-z] { puts -nonewline x }
%%
if {[llength $argv] == 0} {
puts stderr "usage replacer.tcl <inputfile>"
exit 0
}
if {[catch {open [lindex $argv 0] r} yyin]} {
puts stderr "Could not open [lindex $argv 0]"
exit 0
}
set sc [replacer #auto -yyin $yyin]
$sc yylex
close $yyin
Replacer performance I
| Mode | 1 | 10 | 100 | 1000 | 10000 |
| tcl | 0.831 | 5.618 | 61.415 | nd | nd |
| tcl8564 | 0.422 | 2.895 | 30.820 | nd | nd |
| java | 0.651 | 2.008 | 17.422 | 195.260 | nd |
| javaip | 1.318 | 9.338 | 104.707 | nd | nd |
| gcj | 0.520 | 3.293 | 37.306 | nd | nd |
| gcj-exe | 0.368 | 2.147 | 23.925 | 270.932 | nd |
| flex | 0.004 | 0.023 | 0.230 | 2.305 | 29.733 |
Replacer performance II
| flex | 0.004 | 0.023 | 0.230 | 2.305 | 29.733 |
| flexpp | 0.015 | 0.096 | 1.038 | 10.787 | 122.601 |
| re2c | 0.008 | 0.053 | 0.576 | 6.451 | 69.661 |
| plex | 0.009 | 0.062 | 0.652 | 6.796 | 78.798 |
| unix | 0.011 | 0.068 | 0.636 | 6.375 | 66.401 |
| perl-hand | 0.007 | 0.035 | 0.267 | 2.564 | 28.897 |
BlastScanner Example Flex
Performance SimpleBlastScanner
| Mode | 1 | 10 | 100 | 1000 | 10000 |
| tcl | 1.750 | 12.658 | 142.109 | nd | nd |
| java | 0.220 | 0.367 | 0.791 | 3.346 | 28.062 |
| flex | 0.003 | 0.017 | 0.159 | 1.716 | 19.939 |
| flex-tcl | 0.005 | 0.019 | 0.170 | 1.793 | 20.238 |
insert into positions (query_id,query_length,position)
values ('Contig1',834 l,0);
insert into hits (query_id,hit_id,hit,score,evalue)
values ('Contig1','SP_TREM:Q4RM58_TETNG',1,167,6e-40);
insert into hits (query_id,hit_id,hit,score,evalue)
values ('Contig1','SP_TREM:Q96IA7_HUMAN',2,166,1e-39);
...
SQL queries
- How many queries have a hit with score 100 or better
- select count(*) from (select distinct query_id from hits where score >= 100)
- How many hits are matched by more than one query with an evalue of smaller than 1e-10
- select count(*) from (select distinct hit_id from hits where evalue < 1e-10)
- Which queries are matching exactly one and not more hits with an score of larger than 100
- select * from (
select count(query_id) as cnt, query_id from (select query_id,score from hits where score > 100) group by query_id
) where cnt == 1
Conclusions I
Scanners are the better Parsers
- easy to program (BlastScanner in 1 day)
- standalone
- should run forever
- fast
- possible to keep different scanner versions as seperate files
- SimpleBlastScanner2.2.4.jar
- not tied to a certain programming language,
- C, C++, Java, Pascal, Tcl and others can be used
Conclusions II
SQL is most reliable to answer your questions
- easy to code
- often fast
- complex tasks can be solved even by biologists
- queries can be stored, shared ...
SimpleBlastScanner h2e.blast h2e | sqlite3 myblast.sqlite
SimpleBlastScanner e2h.blast e2h | sqlite3 myblast.sqlite
cat myviews.sql | sqlite3 myblast.sqlite
sqlite3 myblast.sqlite3 "select * from vw_recip_best_hit" > out.tab
Determine reciprocal best hit orthologs with a single sql statement ...
Est Annotation Pipeline II
Vision
Sourceforge Project
Status
- BlastScanner
- FastaScanner
- Cap3Scanner
- GeneOntolScanner (obo)
- TWikiScanner
Outlook / ToDo
- Complete SimpleBlastScanner Test-Suite
- More Scanners (Java, Flex) as requested by biologists
- Sourceforge project side
- Format converters
- Hire volunteers
- Biologists: request Scanners
- Bioinformaticsts/Programmers: write Scanners
Acknowledgement
- MPIMG Berlin
- Albert Poustka
- Georgia Panopoulou
- Steffen Hennig
- Hans Lehrach
- Uni Potsdam