Developing scanners for biological data with SQL output using lexer generators

Scanner Pipelines for Biologists

Detlef Groth

UP / MPI-MP Bioinformatics

Outline

Abstract: An often occuring task in nowadys biological/physiological work is analysing, also called "parsing", of large amounts of biological data. Although tools has been written many times in many programming languages for such tasks most applications do not satisfy the need of end users - researchers with limited programming skills. Installation, speed and output are not suited for easy usage of tools and concentrating on the scientifc task. Instead of handwriting applications we use scanner generators like Flex and Jflex to create fast scanners for biological data which are easy to code, easy to maintain, easy to install and as standalone applications without any external dependencies easy to use. The data output of those tools is standard SQL-code not tied to a certain RDBMS which can be used to fill the database. This allows researches to use their database of choice and SQL statements to assembl data from different resources and applications. Some practical examples are provided to document the advantages of using scanners which can fill databases to combine results from different scanning procedures.

Motivation

  • Biologist
    • data mountains
    • many output formats
    • many converters
    • difficulties in making clues from data
  • Bioinformaticist
    • parsing programs
    • library hell
    • incompatibilities, updates, headaches
    • writing "parsers" again and again

Est Annotation Pipeline

Current Problems

  • biologist needs help by an bioinformaticist
  • bioinformaticist assembles pipelines of parsers
  • output changes by one program requires rewriting of code by an bioinformaticist
  • often libraries are not in sync with output changes
  • often programs are difficult to compile
  • seldom they use databases as the output target

Solution

  • standalone scanners
  • no installation
  • almost no library dependencies
  • RDBMS independent sql output
  • biologists / bioinformaticists make relations via SQL joins
  • output changes of your program, just a new one file application

Orthologs Via Reciprocal Best Hit

-- instead of writing a program 
-- you join your data ...
select a.hit_id,a.query_id
from blast_h2e_hits as a 
join blast_e2h_hits as b 
where a.hit_id = b.query_id 
and a.query_id = b.hit_id
and a.hit_id = b.query_id
and a.hit = 1 and b.hit = 1

Scanner Generators

  • Flex, Lex, Jflex, Aflex, Ifickle, Plex, Re2c ...
  • Writing pattern matching regular expressions and actions for that pattern
  • input stream (file,stdin) is translated into tokens
  • scanners are often frontends for compilers
  • strictly spoken scanners are not parsers
  • only together with a grammar and parser generators (yacc,bison) you can create parsers and compilers
  • we only need a scanner, not a parser

Scanner Advantages

  • Easier to program
  • Easier to maintain
  • Easier to compile and install
  • Easier to keep different versions
  • One file = one program
  • Often much faster than handwritten scanners

Recent observations by others

http://portal.open-bio.org/pipermail/biopython-dev/2006-July.txt
Lexer based Fasta Readers are 20 times faster than BioPython ones
http://bioinformatics.oxfordjournals.org/cgi/reprint/19/8/1035.pdf
Paquola et al (2003). Zerg: a very fast BLAST parser library. Flex based parser is 100 times faster than BioPerl ones

Most simple scanner

file: flex.in
%%
compile and run:
flex in.flex
gcc lex.yy.c -o prog -lfl
./prog < lex.yy.c > temp.yy.c

Lex like input files

definitions
%%
rules
%%
user code
  • only scanners with such flex like input are investigated

Sample Blast Files

dgroth@zetes> 
ls -l ~/project/samples/* | grep 1- | perl -ane '$F[8] =~ s|.+/.+_|Pliv_|; 
        printf("%55s %6.0f kb\n",$F[8],$F[4]/1024.0);'
Pliv_Clusters-1-10000.fasta.swplus 487614 kb
Pliv_Clusters-1-1000.fasta.swplus  42607 kb
Pliv_Clusters-1-100.fasta.swplus   3967 kb
Pliv_Clusters-1-10.fasta.swplus    351 kb
Pliv_Clusters-1-1.fasta.swplus     45 kb

wc with flex

%{
int nchar, nword, nline ;
%}
%%
\n        { nline++;nchar++; }
[^ \t\n]+ { nword++, nchar += yyleng ;}
.         { nchar++;}
%%
int main( argc, argv ) int argc; char **argv;    {
    if ( argc > 1 )  
       yyin = fopen( argv[1], "r" );
    else   
       yyin = stdin;
    yylex();
    printf("   %d\t%d\t%d", nchar,nword,nline);
    if ( argc > 1 ) printf("\t%s",argv[1]);
    printf("\n");
    return 0 ;
}

wc with java

package sf.net.bioscanners;
%%
%public
%class Wc
%standalone
%unicode
%{
  int nchars = 0;  int nwords = 0;  int nlines = 0;
%}
%eof{
 System.out.println("   "+nlines+"\t"+nwords+"\t"+nchars);
%eof}
%%
[\n] { nlines += 1; nchars += 1;  }
[ \t]+ { nchars += yylength() ; }
[^ \t\n]+ { nwords += 1; nchars += yylength(); }

wc with pascal I

%{
uses LexLib;
var numlines : integer  = 0;
var numchars : integer  = 0;
var numwords : integer  = 0;
%}
%%
[ \t]+ begin     
         numchars += yyleng;   
       end;
\n     begin
         numchars += 1; numlines += 1 ;
       end;  
[!-~]+ begin
         numchars += yyleng; numwords += 1;
       end;
%%

wc with pascal II

var infile : String = '';
begin
    case paramCount of
    1 : begin
          infile   := paramStr(1);
        
        end;
    end;
    assign(yyinput,  infile);
    reset(yyinput);
    if yylex=0 then ;
    writeln('  ',numlines, '  ',numwords,'  ', numchars, '  ',infile);
   
end.

Wc Performance I

Mode 1 10 100 1000 10000
tcl 0.401 2.388 25.914 278.242 nd
tcl8564 0.190 1.090 11.698 127.907 nd
tcl8532 0.267 1.640 17.873 194.538 nd
java 0.212 0.269 0.456 1.701 19.592
java14 0.216 0.262 0.475 1.730 15.713
javaip 0.118 0.348 2.848 28.850 328.231
javaip14 0.117 0.348 2.789 28.705 328.136
gcj 0.087 0.306 2.889 30.625 nd
gcj-exe 0.095 0.124 0.464 4.096 44.106
flex 0.009 0.011 0.105 1.094 12.576

Wc Performance II

flex 0.009 0.011 0.105 1.094 12.576
flexpp 0.039 0.189 1.958 21.418 246.365
re2c 0.009 0.004 0.035 0.352 4.045
plex 0.028 0.041 0.457 4.644 52.825
unix 0.005 0.028 0.294 3.122 34.577
perl-lex 0.173 0.884 9.263 96.487 nd
perl-hand 0.006 0.021 0.164 1.616 18.369

Replacer Flex

%%
[a-z]        printf("x");
%%
int main( argc, argv ) int argc; char **argv;    {
    if ( argc > 1 ) {
       yyin = fopen( argv[1], "r" );
       yylex();
    }  else {
            printf ("usage: %s <inputfile>\n");
    }
    return 0 ;
}

Replacer with Tcl

%{
#!/usr/bin/tclsh8.4
%}
%buffersize 1024
%%
[a-z]        { puts -nonewline x }
%%
if {[llength $argv] == 0} {
   puts stderr "usage replacer.tcl <inputfile>"
   exit 0
}
if {[catch {open [lindex $argv 0] r} yyin]} {
    puts stderr "Could not open [lindex $argv 0]"
    exit 0
}  
set sc [replacer #auto -yyin $yyin]
$sc yylex
close $yyin

Replacer performance I

Mode 1 10 100 1000 10000
tcl 0.831 5.618 61.415 nd nd
tcl8564 0.422 2.895 30.820 nd nd
java 0.651 2.008 17.422 195.260 nd
javaip 1.318 9.338 104.707 nd nd
gcj 0.520 3.293 37.306 nd nd
gcj-exe 0.368 2.147 23.925 270.932 nd
flex 0.004 0.023 0.230 2.305 29.733

Replacer performance II

flex 0.004 0.023 0.230 2.305 29.733
flexpp 0.015 0.096 1.038 10.787 122.601
re2c 0.008 0.053 0.576 6.451 69.661
plex 0.009 0.062 0.652 6.796 78.798
unix 0.011 0.068 0.636 6.375 66.401
perl-hand 0.007 0.035 0.267 2.564 28.897

BlastScanner Example Flex

Performance SimpleBlastScanner

Mode 1 10 100 1000 10000
tcl 1.750 12.658 142.109 nd nd
java 0.220 0.367 0.791 3.346 28.062
flex 0.003 0.017 0.159 1.716 19.939
flex-tcl 0.005 0.019 0.170 1.793 20.238
insert into positions (query_id,query_length,position) 
         values ('Contig1',834 l,0);
insert into hits (query_id,hit_id,hit,score,evalue) 
         values ('Contig1','SP_TREM:Q4RM58_TETNG',1,167,6e-40);
insert into hits (query_id,hit_id,hit,score,evalue) 
         values ('Contig1','SP_TREM:Q96IA7_HUMAN',2,166,1e-39);
...

SQL queries

How many queries have a hit with score 100 or better
select count(*) from (select distinct query_id from hits where score >= 100)
How many hits are matched by more than one query with an evalue of smaller than 1e-10
select count(*) from (select distinct hit_id from hits where evalue < 1e-10)
Which queries are matching exactly one and not more hits with an score of larger than 100
select * from (
select count(query_id) as cnt, query_id from (select query_id,score from hits where score > 100) group by query_id
) where cnt == 1

Conclusions I

Scanners are the better Parsers
  • easy to program (BlastScanner in 1 day)
  • standalone
  • should run forever
  • fast
  • possible to keep different scanner versions as seperate files
  • SimpleBlastScanner2.2.4.jar
  • not tied to a certain programming language,
  • C, C++, Java, Pascal, Tcl and others can be used

Conclusions II

SQL is most reliable to answer your questions
  • easy to code
  • often fast
  • complex tasks can be solved even by biologists
  • queries can be stored, shared ...
SimpleBlastScanner h2e.blast h2e | sqlite3 myblast.sqlite
SimpleBlastScanner e2h.blast e2h | sqlite3 myblast.sqlite
cat myviews.sql | sqlite3 myblast.sqlite
sqlite3 myblast.sqlite3 "select * from vw_recip_best_hit" > out.tab
Determine reciprocal best hit orthologs with a single sql statement ...

dgDBBrowser

http://mini.net/tcl/dgDBBrowser
  • cross platform
    • Win32
    • Linux
    • Solaris
    • OSF1
    • Mac-OSX
  • cross database
    • SQLite
    • ODBC
    • Mysql
    • Postgres
    • Oracle
    • Metakit

Est Annotation Pipeline II

Vision

Sourceforge Project

Status

  • BlastScanner
  • FastaScanner
  • Cap3Scanner
  • GeneOntolScanner (obo)
  • TWikiScanner

Outlook / ToDo

  • Complete SimpleBlastScanner Test-Suite
  • More Scanners (Java, Flex) as requested by biologists
  • Sourceforge project side
  • Format converters
  • Hire volunteers
  • Biologists: request Scanners
  • Bioinformaticsts/Programmers: write Scanners

Acknowledgement

  • MPIMG Berlin
    • Albert Poustka
    • Georgia Panopoulou
    • Steffen Hennig
    • Hans Lehrach
  • Uni Potsdam
    • Joachim Selbig