Oracle Data Mining Application Developer's Guide 10g Release 1 (10.1) Part Number B10699-01 |
|
|
View PDF |
This chapter describes table functions included with ODM that permit you to perform similarity searches against nucleotide and amino acid sequence data stored in an Oracle database. You can use the table functions described in this chapter for ad hoc searches or you can embed them in applications. The inclusion of these table functions in ODM positions Oracle as a platform for bioinformatics.
This chapter discusses the following topics:
The National Center for Biotechnology Information (NCBI) implemented one of the commonly used versions of the Basic Local Alignment Search Tool (BLAST).
Sequence alignments provide a way to compare new sequences with previously characterized sequences. Both functional and evolutionary information can be inferred from well-designed queries and alignments. BLAST provides a method for searching of both nucleotide and protein databases. Since the BLAST algorithm detects local alignments, regions of similarity embedded in otherwise unrelated sequences can be detected.
The BLAST algorithm searches nucleotide and amino acid query sequences against databases of nucleotide and amino acid sequences. Based on the nature of the query and the database sequences, NCBI BLAST provides the following variants:
For more information about NCBI BLAST, see the NCBI BLAST Home Page at http://www.ncbi.nlm.nih.gov/BLAST/
.
The table functions described in this chapter implement several of the variants of NCBI BLAST version 2.0.
This section contains several examples of using the ODM BLAST table functions to perform searches on nucleotide or amino acid sequences.
Most table function parameters have defaults. The defaults were carefully chosen so that users who have limited experience with BLAST will obtain good results.
The BLAST table functions accept the CLOB (Character Large OBject) data type as the query sequence. It is not possible to construct a CLOB in an ad hoc SQL query. One way to construct a CLOB is to create a table and insert the query sequence into the table. Another option is to construct a CLOB using the programmatic interface if the BLAST query is part of a larger program. Suppose that the table query_db
has the schema (sequence_id VARCHAR2(32), sequence CLOB)
. The following SQL query inserts the query sequence into query_db
:
INSERT INTO query_db VALUES ('1', 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGT');
Suppose that the table GENE_DB stores DNA sequences. Suppose that GENE_DB has attributes seq_id
, publication date
, modification date
, organism
, and sequence
, among other attributes. There is no required schema for the table that stores the sequences. The only requirement is that the table contain an identifier and the sequence and any number of other optional attributes.
The portion of the database to be used for the search can be specified using SQL. The full power of SQL can be used to perform more sophisticated selections.
The following query does a BLAST search of the given query sequence against the human genome and returns the seq_id,
score
, and expect
value of matches that score > 25:
SELECT t.t_seq_id, t.score, t.expect FROM TABLE ( BLASTN_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE organism = 'human'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.score > 25;
Note: The parameter value of 0 invokes the default values in most cases. See the syntax for details.
The following query does the BLAST search against all sequences published after Jan 01, 2000:
SELECT t.t_seq_id, t.score, t.expect FROM TABLE ( BLASTN_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE publication_date > '01-JAN-2000'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.score > 25;
You can obtain other attributes of the matching sequence by joining the BLAST result with the original sequence table as follows:
SELECT t.t_seq_id, t.score, t.expect, g.publication_date, g.organism FROM GENE_DB g, TABLE ( BLASTN_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE publication_date > '01-JAN-2000'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.t_seq_id = g.seq_id AND t.score > 25;
Suppose that the table PROT_DB stores protein sequences. Insert the protein query sequence to be used for the search into query_db
.
The following query does a BLASTP search of the given query sequence against protein sequences in PROT_DB and returns the identifier
, score
, name
, and expect
value of matches that score > 25:
SELECT t.t_seq_id, t.score, t.expect, p.name FROM PROT_DB p, TABLE( BLASTP_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '2'), CURSOR(SELECT seq_id, sequence FROM PROT_DB), 1, -1, 0, 0, 'BLOSUM62', 10, 0, 0, 0, 0, 0) )t WHERE t.t_seq_id = p.seq_id AND t.score > 25 ORDER BY t.expect;
Suppose that the table GENE_DB stores DNA sequences. Suppose that GENE_DB has attributes seq_id
, publication date
, modification date
, organism
, and sequence
among other attributes.
The following query does a BLAST search and alignment of the given query sequence against the human genes and returns the publication_date
, organism, and the alignment attributes of matching sequences that score > 25 and where more than 50% of the sequence is conserved in the match:
SELECT t.t_seq_id, t.alignment_length, t.pct_identity, t.q_seq_start, t.q_seq_end, t.t_seq_start, t.t_seq_end, t.score, t.expect, g.publication_date, g.organism FROM GENE_DB g, TABLE ( BLASTN_ALIGN ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE publication_date > '01-JAN-2000'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.t_seq_id = g.seq_id AND t.score > 25 AND t.pct_identity > 50;
You can use BLASTP_ALIGN and TBLAST_ALIGN in a similar way.
The output of a BLAST query is a table; the output table is described as the output table for the specific query.
Here are two examples of queries and the resulting output tables.
Query 1 is as follows:
select T_SEQ_ID AS seq_id, score, EXPECT as evalue from TABLE( BLASTP_MATCH ( (select sequence from query_db), CURSOR(SELECT seq_id, seq_data FROM swissprot WHERE organism = 'Homo sapiens (Human)'), 1, -1, 0, 0, 'BLOSUM62', 10, 0, 0, 0, 0, 0) );
The output for query 1 is as follows:
SEQ_ID SCORE EVALUE -------- ---------- ---------- P31946 205 5.8977E-18 Q04917 198 3.8228E-17 P31947 169 8.8130E-14 P27348 198 3.8228E-17 P58107 49 7.24297332
Query 2 is as follows:
select T_SEQ_ID AS seq_id, ALIGNMENT_LENGTH as len, Q_SEQ_START as q_strt, Q_SEQ_END as q_end, Q_FRAME, T_SEQ_START as t_ strt, T_SEQ_END as t_end, T_FRAME, score, EXPECT as evalue from TABLE( BLASTP_ALIGN ( (select sequence from query_db), CURSOR(SELECT seq_id, seq_data FROM swissprot WHERE organism = 'Homo sapiens (Human)' AND creation_date > '01-Jan-90'), 1, -1, 0, 0, 'BLOSUM62', 10, 0, 0, 0, 0, 0) );
The output for Query 2 is as follows:
SEQ_ID LEN Q_STRT Q_END Q_FRAME T_STRT T_END T_FRAME SCORE EVALUE -------- ---- ------ ----- ------- ------ ----- ------- ------- ---------- P31946 50 0 50 0 13 63 0 205 5.1694E-18 Q04917 50 0 50 0 12 62 0 198 3.3507E-17 P31947 50 0 50 0 12 62 0 169 7.7247E-14 P27348 50 0 50 0 12 62 0 198 3.3507E-17 P58107 21 30 51 0 792 813 0 49 6.34857645
We provide a few sample datasets and queries to test if the BLAST functions work correctly after ODM is installed.
The DM_USER
schema contains the following sequence data tables.
SWISSPROT
: This table contains the sequences in Release 40 of the SwissProt dataset. This table has the sequence identifier, creation_date
, organism
, and sequence_data
attributes. It has 101,602 protein sequences.
SQL> describe SWISSPROT; Name Null? Type --------------------------------------- ------- ------------- SEQ_ID VARCHAR2(32) CREATION_DATE DATE ORGANISM VARCHAR2(256) SEQ_DATA CLOB
PROT_DB
: This table consists of 19 protein sequences from the SwissProt dataset.
SQL> describe prot_db; Name Null? Type ---------------------------------------- ------- ------------- SEQ_ID VARCHAR2(32) SEQ_DATA CLOB
ECOLI10
: This table contains 10 nucleotide sequences from the ecoli dataset.
SQL> describe ECOLI10; Name Null? Type ----------------------------------------- -------- --------------- SEQ_ID VARCHAR2(32) SEQ_DATA CLOB
Table 6-1 displays genetic codes and associated names.
There are several public domain sequence databases available. One of them is the SwissProt database, which is a highly curated collection of protein sequences. The SwissProt database can be downloaded from
ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/sprot40.dat
In addition to the raw sequence data, the SwissProt database contains several other attributes of the sequence including organism, date published, date modified, published literature references, annotations, etc. BLAST requires only the sequence identifier and the sequence data to be stored to perform searches.
Depending on the needs of your specific application, different sets of these attributes may be important. Therefore, the database schema required to store the data needs to be appropriately designed. You can use a scripting language to parse the required fields from the SwissProt data and format the fields so that they can be loaded into an Oracle database.
The following Perl script outputs the sequence identifier, creation_date, organism, and sequence data in the required format for sqlldr
is given below. (sqlldr
is a program to load data into an Oracle database.)
#!/bin/perl #swissprot.pl < input > output #Input: protein db as provided by SWISSPROT # my $string = ""; my $indicator = ""; $sq = 0; $ac = 0; while(<>) { #chop; if ( /^\/\// ) { print "\n"; $sq = 0; $ac = 0; next; } if ($sq == 1) { @words = split; foreach $word (@words) { print "$word"; } next; } if( /^AC(\s+)(\w+);/ ) { if ($ac == 0) { $indicator = $2; print "$indicator|"; $sq = 0; $dt = 0; $ac = 1; next; } } if ( /^OS(\s+)(.*)\./ ) { $organism = $2; print "$organism|"; next; } if ( /^DT(\s+)(\S+)/ ) { if ($dt == 0) { print "$2|"; $dt = 1; } } if ( /^SQ(\s+)/ ) { $sq = "1"; next; } }
Follow these steps to download, parse, and save the SwissProt data in an Oracle database:
sprot40.dat
.swissprot.pl
, type the command
swissprot.pl sprot40.dat > sprot_formatted.txt
This command will read the SwissProt data stored in sprot40.dat
, format it, and write it out to sprot_formatted.txt
.
sqlldr
, you must create a table to hold the data and a control file. Create the table swissprot
using the following SQL statement:
create table swissprot (SEQ_ID VARCHAR2(32), CREATION_DATE DATE, ORGANISM VARCHAR2(256), SEQ_DATA CLOB);
Create a control file named sprot.ctl
with the following contents:
LOAD DATA INFILE sprot40_formatted.txt INTO TABLE swissprot REPLACE FIELDS TERMINATED BY '|' TRAILING NULLCOLS ( seq_id, creation_date, organism, seq_data char(100000) )
sqlldr userid=<user_name>/<passwd> control=sprot.ctl log=sprot.log direct=TRUE data=sprot40_formatted.txt
The SwisProt data is now stored in the Oracle table swissprot
.
The BLAST functionality is available as built-in table functions; these table functions can be used in the FROM clause of a SQL query.
This table function performs a BLASTN search of the given nucleotide sequence against the selected portion of the nucleotide database. The database can be selected using a standard SQL select and passed into the function as a cursor. It accepts the standard BLAST parameters that are mentioned below. The match returns the identifier of the matched (target) sequence (t_seq_id
) (for example, the NCBI accession number), the score of the match, and the expect value.
function BLASTN_MATCH ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, expect_value NUMBER default 10, open_gap_cost NUMBER default 5, extend_gap_cost NUMBER default 2, mismatch_cost NUMBER default -3, match_reward NUMBER default 1, word_size NUMBER default 11, xdropoff NUMBER default 30, final_x_dropoff NUMBER default 50) return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);
Table 6-3 describes the input parameters for BLASTN_MATCH; Table 6-4, the output parameters.
Attribute | Description |
---|---|
t_seq_id |
The sequence identifier of the returned match. |
score |
The score of the returned match. |
expect |
The expect value of the returned match. |
This table function performs a BLASTP search of the given amino acid sequence against the portion of the selected protein database. The database can be selected using a standard SQL select and passed into the function as a cursor. We also accept the standard BLAST parameters that are mentioned below. The match returns the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.
function BLASTP_MATCH ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);
Table 6-5 describes the input parameters for BLASTN_MATCH; Table 6-6, the output parameters.
Attribute | Description |
---|---|
t_seq_id |
The sequence identifier of the returned match. |
score |
The score of the returned match. |
expect |
The expect value of the returned match. |
This table function performs BLAST searches involving translations of either the query sequence or the database of sequences. The available options are:
The database can be selected using a standard SQL select and passed into the function as a cursor. We also accept the standard BLAST parameters that are mentioned below. The match returns the identifier of the matched (target) sequence (t_seq_id) (for example, the NCBI accession number), the score of the match, and the expect value.
function TBLAST_MATCH ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, translation_type VARCHAR2 default 'BLASTX', genetic_code NUMBER default 1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);
Table 6-7 describes the input parameters for TBLAST_MATCH; Table 6-8, the output parameters.
Parameter | Description |
---|---|
query_seq |
The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
seqdb_cursor |
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
subsequence_from |
Start position of a region of the query sequence to be used for the search. The default is 1. |
subsequence_to |
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
translation_type |
Type of the translation involved. The options are BLASTX, TBLASTN, and TBLASTX. The default is BLASTX. |
genetic_code |
Used for translating nucleotide sequences to amino acid sequences. |
filter_low_complexity |
|
mask_lower_case |
|
sub_matrix |
Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are |
expect_value |
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
open_gap_cost |
The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior. |
extend_gap_cost |
The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior. |
word_size |
The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior. |
x_dropoff |
Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior. |
final_x_dropoff |
The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior. |
Attribute | Description |
---|---|
t_seq_id |
The sequence identifier of the returned match. |
score |
The score of the returned match. |
expect |
The expect value of the returned match. |
This table function performs a BLASTN alignment of the given nucleotide sequence against the selected portion of the nucleotide database. The database can be selected using a standard SQL select and passed into the function as a cursor. It accepts the standard BLAST parameters that are mentioned below.
BLASTN_MATCH returns only the score and expect value of the match. It does not return information about the alignment. BLASTN_MATCH is typically used when you want to follow up a BLAST search with a Smith-Waterman alignment.
BLASTN_ALIGN does the BLAST alignment and returns the information about the alignment.
function BLASTN_ALIGN ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, expect_value NUMBER default 10, open_gap_cost NUMBER default 5, extend_gap_cost NUMBER default 2, mismatch_cost NUMBER default -3, match_reward NUMBER default 1, word_size NUMBER default 11, xdropoff NUMBER default 30, final_x_dropoff NUMBER default 50) return table of row ( t_seq_id VARCHAR2, pct_identity NUMBER, alignment_length NUMBER, mismatches NUMBER, positives NUMBER, gap_openings NUMBER, gap_list [Table of NUMBER], q_seq_start NUMBER, q_frame NUMBER, q_seq_end NUMBER, t_seq_start NUMBER, t_seq_end NUMBER, t_frame NUMBER, score NUMBER, expect NUMBER);
Table 6-9 describes the input parameters for BLASTN_ALIGN; Table 6-10, the output parameters.
This table function performs a BLASTP alignment of the given amino acid sequences against the selected portion of the protein database. The database can be selected using a standard SQL select and passed into the function as a cursor. You can also use the standard BLAST parameters that are mentioned below.
BLASTP_MATCH function returns only the score and expect value of the match. It does not return information about the alignment. The BLASTP_MATCH function will typically be used where the user wants to follow up a BLAST search with a full FASTA or Smith-Waterman alignment.
The BLASTP_ALIGN function does the BLAST alignment and returns the information about the alignment. The schema of the returned alignment is the same as that of BLASTN_ALIGN.
function SYS_BLASTP_ALIGN ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row ( t_seq_id VARCHAR2, pct_identity NUMBER, alignment_length NUMBER, mismatches NUMBER, positives NUMBER, gap_openings NUMBER, gap_list [Table of NUMBER], q_seq_start NUMBER, q_frame NUMBER, q_seq_end NUMBER, t_seq_start NUMBER, t_seq_end NUMBER, t_frame NUMBER, score NUMBER, expect NUMBER);
Table 6-11 describes the input parameters for BLASTP_ALIGN; Table 6-12, the output parameters.
This table function performs BLAST alignments involving translations of either the query sequence or the database of sequences. The available translation options are BLASTX, TBLASTN, and TBLASTX. The schema of the returned alignment is the same as that of BLASTN_ALIGN and BLASTP_ALIGN.
function TBLAST_ALIGN ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default 0, translation_type VARCHAR2 default 'BLASTX', genetic_code NUMBER default 1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row ( t_seq_id VARCHAR2, pct_identity NUMBER, alignment_length NUMBER, mismatches NUMBER, positives NUMBER, gap_openings NUMBER, gap_list [Table of NUMBER], q_seq_start NUMBER, q_frame NUMBER, q_seq_end NUMBER, t_seq_start NUMBER, t_seq_end NUMBER, t_frame NUMBER, score NUMBER, expect NUMBER);
Table 6-13 describes the input parameters for TBLAST_ALIGN; Table 6-14, the output parameters.
Parameter | Description |
---|---|
|
The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
|
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
|
Start position of a region of the query sequence to be used for the search. The default is 1. |
|
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
|
Type of the translation involved. The options are BLASTX, TBLASTN, and TBLASTX. The default is BLASTX. |
|
Used for translating nucleotide sequences to amino acid sequences. |
|
|
|
|
|
Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are |
|
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
|
The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior. |
|
The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior. |
|
The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior. |
|
Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior. |
|
The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior. |