Drosophila melanogaster genome annotation release 3.2.1 date 07212004 DATA CONTENTS Feature counts in release 3.2 (r321, July 2004; r320, March 2004) compared to release 3.1 (Dec 2003, r310d and Spring 2003, r310g) Feature r321 r320 r310g ------------------------------------------------------------ BAC 949 949 -- CDS 18747 18746 18109 DNA_motif 5 5 0 EST 310718+ 304257 -- RNA_motif 1 0 0 aberration_junction 86 87 0 cDNA_clone 10283+ 10204 -- enhancer 27 27 0 five_prime_UTR 18621? 13608 -- gene 13472 13473 13369 golden_path 437 437 437 insertion_site 458 424 0 intron 16153 16199 -- mRNA 19307? 18810 18109 mRNA_genscan 19052^ -- -- mRNA_piecegenie 13740^ -- -- match_HDP 2448 -- -- match_RNAiHDP 40 -- -- match_fgenesh 14838 -- -- mature_peptide 7 8 0 ncRNA 65 65 60 oligo 197330+ 193813 -- point_mutation 485 476 0 polyA_site 107 101 0 processed_transcript 15113$ 16748 -- protein 162585$ 233812 -- protein_binding_site 92 85 0 pseudogene 40 39 17 rRNA 96 85 0 region 30 28 0 regulatory_region 137 136 0 repeat_region 4051+ 3390 -- rescue_fragment 136 135 0 sequence_variant 232 225 0 signal_peptide 1 1 0 snRNA 28 28 28 snoRNA 28 28 28 so 14666$ 16244 -- tRNA 288 288 288 tRNA_trnascan 294^ -- -- three_prime_UTR 18590? 15493 -- transcription_start_site 35698+ 16997 -- transposable_element 1572 1567 1572 transposable_element_inserti.. 3257$ 4566 -- transposable_element_pred 1572^ -- -- ------------------------------------------------------------ -- data not available for this feature + Increases in r321 probably from non-gene regions missed in r320 xml output $ Reductions in transposable_element_insertion_site, processed_transcript, protein features due to duplicate removals ? Uncertain about cause of changed numbers ^ computed features added for r321 (genscan, genie, trnascan, predicted TE) Cytologically located features Feature r321 r320 r310g ------------------------------------------------------------ cyto_insertion 16363$ 21379 13522 cytobreakpoint_inv 4565 4119 4125 cytobreakpoint_other 791 9198 1810 cytobreakpoint_ttp 6243 3176 3235 cytodeleted_segment 11073 6874 6942 cytoduplicated_segment 880 1279 1328 cytogene 5671$ 6683 6494 ------------------------------------------------------------ $ Reductions in cytoins, cytogene due to removal of duplicates with seq Changes for r321 in other cyto features due to reclassified data ------- Data are from Postgres Chado database, release 3.2, v 26, 29 July 2004 Copy at ftp://flybase.net/genomes/Drosophila_melanogaster/ dmel_r3.2.1_07212004/pgsql/chado_r3_2_26.gz #-------------------------------------------------------------- # ??? NO CHANGE FROM r3.2.0 March ??? #-------------------------------------------------------------- WEB FUNCTIONS Updates to data, with some software changes, for -- Gene annotation reports - updated and extended symbols, synonyms, IDs, annotation notes. Other Features added. -- Genome maps (gbrowse) - added new feature types -- Sequence reports -- new features mat/signal peptides, etc. See http://flybase.net/annot/ SYMBOLS and IDS Symbols and IDs for annotations in this release have been updated to close correspondence with gene data. The transcript and translation/CDS symbols and IDs for FlyBase have changed some over last year. An annotation has an ID of CG00000 (with a corresponding FBan00000 which is being de-emphasized), Its mRNA and CDS have -Rx and -Px suffixes respectively, where letter 'x' extends to as many variants as found. In the release 3.2, the standard symbols for gene annotation CG00000 have been replaced with accepted gene name (where available), thus CG8094, CG8094-RA, CG8094-PA become gene 'Hex-C', Hex-C-RA, Hex-C-PA. The CG8094 ID is supported as a more computable alternative to this symbolic name, but will be less visible than the more consistant and memorable gene names There is still some quandry in data files about when to use 'Hex-C-PA' or CG8094-PA. BULK FILE SET See ftp://flybase.net/genomes/Drosophila_melanogaster/current/ blast/ - updated NCBI blast database set for transcripts, translations and transposons dna/ - contains dna in fasta and/or raw format files per chromosome-arm; no change from release 3 data. fasta/ - dna and protein data per chromosome and feature type feats-all/ - intermediate files of all feature locations in tabular format gff/ - GFF v2 standard feature files per chromosome gnomap/ - Gnomap standard feature files per chromosome (drive genome map views) pgsql/ - Postgres Chado database dump, source of most of these files srs/ - SRS search indices fbobs/ - Acode format annotation object data files for web services xml-chado/ - Chado format XML database output of genes, dna and other features, per scaffold xml-game/ - GAME format XML database output of genes, dna and other features, per scaffold Bulk files compared to those of release 3.1: whole_genome_* -- create by catenating each chr file set heterochromatin_* and (2h,3h,Xh,Yh,U) -- 'heterosomes' to be added euchromatin_* -- create by catenating each chr file set, excluding 'heterosomes' per chromosome set 2L_3_UTR, 2L_5_UTR == dmel_2L_three_prime_UTR, dmel_2L_five_prime_UTR 2L_CDS == dmel_2L_CDS 2L_annotation == catenate dmel_2L_gene with (tRNA,miscRNA,transposon) set 2L_annotation_extend5000 == dmel_2L_gene_extended5000, minus (tRNA,miscRNA,transposon) set 2L_annotation_extend2000 .. not planned 2L_annotation_extend500 .. not planned 2L_exon .. not planned 2L_genomic == dmel_2L_chromosome (chromosome arm dna, same as rel3.1) 2L_genomic_scaffolds == dmel_2L_scaffolds (segment dna, same as rel3.1) 2L_intron .. not planned 2L_masked_genomic .. not planned 2L_noncoding-gene == catenate (tRNA,miscRNA,transposon,pseudogene) 2L_protein-coding-gene == dmel_2L_gene 2L_splice_site .. not planned 2L_tRNA == dmel_2L_tRNA 2L_transcript == dmel_2L_transcript 2L_translation == dmel_2L_translation (curated translations) 2L_transposable_element == dmel_2L_transposon 2L_unique_intergenic .. not planned 2L_unique_intron .. not planned Not in past release: dmel_2L_miscRNA dmel_2L_pseudogene File name format: $org_$chr_$feature_$release.$format $org in (dmel) $chr in (2L 2R 3L 3R X 4), (2h 3h Xh Yh U) $feature in ( gene, mRNA, CDS, CDS-translation, transposon/transposable_element, pseudogene, tRNA, miscRNA=ncRNA,snRNA,snoRNA,rRNA gene-extended5000 chromosome-arm scaffold ) $release in ( r3.1.0g (gadfly, summer 2003 ) r3.1.0d (chado r3.1.0_12182003) r3.2.0a (chado r3.2.0_12052003) r3.2.0c (chado r3.2.0_03162004) ) $format in ( .fasta(.gz) .gff(.gz) .chado.xml(.gz) .game.xml(.gz) ) ANNOTATION RELEASE 3.1 HOLD-OVERS ftp://flybase.net/genomes/Drosophila_melanogaster/dmel_RELEASE3-1/ Annotations_and_Evidence/ GFF/ blastdb/ FASTA/ README Annotations_and_Evidence/ ------ >>> euchromatic scaffolds, updated in r3.2 release AE002603.xml.gz .. AE003847.xml.gz >>> heterochromatin and centromere scaffolds - no r3.2 equivalent yet AABU01000058.xml.gz .. AABU01002775.xml.gz 2L_wgs3_centromere_extension.xml.gz 2R_wgs3_centromere_extension.xml.gz 3L_wgs3_centromere_extension.xml.gz 3R_wgs3_centromere_extension.xml.gz X_wgs3_centromere_extensionB.xml.gz linked_1.xml.gz linked_2.xml.gz linked_3.xml.gz linked_4.xml.gz linked_5.xml.gz linked_6.xml.gz linked_7.xml.gz FASTA ------------- Heterochromatin sections are not yet available for r3.2 2h, 3H, Xh, Yh, U (heterochromatin, unclassified) Block dna (fasta) sections are identical for r3.2 scaffolds, genomic, masked_genomic CHADO DATABASE LOOKUP SERVICE SERVICE URL http://flybase.net/apollo-cgi/chado2apollo.cgi Information and software at http://bugbane.bio.indiana.edu:7092/apollo/ EXAMPLES http://flybase.net/apollo-cgi/chado2apollo.cgi?scaffold=AE003650 http://flybase.net/apollo-cgi/chado2apollo.cgi?gene=cact http://flybase.net/apollo-cgi/chado2apollo.cgi?range=2L:300000-310000 http://flybase.net/apollo-cgi/chado2apollo.cgi?band=34A This provides support for Apollo genome browser/editor, returning GAME XML gene and genome objects in response to basic queries of 'gene' name/ID, 'scaffold' section, genome base 'range' or cytological 'band'. It currently works well for scaffold chunks of data, using pre-generated XML. But it is very slow (5 - 10 minutes) at generating XML, for the gene region queries. The default operation now returns pre-generated scaffolds to any query. We will work to improve this.