List of Most Popular Bioinformatics Databases
Biological information is dynamic, unusual, vast and incomplete. Several databases have therefore been developed and interpreted to ensure clear performance.
A computer-readable collection of biological data that increases the speed of search and recovery and is easy to use is called bioinformatics Databases. Good records should be updated in a good database.
Importance of biological database
A range of information like biological sequences, structures, binding sites, metabolic interactions, molecular action, functional relationships, protein families, motifs and homologous can be retrieved by using biological databases. The main purpose of a biological database is to store and manage biological data and information in computer-readable forms.
Also Read: How To Become A Bioinformatician In 2021?
Primary database vs. secondary database
- A primary database contains only sequence or structural information.
- The database derived from the analysis or treatment of primary data is a secondary database. It is very important for interfering with protein function.
“A biological database is a structured large group of permanent data, typically linked to computerised software that updates, queries and retrieves data components stored in the system. A simple database can be a single file with several records, each with the same details.”
Some of the common databases are GenBank from NCBI, SwissProt from the Swiss Bioinformatics Institute, and PIR from Protein Information Resource.
- GenBank: GenBank is one of the fastest-growing collections of known genetic sequences. GenBank (genetic sequence database).
- EMBL: the EMBL Nucleotide Sequence Database is a robust DNA and RNA sequential databases compiled by researchers and sequencer groups directly from the scientific literature and patent applications.
- SwissProt: this is a protein sequence database that offers a high integration level with other databases and a very low redundancy level (means less identical sequences are present in the database).
NCBI, National Center for Biotechnology Information, has a number of useful databases for bioinformatics. A complete list is available here, and selected databases are linked below.
- 1000 Genomes Browser: An interactive graphical viewer that allows users to explore variant calls, genotype calls and supporting evidence (such as aligned sequence reads) that have been produced by the 1000 Genomes Project.
- BLAST: The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
- ClinVar: A resource to provide a public tracked record of reported relationships between human variation and observed health status with supporting evidence.
- dbSNP: dbSNP (Database of Short Genetic Variations) includes single nucleotide variations, microsatellites, and small-scale insertions and deletions. dbSNP contains population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral variations and clinical mutations.
- dbVar: dbVar (Database of Genomic Structural Variation) has been developed to archive information associated with large scale genomic variation, including large insertions, deletions, translocations and inversions. In addition to archiving variation discovery, dbVar also stores associations of defined variants with phenotype information.
- dbGaP: Database of Genotypes and Phenotypes (dbGaP) is an archive and distribution centre for the description and results of studies that investigate the interaction of genotype and phenotype. These studies include genome-wide association (GWAS), medical resequencing, molecular diagnostic assays, as well as an association between genotype and non-clinical traits.
- Gene: A searchable database of genes, focusing on genomes that have been completely sequenced and that have an active research community to contribute gene-specific data. Information includes nomenclature, chromosomal localization, gene products and their attributes (e.g., protein interactions), associated markers, phenotypes, interactions, and links to citations, sequences, variation details, maps, expression reports, homologs, protein domain content, and external databases.
- Genome: Contains sequence and map data from the whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life (bacteria, archaea, and Eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.
- MedGen: Organizes information related to human medical genetics, such as attributes of conditions with a genetic contribution.
- Nucleotide: A collection of nucleotide sequences from several sources, including GenBank, RefSeq, the Third Party Annotation (TPA) database, and PDB. Searching the Nucleotide Database will yield available results from each of its component databases.
- Protein: The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function.
- NCBI Develop: NCBI provides a variety of resources that allow developers to access and manipulate NCBI data in their applications. Use this resource for information on APIs, code libraries, and data formats.
- NCBI GitHub Repository: The GitHub Repository from NCBI
- E-Utilities API for NCBI: The Entrez Programming Utilities (E-utilities) are a set of eight server-side programs that provide a stable interface into the Entrez query and database system at the National Center for Biotechnology Information (NCBI). The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. The E-utilities are therefore the structured interface to the Entrez system, which currently includes 38 databases covering a variety of biomedical data, including nucleotide and protein sequences, gene records, three-dimensional molecular structures, and the biomedical literature.
- ENCODE: The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
- UniProt: Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. The UniProt databases are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR).
- REACTOME: REACTOME is open-source, open access, manually curated and peer-reviewed pathway database. Our goal is to provide intuitive bioinformatics tools for the visualization, interpretation and analysis of pathway knowledge to support basic and clinical research, genome analysis, modelling, systems biology and education.
- Broad Institute of Harvard and MIT: The Broad Institute of Harvard and MIT shares some data and software tools produced with the larger scientific community.
- Genome Analysis Toolkit – Broad Institute: Developed in the Data Sciences Platform at the Broad Institute, the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
- Firehose – Broad Institute: A suite of tools and pipelines developed for processing and analyzing various types of large-scale genomic and proteomic data.
- Firebrowse – Broad Institute: A tool to explore and visualize cancer data generated by Broad GDAC Firehose. Provides graphical tools like viewGene to explore expression levels and iCoMutto explore a comprehensive mutation analysis of each TCGA disease and an API for programmers.
- KEGG: Kyoto Encyclopedia of Genes and Genomes: KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
- TCGA Computational Tools – National Cancer Institute: The Cancer Genome Atlas (TCGA) catalyzed considerable growth and advancement in the computational biology field by supporting the development of high-throughput genomic characterization technologies, generating a massive quantity of data, and fielding teams of researchers to analyze the data. At this link is a collection of some of the tools developed by TCGA network researchers and collaborators that were used to analyze TCGA data.
This is a list of additional suggested databases, but not a complete list of all databases for Bioinformatics. For the ‘best bets’ literature databases relevant to Bioinformatics, please click here. For a complete list of databases at Northeastern, click here.
If you are having trouble finding what you need, you can make an appointment (in person or via web) with a librarian here.
- PubMedUse this link for direct full-text access to Northeastern’s resources through PubMed.
PubMed comprises more than 29 million citations for biomedical literature from MEDLINE, life science journals, and online books. Citations may include links to full-text content from PubMed Central and publisher web sites. For tips on searching efficiently and effectively in PubMed, click here.
- Embase: Embase is a versatile and up-to-date biomedical research database covering the most important international biomedical literature from 1947 to the present day with more than 32+ million records from 8,200 journals and ‘grey literature’ from over 2.4 million conference abstracts. Embase includes unique non-English content and coverage of the most important types of evidence, such as randomized controlled trials, controlled clinical trials, Cochrane reviews and meta-analyses.
- BioMed Central: BMC has an evolving portfolio of some 300 peer-reviewed, open access journals, sharing discoveries from research communities in science, technology, engineering and medicine.
- BioOne: Not-for-profit collaboration bringing together scientific societies, publishers, and libraries to provide access to critical, peer-reviewed research in the biological, ecological, and environmental sciences.
- PLoS Biology: Published by the Public Library of Science, open access, peer-reviewed journal; features works in all areas of biological science, including works that interface with other disciplines such as chemistry, medicine and mathematics.
- ScienceDirect: Science Direct is the web site for selected journal titles from the scholarly publisher Elsevier and its affiliates. Learn how to download articles directly to your mobile device.
- Web of Science: Citations and abstracts from scholarly literature in the sciences, social sciences, arts, and humanities. Includes conference proceedings, symposia, seminars, colloquia, workshops, and conventions. One of the most comprehensive databases of academic research.
List of Most Popular Bioinformatics Databases