New RefSeq Annotations!

New RefSeq Annotations!

In February and March, the NCBI Eukaryotic Genome Annotation Pipeline released thirty-seven new annotations in RefSeq for the following organisms:

  • Belonocnema kinseyi (wasp)
  • Daphnia pulex (common water flea)
  • Daphnia pulicaria (crustacean)
  • Dermatophagoides farinae (American house dust mite)
  • Diprion similis (hymenopteran)
  • Drosophila willistoni (fly)
  • Equus quagga burchellii (Burchell’s zebra) (pictured)
  • Gallus gallus (chicken)
  • Haliotis rubra (blacklip abalone)
  • Haliotis rufescens (red abalone)
  • Helicoverpa zea (corn earworm)
  • Homalodisca vitripennis (glassy-winged sharpshooter)
  • Hydra vulgaris (swiftwater hydra)
  • Hypomesus transpacificus (delta smelt)
  • Ictalurus punctatus (channel catfish)
  • Ischnura elegans (damselfly)
  • Lolium rigidum (monocot)
  • Lucilia cuprina (Australian sheep blowfly)
  • Lynx rufus (bobcat)
  • Marmota monax (woodchuck)
  • Meles meles (Eurasian badger)
  • Micropterus dolomieu (smallmouth bass)
  • Neodiprion fabricii (hymenopteran)
  • Neodiprion lecontei (redheaded pine sawfly)
  • Neodiprion pinetum (white pine sawfly)
  • Neodiprion virginiana (hymenopteran)
  • Oncorhynchus gorbuscha (pink salmon)
  • Osmia bicornis bicornis (red mason bee)
  • Scatophagus argus (bony fish)
  • Schistocerca americana (American grasshopper)
  • Schistocerca piceifrons (Central American locust)
  • Silurus meridionalis (bony fish)
  • Ursus americanus (American black bear)
  • Vanessa cardui (painted lady)
  • Vespa crabro (European hornet)
  • Vigna umbellata (eudicot)
  • Xenia sp. Carnegie-2017 (soft coral)

View the full list of annotated eukaryotes available in the Genome Data Viewer (GDV) browser.

New feature in the MSA viewer: Search for a short sequence

New feature in the MSA viewer: Search for a short sequence

We’re reading and incorporating your feedback! As requested, you can now search for sequences in our Multiple Sequence Alignment (MSA) Viewer. You can search the anchor or consensus sequence of a multiple alignment for short sequence strings. This new feature allows you to:

MANE is published in Nature!

MANE is published in Nature!

We are delighted to announce that three and a half years of hard work by the collaborative team that brought you the Matched Annotation from NCBI and EMBL-EBI (MANE) dataset has culminated in a full article in the April 14 issue of Nature!  We invite you to read the online article to learn more about the goals of the MANE collaboration, MANE offerings and how to access them, and the methods used in generating MANE data. And of course, now you have a paper to cite MANE data!

Morales, J., Pujar, S., Loveland, J.E. et al. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature (2022).  DOI: 10.1038/s41586-022-04558-8

Launched in October 2018, MANE is a collaboration between the National Library of Medicine’s (NLM) National Center for Biotechnology Information (NCBI) and the EMBL’s European Bioinformatics Institute (EMBL-EBI), the two major groups who provide whole-genome annotation for a broad range of organisms including human. Our initial offering, MANE Select, is intended to be used as a universal standard to report clinical variants and for browser display in genome resources. Starting from MANE v0.92, we added MANE Plus Clinical transcripts for a small set of genes where MANE Select alone was not sufficient to report known clinical variants (Figure 1).

Figure 1. The Sequence Viewer showing the MANE Project track and the NCBI Genes track for the human SCN5A gene region on chromosome 3. The MANE track has the MANE Select Transcript, NM_000335, and the MANE Plus Clinical transcript, NM_001099404, providing two standard transcripts to represent the gene.

Continue reading “MANE is published in Nature!”

NCBI hidden Markov models (HMM) release 8.0 now available!

NCBI hidden Markov models (HMM) release 8.0 now available!

Release 8.0 of the NCBI Hidden Markov models (HMM), used by the Prokaryotic Genome Annotation Pipeline (PGAP), is now available for download. You can search this collection against your favorite prokaryotic proteins to identify their function using the HMMER sequence analysis package.

The 8.0 release contains 15,358 models, including 160 that are new since 7.0. In addition, we have added better names, EC numbers, Gene Ontology (GO) terms, gene symbols or publications to over 550 existing HMMs. You can search and view the details for these in the Protein Family Model collection, which also includes conserved domain architectures and BlastRules, and find all RefSeq proteins they name.

GO terms associated with HMMs are now propagated to  coding sequences and proteins annotated with PGAP. In case you missed it, see our previous blog post on this topic.

BLAST+ 2.13.0 now available with SRA BLAST, ARM Linux executables, and database metadata

BLAST+ 2.13.0 now available with SRA BLAST, ARM Linux executables, and database metadata

BLAST+ 2.13.0  includes several important new features including SRA BLAST programs, ARM Linux executables, and the ability to produce database metadata as well as some important improvements, and a few bug fixes.  You can download the new BLAST release from the FTP site.

New features

SRA / WGS BLAST (blastn_vdb, tblastn_vdb)

Beginning with this release, the BLAST distribution now includes the SRA BLAST programs  blastn_vdb and tblastn_vdb that can directly search SRA and WGS projects without the need to build a BLAST database. See the BLAST documentation on how to use these programs with WGS projects.

ARM Linux executables

This release also includes executables compiled under ARM Linux for the first time. Please let us know if you find any issues with ARM Linux programs.

Database metadata in JSON format

Starting with BLAST+ 2.13.0, the makeblastdb program generates an additional file with the file extension .njs for nucleotide databases or .pjs  for protein databases. These files contain BLAST database metadata in JSON format. See the BLAST database metadata section in the BLAST User Manual for an example. This file can be easily read by many tools and makes the BLAST database more compliant with FAIR principles.

See the release notes for more details on improvements and bug fixes for the release.

Important reminder about usage reporting

As we announced previously, BLAST can report limited usage information back to NCBI. This information shows us whether BLAST+ is being used by the community, and therefore is worth being maintained and developed.  It also allows us to focus our development efforts on the most used aspects of BLAST+.  Please help us improve BLAST by allowing BLAST to share information about your search. The BLAST privacy statement  provides details on the information collected, how it is used, and how to opt-out of reporting if you don’t want to participate.

Using NCBI resources to research, detect, and treat genetic phenotypes

Using NCBI resources to research, detect, and treat genetic phenotypes

Clinical Genetics Information at Your Fingertips

NCBI offers a portfolio of medical genetics resources to help you research, diagnose, and treat diseases and conditions. You can easily access our data and tools through the Medical Genetics and Human Variation page of the NCBI website. We also encourage you to join our community of thousands of submitters and share your germline and/or somatic data to advance discovery and optimize clinical care. 

How and why should you use our resources? Consider the example below. 

Your patient is a 40-year-old mother of two presenting with changes in bathroom habits, bleeding, and belly pain. She has a medical history of colonic polyps. Her family history reveals that her maternal grandmother, mother and uncle had several forms of cancers including colon, breast, and endometrium. 

Continue reading “Using NCBI resources to research, detect, and treat genetic phenotypes”

Test Server for the PubMed API (E-utilities) is Now Available

Test Server for the PubMed API (E-utilities) is Now Available

Official update scheduled to launch June 2022 

As previously announced, we will be moving to an updated version of the E-utilities API for PubMed. We are planning to delay this change until June 2022 to give you time to test your API calls on the new service, report issues, and provide your feedback. Don’t wait until launch! A test server is available leading up to the release and ready for you to try! 

How do I use the test server? 

The test server is available through the following URL:

Test server: https://eutilspreview.ncbi.nlm.nih.gov/entrez/eutils/

Continue reading “Test Server for the PubMed API (E-utilities) is Now Available”

Introducing ElasticBLAST – BLAST® is now easier, bigger, and faster on the Cloud!

Introducing ElasticBLAST – BLAST® is now easier, bigger, and faster on the Cloud!

ElasticBLAST is a new tool that helps you run BLAST searches on the cloud. ElasticBLAST is perfect for you if you have thousands to millions of queries to our Basic Local Alignment Search Tool (BLAST ®), or if you want to use cloud infrastructure for your searches. ElasticBLAST can handle large searches that are not appropriate for NCBI web BLAST, and it runs them more quickly than stand-alone BLAST+.

ElasticBLAST works on two of the current NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) partners- Amazon Web Services (AWS) and Google Cloud Platform (GCP).  ElasticBLAST works by distributing your searches across multiple cloud instances to process them in tandem. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+. ElasticBLAST can handle millions of queries, and it also supports most BLAST+ options and programs.

Making it easier to run BLAST on the cloud

ElasticBLAST reduces the barrier to using the cloud by creating and managing cloud resources for you. It manages the software and database installation, handles partitioning of the BLAST workload among the various instances and deallocates cloud resources when the searches are done. For example, ElasticBLAST will select the best cloud instance type for your search based on the database metadata that provides database size and memory needs (Figure 1). You can also manually select the instance type if you prefer.

Fig. 1: JSON metadata for the 16S_ribosomal_RNA database. The “bytes-to-cache” information helps ElasticBLAST pick out an instance with the appropriate capacity.

Selecting Databases

ElasticBLAST can access the 28 NCBI databases available on AWS and GCP. These are the same databases that are also available from the NCBI FTP site. For instance, databases available on the two cloud providers include the RefSeq Eukaryotic Representative Genomes database, 16S database based on Targeted Loci, and Human and mouse genomes databases.

You can also provide your own databases, and you can produce the metadata needed to select an instance through a Python script that comes with ElasticBLAST.

Example Runs

ElasticBLAST can perform a variety of searches with query sets that range from hundreds to millions of sequences and BLAST databases of all sizes.  Table 1 shows ElasticBLAST searches with query sets that range up to billions of letters using a variety of BLAST databases.

Table 1: Sample ElasticBLAST searches.  This table demonstrates the breadth of searches supported by ElasticBLAST.  Additionally, the first row demonstrates the ability of ElasticBLAST to use many CPUs (3200) on a cloud provider at once to complete a task in hours that would have taken days on a single machine.

Costs

Because ElasticBLAST runs on cloud providers, using it will incur some cost. Based on current cost structures on AWS and GCP, in most cases these costs are quite small. For example, a protein search with a query of about 20 million residues against a database of about 20 billion residues can cost less than $5. Even a larger search with a query of 3-4 billion DNA bases can cost only around $50. Both cloud services include the option to bid on instances for less than full price, which can result in significant savings. ElasticBLAST can be configured to request such instances. Your costs will obviously vary based on many factors, and we encourage you to explore these options with the individual cloud providers. Also, both AWS and GCP offer a free tier or time-limited trial of their cloud services, and you can find information about using ElasticBLAST with the free tiers here.

Welcome to ElasticBLAST!

Go ahead and run your first ElasticBLAST search! We are sure you’ll love how ElasticBLAST accelerates your research.

Your feedback is crucial to the development and support of ElasticBLAST. If you have any questions or suggestions, please reach out to us at [email protected]. We’d love to hear from you.

ElasticBLAST is a cloud-native package developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) with support from the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

New PMC Website Design is Live!

New PMC Website Design is Live!

We have launched a fresh look and feel to the PubMed Central (PMC) website, which marks the first step of an ongoing modernization effort. The updated website will allow us to make continuous enhancements to PMC based on your feedback.  

What Has Changed? 

Now when you visit PMC’s homepage, you will see:  

  • A redesigned and reorganized homepage  
  • Easy-to-navigate help documentation  
  • A similar look and feel between features in PMC and PubMed  
  • A streamlined article display  

Figure 1 highlights features of the new PMC article display. You can also find the most up-to-date version of this information in PMC’s User Guide. Continue reading “New PMC Website Design is Live!”

RefSeq Release 211 is available

RefSeq Release 211 is available

RefSeq Release 211 is now available online, from the FTP site and through NCBI’s Entrez programming utilities, E-utilities.

This full release incorporates genomic, transcript, and protein data available as of March 7, 2022, and contains 308,229,655 records, including 224,211,842 proteins, 43,956,061 transcripts, and sequences from 117,030 organisms. The release is provided in several directories as a complete dataset and also as divided by logical groupings.

This release includes new annotations generated by NCBI’s eukaryotic genome annotation pipeline for 36 species, including: Continue reading “RefSeq Release 211 is available”