Saturday, 27 March 2021

Exploring ONT's experimental basecaller BONTITO on the major crop Potato

Bonito is a experimental basecaller for Oxford Nanopore reads. I wanted to assess if/how this basecaller improves read accuracy with data of one of the crops with which we are working; namely potato. Our initial focus was on improving basecalling accuracy using data from a nearly homozygous diploid potato cultivar names Solyntus. We developed and published the genome sequence of this cultivar. Besides improving read accuracy, which could help to further improve that genome assembly, our main ambition is to improve basecalling accuracy in tetraploid potato, which would allow us to perform phase specific genome assemblies.

ONT provides a set of experimental models via its rerio repository. Rerio is comprised of "research release" base calling models and configuration files. 

ONT Bonito allows training of (crop specific) basecalling models.

We tested the current state of the art basecaller (guppy v4.5.2) using the High Accuracy Model and three rerio models and compared this to several self trained potato models using bonito. Data was obtained from R9.4.1. pores.

Conclusion

Self trained models are more accurate but more accurate basecalling is increasing processing time.

Accuracy

A subset of 40K ONT reads ware take and using either guppy or bonito mapped against the Solyntus reference genome. Per read mapping accuracy was measured and plotted.

Fig 1. Density plots showing read mapping accuracy after basecalling with different methods of the same subset of 40K ONT reads.

The reference is the default ONT HAC calling using guppy. Median accuracy is approx 93%. The two rerio models (min_mod bases_5dmc and 5mC-5hmC) did not perform better. Basecalling* with the rerio crf model (v0.3.2) improved accuracy with approx 2%. However, the in-house trained models outperformed these models in terms of accuracy. The difference between bonito_05 and bonito_15 is in the amount of reads used for training (15K and 45K respectively). Bonito_05 was trained with standard parameters, while bonito_05adjust was trained with encoder features=384 and state len=4. This adjustment slightly reduced accuracy, but improved the speed of basecalling using bonito with a factor two (Table 1). The model trained on 45K reads (bonito_15) outperformed all other models in terms of accuracy (median 97.5%). 

 *note crf model required smaller chunks parameter during base calling with guppy, otherwise the NVIDIA card (RTX2080) ran out of memory (8Gb).

Speed

Guppy models clearly outperform the self-trained bonito models, though increase in time using the guppy compatible crf model increases base calling time as well. 

Method / model

time

Guppy v4.5.2 HAC

8 min

Guppy v4.5.2 min_modbases_5dmc

8 min

Guppy v4.5.2 min_modbases 5mc_5hmC

8 min

Guppy v4.5.2 crf 0.3.2

40 min

Bonito_05

140 min

Bonito_05adjust

74 min

Bonito_15

140 min

Table 1. Times spend basecalling 40K reads with each model. Run on an intel i5-9400@2.9Mhz; 8 cores; 64Gb ram and a NVIDIA RTX2080 GPU (with 8Gb ram).