I’ve been doing some biotech courses in Coursera and Galaxy.org the past few months. It’s a field I wanted to get into since graduate school but haven’t since there is a niche quality to it. Learning the language of ATGC and AUGC seem too different than the rest of the cash-cows of data science.
Then fast forward half a decade, we find ourselves right in the middle of a pandemic. Suddenly, we see viral RNA and reverse transcription polymerase chain reaction (RT PCR) flying around in the news. The classical methods of clinical trials and wet lab experiments are now complemented by massive amounts of data, statistics and computation. It is a testament to the state of genomics that we have sequenced the SARS-COV-2 virus in a few short weeks.
With my non-biology background, I wanted to immerse myself in this field even more. I’ll share the things I have tried out.
Genomic Course Work

There’s a lot of courses and specializations in Coursera. There’s the excellent John Hopkins Genomic Data Science Specialization and the State University of New York’s Big Data, Genes and Medicine. They have the standard quizzes and hands-on exercises through R and Galaxy. There’s a price tag in getting certified and if that’s your jam, you should definitely get one.
I only audit the courses though. Open for sponsorship. : )
#continuouslearning
Healthcare Course Work
Stanford’s AI in Healthcare Specialization is a holistic take on healthcare and how AI can play a role in its diverse facets. From its economics, to the datasets one might expect in working in this domain, the specialization can elevate your understanding of genomics in healthcare.
Bonus: Educational Series
Crash Course Anatomy and Physiology is a perfect for-dummies content. For manga fans, Cells at Work works great as to entertain and inform.
Kaggle Competitions

There is a competition in Kaggle that is very timely: OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction. With sequence data of mRNA molecules, competitors must predict the degradation rate features of these molecules. mRNA molecules, which is a central component of mRNA vaccines, are notoriously fragile. By discovering the degradation rates of these molecules, vaccines can be more easily manufactured, stabilized and shipped in the world.
I started late in this competition but I learned from a few notebooks. I’ll share these starter ones. It’s quite surreal that coursework components are now being applied in the timeliest of real-world problems.
- Rob did an EDA of the problem and revealed key aspects of the challenge. The unique bit about this competition is that the input data is all about sequences. The prediction should also be sequences, with 3 target values to predict. That’s something you don’t see everyday!
- Konstantin didn’t just feed the data to an LSTM and ran away with it. Here’s some solid feature engineering work with sequence, structure, loops and error information all shipped in.
- However, as expected, deep learning models such as LSTMs and GRUs rule since we are in the realm of sequences and its sub-attributes. Tucker’s notebook is well written and achieves a very low error rate.