Biologists, let your freak flag fly

Compared with physics or mathematics, biology merely appears in the job description of data scientists. Indeed, biologists own quite a few perks in becoming great data scientists. Prestigious data scientists such as Chris Wiggins, David Heckerman, and even the originator of R language, Robert Gentleman all went through years of training in the area of biomedical science.

Biologists have a deep understanding of biologically inspired algorithms. Aside from the well-known neural network or genetic algorithms. Biologists have pretty good grasps about the abstract concepts in machine learning and artificial intelligence. Central dogma demonstrates that every possible biological process performs on a universal cellular machinery (DNA-RNA-Protein). This idea of using many simple agents to achieve complex tasks let us relate to many key concepts in data science, including one-learning algorithm hypothesis (supporters including my favorite MIT professor, Marvin Minsky), ensemble models or even swarm robotics.

The ability to abstract ideas from complex biological processes prepares biologists to solve data science problems. Biologists see the full life-cycle of data science. They collect data, manage data, identify problems and answer questions by statistics and models. In natural science, biologists start all from scratch. They have no idea about what data to collect and what problem has been defined. This is similar to identify a new customer demand from big data when customers don’t know about themselves. Additionally, the scale of data is comparable in the world of business and the world of population genomics.

Statistical methods (both frequentists and Bayesian) and many advanced techniques such as MCMC and classification models have been well-applied in many sub-disciplines in biology. It is exhilarating as many biologists start to gravitate towards data science. Step up your game biologists! They make organic data scientists. 

Misunderstanding about p-values is a crime committed by lots of people

P value is a statistical value calculated from absolute effect size, sample size and variability. It is also the single value that has been abused/misinterpreted/overused by lots of people, including the science community. P-value itself is rather innocent – it is a good metric to examine your data and your experiment design but more often to be misinterpreted.

Suppose you want to conduct a scientific experiment based on your hypothesis: playing video games causes poor vision in mice *(This is a made-up example, p < 0.05). P-value can be calculated from your two-sample t-test of measurement of vision from two group of mice that were treated differently (in this hypothetical example, you will have two group of randomly selected mice; one group will be treated as usual and the other group will be used to treated by several hours of video-game playing each day). Now suppose you have a reasonable sample size, and your method of measuring vision has a small error within a limit, your two-sample t-test gives you a p of  < 0.05. If you think you are allowed to reject your null hypothesis based on your p-value, and with that statistical significance you can publish a paper titled “playing video games reduces the vision of mice”, then you are completely a felony of p-value. If your journal reviewer is happy to see your p < 0.05 without asking any further details about how you conducted the experiment, then your reviewer is your partner-in-crime!

We must realize that P-value is not the probability how much likely we will accept the null hypothesis (Official statement of American Statistical Association). In the above case, p-value only tells us how much incompatible that our mice vision data is in our statistical model. Under the assumption of null hypothesis, p-value can indicate the probability of observing a vision-lowering effect as extreme as in our data. Another common misunderstanding is that p-value signifies the level of strong evidence (P(observation | hypothesis) ! =  P(true hypothesis | observation)). To illustrate this point, consider the above video game playing example. You hypothesize that playing warcraft causes poor vision and playing super macro did not have any effect. The p-value of super Mario group is 0.3, while the warcraft group is 0.03. Here comes the misunderstanding: people assumes that the smaller p-value, the better the effect. In this case, warcraft video game has a much bigger impact on vision than super Mario. This is completely wrong and this thought may have been practiced a lot in researchers.

ASA has provided the official principles of using p-value in 2016. First of all, p-value informs us about how incompatible the data are under a specified statistical model. In addition, interpreting data often requires more than just a single p-value. The goal of any inferential statistical methods is to use your sample to draw inference about the population. Under this line of thinking, we should be extremely careful when interpreting our data and scientific journals should not use p-valuee as the only measurement to determine the value of the study. In conclusion, we don’t want to be the crinmals or the victims of p-hacking.

 

This post is inspired by the author of pavopax.Check out his interesting posts about data science and statistics!

Pastime of simply being a reader

Before age of 29, Haruki Murakami was just an unknown hipster, who owned a coffeehouse/jazzbar called the Peter Cat. One day he decided to spent some time in writing novel after he closed the bar every night. His first book, Hear the Wind Sing, became a huge success. Here’s one thing that he teaches me: never be afraid of starting late as people don’t grow their will and ability of achieve something from the womb. A paralyzed attitude – not one’s age, blocks your way towards greatness.

Haruki Murakami’s most popular book is Norwegian Wood. I love this novel for its Japanese-aesthetic storyline, alternatively, which I would describe it as agreeable melancholy. As the lyrics of Norwegian wood says, the book narrates the encounter between Toru Watanabe and two girls, Naoko and Midori. Watanabe developed convoluted relationships with these two girls over his college years and after. The two had given him contrasting taste of charm. Naoko was fragile and gentle; while Midori was lively and upbeating. Watanabe was slowly attracted by both, thus leaving himself in a dilemma. He thought Naoko was precious while Midori brought him happiness. Japanese aesthetics demand tragical beauty. Likewise the story was ended by the suicide of Naoko after her suffering from the agony brought by her suicidal boyfriend and sister.

Reading What I Talk about When I Talked about Running made me like Murakami’s personal charm as a strong-willed man.  He started running when he was 32 and has run several marathon, triathlon and ultra marathon. Though not being a natural-born athletes, he strived to train himself as a runner and writer. He made several analogies of both acts that requires long-term perseverance and passion. “No matter how mundane the act seems to be, keep it long enough, it will become some contemplative act.” He described his relentless effort in finding inspiration as a writer without much talent

“Like water from a natural spring, the sentences just well up, and with little or no effort these writers can complete a work. Occasionally you’ll find someone like that, but, unfortunately, that category wouldn’t include me. I haven’t spotted any springs nearby. I have to pound the rock with a chisel and dig out a deep hole before I can locate the source of creativity. To write a novel, I have to dredge out another new, deep hole. But as I’ve sustained this kind of life over many years, I’ve become quite efficient, both technically and physically, at opening a hole in the hard rock and locating a new water vein.”

Reading is a lonely act and so does writing. Every reader’s interaction with the writer is quite personal. My words may not transcend the experience that I had through reading many books of Haruki Murakami. However, one thing that I am certain about through the adventure of being a reader, is the strengths that I gained by reading the stories from a strong man.

Using caret package for hands on machine learning in R

I’ve been taking courses of data science made by JHU from coursera. The
course focuses on practicing data analytics by using R. I figured that I
would post what I have learned. Plus, I’ve been obsessed with Rmarkdown
file and here’s a practical fix on how to generate wordpress markdown
file from Rmarkdown so I can write my blog in R.

The overall layout doesn’t look very nice, but here is a link to the R pubs of the following report.

This report summarizes basic function of caret package in R, using
nottem dataset as an example. The dataset consists of 20 years of
monthly temperature measurements at Notthingham Castle (ts ojbect)
(Anderson, O. D. 1976). The report will be focusing on the hands-on
practices of implement machine learning for newbies in this realm. Most
of the material is derived from JHU Practice Machine Learning Course on
Coursera. This report introduces the functionality of caret pacakge,
including:

  • preProcess
  • createDataPartition
  • train
  • predict

The nottem dataset: prepossesing ts object

By taking a quick look at the original data we noticed the seasonality
of the monthly temperature. Some sort of normalization may be necessary
for the time-series data to remove the non-linear trend. The data was
melted into data frame with another variable that indicates the time
point of each measurements.

 

&lt;br /&gt;data(&quot;nottem&quot;)
plot(nottem)

unnamed-chunk-1-1

 

&lt;br /&gt;library(reshape2)
temp.new = data.frame(timePoint= melt(time(nottem)), temp = melt(nottem))
colnames(temp.new) = c('timePoint','temp')

Splitting data into training and testing sets

The ratio of training and testing dataset is usually 3:1.

&lt;br /&gt;library(caret); library(kernlab)

## Loading required package: lattice

## Loading required package: ggplot2

##
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
##
## alpha

inTrain training testing

Model fitting

Here we use Bayesian Regularized Neural Network model to fit the
training set. Prepossessing method “center” is included in the parameter
(to substract mean from the training data; it is similar as a
normalization process)

&lt;br /&gt;set.seed(666)
modelFit

## Loading required package: brnn

## Loading required package: Formula

modelFit

Final model
-----------

summary(modelFit$finalModel)

## Length Class Mode
## theta 2 -none- list
## alpha 1 -none- numeric
## beta 1 -none- numeric
## gamma 1 -none- numeric
## Ed 1 -none- numeric
## Ew 1 -none- numeric
## F_history 10 -none- numeric
## reason 1 -none- character
## epoch 1 -none- numeric
## neurons 1 -none- numeric
## p 1 -none- numeric
## n 1 -none- numeric
## npar 1 -none- numeric
## x_normalized 362 -none- numeric
## x_base 2 -none- numeric
## x_spread 2 -none- numeric
## y_base 1 -none- numeric
## y_spread 1 -none- numeric
## y 181 -none- numeric
## normalize 1 -none- logical
## call 4 -none- call
## xNames 2 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 1 -none- logical

Prediction

Comparing the temp value from the testing set (blue cross) and the
prediction set (red circle), we discovered that the majority of the dots
locate closely except for two or three which don’t align with each
other.

&lt;br /&gt;prediction

## Loading required package: brnn

## Loading required package: Formula

prediction.frame plot(testing, col='navy blue', pch = 4, lwd = 2)
points(prediction.frame, col = 'dark red', lwd = 2)

unnamed-chunk-5-1

How does influenza virus change? (old post)

Influenza virus is a pretty fascinating virus, primarily because of its tremendous ability to change. The versatility of the virus is described as “the chameleon-like power to mutate keeps them ahead of the game”. These creepy RNA viruses, as we know, enjoy negotiating with their hosts. They are not like DNA viruses or bacteria that can cause diseases with high mortality. Instead, they stay with the hosts while performing their amazingly high-speed evolution, by both antigenic shift and antigenic drift.

Antigenic drift refers to mutations in surface glycoproteins of the viruses (hemagglutinin and neuraminidase). The high rate of mutation (an error rate between 1×10−3 and8×10−3 substitutions per site per year) is due to the lack of proofreading mechanism of viral RNA polymerase. The mutations that happen in the receptor binding site of the antigens will result in the failure of host immunity recognition and the subsequent immunity evasion.

Another mechanism that keeps influenza virus changing is antigenic shift, which is known as reassortment of genome segments from different strains. The most telling example of antigenic shift, the swine-origin H1N2 in 2009, is the result of triple reassortment. The virus contains four segments from classic American swine influenza virus, two segments from human influenza virus and two from avian lineage. The reassortment of the virus causes new host adaptation and pandemic formation, posing great threat to public health.

Here is a review article that depics the evolution of influenza A viruses

http://www.ncbi.nlm.nih.gov/pubmed/20542248

References
wikipedia.org
http://pathogenposse.tumblr.com
http://www.ncbi.nlm.nih.gov/pubmed/20542248

Role of M1 Protein in the Process of Assembly and Budding (old post)

 References
Nayak, D. P., Hui, E. K., & Barman, S. (2004). Assembly and budding of influenza virus. Virus Research, 106(2), 147-165. doi:10.1016/j.virusres.2004.08.012
This review mainly discusses the biological process of assembly and budding of influenza virus by emphasizing the role that M1 protein plays in the entire virus-host system. Viral particle of influenza is composed of viral envelope, matrix and core (viral ribonucleocapsid vRNP). M1 protein is considered as a bridge between viral core and envelope. During assembly process, M1 is generated at cytosolic ribosomes, and then directed to facilitate vRNP exportation from nucleus to cytoplasm, with the companion of NEP. At this stage, interaction of M1 with cytoplasmic tail and transmembrane domain is critical for the concentration of viral composition at budding site. Also, M1 is thought to be important in budding of the virus by presenting L motif (what is that?).
The author addressed most of major activity in virus replication, while there are some problems that need further examination. The mechanism of the translocation of M1 and vRNP, from nuclear to plasma membrane (assembly site) is still unsolved. Besides viral components like M1 protein, there are a variety of host factors involved in viral budding and most of them are under discovery.

Somatic Recombination (Old post)

It is generally known that during the phrase of B cell maturation, various antibodies are generated for specific antigens. You might want to ask how the cells utilizes its biological function, to produce 10×15 different types of antibody encoded by only 10×5 genes. The current consensus is that antibody is produced by somatic recombination (aka VDJ recombination) before it is secreted by plasma cells. Immunoglobulin genes are results of recombination of different DNA fragments, gaining their diversity and specificity. This process happened in every B cells which undergo maturation from naive B cells. (Instant Notes in Biochem)

Why there is genome instability? – Discovery of sleeping beauty (old post)

Why there is genome instability? – Discovery of sleeping beauty

Transposable elements (TE) are a large family of DNA elements widely presented in the genome of organism. These elements may not share the exactly same sequence, but consist of elements that confer similar biological function, which is inserting into genome and causing target site duplications. There are two classes of TE in eukaryotic cells: retrotransposon elements and DNA transposons.

The first transposon, founded by Barbara McClintock in 1983, is a naughty gene jumping around in the genome of maize. She found that there were deletions and insertions in the genome that are related to the change in the color of corn kernels. This transposon belongs to class II TEs, being called as DNA copy & paste system (or ctrl+x & ctrl+v system, geeky joke). Class II transposons are composed of two inverted repeats (IR) at each end, and one transposase element in the middle, which determines the autonomy of the transposon. The structure enables class II transposons to insert in the genome, changing the size of the genome and leading to genome instability.

Class I transposon is a family of retroviral genome (I have talked about retrotransposon and viral oncology in a previous post). It is interesting to know that not only in retrovirus, in yeast cells there are transposons that resemble retrotransposon structure. Ty element includes TyA, a gag-like protein and TyB, homology to RTase. The structure of Ty element is shown as follows.

 

Indication of evolution

Comparing these transposons from different branches of species reveal the role of transposon in organism development and evolution. Because the unstable nature of TE leads to mutagenesis, changes in genome sizes and eventually deleterious mutations, most biologists regard these TEs as “selfish DNA parasites”. Therefore, selection force is “giving a harsh time” to the TEs. Many organisms have their special mechanism to inhibit the activity of TE. In human genome, most of the transposable elements have been inactivated by mutagenesis long time ago. Very few transposons still remain their function of being a troublemaker. For example, Alu element is a most common transposon in human and has been proved to have association with inherited diseases and cancer. The figure below shows the karyotype of human chromosome of a female (XX). Green color labels the hybridized Alu element widely distributed in the chromosomes.

 

The discovery of TE also shed light on the speciation of organism in a molecular level. Scientists claim that these TEs have a common ancestor, and that transposons help exchange genetic information in the horizontal gene transmission. However, some researchers believe that these TEs emerged independently in multiple times. But still, except for the understanding of retroviral genome integration into host genome, we are unable to tell how the transposon from one ancestral species has been introduced to another species.

The wake up of sleeping beauty

TEs are named as sleeping beauty by their inactivated nature. However, in 1997 Ivics et al successfully woke up the sleeping beauty in salmonid cells. The researchers used a powerful approach to construct an activated transposon in fish cells by mutagenesis. They mapped several inactivated mutation in the ORF of sleeping beauty, replaced these mutations with robust amino acid residues, and finally woke up the beauty. The following shows how they performed this elegant biological trial.

Reference

Ivics, Z., Hackett, P. B., Plasterk, R. H., & Izsvák, Z. (1997). Molecular Reconstruction of< i> Sleeping Beauty</i>, a< i> Tc1</i>-like Transposon from Fish, and Its Transposition in Human Cells. Cell, 91(4), 501-510.

http://en.wikipedia.org/wiki/Transposable_element

 

Latest version of BLAST 2.2.28: makeblastdb, sequence ID, and some other tips (old post)

This is what I did for a few weeks ago when I used BLAST package (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) to analyze my Illumina 2×250 MiSeq data. It is a whole bunch of short reads collection, about 300bp per reads. Basically, all the sequences in FASTA look like this:

>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC

I also had a small collection of reference sequences at hand, which are annotated and published data (about 22 annoated genes in total). The goal here is to assign all the reads an official gene name, based on the similarity of the reads compared with the reference sequences. The first thing is to build a local database by makeblastdb:

makeblastdb -dbtype nucl -in database.fa -input_type fasta -out allreads

and then there will be a database containing all my reads built in. The funny thing is after you build the database in fasta format, it is hard to retrieve the sequences again by its accession number assigned by the blast——–same problem mentioned by Peter Cock in his blog (http://blastedbio.blogspot.com/2012/10/my-ids-not-good-enough-for-ncbi-blast.html). Thats all because the SUPER COMPLICATED NOMENCLATURE SYSTEM OF NCBI DATABASE…

this shows the original database:

x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt “%f”
>1101:13541:2221
CTCCTGCTTCCAGGATCCTGGCCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGTATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGTAGCTCA
GGCACCCTGACCATCACTGGGCTCCAGGCTGAGGATGACGCTGAGCATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC
>1101:13653:2384
CTCCTGCTTCCAGGATCCTGGGCCCAGGCTGTGCTGACTCAGCCGCCCTCTGAGTCTGGGTCCCTGGGCCAGAGGGTCAC
CCTCTCCTGCACTGGGAGCAGCAGCAACATCGGGGGTGGTAACAGTGTGAACTGGTCCCAGCCGCTCCCAGGAAAGGTCC
CCAGATCCGCATTCACTTATGCCAATCTCATGGCTATTGCTGCCCCGGATCAGATCTCTGGCTTCAAGTCTGGCAGCTCA
GGCACCCTGACCATCACGGGGCTCCAGGCTGAGGATGACGCTGAGTATTACTGCACAGCCGGGGGTGACAGCCTCGATGG
CCCCACAGTGCCCCAGGCCAGGGGGCAAGTGAGACCAAAACC

And then if I want to find the accession numbers/OID/GI in the database, I failed:

x-10-22-15-188:cassieblast Sean$ blastdbcmd -db IGLV1 -entry all -outfmt “OID: %o GI: %g ACC: %a IDENTIFIER: %i”
OID: 0 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 1 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 2 GI: N/A ACC: No ID available IDENTIFIER: No ID available
OID: 3 GI: N/A ACC: No ID available IDENTIFIER: No ID available

so I give up on extracting the sequence by all the command in blast. Instead, I used a reverse approach, making all the reads as query and all the reference sequences as database, and blastn will automatically match all the reads with the perfect hit in the reference sequences, and then the hits is exactly the official name that I want! However, there are still some defects in the BLAST command, I winded up with using Filemaker pro to process the output of BLAST, and life is becoming much easier.