Fascinating World of Molecular Robotics

Download PDF

Note: an extended version of this post is on medium here

On September 21st 2018, I spent the entire day at 9th Annual Wyss International Symposium. The topic of this year was Molecular Robotics. I admit that this is not my area of expertise beyond a synthetic biology course I took couple years back, but I greatly enjoyed the talks. Coming from a computer science background, I was somewhat familiar with the majority of the main concepts except maybe some of the sophisticated bio-chemistry.

The Wyss Institute just released a recap of the event, which is a must read.

The event booklet describes molecular robotics as “encompassing the design of microscopic, self-assembly driven agents that sense, compute and actuate“. Furthermore they state that “Nanoscale robots composed of nucleid acids, proteins or other molecules do not require power or batteries to operate, can be programmed to complete their tasks autonomously and collectively, and offer an unprecedented level of interaction with and control of biological systems“. They mention potential application areas as clinical diagnostics and therapeutics. They also state that it also extends “robotics” into the realm of nanoscale molecular systems with large numbers of individual agents collectively accomplishing tasks at microscopic scales.

The talks were divided into sessions covering the state of the art in sensing, computing and actuating aspects of the field. The final session titled “Translate” aimed to discuss how all this can be translated into applications in real life.

A Bit of the History
Digging a bit deeper into the history of molecular robotics on the internet, it appears that the field started as DNA Nanotechnology (or DNA Nanoengineering). Wikipedia names Dr. Nadrian Seeman of NYU (Dept. of Chemistry) as the founding father of DNA Nanotechnology who laid out the main concepts back in the 80s. However, the field started to attract more attention only in mid-2000s. I realized this connection based on Japanese scientist Dr. Satoshi Murata’s 2009 grant proposal titled “Development of Molecular Robotics based on DNA Nanoengineering“. Furthermore, in this video interview (below), released by the NSF covering the 2010 Nature paper “Molecular robots guided by prescriptive landscapes” by Columbia University researchers, Dr. Milan N. Stojanovic is talking about brief history that led to their paper. And in there he mentions the work of Dr. Seeman of NYU. At that time NSF called molecular robotics as “on the rise”:

The Japanese Government has officially started supporting this field back in 2012. Researchers from Japan published a review of the field at the time in 2013. Later, an initiative in molecular robotics was established at Wyss Intitute in 2016.

Dr. Seeman has published an article in Nature Reviews titled “DNA nanotechnology” in 2017. Interestingly, in that article he does not use the term “Molecular Robotics” at all.

I also sensed little bit of “artificial life” flavor in all this. Steven Levy’s “Artificial Life” and John Holland’s (RIP) “Hidden Order:How Adaptation Builds Complexity” books come to mind.

I should dig deeper into this fascinating field sometime :-)

Posted in Molecular Robotics | Leave a comment

Medical Imaging with Deep Learning (MIDL 2018) Conference: What is Hot?

Download PDF

Last week, I dived into the world of rejected extended abstracts from the first ever deep learning conference focused on medical imaging (MIDL 2018) organized in Amsterdam this summer. This week I am in the mood for the opposite end, namely the glorious world of accepted papers for oral presentation!

According to NVIDIA’s conference highlights slides, acceptance rates were 41% and 35% for papers and abstracts respectively. There were 21 oral presentations and 26 full paper and 35 abstract poster presentations (all presentations, code, slides etc. are here!). Based on these numbers, it seems that around 115 full papers and around 100 extended abstracts were submitted.

As in the case of rejected abstracts, I look at the reviewer comments as well as the papers themselves located here. My goal is to see what topics are hot and what innovations were valued by this community and why.

Some novel(ish) things that were praised:

  • Reinforcement Learning Considering all the fuss about deep reinforcement learning from Deepmind few years ago, one would have thought this was also applied to medical imaging. Alas, the reviewers were pretty impressed with the relative novelty of applying reinforcement learning in medical imaging, specifically in anatomical landmark detection. Video presentation is here
  • Attention U-Net Combining attention mechanism into U-Net has not been done before?? This is very surprising to me given that both ideas have been around for a couple years already. In this case, the architecture is applied to detection of pancreas in large CT abdominal datasets. Video presentation is here.
  • Capsules for Object Segmentation Hmmm..yet another idea that caused quite a buzz in 2017 (Hinton’s talk here) and somehow was applied to image segmentation only recently in 2018! The authors created an capsule-based architecture named SegCaps that reduced the number of parameters of U-Net 95.4% while still providing a better segmentation accuracy. Video presentation is here
  • Neural Conditional Random Fields on top of a CNN. Especially on WSI (digital pathology) domain, images are divided into tiles for processing due to their large sizes. However, such processing can’t adequately model spatial correlations between the tiles. The authors here are proposing to place a CRF on top of the CNN and train it end-to-end for WSI of breast cancer. Although the idea is not brand-new, the application is pretty novel.Video presentation is here.
  • Adversarial training with cycle consistency. Video presentation is here.
  • OBELISK – One Kernel to Solve Nearly Everything: Unified 3D Binary Convolutions for Image Analysis This paper actually won NVIDIA’s best paper award for this conference. Video presentation is here.

Disclaimer: the opinions stated in this blog post are my own. They do not represent my employer’s views or opinions on related topics.

Posted in Deep Learning, Medical Imaging | Tagged , , | Leave a comment

Medical Imaging with Deep Learning (MIDL 2018) Conference: Exploring Rejected Extended Abstracts

Download PDF

Nowadays, everybody and their grandmothers (and grandfathers) are utilizing deep learning for medical imaging problems. This has now become a hot industry. Even though major computer vision (CVPR etc.) and medical imaging conferences (SPIE,ISBI,MICCAI) are currently dominated by deep learning methods, a conference specific to deep learning applied to medical imaging has just emerged in 2018.

We are lucky to have free access to videos and slides from latest research here :-)

I quickly looked into what was rejected. There are certainly interesting things to learn from research that did not quite make it in this round. This post covers the extended abstracts track. There is also a full conference track with lots of rejected papers to look into..but that is for another rainy day…

Judging by the reviews (that were apparently based on 2-3 page abstracts) shared openly, the selection process was competitive. According to openreview.net site(*), 35 papers were accepted and 44 papers were rejected (around 44% acceptance rate):

Couple things stood out for me:

There were some controversial papers that received mixed reviews (some accept, some reject) but eventually rejected:

  • Deep learning & Atlas based segmentation hybrid:This is one area of interest for me. As we all know, deep learning models are dumb and have no clue about what they are looking at. You can train the same model on cat images or cardiac images. It is a cool idea to try and introduce anatomical knowledge into these models. One reviewer clearly accepts the paper even though the validation was not done quantitatively, while the second reviewer is not impressed because “UNet and multi-atlas registration based segmentation is widely used”.
  • Application of geometric deep learning to brain imaging. Spectral Analysis Towards Geometric Auto-Encoding of Subcortical Structures is about statistical shape analysis. As one reviewer puts it, this is a novel technique to apply deep learning to meshes instead of images directly, however, the details were not found to be clear enough to judge. It is based on multi-scale mesh based shape representation. The algorithm seems to be learning a layered representation of these shapes in terms of a hierarchy of shape signatures (based on spectral wavelet transforms). Learning is done layer-wise using unsupervised learning such as k-means or variational Bayes EM. In the past, I was obsessed with content based 3D shape retrieval. The notion of shape signatures was also used there in conjunction with similarity measures. This takes me back to those times.
  • Predicting follow up images to track disease progression . The goal is to develop an unsupervised model to learn static anatomical structures and the dynamic changes of the morphology due to aging or disease progression. The paper Unsupervised Representation Learning of Dynamic Retinal Image Changes by Predicting the Follow-up Image uses almost “4000 OCT images of about 200 patients with macular degeneration who were scanned over 24 months”. One reviewer found the experiments confusing.
  • Cycle-Consistent Generative Adversarial Networks for Image Segmentation. This seems to be an interesting idea to perform Epithelial tissue image segmentation using GANs. According to the authors: “The model consists of two generators, one that maps from the image to the segmentation domain and a second that maps from the segmentation to the image domain, and two associated adversarial discriminators. A so-called cycle-consistency loss regularizes the mapping and enforces a relationship between an image in the segmentation and the image domain”. They find the performance of Cycle-GAN comparable to U-NET that had to be trained on a paired input image-segmentation dataset. As the reviewers point out they only run experiments with the fully annotated training set, therefore did not sufficiently show that it works when we lack annotations. The second paper using the same general idea is on liver lesion segmentation. It proposes an improved U-Net architecture called the polyphase U-Net as a generator in the Cycle-GAN. However, the reviewers pointed out that the performance did not match up to the state-of-the-art in the competition where the dataset came from.
  • Deep Learning-Based 3D Freehand Ultrasound Reconstruction. Even though ultrasound imaging is cheap and safe, it is challenging to construct 3D volumes without external probes or tracking hardware. According to one reviewer, the authors have previously presented a deep learning based method to build 3D volumes from a series of 2D images at MICCAI. In their MIDL submission they have added an inertial measurement unit (IMU) hardware to model more realistic operator hand motions via a gyroscope. It must indeed be challenging to do this solely based on images after all. However, it seems this was no longer novel enough for the reviewers.
  • Reconstruction of sparsely sampled Magnetic Resonance Imaging measurements with a convolutional neural network. This one talks about Compressed Sensing accelerated Magnetic Resonance Imaging (MRI) and how a neural network can be used to decode accelerated, undersampled MR acquisitions, eliminating the need for reconstruction algorithms. One reviewer likes it while the other bashes it questioning methodological novelty and asking for comparisons with other architectures.

There were some interesting ideas (or innovations one might say) borrowed from previous literature but did not quite make it in the end:

  • Improved Deep Learning Model for Bone Age Assessment using Triplet Ranking Loss Strangely this paper received marginally above threshold ratings from both reviewers but ended up being rejected in the end. It uses the idea of regularizing feature embeddings with respect to clustering of similar cases together in the feature space by introducing ranking in the loss function. I guess this triplet ranking loss concept was proposed by Google in 2015 for face recognition problems. To my understanding, the reviewers were saying that the experiments on the 3-stage architecture were not clearly explained or had some issues, so they were not convinced.
  • Curriculum learning from patch to entire image for screening pulmonary abnormal patterns in chest-PA X-ray This paper borrows the notion of curriculum learning from Bengio et al.’s 2009 ICML paper that proposes training gradually from simple to more complex concepts. In this paper the authors first train a RESNET 50 using Imagenet dataset, then train it using lesion extracted patches. Then they fine-tune it using the entire images. Their idea is that it would be difficult to train the network on the more complex whole images where there are other organs etc. exist. Interesting idea. However, the reviewers had issues with clarity of the writing and lack of some details.
  • Detection of Gastric Cancer from Histopathological Image using Deep Learning with Weak Label Weak supervision is a work-around to handle lack of high quality annotated datasets from domain experts where large scale albeit noisy and lower quality annotations are gathered using cheap annotators or programmatically. Apparently the idea is not new in machine learning but has been gaining popularity in deep learning recently. The goal is to predict slide level labels, however since the images are large, it uses a patch based approach where the algorithm ultimately generates a probability map by stitching patch level predictions. Then a random forest algorithm predicts the label for the whole slide using the probability maps.
    This paper uses slide-level weak labels if there are no region-level labels available for the patches. It was not really found novel by the reviewers I am guessing.

This conclusion is not surprising at all, but here it goes: based on common reviewer comments, it seems that having a large dataset is preferable and that is becoming less of an issue with lots of free datasets being available via competitions etc. One rule of thumb however, is to make sure that the method is at least as good as the state-of-the-art if one is using a competition/challenge dataset. Even if the dataset/application domain is novel, some innovation on the architecture along with comparison to established architectures is expected. Of course, in all cases, it is essential to be able to clearly convey all relevant information within the limited number of pages (2 or 3).

*Note: authors are allowed to remove their papers from the site, therefore these numbers might not be accurate. According to NVIDIA’s conference highlights slides, acceptance rates were 41% and 35% for papers and abstracts respectively

Disclaimer: the opinions stated in this blog post are my own. They do not represent my employer’s views or opinions on related topics.

Posted in Deep Learning, Medical Imaging | Tagged , , | Leave a comment

Coursera’s Systems Biology and Biotechnology Specialization – My initial impressions

Download PDF

Couple years ago, Coursera started offering a series of courses in Systems Biology developed by the Icahn School of Medicine at Mount Sinai in New York. The series includes 5 lecture style courses possibly recorded in 2013/2014 (as per the slides) and a final capstone course. At the time of this writing, Coursera was charging $39/month for this specialization and providing a 7-day trial period. In this series of posts, I will share my experience on this specialization.

Coursera has another similar specialization in bioinformatics , however I found the systems biology specialization to be more suitable for my interests and learning goals. The bioinformatics specialization is a series of 7 courses and it is geared more towards transforming you into a bioinformatics software professional; I am still interested in trying that one out–maybe at a later date.

I decided to take a stab at this specialization during 2017 holiday season. It is recommended to start from the Introduction to Systems Biology course and the capstone is the last course. It is recommended to complete the Integrated Analysis in Systems Biology course right before the capstone as it will prepare you for the capstone project. The remaining courses can be taken in any order although Coursera recommends the following sequence:

  • Introduction to Systems Biology
  • Experimental Methods in Systems Biology
  • Network Analysis in Systems Biology
  • Dynamical Modeling Methods for Systems Biology
  • Integrated Analysis in Systems Biology
  • Capstone

The courses run for couple sessions a year and the assignments have certain deadlines that you might want to watch for in order to keep yourself on track. However, they are on-demand and it is possible to work through the courses before their start dates as long as you submit the assignments before the session ends. You can take the courses in a sequence or try multiple ones concurrently.

I choose to work on some courses concurrently. First I completed the first 3 weeks of the introduction course, then switched to the experimental methods course and completed week 1. After that, I worked through first weeks of the network analysis and dynamical modeling courses to form on opinion on what is to come.

The specialization is tough, it is listed as intermediate level on Coursera. Judging by the discussion boards, it seems like these courses were popular on Coursera 2-3 years ago. There has not been too much activity recently and I have noticed that the course staff did not interact with the students at all on the discussion boards.

Weekly quiz questions are difficult and require some thinking and online research. Don’t be surprised if you need to take them multiple times to get a good grade. Moreover, the quiz evaluations only tell you if you answered correctly or not without any further explanation. This specialization requires certain familiarity to cell biology and mathematical tools such as ODEs/PDEs and also statistics. Knowledge of Matlab is recommended as some assignments will be in Matlab. The instructors provide a brief introduction and free license to Matlab. I am quite familiar with the math and statistics, although the biology was a bit challenging as it requires more of memorizing things. I found the following resources useful. If you like animated videos, Youtube has lots of good resources.

Molecular Cell Biology, 4th edition: Free access from NIH site
Molecular Biology of the cell, 4th edition: Free access from NIH site
Systems Biology Textbook,2016: Seems like this is the most up-to-date textbook in print as of 2017. Kindle edition might have some issues with figure quality. Paperback version costs around $80.

In summary, even though the material might be 4-5 years old, this was the the only general training opportunity in systems biology that I was able to find on the internet and it is provided by a top institution in this field. This specialization is certainly not for everyone but if you stick to it, you will certainly learn a lot. Finally, I would like to thank the Icahn School of Medicine at Mount Sinai in New York and Coursera for this training opportunity.

Update as of October 1st 2018:
At this time, sadly I am done with only approximately one thirds of the specialization as life happened ;-( I intend to release a series of posts to review individual courses if I ever get around to finishing them ;-)

Posted in Dynamic Systems, MOOCs, Systems Biology | Leave a comment

Instructional Technology Tool using Adobe Flex

Download PDF

At one point in my academic life, I worked as an Instructional Technology Fellow helping faculty members design online/hybrid courses. When I saw Andrew Ng’s Stanford Machine Learning lectures online, inspired by my job, I wanted to build a tool that would help me study the material by myself. Here is this very primitive tool I wrote using Adobe Flex: http://mindwriting.org/MachineLearningCourse.swf

The application contains 4 panels: topic list and corresponding lecture video on the right, a list of related textbooks and a text area to take notes with ‘save as pdf’ capability on the left-hand side of the screen.


Finally, the code is at : https://github.com/ilknuricke/MLCourseTool
The code was compiled with Adobe Flex 4.6 and works with Adobe Flash player version 11.1

Note that Dr. Ng’s Machine Learning course on Coursera has better quality videos and an easier set of topics supported with Matlab assignments. Nevertheless, this was a cool programming exercise to get introduced to Adobe Flex which seems to have fallen out of favor nowadays…But, it still works!

update: as of 2017, I noticed that video replay from this flash application is no longer working. It seems Youtube is no longer allowing embedding videos within such applications :-(

Posted in Demos, Instructional Technology | Tagged , | Leave a comment

Central Pattern Generators to Synthesize Birdsongs

Download PDF

Songbirds are among the most interesting creatures of nature. Out of 10,000 bird species, around 4,000 are songbirds. They sing for reasons such as territorial ownership and mating. The young birds learn their songs by listening to a tutor. This makes it an attractive research area for those who are interested in understanding how humans learn to speak.

Structure of a Birdsong
A birdsong is a hierarchical structure consisting of syllables and motifs such as the following:

Here is how this waveform plot is generated in matlab:

[y,fs] = wavread('e:\\recordedbird.wav');

And here is how this song sounds like:

What is in a Syllable?
A syllable can contain multiple dynamic tones with up sweeps and down sweeps. The frequency range in a typical birdsong is generally between 2-7Khz [1]. The richness in dynamics makes this an interesting topic for dynamic systems modeling. The following sonogram/spectogram shows how the frequency changes within syllables:


Songbird Anatomy
So, how do birds generate songs? The following figure shows the vocal organs of a songbird.


Located between the lungs and the vocal tract, the syrinx is the sound-producing organ (source) of a songbird. Similar to the larynx in humans, the syrinx contains vibrating membranes that are called labia which oscillate when the air from lungs exerts force on them. The sound waves generated by the syrinx then travel through the vocal tract which serves as the filtering mechanism for producing the vocalizations [2].

What controls the singing is the songbird’s brain. Two distinct forebrain regions were identified as being highly active during birdsong generation. The HVC (high vocal center) projects to the region RA (robust nucleus of the archistriatum) which in turn projects to the brainstem regions that control the syrinx and the respiratory systems.


Based on the observations of the songbird brain activity during singing, it has been proposed that there is a temporal hierarchical organization that controls the bird-song generation. Namely, the HVC neurons were observed to be on a slower timescale than the RA neurons [3].

The final piece of the puzzle is the control of the muscles that actually govern the syrinx. It was proposed that there were 7 muscles controlling the vocal organs, however this meant a 7-dimensional problem for mathematical modeling of song generation. Through experiments, it was deduced that song generation in songbirds could be mathematically modeled with respect to two variables: the stiffness of the labia (the portion of the syrinx that oscillates) and the air pressure generated by the air-flow from the lungs to the vocal tract [2].

syrinx_muscles (image from [2])

Mathematical Modeling of the Syrinx
Sound is created by periodic airflow fluctuations (oscillations). A baseline model for the syrinx was proposed as a very basic spring-mass nonlinear oscillator model which turns out to be a slight variation of a Van der Pol Oscillator (see references for details).

Here is the simplest mathematical model of the syrinx that is supported by experimental evidence. In this scheme, x represents the mid-point of the labia, k is the labia stiffness and p is the air-flow pressure. Note here that these two variables (k and p) are time-varying.


Mathematical Modeling of the Songbird Brain
Here, I look at the model proposed in [4] that models a population of RA neurons using a central pattern generator.

(image from [4])

Here, xk and xp are the neurons that control the syringial and respiratory muscles respectively. The neuron y is the inter-neuron governing the co-ordination between xp and xk. The output of this network controls the two parameters for birdsong generation: the labia elasticity (k) and the air pressure (p). The rho values represent the excitatory connections from the HVC area. In their paper, the authors de fine paths in the
p-k parameter space based on the excitatory input rho2 which creates a di fferent syllable for each value.

Here is the actual mathematical model, where S is the sigmoid function.


In their paper, the authors show that they were able to closely replicate a recorded birdsong using this central pattern generator. The following is my attempt to replicate their result in Matlab using the same model and parameter values from their paper.

Creating Synthetic Birdsongs
Here is how to create a synthetic version of the white-crowned sparrow song shown in [4].

Model for the syrinx:

%The bird syrinx model used in
%From Laje and Mindlin(2002), Diversity within a Birdsong, Physical Review
%pt = interlabial air pressure
%k:labia elasticity
function yprime = syrinx(t,y,flag,tra,p,k)

b=1000;  %(dissipation-friction coefficient) parameter taken from the paper
d=power(10,8);  %(nonlinear dissipation coefficent for bounded motion) parameter taken from the paper

yprime(1) = y(2) ;


yprime(2) = (pt-b)*y(2) - kt*y(1) - d*(power(y(1),2)*y(2));
yprime = yprime';

Model for the Songbird brain (the Central Pattern Generator):

%The CPG (Central Pattern Generator) Neural Network
%Modeling the RA brain region of a songbird
%From Laje and Mindlin(2002), Diversity within a Birdsong, Physical Review
%the parameter rho2 is varied where all other parameters are fixed in
%generating birdsong syllables
function yprime = birdBrain(t,y,flag,rho2)
yprime(1) = 30 * ( - y(1)  + ( 1 / ( 1 + exp(-1 * (rho1  + A * y(1) - B *y(2)))) )) ;
yprime(2) = 30 * ( - y(2)  + ( 1 / ( 1 + exp (-1 * ( rho2 + C * y(1) - D*y(2) + alpha * y(3))))));
yprime(3) = 120 * ( -y(3)  + ( 1 / ( 1 + exp ( -1 * (rho3 + E* y(3) - beta*y(2))))));
yprime = yprime';

Singing a syllable (singSyllable.m):

function [t,song] = singSyllable(rho2)
Fs = 22050;  %sampling
t = 0:1/Fs:0.24; 

[t1 y] = ode15s('birdBrain',t,[0.01 0.01 0.01],[],rho2);

kt=1.4*power(10,9)*y(:,3) + 4.8 * power(10,8);
pt=7000*y(:,1) - 2200;

[m z1] = ode15s('syrinx',t,[0.01 0.01],[],t,pt,kt);

Creating a whole song of syllables a-b-c-c (CreateSong.m)

clear all;
Fs = 22050;  %sampling

%syllable a
disp('Generating Syllable a (rho2=-11)');
[m1 s1]=singSyllable(-11.0);
disp('Syllable a generated');

%play the syllable

%syllable b
disp('Generating Syllable b (rho2=-11.8)');
[m2 s2]=singSyllable(-11.8);
disp('Syllable c generated');
%play the syllable

%syllable c
disp('Generating Syllable c (rho2=-7.1)');
[m3 s3] = singSyllable(-7.1);
disp('Syllable c generated');
%play the syllable

%stich the song up
disp('Stiching the song up a-b-c-c');
song = [ s1(:,1) ;s2(:,1) ;s3(:,1); s3(:,1)];
disp('Saving wav file');

disp('Playing wav file');

Here is how this synthetic birdsong sounds:

This material was prepared for a graduate level course in Advanced Ordinary Differential Equations. The project report can be found in http://mindwriting.org/research/misc/birdsong_dynamics_math330.pdf and the source code can be found here

[1] Gabriel B. Mindlin and Rodrigo Laje. The Physics of Birdsong. Series in Biological and Medical Physics,Biomedical Engineering. Springer, 2005.

[2]Rodrigo Laje, Timothy J. Gardner, and Gabriel B. Mindlin. Neuromuscular control of vocalizations in birdsong: A model. Physical Review E: Statistical,Nonlinear, and Soft Matter Physics,65:051921-8:051921-8, 2002.

[3]Fernando Nottebohm. Birdsong’s clockwork. Nature Neuroscience, 5:925-926, 2002.

[4]Rodrigo Laje and Gabriel B. Mindlin. Diversity within a birdsong. Physical Review Letters, 89:288101-1:288102-4, 2002.

Posted in Algorithms, Demos, Dynamic Systems | Tagged , , , | Leave a comment

Machine Learning web service using Python,Bottle and Scikit-Learn

Download PDF

Software as a service (SAAS) is a nice way to provide analytics capabilities to people who are not experts in machine learning and/or do not have time to build the necessary tools. Here, I implemented a simple web service utilizing the python based machine learning toolkit (scikit-learn) that applies simple dimensionality reduction algorithms (Principal Components Analysis and Linear Discriminant Analysis) to a dataset of user’s choice and returns 2D visualizations of the data.

The implementation is totally python based and it uses the Bottle Web Framework. The service is located at:http://mindwriting.org:8073/

First, it will allow you to upload your dataset as a comma separated .csv file. One restriction of the application is that it needs the class labels to be provided as the last column of the dataset. You will also need to provide a header line with attribute names. An example dataset looks like this:


Then, you submit the dataset and get back the visualization. Here are some examples along with the datasets:
wine dataset (comma-delimited csv)

iris dataset (comma-delimited csv)

digits dataset (comma-delimited csv)

Ripley’s Leptograpsus crabs dataset (comma-delimited csv)


Finally, here is how it is done:
A very simple form (upload.html) to upload the dataset and call the service.

  action="/plot" method="post" 
  Select a file: <input type="file" name="upload" />
  <input type="submit" value="PCA & LDA" />

This is what the service returns: just an image of the 2D visualizations embedded in html:

html = '''
        <img src="data:image/png;base64,{}" />

The main work is done by the plot() function (pca_lda_viz.py) that receives the uploaded .csv file, extracts the attributes and class variable, applies data transformations, creates 2D visualizations.

@route('/plot', method='POST')
def plot():

   # Get the data
   upload = request.files.get('upload')
   mydata = list(csv.reader(upload.file, delimiter=','))

   x = [row[0:-1] for row in mydata[1:len(mydata)]]

   classes =  [row[len(row)-1] for row in mydata[1:len(mydata)]]
   labels = list(set(classes))

   classIndices = np.array([labels.index(myclass) for myclass in classes])

   X = np.array(x).astype('float')
   y = classIndices
   target_names = labels

   #Apply dimensionality reduction
   pca = PCA(n_components=2)
   X_r = pca.fit(X).transform(X)

   lda = LDA(n_components=2)
   X_r2 = lda.fit(X, y).transform(X)

    #Create 2D visualizations
   fig = plt.figure()
   ax=fig.add_subplot(1, 2, 1)
   bx=fig.add_subplot(1, 2, 2)

   fontP = FontProperties()

   colors = np.random.rand(len(labels),3)
   for  c,i, target_name in zip(colors,range(len(labels)), target_names):
       ax.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, 
       ax.legend(loc='upper center', bbox_to_anchor=(1.05, -0.05),
                 fancybox=True,shadow=True, ncol=len(labels),prop=fontP)
       ax.tick_params(axis='both', which='major', labelsize=6)

   for c,i, target_name in zip(colors,range(len(labels)), target_names):
       bx.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, 
       bx.tick_params(axis='both', which='major', labelsize=6)

   # Encode image to png in base64
   io = StringIO()
   fig.savefig(io, format='png')
   data = io.getvalue().encode('base64')

   return html.format(data)

The code above creates random colors to represent each class. However, it sometimes does not generate distinct looking colors. I leave it as an exercise to create a set of N distinct colors.

The largest dataset I have tested is the madelon dataset download here:madelon_training set (500 attributes, 2000 rows) (comma-delimited csv) Response time is pretty good but the data transformations we get from these two simple methods are not interesting at all–this was a feature extraction challenge dataset after all!

The service has not been designed to process very large data files–see Mahout(java) for map-reduce implementation. Therefore, it is not ready for the big data frenzy. Nevertheless, I believe it is a good start that can be extended for big learning.

In order to run the example, just change the hostname and port number in pca_lda_viz.py and start the service as:

> python pca_lda_viz.py

If all goes fine, it should start without any error messages. All done!

Github repository: https://github.com/ilknuricke/bottle_scikit_learn_web_services

Posted in Demos | Tagged , , | Leave a comment

Needleman-Wunsch pairwise DNA sequence alignment RESTful Web Service

Download PDF

Here is a RESTful web service that provides global DNA sequence alignment implementation as a service.

Screenshot from 2014-05-26 13:46:01

See my post on the algorithm itself. Download .war archive with java source code if you would like to install it yourself.

The service can also be called directly: http://mindwriting.org:8080/BioREST/resources/alignment?sequences=GC,C returning the response as text.

This implementation uses gap penalty score=2 and mismatch penalty=1. See here for a more sophisticated implementation (ie. with gap extension penalties and so on) from the The European Bioinformatics Institute.

Posted in Algorithms, Demos | Tagged , , , , | Leave a comment

Needleman-Wunsch algorithm for DNA sequence alignment (2 sequences)

Download PDF

Update (May 26th 2014): See the RESTful web service implementation of this algorithm here

Given two DNA sequences, you are asked to align these two sequences where for each non-matching nucleotide pair you will be penalized by 1 point and for each gap (space) you insert into a sequence to shift a portion of the sequence, you will be penalized by 2 points. The requirement is that you will have to do this alignment with the least amount of penalties as possible (thus minimizing the total edit distance between these two sequences).

Here is an example:

First possible alignment:
edit distance: mismatch(G,C): 1 point penalty + gap (C,-): 2 point penalty = 3

Second possible alignment:
edit distance: gap (G,-): 2 point penalty

Since the second alignment gives us the smaller edit distance (least total penalties), the second alignment is optimal.

This problem can be solved for any pair of sequences of arbitrary sizes using a dynamic programming approach that is known as the Needleman-Wunsch algorithm published in 1970.

This algorithm is used for the global alignment problem in bioinformatics, where the sequences are of more or less the same size and largely similar. The gist of the algorithm is to divide the problem into smaller sub-problems (align prefixes and build the alignment of the whole sequences based on that). The advantage of dynamic programming is that it computes the solution to a sub-problem only once.

If the sequences are of significantly different sizes and/or mostly dissimilar, local alignment algorithms are used to identify the sub-sequences that are similar. See the
Smith-Waterman algorithm for local alignment.

These algorithms are so common in bioinformatics–they are part of many online toolkits. Likewise, many implementations in Java, Python and other languages can be found online. My suggestion is that you implement the algorithm from scratch to appreciate its elegance and to convince yourselves that it indeed works.

Here is my implementation of the Needleman-Wunsch algorithm in Java:

public class NeedlemanWunsch {	
public static boolean verbose=false;

* prints out the score matrix 
public static void dumpMatrix(int[][] matrix, 
                              String row, 
                              String column){
 for (int j =0; j< row.length(); j++){
   System.out.print(String.format("%5s", row.charAt(j)+"  "));
for (int i =0; i< column.length(); i++){
  System.out.print(String.format("%5s",column.charAt(i)+ " "));
  for (int j =0; j< row.length(); j++){
     System.out.print(String.format("%5s", matrix[j][i]+" "));
* Needleman-Wunsch Dynamic Programming Algorithm
* see http://amrita.vlab.co.in/?sub=3&brch=274&sim=1431&cnt=1
* Runtime complexity: O(NxM), 
* Space complexity: O(NxM) where N,M are lengths of the sequences
public static AlignmentResult computeNWAlignment(String seq1, 
                                                 String seq2, 
                       SimpleAlignmentParameters parameters) {
//It is easier to index and faster to use
//integer matrix for fixed length storage
int[][] scoreMatrix = new int[seq1.length()+1][seq2.length()+1];
int gapPenalty=parameters.getGapPenalty();
int substitutePenalty=parameters.getSubstitutePenalty();
AlignmentResult result = new AlignmentResult();
//Initialize the score matrix
//the first row and column are for the gap
//Complexity: O(NxM)
for (int i =0; i< seq1.length()+1; i++){
 for (int j =0; j< seq2.length()+1; j++){
     if (i==0) 
	scoreMatrix[i][j] = gapPenalty*j;
     else if (j==0) 
        scoreMatrix[i][j] = gapPenalty*i;

if (verbose){
   System.out.println("Initial Matrix");
   dumpMatrix(scoreMatrix, "-"+seq1, "-"+seq2);
int similarityCost=0;
int matchCost=0;
int seq1GapCost=0;
int seq2GapCost=0;
//Compute the minimum cost scores between all 
//possible pairs of prefixes
//Complexity: O(NxM)
for (int i =1; i< seq1.length()+1; i++){
for (int j =1; j< seq2.length()+1; j++){
//Case 1: The cost of mistmatch between the two prefixes
similarityCost= (seq2.charAt(j-1)==seq1.charAt(i-1)) ? 0 : 

matchCost = scoreMatrix[i-1][j-1] + similarityCost;
//Case 2: the cost of adding a gap on sequence 2
seq2GapCost = scoreMatrix[i-1][j] + gapPenalty;
//Case 3: the cost of adding a gap on sequence 1
seq1GapCost = scoreMatrix[i][j-1] + gapPenalty;
scoreMatrix[i][j] = Math.min(Math.min(matchCost,seq1GapCost),
if (verbose){
 System.out.println("\nFilled Matrix");
 dumpMatrix(scoreMatrix, "-"+seq1, "-"+seq2);
//Reconstruct the Alignment by backtracking on 
//the score matrix
//Complexity O(N+M)
StringBuilder alignedSequence1= new StringBuilder();
StringBuilder alignedSequence2= new StringBuilder();
int j = seq2.length();
int i = seq1.length();
while (i >0 || j > 0) {
 if (i>0 && j > 0) 
    similarityCost= (seq2.charAt(j-1)==seq1.charAt(i-1)) ? 0 : 
    similarityCost = Integer.MAX_VALUE;
if ( i > 0 && j >0 && 
     scoreMatrix[i][j] == scoreMatrix[i-1][j-1] + similarityCost) { 
 else if ( i > 0 && 
      scoreMatrix[i][j] == scoreMatrix[i-1][j] + gapPenalty){
else if ( j > 0 && 
       scoreMatrix[i][j] == scoreMatrix[i][j-1] + gapPenalty){
} // end of while
 result.setAlignments(new String[] 
 return result;

Helper classes: SimpleAlignmentParameters and AlignmentResult:

 *  Alignment parameters used in Needleman-Wunsch Algorithm 
* with gap penalty
public class SimpleAlignmentParameters {

	private int gapPenalty;
	private int substitutePenalty;

	public int getGapPenalty() {
		return gapPenalty;

	public void setGapPenalty(int gapPenalty) {
		this.gapPenalty = gapPenalty;

	public int getSubstitutePenalty() {
		return substitutePenalty;

	public void setSubstitutePenalty(int substitutePenalty) {
		this.substitutePenalty = substitutePenalty;

	public SimpleAlignmentParameters(int gapPenalty, 
                                  int substitutePenalty) {
		this.gapPenalty = gapPenalty;
		this.substitutePenalty = substitutePenalty;
	public SimpleAlignmentParameters() {
		this.gapPenalty = 2;
		this.substitutePenalty = 1;
 * Class that encapsulates results of a binary Alignment: 
* parameters, edit distance and the actual alignments
public class AlignmentResult {
	private int totalCost=0;
	private int alignmentLength=0;
	public int getAlignmentLength() {
		return alignmentLength;

	public void setAlignmentLength(int alignmentLength) {
		this.alignmentLength = alignmentLength;

	private int matches=0;
	private AlignmentParameters parameters=null;
	private String[] alignments=null;

	public int getMatches() {
		return matches;

	public void setMatches(int matches) {
		this.matches = matches;

	public int getTotalCost() {
		return totalCost;

	public void setTotalCost(int totalCost) {
		this.totalCost = totalCost;

	public AlignmentParameters getParameters() {
		return parameters;

	public void setParameters(AlignmentParameters parameters) {
		this.parameters = parameters;

	public String[] getAlignments() {
		return alignments;

	public void setAlignments(String[] alignments) {
		this.alignments = alignments;
	private int alignmentScore(String seq1, String seq2){
		  int totalCost=0;
		  for (int k=0; k < seq1.length(); k++){
			  if (seq1.charAt(k)!=seq2.charAt(k)) {
		            if ( (seq1.charAt(k)!='-') && 
                                (seq2.charAt(k)!='-')  ) 
			  if ( (seq1.charAt(k)=='-') || 
                               (seq2.charAt(k)=='-')  ) {
		  return totalCost;

Here is the result for a pair of DNA sequences where the gap penalty is 2 and substitution penalty is 1:

Test Alignments using Needleman-Wunsch Algorithm
Original Sequences:
Initial Matrix
       -    G    C    G    C    A    A    T    G  
   -    0    2    4    6    8   10   12   14   16 
   G    2    0    0    0    0    0    0    0    0 
   C    4    0    0    0    0    0    0    0    0 
   C    6    0    0    0    0    0    0    0    0 
   C    8    0    0    0    0    0    0    0    0 
   T   10    0    0    0    0    0    0    0    0 
   A   12    0    0    0    0    0    0    0    0 
   G   14    0    0    0    0    0    0    0    0 
   C   16    0    0    0    0    0    0    0    0 
   G   18    0    0    0    0    0    0    0    0 

Filled Matrix
       -    G    C    G    C    A    A    T    G  
   -    0    2    4    6    8   10   12   14   16 
   G    2    0    2    4    6    8   10   12   14 
   C    4    2    0    2    4    6    8   10   12 
   C    6    4    2    1    2    4    6    8   10 
   C    8    6    4    3    1    3    5    7    9 
   T   10    8    6    5    3    2    4    5    7 
   A   12   10    8    7    5    3    2    4    6 
   G   14   12   10    8    7    5    4    3    4 
   C   16   14   12   10    8    7    6    5    4 
   G   18   16   14   12   10    9    8    7    5 
elapsed time (sec): 0.01765
GCGC-AATG size:9
|| | |  |
Match Score=5, 0.5555556 gaps:1
Edit Distance=5
Alignment Length=9

Here is how the algorithm finds the solution by backtracking the steps on the score matrix:

Needleman-Wunsch backtracking

Needleman-Wunsch backtracking

The algorithm constructs the alignment in reverse order, a diagonal move means match (black arrow) or a mismatch (red arrow) and a vertical or horizontal move (green arrow) means insertion of a gap either on the first (vertical) or the second sequence (horizontal). Note here that, the first sequence is placed horizontally (running from left to right) and the second sequence is placed vertically (running from top to bottom) on the score matrix.

Question: do you think this is the only optimal alignment for this given pair?
show answer

Here is the code that creates the result given above:

public static void testNWAlignment(String seq1, String seq2){
System.out.println("\n\nTest Alignments using Needleman-Wunsch Algorithm");
System.out.println("Original Sequences:");
long start=System.nanoTime();
AlignmentResult result=NeedlemanWunsch.computeNWAlignment(seq1,
                       new SimpleAlignmentParameters());
long stop=System.nanoTime();
System.out.println( "elapsed time (sec): " + 
                    (float)(stop-start)/Math.pow(10, 9)));
String[] alignments= result.getAlignments();
System.out.println(alignments[0] + " size:" + 
int matches=0;
int gaps=0;
for (int k=0; k < alignments[0].length(); k++){
if (alignments[0].charAt(k)==alignments[1].charAt(k)) {
} else System.out.print(" ");
  if ( (alignments[0].charAt(k)=='-') ||
       (alignments[1].charAt(k)=='-')  ) 
System.out.println(alignments[1] + " size:" + 
System.out.println("Match Score=" + matches + ", " + 
                   " gaps:" + gaps);
System.out.println("Edit Distance="+ result.getTotalCost());
System.out.println("Alignment Length="+ 

Test your understanding of the algorithm :
-If all you wanted was the optimal alignment score (edit distance, smallest total penalties) but not the aligned sequences themselves, where in the score matrix would you look to retrieve this score?

-How do you prove that the optimal alignment score computed by this algorithm is indeed optimal (ie. the smallest edit distance between the two sequences) ?

-Can you calculate this optimal alignment score without using O(NxM) space for the score matrix (N,M are lengths of the sequences)? How many rows would you need on the score matrix?

A couple of ideas to explore:
-How do you modify the algorithm to print out all optimal alignments instead of just one?

-How easy is it to parallelize this algorithm? What would be the runtime/space complexity of such an algorithm?

-How to modify the Needleman-Wunsch algorithm to perform local alignment?

-What happens if you modify the gap and substitution penalties? What are biologically plausible ways to extend these penalties for DNA sequence matching in real applications?

-Can you improve the space complexity of the algorithm to be linear instead of quadratic? (Hint: Look up Hirschberg’s algorithm that was published in 1975)

-Can you extend this algorithm to align 3 DNA sequences at once? How about N DNA sequences?

- Check out http://amrita.vlab.co.in/?sub=3&brch=274&sim=1431&cnt=1 for more information on the biology side as well as the algorithm itself

-Dynamic programming and sequence alignment : http://www.ibm.com/developerworks/opensource/library/j-seqalign/index.html

Posted in Algorithms | Tagged , , , | Leave a comment