Machine Learning web service using Python,Bottle and Scikit-Learn

Download PDF

Software as a service (SAAS) is a nice way to provide analytics capabilities to people who are not experts in machine learning and/or do not have time to build the necessary tools. Here, I implemented a simple web service utilizing the python based machine learning toolkit (scikit-learn) that applies simple dimensionality reduction algorithms (Principal Components Analysis and Linear Discriminant Analysis) to a dataset of user’s choice and returns 2D visualizations of the data.

The implementation is totally python based and it uses the Bottle Web Framework. The service is located at:http://mindwriting.org:8073/

First, it will allow you to upload your dataset as a comma separated .csv file. One restriction of the application is that it needs the class labels to be provided as the last column of the dataset. You will also need to provide a header line with attribute names. An example dataset looks like this:


example_data

Then, you submit the dataset and get back the visualization. Here are some examples along with the datasets:
wine dataset (comma-delimited csv)
wine_data_result

iris dataset (comma-delimited csv)
iris_data_result

digits dataset (comma-delimited csv)
digits_data_result

Ripley’s Leptograpsus crabs dataset (comma-delimited csv)


crabs_data_result

Finally, here is how it is done:
A very simple form (upload.html) to upload the dataset and call the service.

<form 
  action="/plot" method="post" 
  enctype="multipart/form-data"
>
  Select a file: <input type="file" name="upload" />
  <input type="submit" value="PCA & LDA" />
</form>

This is what the service returns: just an image of the 2D visualizations embedded in html:

html = '''
<html>
    <body>
        <img src="data:image/png;base64,{}" />
    </body>
</html>
'''

The main work is done by the plot() function (pca_lda_viz.py) that receives the uploaded .csv file, extracts the attributes and class variable, applies data transformations, creates 2D visualizations.

@route('/plot', method='POST')
def plot():

   # Get the data
   upload = request.files.get('upload')
   mydata = list(csv.reader(upload.file, delimiter=','))

   x = [row[0:-1] for row in mydata[1:len(mydata)]]

   classes =  [row[len(row)-1] for row in mydata[1:len(mydata)]]
   labels = list(set(classes))
   labels.sort()

   classIndices = np.array([labels.index(myclass) for myclass in classes])

   X = np.array(x).astype('float')
   y = classIndices
   target_names = labels

   #Apply dimensionality reduction
   pca = PCA(n_components=2)
   X_r = pca.fit(X).transform(X)

   lda = LDA(n_components=2)
   X_r2 = lda.fit(X, y).transform(X)

    #Create 2D visualizations
   fig = plt.figure()
   ax=fig.add_subplot(1, 2, 1)
   bx=fig.add_subplot(1, 2, 2)

   fontP = FontProperties()
   fontP.set_size('small')

   colors = np.random.rand(len(labels),3)
   
   for  c,i, target_name in zip(colors,range(len(labels)), target_names):
       ax.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, 
                  label=target_name,cmap=plt.cm.coolwarm)
       ax.legend(loc='upper center', bbox_to_anchor=(1.05, -0.05),
                 fancybox=True,shadow=True, ncol=len(labels),prop=fontP)
       ax.set_title('PCA')
       ax.tick_params(axis='both', which='major', labelsize=6)

   for c,i, target_name in zip(colors,range(len(labels)), target_names):
       bx.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, 
                  label=target_name,cmap=plt.cm.coolwarm)
       bx.set_title('LDA');
       bx.tick_params(axis='both', which='major', labelsize=6)

   # Encode image to png in base64
   io = StringIO()
   fig.savefig(io, format='png')
   data = io.getvalue().encode('base64')

   return html.format(data)

The code above creates random colors to represent each class. However, it sometimes does not generate distinct looking colors. I leave it as an exercise to create a set of N distinct colors.

The largest dataset I have tested is the madelon dataset download here:madelon_training set (500 attributes, 2000 rows) (comma-delimited csv) Response time is pretty good but the data transformations we get from these two simple methods are not interesting at all–this was a feature extraction challenge dataset after all!

The service has not been designed to process very large data files–see Mahout(java) for map-reduce implementation. Therefore, it is not ready for the big data frenzy. Nevertheless, I believe it is a good start that can be extended for big learning.

In order to run the example, just change the hostname and port number in pca_lda_viz.py and start the service as:

> python pca_lda_viz.py

If all goes fine, it should start without any error messages. All done!

Github repository: https://github.com/ilknuricke/bottle_scikit_learn_web_services

This entry was posted in Demos and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>