Você está na página 1de 11

CancerDiscover

GSoC Project Proposal - Helikar Lab


Computational Biology at University of Nebraska, Lincoln

Achilles Rasquinha
March 31, 2017

1 Introduction
The following project proposal is for Helikar Lab under the organization -
Computational Biology at University of Nebraska, Lincoln. I am keen to work
on the project - CancerDiscover: a GUI for cancer prediction and biometric
identification using microarray data.

2 Student Information
General Information
Name: Achilles Rasquinha
Alternate Names:
GitHub: github.com/achillesrasquinha
LinkedIn: linkedin.com/in/achillesrasquinha
Email: achillesrasquinha@gmail.com
Time Zone: Asia/Kolkata (UTC +05:30)
Website: Brains from Scratch
Background Information

Education
University of Mumbai
Bachelors Degree, Computer Engineering (2013 - Present)
Courses: Machine Learning, Artificial Intelligence, Data
Warehouse and Mining, Soft Computing, Distributed Databases,
Software Engineering, Structured and Object-Oriented Anal-
ysis and Design.
St. Xaviers College, Mumbai
Higher Secondary Certificate (2011 - 2013)

1
Courses: Physics, Applied and Organic Chemistry, Applied
Mathematics, Economics and Biology.
What are your languages of choice?
Im an eternal Pythonista and currently, a Pythoneer. Im also well-
versed with C++ (STL and Boost) and Java. I also currently work
on various projects written in JavaScript (Node.js).

Overall, Python and JavaScript are my two default languages of


choice.

Any prior experience with Open Source Development?


Ive recently authored bulbea - a Deep Learning based Open Sourced
Python Library for Stock Market Prediction and Modelling which has
garnered a good amount of attention with the Deep Learning commu-
nity worldwide. Currently with 165 stars and 61 forks on GitHub,
bulbea is set to be a de facto Python toolkit for quantitative finance
to make it easy for developers and users to get going with algorithmic
financial trading.
(github.com/achillesrasquinha/bulbea)

Ive also authored SnackJS - Android Snackbars for the web, a Re-
sponsive Web Design UI component for notification and feedback
on the web, written in TypeScript and SASS. Currently with 1,236
downloads, snackjs turns out to be web developers favourite choice.
(npmjs.com/package/snackjs)

Ive also contributed to various Open Source projects (from minor


bugs to enhancements) - stingray (Pull Request 175), gesture-
opencv (Pull Request 5), etc.

Im a self-proclaimed Python and Agile-Development evangelist, cur-


rently attempting to spread the use of Python and various Open
Source Projects as toolkits for research and science at my University.
In order to achieve this, Ive conducted many workshops and sessions
as an instructor (under the guidance of my Department) at the in-
stitute for students to learn and build applications that help to solve
real-world problems.

What do you want to learn this summer?


This summer, Id like to learn the many applications and potential
that Machine Learning has in the field of Bioinformatics. Given that
Im currently pursuing a field of core Computer Science, Id also like
to learn what it takes to solve Machine Learning problems consid-
ering the expertise of professional domain experts (in this case, the
mentors). Finally, Id like to learn and overcome many challenges

2
and hurdles I may face, if given the chance to implement the said
project.

Any prior exposure to biology or bioinformatics?


Ive previously worked with the Central Institute for Cotton Re-
search, India wherein I built and optimized a classification model
(Support Vector Machine) to classify various germplasm of cotton
with an extremely raw and highly constrained data set and with lim-
ited domain expertise. Nonetheless, the resultant classification model
reached with an impressive accuracy of 75% by considering various
imputation strategies, normalization and cross-validation techniques.
For more details, visit Brains from Scratch - Case Study 1.

Any interest in learning a bit of biology this summer?


Yes, very much. Im keen to learn various visualization techniques (in
particular, visualizing multi-dimensional biological data) and build-
ing a smooth pipeline for microarray experiements (integrated with
CancerDiscover) that can be used by biologists, worldwide.

3 Project Information
Proposal Title

CancerDiscover: a GUI for cancer prediction and biometric identifi-


cation using microarray data.
Proposal Abstract

Problem
As of today, CancerDiscover has a not-so-easy build and workflow for
users to conduct microarray experiements. Moreover, the current setup
requires users to manually download Affymetrix CEL files and at the same
time, separately label them. Users have no visual analysis provided at any
given time during the course of the experiment. Moreover, users are also
limited to use the default parameters for each classifier (during analysis),
thus limiting a user to build more efficient prediction models. Finally, the
overall workflow is poorly documented for installation and usage.

Solution
candis - A minimalistic clean Graphical User Interface integrated with
the current command-line tool will not only provide a smooth build and
workflow for an experiment setup, but also provide remote download-
ing for data sets, quality control visualizations and user-defined
parameters during analysis. Such a modular framework will also pro-
vide future extensions for more methods and techniques. Users will also

3
have access to a well-documented manual that eases the overall use of the
proposed software. The current ongoing development of this application
can be viewed at github.com/achillesrasquinha/candis
Proposal Description

candis will revolve around a single QtWidgets.QMainWindow extension


instance (candis.Window) that assures to make the entire experiment
setup extremely simple (from Data Selection to Model Deployment) for
new users, sequentially navigating across 6 tabs only. Heavily inspired by
the way a machine learning problem is approached, these tabs are named
as follows: Source, Preview, Preprocess, Model, Analysis and Pre-
dict.

Each of these tabs shall be linked to a single frame performing indepen-


dent tasks over the dataset (loading, preprocessing, visualizations, etc.),
thereby passing the output dataset to the next frame (or task) in the
pipeline.
Source

The first tab in the pipeline, the Source frame shall provide the
following functionalities:
A live search functionality for users to select and download raw
Affymetrix CEL files from the National Centre for Biotechnology
Information website (either via NCBI FTP or via NCBIs pre-
ferred API) simply based on a user search-query - using requests.
Downloaded data sets will be cached onto the users local disk
for future use.

4
A dataset loader for datasets available on a local machine.
An editor to input label names (normal, tumour, etc.) for each
custom selected dataset.
A visual list to display metadata about available datasets and
to custom select datasets (Data Selection) for the next stages in
the pipeline.

Preview

5
The next frame in the pipeline - the Preview frame, shall provide
the following functionalities:
Quality Control Statistics and Visualization such as box-and-
whisker plots (upper quartile, interquartile and lower quartile for
each microarray sample versus their intensities on a logarithmic
scale) and density plots (intensities on a logarithmic scale versus
their densities). - using matplotlib.
Any microarray samples required for removal, will be highlighted
in red.
Save visualizations in a format of ones choice (.png, .jpeg, etc.).
- using matplotlib.
The Preview frame will take into consideration of developers to
smoothly register more visualization methods as the software evolves
over time.

Preprocess

The Preprocess frame shall provide the following functionalities:


Background Correction choice - defaults to Robust Multiarray
Average (RMA).
Normalization choice - Scale Normalization (providing a wid-
get to adjust the scale), LOWESS Normalization (and its vari-
ants), Quantile Normalization (defaults to Quantile Normaliza-
tion).
Users can visualize normalized data on the current frame with
visualization techniques such as - box-and-whisker plots, density
plots and MA plots. Moreover, users can save such visualiza-
tions in a format of their choice (.png, .jpeg, .tiff, etc.).
Users can save their normalized data onto the local disk and
reload them into the pipeline when needed.
The Preprocess frame will be written for developers to easily regis-
ter pre-defined or user-defined Background Correction and Normal-
ization methods as the software evolves over time. Modularizing the
overall framework will be taken into consideration throughout the
development cycle of the said software.

Model

We move to the next frame - Model. Here, we move from the pre-
processing phase to the modelling and analysis phase. Model shall
provide the following functionalities:

6
A custom dialog titled Experiment for users to select:
Name: A user-defined microarray experiment name.
Description: A user-defined description for the experiment.
Classifier : Random Forest (default), Decision Trees, Sup-
port Vector Machines, k-Nearest Neighbors, etc.
Parameters: A JSON (Python dict) editor for users to
tweak individual parameter set belonging to each classifier.

7
(e.g. - entropy or gini for information gain if the classifier is
a Decision Tree, etc.)
Training Size: A user-defined training size (ratio) within
the range (0, 1].
k-Folds: A user-defined fold count for cross-validation. Users
can check a check-box if it wishes to consider a validation-
split.
Dimensionality Reduction: A choice for a feature selec-
tion technique to be used (defaults to Correlation-based Fea-
ture Selection).
Users can view the current experiments (complete or in-progress)
in a tabular form and choose a model for the next stage in the
pipeline (Analysis and Prediction). Users can view the training
time taken for each experiment as well.
Users can save such experiments (serializing candis.Experiment
objects into a .cache directory on a users local disk and reload
them into the current frame when needed).
FUTURE SCOPE : Since the Model frame provides combi-
nations of various phases and techniques, a base framework for
users to create experiments using flow graphs will be imple-
mented. The current implementation however, will be a single
parent candis.Model.Dialog (the Experiment dialog) and one
or more child QtWidgets.QDialogs.

Analysis

Each trained model can be pushed into the Analysis frame which
shall provide the following functionalities:
A complete generalized report of the experiment with necessary
infographics.
Visualizations for analysis such as - Confusion Matrix, ROC
Curve, etc.
Metrics such as Accuracy, Precision, Recal, F1-Score, etc.
Users can save a complete report in the format of ones choice
(HTML, PDF, etc.) as well as individual visual plots.

Predict

The last frame in the pipeline, the Predict frame shall provide the
following functionalities:
Load a trained model in the pipeline for prediction.
Load a previously trained experiment from the local disk.
Perform a prediction based on user input.
Perform a prediction from an unknown sample.

8
Requirements
GUI Framework: PyQt5 - A Python binding for Qt5.
Visualization: matplotlib
HTTP requests: requests
Documentation: sphinx
Timeline
Community Bonding - (May 5th to May 30th , 2017)
During the community bonding phase, I would like to get famil-
iar with the organization, gather and understand as much neces-
sary information and references (in particular, quality control tech-
niques) and build working prototypes for the same. A complete wire-
frame will be structured out during this phase with modularity in
mind. I would also like to prioritize the modules to-be-implemented
based on inputs and feedback received from mentors. I would also
like to discuss the preferred choice of reviewing the latest develop-
ment, choice of platform for simultaneous documentation, continuous
integration, code coverage, re-factor (and probably document) the
existing command-line tool - CancerDiscover and implement front-
end functional prototypes smoothly integrated with the back-end
command-line scripts.

Week 1 - 4 - (May 30th to June 30th , 2017)


I shall dedicate Week 1 to Week 3 with implementing the first 3 tabs
in the pipeline (Source, Preview and Preprocess) with each week
dedicated to each tab. This would include - candis.DataLoader,
candis.Downloader, etc. At the same time, a good amount of the
command-line tool will be re-factored in order to smoothly integrate
with candis as well as work independently with an argument parser.
Week 4 will be dedicated to Unit Testing (using pytest) the devel-
opment so far with simultaneously documenting the API for both -
users as well as developers. At the same time, the first evaluation
period will witness a functional pipeline for the first 3 stages.

Possible Hurdles
Unknown dependency issues that may arise in the first stage of
development.
Visual Embedding integrations from WEKAs ARFF to pandas
and matplotlib.
Week 5 - 8 - (July 1st to July 28th , 2017)
Week 5, 6 and 7 will be dedicated to implementing the next 2 tabs in
the pipeline (Model and Analysis). There exists a mutual dependency
between the two phases which requires simultaneous development of

9
the two. Week 5 and 6 will be dedicated to each of the tabs during
which, candis.Experiment - an object which abstracts all necessary
information related to each experiment will be implemented. Week 7
will be dedicated to integrating candis.Experiment instances with
visual embeddings proposed for the two tabs. Appropriate unit tests
and documentation will be done during Week 8 - June 24th to June
28th . At the same time, possible bugs will be handled and feedback
from mentors will be considered for improvements in the latest de-
velopment. A working but unstable bleeding-edge prototype will be
assured during the completion of the first 8 weeks.

Possible Hurdles
CancerDiscover integration with candiss model and analysis frames.
Week 9 - 12 - (July 28th to August 28th , 2017)
Week 9 and 10 will be dedicated to implementing unit tests for the
previous 2 tabs and the final tab in the pipeline - Predict. Week
11 will be utilized for experimenting with candis, rectify possible
bugs as well as provide more enhancements. Reducing dependencies,
cross-platform testing and code coverage will be done during Week
11. Week 11 will also be dedicated for completing a full-fledged docu-
mentation of user and developer guides (accessible online via readthe-
docs.io) for the same. A production-ready software (versioned 1.0)
will be released during the final evaluation.

4 Commitments
When do your classes and exams finish?

My classes for my final semester end on the 18th April, 2017. However,
my examinations extend till the end of April. I shall be free full-time,
thereafter.
Do you have any other school-related activities scheduled during
the coding period?

None.
Do you have a full or part-time job or internship this summer?

No, which is why Im very keen spending my summer contributing to


the research and projects happening at UNL under the GSoC banner.
How many hours per week do have available for GSoC?

A commitment of minimum 40 hours a week will be assured on my


behalf.

10
5 Additional Information
R
esum
e: Achilles Rasquinhas Resume
Contact: +91 9821410251

Im eager and excited to work with the research team at Helikar Lab and I
hope you will consider me fit for the project to-be-implemented under the GSoC
banner.

11

Você também pode gostar