Machine Learning Engineer - Computational Biology

Center for Computational Biology, IIIT New Delhi - Machine Learning Engineer

December 2018 - December 2019 | New Delhi, India

The Drug Discovery Adventure

Imagine being a detective, but instead of solving crimes, you’re solving the mysteries of how proteins and small molecules interact - and your discoveries could lead to new treatments for arthritis, cancer, dementia, and depression. That’s exactly what I got to do during this incredible year at IIIT New Delhi’s Center for Computational Biology.

The Protein-Ligand Prediction Breakthrough

The crown achievement of this role was developing a machine learning system that could predict protein-ligand interactions with 95.99% accuracy! Using a combination of SVM, Random Forest, and MLP algorithms, I created a system that could essentially predict how well a potential drug molecule would bind to its target protein.

The Art of Feature Engineering

One of the most fascinating aspects was engineering features based on Binary/PSSM profiling of non-redundant protein sequences. Think of it as teaching machines to read the “language” of proteins - each amino acid sequence tells a story, and we trained our models to understand these biological narratives.

Building the SAMbinder Web Server

Beyond just creating accurate models, I deployed an open-source web server that automated the entire process of feature generation and prediction. The server focused on predicting co-factor (SAM) binding, which is crucial for developing treatments for serious diseases like arthritis, cancer, dementia, and depression.

The 96% Accuracy Milestone

The SAMbinder system achieved 96% accuracy in predicting SAM binding sites, which was a significant breakthrough for the drug discovery community. This wasn’t just an academic exercise - these predictions could help pharmaceutical companies identify where to target their drug development efforts.

Data Management Mastery

Working with biological data taught me advanced SQL and pandas techniques for data management and munging. Biological datasets are notoriously messy and complex, requiring sophisticated preprocessing and validation pipelines to ensure reliable results.

The Open Science Mission

Creating open-source tools was a core part of this role. The web server with executables made advanced computational biology accessible to researchers worldwide, democratizing access to cutting-edge protein analysis tools.

The Machine Learning Ensemble

Using multiple algorithms (SVM, Random Forest, MLP) and combining their predictions taught me the power of ensemble methods. Each algorithm brought its own strengths, and by combining them intelligently, we achieved accuracy levels that individual models couldn’t reach.

Biological Sequence Analysis

Working with Binary/PSSM profiling was like learning a new language - the language of evolutionary biology. PSSM (Position-Specific Scoring Matrix) profiles capture evolutionary information about protein sequences, providing rich features for machine learning models.

Cross-Disciplinary Collaboration

This role required constant collaboration with biologists, chemists, and medical researchers. I learned to translate between the worlds of computer science and life sciences, making complex ML concepts accessible to domain experts.

The Drug Discovery Impact

Every model we built, every web server we deployed, potentially contributed to the discovery of new medications. There’s something deeply satisfying about knowing that your code could eventually help develop treatments for diseases that affect millions of people.

Publications and Recognition

This work resulted in multiple publications and has been cited by researchers worldwide. The combination of high accuracy and practical accessibility made our tools valuable resources for the computational biology community.

Key Achievements

🧬 Scientific Breakthrough

95.99% accuracy in protein-ligand interaction prediction
96% accuracy in SAM binding site prediction
Novel feature engineering based on evolutionary profiles
Multi-algorithm ensemble for robust predictions

🌐 Open Science Impact

Open-source web server for global research community
Automated feature generation tools
Accessible executables for non-technical users
Democratized access to advanced computational tools

💊 Drug Discovery Applications

SAM binding prediction for therapeutic development
Target identification for arthritis, cancer, dementia, depression
Pharmaceutical research acceleration
Clinical relevance validation

🔬 Technical Innovation

Binary/PSSM profiling for sequence analysis
Ensemble machine learning methods
Biological data preprocessing pipelines
Cross-platform deployment strategies

📚 Research Contribution

Multiple publications in peer-reviewed journals
Code repositories shared with research community
Methodology documentation for reproducibility
Benchmarking datasets for future research

Technical Deep Dive

Machine Learning Approaches

Support Vector Machines (SVM): For complex boundary detection
Random Forest: For feature importance and ensemble learning
Multi-Layer Perceptron (MLP): For non-linear pattern recognition
Ensemble Methods: Combining predictions for optimal accuracy

Bioinformatics Techniques

PSSM Profiles: Evolutionary information extraction
Binary Encoding: Sequence representation for ML
Feature Engineering: Domain-specific attribute creation
Cross-validation: Robust model evaluation

Software Engineering

Web Server Development: User-friendly interfaces
Database Management: SQL for biological data
Data Processing: Pandas for complex data manipulation
Version Control: Collaborative development practices

This year-long journey was like being a pioneer in a new frontier where biology meets artificial intelligence - every day brought new challenges and the possibility of discoveries that could change how we understand and treat diseases.

Center for Computational Biology, IIIT New Delhi - Machine Learning Engineer#

The Drug Discovery Adventure#

The Protein-Ligand Prediction Breakthrough#

The Art of Feature Engineering#

Building the SAMbinder Web Server#

The 96% Accuracy Milestone#

Data Management Mastery#

The Open Science Mission#

The Machine Learning Ensemble#

Biological Sequence Analysis#

Cross-Disciplinary Collaboration#

The Drug Discovery Impact#

Publications and Recognition#

Key Achievements#

🧬 Scientific Breakthrough#

🌐 Open Science Impact#

💊 Drug Discovery Applications#

🔬 Technical Innovation#

📚 Research Contribution#

Technical Deep Dive#

Machine Learning Approaches#

Bioinformatics Techniques#

Software Engineering#