Center for Computational Biology, IIIT New Delhi - Machine Learning Engineer
December 2018 - December 2019 | New Delhi, India
The Drug Discovery Adventure
Imagine being a detective, but instead of solving crimes, you’re solving the mysteries of how proteins and small molecules interact - and your discoveries could lead to new treatments for arthritis, cancer, dementia, and depression. That’s exactly what I got to do during this incredible year at IIIT New Delhi’s Center for Computational Biology.
The Protein-Ligand Prediction Breakthrough
The crown achievement of this role was developing a machine learning system that could predict protein-ligand interactions with 95.99% accuracy! Using a combination of SVM, Random Forest, and MLP algorithms, I created a system that could essentially predict how well a potential drug molecule would bind to its target protein.
The Art of Feature Engineering
One of the most fascinating aspects was engineering features based on Binary/PSSM profiling of non-redundant protein sequences. Think of it as teaching machines to read the “language” of proteins - each amino acid sequence tells a story, and we trained our models to understand these biological narratives.
Building the SAMbinder Web Server
Beyond just creating accurate models, I deployed an open-source web server that automated the entire process of feature generation and prediction. The server focused on predicting co-factor (SAM) binding, which is crucial for developing treatments for serious diseases like arthritis, cancer, dementia, and depression.
The 96% Accuracy Milestone
The SAMbinder system achieved 96% accuracy in predicting SAM binding sites, which was a significant breakthrough for the drug discovery community. This wasn’t just an academic exercise - these predictions could help pharmaceutical companies identify where to target their drug development efforts.
Data Management Mastery
Working with biological data taught me advanced SQL and pandas techniques for data management and munging. Biological datasets are notoriously messy and complex, requiring sophisticated preprocessing and validation pipelines to ensure reliable results.
The Open Science Mission
Creating open-source tools was a core part of this role. The web server with executables made advanced computational biology accessible to researchers worldwide, democratizing access to cutting-edge protein analysis tools.
The Machine Learning Ensemble
Using multiple algorithms (SVM, Random Forest, MLP) and combining their predictions taught me the power of ensemble methods. Each algorithm brought its own strengths, and by combining them intelligently, we achieved accuracy levels that individual models couldn’t reach.
Biological Sequence Analysis
Working with Binary/PSSM profiling was like learning a new language - the language of evolutionary biology. PSSM (Position-Specific Scoring Matrix) profiles capture evolutionary information about protein sequences, providing rich features for machine learning models.
Cross-Disciplinary Collaboration
This role required constant collaboration with biologists, chemists, and medical researchers. I learned to translate between the worlds of computer science and life sciences, making complex ML concepts accessible to domain experts.
The Drug Discovery Impact
Every model we built, every web server we deployed, potentially contributed to the discovery of new medications. There’s something deeply satisfying about knowing that your code could eventually help develop treatments for diseases that affect millions of people.
Publications and Recognition
This work resulted in multiple publications and has been cited by researchers worldwide. The combination of high accuracy and practical accessibility made our tools valuable resources for the computational biology community.
Key Achievements
🧬 Scientific Breakthrough
- 95.99% accuracy in protein-ligand interaction prediction
- 96% accuracy in SAM binding site prediction
- Novel feature engineering based on evolutionary profiles
- Multi-algorithm ensemble for robust predictions
🌐 Open Science Impact
- Open-source web server for global research community
- Automated feature generation tools
- Accessible executables for non-technical users
- Democratized access to advanced computational tools
💊 Drug Discovery Applications
- SAM binding prediction for therapeutic development
- Target identification for arthritis, cancer, dementia, depression
- Pharmaceutical research acceleration
- Clinical relevance validation
🔬 Technical Innovation
- Binary/PSSM profiling for sequence analysis
- Ensemble machine learning methods
- Biological data preprocessing pipelines
- Cross-platform deployment strategies
📚 Research Contribution
- Multiple publications in peer-reviewed journals
- Code repositories shared with research community
- Methodology documentation for reproducibility
- Benchmarking datasets for future research
Technical Deep Dive
Machine Learning Approaches
- Support Vector Machines (SVM): For complex boundary detection
- Random Forest: For feature importance and ensemble learning
- Multi-Layer Perceptron (MLP): For non-linear pattern recognition
- Ensemble Methods: Combining predictions for optimal accuracy
Bioinformatics Techniques
- PSSM Profiles: Evolutionary information extraction
- Binary Encoding: Sequence representation for ML
- Feature Engineering: Domain-specific attribute creation
- Cross-validation: Robust model evaluation
Software Engineering
- Web Server Development: User-friendly interfaces
- Database Management: SQL for biological data
- Data Processing: Pandas for complex data manipulation
- Version Control: Collaborative development practices
This year-long journey was like being a pioneer in a new frontier where biology meets artificial intelligence - every day brought new challenges and the possibility of discoveries that could change how we understand and treat diseases.