Center for Computational Biology, IIIT New Delhi - Machine Learning Engineer

December 2018 - December 2019 | New Delhi, India

The Drug Discovery Adventure

Imagine being a detective, but instead of solving crimes, you’re solving the mysteries of how proteins and small molecules interact - and your discoveries could lead to new treatments for arthritis, cancer, dementia, and depression. That’s exactly what I got to do during this incredible year at IIIT New Delhi’s Center for Computational Biology.

The Protein-Ligand Prediction Breakthrough

The crown achievement of this role was developing a machine learning system that could predict protein-ligand interactions with 95.99% accuracy! Using a combination of SVM, Random Forest, and MLP algorithms, I created a system that could essentially predict how well a potential drug molecule would bind to its target protein.

The Art of Feature Engineering

One of the most fascinating aspects was engineering features based on Binary/PSSM profiling of non-redundant protein sequences. Think of it as teaching machines to read the “language” of proteins - each amino acid sequence tells a story, and we trained our models to understand these biological narratives.

Building the SAMbinder Web Server

Beyond just creating accurate models, I deployed an open-source web server that automated the entire process of feature generation and prediction. The server focused on predicting co-factor (SAM) binding, which is crucial for developing treatments for serious diseases like arthritis, cancer, dementia, and depression.

The 96% Accuracy Milestone

The SAMbinder system achieved 96% accuracy in predicting SAM binding sites, which was a significant breakthrough for the drug discovery community. This wasn’t just an academic exercise - these predictions could help pharmaceutical companies identify where to target their drug development efforts.

Data Management Mastery

Working with biological data taught me advanced SQL and pandas techniques for data management and munging. Biological datasets are notoriously messy and complex, requiring sophisticated preprocessing and validation pipelines to ensure reliable results.

The Open Science Mission

Creating open-source tools was a core part of this role. The web server with executables made advanced computational biology accessible to researchers worldwide, democratizing access to cutting-edge protein analysis tools.

The Machine Learning Ensemble

Using multiple algorithms (SVM, Random Forest, MLP) and combining their predictions taught me the power of ensemble methods. Each algorithm brought its own strengths, and by combining them intelligently, we achieved accuracy levels that individual models couldn’t reach.

Biological Sequence Analysis

Working with Binary/PSSM profiling was like learning a new language - the language of evolutionary biology. PSSM (Position-Specific Scoring Matrix) profiles capture evolutionary information about protein sequences, providing rich features for machine learning models.

Cross-Disciplinary Collaboration

This role required constant collaboration with biologists, chemists, and medical researchers. I learned to translate between the worlds of computer science and life sciences, making complex ML concepts accessible to domain experts.

The Drug Discovery Impact

Every model we built, every web server we deployed, potentially contributed to the discovery of new medications. There’s something deeply satisfying about knowing that your code could eventually help develop treatments for diseases that affect millions of people.

Publications and Recognition

This work resulted in multiple publications and has been cited by researchers worldwide. The combination of high accuracy and practical accessibility made our tools valuable resources for the computational biology community.

Key Achievements

🧬 Scientific Breakthrough

  • 95.99% accuracy in protein-ligand interaction prediction
  • 96% accuracy in SAM binding site prediction
  • Novel feature engineering based on evolutionary profiles
  • Multi-algorithm ensemble for robust predictions

🌐 Open Science Impact

  • Open-source web server for global research community
  • Automated feature generation tools
  • Accessible executables for non-technical users
  • Democratized access to advanced computational tools

💊 Drug Discovery Applications

  • SAM binding prediction for therapeutic development
  • Target identification for arthritis, cancer, dementia, depression
  • Pharmaceutical research acceleration
  • Clinical relevance validation

🔬 Technical Innovation

  • Binary/PSSM profiling for sequence analysis
  • Ensemble machine learning methods
  • Biological data preprocessing pipelines
  • Cross-platform deployment strategies

📚 Research Contribution

  • Multiple publications in peer-reviewed journals
  • Code repositories shared with research community
  • Methodology documentation for reproducibility
  • Benchmarking datasets for future research

Technical Deep Dive

Machine Learning Approaches

  • Support Vector Machines (SVM): For complex boundary detection
  • Random Forest: For feature importance and ensemble learning
  • Multi-Layer Perceptron (MLP): For non-linear pattern recognition
  • Ensemble Methods: Combining predictions for optimal accuracy

Bioinformatics Techniques

  • PSSM Profiles: Evolutionary information extraction
  • Binary Encoding: Sequence representation for ML
  • Feature Engineering: Domain-specific attribute creation
  • Cross-validation: Robust model evaluation

Software Engineering

  • Web Server Development: User-friendly interfaces
  • Database Management: SQL for biological data
  • Data Processing: Pandas for complex data manipulation
  • Version Control: Collaborative development practices

This year-long journey was like being a pioneer in a new frontier where biology meets artificial intelligence - every day brought new challenges and the possibility of discoveries that could change how we understand and treat diseases.