Abhishek Shakya

CTO

Noida, Uttar Pradesh, India6 yrs 6 mos experience
AI ML PractitionerAI Enabled

Key Highlights

  • 8+ years of experience in Biopharma domain.
  • Led teams of 20+ in tech and bioinformatics.
  • Expert in designing scalable data pipelines.
Stackforce AI infers this person is a Bioinformatics and Data Engineering expert with a focus on AI-driven healthcare solutions.

Contact

Skills

Core Skills

Data EngineeringAi SolutionsPredictive ModelingData IntegrationPatient Cohort AnalysisNlpData Pipeline StandardizationData TransformationMachine LearningBioinformaticsGenomic Data AnalysisStatistical AnalysisTechnical SupportData ManagementDashboard DevelopmentBioinformatics Pipeline DevelopmentCloud ComputingData AutomationInformation RetrievalProcess AutomationRobotic Process AutomationDocument AnalysisPredictive AnalyticsMaintenance ManagementAi DevelopmentSpeech ProcessingFraud DetectionStructural EngineeringSeismic Safety

Other Skills

PythonETL pipelinesAI-driven solutionsDocumentationTeam onboardingEHR data integrationRapidFuzzT-SNEXGBoostBERTLSTMMedspacyScispacyData pipeline designEHR data transformation

About

I have 8+ years of experience in the industry and have dedicated 5+ years to the Biopharma domain. I have worked in an individual capacity as well as headed teams of 20+ tech developers, bioinformaticians and wet lab scientists. I have worked with large pharmaceutical customers(billion dollar plus revenue) to understand their challenges and areas of focus. I have a fair knowledge of Statistics, Machine Learning, and data exploration. I have designed workflows and pipelines using Python, R, Java, WDL, bash, and other programming languages. Over the past years, I have gained solid skillset in Bioinformatics, understanding of the schemas and standards, data sources and tool chains used in NGS and clinical trials. I have worked in domain of Genomics, Transcriptomics, Proteomics, Biochemistry, Metabolomics and Cell Biology. I maintain timely and thorough documentation and completion of exploratory notebooks. I effectively communicate study progress and results adopting agile practices and offering subject matter expertise. In my past time, I independently research and maintain relevant knowledge of trends, best practices and new guidelines in bioinformatics approaches via conference attendance, consultation with SMEs, and/or literature reviews.

Experience

6 yrs 6 mos
Total Experience
2 yrs 2 mos
Average Tenure
--
Current Experience

Autocash.ai

Head of Data Engineering and Platforms

Jan 2025Present · 1 yr 5 mos · India · Remote

  • I led financial technology and AI initiatives in the Autocash platform, focusing on automation, scalable architecture, and AI-driven solutions. My work included designing ETL pipelines, financial classification models, and integrating large language models (LLMs) to transform manual processes into efficient workflows. I emphasized documentation, knowledge sharing, and team onboarding for sustainable growth.
  • Problem:**
  • Legacy systems and manual processes caused inefficiencies in cashflow categorization and reconciliation.
  • Fragmented ETL pipelines and inconsistent engineering delayed scalability and client onboarding.
  • Limited documentation and knowledge transfer led to repeated support queries.
  • Effort:**
  • Designed a three-layer architecture for data pipelines, ensuring scalability and reliability.
  • Developed financial classification models using domain embeddings, BERT, FastText, and fuzzy logic for improved categorization.
  • Integrated LLMs to automate rule creation, categorization, and natural language querying.
  • Created documentation frameworks, including knowledge bases and onboarding materials, to support knowledge transfer.
  • Mentored technical staff to build a strong developer pipeline.
  • Conducted PoCs for features like cashflow categorization and AR tracking.
  • Evaluated tools such as Alteryx, Windmill, and iPaaS to optimize ETL processes.
  • Implemented AWS cost optimization strategies, including resource management and RDS scaling.
  • Developed architecture diagrams and design reviews for project alignment.
  • Impact:**
  • Reduced manual processing with AI-based categorization, saving time and resources.
  • Accelerated client onboarding with standardized ETL processes.
  • Enhanced team productivity through documentation and mentorship.
  • Increased stakeholder confidence via transparent reporting.
  • Improved platform scalability and compliance by addressing technical debt.
PythonETL pipelinesAI-driven solutionsDocumentationTeam onboardingData Engineering+1

Nobias therapeutics

2 roles

Software Engineer - Platforms

Mar 2024Oct 2024 · 7 mos · India · Remote

  • I have led multiple projects enhancing patient cohort analysis, predictive modeling, and data pipeline standardization. My experience spans NLP, EHR data integration, and domain-specific embeddings. Leveraging RapidFuzz, T-SNE, XGBoost, and BERT-based architectures, I delivered solutions transforming raw data into insights. My work includes developing documentation and reporting frameworks, ensuring replicability and clarity for stakeholders.
  • Project 1: Autism Patient Cohort Analysis and Modeling**
  • Problem:**
  • Fragmented patient records, inconsistent coding, and varied clinical descriptions in EHR data challenged autism phenotype analysis.
  • Needed to stratify idiopathic autism patients to identify biomarkers and treatment targets.
  • Effort:**
  • Built patient cohorts (case, control, test) using LSTM and BERT models.
  • Developed unsupervised pipelines with T-SNE and clustering for autism group visualization.
  • Used RapidFuzz and FastText for keyword matching, clustering, and embedding.
  • Created word clouds, overlap analyses, and top-N word frequency tables.
  • Applied Medspacy, Scispacy, and Ascle for data cleaning and text normalization.
  • Evaluated domain-specific embedding models (MedBERT, Clinical-T5, BioClinicalBERT).
  • Documented cohort processes, data-switching, and feature engineering.
  • Impact:**
  • Improved accuracy and interpretability of autism models.
  • Reduced overfitting with weight decay and attention normalization.
  • Delivered actionable visualizations for management.
NLPEHR data integrationPredictive modelingData pipeline standardizationRapidFuzzT-SNE+4

Software Engineer - Platforms

Feb 2023Feb 2024 · 1 yr · India · Remote

  • Project 2: Data Pipeline Standardization and EHR Data Transformation
  • Problem:
  • Encountered fragmented EHR datasets across different sources, leading to inconsistency in patient records and increased data bias.
  • Required transformation pipelines for efficient data integration and embedding generation for clinical NLP tasks.
  • Effort:
  • Designed an unsupervised data pipeline with modules for data cleaning, preprocessing, and table-level feature engineering.
  • Implemented column-level feature extraction and merged similar columns to reduce redundancy.
  • Incorporated spell-checking, stopword filtering, and subword handling for embedding generation.
  • Integrated UMAP and T-SNE for dimensionality reduction and embedding visualization.
  • Applied clustering algorithms (e.g., K-Medoids, hierarchical) on distance matrices for patient grouping.
  • Impact:
  • Streamlined EHR data preparation for downstream NLP tasks, reducing manual cleaning efforts.
  • Enabled consistent embedding generation across various clinical data types, improving model performance.
  • Facilitated faster iteration on cohort analysis and embedding-based tasks through parallelized computations and standardized pipelines.
Data pipeline designEHR data transformationClustering algorithmsUMAPT-SNEData Pipeline Standardization+1

Aganitha

8 roles

Senior Data Scientist 2

Jan 2022Nov 2022 · 10 mos

  • Finding highly enriched AAV templates using Machine learning
  • Problem:
  • Find highly enriched group of AAV template sequences
  • Solution:
  • Tried various featurizing techniques for embedding generation like one hot coding, by identity, by properties(BLOSUM62) etc.
  • Collaborated with scientists for interpreting clustering plots for AAV sequences using T-SNE and UMAP
  • Trained Random forest model to find patterns in AAV sequences passing the Blood Brain Barrier(BBB) across tissues
  • Analyzed AAV sequence interaction with LY6A protein(in BBB) using Alphafold-2 multimer model
  • Impact:
  • Classify capsid families based on amino acid similarities by identity and properties
  • Scientists able to determine individual capsid variants based on tissue biodistribution profile
  • Analyzing highly enriched sequences for AAV capsid engineering
  • Problem:
  • Natural serotypes displaying finite set of tropisms in Gene therapy
  • AAV sequences getting expressed in off-target tissues
  • Solution:
  • Analyzed mutated library pool and identified capsid variants via Illumina HiSeq Next-Generation Sequencing(NGS)
  • Implemented collapsing algorithm for denoising clusters of sequences due to PCR bias (duplicates)
  • Generated Cytoscape based clustering plots and Amino Acid frequency bias heatmaps using python
  • Assembled interactive dashboard with thresholds for comparison of tissue biodistribution across experiments
  • Impact:
  • Scientists able to study enrichment of AAVs across tissues and confidence in NGS data
  • Identify templates with improved tropism, high diversity and immune evasive AAV capsid libraries
  • Enabled Scientists to screen libraries using cell-based assays and in vivo models to select capsids with desired properties
Machine learningRandom forest modelT-SNEUMAPAlphafold-2Machine Learning+1

Senior Data Scientist 1

Oct 2021Dec 2021 · 2 mos

  • Genome-Wide Association Studies(GWAS) regression analysis
  • Problem:
  • Non replicable genetic association studies with few well-validated genetic risk factors peeking above the noise
  • Challenges in highlighting molecular pathways and finding potential targets for therapy
  • Difficulties in understanding variations affecting a person’s response to certain drugs and gene-environment interactions
  • Solution:
  • Engineered custom library for processing UKBB 200K and 450K WES data using luigi, hail and spark
  • Added QC steps like MAF, monomorphic variant filters, ancestry estimation etc. for bias and outlier removal
  • Added vizualization charts like PCA, Manhattan, QQ-plot etc. along with QC plots like variant and sample call rate
  • Deployed NGS pipeline using CI/CD on ON-premise and AWS cluster for scalable computing
  • Migrated pipeline to DNAnexus for processing UKBB 450K WES data release with gene burden tests
  • Impact:
  • Brought down entire pipeline time(data onboarding, processing, results generation etc.) from 3 months to ~3 weeks
  • Scientists able to identify genes or novel SNPs associated with a particular disease or trait
  • Pinpoint genes or markers that may contribute to a person’s risk of developing a certain disease
GWASRegression analysisNGS pipelineVisualization chartsCI/CDGenomic Data Analysis+1

Senior Data Scientist 1

Promoted

Apr 2021Sep 2021 · 5 mos

  • Dashboard for managing Next Generation Sequencing(NGS) pipelines
  • Problem:
  • Unable to manage exploding NGS data generated from pipelines at terabyte scale
  • Inefficient and non-reproducible analysis pipelines with results not accessible to biologists
  • Complexity required for analysis significantly hindering the overall turnaround time for wet-lab scientists
  • Solution:
  • Crafted framework for managing analysis runs from command line interface and GUI
  • Architected Postgres database schema to capture metadata about runs and experiments
  • Designed dashboard(GUI) for managing, monitoring and analyzing runs using React JS
  • Added Access Management support for project teams and managers using integration with Active Directory
  • Impact:
  • Able to plug in and use modern as well as legacy tools without much requirement of programming skills
  • Effectively communicate study progress and results via generated reports and integrated genome browser
  • Building apps on DNAnexus for GxP compliance
  • Problem:
  • Lack of reproducible workflows leading to inaccurate and misrepresentative results
  • Missing GxP compliant tools in DNAnexus environment thereby delaying scientific progress
  • Solution:
  • Constructed reproducible workflows for RNA-seq pipeline using tools like Dseq2 on DNAnexus
  • Defined user, functional and design requirements specifications along with Approval and Traceability matrix
  • Wrote automated unit and integration test scripts for tools like PRSice-2 for Qualification SOP
  • Impact:
  • Scientists capable of analyzing scientific findings while adhering to regulatory compliance
  • Integrated quality management software tools for robust and automatic GxP compliant cloud software
  • Defect tracking and Change control processes ready to produce documented evidence and consistent tools in practice
NGS data managementPostgres databaseDashboard designReact JSData ManagementDashboard Development

Senior Data Scientist 1

Sep 2020Dec 2021 · 1 yr 3 mos

  • IT support for Bioinformaticians and Wet lab scientists
  • Problem:
  • Scientists debugging errors and getting stuck in installing software
  • Scientists unable to do computational analysis on large scale due to unavailability of cluster
  • Solution:
  • Provided customer support on technical issues for the client tools and workflows
  • Debugged task failures using Kibana and troubleshoot incoming requests from clients
  • Set up UCSC genome browser for visualizing target sites for alternative splicing
  • Installed open-source software using spack modules e.g. Rstudio, bioconductor packages, Geneious etc.
  • Trained junior scientists, recruited new staff, and worked with external collaborators on data harmonization
  • Wrote developer and end-user documentation to achieve full traceability of processes
  • Impact:
  • Scientists able to focus on higher level goals and analyze results
Technical supportSoftware installationDocumentationTechnical SupportBioinformatics

Senior Data Scientist 1

Promoted

Sep 2020Mar 2021 · 6 mos

  • Contamination Analysis and Vector Genome Integration site discovery
  • Problem:
  • Contaminants in viral preparations and NGS data leading to immunogenic effects
  • Vector-host integration in host genome leading to tumour
  • Solution:
  • Created a parallelizable, resumable, and modular pipeline for analyzing Terabytes of data on AWS cluster
  • Added QC steps to include cut adapters, check paired-end reads, remove chimeric alignments, etc.
  • Aligned vector sequences against reference genome using Burrows-Wheeler Mem Aligner(BWA)
  • Implemented parallel processing and In-memory computation of data to reduce disk space and time
  • Impact:
  • Precise characterization and estimation of contaminants and their frequency
  • Reduced disk space for overall pipeline by ~20% on TeraBytes scale
  • Decreased time for overall pipeline by ~10% on Hourly scale
  • Developing insilico RNAseq pipeline for NGS clinical data
  • Problem:
  • Finding genes of importance contributing to disease helping in drug target discovery
  • Difficulty in understanding the biological differences between healthy and diseased states
  • Solution:
  • Wrote a RNA sequencing pipeline in WDL for UKBB 200K clinical data using Cromwell workflow manager
  • Used bioinformatic tools like STAR for alignment, PICARD for QC, HTseq for read counts etc.
  • Added visualization charts like Volcano plot, PCA charts etc. to interpret genes with highest fold change
  • Integrated with Slurm backend for testing on ON-premise HPC clusters for cluster management
  • Migrated pipeline to AWS cloud for elastic scalable computing on demand
  • Impact:
  • Able to perform Differential gene expression analysis(DGEA) to find relevant genes for disease cure
  • Scientists unblocked to discover quantitative changes in expression levels between experimental groups
RNA-seq pipelineWDLBioinformatics toolsAWS cloudBioinformatics Pipeline DevelopmentCloud Computing

Data Scientist 2

Promoted

Jun 2019Aug 2020 · 1 yr 2 mos

  • Competitive Intelligence platform for Biopharma supply chain
  • Automated information extraction for publicly traded Contract Development and Manufacturing Organization(CDMO)
  • Developed assets in form of pricing, suppliers, patents, applications, financial statements etc.
  • Chemical Reaction Extractor for training data generation
  • Made human guided chemical reaction extractor tool for extracting data from 1000+ chemical reactions
  • Stored data in SMILES format for training machine learning model for predicting yield of product
  • Information Retrieval System using UMLS database
  • Created using jupyter notebook for querying diseases from UMLS database
  • Designed hierarchical plots for identifying hierarchies among multiple ontologies and taxonomies
Information extractionData automationChemical reaction extractionData AutomationInformation Retrieval

Data Scientist

Jan 2019May 2019 · 4 mos

  • Robotic Process Automation(RPA) of IT Business processes
  • Downloading transaction files and automated report generation
  • Wrote automation scripts to download files from older Citrix UI portals for 20 banks
  • Designed exception handling for outlier cases in case of interruptions and bad data
  • Automated 6 IT Business processes for Credit card payments using UIPath automation tool
  • Deployed 6 processes ON-Site in production using UIPath server mode for scheduling
  • Email classification model for incoming service requests
  • Deployed Email classification Random forest model in production using Flask server
  • Wrote script for account statement generation to be sent to customer
  • Helped in meeting stringent Turn Around Times(TATs) and Service Level Agreements(SLAs)
Robotic Process AutomationUIPathAutomation scriptsProcess Automation

Data Scientist

Jun 2018Dec 2018 · 6 mos

  • Geological document identification from scanned mining documents
  • Heuristic approach
  • Challenge: Assign location to document based upon content of document
  • Cleaned up special characters and preprocessed the document using OCR to extract text for NLP
  • Created custom markers like longitude and latitude extraction from maps for identifying location
  • Used ngram look up model to matched extracted text against list of reference locations(gazetter data)
  • Machine learning approach
  • Encoded hierarchical information of location to generate more features
  • Used Parts of Speech(POS) tagging and custom markers from heuristic model as features
  • Trained Learning to Rank model for identifying probable georeferencing candidates
  • Deduplication and Assay data extraction from scanned documents
  • Removed duplicate documents(having different names) based on content matching using CV and NLP
  • Classified documents into departments (like HR, compliance, operations etc.) using bag of words NLP model
  • Scraped assay data(amount of minerals like copper, aluminium etc.) into database using table extraction module
  • Predictive maintenance for large excavation trucks
  • Challenge: Early warning and scheduling of maintenance tasks to prevent downtime
  • Modelled continuous sensor data and batch maintenance reports data for time series model
  • Encoded features from physical and chemical properties from lubricant and coolant
  • Trained ARIMA time series model to predict next breakdown of excavation trucks
Geological document identificationNLPMachine learningDocument AnalysisMachine Learning

Quantiphi, inc.

Decision Science Analyst

Jul 2017May 2018 · 10 mos · Mumbai Area, India

  • At Quantiphi, I was part of a product team named 'AthenasOwl'. Our agenda was to automate content tagging and metadata generation using AI to generate game-changing insights.
  • Speaker Diarization
  • “Who Spoke When” – Speech and Speaker segmentation
  • Dataset: 10 hours Game of Thrones audio; Bi-LSTM network
  • Achieved train accuracy = 92.9% and test DER = 29.3%
  • Logo Recognition
  • Trained CNN model for brands like Nike, Adidas, Coke etc.
  • Dataset: 9000 images from BelaLogos and Logo32 dataset
  • Achieved train accuracy = 98.5% and test accuracy = 96.3%
  • Emotion Recognition
  • Trained CNN model for 7 categories of human emotions
  • Dataset: 28709 images of FER2013 and RaFD Kaggle
  • Achieved test accuracy = 87% for Happy and Sad classes
  • Character Recognition
  • Automated face recognition pipeline to label primary clusters
  • Pretrained model: 29-layer Resnet CNN with Batch Norm.
  • Classifier: 4-layer DNN for classifying 128-D feature vectors
  • Achieved test accuracy = 96.5% on F.R.I.E.N.D.S sitcom
  • Team projects: Smoking, NSFW, Locale and Object detection;
  • Commentary transcription and High point detection in matches
Predictive maintenanceTime series modelingPredictive AnalyticsMaintenance Management

Allizhealth

Data Analyst

May 2016Jun 2016 · 1 mo · Pune, Maharashtra, India · On-site

  • Job Summary:
  • Developed and deployed advanced insurance claim forecasting models and fraud detection systems, while also building mental health analysis tools. Successfully improved prediction accuracy, identified fraudulent cases, and enhanced the emotional well-being analysis capability for product deployment.
  • Job Description:
  • PROBLEM: High claim ratio diseases were driving insurance costs.
  • EFFORT: Built a predictive model using clustering and Vector-Autoregression in R to forecast claims for top ten diseases.
  • IMPACT: Achieved 85% confidence interval accuracy, enabling better risk management for SecureNow insurance company.
  • PROBLEM: Need to determine optimal lag order for time series modeling.
  • EFFORT: Executed Multivariate Granger-Causality tests to determine lag order with the lowest AIC.
  • IMPACT: Improved model precision and forecasting reliability.
  • PROBLEM: Risk of fraudulent claims impacting insurance payouts.
  • EFFORT: Designed an outlier detector to identify fraudulent cases and detected appendicitis fraud.
  • IMPACT: Enhanced fraud detection, supporting real-world implementation by SecureNow.
  • PROBLEM: Need for a scalable mental health analysis platform.
  • EFFORT: Built an emotional stroop test portal in PHP and MySQL to detect depression and anxiety using lexicon features.
  • IMPACT: Developed a deployable tool to screen for emotional states, ready for company-wide rollout.
  • PROBLEM: Lack of robust emotional state insights among employees.
  • EFFORT: Surveyed 20 employees for 7 days and performed Analysis of Variance for pattern recognition.
  • IMPACT: Created reliable visualizations of mood swings and verified accuracy with cross-validation.
Content taggingAISpeech recognitionAI DevelopmentSpeech Processing

Technip

EPC Trainee

May 2015Jul 2015 · 2 mos · Noida, Uttar Pradesh, India · On-site

  • Job Summary:
  • Performed advanced seismic and wind analysis of critical refinery structures, optimized base plate design for load efficiency, and implemented cost-effective design improvements. Delivered significant reductions in structural thickness and base pressure, enhancing safety, performance, and cost savings.
  • Job Description
  • PROBLEM: Ensuring seismic stability of critical structures under various loading conditions.
  • TASK: Conducted seismic analysis of slug catcher and column base plates per IS codes, using CQC method for static and dynamic earthquake evaluation.
  • IMPACT: Enhanced seismic safety and compliance with national standards.
  • PROBLEM: Refinery heater wall deformation under wind loading.
  • TASK: Analyzed wind-induced deformations and determined bearing length using Blodgett’s method.
  • IMPACT: Ensured structural integrity and reduced risk of wind-related damage.
  • PROBLEM: Outdated design practices in column base plates leading to inefficiencies.
  • TASK: Compared old and new IS 800 codes, and remodeled base plate design using FEM.
  • IMPACT: Achieved 86% reduction in plate thickness, resulting in material and cost savings.
  • PROBLEM: Excessive base pressure affecting structural performance.
  • TASK: Installed flange stiffeners and optimized load distribution.
  • IMPACT: Reduced base pressure by 13%, enhancing overall load capacity and efficiency.
Predictive modelingClusteringTime series modelingPredictive AnalyticsFraud Detection

Jaypee greens - suncourt tower

Management Trainee

May 2014Jun 2014 · 1 mo · Noida Area, India · On-site

  • Job Summary:
  • Analyzed construction details, conducted on-site inspections, and ensured adherence to design specifications, leading to improved construction quality and cost-effective solutions.
  • Job Description:
  • PROBLEM: Complex construction details required thorough technical analysis.
  • TASK: Reviewed Aluminium framework Mivan shuttering, Conventional shuttering, Reinforced Concrete, Tremix flooring, and Frost glass interiors using project maps.
  • IMPACT: Improved construction quality and design efficiency.
  • PROBLEM: Needed to align on-site work with design specifications.
  • TASK: Assisted the site engineer during inspections to ensure quality and compliance.
  • IMPACT: Reduced discrepancies and rework, enhancing project delivery.
  • PROBLEM: Coordination between design documentation and site execution.
  • TASK: Collaborated with teams to optimize material usage and maintain design integrity.
  • IMPACT: Strengthened collaboration and reduced project delays.
Seismic analysisStructural designStructural EngineeringSeismic Safety

Education

Indian Institute of Technology, Kharagpur

Btech Mtech dual degree — Civil Engineering

Jan 2012Jan 2017

Saint Thomas Senior Secondary School

Class XII : Indian School Certificate — Science

Jan 2010Jan 2011

Saint Thomas Senior Secondary School

Class X : Indian Certificate of Secondary Education — Science

Jan 2008Jan 2009

Stackforce found 100+ more professionals with Data Engineering & Ai Solutions

Explore similar profiles based on matching skills and experience