Abhishek Shakya — CTO

I have 8+ years of experience in the industry and have dedicated 5+ years to the Biopharma domain. I have worked in an individual capacity as well as headed teams of 20+ tech developers, bioinformaticians and wet lab scientists. I have worked with large pharmaceutical customers(billion dollar plus revenue) to understand their challenges and areas of focus. I have a fair knowledge of Statistics, Machine Learning, and data exploration. I have designed workflows and pipelines using Python, R, Java, WDL, bash, and other programming languages. Over the past years, I have gained solid skillset in Bioinformatics, understanding of the schemas and standards, data sources and tool chains used in NGS and clinical trials. I have worked in domain of Genomics, Transcriptomics, Proteomics, Biochemistry, Metabolomics and Cell Biology. I maintain timely and thorough documentation and completion of exploratory notebooks. I effectively communicate study progress and results adopting agile practices and offering subject matter expertise. In my past time, I independently research and maintain relevant knowledge of trends, best practices and new guidelines in bioinformatics approaches via conference attendance, consultation with SMEs, and/or literature reviews.

Stackforce AI infers this person is a Bioinformatics and Data Engineering expert with a focus on AI-driven healthcare solutions.

Location: Noida, Uttar Pradesh, India

Experience: 6 yrs 6 mos

Skills

Data Engineering
Ai Solutions
Predictive Modeling
Data Integration
Patient Cohort Analysis
Nlp
Data Pipeline Standardization
Data Transformation
Machine Learning
Bioinformatics
Genomic Data Analysis
Statistical Analysis
Technical Support
Data Management
Dashboard Development

Career Highlights

8+ years of experience in Biopharma domain.
Led teams of 20+ in tech and bioinformatics.
Expert in designing scalable data pipelines.

Work Experience

Autocash.ai

Head of Data Engineering and Platforms (1 yr 5 mos)

Nobias Therapeutics

Software Engineer - Platforms (7 mos)

Software Engineer - Platforms (1 yr)

Aganitha

Senior Data Scientist 2 (10 mos)

Senior Data Scientist 1 (2 mos)

Senior Data Scientist 1 (5 mos)

Senior Data Scientist 1 (1 yr 3 mos)

Senior Data Scientist 1 (6 mos)

Data Scientist 2 (1 yr 2 mos)

Data Scientist (4 mos)

Data Scientist (6 mos)

Quantiphi, Inc.

Decision Science Analyst (10 mos)

AllizHealth

Data Analyst (1 mo)

Technip

EPC Trainee (2 mos)

Jaypee Greens - SunCourt Tower

Management Trainee (1 mo)

Education

Btech Mtech dual degree at Indian Institute of Technology, Kharagpur

Class XII : Indian School Certificate at Saint Thomas Senior Secondary School

Class X : Indian Certificate of Secondary Education at Saint Thomas Senior Secondary School

Abhishek Shakya

CTO

Noida, Uttar Pradesh, India6 yrs 6 mos experience

AI ML PractitionerAI Enabled

Key Highlights

8+ years of experience in Biopharma domain.
Led teams of 20+ in tech and bioinformatics.
Expert in designing scalable data pipelines.

Stackforce AI infers this person is a Bioinformatics and Data Engineering expert with a focus on AI-driven healthcare solutions.

Contact

Skills

Core Skills

Data EngineeringAi SolutionsPredictive ModelingData IntegrationPatient Cohort AnalysisNlpData Pipeline StandardizationData TransformationMachine LearningBioinformaticsGenomic Data AnalysisStatistical AnalysisTechnical SupportData ManagementDashboard DevelopmentBioinformatics Pipeline DevelopmentCloud ComputingData AutomationInformation RetrievalProcess AutomationRobotic Process AutomationDocument AnalysisPredictive AnalyticsMaintenance ManagementAi DevelopmentSpeech ProcessingFraud DetectionStructural EngineeringSeismic Safety

Other Skills

PythonETL pipelinesAI-driven solutionsDocumentationTeam onboardingEHR data integrationRapidFuzzT-SNEXGBoostBERTLSTMMedspacyScispacyData pipeline designEHR data transformation

About

Experience

6 yrs 6 mos

Total Experience

2 yrs 2 mos

Average Tenure

Current Experience

Autocash.ai

Head of Data Engineering and Platforms

Jan 2025 – Present · 1 yr 5 mos · India · Remote

I led financial technology and AI initiatives in the Autocash platform, focusing on automation, scalable architecture, and AI-driven solutions. My work included designing ETL pipelines, financial classification models, and integrating large language models (LLMs) to transform manual processes into efficient workflows. I emphasized documentation, knowledge sharing, and team onboarding for sustainable growth.
Problem:**
Legacy systems and manual processes caused inefficiencies in cashflow categorization and reconciliation.
Fragmented ETL pipelines and inconsistent engineering delayed scalability and client onboarding.
Limited documentation and knowledge transfer led to repeated support queries.
Effort:**
Designed a three-layer architecture for data pipelines, ensuring scalability and reliability.
Developed financial classification models using domain embeddings, BERT, FastText, and fuzzy logic for improved categorization.
Integrated LLMs to automate rule creation, categorization, and natural language querying.
Created documentation frameworks, including knowledge bases and onboarding materials, to support knowledge transfer.
Mentored technical staff to build a strong developer pipeline.
Conducted PoCs for features like cashflow categorization and AR tracking.
Evaluated tools such as Alteryx, Windmill, and iPaaS to optimize ETL processes.
Implemented AWS cost optimization strategies, including resource management and RDS scaling.
Developed architecture diagrams and design reviews for project alignment.
Impact:**
Reduced manual processing with AI-based categorization, saving time and resources.
Accelerated client onboarding with standardized ETL processes.
Enhanced team productivity through documentation and mentorship.
Increased stakeholder confidence via transparent reporting.
Improved platform scalability and compliance by addressing technical debt.

PythonETL pipelinesAI-driven solutionsDocumentationTeam onboardingData Engineering+1

Nobias therapeutics

2 roles

Software Engineer - Platforms

Mar 2024 – Oct 2024 · 7 mos · India · Remote

I have led multiple projects enhancing patient cohort analysis, predictive modeling, and data pipeline standardization. My experience spans NLP, EHR data integration, and domain-specific embeddings. Leveraging RapidFuzz, T-SNE, XGBoost, and BERT-based architectures, I delivered solutions transforming raw data into insights. My work includes developing documentation and reporting frameworks, ensuring replicability and clarity for stakeholders.
Project 1: Autism Patient Cohort Analysis and Modeling**
Problem:**
Fragmented patient records, inconsistent coding, and varied clinical descriptions in EHR data challenged autism phenotype analysis.
Needed to stratify idiopathic autism patients to identify biomarkers and treatment targets.
Effort:**
Built patient cohorts (case, control, test) using LSTM and BERT models.
Developed unsupervised pipelines with T-SNE and clustering for autism group visualization.
Used RapidFuzz and FastText for keyword matching, clustering, and embedding.
Created word clouds, overlap analyses, and top-N word frequency tables.
Applied Medspacy, Scispacy, and Ascle for data cleaning and text normalization.
Evaluated domain-specific embedding models (MedBERT, Clinical-T5, BioClinicalBERT).
Documented cohort processes, data-switching, and feature engineering.
Impact:**
Improved accuracy and interpretability of autism models.
Reduced overfitting with weight decay and attention normalization.
Delivered actionable visualizations for management.

NLPEHR data integrationPredictive modelingData pipeline standardizationRapidFuzzT-SNE+4

Software Engineer - Platforms

Feb 2023 – Feb 2024 · 1 yr · India · Remote

Project 2: Data Pipeline Standardization and EHR Data Transformation
Problem:
Encountered fragmented EHR datasets across different sources, leading to inconsistency in patient records and increased data bias.
Required transformation pipelines for efficient data integration and embedding generation for clinical NLP tasks.
Effort:
Designed an unsupervised data pipeline with modules for data cleaning, preprocessing, and table-level feature engineering.
Implemented column-level feature extraction and merged similar columns to reduce redundancy.
Incorporated spell-checking, stopword filtering, and subword handling for embedding generation.
Integrated UMAP and T-SNE for dimensionality reduction and embedding visualization.
Applied clustering algorithms (e.g., K-Medoids, hierarchical) on distance matrices for patient grouping.
Impact:
Streamlined EHR data preparation for downstream NLP tasks, reducing manual cleaning efforts.
Enabled consistent embedding generation across various clinical data types, improving model performance.
Facilitated faster iteration on cohort analysis and embedding-based tasks through parallelized computations and standardized pipelines.

Data pipeline designEHR data transformationClustering algorithmsUMAPT-SNEData Pipeline Standardization+1

Aganitha

8 roles

Senior Data Scientist 2

Jan 2022 – Nov 2022 · 10 mos

Finding highly enriched AAV templates using Machine learning
Problem:
Find highly enriched group of AAV template sequences
Solution:
Tried various featurizing techniques for embedding generation like one hot coding, by identity, by properties(BLOSUM62) etc.
Collaborated with scientists for interpreting clustering plots for AAV sequences using T-SNE and UMAP
Trained Random forest model to find patterns in AAV sequences passing the Blood Brain Barrier(BBB) across tissues
Analyzed AAV sequence interaction with LY6A protein(in BBB) using Alphafold-2 multimer model
Impact:
Classify capsid families based on amino acid similarities by identity and properties
Scientists able to determine individual capsid variants based on tissue biodistribution profile
Analyzing highly enriched sequences for AAV capsid engineering
Problem:
Natural serotypes displaying finite set of tropisms in Gene therapy
AAV sequences getting expressed in off-target tissues
Solution:
Analyzed mutated library pool and identified capsid variants via Illumina HiSeq Next-Generation Sequencing(NGS)
Implemented collapsing algorithm for denoising clusters of sequences due to PCR bias (duplicates)
Generated Cytoscape based clustering plots and Amino Acid frequency bias heatmaps using python
Assembled interactive dashboard with thresholds for comparison of tissue biodistribution across experiments
Impact:
Scientists able to study enrichment of AAVs across tissues and confidence in NGS data
Identify templates with improved tropism, high diversity and immune evasive AAV capsid libraries
Enabled Scientists to screen libraries using cell-based assays and in vivo models to select capsids with desired properties

Machine learningRandom forest modelT-SNEUMAPAlphafold-2Machine Learning+1

Senior Data Scientist 1

Oct 2021 – Dec 2021 · 2 mos

Genome-Wide Association Studies(GWAS) regression analysis
Problem:
Non replicable genetic association studies with few well-validated genetic risk factors peeking above the noise
Challenges in highlighting molecular pathways and finding potential targets for therapy
Difficulties in understanding variations affecting a person’s response to certain drugs and gene-environment interactions
Solution:
Engineered custom library for processing UKBB 200K and 450K WES data using luigi, hail and spark
Added QC steps like MAF, monomorphic variant filters, ancestry estimation etc. for bias and outlier removal
Added vizualization charts like PCA, Manhattan, QQ-plot etc. along with QC plots like variant and sample call rate
Deployed NGS pipeline using CI/CD on ON-premise and AWS cluster for scalable computing
Migrated pipeline to DNAnexus for processing UKBB 450K WES data release with gene burden tests
Impact:
Brought down entire pipeline time(data onboarding, processing, results generation etc.) from 3 months to ~3 weeks
Scientists able to identify genes or novel SNPs associated with a particular disease or trait
Pinpoint genes or markers that may contribute to a person’s risk of developing a certain disease

GWASRegression analysisNGS pipelineVisualization chartsCI/CDGenomic Data Analysis+1

Senior Data Scientist 1

Promoted

Apr 2021 – Sep 2021 · 5 mos

Dashboard for managing Next Generation Sequencing(NGS) pipelines
Problem:
Unable to manage exploding NGS data generated from pipelines at terabyte scale
Inefficient and non-reproducible analysis pipelines with results not accessible to biologists
Complexity required for analysis significantly hindering the overall turnaround time for wet-lab scientists
Solution:
Crafted framework for managing analysis runs from command line interface and GUI
Architected Postgres database schema to capture metadata about runs and experiments
Designed dashboard(GUI) for managing, monitoring and analyzing runs using React JS
Added Access Management support for project teams and managers using integration with Active Directory
Impact:
Able to plug in and use modern as well as legacy tools without much requirement of programming skills
Effectively communicate study progress and results via generated reports and integrated genome browser
Building apps on DNAnexus for GxP compliance
Problem:
Lack of reproducible workflows leading to inaccurate and misrepresentative results
Missing GxP compliant tools in DNAnexus environment thereby delaying scientific progress
Solution:
Constructed reproducible workflows for RNA-seq pipeline using tools like Dseq2 on DNAnexus
Defined user, functional and design requirements specifications along with Approval and Traceability matrix
Wrote automated unit and integration test scripts for tools like PRSice-2 for Qualification SOP
Impact:
Scientists capable of analyzing scientific findings while adhering to regulatory compliance
Integrated quality management software tools for robust and automatic GxP compliant cloud software
Defect tracking and Change control processes ready to produce documented evidence and consistent tools in practice

NGS data managementPostgres databaseDashboard designReact JSData ManagementDashboard Development

Senior Data Scientist 1

Sep 2020 – Dec 2021 · 1 yr 3 mos

IT support for Bioinformaticians and Wet lab scientists
Problem:
Scientists debugging errors and getting stuck in installing software
Scientists unable to do computational analysis on large scale due to unavailability of cluster
Solution:
Provided customer support on technical issues for the client tools and workflows
Debugged task failures using Kibana and troubleshoot incoming requests from clients
Set up UCSC genome browser for visualizing target sites for alternative splicing
Installed open-source software using spack modules e.g. Rstudio, bioconductor packages, Geneious etc.
Trained junior scientists, recruited new staff, and worked with external collaborators on data harmonization
Wrote developer and end-user documentation to achieve full traceability of processes
Impact:
Scientists able to focus on higher level goals and analyze results

Technical supportSoftware installationDocumentationTechnical SupportBioinformatics

Senior Data Scientist 1

Promoted

Sep 2020 – Mar 2021 · 6 mos

Contamination Analysis and Vector Genome Integration site discovery
Problem:
Contaminants in viral preparations and NGS data leading to immunogenic effects
Vector-host integration in host genome leading to tumour
Solution:
Created a parallelizable, resumable, and modular pipeline for analyzing Terabytes of data on AWS cluster
Added QC steps to include cut adapters, check paired-end reads, remove chimeric alignments, etc.
Aligned vector sequences against reference genome using Burrows-Wheeler Mem Aligner(BWA)
Implemented parallel processing and In-memory computation of data to reduce disk space and time
Impact:
Precise characterization and estimation of contaminants and their frequency
Reduced disk space for overall pipeline by ~20% on TeraBytes scale
Decreased time for overall pipeline by ~10% on Hourly scale
Developing insilico RNAseq pipeline for NGS clinical data
Problem:
Finding genes of importance contributing to disease helping in drug target discovery
Difficulty in understanding the biological differences between healthy and diseased states
Solution:
Wrote a RNA sequencing pipeline in WDL for UKBB 200K clinical data using Cromwell workflow manager
Used bioinformatic tools like STAR for alignment, PICARD for QC, HTseq for read counts etc.
Added visualization charts like Volcano plot, PCA charts etc. to interpret genes with highest fold change
Integrated with Slurm backend for testing on ON-premise HPC clusters for cluster management
Migrated pipeline to AWS cloud for elastic scalable computing on demand
Impact:
Able to perform Differential gene expression analysis(DGEA) to find relevant genes for disease cure
Scientists unblocked to discover quantitative changes in expression levels between experimental groups

RNA-seq pipelineWDLBioinformatics toolsAWS cloudBioinformatics Pipeline DevelopmentCloud Computing

Data Scientist 2

Promoted

Jun 2019 – Aug 2020 · 1 yr 2 mos

Competitive Intelligence platform for Biopharma supply chain
Automated information extraction for publicly traded Contract Development and Manufacturing Organization(CDMO)
Developed assets in form of pricing, suppliers, patents, applications, financial statements etc.
Chemical Reaction Extractor for training data generation
Made human guided chemical reaction extractor tool for extracting data from 1000+ chemical reactions
Stored data in SMILES format for training machine learning model for predicting yield of product
Information Retrieval System using UMLS database
Created using jupyter notebook for querying diseases from UMLS database
Designed hierarchical plots for identifying hierarchies among multiple ontologies and taxonomies

Information extractionData automationChemical reaction extractionData AutomationInformation Retrieval

Data Scientist

Jan 2019 – May 2019 · 4 mos

Robotic Process Automation(RPA) of IT Business processes
Downloading transaction files and automated report generation
Wrote automation scripts to download files from older Citrix UI portals for 20 banks
Designed exception handling for outlier cases in case of interruptions and bad data
Automated 6 IT Business processes for Credit card payments using UIPath automation tool
Deployed 6 processes ON-Site in production using UIPath server mode for scheduling
Email classification model for incoming service requests
Deployed Email classification Random forest model in production using Flask server
Wrote script for account statement generation to be sent to customer
Helped in meeting stringent Turn Around Times(TATs) and Service Level Agreements(SLAs)

Robotic Process AutomationUIPathAutomation scriptsProcess Automation

Data Scientist

Jun 2018 – Dec 2018 · 6 mos

Geological document identification from scanned mining documents
Heuristic approach
Challenge: Assign location to document based upon content of document
Cleaned up special characters and preprocessed the document using OCR to extract text for NLP
Created custom markers like longitude and latitude extraction from maps for identifying location
Used ngram look up model to matched extracted text against list of reference locations(gazetter data)
Machine learning approach
Encoded hierarchical information of location to generate more features
Used Parts of Speech(POS) tagging and custom markers from heuristic model as features
Trained Learning to Rank model for identifying probable georeferencing candidates
Deduplication and Assay data extraction from scanned documents
Removed duplicate documents(having different names) based on content matching using CV and NLP
Classified documents into departments (like HR, compliance, operations etc.) using bag of words NLP model
Scraped assay data(amount of minerals like copper, aluminium etc.) into database using table extraction module
Predictive maintenance for large excavation trucks
Challenge: Early warning and scheduling of maintenance tasks to prevent downtime
Modelled continuous sensor data and batch maintenance reports data for time series model
Encoded features from physical and chemical properties from lubricant and coolant
Trained ARIMA time series model to predict next breakdown of excavation trucks

Geological document identificationNLPMachine learningDocument AnalysisMachine Learning

Quantiphi, inc.

Decision Science Analyst

Jul 2017 – May 2018 · 10 mos · Mumbai Area, India

At Quantiphi, I was part of a product team named 'AthenasOwl'. Our agenda was to automate content tagging and metadata generation using AI to generate game-changing insights.
Speaker Diarization
“Who Spoke When” – Speech and Speaker segmentation
Dataset: 10 hours Game of Thrones audio; Bi-LSTM network
Achieved train accuracy = 92.9% and test DER = 29.3%
Logo Recognition
Trained CNN model for brands like Nike, Adidas, Coke etc.
Dataset: 9000 images from BelaLogos and Logo32 dataset
Achieved train accuracy = 98.5% and test accuracy = 96.3%
Emotion Recognition
Trained CNN model for 7 categories of human emotions
Dataset: 28709 images of FER2013 and RaFD Kaggle
Achieved test accuracy = 87% for Happy and Sad classes
Character Recognition
Automated face recognition pipeline to label primary clusters
Pretrained model: 29-layer Resnet CNN with Batch Norm.
Classifier: 4-layer DNN for classifying 128-D feature vectors
Achieved test accuracy = 96.5% on F.R.I.E.N.D.S sitcom
Team projects: Smoking, NSFW, Locale and Object detection;
Commentary transcription and High point detection in matches

Predictive maintenanceTime series modelingPredictive AnalyticsMaintenance Management

Allizhealth

Data Analyst

May 2016 – Jun 2016 · 1 mo · Pune, Maharashtra, India · On-site

Job Summary:
Developed and deployed advanced insurance claim forecasting models and fraud detection systems, while also building mental health analysis tools. Successfully improved prediction accuracy, identified fraudulent cases, and enhanced the emotional well-being analysis capability for product deployment.
Job Description:
PROBLEM: High claim ratio diseases were driving insurance costs.
EFFORT: Built a predictive model using clustering and Vector-Autoregression in R to forecast claims for top ten diseases.
IMPACT: Achieved 85% confidence interval accuracy, enabling better risk management for SecureNow insurance company.
PROBLEM: Need to determine optimal lag order for time series modeling.
EFFORT: Executed Multivariate Granger-Causality tests to determine lag order with the lowest AIC.
IMPACT: Improved model precision and forecasting reliability.
PROBLEM: Risk of fraudulent claims impacting insurance payouts.
EFFORT: Designed an outlier detector to identify fraudulent cases and detected appendicitis fraud.
IMPACT: Enhanced fraud detection, supporting real-world implementation by SecureNow.
PROBLEM: Need for a scalable mental health analysis platform.
EFFORT: Built an emotional stroop test portal in PHP and MySQL to detect depression and anxiety using lexicon features.
IMPACT: Developed a deployable tool to screen for emotional states, ready for company-wide rollout.
PROBLEM: Lack of robust emotional state insights among employees.
EFFORT: Surveyed 20 employees for 7 days and performed Analysis of Variance for pattern recognition.
IMPACT: Created reliable visualizations of mood swings and verified accuracy with cross-validation.

Content taggingAISpeech recognitionAI DevelopmentSpeech Processing

Technip

EPC Trainee

May 2015 – Jul 2015 · 2 mos · Noida, Uttar Pradesh, India · On-site

Job Summary:
Performed advanced seismic and wind analysis of critical refinery structures, optimized base plate design for load efficiency, and implemented cost-effective design improvements. Delivered significant reductions in structural thickness and base pressure, enhancing safety, performance, and cost savings.
Job Description
PROBLEM: Ensuring seismic stability of critical structures under various loading conditions.
TASK: Conducted seismic analysis of slug catcher and column base plates per IS codes, using CQC method for static and dynamic earthquake evaluation.
IMPACT: Enhanced seismic safety and compliance with national standards.
PROBLEM: Refinery heater wall deformation under wind loading.
TASK: Analyzed wind-induced deformations and determined bearing length using Blodgett’s method.
IMPACT: Ensured structural integrity and reduced risk of wind-related damage.
PROBLEM: Outdated design practices in column base plates leading to inefficiencies.
TASK: Compared old and new IS 800 codes, and remodeled base plate design using FEM.
IMPACT: Achieved 86% reduction in plate thickness, resulting in material and cost savings.
PROBLEM: Excessive base pressure affecting structural performance.
TASK: Installed flange stiffeners and optimized load distribution.
IMPACT: Reduced base pressure by 13%, enhancing overall load capacity and efficiency.

Predictive modelingClusteringTime series modelingPredictive AnalyticsFraud Detection

Jaypee greens - suncourt tower

Management Trainee

May 2014 – Jun 2014 · 1 mo · Noida Area, India · On-site

Job Summary:
Analyzed construction details, conducted on-site inspections, and ensured adherence to design specifications, leading to improved construction quality and cost-effective solutions.
Job Description:
PROBLEM: Complex construction details required thorough technical analysis.
TASK: Reviewed Aluminium framework Mivan shuttering, Conventional shuttering, Reinforced Concrete, Tremix flooring, and Frost glass interiors using project maps.
IMPACT: Improved construction quality and design efficiency.
PROBLEM: Needed to align on-site work with design specifications.
TASK: Assisted the site engineer during inspections to ensure quality and compliance.
IMPACT: Reduced discrepancies and rework, enhancing project delivery.
PROBLEM: Coordination between design documentation and site execution.
TASK: Collaborated with teams to optimize material usage and maintain design integrity.
IMPACT: Strengthened collaboration and reduced project delays.