Nandan Thakur — AI Researcher

I’m Nandan Thakur (https://thakur-nandan.github.io), a fourth-year Ph.D. student at the University of Waterloo, under the supervision of Prof. Jimmy Lin. My research interests lie in information retrieval, focused on heterogeneous benchmarking of retrieval models on specialized domains and languages. I received my bachelor’s in Electronics from Birla Institute of Technology and Science (BITS) Pilani in India. During my PhD, I have been fortunate to intern at Google Research and collaborate on research projects with leading industry partners, such as Huawei, and start-ups, such as Vectara. Before my PhD, I extensively worked as a research assistant at the Technical University of Darmstadt in Germany. My long-term ambition lies in improving low-resource and efficient neural techniques within retrieval, a practical setting useful for a large audience, including research and industry. I study and try to evaluate and optimize robustness by analyzing popular out-of-domain evaluation techniques, and also provide such models with easy-to-implement code, enabling everyone with minimal prior experience to deploy such techniques easily. In addition, I also like to contribute to open-source projects such as BEIR, an open-source IR benchmark that got accepted at NeurIPS in 2021 and currently has over 1.4K likes on GitHub! Check my CV (PDF attached in the featured) to better understand my research orientation and experience. Feel free to reach out to chat about research in NLP, if you would like to collaborate on a project (if you are a student at the University of Waterloo), or just as a friendly greeting!

Stackforce AI infers this person is a Data Scientist specializing in Information Retrieval and Machine Learning.

Experience: 8 yrs 1 mo

Skills

Information Retrieval
Data Science
Machine Learning

Career Highlights

Developed a benchmark for zero-shot information retrieval.
Created an enterprise product for efficient content storage.
Contributed to open-source projects with significant community engagement.

Work Experience

Databricks

Research Intern (3 mos)

Vectara

Research Intern (5 mos)

Google

Student Researcher (7 mos)

University of Waterloo

Graduate Researcher (4 yrs 8 mos)

Technische Universität Darmstadt

NLP Research Assistant (1 yr 10 mos)

KNOLSKAPE

Data Scientist (1 yr 2 mos)

EMBL

Research Trainee (2 mos)

Belong.co

Data Science Intern (5 mos)

numberz

Data Analyst Intern (1 mo)

Cellular Operators Association of India - COAI

Reseach Intern (2 mos)

Education

Doctor of Philosophy - PhD at University of Waterloo

B.E. (Hons.) at Birla Institute of Technology and Science, Pilani

High School at Modern School, Barakhamba Road

Nandan Thakur

AI Researcher

Canada8 yrs 1 mo experience

Most Likely To SwitchHighly Stable

Key Highlights

Developed a benchmark for zero-shot information retrieval.
Created an enterprise product for efficient content storage.
Contributed to open-source projects with significant community engagement.

Stackforce AI infers this person is a Data Scientist specializing in Information Retrieval and Machine Learning.

Contact

Skills

Core Skills

Information RetrievalData ScienceMachine Learning

Other Skills

AWSApache AirflowBitbucketCC++Data AugmentationDeduplication AlgorithmDeep LearningDjangoDockerElasticsearchFlaskGitHTMLLDA

About

Experience

8 yrs 1 mo

Total Experience

1 yr 6 mos

Average Tenure

4 yrs 8 mos

Current Experience

Databricks

Research Intern

Aug 2024 – Nov 2024 · 3 mos · San Francisco, California, United States · On-site

Internship at Databricks & MosiacML Research Group.

Vectara

Research Intern

Feb 2024 – Jul 2024 · 5 mos · Remote

Part-time research collaboration between the University of Waterloo and Vectara.

Google

Student Researcher

Sep 2022 – Apr 2023 · 7 mos · Mountain View, California, United States · On-site

University of waterloo

Graduate Researcher

Sep 2021 – Present · 4 yrs 8 mos · Waterloo, Ontario, Canada

Technische universität darmstadt

NLP Research Assistant

Nov 2019 – Sep 2021 · 1 yr 10 mos · Darmstadt Area, Germany

Supervisor: Prof. Iryna Gurevych (ACL Fellow, 2020)
Mentors: Dr. Nils Reimers (Creator of SBERT), Dr. Johannes Daxenberger (Founder of ArgumenText)
BEIR: A heterogeneous benchmark for Zero-shot Information Retrieval (First Author, Arxiv 2021)
A diverse benchmark for IR incorporating over 15+ diverse datasets and state-of-the-art retrieval systems such as Sentence-BERT, DPR, USE-QA, and Elasticsearch in a zero-shot evaluation setup.
Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (First Author, accepted at NAACL'21)
Developed an effective data augmentation strategy for improving bi-encoders performance over small-sized or zero training datasets.
SEABERT: Effective and efficient Adapters for Sentence-BERT
Ongoing research work to Integrate adapters (https://adapterhub.ml) with Sentence-BERT (sentence-transformers).
Engineering Experience:
Single-handedly worked on an Industrial research project with Procter & Gamble (P&G) using customer complaints on P&G diaper-based products.
Developed an Argument Aggregation and Clustering technique within the ArgumenText dashboard. Developed using flask, SQL, and VueJS. Also worked extensively with Docker, Elasticsearch, and unit-testing frameworks.
Conducted a crowdsourcing study using the Best-Worst Scaling (BWS) method for annotation of pairwise arguments using Amazon Mechanical Turk Platform (developed using HTML and boto3).

Information RetrievalData AugmentationFlaskSQLDockerElasticsearch+1

Knolskape

Data Scientist

Sep 2018 – Nov 2019 · 1 yr 2 mos · Bengaluru Area, India

Research Projects (NLP, ML):
Approximate Deduplication at Scale
Developed a custom deduplication algorithm to detect near-duplicates within a big pool of over 10 million contents. Implemented using MinHashing and Locally Sensitive Hashing (LSH) techniques for textual documents and Perpetual Hashing (PH) techniques for images.
Feature-based Segmentation Strategy for Passage Retrieval
Generated custom sentence level feature-cues in audio-visual and textual documents using cue words and tags, capital letters, length of silence before every sentence (lexical cues) and Word2vec (semantic cues) for each document. Implemented the TextTiling Algorithm as an unsupervised task to, by utilizing these feature-cues to merge individual sentences to create passages that speak extensively about a similar topic (For eg. how chapters work in a book).
Engineering Responsibilities (Back-End, Front-End):
Developed Krawler.ai, an enterprise product using Flask and ReactJS. The Product is used by over five mid-level organizations for efficiently storing and searching better content within their knowledge repositories utilizing deep learning and by understanding deep-down the content’s metadata.
Technical development contributions inolved in developing:
Robust, multi-threaded, error-resistant crawlers for retrieval of data from various public and private repositories.
Scripts for automatic text, image, and table extraction from various popular pdf, doc, ppt, excel, image filetypes.
Pipeline architecture to index 10 million+ content comprised of various sources (Majorly: YouTube, CrossKnowledge, OneDrive, Google Drive) using Apache Airflow, Microsoft Azure, and Amazon Web Services along with RabbitMq and Google Pub/Sub messaging techniques.
Full back-end architecture and hosting models using Flask and Docker.
Relevant indexes and managing our MongoDB (DocumentDB) database lead to efficient retrieval and storage of data.

Deduplication AlgorithmFlaskReactJSDeep LearningApache AirflowAWS+2

Embl

Research Trainee

Jun 2018 – Aug 2018 · 2 mos · Heidelberg Area, Germany

Supervisor: Dr. Toby J. Gibson
Mentors: Dr. Manjeet Kumar, Dr. Bernd Klaus
Received a fully-funded 3-month research internship at the Gibson Lab, European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. I worked on an interdisciplinary project with the application of Machine Learning in Computational Biology. Developed a Logistic Regression based prediction toolkit using scikit-learn with nested cross-validation techniques. Implemeted various structural features (RaptorX, Anchor, DynaMine, MAPKAPK) to predict kinase-substrate site phosphorylation (i.e.) to check how confidently can a kinase phosphorylate a particular substrate site. For Non-Biologists reading this, this problem transforms into a classification problem with heavily imbalanced datasets.

Belong.co

Data Science Intern

Jul 2017 – Dec 2017 · 5 mos · Bengaluru Area, India

Active Maintainer of Flashtext module (over 4k+ stars on Github). A major chunk of my experience revolved with traditional ML algorithms such as Latent Dirichlet Allocation, topic modeling, Word2Vec, etc. I single-handedly developed end-to-end machine learning pipelines right from preprocessing data (Training) to deploying machine learning models (using Django). Code currently being used in production (ever since 2017).
I Perform topic modeling by classifying over 5 million documents into various topics using a Semi-Supervised Clustering Algorithm - GuidedLDA (Guided Latent Dirichlet Allocation) and train a (word2vec) to reconstruct linguistic contexts of words to measure job skill similarities. Links to open-sourced resources can be found below. (Confidential: Can't release actual production code.)

Numberz

Data Analyst Intern

May 2017 – Jun 2017 · 1 mo · Gurgaon, India

Automated various metrics requirements of the companies which included the Product Metrics, Usage Metrics, and Marketing Metrics in R, which resulted in saving a day's worth of productivity which earlier used to be spent weekly updating the metrics manually. Also, I created interactive visual diagrams and extensive insights on the company's lead quality and generation, resulting in better conversion rates using ggplot2 in R and Macros in Excel.

Cellular operators association of india - coai

Reseach Intern

May 2016 – Jul 2016 · 2 mos · New Delhi Area, India

Conducted research on Electromagnetic Emissions from telecom towers and its health effects upon citizens of India residing in urban areas to assess safety standards and to help communicate better awareness amongst citizens of India. Worked closely with the Department Of Telecommunications (DOT) in arranging Electromagnetic Emission Awareness Programs in Dehradun and Hyderabad, India.