Building In-house Semantic Search Engine from Scratch — Fast and Accurate

Photo by Markus Winkler on Unsplash

Bloomberg - Semantic search is a data searching technique in which a search query aims to not only find keywords but to determine the intent and contextual meaning of the words a person is using for search.

Semantics refers to the philosophical study of meaning. It’s true that philosophy rarely rhymes with software engineering, but this concept does help us reach a definition. Indeed, Semantic Search is related to figuring out what your user means.

Semantic search seeks to improve search accuracy by understanding the content of the search query. …

Easy way to re-rank search results and personalized recommendations.

This post is in continuation of my previous post Semantic Search with S-BERT is all you need

What we covered in the previous post:

The architecture we have already discussed
  • Types of search: Asymmetric & Symmetric
  • Dot Product and Cosine Product Based models
  • Embeddings Storage (FAISS)
  • Synthetic Query Generation
  • Bi-Encoder Finetuning

And what we left hanging there were some important questions:

  • methods of re-ranking of results
  • quality of results, how inlined it is wrt to a user query.
  • how to incorporate user behavior in recommending results

Get Started: Overview

Search Ranking and Recommendations are fundamental problems of crucial interest to major Internet companies, including web search engines…


In the process of a model building, overfitting drains out a lot of effort. Diagnosing overfitting in data is sometimes obvious where your training data performance is far better than validation but not always. Especially in a production environment, understanding the root cause of failure becomes hectic and time-consuming. We try a lot of techniques to make sure we do not overfit the training data and generalize well by checking model performance on cross-validation data.
Overfitting on a high level means our model has leaned features from train data which doesn’t generalize well on test data. There are a lot of…


In this post, we will be discussing some quick and short tricks which are very effective and powerful and can save you a lot of time struggling in-text cleansing process.

Problems in text pre-processing

Making sense out of textual data is a tricky job and dirty text makes it a lot more troublesome. All of us have encountered, text related problem statements and are aware of the fact that cleaning text data is much more difficult than numerical data( that’s what I believe ).

When I say cleaning text data what is exactly am I talking about:
(1) Spelling errors
(2) Text Contractions
(3) Concatenated words…

Subir Verma

Senior Data Scientist (NLP)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store