Semantic Search with S-BERT is all you need

Building In-house Semantic Search Engine from Scratch — Fast and Accurate

Subir Verma
14 min readJun 5, 2021
Photo by Mojahid Mottakin on Unsplash

Bloomberg - Semantic search is a data searching technique in which a search query aims to not only find keywords but to determine the intent and contextual meaning of the words a person is using for a search.

Semantics refers to the philosophical study of meaning. Indeed, philosophy rarely rhymes with software engineering, but this concept does help us reach a definition. Indeed, Semantic Search is related to figuring out what your user means.

Semantic search seeks to improve search accuracy by understanding the content of the search query. In contrast to traditional search engines, which only find documents based on lexical matches, semantic search can also find synonyms.

This type of search makes browsing more complete by understanding almost exactly what the user is trying to ask, instead of simply matching keywords to pages. The idea behind semantic search is to embed all entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space.

At search time, the query is embedded into the same vector space and the closest embedding from your corpus is found. These entries should have a high semantic overlap with the query.

FYI — For this blog, I am sticking with variants of Sentence-Transformers. SentenceTransformers is designed in such a way that fine-tuning your own sentence/text embedding models is easy.
It provides most of the building blocks that you can stick together to tune embeddings for your specific task.
If you are not aware of this concept please read their publications here: https://www.sbert.net/docs/publications.html

Before we begin let’s ask ourselves these questions and look for its solution rather than directly jumping into the solution:

Q1. What sort of embeddings will work ?
Q2. How to store documents and their huge embeddings if using BERT?
Q3. What if we have long documents like blogs and small pieces of content like product descriptions? How will the approach change?
Q4. How model fine-tuning can give me good results ?

Let’s see if we can get answers to all of our questions. While we do that following concepts will be covered:
1. Search Types
2. Cosine and Dot product metric
3. Document Embedding techniques
4. FAISS storage and retrieval.
5. Synthetic Query Generation.
6. Bi-Encoder fine-tuning

I have shared a detailed analysis of this blog over a video here.

Symmetric vs. Asymmetric Semantic Search

The solution to Q1 & Q3
A critical distinction for your setup is symmetric vs. asymmetric semantic search:

  • For symmetric semantic search, your query and the entries in your corpus are of about the same length and have the same amount of content. An example would be searching for similar questions: Your query could for example be “How to learn Python online?” and you want to find an entry like “How to learn Python on the web?”. For symmetric tasks, you could potentially flip the query and the entries in your corpus.
  • For asymmetric semantic search, you usually have a short query (like a question or some keywords) and you want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you want to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”. For asymmetric tasks, flipping the query and the entries in your corpus usually does not make sense.

The above concept is very important as this will help us understand our problem better and thus we can work on using methods developed specifically for the task/problem that we have in our hands.

You must choose the right model for your type of task. It is mostly distinguished by the type of data it has been trained on. Also, models tuned for cosine similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents. Depending on your task, the models of one or the other type are preferable.

1. Tuned for Cosine Similarity:

If the vectors representing documents are normalized (i.e., their magnitudes are adjusted to 1), cosine similarity is not influenced by document length.

Models tuned for cosine similarity can handle documents of varying lengths equally well, and they do not inherently prefer short or long documents.

2. Tuned for Dot Product:

Without normalization, the dot product is influenced by the magnitude of the vectors.

If vectors are not normalized, longer documents might have larger magnitudes, potentially leading to a bias toward longer documents in terms of similarity scores.

Suitable models for symmetric semantic search:

  • paraphrase-distilroberta-base-v1 / paraphrase-xlm-r-multilingual-v1
  • quora-distilbert-base / quora-distilbert-multilingual
  • distiluse-base-multilingual-cased-v2

Suitable models for asymmetric semantic search:

  • msmarco-distilbert-base-v2
  • msmarco-bert-base-v3

More details here: https://www.sbert.net/docs/pretrained_models.html

To understand why S-BERT, not BERT I highly recommend the paper here: https://arxiv.org/pdf/1908.10084.pdf

Solution to Q2
At this point, we have an understanding of our data and accordingly, we have selected the embeddings model. Now next we need to understand how to encode data and what other information we need to store with encodings that will be helpful in retrieving search results.

For storage, we have options like :
(a) ElasticSearch: It can be a very good option if we have a lot of meta-information storage and we want to run some cross-cluster search. Having said that it can cost you a lot and maintenance can cost you another expert if shipping things to production and scale.
(b) FAISS: (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves the limitations of traditional query search engines that are optimized for hash-based searches and provides more scalable similarity search functions.
(c ) Annoy: a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mapped into memory so that many processes may share the same data.

It is completely up to you and your requirements what you choose. I choose FAISS for now, for the ease of use (pythonic) and GPU support makes is blazing fast.
The only limitation with FAISS is you have to maintain your meta-data information separately in some DataBase with mapping of your FAISS ids.
Read here: https://medium.com/swlh/add-similarity-search-to-dynamodb-with-faiss-c68eb6a48b08

Get Started — Build Something Quick

DataSet: Wikipedia Movie Plots
Content: The dataset contains descriptions of 34,886 movies from around the world. Column descriptions are listed below:

  • Release Year — The year in which the movie was released
  • Title — Movie title
  • Origin/Ethnicity — Origin of the movie (i.e. American, Bollywood, Tamil, etc.)
  • Director — Director(s)
  • Cast — Main actors and actresses
  • Genre — Movie Genre(s)
  • Wiki Page — URL of the Wikipedia page from which the plot description was scraped
  • Plot — Long-form description of movie plot (WARNING: May contain spoilers!!!)

We want to build a search bar for users to search for movies and for simplicity let us consider we want to search in the movie plot field only. Users are allowed to type in some keywords or sentences that can describe their movie and we bring the best we can by closely understanding the Query and Movie plots.

Before encoding, let’s see how long this text information is.

df['doc_len'] = df['Plot'].apply(lambda words: len(words.split()))
max_seq_len = np.round(df['doc_len'].mean() + df['doc_len'].std()).astype(int)
sns.distplot(df['doc_len'], hist=True, kde=True, color='b', label='doc len')
plt.axvline(x=max_seq_len, color='k', linestyle='--', label='max len')
plt.title('plot length'); plt.legend()
plt.show()
Movie Plot Length Distribution

There are a couple of ways to handle these scenarios where the document length is more than 512.
(1) The easiest and Scariest is to slice everything off for a length greater than 512.
(2) Run Extractive or Abstractive Summarisations
(3) Average of Document pool embeddings.

Right now we will choose the scariest one. I went through some plot samples and concluded that taking max length should suffice for what we are trying to build with movie search.
Here, as discussed above, we are dealing with Asymmetric Search and trying to retrieve long passages so for our case, a dot-product model will suit well. For this experiment, I am choosing sentence-transformers/msmarco-distilbert-base-dot-prod-v3 model which performs great in Semantic Textual Similarity (Asymmetric ) tasks and it’s quite faster than BERT as it is considerably smaller. This model was optimized to be used with dot-product as a similarity function between queries and documents.
Note: If you have short descriptions, “distilbert-base-nli-stsb-mean-tokens”, works a lot better and faster.

Search Architecture
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')

Since we are using FAISS, storing embeddings is easy, once we have settled with Index Factory and Mappings.

import faiss
encoded_data = model.encode(df.Plot.tolist())
encoded_data = np.asarray(encoded_data.astype('float32'))
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(df))))
faiss.write_index(index, 'movie_plot.index')

We have encoded our Movie Plot, where each plot has been encoded with a 768-dimensional vector and stored to disk with movie_plot.index name.
Note here we have used index.add_with_ids and this encodes data in the order of the data frame and stores their index IDs too.

Let’s write a couple of utility functions that will help encode user queries and fetch similar movies from the FAISS index directory.

def fetch_movie_info(dataframe_idx):
info = df.iloc[dataframe_idx]
meta_dict = dict()
meta_dict['Title'] = info['Title']
meta_dict['Plot'] = info['Plot'][:500]
return meta_dict

def search(query, top_k, index, model):
t=time.time()
query_vector = model.encode([query])
top_k = index.search(query_vector, top_k)
print('>>>> Results in Total Time: {}'.format(time.time()-t))
top_k_ids = top_k[1].tolist()[0]
top_k_ids = list(np.unique(top_k_ids))
results = [fetch_movie_info(idx) for idx in top_k_ids]
return results

Function: Search
query: string -> user typed query
top_k: integer -> number of results to be returned
index: faiss_index -> index to query the for results
model: sbert -> model to encode the user-query
Function: fetch_movie_info
dataframe_idx: integer -> index value extracted from movie_plot.index which can be used to map back to our data frame to fetch additional information.

from pprint import pprintquery="Artificial Intelligence based action movie"
results=search(query, top_k=5, index=index, model=model)
print("\n")
for result in results:
print('\t',result)
from pprint import pprint

query="movie about romance and pain of separation"
results=search(query, top_k=5, index=index, model=model)

print("\n")
for result in results:
print('\t',result)

I tried running a few queries, very vague and I got some decent results. Not impressed but compared to how just by narrowing down the task it has been trained and fine-tuned and then relating to our use case gave us some good results.

Not impressed

Fine Tune

Solution to Q4
We could have easily fine-tuned a sentence-transformer model on our dataset given if we had query & relevant passages information. But you would not have this data if building something from ground zero. (Here we are not dealing with pre-training approaches of transformer models. It is expensive and requires a huge deal of data. Not to forget the domain here )

At this point, we got nothing. But wait, what about all the movie plots' textual information we have? Can we devise some unsupervised approach to fine-tune our model on our dataset?

1. Synthetic Query Generation

We use synthetic query generation to achieve our goal. We start with the passage from our document collection and create for these possible queries users might ask / might search for.
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models presented a method to learn (or adapt) a model for asymmetric semantic search without requiring training data.

Paper: https://arxiv.org/abs/2104.08663
GitHub: https://github.com/UKPLab/beir

This paper has some amazing ideas on Multi-Task Learning and fine-tuning.
From the paper
GenQ Similar to past work (Liang et al., 2020; Ma et al., 2021), we propose an unsupervised domain-adaption approach for dense retrieval models using synthetic queries. First, we fine-tune a T5-base (Raffel et al., 2020) model to generate queries given a passage. We use the MSMARCO dataset and train for 2 epochs. Then, for a target corpus we generate 5 queries for each document using a combination of top-k and nucleus-sampling (top-k: 25; top-p: 0.95). Due to resource constraints, we cap the maximum number of target documents in each dataset to 100K. We found the T5 model to perform better compared to BART (Lewis et al., 2020) and our decoding setting better as compared to beam-search. For retrieval, we continue to fine-tune the SBERT model (section 4.1.3) on the synthetic queries and document pairs. Note, this creates an independent model for each task.

Inspired by the paper, we will use a similar fine-tuning technique for SBERT. Let’s see the synthetic query generation code first.

# Parameters for generation
batch_size = 16 #Batch size
num_queries = 5 #Number of queries to generate for every paragraph
max_length_paragraph = 512 #Max length for paragraph
max_length_query = 64 #Max length for output query
def _removeNonAscii(s): return "".join(i for i in s if ord(i) < 128)with open('generated_queries_all.tsv', 'w') as fOut:
for start_idx in tqdm(range(0, len(paragraphs), batch_size)):
sub_paragraphs = paragraphs[start_idx:start_idx+batch_size]
inputs = tokenizer.prepare_seq2seq_batch(sub_paragraphs, max_length=max_length_paragraph, truncation=True, return_tensors='pt').to(device)
outputs = model.generate(
**inputs,
max_length=max_length_query,
do_sample=True,
top_p=0.95,
num_return_sequences=num_queries)

for idx, out in enumerate(outputs):
query = tokenizer.decode(out, skip_special_tokens=True)
query = _removeNonAscii(query)
para = sub_paragraphs[int(idx/num_queries)]
para = _removeNonAscii(para)
fOut.write("{}\t{}\n".format(query.replace("\t", " ").strip(), para.replace("\t", " ").strip()))

The paragraphs input to this code are nothing but chunks of the movie plot and each chunk will have a maximum of 5 synthetically generated queries. If you think for a moment what are we trying to do here is have the possible information present in paragraphs represented as questions and then use this knowledge tuple to fine-tune a s-bert model which will capture the semantic and syntactic information mapping between these tuples.

2. Bi-Encoder Fine-tuning

SBERT is a siamese bi-encoder using mean pooling for encoding and cosine-similarity for retrieval.
SentenceTransformers was designed in such a way that fine-tuning your own sentence/text embedding models is easy. It provides most of the building blocks that we can stick together to tune embeddings for our specific task.
we can create the network architectures from scratch by defining the individual layers. For example, the following code would create the depicted network architecture:

# Now we create a SentenceTransformer model from scratch
word_emb = models.Transformer('sentence-transformers/msmarco-distilbert-base-dot-prod-v3')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])

Note: Here I am using the same “sentence-transformers/msmarco-distilbert-base-dot-prod-v3” for fine-tuning which I used in the first half. No changes there.
I want here to talk about the loss function that we will be using. sentence_transformers.losses define different loss functions, that can be used to fine-tune the network on training data. The loss function plays a critical role when fine-tuning the model. It determines how well our embedding model will work for the specific downstream task. Sadly there is no “one size fits all” loss function. Which loss function is suitable depends on the available training data and the target task.

MultipleNegativesRankingLoss: MultipleNegativesRankingLoss is a great loss function if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language).paper: https://arxiv.org/pdf/1705.00652.pdf

This loss expects as input a batch consisting of sentence pairs (a_1, p_1), (a_2, p_2)…, (a_n, p_n) where we assume that (a_i, p_i) are a positive pair and (a_i, p_j) for i!=j a negative pair. For each a_i, it uses all other p_j as negative samples, i.e., for a_i, we have 1 positive example (p_i) and n-1 negative examples (p_j). It then minimizes the negative log-likelihood for softmax normalized scores. This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc)) as it will sample in each batch n-1 negative docs randomly. The performance usually increases with increasing batch sizes. You can also provide one or multiple hard negatives per anchor-positive pair by structuring the data like this: (a_1, p_1, n_1), (a_2, p_2, n_2). Here, n_1 is a hard negative for (a_1, p_1). The loss will use for the pair (a_i, p_i) all p_j (j!=i) and all n_j as negatives.

But why not CosineSimilarityLoss? The most simple way is to have knowledge tuples annotated with a score indicating their similarity, e.g. on a scale of 0 to 1. We can then train the network with a Siamese Network Architecture. But we do not have a labeled score.
That the softmax-loss with NLI data produces (relatively) good sentence embeddings is rather coincidental. The MultipleNegativesRankingLoss is much more intuitive and produces also significantly better sentence representations.

The training data for MultipleNegativesRankingLoss consists of sentence pairs [(a1, b1), …, (an, bn)] where we assume that (ai, bi) are similar sentences and (ai, bj) are dissimilar sentences for i != j. The minimizes the distance between (ai, bj) while it simultaneously maximizes the distance (ai, bj) for all i != j.

For example in the following picture:

The distance between (a1, b1) is reduced, while the distance between (a1, b2…5) will be increased. The same is done for a2, …, a5.

Using MultipleNegativeRankingLoss with NLI is rather easy: We define sentences that have an entailment label as positive pairs. For, we have pairs like (“A soccer game with multiple males playing.”, and “Some men are playing a sport.”) and want these pairs are close in vector space.

from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets
from torch import nn
import os

train_examples = []
with open('../input/user-query-data/generated_queries_all (1).tsv') as fIn:
for line in fIn:
try:
query, paragraph = line.strip().split('\t', maxsplit=1)
train_examples.append(InputExample(texts=[query, paragraph]))
except:
pass

# For the MultipleNegativesRankingLoss, it is important
# that the batch does not contain duplicate entries, i.e.
# no two equal queries and no two equal paragraphs.
# To ensure this, we use a special data loader
train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=8)
# MultipleNegativesRankingLoss requires input pairs (query, relevant_passage)
# and trains the model so that is is suitable for semantic search
train_loss = losses.MultipleNegativesRankingLoss(model)

The fine-tuning frame is ready to be fine-tuned. We have successfully established the model architecture and loss function and why to use them.

#Tune the model
num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)

os.makedirs('search', exist_ok=True)
model.save('search/search-model')

Once the tuning is done, do not forget to re-run the FAISS encoding and indexing on the dataset again with this model.

Once done I fired similar queries to the new model and voilà !! The recommendations or the search results improved a lot. If you are not convinced go through the plots and you will understand.

Artificial intelligence-based action movie
movie about romance and the pain of separation
somewhat impressed now

This is a half-battle won. We have just scratched the surface of the search. We still have not talked about the quality of retrieval,re-ranking, and personalization.

In the next post, I will discuss in detail the strategy of re-ranking and improving search results with user behavior. Till then keep learning.
Please do comment with your thoughts and inputs.

Cheers!!

Want to discuss this article or your semantic search project with me ? Talk to me here: https://topmate.io/subir_verma

Youtube Link: https://youtu.be/Yo4NqGPISXQ

Note: All thanks to the SBERT paper and their Documentation.

Also if you’re looking for a Fast-API-based project skeleton for your ML/DL-based project, do check out

RedisAI vector Search demo — https://github.com/RedisAI/vecsim-demo

🔵 Become a ML Writer

--

--