Transformer-based Sequential Denoising Auto-Encoder: Job2Vec

Subir Verma
4 min readApr 17, 2022

Using TSDAE to convert Job descriptions into Vector for Job Search

TSDAE is a strong domain adaptation and pre-training method for sentence embeddings, significantly outperforming other approaches like Masked Language Model.

TSDAE trains sentence embeddings by adding a certain type of noise (e.g. deleting or swapping words) to input sentences, encoding the damaged sentences into fixed-sized vectors, and then reconstructing the vectors into the original input.

An important difference from the original transformer encoder-decoder setup presented in Vaswani et al. (2017) is the information available to the decoder: This decoder decodes only from a fixed-size sentence representation produced by the encoder. It does not have access to all contextualized word embeddings from the encoder. This modification introduces a bottleneck, that should force the encoder to produce a meaningful sentence representation.

S-BERT Implementation of TSDAE

They add noise to the input text and delete about 60% of the words in the text. The encoder maps this input to a fixed-sized sentence embedding. A decoder then tries to re-create the original text without the noise. Later, the encoder is used as the sentence embedding method.

During training, TSDAE encodes damaged sentences into fixed-sized vectors and requires the decoder to reconstruct the original sentences from these sentence embeddings. For good reconstruction quality, the semantics must be captured well in the sentence embeddings from the encoder. Later, at inference, we only use the encoder for creating sentence embeddings.

How do they add noise?

Below is the magic function which handles the noise(delete) part.

The Treebank detokenizer uses the reverse regex operations corresponding to the Treebank tokenizer’s regexes.

Sample Output of Delete Function

But why delete only?

From Research Paper

Loss Function: DenoisingAutoEncoderLoss

This loss expects as input a batch consisting of damaged sentences and the corresponding original ones.

Batch of Data is Constructed using DenoisingAutoEncoderDataset in format like : texts=[noise_fn(sentence), sentence]

: param sentences: A list of sentences
: param noise_fn: A noise function: Given a string, it returns a string with noise, e.g. deleted words

The below diagram will give you a better intuition. Instead of masking, we are now denoising.

Let’s get started: Job2vec

Note: This is not the best way to convert jobs to vector. As job descriptions can have multiple entities which can be relevant to different job requirements. Ideally there should be NER layer too to complete the solution. But this solution can work surprisingly good for lot of use-cases.

I have the following dataset of Job Requirements with their respective titles and department.

I did basic cleanup on the requirements text. Below is the function.

After cleanup, the document-length distribution looks something like this

We can now directly utilize this sentence list as our input to the model.

After training it for 5 epochs, this is what the embeddings look like for a few data samples.

Not Happy?

I tried indexing the embeddings using faiss and looking for job via finding requirements similarity. The results look nice to begin with.

Star the code repository here:

Happy Learning !!

Follow me to get more cool and exciting stuff coming ahead.

Checkout This -> https://youtu.be/Yo4NqGPISXQ

--

--