In the first part, we've learned about the fundamentals of Neural Search. Let's now start with the hands-on tutorial.
Which model could be used?
It is ideal to use a model specially trained to determine the closeness of meanings. For example, models trained on Semantic Textual Similarity (STS) datasets. Current state-of-the-art models could be found on this leaderboard.
However, not only specially trained models can be used. If the model is trained on a large enough dataset, its internal features can work as embeddings too. So, for instance, you can take any pre-trained on ImageNet model and cut off the last layer from it. In the penultimate layer of the neural network, as a rule, the highest-level features are formed, which, however, do not correspond to specific classes. The output of this layer can be used as an embedding.
What tasks is neural search good for?
Neural search has the greatest advantage in areas where the query cannot be formulated precisely. Querying a table in a SQL database is not the best place for neural search.
On the contrary, if the query itself is fuzzy, or it cannot be formulated as a set of conditions — neural search can help you. If the search query is a picture, sound file or long text, neural network search is almost the only option.
If you want to build a recommendation system, the neural approach can also be useful. The user’s actions can be encoded in vector space in the same way as a picture or text. And having those vectors, it is possible to find semantically similar users and determine the next probable user actions.
Let’s build our own
With all that said, let’s make our neural network search. As an example, let's make a search for startups by their description. In this demo, we will see the cases when text search works better and the cases when neural network search works better.
We will use data from startups-list.com. Each record contains the name, a paragraph describing the company, the location and a picture. Raw parsed data can be found at this link.
Prepare data for neural search
To be able to search for our descriptions in vector space, we must get vectors first. We need to encode the descriptions into a vector representation. As the descriptions are textual data, we can use a pre-trained language model. As mentioned above, for the task of text search there is a whole set of pre-trained models specifically tuned for semantic similarity.
One of the easiest libraries to work with pre-trained language models, in my opinion, is the sentence-transformers by UKPLab. It provides a way to conveniently download and use many pre-trained models, mostly based on transformer architecture. Transformers is not the only architecture suitable for neural search, but for our task, it is quite enough.
We will use a model called **distilbert-base-nli-stsb-mean-tokens**
. DistilBERT means that the size of this model has been reduced by a special technique compared to the original BERT. This is important for the speed of our service and its demand for resources. The word stsb
in the name means that the model was trained for the Semantic Textual Similarity task.
The complete code for data preparation with detailed comments can be found and run in Colab Notebook.
To be continued...