Indexing the contents of files, not by the terms they contain, but by their meaning, that is Nuclia’s amazing promise. This Spanish startup, which LeMagIT met at an IT Press Tour event earlier this month, claims to have developed an engine that “vectorizes” information, regardless of the languages and file formats that a company has.
“A classic search engine will list the English documents that contain the English words you search for. Ours is able to understand your question and show the answer in the French or Italian documents that you have, or even answer it directly by summarizing the content of the documents that contain the answer”, explains Eudald Camprubí, CEO and co-founder of Núclia.
He sets an example. A search for “Nuclia creation date” yields a list of documents in a classic search engine that contain that phrase. Nuclia’s engine gives the exact date and offers links that point directly to the exact location – the paragraph, the sentence – of the documents that talk about this information.
Understand the meaning of texts, scanned documents and soundtracks
The other strength of Nuclia is that it also works with image files – it has an OCR engine to extract text from them, for example in scanned PDF documents – and videos. In the latter case, it extracts the soundtrack and converts it to text with an internal Speech-to-Text engine. Better yet, Nuclia’s engine is not limited to locally stored documents. It parses all content accessible through an address.
“If you reference online, file sharing, or S3 object storage spaces, or even public YouTube videos, in the data pool to be indexed, our engine will parse them and include them in its knowledge base. . Thus, among the answers it will give you, you will find links to a paragraph in a Word document, to a page in a PDF document or to a specific sequence in a video”, explains Eudald Camprubí.
On the other hand, the content must match the text. Nuclia’s engine is not capable of interpreting the meaning of a photo or filmed scene.
Technically, Nuclia primarily consists of a client – Nuclia Desktop – to be installed on a machine that accesses the storage to be indexed. In addition to serving as a local search engine, the client sucks up the data to hand it over to a data extractor that includes all file openers, OCR, audio to text conversion, and language translation. A second engine “vectorizes” the information, classifies its findings, generates summaries.
All results are stored in an internal database, Nuclia DB. This can be queried via API – Nuclia offers an SDK for you to develop compatible applications yourself, including an SDK for building mouse interfaces – or via natural language queries. By the way, Nuclia DB is available in open source.
All of these modules can work onsite or online.
Use cases that go beyond keyword research
“All this technology allows us to address use cases that go beyond the limits of desk research. For example, you can automatically detect sensitive data, which falls under the authority of the GDPR, and program a script to anonymize it in real time. You can automatically analyze the messages your customers leave on your voicemail or social networks and quickly trigger reactions from your services, etc. says Eudald Camprubí.
The CEO says he sold early versions of his technology to major US accounts, including Facebook and Electronic Arts, primarily to meet legal needs. It was only then, in 2019, that he decided to found Nuclia and set up its headquarters in Barcelona, under European law. Since then, it has had several European administrations, pharmaceutical research centres, entities specializing in customer relations and training centers as clients.
The latter, for example, produce video lessons. Thanks to Nuclia, they can now equip their platform with a search engine that displays the exact passage related to the subject requested during a course.
Eudald Camprubí’s hope now is to expand its clientele of European private companies, with prices ranging from €5,000 to €60,000/year depending on their size. “In addition to selling our ready-to-use solution, we expect Nuclia DB to become the default database for Hugging Face, the community portal that brings together all developments around artificial intelligence,” he concludes.
The IT Press Tour event during which this meeting took place, in Lisbon, aimed to exclusively present European start-ups that innovate in the storage area to the press. These startups should therefore better respond to the sovereignty needs of EU companies.