Descriptive AI vs Generative AI – Overview of privacy aspects

The recent hype around Generative AI has generated a lot of interest around AI. Many customers are struggling to understand if they can make use of this new, highly hyped Generative AI. In this blog series, I will highlight the differences between Generative AI and Descriptive AI. I believe this will help many to better understand when they should use Generative AI and when Descriptive AI is better suited for the business challenge. 

Let’s start with definitions: 

Descriptive AI  is an AI model that has been pre-trained and fine-tuned to generate metadata from unstructured input data like business documents. Typically, Descriptive AI can extract complex metadata with high accuracy and classify documents into multiple dimensions as needed. 

Generative AI is an AI model that has been pre-trained and fine-tuned to generate responses from data that human accepts as valid communication. Generative AI:s are extremely good in creating text snippets that can be used in reports and other written communication. They are great for writing aids and can be used to augment Chatbots with great effect. 

The differences between Generative AI and Descriptive AI from the ground up

Let’s investigate the differences between Generative AI and Descriptive AI from the ground up; the first major difference is on how the base model is trained and how this affects privacy. I claim that by nature Generative AI can not be made privacy safe. Ever. Generative AI must be trained with real data containing privacy information so that it can generate such information as needed. Consider the following question: Who was the president of United States in 1980? Answer: Jimmy Carter. For Generative AI to have this kind of general knowledge it must be trained on plain text containing privacy information relating to countless individuals. And the model must be able to generate such privacy information. 

Descriptive AI like ElinarAI can be made privacy-safe. There two significant factors on how to do this: 

  • Firstly, we can train foundational LLMs using heavily pseudonymized data. As descriptive AIs work on describing certain input data, actual instances of privacy information become irrelevant for AI and they can be replaced with context relevant placeholders. 
  • Secondly, the training process for foundational Descriptive AIs as well as process specific fine tuning is completely different from generative AI. Generative AI must be trained on sequences of text; generate this information from this input; models are trained to generate text that contains privacy data. Period. This will lead into situations where model will output undesired data and any reinforcement will simply decrease the odds of such happenstances. Descriptive AI foundational models on the other hand are trained to describe the input This by nature makes the models unable to hallucinate random individual on the response.

Elinar has spent a significant amount of R&D effort since 2015 to develop a unique way to preprocess training data for LLMs to make them privacy safe by heavily pseudonymizing the datasets before they are used in LLM training. Our descriptive AI foundational models have been engineered to output XML fragments describing the input data. The same applies into fine tuning portion of the process. Model is only able to extract entities existing on the input document. This combined with heavy pseudonymization, and data preprocessing layer validations guarantee that responses are in all cases privacy safe. 

My next blog post in this series will be “Descriptive vs. Generative AI: Why Size Matters – an analysis on effect of model and context size on AI accuracy” where I will highlight the issues that arise when model and context size are very large and how to cope with it. 

Ari Juntunen

Chief Technology Officer