Data pre-processing for cleaning Text Data using LLM (Llama 3.1)

In recent years, large language models (LLMs) have dramatically transformed the landscape of Natural Language Processing (NLP). From machine translation and content creation to code generation and sentiment analysis, these models are driving innovation across multiple industries. One such impressive tool is the Llama3.1 8B-Instruct model—a powerful yet efficient model that I recently used to pre-process a text dataset. In this article, we shall explore why this model is useful and some insights on how to effectively utilize large language models for advanced text processing tasks.

What are we doing?

At PatentAssist.AI, we are developing a software to help researchers and inventors create patent drafts for registering their inventions. The aim is to fine-tune an LLM such that it creates patent claims using details of invention as input prompt. The first step in fine-tuning the LLM is preparing the input data. The raw patent data consists of patents which have been successfully filed. These files consist of different sections such as ‘field of invention,’ ‘summary of invention,’ ‘detailed description of invention,’ ‘claims’ etc. The patent data is then converted to JSON objects with the tag ‘section’ indicating the section of patent and the tag ‘content’ consisting of the content of that section.

The patent data, now available as JSON objects, needs to be cleaned and arranged in acceptable legal format. For example, correct the formatting of the text, correct grammatical errors if any, restructure the text as per legal guidelines etc. Without a uniform format, the LLM cannot be effectively fine-tuned to correctly generate the claims for filing a patent. But now the question arises: why use an LLM to clean the input dataset? Some traditional rule-based text cleaning methods include Regex for manipulating text based on patterns, splitting text into smaller units, text filtering to remove unwanted information etc. However, for a vast dataset containing subjective information, it is difficult to implement these methods since they rely on predefined rules and do not understand the overall context. Using LLMs counters this problem leading to coherent and contextually appropriate modifications.

The Llama 3.1 Series

The Llama series, developed by Meta, is well-known within the AI community for its performance and efficiency. The Llama3.1 8B-Instruct model is an 8-billion parameter model optimized for instruction-following tasks, making it suitable for tasks like answering questions, summarizing text, and generating well-structured responses based on input prompts.

Here’s why I chose the Llama 3.1 8B-Instruct model:

Scalability: With 8 billion parameters, this model strikes a good balance between size and computational efficiency. It is designed to be more resource-efficient, allowing for more affordable and faster deployment.
Instruction-Following: The "Instruct" variants of the LLMs are fine-tuned to better follow specific instructions, making them effective for tasks where you want the model to perform a distinct task based on user inputs.
Performance and Usability: The performance of the Llama 3.1 model was compared with that of Google’s Gemma 2 9B model. It was observed that while both are powerful models, the Llama 3.1 8B-Instruct model performed better while closely adhering to the instructions provided in the prompt, such as text restructuring and summarization. On the other hand, Gemma 2 performs better in tasks requiring complex language understanding and generating nuanced responses.

Here are the code snippets to load Llama 3.1 8B-Instruct model using Hugging Face and create a LLM pipeline.

tokenizer: It is responsible for preparing the input to the model. A tokenizer encodes the input text into smaller units called tokens. A unique index number is then assigned to each token using the tokenizer’s vocabulary. When the tokens are passed to the model, the embedding layer converts the tokens to embedding vectors. The transformer layer then analyses the embedding vectors, understands the context and generates output tokens. In the final step, the output tokens are converted into a human-readable response. Several token calculators are available online which can be used to get an idea of how tokens are created. One such example using the OpenAI Platform Tokenizer is shown below.
temperature: The temperature parameter, ranging from 0 t0 1, controls the randomness of the response. It adjusts the probability distribution of the next word. A lower temperature value means that the model will pick the most probable word, thus becoming deterministic and repetitive. Conversely, a higher temperature results in a more creative and diverse output. max_new_tokens: It allows the users to manipulate the length of output response generated. By limiting the number of tokens generated, one can prevent the model from generating too lengthy responses, particularly useful for summarizing purpose.
top_p: This parameter limits the set of next possible tokens for the model to choose from such that the cumulative probability of the tokens is less than a certain threshold. For example, “I like to read …” {books: 0.4, novels: 0.2, newspapers: 0.15, tweets: 0.1}. Now, if the top_p parameter is set to 0.60, then the model will only consider options books and novels to predict the next token. This means, a higher top_p will allow the model to consider even the less probable tokens, thus generating a more creative output.
repetition_penalty: As the name suggests, this parameter penalizes the model when a particular token is repeated frequently.
return_full_text: When set to false, the model does not return input query.

Prompt Engineering

An important aspect of using LLMs effectively is prompt engineering, i.e., how well you can communicate instructions to the model. Here are some key techniques:

Zero-Shot Prompts: In zero-shot learning, the model is asked to perform a task without any prior examples. This approach is useful when you want to see how well the model generalizes to new instructions. For instance, you can prompt the model with: "Write an essay about global warming." The model will generate a response based on its training, without any specific examples provided.

Few-Shot Prompts: In few-shot learning, you provide the model with a few examples of the task you want it to perform. This can improve the model’s accuracy and relevance of output. An example prompt might look like: “Input 1: Rewrite the given set of text: Tom ate an apple. Output 1: An apple was eaten by Tom. Input 2: Rewrite the given set of text: Maria wrote an email. Output 2: An email was written by Maria. Now rewrite this set of text: Dennis drove a car.”

System and User Prompts: In advanced applications of LLMs, the system prompts instruct the model on how to behave and the user prompts are task specific queries. This helps in using the same LLM for solving multiple tasks within the context of a scenario. For example: System Prompt: “You are an unbiased and fair AI assistant that helps in creating patent drafts.” User Prompt 1: “Summarize the main features of the given invention.” User Prompt 2: “Describe the invention in detail.” User Prompt 3: “Create a draft of claims for the invention.”

The user prompt can be either zero-shot prompt or few-shot prompt depending on the complexity of the task. This approach can help the model stay on track and provide responses that are more aligned with user expectations.

Conclusion

Through our experiments, it was observed that it is challenging for the model to retain extensive instructions in memory due to the relatively smaller size of the model and a large document size. Thus, it is necessary to give a precise system prompt to the model. Also, it was observed that given the complexity of the task and subjective nature of the output, few-shot prompting worked the best for the task at hand. Using the above tips and tricks, we have successfully prepared the dataset for fine-tuning purposes. In next articles, we will explore how LLMs are fine-tuned. #PatentAssist.AI #LLMs #Llama3.1 #Huggingface #Text_Generation