IJRAIDS

Augmented Data Science Assistants: LLMs for Data Curation and Cleaning

Haitham A. Moniem
Pdf

Cite this Article

Haitham A. Moniem, 2025. "Augmented Data Science Assistants: LLMs for Data Curation and Cleaning", International Journal of Research in Artificial Intelligence and Data Science(IJRAIDS)1(1): 57-72.

The International Journal of Research in Artificial Intelligence and Data Science (IJRAIDS)
© 2025 by IJRAIDS
Volume 1 Issue 1
Year of Publication : 2025
Authors : Author Name
Doi : XXXX XXXX XXXX

Keywords

AI assistants, LLMs, data cleaning, data curation, augmented analytics, data preprocessing, and automated data pipelines.

Abstract

Data scientists have a lot more work to perform since that data is expanding so quickly. Cleaning and organizing data are very important, but they require a lot of work. When it comes to dealing with the huge amount and variety of real-world data, old-fashioned methods and rules don't always work. Recent improvements in Large Language Models (LLMs) like GPT-4 and Claude have led to new ways to automate and improve these tasks. These models can be useful as smart assistants for preparing data since they can comprehend how to talk, how to write code, and how to do both. This study looks at how LLMs can help data scientists clean up and organize their data. We offer a structured approach for incorporating LLMs into data operations, elucidate their functional roles in schema mapping, imputation of missing values, and outlier detection, and evaluate their efficacy on practical datasets. We also look at technologies that make this integration easier, assess how well they work, and speak about the problems and ethical difficulties that come up when using LLMs with private data. This study shows that LLMs can speed up and simplify data preprocessing, make it more accurate, and make data science methods more flexible and scalable.

Introduction

In today's fast-paced digital world, data is particularly important for making wise choices, running AI systems, and getting business information. But it's challenging to make sure this data is clean, reliable, and helpful for analysis since there is so much of it, it arrives in so many various forms, and it comes in so quickly. Data scientists normally have to do a number of things to get the data ready before they can do any real study or construct a model. Studies and surveys in the field suggest that more than 80% of the time spent on data science projects is spent on chores that prepare data for analysis, such as cleaning and curating it. Finding and fixing issues, dealing with missing information, getting rid of duplicates, making sure that data from different sources matches, and making sure that formats are the same are all parts of these jobs. It is much difficult to manage data that is not organized, multilingual, or comes from disparate systems.particular old data preparation methods function well in particular situations, but they usually follow strict rules and need a lot of technical knowledge. These systems could have issues figuring out problems that depend on the situation when the data is continually changing. Also, these kinds of systems sometimes need a lot of input from individuals to adjust how they work when data patterns change or to apply their results in different ways. This is a big problem in the data science lifecycle, and we need better, easier-to-use, and more adaptable solutions that can grow with the needs of current data workflows.

Large Language Models (LLMs) like GPT-4, Claude, and PaLM have given us a whole new way to think about data challenges. These models are better at understanding language, figuring out what it means, and writing code. This means they are in a great position to link what people want to accomplish with what machines can do to get the data ready. those can talk to computers in plain English thanks to LLMs. This makes it easier for those who aren't experts to make substantial changes to data. They can also give advanced users tools to assist them get things done faster. They can be superior data science assistants because they can grasp the situation, figure out what the user wants, and change their answers based on what the user says. These helpers can read instructions, search for faults or inconsistencies, give suggestions, and even write or review cleaning plans on the spot