"

4 ChatGPT for Data Standardization and Entity Resolution

Note. Retrieved from https://www.foxbusiness.com/technology/gm-scouts-ways-integrate-ai-chatgpt-into-everything-report

By: Tanner Giroux and Chantal Branch

March 19, 2024 / University of Ottawa – Telfer School of Business Management

The FinTech Explorer: A Comprehensive Guide

 

Data standardization refers to transforming data into a standard form that computers and users can understand better. A prime example is standardizing dates across datasets. Some geopolitical regions or businesses format dates in month/day/year format, which cannot be mixed with the day/month/year format used by other areas, thus needing analysis tools to standardize date data. Entity resolution is the process that detects relationships and resolves entities, such as removing or combining duplicate data entries or accounts by detecting key identifiers in the dataset. The abilities of ChatGPT in multiple scenarios relating to standardizing and resolving discrepancies in data have been of growing interest,

For example, if a bank were to input a client’s name without a middle name for a transaction and another entered the date in a different format, it would cause data inconsistencies and negatively affect the bank’s ability to ensure clients’ and auditors’ accurate data. A standard for entry would reduce discrepancies. Entity resolution is the process of determining if different data entries are the same (Pocock, 2023). It can often be confusing when a client is assigned two separate client numbers or an invoice is assigned twice to the same person, which can cause accounting imbalances and confusion. Entity resolution plays a key role in ensuring a company’s database is accurate and is vital in financial analysis as data referring to an account with a duplicate may not demonstrate the full extent of a client or business’s financial activity. Unresolved and unstandardized data contributes to lacking detection of fraudulent activity and creates opportunities for fraud to go undetected, allowing criminals to continue to take advantage of the negligence of businesses and systems.

 

How Does ChatGPT Play a Role?

Recently, AI systems have been used for data standardization and entity resolution in the form of deep learning modules. They are best suited to the task because they can utilize neural networks that function like a human brain several times faster (Wolfe, 2021). ChatGPT and other large language models are primarily used for text analysis, not data filtering and sorting. This being said, if the data primarily contains text or non-mathematical processes, it still possesses the capability to be efficient in both standardizing data and detecting anomalies and duplicates in the data.

Four tests were conducted on ChatGPT to score its accuracy, time, and costs associated with a diverse set of grammar, textual data standardization, and entity identification and resolution tests. The inputs used were engineered using a method called “prompt engineering,” using two different methods: Zero-Shot prompting (often referred to as “Direct Entry” throughout this report) and Chain-of-Thought (COT) prompting. Zero-shot prompting is directly inputting or asking the program to do something without more information than required. On the other hand, chain-of-thought prompting is the method of helping the AI create and regurgitate its thought processes. This method allows the user to explain better what it wishes the AI to do, and because ChatGPT will output its logic process, it is easier to manipulate steps in the overall process. (Weng, 2023) This can be done by giving examples before the test to reinforce logical steps or inputting one step at a time to detect errors in the outputs quickly. (Wei et al., 2023) These tests compare the performance of three humans at different educational levels to GPT 3.5 and GPT 4, as well as determine the most accurate and efficient method of prompting to be used for data analysis tasks by measuring accuracy, time to prompt and time to compute, and cost to determine efficiency. The human subjects are labelled as follows: S.A., a 3rd-year computer science major from Carleton University; J.B., a 1st-year youth education major from Algonquin College; and BR, a 4th-year Finance student from the University of Ottawa. Each of the three volunteers was given a similar prompt to the Chain-of-Thought prompt given to ChatGPT to reduce discrepancies and eliminate any advantages.

 

Test One: Basic Text

The first test determined ChatGPT’s ability to follow very basic prompts by giving it a set of 10 basic English sentences (Table 1) with grammar, spelling, capitalization and punctuation errors. This test was used as a baseline for the model’s ability to follow prompts and fairly benchmark the human subjects on simple tasks. The test also assisted us in determining what prompts were best to use as a standard and familiarizing ourselves with each model’s basic capabilities.

 

Table 1. Test One: Basic Test Data

Test Data: Corrections:
the dog hoped over the fence The dog hopped over the fence.
Are there hats on backwards? Are their hats on backwards?
whats up with him? What’s up with him?
Look at that plain fly Look at that plane fly.
“Hello there,” said susan. “Hello there,” said Susan.
I have a apple I have an apple.
I wood like a cheeseburger please. I would like a cheeseburger, please.
Youre not very good at this. You’re not very good at this.
My watch is made of sliver My watch is made of silver.
Math is a very difficult subject Math is a very difficult subject.

Due to the simple nature of the test itself, no great differences were expected between prompt methods compared to their like model. GPT 3.5 using both Zero-Shot and Chain-of-Thought prompting was unable to score perfectly, only correcting 9/10 sentences, both times being unable to detect the error in the sentence “Are there hats on backwards?”. Zero-shot resulted in the output reading as “Are the hats on backwards?” which is grammatically correct but changes the meaning of the text. COT could not find any error in the text and re-output the sentence with no changes. GPT 4.0 using both Zero-Shot and COT scored perfectly, as expected.

In comparison to the human subjects, both models, regardless of prompt method, scored higher than the subjects and took significantly less time to complete the task. SA scored 70% at 3:12, faster than GPT 3.5, to get through the series of texts with COT prompting at 3:21. BR took more time (5:40) and scored 80%. JB scored the same, taking significantly more time (20:43). ChatGPT 4.0 with Zero-shot was by far the most efficient for the task, as it scored perfectly and took the least amount of time (00:14.89) of any of the models or prompt methods. Table 2 shows the results of the test.

 

Table 2. Test 1 Results

Chat GPT: Accuracy: Time:
GPT 3.5 (Direct Entry) 90% 00:16.05
GPT 3.5 (COT & Self Consistency) 90% 03:21.49
GPT 4.0 (Direct Entry) 100% 00:14.89
GPT 4.0 (COT & Self Consistency) 100% 01:09:74

 

Test Two: Advanced Text

The second test built off of the previously established logic of correcting grammar, spelling, capitalization and punctuation, but the text was now in paragraph form with more advanced language and a need to understand the context within the text to correct the sentences (Table 3). The advanced text was used to determine its ability to capture context in longer strains of information and retain context from larger token use, as well as benchmark the capabilities of the three human subjects regarding their abilities on the same task.

We hypothesized that due to the nature of the test requiring contextual cues, taking the logic process step by step would encourage better results from both models. In each case, this was true. GPT 3.5 scored an 82.14% with Zero-Shot and an 86.90% with COT. GPT 4.0 with Zero-shot scored 86.31%, while COT managed to increase it to a 95.24%. In each case, taking the logic step by step encouraged ChatGPT to output more well-thought-out corrections and ensure the context of the paragraphs stayed coherent. In the human testing, SA scored 39.1% in 20:08.26, JB scored 82.74% in 36:41.16 and BR scored 77.38% in 17:21.51. In all instances, ChatGPT took significantly less time. GPT 3.5 and 4.0 with Zero-Shot prompting took only 44.23 seconds and 43.83 seconds, respectively. Their COT counterparts took 3:02.20 and 3:16.18, much faster than the test subjects. We can easily conclude that ChatGPT is more efficient at correcting text-related errors as the length of the text-related task increases. Table 4 demonstrates these results in a table.

 

Table 3. Test Two: Advanced Test Data

Test Data: Corrections:
My name is Jay Hammond I am a firefighter. I live in 128 Pine Lane, in Jackson, Mississippi. I have two childs. One is a girl named Clair. The other is boy named Thatcher. His name after my father. I also have a wife named Jenna. She is beutiful. She has long, dark, soft hair. We also got a dog named Buck. He is very obedient but sometimes he barks at night and it upsets our neighbors! My name is Jay Hammond. I am a firefighter. I live at 128 Pine Lane, in Jackson, Mississippi. I have two children. One is a girl named Clair. The other is a boy named Thatcher. He is named after my father. I also have a wife named Jenna. She is beautiful. She has long, dark, soft hair. We also have a dog named Buck. He is very obedient but sometimes he barks at night and it upsets our neighbors!
Well, its another rainy day. I wonder what I will do? First, I think I’ll take a walk around the neyborhood to stretch my legs. Second I’ll cook a big breakfast with toast fruit eggs and bacon. After that, I might mow my lawn; it’s getting pretty long. I’m not sure what I’ll do after that. I guess I should go see my mother. I think she wants me to go grocery shopping with her. I have no idea why she can’t just go by herself. Or, better still, she could ask my dad to go with her! I doubt he will want to go with her though. He doesn’t like going to the grocery store as much as I do! Well, it’s another rainy day. I wonder what I will do. First, I think I’ll take a walk around the neighborhood to stretch my legs. Second, I’ll cook a big breakfast with toast, fruit, eggs, and bacon. After that, I might mow my lawn; it’s getting pretty long. I’m not sure what I’ll do after that. I guess I should go see my mother. I think she wants me to go grocery shopping with her. I have no idea why she can’t just go by herself. Or, better still, she could ask my dad to go with her! I doubt he will want to go with her though. He dislikes going to the grocery store as much as I do!
“To be, or not to be…that is the question” This wellknown utterance has been the source of both mystery and wonderment for students around the world since the turn of the 16th century—arguably the zenith of Shakespeare’s creative output. However, the mere ubiquity of this phrase fails to answer some basic questions about it’s rather context. Where did it come from what does it mean? The first of these questions (where does it come from?) can be answered fairly easily: from Shakespeare’s famous play Hamlet. As for the last of the two questions, a complete answer would require a more deep look at Shakespearean culture and nuance. “To be, or not to be…that is the question.” This well-known utterance has been the source of both mystery and wonderment for students around the world since the turn of the 16th century—arguably the zenith of Shakespeare’s creative output. However, the mere ubiquity of this phrase fails to answer some basic questions about its rather context. Where did it come from? What does it mean? The first of these questions can be answered fairly easily: from Shakespeare’s famous play Hamlet. As for the latter question, a complete answer would require a more in depth investigation of Shakespearean culture and nuance.

 

Table 4. Test 2 Results

Chat GPT: Accuracy: Time:
GPT 3.5 (Zero-Shot) 82.14% 00:44.23
GPT 3.5 (COT) 86.90% 03:02.20
GPT 4.0 (Zero-Shot) 86.31% 00:43.83
GPT 4.0 (COT) 90.07% 03:16.18

 

Test Three: Sample Dataset

For the third test, a dataset with columns “Name,” “Address,” “Province,” “Postal Code,” “Phone Number,” “Email,” “Customer Number,” “Transaction Amount,” and “Transaction Date” was created with data formatted at random. The data in the table (Appendix 1A) contained errors in the formatting of the above columns that can be seen in an actual work environment, such as names being entered as “family name, name” and “name, family name,” or address lines containing the street name having the type of street abbreviated as “st” instead of “street,” provinces being in both long and shortened format, postal codes being in lowercase or missing a space between the forward sortation area (first three digits) and the local delivery unit (last three digits), phone numbers being in several different formats, such as brackets around the area codes missing, hyphens, spaces or nothing between groups of numbers, amounts being formatted to contain more than two decimal places and dates being in both day/month/year and month/day/year. The data also contained possible duplicate accounts with matching identifying information, such as names, addresses, phone numbers and email addresses.

This test aimed to determine the efficacy of ChatGPT in its ability to read tabular formatted datasets, assess its accuracy in standardizing larger amounts of information in a conversation, and identify and correct duplicate account entries by merging the accounts. This test hoped to determine the most efficient model between GPT 3.5 and GPT 4.0, as well as the Data Analysis GPT released with the custom GPT update in November of 2023. The GPTs are all compared on an accuracy score based on how near the outputted dataset matched the standardized test set, as well as how long each model took to reach a final output and an estimate of how costly it would be for a firm to run each model based on the prices of tokens from November 2023. The test also determined efficiency against post-secondary educated test subjects to determine how entry-level data jobs could soon be replaced by LLMs. Unlike previous tests, accuracy was measured by how many corrections were made to the mistakes present. The test measured the accuracy of each column based on how many mistakes were present in the end, and the average accuracy of the “Name,” “Address,” “Province,” “Postal Code,” “Transaction amount,” and “Transaction Date” dictated the final score for that model and prompt method. This change was made due to ChatGPT’s tendency to frequently hallucinate address lines, names, and numbers when files are in Excel format. Entity resolution is judged on its ability to flag five possible duplicates, 4 of which are definite, as well as remove the definite duplicates and add the transaction amounts to one account with the most recent transaction date.

 

Figure 1. Sample Data for Test 3

The above image (Figure 1) is the test sheet used in this experiment, where matching colours represent the accounts which could be flagged as duplicates. The accounts named Anthony White should only be flagged and not automatically changed, as only their names are duplicates. In contrast, the others have key identifying information such as address and contact information, which could signify data entry errors in forming a new account for the same entity.

ChatGPT showed great promise in its ability to standardize data in large datasets. ChatGPT scored 79% or higher in all instances, with both models scoring 100% with Chain-of-Thought prompting. GPT 3.5, with zero-shot prompting, managed a 79% accuracy, taking only 30.32 seconds. The entire conversation generated ~440 input tokens and ~1220 output tokens, totalling a cost of approximately $0.006 USD (United States Dollars). GPT 3.5 with COT scored a perfect 100% in standardization tasks in 7:27.06 minutes. Due to the nature of the prompt method, it took more tokens to reach a successful result, generating ~704 input tokens and ~2233.33 output tokens, totalling $0.011 USD, about twice as much as its zero-shot counterpart. ChatGPT 4.0 with zero-shot prompting took significantly more time at 2:58.15 to achieve an accuracy of 83.55% while generating ~444 input tokens and ~1264 output tokens, resulting in a total cost of $0.178 USD. GPT 4.0 with COT also managed perfect accuracy but took more time than GPT 3.5 with COT, taking 12:37.47 to accomplish the same task. This method also proved the most costly at ~750.67 input and ~3588 output, totalling $0.476 USD. The 3 test subjects all scored very high. SA reached an accuracy of 96.88% in 54:29.29, JB scored 95% in 1:32:10.03, and BR managed 98.75% in 30:03.42. For standardizing data, ChatGPT 3.5 with Chain of Thought prompting is the most cost and time-efficient of the options tested, outperforming both zero-shot models and being significantly faster than any human subject.

 

Table 5. Test 3 Results

Chat GPT: Standardization: ER Detection: ER Resolution:
GPT 3.5 (Direct Entry) 79% 4/5 4/3 = 1 over
GPT 3.5 (COT) 100% 4/5 0/3
GPT 4.0 (Direct Entry) 83.55% 1/5 1/3
GPT 4.0 (COT) 100% 5/5 3/3

Entity resolution results did not follow the trend of the previous tests, where GPT 4.0 would outperform 3.5, and COT would outperform zero-shot consistently. In this case, results were more erratic, with GPT 3.5 detecting 4 of 5 possible entities in both zero-shot and COT tests. Surprisingly, 3.5 with zero-shot removed the three confirmed duplicates and one of the unconfirmed duplicates, which, in a real-world setting, could result in the loss of key data for the given client and bank. GPT 3.5 with COT detected 4 of 5 accounts but was unable to remove any, becoming uncooperative at the steps, requesting to remove the account and add the transaction amounts to the most recent data related to the account. GPT 4.0 with zero-shot could only detect and correct 1 of the five accounts, making it the worst performer. GPT 4.0 with COT managed to detect all five accounts, successfully remove all 3 of the confirmed duplicate accounts, and provide sound logic dictating why the steps to remove each account were taken. The human participants seemingly fared better, with SA finding all five possible accounts but correcting all five as well, 2 of which are not duplicates. Both J.B. and BR found the correct three, but JB only corrected one account, whereas BR was able to correct the correct three they found.

 

Future of AI in Financial Data Management

This study substantiates the efficacy of ChatGPT, particularly when enhanced with Chain-of-Thought prompting, as a potent tool for data standardization tasks. It adeptly handles textual and numerical corrections, indicating its robust potential to streamline data management tasks traditionally handled by human analysts. While ChatGPT excels in these areas, its capacity for automating complex entity resolution tasks is still evolving. This points to a hybrid model where AI supports human analysts rather than replacing them, optimizing accuracy and efficiency in data management processes.

 

Strategic Implications for Financial Data Integrity

Integrating AI like ChatGPT into data management offers substantial strategic advantages, including significant time savings and cost reductions. These benefits are crucial for financial institutions where data integrity is paramount. By automating routine data standardization tasks, firms can allocate human resources to more complex and strategic initiatives, thus enhancing overall business innovation and competitive edge.

 

Challenges and Recommendations for Implementation

Despite the promising results, AI’s full replacement of human oversight in financial data management is premature. The technology’s limitations in nuanced decision-making and complex problem resolution necessitate continued human involvement. To maximize AI’s benefits, businesses should consider phased implementations, starting with pilot projects that allow for iterative testing and integration. This approach enables organizations to calibrate AI applications according to specific operational needs and adjust strategies in response to evolving AI capabilities.

 

Navigating the Future of AI in Business

As AI technologies like ChatGPT continue to advance, their role in business processes will likely expand, making ongoing investment in AI development and training imperative. Organizations should remain agile, updating and adapting AI strategies to leverage emerging capabilities and ensure alignment with business objectives. Additionally, fostering a culture of innovation and continuous learning will be key to harnessing AI’s full potential.

ChatGPT represents a significant technological advancement with the potential to transform data management practices. By integrating AI responsibly and strategically, businesses can enhance their analytical processes, improve data integrity, and stay ahead in a rapidly evolving digital landscape. Future research should focus on overcoming current limitations and expanding AI’s capabilities to fully automate complex data analysis tasks, ensuring that businesses can achieve the highest data accuracy and operational efficiency standards.

 

References

Di Cicco, V. Et al. (2019). Interpreting deep learning models for entity resolution: an experience report using LIME. Association for Computing Machinery. Article 8, 1–4. https://doi.org/10.1145/3329859.3329878

Pocock, K. (2023, May 20). What is ChatGPT, and what is it used for? PC Guide. https://www.pcguide.com/apps/what-is-chat-gpt/

Quantexa. (2022, June 23). Entity resolution. Quantexa. https://www.quantexa.com/entity-resolution/#chapter-4

Wei et al (2023, Jan 10). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Google Brain Team. https://doi.org/10.48550/arXiv.2201.11903

Weng, L. (2023, March 15). Prompt engineering. Github. https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/#references

Wolfe, M. (2021, Sept 14). Deep Learning in Data Science. Medium. https://towardsdatascience.com/deep-learning-in-data-science-f34b4b124580