Detect AI Text by Just Looking at it

Words that LLM regularly uses

Fareed Khan
Level Up Coding

--

Abstract of a Research Paper written using ChatGPT

ChatGPT often generates words that may require a dictionary for understanding, or it comes up with words that just sound magical. This isn’t only true for ChatGPT, other open-source language models like Mistral do the same. There’s no harm in seeking assistance from AI to create content, as long as it’s done ethically, but in a science-writing competition for 14–16 year-olds, a judge got suspicious when he saw the phrase “Labyrinthian mazes” in an essay, which seemed too advanced for a teenager writing. So, he used AI tools to check it. Unfortunately, all four tools gave the same result, almost the entire essay, around 90–96%, seemed to be written by AI, not a human. However, not all of us are professionals, If we see the above phrase, we may have skipped it due to our limited awareness.

There is a need for critical thinking skills to identify if AI is the author

The easiest way to spot AI-generated text is by checking for words that you don’t usually use but are common for ChatGPT. Consider a massive corpus of over 19 billion English words from blogs, articles, news, and more, updated daily from 2010 to now. I looked for the word “delve” using a string search algorithm, and it showed up 52,388 times. I plot its yearly pattern and identified an unusual behavior, a ~200% growth in its appearance on the internet from 2022, the same year when ChatGPT was released on November 30th.

Trend of Delve word occurrence in NOW Corpus (by Fareed Khan)

Other words, like “intricacies” or “unwavering”, also shows a similar increase, just like “delve”. They’re being used more often lately.

Trend of intricacies and unwavering in NOW Corpus (by Fareed Khan)

This choice of vocabulary is not necessarily something that AI exclusively uses, as humans also use a diverse range of words. Although, in academic writing, we often use phrases like “explore” or “discuss in more detail” instead of “delve”. I ask ChatGPT to rephrase “discuss in more detail …”, the initial five suggestions it provides typically include this phrase.

Rephrasing using ChatGPT

Moreover, I try to analyze the arXiv database, a famous publishing papers platform containing more than 2 million papers in it up to 2023. I try to detect the word “delve” in the papers abstracts and plot its yearly pattern. I was amazed to see that this word has been widely used in the papers abstracts in the year 2023, the same word that ChatGPT suggested in its top 5 suggestions.

Trend of Delve word occurrence in arXiv Database (by Fareed Khan)

This indicates that academic writers may be using ChatGPT, either for rephrasing or generating content. The presence of the word “delve” serves as a hint or a doubt that the document submitted from a student or an online blog, either that paragraph or that portion of text, has been rephrased or enhanced using ChatGPT.

Drawing upon my research expertise and two years of experience working with LLMs, I’ve put together a pretty comprehensive list of 100 words you can keep an eye out for in a piece of text to help you figure out if it’s been generated or paraphrased using AI.

But checking for such number of words is not an easy job so to achieve it quickly, I made a web app that quickly checks your text. Just upload your file or paste your text, and it’ll do the rest. Easy peasy!

If you’re curious, you can find the full list of commonly used AI words and the source code for my web app on my GitHub repository. Here’s the link:

Hope you enjoy the read!

--

--