Big Brother Meets AI: Summarizing the 400 Pages of 1984 with Python and GPT APIs

A story of surprises you might find while summarizing long documents with GPT

Massimiliano Costacurta
Level Up Coding

--

Photo by Abdul Ahad Sheikh on Unsplash

By now it’s no secret that GPT is a game changer — a technological prodigy that has left everyone in awe of its capabilities. So, I won’t dwell on what an incredible breakthrough it has been for the AI world. Instead, I’d like to shed light on some of the limitations that the current version has displayed, and the challenges I’ve personally faced while using it in my day-to-day activities:

  1. Prompt limit: GPT has a token limit, which means it can only handle a certain amount of text at once. This constraint can make it difficult to work with long documents, as GPT may not be able to process or generate text beyond this limit.
  2. Cutoff in late 2021: GPT’s knowledge is limited to information available up until September 2021. This means any recent developments, research, or events that have occurred since then are outside the scope of its understanding, which can be a problem if you’re working with up-to-date documents.
  3. No knowledge of your personal documents: GPT doesn’t have the ability to access or understand the context of your specific document. This can lead to inaccuracies and inconsistencies in generated summaries, as it may provide information that is unrelated or irrelevant to the document you’re working on.

If you happen to need a summary of a long personal document written after 2021, well, you’ve hit the jackpot — or rather, the anti-jackpot, because you’ve stumbled upon all three limitations at once, making ChatGPT far less useful for your purposes. In this scenario, the token limit hinders GPT’s ability to digest and condense the text, the knowledge cutoff leaves it blind to any post-2021 content, and its lack of context about your specific document may result in a less-than-stellar summary.

One possible way to tackle this problem is by utilizing the recursive task decomposition approach suggested by OpenAI in this blog post. The method described in the post breaks down the summarization process into smaller, more manageable tasks that GPT can handle more effectively. In this article, we’ll explore how to implement this approach using Python, putting it to the test to see what the results will look like.

We’ll test the implementation on a variety of PDF files, ranging from a couple of tens to more than 400 pages. The longest document in our test set is George Orwell’s immortal masterpiece, 1984. We’ll also analyze the execution times and costs associated with using OpenAI APIs, and we’ll create some visually appealing graphs about execution times and costs. Oh, by the way, there are three reasons why I included 1984 in the list:

  1. It has a beautiful, long and intricate plot that will likely put stress on the decomposition approach. I suspect it might struggle with texts that have a deep meaning not separated into clear topics.
  2. It’s included in ChatGPT’s prior knowledge, so we’ll be able to compare the results of our implementation with the direct output of ChatGPT.
  3. It’s perfect for a catchy title for this article.

Sounds good? Perfect so, let’s start going through the code.

Import and preprocess the files

The first step in our journey is to import the text from the test files. Since all the files are in PDF format, we’ll use the powerful PyMuPDF library, which allows granular control over the elements of each PDF page. However, I encountered the first problem here, as the files’ formats differ from one another, making it difficult to isolate the text. The most challenging task is removing header and footer text, which is not only meaningless but also clutters and increases the overall amount of text we need to process (more tokens mean higher API expenses). To address this issue, I created a simple yet fairly general function that automatically detects repetitive text within the top and bottom parts of each page in the files. Of course, I also encountered files with similar but not identical headers, so I needed to perform soft matches to identify repetitive text. The function is not perfect (and O(N²) so be careful in case you want to use it), but it managed to handle all cases for my files.

Next, we can write a simple routine to extract the text from the files:

The extract_text function can now be used to extract clean text from our files. The next step is to create a function that breaks the extracted text into manageable chunks. We want the function to provide an option for overlapping segments since, during the summarization process, this overlap will serve as a form of “memory” of the previous chunk. With this in mind, here is the chunking function:

Alright, now that the boring homework has been taken care of, it’s time to move on to the much more exciting summarization portion of the process.

Recursive task decomposition for summarization

As explained in the original research paper from OpenAI, task decomposition involves breaking down a complex task into simpler subtasks, that are easier to manage. For book summarization, the task can be decomposed algorithmically into a tree of summarization tasks, where only the leaf tasks operate on passages of the original book text. Regardless of the details, the below image shows very effectively the approach in action.

Image from the original OpenAI research paper

However, the algorithm that we are going to implement aims to enhance parallelization, differing slightly from the approach suggested by OpenAI. Our goal is to eliminate the need to wait for one summary to complete before starting the next one, thereby optimizing the execution of the overall summarization process. The approach operates as follows:

  • In the initial iteration, we divide the input text into segments with a 20% overlap. This overlap is designed to ensure continuity and context preservation across segments. A summary is then generated for each segment.
  • In the subsequent iterations, we concatenate the summaries from the previous iteration and split this combined summary into new segments, again maintaining a 20% overlap.

This procedure allows each segment to be processed concurrently in every iteration, significantly improving the efficiency of the algorithm.

The function iterative_summarization implements the algorithm, leveraging the ThreadPoolExecutor library to manage the parallelization. The process is performed iteratively, combining the summaries until a single summary is obtained and returned.

Once the iterative_summarization function is up and running, creating a summary for one file is nothing but writing two lines of code:

Results

Summarizing ‘1984’ cost me $0.89 and took a total of 352 seconds. The process was primarily slowed down by the response time from the OpenAI call and my system’s limited parallelization, which only had four threads. With a more powerful setup, I’m confident this could be done much quicker.

And finally here is the summary I got:

George Orwell’s 1984 is a dystopian novel that warns against the dangers of totalitarianism and emphasizes the importance of individual freedom and democracy. The novel’s protagonist, Winston Smith, rebels against the oppressive Party and longs for love and intimacy, but the Party’s strict rules and conditioning make it impossible. The novel highlights themes of surveillance, control, propaganda, erasure of history, and the longing for a better past.

The concept of war is also explored in the novel, where war is continuous and serves to maintain the ruling class’s grip on power. The citizens of each super-state are kept isolated from foreigners and taught to hate and fear them, with the prevailing philosophy and social structure being similar across all three powers. The nature of war has fundamentally changed in this world, becoming a perpetual state of conflict that serves to maintain the ruling class’s grip on power.

The novel depicts the oppressive nature of the society in which Winston lives, where the Party seeks to convert their enemies to their way of thinking before killing them, rather than creating martyrs. The Party aims to capture Winston’s inner mind and reshape him, burning out all evil and illusion. The Party uses psychological torture and manipulation to control its citizens, with Newspeak being a language created to limit and control thought by eliminating words and concepts that diverge from the principles of Ingsoc, the ruling ideology.

Ultimately, the novel serves as a warning against the dangers of totalitarianism and the importance of individual freedom and independent thought. The passages highlight the power of the government to control and manipulate its citizens, the danger of trusting anyone in a society where even friends and neighbors can turn out to be informants for the government, and the need to be vigilant against the abuse of power by those in authority.

And this is what I got by asking the same task to ChatGPT:

“1984” is a dystopian novel written by George Orwell, published in 1949. The story is set in the totalitarian state of Oceania, where the Party, led by Big Brother, exercises complete control over the lives of its citizens. The novel follows the protagonist, Winston Smith, a low-ranking member of the Party who works at the Ministry of Truth, where he alters historical records to conform with the Party’s propaganda.

The Party controls every aspect of life in Oceania, using surveillance, psychological manipulation, and brutal force to maintain its power. It enforces strict rules, such as the prohibition of individual thought and expression, which is considered “thoughtcrime.” The Party also promotes a language called Newspeak, designed to limit free thought and suppress dissent.

Winston becomes increasingly disillusioned with the oppressive regime and starts a forbidden affair with Julia, a fellow Party member. They engage in acts of rebellion, including reading forbidden literature that questions the Party’s control. Their actions are discovered by the Thought Police, and they are arrested.

Winston is tortured and brainwashed by O’Brien, a high-ranking Party member, who forces him to betray his love for Julia and accept the Party’s ideology. The novel ends with a broken Winston, fully indoctrinated and loyal to the Party, gazing at a poster of Big Brother with tears of joy and love in his eyes.

“1984” serves as a chilling warning about the dangers of totalitarianism, surveillance, and the erosion of individual freedom and privacy.

Not bad right? Our summary is less focused on the plot and more on the general meaning of the novel, but I believe we could tweak this with some clever prompt engineering. All things considered, the result looks pretty good.

At last, we’ve got the visuals. The plots below showcase the outcomes of my other experiments with files of varying lengths. You’ll see both the cost and time taken for each summarization.

Image by author

Did you enjoy this article? Want to stay updated on future content like this? Don’t forget to follow me on Medium to get notified about my latest articles and insights in AI, machine learning, and more. Let’s continue our learning journey together!

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

--

--

Software engineer turned into product manager. Head of product @Rulex_Analytics, AI enthusiast, 90s rock lover. linkedin.com/in/massimilianocostacurta