You’d Better Chat With This AI Model Rather Than ChatGPT To Do Scientific Exploration

A quick journey of Galactica, a new large language model for science

Yeyu Huang
Level Up Coding

--

Image by author

Introduction

Galactica is a large language model (LLM) open-sourced by Meta AI. It is based on the Transformer architecture. It is mainly trained by scientific articles and research papers, by using GROBID library to convert documents from pdf to text for corpus.

Through its website, Galactica can offer suggestions for scientific papers and associated resources based on your prompt. The supported categories for such suggestions encompass machine learning, mathematics, computer science, biology, and physics.

In the next example, you can try to work with Galactica for scientific terminology, math, and chemical formulas as well as source codes.

Galactica manages multiple scientific tasks with a single model. It can reason, create presentations, predict data citations, etc., and has the following key features:

  • Include 5 different sizes between 125 million to 120 billion parameters.
  • Uses a context window with a length of 2048.
  • Tokenization methods for managing specific data types.

Galactica models achieve state-of-the-art performance on scientific-oriented datasets. Compared with the upgraded GPT-3 or OPT, it has less toxic speech in the TruthfulQA dataset so can be used as an open-source program. In the next section, I will take you to experience this large language model in the scientific domains.

2. Installation and Import

Let’s install the Galactica model with the below pip command:

$ pip install git+https://github.com/paperswithcode/galai

Note that Galactica models are compatible with Python versions 3.8 and 3.9. The installation will fail for Python 3.10 and later due to the dependency on promptsource-library.

Now we can import Galactica and load the model:

import galai as gal
model = gal.load_model("base", num_gpus = 1)

When loading the model, you can specify the version of the pre-trained model to be loaded. We use the “base” version here, which has 1.3B (1.3 billion) parameters. Other available versions include mini, standard, large, and huge, with parameter scaling from 125m to 120b.

Note that larger models require more memory and computing resources, and here I choose the “base” version based on my memory availability, which consumes about 11GB of memory.

The second param num_gpus is optional which indicates how many CPUs will be used.

3. Applications

Let’s start to use and experience the model. The following is an example of an encyclopedia explanation:

model.generate("We can explain Brain as", new_doc=True, top_p=0.7, max_length=200)

The model includes additional parameters where we can set the limit of output text length, which is quite similar to the GPT-3.

Here is the output from my prompt:

We can explain Brain as a computer program that takes in data from the external world, and produces an output as a result. The Brain is the machine that makes decisions about what to do. The Brain is the part of the brain that is made up of neurons, the basic building blocks of the brain. Neurons are the smallest units of the brain. Each neuron contains a membrane and a set of synapses that allow it to communicate with other neurons.\n\n[IMAGE]\n\nFigure Caption: Figure 10.2.110.2.1: Neurons are the smallest units of the brain.\n\n## What are the Functions of Neurons?\n\nNeurons are the basic building blocks of the brain. The brain is the part of the body that is made up of neurons. Neurons communicate with each other using chemical signals called neurotransmitters. The brain has many different types of neurons. The different types of neurons in the brain are called neurons of the different types. Neurons of different types’

Hugging Face + Galactica

We can also use HuggingFace to load and apply the Galactica model. Let’s install the module accelerate first:

$ pip install accelerate #to run with the gpu

Note that the accelerate library is required to run the model with a GPU. We can skip installing the accelerate library when running the model using only the CPU. The model is slow when using the CPU only. So, if you have GPU resources, please try to run it with GPU.

Next, we import the transformers and select a model version, the models of different sizes are 125m, 1.3b, 6.7b, 30b, and 120b. We will now run a minimal version with 125 million parameters using the following code:

from transformers import AutoTokenizer, OPTForCausalLM
tokenizer = AutoTokenizer.from_pretrained("facebook/galactica-125m")
model = OPTForCausalLM.from_pretrained("facebook/galactica-125m", device_map="auto")

After loading the model, let’s test its reasoning capability of the model. We provide prompt as below:

input_text = "Car 1 speed is 30km/h and Car 2 speed is 50km/h. Which car travels faster and how much? <work>"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

Even when I use the minimal version of Galactica model, I still receive the right answer from this reasoning task:

Car 1 travels faster than Car 2 (30km/h vs. 50km/h). calc_1.py result = 30/50 with open(“output.txt”, “w”) as file: file.write(str(round(result)))<> <> 10 So 10 km. Car 1 travels faster than Car 2 (50km/h vs. 30km/h). calc_2.py ```result = 50/30 … Answer: 20

That’s it.

Hope you can find something useful in this article. Thank you for reading!

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

--

--

As a technical writer and consultant, I strive to bridge the gap between AI, language models, data science, Python and learners.