Diving Deep with Hugging Face: The GitHub of Deep Learning & Large Language Models!

Models, Data Sets, Fine Tuning, Pipelines, Custom Pipelines

Senthil E

Published in

Level Up Coding

32 min readAug 16, 2023

Introduction:

👉 Hugging Face is an AI startup based in New York that was founded in 2016 by Clement Delangue and Julien Chaumond.
👉 The company is best known for creating and open-sourcing Transformer-based machine learning models for natural language processing. Some of their popular models include BERT, GPT-2, and T5.
👉 Over 1 million users worldwide use their models and datasets.
👉 Their models are used by over 5,000 plus companies, including Amazon, Google, Facebook, Microsoft, etc.
👉 They have open-sourced more than 100 AI models in PyTorch and TensorFlow for anyone to use for free.
👉 Hugging Face has over 1000 plus contributors to their open-source repositories on GitHub.
👉 Their models have been used to train chatbots, search engines, summarization tools, and other AI applications.
👉 Hugging Face’s goal is to advance and democratize NLP and make state-of-the-art AI accessible to everyone.

Hugging Face is reportedly in the process of a new Series D funding round, aiming for a valuation of $4 billion. The round could raise over $200 million, with Ashton Kutcher’s venture capital firm, Sound Ventures, among the leading investors. CEO Clément Delangue is considering multiple offers and the final funding could go as high as $300 million. Last year, Hugging Face secured $100 million in a Series C round valued at $2 billion. The company, renowned for its open-source AI models, has seen its annual revenue rate jump to between $30 million to $50 million.

AI Startup Hugging Face Is Raising Fresh VC Funds At $4 Billion Valuation

Hugging Face is raising at least $200 million in a new funding round expected to value the high-flying AI startup at $4…

www.forbes.com

As of June 2023 — Top AI 100

Credit:https://research-assets.cbinsights.com/2023/06/16094633/AI-100-2023-Map.png

Let's dive deep into Hugging Face.

👉 What is NLP?
👉 NLP Resources
👉 Deep Dive into Hugging Face Hub
👉 The main key components of Hugging Face:
👉 Pipeline
👉 Sentiment Analysis
👉 Generic Workflow of Hugging Face Models
👉 Topic Classification
👉 Text Summarization
👉 Translation
👉 Question Answering Models
👉 Text Generation
👉 Sentence Similarity
👉 Zero Shot Classification
👉 NER
👉 All the important NLP Pipelines
👉 Default Models used in NLP Tasks
👉 Computer Vision Models
👉 Audio Models
👉 Multimodal
👉 Build Custom Pipeline
👉 Fine Tuning
👉 What is a Model Card?
👉 Hugging Face Leaderboard
👉 Conclusion
👉 References

What is NLP?

A subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language.

Check out my article on basic NLP and how to create an NLP App using Streamlit:

How to Build an NLP Machine Learning App-End to End

NLP App using Streamlit and Python NLP Libraries

medium.com

Basic Components:

Tokenization: Splitting text into words or sub-words.
Stemming: Reducing words to their root form (e.g., “running” -> “run”).
Lemmatization: Similar to stemming, but returns a valid word (e.g., “better” -> “good”).
Part-of-speech tagging: Assigning word types (noun, verb, adjective, etc.).
Dependency Parsing: Identifying grammatical relationships between words.

Machine Learning in NLP:

Supervised Learning: Uses labeled data (input-output pairs) to train models.
Unsupervised Learning: Discovers patterns in data without explicit labels.
Semi-supervised and Few-shot Learning: Uses small amounts of labeled data with large amounts of unlabeled data.
Transfer Learning: Transfers knowledge from one task to another.

Machine Learning Libraries:

Deep Learning in NLP:

Word Embeddings: Dense vector representations of words capturing semantic meaning (e.g., Word2Vec, GloVe).
RNN (Recurrent Neural Networks): Processes sequences by maintaining a memory of previous steps.
LSTM (Long Short-Term Memory): A type of RNN that can remember long-term dependencies.
GRU (Gated Recurrent Units): A simplified LSTM.
Attention Mechanism: Allows models to focus on different parts of the input.
Transformers: Uses self-attention to capture contextual information; basis for many state-of-the-art NLP models.

Large Language Models (LLMs):

Definition: Very large neural network models trained on vast amounts of text data.
Examples: ChatGPT (Generative Pre-trained Transformer) series by OpenAI, BERT (Bidirectional Encoder Representations from Transformers) by Google.
Training Approach: Typically trained in two steps: pre-training on large corpora and fine-tuning on specific tasks.
Capabilities: Text generation, question-answering, translation, summarization, and more.
Advantages: Can achieve state-of-the-art performance with minimal task-specific data due to knowledge transfer.
Challenges: Can be computationally expensive, potential for biases in the data, and sometimes produce unpredictable outputs.

Transformers:

Transformers are a type of neural network architecture used primarily in natural language processing (NLP).
Unlike traditional neural networks, Transformers do not need to process data sequentially. This allows them to learn dependencies between words faster.
The main components of a Transformer are the encoder and decoder. The encoder reads the input text and generates an encoded representation. The decoder takes this encoded input and generates the output text.
Attention is the key mechanism in Transformers. It allows the model to focus on relevant parts of the input when generating the output. This is similar to how humans pay attention to certain words when translating or summarizing.
Transformers are trained on large datasets to learn relationships between words. This allows them to understand language contextually and perform tasks like translation more accurately.
The success of Transformers is due to their ability to model longer term dependencies in text efficiently compared to older NLP models like RNNs.
Overall, Transformers have revolutionized natural language processing through the use of attention mechanisms and large-scale pre-training.

“Attention Is All You Need”

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an…

arxiv.org

The title of the paper is “Attention Is All You Need,” and it was published on June 28, 2017.The authors are Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.

NLP Resources:

You can check out the following resources for NLP:

NLP with Deep Learning with Chris Manning-Stanford University

2.CS324 — Large Language Models-https://stanford-cs324.github.io/winter2022/

3. About Transformers:

4. Hugging Face NLP Course:

Introduction - Hugging Face NLP Course

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

5. Illustrated Transformers:

The Illustrated Transformer

Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations…

jalammar.github.io

6. The Annotated Transformer:

The Annotated Transformer

Edit description

nlp.seas.harvard.edu

Deep Dive into Hugging Face Hub:

The Hugging Face Hub hosts:

Hugging Face All Courses:

Models:

Hugging Face provides access to thousands of pre-trained models for NLP, vision, audio, and multimodal tasks.
Popular model architectures include BERT, GPT-2, T5, ELECTRA for NLP and ViT, DETR, and CLIP for vision.
Models are pretrained on large datasets then can be fine-tuned for downstream tasks.
Allow easy sharing and reuse of models without training from scratch.

Models - Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Searching for Models:

Homepage: Navigate to Hugging Face Models.
Search Bar: Use the search bar at the top. You can type model names, types, or even tasks (e.g., “BERT”, “translation”).
Filters: Use the filters on the left side:
Model Type: e.g., BERT, GPT-2, RoBERTa.
Language: e.g., English, Chinese, Multi-lingual.
Datasets: Models fine-tuned on specific datasets.
Task: e.g., Text classification, Token classification.
Sort By: You can sort the models by “Most downloaded”, “Last updated”, etc.
Model Pages: Clicking on a model will show detailed info, including its documentation, use-cases, and code to load it.

Datasets:

Contains numerous benchmark datasets for NLP, vision, speech etc.
Includes labels, conversion scripts, sampler code for many datasets.
Hosts datasets for model evaluation like GLUE, SuperGLUE for NLP.
Makes it easy to replicate research using canonical datasets.

Datasets

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Searching for Datasets:

Homepage: Navigate to Hugging Face Datasets.
Search Bar: Use the search bar at the top. You can type dataset names or topics (e.g., “SQuAD”, “sentiment analysis”).
Filters: Use the filters on the left side:
Language: e.g., English, French.
Task: e.g., Text classification, Question Answering.
License: e.g., MIT, Apache-2.0.
Dataset Pages: Clicking on a dataset will show:
Overview of the dataset.
Statistics about the data.
Code snippets to load the dataset.

Spaces for demos and code:

Managed hosting platform for sharing and collaborating on ML projects.
Supports Jupyter notebooks, web apps, training models, datasets.
Integrates with other Hugging Face tools like datasets, models.
It helps promote openness and democratize access to AI research.

Spaces - Hugging Face

Discover amazing ML apps made by the community

huggingface.co

The main key components of Hugging Face:

Model Architectures:

These are the core deep learning models, often containing millions or billions of parameters.
Examples: BertModel, GPT2Model, T5Model, etc.
They are usually pretrained on large datasets and can be further fine-tuned for specific tasks.

Tokenizer:

Converts text into a format that can be fed into models, typically a sequence of integers.
Handles tasks such as splitting text into tokens, mapping tokens to their IDs, and adding any additional tokens required by specific models.
Examples: BertTokenizer, GPT2Tokenizer, etc.

Pipeline:

A high-level component that abstracts away the preprocessing and post-processing steps, allowing users to directly make inferences.
Easily perform tasks like sentiment analysis, named entity recognition, or text generation without dealing with tokenization or decoding steps.
Examples: pipeline("sentiment-analysis"), pipeline("ner"), etc.

Trainer & TrainingArguments:

Trainer is a component that simplifies the training and fine-tuning process.
TrainingArguments lets users define training parameters like learning rate, batch size, number of epochs, etc.
Provides functionalities like distributed training, mixed-precision training, and more.

Datasets:

A separate library integrated with Hugging Face’s ecosystem.
Allows users to easily load, process, and work with datasets.
Provides tools for data preprocessing, versioning, and sharing.

Inference:

Refers to using a trained model to make predictions.
Pipelines simplify the inference process, but users can also manually handle preprocessing (tokenization), model prediction, and post-processing (e.g., decoding).

Model Hub:

A central repository where pretrained models are stored.
Users can easily download, use, or fine-tune any model from the hub.
Also supports sharing and publishing user-trained models.

Task-specific Heads/Models:

While base models are great for extracting features, for specific tasks like classification or regression, an additional layer or ‘head’ is added.
Examples: BertForSequenceClassification, GPT2ForConditionalGeneration, etc.

Configuration (Config):

Contains all the settings and hyperparameters for a model.
Useful for understanding model specifics or when instantiating a model from scratch.
Examples: BertConfig, GPT2Config, etc.

Pipeline:

Image by the Author

👉 Hugging Face pipelines provide an easy way to use pre-trained models for common NLP tasks.
👉 You first select a pipeline for your task like “text classification” or “question answering”.
👉 The pipeline will load a pretrained model like BERT or RoBERTa and prepare it for your task.
👉 You can then directly feed text inputs to the pipeline and get predictions without writing any actual code.
👉 Behind the scenes, the pipeline handles preprocessing the data, passing it to the model, and postprocessing the predictions.
👉 The pipelines abstract away the complexity so you can quickly utilize powerful NLP models.
👉 Some key pipelines provided include sentiment analysis, named entity recognition, text generation, translation, summarization and more.
👉 Pipelines make it easy to get started with NLP and benchmark models without infrastructure or in-depth knowledge.
👉 You can use the default models or specify your own models for the pipelines as needed.
👉 Overall, pipelines democratize access to advanced NLP for non-experts with just a few lines of code.

What Libraries and Packages to be installed:

%pip install transformers

#Below packages as needed
pip install librosa
pip install soundfile
pip install bitsandbytes
pip install SentencePiece
pip timm

NLP Models:

Steps for Sentiment Analysis using Hugging Face:

Install the Required Libraries: First, ensure you have the necessary libraries installed.

pip install transformers torch

2. Import Dependencies: Import the required modules from the transformers library.

from transformers import pipeline

3. Create the Sentiment Analysis Pipeline: Initialize the sentiment analysis pipeline. This automatically downloads and loads the default model and tokenizer for sentiment analysis.

sentiment_pipeline = pipeline("sentiment-analysis")

4. Analyze Sentiment: Provide a piece of text to the pipeline to get the sentiment.

result = sentiment_pipeline("I love using Hugging Face's transformers!") print(result)

5. The expected output will be something like:

[{'label': 'POSITIVE', 'score': 0.9998}]

6. Interpret the Results:

The 'label' indicates the sentiment: either 'POSITIVE' or 'NEGATIVE'.
The 'score' is a confidence score, ranging between 0 and 1, indicating the model's certainty in its prediction.

Additional Notes:

The default sentiment analysis pipeline uses the distilbert-base-uncased-finetuned-sst-2-english model, which is a version of the DistilBERT model fine-tuned for sentiment analysis.
You can easily swap out the default model with any other compatible sentiment analysis model from the Hugging Face model hub by specifying the model and tokenizer arguments when initializing the pipeline.

Here’s an example using the nlptown/bert-base-multilingual-uncased-sentiment model, which is designed to provide sentiment scores for multilingual text:

from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create the sentiment analysis pipeline with the custom model and tokenizer
sentiment_pipeline_custom = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# Analyze sentiment using the custom pipeline
result = sentiment_pipeline_custom("Ich liebe es, Hugging Face's Transformer zu benutzen!")
print(result)

[{'label': '5 stars', 'score': 0.7963526248931885}]

For more information on sentiment analysis:

Getting Started with Sentiment Analysis using Python

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Generic Workflow for all Hugging Face models:

Install the Library: Before anything, you need to install the transformers library.

pip install transformers

2. Choose a Model: Decide which pre-trained model you’d like to use. Hugging Face provides a variety of models like BERT, GPT-2, T5, etc. You can explore available models on the Hugging Face Model Hub.

3. Load the Tokenizer: Tokenization is the process of converting text into tokens (chunks of text) that can be fed into a model. Each model generally requires its own specific tokenizer.

from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

4. Tokenize the Input: Use the tokenizer to convert your input text into a format that’s suitable for the model. This might involve breaking text down into subwords, encoding it as integers, and adding special tokens.

input_text = "Hello, Hugging Face!" 
encoded_input = tokenizer.encode(input_text, return_tensors='pt')

5. Load the Model:Once you’ve chosen a model and have tokenized your input, load the pre-trained model.

from transformers import BertForMaskedLM 
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

6.Use the Model: With your tokenized input and loaded model, you can now perform inference. The exact method and outputs will vary depending on the model and its design.

output = model(encoded_input)

7. Decoding (if necessary): For certain tasks, you’ll need to decode the model’s output back into human-readable text. This is common for tasks like text generation or sequence-to-sequence tasks.

predicted_token_ids = output[0].argmax(dim=2) 
predicted_text = tokenizer.decode(predicted_token_ids[0])

8. Fine-tuning (Optional): If you’re not just doing inference but also wish to fine-tune the model on your own dataset, you’ll need to set up a training loop, define a loss function, and update the model’s weights using an optimizer. Hugging Face’s Trainer class simplifies this process.

9. Save & Load Fine-Tuned Model (Optional): After fine-tuning, you can save the model and tokenizer for later use.

# Save model and tokenizer
model.save_pretrained("./my_model_directory/")
tokenizer.save_pretrained("./my_model_directory/")

# Load them back
model = AutoModel.from_pretrained("./my_model_directory/")
tokenizer = AutoTokenizer.from_pretrained("./my_model_directory/")

10.Using Pipelines (for simplicity): For many standard tasks (e.g., sentiment analysis, named entity recognition), Hugging Face provides the pipeline utility, which abstracts away much of the above process into a simpler API.

from transformers import pipeline 
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased', tokenizer='distilbert-base-uncased') 
result = classifier("I love Hugging Face!")

This is a general overview of the workflow when using models from Hugging Face’s Transformers library. The exact steps and code might differ depending on the specific model and task.

Topic Classification:

Let's try a topic classification:.

Task:

Classify news articles into one of three topics: “Sports”, “Politics”, or “Technology”.

# Topic Classification using Hugging Face's Transformers

# Installation:
# Make sure you've installed the required libraries.
# pip install transformers torch

# Import necessary libraries and modules
from transformers import BertTokenizer, BertForSequenceClassification, pipeline

# 1. Load Model & Tokenizer:
# Using a pre-trained BERT model. For real-world usage, you'd ideally fine-tune this on your specific dataset.
model_name = "bert-base-uncased"
# We specify num_labels=3 since we have three topics: "Sports", "Politics", and "Technology".
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3)
tokenizer = BertTokenizer.from_pretrained(model_name)

# 2. Create a Classification Pipeline:
topic_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# 3. Predict:
text = "The latest GPU's have caused a surge in PC gaming popularity."
result = topic_classifier(text)
print(result)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[{'label': 'LABEL_2', 'score': 0.43456608057022095}]

Given the lack of fine-tuning, the output will likely not be meaningful. In practice, you would train the model on labeled data corresponding to the three categories.

Fine-tuning:

To truly make this useful, you’d need to fine-tune the model on a dataset of labeled news articles. This would involve:

Preprocessing your dataset to tokenize and format the news articles correctly.
Setting up a training loop or using Hugging Face’s Trainer to fine-tune the model on your specific dataset.
Evaluating the model’s performance on a separate test set.

Text Summarization:

Text summarization involves generating a concise summary retaining the most salient information from a long text document.
Hugging Face provides pretrained summarization models like BART, T5, Pegasus and mT5 in its model hub.
These models are pretrained on large datasets to generate summaries of input texts.
The summarization pipeline in Hugging Face makes it easy to utilize these models out-of-the-box.
It handles preprocessing the input, passing it to model and returning the generated summary.
Users can fine-tune the summarization models on custom datasets using the Trainer API for better performance.
Overall, Hugging Face provides easy access to cutting edge summarization models for research and applications.

from transformers import pipeline

summarizer = pipeline("summarization") 

text = """"The Tower of London, officially Her Majesty's Royal Palace and Fortress of the Tower of London, is a historic castle on the north bank of the River Thames in central London. It lies within the London Borough of Tower Hamlets, which is separated from the eastern edge of the square mile of the City of London by the open space known as Tower Hill. It was founded towards the end of 1066 as part of the Norman Conquest of England. The White Tower, which gives the entire castle its name, was built by William the Conqueror in 1078 and was a resented symbol of oppression, inflicted upon London by the new ruling elite. The castle was used as a prison from 1100 until 1952, although that was not its primary purpose. A grand palace early in its history, it served as a royal residence. As a whole, the Tower is a complex of several buildings set within two concentric rings of defensive walls and a moat."""

summary = summarizer(text, max_length=130, min_length=30, do_sample=False)

print(summary[0]['summary_text'])

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.

Tower of London was founded towards the end of 1066 as part of the Norman 
Conquest of England . The White Tower, which gives the entire castle its 
name, was built by William the Conqueror in 1078 . The castle was used as 
a prison from 1100 until 1952, although that was not its primary purpose .

Here’s a basic example using the BartForConditionalGeneration model.

from transformers import BartForConditionalGeneration, BartTokenizer

# Load the model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Provide a sample text that you want to summarize
text = """
The Hubble Space Telescope has made some of the most dramatic discoveries in the history of astronomy. 
From its vantage point 370 miles above Earth, Hubble has beamed back images of distant galaxies, 
nebulae, and star clusters, shedding light on nearly every aspect of the universe.
"""

# Encode the text and generate the summarized ids
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
summary_ids = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

# Decode the ids to get the summarized text
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

The Hubble Space Telescope has made some of the most dramatic discoveries 
in the history of astronomy. From its vantage point 370 miles above Earth, 
Hubble has beamed back images of distant galaxies, nebulae, and star clusters.

We can set a few hyperparameters for the generation like max_length, min_length, length_penalty, and num_beams to influence the length and quality of the summary.

Text Summarization DataSets:

Hugging Face’s datasets library provides a collection of datasets that can be readily used for various tasks, including text summarization. Here are some popular datasets for text summarization training:

CNN/Daily Mail:

This is a common dataset used for extractive and abstractive summarization. It contains news articles and their respective summaries.
Usage:

from datasets import load_dataset 
dataset = load_dataset("cnn_dailymail", "3.0.0")

2. XSum:

The Extreme Summarization (XSum) dataset contains BBC articles accompanied by single-sentence summaries.

3. Gigaword:

This dataset contains a large number of articles and their respective headlines from various news agencies. It’s typically used for abstractive summarization.

4. MultiNews:

MultiNews contains news articles and their summaries, which are created by combining multiple articles on the same topic.

5. SAMSum:

The SAMSum dataset consists of dialogue-based data, providing conversations and their respective summaries.

6. BillSum:

BillSum contains text from US Congressional and California state bills with human-written summaries.

7. BigPatent:

As the name suggests, this dataset contains patent documents and their respective abstracts.

Translation:

Pre-trained Models:

Hugging Face hosts numerous state-of-the-art translation models in various languages and language pairs.
Examples include MarianMT, T5, and BERT-based models tailored for translation tasks.

Ease of Use with Pipelines:

Hugging Face’s pipeline API offers an easy way to perform translation without delving deep into model details.
For instance: translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de").

Fine-tuning:

The Transformers library enables users to fine-tune existing translation models on custom datasets, enhancing performance for domain-specific applications.

Datasets for Translation:

Hugging Face’s datasets library includes numerous datasets suitable for machine translation tasks, such as WMT and Opus.

Example:

Using Default model:

from transformers import pipeline

# Initialize the translation pipeline
translator = pipeline("translation_en_to_fr")

# Provide a sample text that you want to translate
text = "Hello, how are you?"

# Translate the text
translation_output = translator(text)

# Extract the translated text
translated_text = translation_output[0]['translation_text']

print(translated_text)

Bonjour, comment êtes-vous?

2. Use an advanced model for translation:

pip install sentencepiece


from transformers import MarianMTModel, MarianTokenizer

# Define the source language and target language
src_lang = 'en'
tgt_lang = 'de'

# Load the MarianMT model and tokenizer for English to German translation
model_name = "Helsinki-NLP/opus-mt-en-de"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Provide a sample text that you want to translate
text = "Hello, how are you?"

# Tokenize the text and translate
tokenized_text = tokenizer.encode(text, return_tensors="pt")
translated_tokens = model.generate(tokenized_text)
translated_text = tokenizer.decode(translated_tokens[0], skip_special_tokens=True)

print(translated_text)

Hallo, wie geht's?

Question-Answering Models:

Question answering involves predicting an answer to a question in text format based on context.
Hugging Face provides pretrained QA models like BERT, ALBERT, DistilBERT that can be finetuned for question answering.
The models are trained on SQuAD dataset and can answer questions based on a reference text.
The QAPipeline handles passing question-context inputs to the QA model and extracting the predicted answer.
Users only need to provide the question and context to the pipeline to get extracted answer text.
Models can also predict “no answer” if the context does not contain the answer.
For unanswerable questions, pipeline returns empty string instead of incorrect answers.
BERT-base model gets 80–90% F1 on SQuAD v1.1 which is close to human performance.
Users can fine-tune with Trainer API on custom datasets to improve domain-specific performance.

Example using default model:

from transformers import pipeline

context = r"""The Tower of London, officially Her Majesty's Royal Palace and Fortress of the Tower of London, is a historic castle located on the north bank of the River Thames in central London. It lies within the London Borough of Tower Hamlets, separated from the eastern edge of the square mile of the City of London by the open space known as Tower Hill."""

qa_pipeline = pipeline("question-answering")

question = "Where is the Tower of London located?"

res = qa_pipeline({"question": question, "context": context})

print(res["answer"])

on the north bank of the River Thames in central London

2. Another example using advanced model:

from transformers import BertTokenizer, BertForQuestionAnswering
import torch

# Load tokenizer and model
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

# Context and Question
context = ("In its early years, the digital data processing industry was dominated by the IBM 701, "
           "then eventually the IBM 704, IBM 709, IBM 7040, 7044, IBM 7090 and IBM 7094.")
question = "Which company dominated the digital data processing industry in its early years?"

# Tokenize input
inputs = tokenizer.encode_plus(question, context, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]

# Get answer
output = model(**inputs)
answer_start = torch.argmax(output.start_logits)
answer_end = torch.argmax(output.end_logits) + 1
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

print(answer)

ibm 701

Some important Datasets used for Question-Answering:

SQuAD (Stanford Question Answering Dataset)
SQuAD 1.1: Contains 100,000+ question-answer pairs based on 500+ Wikipedia articles.
SQuAD 2.0: Extends SQuAD 1.1 with questions that do not have an answer in the provided passage, requiring the model to determine when no answer is available.

NewsQA: A challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs.
CoQA (Conversational Question Answering Challenge): Contains 127,000+ questions with answers, collected from 16,000+ conversations.
QuAC (Question Answering in Context): A dataset for modeling, understanding, and participating in information seeking dialog.
MS MARCO: A large-scale dataset for reading comprehension and question answering. It focuses on real-world questions.
Natural Questions: Developed by Google AI Language, it uses naturally occurring questions to extract answers from Wikipedia articles.
RACE: A reading comprehension dataset collected from English examinations in China, which is designed for evaluating machine reading comprehension.
HotpotQA: A dataset with questions that require finding and reasoning over multiple evidence documents to answer.
DROP (Discrete Reasoning Over the content of Paragraphs): A reading comprehension benchmark where answering questions requires performing discrete operations over the content of paragraphs.
DuReader: A large-scale, open-domain Chinese reading comprehension dataset.
BoolQ: Consists of 15942 yes/no questions about short passages from Wikipedia.
BioASQ: A challenge on large-scale biomedical semantic indexing and question answering.
TriviaQA: Contains 650K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents.

To fetch any dataset using the datasets library, you can use:

from datasets import load_dataset

# For example, to load the SQuAD 2.0 dataset:
dataset = load_dataset("squad_v2")

Text Generation:

Text generation involves automatically generating coherent text from scratch on a given prompt/topic.
Hugging Face provides access to models like GPT-2, GPT-Neo, BART, T5 that can generate text.
GPT-2 and GPT-Neo are auto-regressive language models trained to predict next word in a sequence.T5 and BART are encoder-decoder models that can be tuned for conditional text generation.
The TextGenerationPipeline handles prompting the model and generating text.
Models are pretrained on huge text corpuses like WebText, BooksCorpus, etc.
Users can fine-tune models on custom datasets using Trainer API.
Generation can be tweaked via parameters like max length, repetition penalty, etc.
Allows generating long-form text like stories, articles, content for websites.
Text generation has applications in conversational bots, creative writing aid, content creation, etc.

Default Example:

from transformers import pipeline

text_generator = pipeline("text-generation", model="gpt2")

prompt = "In the kingdom of artificial intelligence," 

print(text_generator(prompt, max_length=50)[0]["generated_text"])

In the kingdom of artificial intelligence, the ability to be intelligent is the only real power in the realm of matter.

2. Use an advanced model:

from transformers import GPTNeoForCausalLM, GPT2Tokenizer
model_name = "EleutherAI/gpt-neo-2.7B"
model = GPTNeoForCausalLM.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

prompt = "In a world where AI and humans coexist,"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text using beam search
output = model.generate(
    input_ids,
    max_length=150, 
    num_return_sequences=3,
    no_repeat_ngram_size=2, 
    temperature=0.7,
    num_beams=5, # Using beam search with 5 beams
    early_stopping=True
)

for i, text in enumerate(tokenizer.batch_decode(output)):
    print(f"Generated {i + 1}: {text}")

In this article, we will take a look at how AI can be used to solve some of the most pressing problems in our world today, and how it can help us make the world a better place. We will also explore how we can harness the power of AI to make our lives easier and improve the quality of life for everyone on the planet.
AI and the Internet of Things (IoT) have the potential to revolutionize the way we live, work and interact with each other. It is estimated that by 2020, there will be more than 1.5 billion Internet-connected

Data Sets:

You can check the Hugging Face datasets:

Sentence Similarity:

Sentence similarity refers to quantifying how similar two input sentences are semantically.
It has applications in search, FAQ chatbots, duplicate detection, plagiarism checking etc.
Hugging Face provides access to pretrained encoders like sentence-transformers/all-MiniLM-L6-v2 model.
This model encodes input sentences into fixed-length vectors using a Siamese network architecture.
The vectors are compared using cosine similarity to determine closeness between sentences.
Values range from 0 to 1 with 1 indicating identical sentences. A threshold can separate semantic duplicates.
The model is trained on Natural Language Inference (NLI) datasets like SNLI, MultiNLI.
Fine-tuning on domain-specific data can improve performance for niche applications.
The pipeline handles encoding sentences and computing similarity scores automatically.
Overall, it enables building semantic search, duplicate detection, document clustering solutions easily.
Vector comparisons are faster and more robust compared to rules-based semantic matching.

Check out my article on vector databases for more information:

Beyond Traditional Databases: A Look at Vector Databases for Machine Learning

Code Snippets in Python

levelup.gitconnected.com

Basic example:

from sentence_transformers import SentenceTransformer, util
import torch

# Load a pre-trained sentence-transformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Example sentences
sentence1 = "The sky is blue."
sentence2 = "Blue is the color of the sky."

# Convert sentences to embeddings
embedding1 = model.encode(sentence1, convert_to_tensor=True)
embedding2 = model.encode(sentence2, convert_to_tensor=True)

# Compute cosine similarity between embeddings
cosine_sim = util.pytorch_cos_sim(embedding1, embedding2)

print(f"Cosine Similarity: {cosine_sim.item()}")

Cosine Similarity: 0.90008145570755

2. Another example using advanced model:

from transformers import BertTokenizer, BertModel
import torch
from torch.nn.functional import cosine_similarity

# Load pre-trained model and tokenizer
model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Define function to convert sentence to embedding
def get_embedding(sentence, model, tokenizer):
    inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    # Average the second to last hidden layer of each token to get the sentence embedding
    sentence_embedding = torch.mean(outputs.last_hidden_state[0], dim=0)
    return sentence_embedding

# Example sentences
sentence1 = "The sky is blue."
sentence2 = "Blue is the color of the sky."

# Get embeddings
embedding1 = get_embedding(sentence1, model, tokenizer)
embedding2 = get_embedding(sentence2, model, tokenizer)

# Compute cosine similarity
similarity = cosine_similarity(embedding1.unsqueeze(0), embedding2.unsqueeze(0))

print(f"Cosine Similarity: {similarity.item()}")

Cosine Similarity: 0.7346132397651672

Check the datasets used for sentence similarity.

Zero Shot Classification:

Zero-shot classification involves predicting classes that were not seen during model training.
Useful when you have new classes at inference time that were not available for training..
Works better for some classes compared to others based on description.
Useful for datasets with continuously growing or shifting classes.
Avoids retraining model from scratch each time new classes are added.
Overall, enables adapting models to new concepts on the fly without explicit training.

Default Model:

from transformers import pipeline

# Initialize the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")

# Define the sequence to classify and potential labels
sequence = "I love hiking in the mountains."
candidate_labels = ["entertainment", "sports", "nature activity"]

# Classify the sequence
result = classifier(sequence, candidate_labels)

# Display result
print("Sequence:", sequence)
print("Predicted label:", result["labels"][0])
print("Confidence scores:", result["scores"])

No model was supplied, defaulted to facebook/bart-large-mnli and revision c626438 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Sequence: I love hiking in the mountains.
Predicted label: nature activity
Confidence scores: [0.9644157886505127, 0.01824183762073517, 0.017342381179332733]

2.Using an Advanced Model:

from transformers import BartForSequenceClassification, BartTokenizer
from transformers import pipeline

# Load the BART model and tokenizer
model_name = "facebook/bart-large-mnli"
model = BartForSequenceClassification.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Initialize the zero-shot classification pipeline using BART
classifier = pipeline("zero-shot-classification", model=model, tokenizer=tokenizer)

# Define the sequence to classify and potential labels
sequence = "I love hiking in the mountains."
candidate_labels = ["entertainment", "sports", "nature activity"]

# Classify the sequence
result = classifier(sequence, candidate_labels)

# Display result
print("Sequence:", sequence)
print("Predicted label:", result["labels"][0])
print("Confidence scores:", result["scores"])

Sequence: I love hiking in the mountains.
Predicted label: nature activity
Confidence scores: [0.9644157886505127, 0.01824183762073517, 0.017342381179332733]

In this example, I specifically use the BART model pre-trained on the MultiNLI (MNLI) dataset, which is suited for zero-shot classification.

NER:

NER involves identifying and classifying named entities like people, organizations, and locations in text.
Useful for extracting structured information from unstructured documents.
Hugging Face provides pre-trained NER models like BERT, RoBERTa, XLM-RoBERTa.
These models label words/spans in a text into pre-defined entity categories.
Common entities annotated are PERSON, ORG, LOCATION, DATE, TIME, MONEY, etc.
Models are trained on datasets like CoNLL-2003, OntoNotes, WNUT-17.
The TokenClassificationPipeline handles feeding text to the model and extracting entity labels.
NER models can be fine-tuned on custom data using Trainer API.
Achieve F1 scores of over 90% on common benchmark datasets.
Significantly more accurate than older CRF based statistical NER systems.
Enables information extraction from text for knowledge bases, chatbots, search etc.

Basic -default model:

from transformers import pipeline

ner = pipeline("ner")

text = "My name is Sarah and I live in London, UK." 

ner_results = ner(text)

for entity in ner_results:
    print(entity["word"], entity["entity"])

Sarah I-PER
London I-LOC
UK I-LOC

2. Using an advanced model:

from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline

# Define the model and tokenizer
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Create a NER pipeline
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer)

# Provide a sample text
text = "Elon Musk is the CEO of SpaceX and Tesla."

# Get NER results
ner_results = nlp_ner(text)

# Print results
for entity in ner_results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")

Entity: El, Label: I-PER, Score: 0.9996
Entity: ##on, Label: I-PER, Score: 0.9990
Entity: Mu, Label: I-PER, Score: 0.9993
Entity: ##sk, Label: I-PER, Score: 0.9985
Entity: Space, Label: I-ORG, Score: 0.9992
Entity: ##X, Label: I-ORG, Score: 0.9986
Entity: Te, Label: I-ORG, Score: 0.9964
Entity: ##sla, Label: I-ORG, Score: 0.9953

All important NLP pipelines:

from transformers import pipeline

# Sentiment Analysis
sentiment_pipeline = pipeline("sentiment-analysis")

# Text Classification
classifier = pipeline("text-classification")

# Token Classification (e.g., Named Entity Recognition)
ner_pipeline = pipeline("ner")

# Question Answering
qa_pipeline = pipeline("question-answering")

# Masked Language Modeling
fill_mask = pipeline("fill-mask")

# Summarization
summarizer = pipeline("summarization")

# Translation (e.g., English to French)
translator = pipeline("translation_en_to_fr")

# Feature Extraction
feature_extraction = pipeline("feature-extraction")

# Text Generation
generator = pipeline("text-generation")

# Zero-shot Classification
zero_shot_classifier = pipeline("zero-shot-classification")

# Conversation
conversational_pipeline = pipeline("conversational")

Default Models Used in NLP Tasks:

Text Classification:

Model: distilbert-base-uncased-finetuned-sst-2-english

Token Classification (e.g., Named Entity Recognition):

Model: dslim/bert-base-NER

Text Summarization:

Model: sshleifer/distilbart-cnn-12-6

Question Answering:

Model: distilbert-base-cased-distilled-squad

Text Generation:

Model: gpt2

Text Similarity:

Model: sentence-transformers/all-mpnet-base-v2

Translation:

Model: t5-base

Fill Mask:

Model: distilroberta-base

Zero-Shot Classification:

Model: facebook/bart-large-mnli

Named Entity Recognition:

Model: dbmdz/bert-large-cased-finetuned-conll03-english

Conversational (e.g., Chatbots):

Model: facebook/blenderbot-400M-distill

Language Model (Mask filling):

Model: distilbert-base-uncased

Text-to-Text Transfer Transformer (T5) tasks:

Model: google/t5-small

Check out all the notebooks:

🤗 Transformers Notebooks

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Computer Vision Models:

- Hugging Face provides access to SOTA models for various computer vision tasks:

- Image Classification — ViT, DeiT, ConvNeXT, Swin Transformer

- Object Detection — DETR, Masked R-CNN

- Image Segmentation — MaskFormer, SETR

- Video Classification — TimeSformer, MV-ViT

- Image Generation — DALL-E 2

- Self-Supervised Models — BEiT, MAE, MaskFeat

- Vision-Language Models — VL-T5, ViLT

- Models are pretrained on large datasets like ImageNet, COCO, Kinetics,Conceptual Captions.

- OpenCV and PIL integrations allow feeding images directly to models.

- Vision pipelines provide OOTB inference for tasks like classification, object detection.

- Model repo READMEs document model architecture, training details.

- Models can be fine-tuned on new datasets using the Trainer API.

- Supports features like batched inference, mixed precision, multi-GPU training.

- Achieve leading metrics across computer vision tasks and datasets.

- Active community support for most popular vision models on Discussions forum.

Overall, Hugging Face is becoming a leading hub for transferring and deploying CV models.

Image Classification Model:

from transformers import ViTFeatureExtractor, ViTForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"  
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224-in21k') 
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224-in21k')
            
inputs = feature_extractor(images=image, return_tensors="pt") 
outputs = model(**inputs)
            
predicted_class_idx = outputs.logits.argmax(-1).item()
print(model.config.id2label[predicted_class_idx])

LABEL_1

Audio Models:

Hugging Face provides access to pretrained models for speech and music audio tasks.
Key speech models include Wav2Vec2, HuBERT, Speech2Text, Speech2Text2, Whisper.
These models perform speech recognition, speech translation, keyword spotting.
Music models like MusicLM and Jukebox generate and classify music.
Audio models are pretrained on large labeled datasets like Librispeech, Common Voice, GTZAN music dataset.
Speech models encode raw waveform input to word or phoneme representations.
Models leverage transformer architectures tailored for audio seq2seq tasks.
The AudioClassification pipeline classifies audio clips like environmental sounds.
The Speech2Text pipeline transcribes speech to text using ASR models.
Low resource speech recognition using adapter-based tuning.
Music generation models produce high-fidelity song samples.
Active areas of research include multilingual speech recognition, spoken dialog.
Audio models enable voice interfaces, transcriptions, sound searching.
Powerful capabilities being unlocked by models trained on large labeled audio.

Check out the below Hugging Face Audio course for more information:

Welcome to the Hugging Face Audio course! - Hugging Face Audio Course

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Multimodal:

Multimodal models combine different modalities like text, vision, audio within a single model.
Enables models to represent information from multiple modalities in a joint embedding space.
Examples include VL-BERT, ViLBERT, LXMERT — combining text and image inputs.
Other models like MuST-C integrate speech, text and audio modalities.
Models pretrained on multimodal datasets like Conceptual Captions, COCO, Visual Genome.
Can perform cross-modal tasks like visual question answering, image captioning.
Leverage modality-specific encoders combined with cross-attention mechanisms.
Joint embedding space allows transfer of knowledge between modalities.
Finetuning on downstream tasks provides alignments between modalities.
Multimodal models achieve state-of-the-art results on multimodal benchmarks.
Provides building blocks for developing multisensory AI systems.
Enables models to understand the world and different modalities better.
Overall, an emerging area of research for universal multimodal representations.

Build Custom Pipeline:

Creating a custom pipeline in Hugging Face is possible and often done when you have specialized processing steps or when you need more control over how data flows through the pipeline. Here are the steps to create a custom pipeline for a Named Entity Recognition (NER) task:

Load Pre-trained Model and Tokenizer: This is the first step where you load your desired model and tokenizer.

from transformers import AutoModelForTokenClassification, AutoTokenizer

model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

2. Tokenize Input: You’ll use the tokenizer to convert input text into tokens. This returns an encoding that the model can understand.

def tokenize_input(text):
    return tokenizer.encode(text, return_tensors="pt")

3.Predict with Model: Use the model to make predictions on the tokenized input.

def get_predictions(text):
    inputs = tokenize_input(text)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs

4. Process the Output: For NER, you’ll want to extract entity labels from the model’s predictions.

def extract_entities(text):
    outputs = get_predictions(text)
    predictions = torch.argmax(outputs.logits, dim=-1)
    id2label = {i: label for i, label in enumerate(tokenizer.convert_ids_to_tokens(predictions[0].tolist()))}
    tokens = tokenizer.tokenize(text)
    
    entities = []
    for token, label_id in zip(tokens, predictions[0]):
        label = model.config.id2label[label_id.item()]
        entities.append((token, label))
    return entities

5. Create the Custom Pipeline: Finally, wrap everything in a function that can be used just like the default Hugging Face pipeline.

def custom_ner_pipeline(text):
    return extract_entities(text)

6. Usage:

text = "Elon Musk is the CEO of SpaceX."
results = custom_ner_pipeline(text)
print(results)

This gives you full control over each step. You can add custom pre-processing before tokenization, change how predictions are made, or post-process the model’s output differently.

Check the below blog on custom pipelines:

How to create a custom pipeline?

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Fine Tuning:

Why We Need to Fine-tune Models:

Transfer of Knowledge
Adaptation to Specific Domains
Resource Efficiency
Overcome Data Limitations
Task Specificity
Improved Performance

For example:Transfer of Knowledge:A model trained on a vast amount of text data (like BERT) has general language understanding. However, for specific tasks like sentiment analysis on movie reviews, it needs to understand the nuances of that particular dataset.

# Install Necessary Libraries

# Import Required Modules
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import torch

# Load Dataset
dataset = load_dataset("imdb")

# Tokenization
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Prepare Model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Training Configuration
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=200,
    do_train=True,
    do_eval=True,
    output_dir="./results",
    overwrite_output_dir=True,
    save_steps=10_000,
    eval_steps=200,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

# Fine-tune the Model
trainer.train()

# Evaluate the Model
results = trainer.evaluate()
print(results)

The code starts by importing necessary libraries from transformers and datasets.
It fetches the IMDB movie reviews dataset and processes it using a BERT tokenizer.
Using the Trainer class and specified training configurations, the BERT model is fine-tuned on this sentiment analysis task.
Finally, the trained model’s performance is evaluated on a validation set.

Check out for more information on fine-tuning.

Fine-tune a pretrained model

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

What is Model Card?

Model Details:

Name of the model.
Model architecture (e.g., BERT, GPT-2).
Date of release and last update.

Intended Use:

Specific tasks the model is designed for.
Application domains where the model performs optimally.

Performance Metrics:

Benchmarks on relevant datasets.
Comparisons with other models (if applicable).

Training Data:

Description of the dataset used for training.
Data collection process, sources, and potential biases.

Evaluation Data:

Information about datasets used for evaluation/testing.
Details about the evaluation setup.

Ethical Considerations:

Known model limitations.
Potential risks and recommendations for use.

Usage:

Instructions for using the model.
Sample code or examples.

Licensing:

Licensing information about the model and data.

How Model Cards Help Developers:

Transparency: Provides clarity on how the model was trained, what data was used, and what its intended purposes are.
Informed Decision Making: With insights on performance metrics and benchmarks, developers can make better decisions when choosing models for specific tasks.
Understanding Limitations: Outlines the known limitations and potential biases in the model, helping developers anticipate issues.
Ethical Deployment: By highlighting ethical considerations and potential misuse cases, Model Cards guide developers towards responsible usage.
Replicability: Model Cards often contain details that could assist researchers and developers in replicating or building upon the model’s results.
Ease of Use: With usage instructions and sample code, developers can quickly integrate and experiment with the model in their applications

Hugging Face Leaderboard:

It tracks evaluation results of NLP models on various datasets and benchmark tasks.
Covers popular NLP tasks like text classification, named entity recognition, question answering.
Includes canonical datasets like GLUE, SQuAD, CONLL-2003 that are used to benchmark model performance.
Models evaluated include BERT, RoBERTa, T5, BART, ALBERT and other transformer architectures.
Shows single model and ensemble model results for comparison.
Metrics tracked include accuracy, F1 score, EM (exact match), etc based on the dataset.
Allows comparing performance of different models architectures and configurations.
Updated frequently as new state-of-the-art models are released.
Links to model repository and reference paper for more details.
Can download config files of top models for replicating results.

The link is

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

huggingface.co

Conclusion:

Hugging Face has had a tremendous impact on machine learning research and applications in recent years. Their open-source tools have enabled quick experimentation and benchmarking using state-of-the-art models.
They have helped democratize access to AI by making it easy for anyone to utilize powerful models like BERT and GPT-2 through their libraries and model hub. Their tools are actively used by students, researchers, startups, and large enterprises alike.
Hugging Face has fostered an active community with over 22K GitHub stars and thousands of contributors. Users interact on GitHub discussions, Discourse forums, Twitter to share ideas, get help, and give feedback. This engaged community helps drive the growth of their tools.
With continued research advances in fields like NLP, computer vision and multimodal AI, Hugging Face is poised to provide the best-in-class tools to transfer these innovations to real-world applications. Their approach of empowering users through open source tools can go a long way in shaping the future of AI.

References:

Websites:

Hugging Face homepage — https://huggingface.co/
Hugging Face docs — https://huggingface.co/docs
Hugging Face course — https://huggingface.co/course
GitHub repo — https://github.com/huggingface/transformers
Model hub — https://huggingface.co/models
Dataset hub — https://huggingface.co/datasets

Papers:

Original Transformer paper — https://arxiv.org/abs/1706.03762
BERT paper — https://arxiv.org/abs/1810.04805
T5 paper — https://arxiv.org/abs/1910.10683
Datasets papers — GLUE — https://arxiv.org/abs/1804.07461

Documentation:

API reference — https://huggingface.co/docs/transformers/index
Model docs — https://huggingface.co/models
Pipeline docs — https://huggingface.co/docs/transformers/main/en/task_summary

Blog posts and articles:

Hugging Face blog — https://huggingface.co/blog
Transformer architecture deep dive — https://jalammar.github.io/illustrated-transformer/
Illustrated BERT guide — http://jalammar.github.io/illustrated-bert/
Hugging Face course units — https://huggingface.co/course/chapter1

Videos:

Hugging Face YouTube channel — https://www.youtube.com/channel/UC_OKylaXZtNciW4gQPyEgEA
Transformers from scratch — https://www.youtube.com/watch?v=U0s0f235bdM
Transformer neural networks — https://www.youtube.com/watch?v=TQQlZhbC5ps

Diving Deep with Hugging Face: The GitHub of Deep Learning & Large Language Models!

Models, Data Sets, Fine Tuning, Pipelines, Custom Pipelines

Introduction:

AI Startup Hugging Face Is Raising Fresh VC Funds At $4 Billion Valuation

Hugging Face is raising at least $200 million in a new funding round expected to value the high-flying AI startup at $4…

Contents:

What is NLP?

How to Build an NLP Machine Learning App-End to End

NLP App using Streamlit and Python NLP Libraries

Basic Components:

Machine Learning in NLP:

Deep Learning in NLP:

Large Language Models (LLMs):

Transformers:

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an…

NLP Resources:

Introduction - Hugging Face NLP Course

We're on a journey to advance and democratize artificial intelligence through open source and open science.

The Illustrated Transformer

Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations…

The Annotated Transformer

Edit description

Deep Dive into Hugging Face Hub:

Models:

Models - Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Searching for Models:

Datasets:

Datasets

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Searching for Datasets:

Spaces for demos and code:

Spaces - Hugging Face

Discover amazing ML apps made by the community

The main key components of Hugging Face:

Pipeline:

What Libraries and Packages to be installed:

NLP Models:

Steps for Sentiment Analysis using Hugging Face:

Getting Started with Sentiment Analysis using Python

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Generic Workflow for all Hugging Face models:

Topic Classification:

Task:

Fine-tuning:

Text Summarization:

Text Summarization DataSets:

Translation:

Question-Answering Models:

Text Generation:

Sentence Similarity:

Beyond Traditional Databases: A Look at Vector Databases for Machine Learning

Code Snippets in Python

Zero Shot Classification:

NER:

All important NLP pipelines:

Default Models Used in NLP Tasks:

🤗 Transformers Notebooks

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Computer Vision Models:

Audio Models:

Welcome to the Hugging Face Audio course! - Hugging Face Audio Course

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Multimodal:

Build Custom Pipeline:

How to create a custom pipeline?

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Fine Tuning:

Fine-tune a pretrained model

We're on a journey to advance and democratize artificial intelligence through open source and open science.

What is Model Card?

How Model Cards Help Developers:

Hugging Face Leaderboard:

Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4

Discover amazing ML apps made by the community

Conclusion:

References:

Written by Senthil E