目录

Hugging Face Transformers:最强大的 NLP 库完全指南

学习目标

通过本文,你将全面掌握以下核心能力:

  • 深入理解 Transformers 的项目定位、核心概念和设计理念
  • 掌握 Transformers 的安装、环境配置和依赖管理
  • 学会使用 Pipeline 进行推理
  • 理解 AutoModel 和 AutoTokenizer 的工作机制
  • 掌握模型微调(Fine-tuning)的完整流程
  • 理解多模态模型(文本、音频、图像)的使用方法
  • 学会使用 Trainer API 和自定义训练循环
  • 掌握性能优化和推理加速技巧

1. 项目概述

1.1 是什么

Transformers 是 Hugging Face 开发的最强大的 NLP 库,它提供了预训练模型的 API 和工具,让你可以轻松下载和微调最前沿(SOTA)的预训练模型。

核心理念:让前沿 AI 技术人人可及

1.2 核心数据

指标数值
GitHub Stars144k
GitHub Forks25.8k
Contributors2,000+
Releases339
最新版本v4.49.0 (2026-03)
LicenseApache 2.0
语言Python 95.9%

1.3 为什么选择 Transformers

特性说明
模型丰富100,000+ 预训练模型
多模态文本、音频、图像、视频
易用Pipeline 简化推理
框架兼容PyTorch、TensorFlow、JAX
自包含仅需后端框架作为依赖
文档完善详尽的教程和示例
社区活跃2,000+ 贡献者

1.4 支持的任务类型

Transformers 支持广泛的任务:

任务说明
文本文本分类、命名实体识别、问答、生成、翻译、摘要
音频语音识别、语音合成、音频分类
图像图像分类、分割、生成、OCR
多模态VQA、图像描述、文档理解

2. 安装与环境配置

2.1 基本安装

# 仅安装 Transformers(需要 PyTorch)
pip install transformers

# 安装并验证
python -c "import transformers; print(transformers.__version__)"

2.2 完整安装(包含所有后端)

# PyTorch 后端
pip install transformers[torch]

# TensorFlow 后端
pip install transformers[tensorflow]

# JAX 后端
pip install transformers[jax]

# 所有后端
pip install transformers[torch,tensorflow,jax]

2.3 深度学习框架版本要求

后端最低版本推荐版本
PyTorch≥1.11≥2.0
TensorFlow≥2.4≥2.15
JAX/flax≥0.4.20≥0.4.25

2.4 GPU 配置

# 确认 CUDA 可用
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# 确认 GPU 被识别
python -c "import torch; print(f'GPU: {torch.cuda.get_device_name(0)}')"

3. Pipeline:快速推理

3.1 什么是 Pipeline

Pipeline 是 Transformers 提供的最简单的推理接口,只需一行代码即可完成推理。

3.2 文本分类

from transformers import pipeline

# 情感分析
classifier = pipeline("sentiment-analysis")
result = classifier("I love using Hugging Face Transformers!")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

3.3 命名实体识别(NER)

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
result = ner("Hugging Face is based in New York City.")
print(result)
# [{'entity_group': 'ORG', 'score': 0.9987, 'word': 'Hugging Face', 'start': 0, 'end': 12},
#  {'entity_group': 'LOC', 'score': 0.9992, 'word': 'New York City', 'start': 26, 'end': 37}]

3.4 问答系统

from transformers import pipeline

qa = pipeline("question-answering")
context = """
Transformers is a library developed by Hugging Face.
It provides pretrained models for NLP tasks.
"""
question = "Who developed Transformers?"

result = qa(question=question, context=context)
print(result)
# {'score': 0.9995, 'start': 45, 'end': 58, 'answer': 'Hugging Face'}

3.5 文本生成

from transformers import pipeline

generator = pipeline("text-generation", model="gpt2")
result = generator("Once upon a time,", max_length=50, num_return_sequences=2)
print(result)
# [{'generated_text': 'Once upon a time, there was a young prince who...'},
#  {'generated_text': 'Once upon a time, in a galaxy far, far away...'}]

3.6 翻译

from transformers import pipeline

translator = pipeline("translation_en_to_fr")
result = translator("Hugging Face Transformers is awesome!")
print(result)
# [{'translation_text': "Les Transformers de Hugging Face sontidables !"}]

3.7 摘要生成

from transformers import pipeline

summarizer = pipeline("summarization")
article = """
Transformers have revolutionized natural language processing.
They use self-attention mechanisms to process input sequences.
The architecture was introduced in the paper 'Attention Is All You Need'.
Since then, it has become the foundation for many state-of-the-art models.
"""

result = summarizer(article, max_length=50, min_length=20)
print(result)
# [{'summary_text': 'Transformers use self-attention to process sequences, introduced in Attention Is All You Need.'}]

3.8 图像分类

from transformers import pipeline

classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cat.png")
print(result)
# [{'score': 0.9978, 'label': 'Egyptian cat'},
#  {'score': 0.0012, 'label': 'tabby cat'}]

4. AutoClass:自动模型加载

4.1 什么是 AutoClass

AutoClass 可以自动推断模型架构,并加载对应的预训练权重。无需手动指定模型类。

4.2 AutoModel

from transformers import AutoModel, AutoTokenizer

# 自动加载模型
model_name = "bert-base-uncased"
model = AutoModel.from_pretrained(model_name)

# 自动加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 使用
inputs = tokenizer("Hello, world!")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)
# torch.Size([1, 6, 768])

4.3 常用 AutoClass

说明
AutoModel自动加载任意模型
AutoModelForSequenceClassification序列分类
AutoModelForTokenClassificationToken 分类
AutoModelForQuestionAnswering问答
AutoModelForCausalLM因果语言模型
AutoModelForMaskedLM掩码语言模型
AutoModelForSeq2SeqLM序列到序列
AutoTokenizer自动加载分词器
AutoConfig自动加载配置

4.4 处理不同任务

from transformers import AutoModelForSequenceClassification, AutoModelForQuestionAnswering

# 文本分类
clf_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# 问答
qa_model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

5. 模型微调(Fine-tuning)

5.1 为什么要微调

预训练模型在大规模数据上学习通用特征,微调则在你的特定任务数据上学习,使其适应你的场景。

5.2 准备数据

from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# 加载数据集
dataset = load_dataset("glue", "sst2")
print(dataset)
# DatasetDict({
#     train: Dataset({ features: ['idx', 'sentence', 'label'], num_rows: 67349 })
#     validation: Dataset({ features: ['idx', 'sentence', 'label'], num_rows: 872 })
# })

5.3 预处理

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

5.4 使用 Trainer

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# 加载模型
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2
)

# 训练参数
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=100,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# 开始训练
trainer.train()

5.5 自定义训练循环

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer, get_linear_schedule_with_warmup

# 初始化
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# 数据
dataset = load_dataset("glue", "sst2")
tokenized = dataset.map(tokenize_function, batched=True)
train_loader = DataLoader(tokenized["train"], batch_size=16, shuffle=True)

# 优化器
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

# 训练循环
for epoch in range(3):
    model.train()
    for batch in train_loader:
        outputs = model(**{k: v for k, v in batch.items() if k in ["input_ids", "attention_mask", "labels"]})
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

6. 分词器(Tokenizer)

6.1 什么是分词器

分词器将文本转换为模型能处理的 token(词元)。

6.2 基本使用

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# 编码
encoded = tokenizer("Hello, world!")
print(encoded)
# {'input_ids': [101, 7592, 1010, 2088, 102],
#  'token_type_ids': [0, 0, 0, 0, 0],
#  'attention_mask': [1, 1, 1, 1, 1]}

# 解码
decoded = tokenizer.decode(encoded["input_ids"])
print(decoded)
# "[CLS] Hello, world! [SEP]"

6.3 批量编码

# 批量编码
sentences = ["Hello, world!", "Transformers is awesome!"]
encoded = tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors="pt")
print(encoded)
# {'input_ids': tensor([[101, 7592, 1010, 2088, 102],
#                      [101, 2154, 2003, 12476, 999, 102]]),
#  'attention_mask': tensor([[1, 1, 1, 1, 1],
#                           [1, 1, 1, 1, 1, 1]])}

6.4 不同分词器类型

类型示例特点
WordPieceBERT按词边界分
BPEGPT-2, RoBERTa字节级分
SentencePieceXLNet, ALBERT统一处理

7. 多模态模型

7.1 音频处理

from transformers import pipeline

# 语音识别
asr = pipeline("automatic-speech-recognition", model="openai/whisper-base")
result = asr("audio.wav")
print(result)
# {'text': 'The quick brown fox jumps over the lazy dog.'}

7.2 图像处理

from transformers import pipeline

# 图像分割
segmenter = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic")
result = segmenter("image.jpg")
print(result)
# [{'score': 0.99, 'label': 'person', 'mask': <PIL.Image.Image>}]

7.3 多模态问答

from transformers import pipeline

vqa = pipeline("vqa", model="microsoft/trocr-base-handwritten")
result = vqa(image="image.png", question="What is in the image?")
print(result)
# [{'score': 0.99, 'answer': 'A cat'}]

8. 模型优化与加速

8.1 量化

from transformers import AutoModelForCausalLM
import torch

# 4-bit 量化加载
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_4bit=True,  # 4-bit 量化
    bnb_4bit_compute_dtype=torch.float16
)

8.2 Flash Attention

# 使用 Flash Attention 2
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.float16
)

8.3 推理优化工具

工具说明
OptimumHugging Face 官方优化工具
ONNX Runtime跨平台推理加速
Text Generation InferenceLLM 推理优化服务器
# 使用 Optimum 优化
from optimum.onnxruntime import ORTModelForSequenceClassification

model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    export=True
)
model.save_pretrained("optimized_model/")

9. Transformers 生态

9.1 Hugging Face 生态

说明
Transformers预训练模型库
Datasets数据集管理
Tokenizers高性能分词器
PEFT参数高效微调(LoRA 等)
Diffusers扩散模型
Evaluate评估指标
Hub模型托管平台

9.2 Hub 使用

from huggingface_hub import snapshot_download

# 下载模型
model_dir = snapshot_download("bert-base-uncased")
print(model_dir)
# '/root/.cache/huggingface/hub/models--bert-base-uncased'

# 上传模型
from huggingface_hub import HfApi
api = HfApi()
api.upload_folder(
    folder_path="./my_model",
    repo_id="my-username/my-model",
    repo_type="model"
)

9.3 Spaces 部署

Hugging Face Spaces 提供免费推理演示托管:

# 创建 Space
huggingface-cli create-space my-space --type gradio

# 部署
git clone https://huggingface.co/spaces/my-username/my-space
# 添加你的模型代码
git push origin main

10. 实战案例

10.1 情感分析服务

from fastapi import FastAPI
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
import torch

app = FastAPI()

# 加载优化后的模型
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment = pipeline("sentiment-analysis", model=model_name)

@app.post("/predict")
def predict(text: str):
    result = sentiment(text)
    return {"text": text, "sentiment": result[0]["label"], "confidence": result[0]["score"]}

10.2 批量推理

from transformers import pipeline
from concurrent.futures import ThreadPoolExecutor

classifier = pipeline("text-classification")

texts = [
    "I love this product!",
    "This is terrible.",
    "It's okay, nothing special.",
    # ... 更多文本
]

def classify(text):
    return classifier(text)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(classify, texts))

11. 常见问题

11.1 CUDA out of memory

# 减小 batch size
batch_size = 4  # 原来可能是 16

# 使用梯度累积
model.gradient_checkpointing_enable()

# 清理缓存
torch.cuda.empty_cache()

11.2 模型加载慢

# 使用 fast_init_tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

# 预下载模型
from huggingface_hub import snapshot_download
snapshot_download(model_name)

11.3 分词结果不对齐

# 确保编码解码一致
encoded = tokenizer("Hello", return_offsets_mapping=True)
print(encoded["offset_mapping"])
# [(0, 0), (0, 5), (0, 0)]

# 映射回原文
start, end = encoded["offset_mapping"][1]
print(f"Token '{encoded['input_ids'][1]}' corresponds to: '{text[start:end]}'")

12. 总结

Transformers 是 NLP 领域最核心的库

优势说明
生态完善Transformers + Datasets + PEFT + Hub
模型丰富100,000+ 预训练模型
多模态文本、音频、图像、视频
易用Pipeline 一行代码推理
灵活支持微调和自定义
性能优化量化、Flash Attention、ONNX

适用场景

  • 自然语言处理(分类、NER、QA、生成)
  • 语音处理(ASR、TTS)
  • 计算机视觉(分类、分割、检测)
  • 多模态任务(VQA、图像描述)
  • 模型研究和新任务探索

官方资源

  • GitHub:https://github.com/huggingface/transformers
  • 文档:https://huggingface.co/docs/transformers
  • 模型库:https://huggingface.co/models
  • 数据集:https://huggingface.co/datasets
  • Spaces:https://huggingface.co/spaces