初始提交: 英文PDF翻译中文工具
功能: - 从PDF提取文本并翻译成中文 - 支持Markdown/TXT/JSON输出格式 - 上下文连贯翻译 - 自动重试机制 配置: - API: http://192.168.2.5:1234/v1 - 模型: qwen/qwen3.5-35b-a3b
This commit is contained in:
87
README.md
Normal file
87
README.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# 英文PDF翻译中文工具
|
||||
|
||||
基于本地 LLM 服务(Qwen3.5)的英文PDF自动翻译工具。
|
||||
|
||||
## 功能特点
|
||||
|
||||
- ✅ 从 PDF 提取文本并翻译成中文
|
||||
- ✅ 保持原文段落结构
|
||||
- ✅ 支持多种输出格式 (Markdown/TXT/JSON)
|
||||
- ✅ 上下文连贯翻译
|
||||
- ✅ 自动重试机制
|
||||
|
||||
## 快速使用
|
||||
|
||||
```bash
|
||||
# 翻译 PDF 为 Markdown
|
||||
python translate_pdf.py input.pdf output.md
|
||||
|
||||
# 指定输出格式
|
||||
python translate_pdf.py input.pdf output.txt --format txt
|
||||
|
||||
# JSON 格式(包含原文对照)
|
||||
python translate_pdf.py input.pdf output.json --format json
|
||||
|
||||
# 测试 LLM 连接
|
||||
python translate_pdf.py --test
|
||||
```
|
||||
|
||||
## 配置说明
|
||||
|
||||
修改 `translate_pdf.py` 中的 `CONFIG` 部分:
|
||||
|
||||
```python
|
||||
CONFIG = {
|
||||
"api_base": "http://192.168.2.5:1234/v1", # LLM API 地址
|
||||
"api_key": "sk-lm-fuP5tGU8:Hi7YU87jHyDP6Ay8Tl2j", # API Key
|
||||
"model": "qwen/qwen3.5-35b-a3b", # 模型名称
|
||||
"chunk_size": 2000, # 每次翻译字符数
|
||||
"max_tokens": 8000, # 最大输出 token(需足够让思考完成)
|
||||
"timeout": 180, # 单次请求超时(秒)
|
||||
}
|
||||
```
|
||||
|
||||
## 依赖安装
|
||||
|
||||
```bash
|
||||
pip install pypdf openai
|
||||
```
|
||||
|
||||
## 注意事项
|
||||
|
||||
⚠️ **Qwen 模型的思考模式**
|
||||
|
||||
Qwen3.5 模型会在翻译前进行"思考",占用大量 token。因此:
|
||||
- `max_tokens` 需设置较大(建议 8000+)
|
||||
- 翻译速度较慢(约 30秒/块)
|
||||
- 适合处理重要文档,不适合快速预览
|
||||
|
||||
## 输出示例
|
||||
|
||||
输入英文:
|
||||
```
|
||||
Machine learning is a subset of artificial intelligence (AI)
|
||||
that enables systems to learn and improve from experience.
|
||||
```
|
||||
|
||||
输出中文:
|
||||
```
|
||||
机器学习是人工智能(AI)的一个子集,
|
||||
它使系统能够从经验中学习和改进。
|
||||
```
|
||||
|
||||
## 文件结构
|
||||
|
||||
```
|
||||
pdf-translator/
|
||||
├── translate_pdf.py # 主翻译脚本
|
||||
├── README.md # 使用说明
|
||||
└── output.md # 翻译输出示例
|
||||
```
|
||||
|
||||
## 扩展建议
|
||||
|
||||
1. **批量翻译**:循环处理多个 PDF
|
||||
2. **进度保存**:中断后可从上次位置继续
|
||||
3. **质量检查**:对比原文与译文段落
|
||||
4. **格式优化**:保留 PDF 原排版样式
|
||||
354
translate_pdf.py
Normal file
354
translate_pdf.py
Normal file
@@ -0,0 +1,354 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
英文PDF翻译中文工作流
|
||||
使用本地LLM服务进行翻译
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import re
|
||||
import json
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import List, Optional
|
||||
from pypdf import PdfReader
|
||||
from openai import OpenAI
|
||||
|
||||
# ==================== 配置 ====================
|
||||
CONFIG = {
|
||||
"api_base": "http://192.168.2.5:1234/v1",
|
||||
"api_key": "sk-lm-fuP5tGU8:Hi7YU87jHyDP6Ay8Tl2j",
|
||||
"model": "qwen/qwen3.5-35b-a3b",
|
||||
"chunk_size": 2000, # 每次翻译的字符数(减小以加速)
|
||||
"max_tokens": 8000, # 最大输出token(必须足够让思考完成)
|
||||
"temperature": 0.3, # 低温度保证稳定翻译
|
||||
"retry_times": 3, # 重试次数
|
||||
"retry_delay": 2, # 重试延迟(秒)
|
||||
"timeout": 180, # 单次请求超时(秒)
|
||||
}
|
||||
|
||||
# ==================== LLM 客户端 ====================
|
||||
client = OpenAI(
|
||||
api_key=CONFIG["api_key"],
|
||||
base_url=CONFIG["api_base"],
|
||||
)
|
||||
|
||||
def translate_text(text: str, context: str = "") -> str:
|
||||
"""
|
||||
调用LLM翻译文本
|
||||
|
||||
Args:
|
||||
text: 待翻译的英文文本
|
||||
context: 前文上下文(保持翻译连贯性)
|
||||
|
||||
Returns:
|
||||
翻译后的中文文本
|
||||
"""
|
||||
system_prompt = """你是一个专业的英译中翻译专家。请遵循以下规则:
|
||||
1. 保持原文的格式和段落结构
|
||||
2. 专业术语保持准确性,必要时保留英文原文
|
||||
3. 语言流畅自然,符合中文表达习惯
|
||||
4. 不要添加任何解释或注释,只输出翻译结果"""
|
||||
|
||||
user_prompt = f"""请将以下英文翻译成中文。直接输出中文翻译,不要解释。
|
||||
|
||||
{f'参考前文:{context[-300:]}' if context else ''}
|
||||
|
||||
英文内容:
|
||||
{text}"""
|
||||
|
||||
for attempt in range(CONFIG["retry_times"]):
|
||||
try:
|
||||
response = client.chat.completions.create(
|
||||
model=CONFIG["model"],
|
||||
messages=[
|
||||
{"role": "system", "content": system_prompt},
|
||||
{"role": "user", "content": user_prompt}
|
||||
],
|
||||
max_tokens=CONFIG["max_tokens"],
|
||||
temperature=CONFIG["temperature"],
|
||||
timeout=CONFIG["timeout"],
|
||||
)
|
||||
|
||||
content = response.choices[0].message.content
|
||||
# 如果content为空但有reasoning_content,说明思考未完成
|
||||
if not content or content.strip() == "":
|
||||
print(f" ⚠️ 思考未完成输出,增加等待...")
|
||||
time.sleep(1)
|
||||
continue
|
||||
|
||||
return content.strip()
|
||||
|
||||
except Exception as e:
|
||||
print(f" ⚠️ 翻译失败 (尝试 {attempt+1}/{CONFIG['retry_times']}): {e}")
|
||||
if attempt < CONFIG["retry_times"] - 1:
|
||||
time.sleep(CONFIG["retry_delay"])
|
||||
|
||||
return text # 失败时返回原文
|
||||
|
||||
|
||||
def extract_pdf_text(pdf_path: str) -> List[dict]:
|
||||
"""
|
||||
从PDF提取文本,按页分组
|
||||
|
||||
Returns:
|
||||
[{"page": 1, "text": "..."}, ...]
|
||||
"""
|
||||
reader = PdfReader(pdf_path)
|
||||
pages_text = []
|
||||
|
||||
for i, page in enumerate(reader.pages):
|
||||
text = page.extract_text()
|
||||
if text.strip():
|
||||
# 清理文本
|
||||
text = clean_text(text)
|
||||
pages_text.append({
|
||||
"page": i + 1,
|
||||
"text": text
|
||||
})
|
||||
|
||||
return pages_text
|
||||
|
||||
|
||||
def clean_text(text: str) -> str:
|
||||
"""
|
||||
清理提取的PDF文本
|
||||
"""
|
||||
# 移除多余空白
|
||||
text = re.sub(r'\n{3,}', '\n\n', text)
|
||||
text = re.sub(r' {2,}', ' ', text)
|
||||
# 移除特殊字符干扰
|
||||
text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f]', '', text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
def chunk_text(text: str, max_size: int = CONFIG["chunk_size"]) -> List[str]:
|
||||
"""
|
||||
将长文本分割成小块进行翻译
|
||||
|
||||
按段落分割,避免句子断裂
|
||||
"""
|
||||
paragraphs = text.split('\n\n')
|
||||
chunks = []
|
||||
current_chunk = ""
|
||||
|
||||
for para in paragraphs:
|
||||
if len(current_chunk) + len(para) < max_size:
|
||||
current_chunk += para + '\n\n'
|
||||
else:
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk.strip())
|
||||
current_chunk = para + '\n\n'
|
||||
|
||||
if current_chunk:
|
||||
chunks.append(current_chunk.strip())
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def translate_pdf(
|
||||
pdf_path: str,
|
||||
output_path: str,
|
||||
output_format: str = "markdown",
|
||||
progress_callback: Optional[callable] = None
|
||||
) -> dict:
|
||||
"""
|
||||
翻译PDF文档
|
||||
|
||||
Args:
|
||||
pdf_path: 输入PDF路径
|
||||
output_path: 输出文件路径
|
||||
output_format: 输出格式 (markdown/txt/json)
|
||||
progress_callback: 进度回调函数
|
||||
|
||||
Returns:
|
||||
翻译统计信息
|
||||
"""
|
||||
print(f"📖 开始处理: {pdf_path}")
|
||||
|
||||
# 提取文本
|
||||
pages_text = extract_pdf_text(pdf_path)
|
||||
total_pages = len(pages_text)
|
||||
print(f"📄 共 {total_pages} 页")
|
||||
|
||||
# 翻译统计
|
||||
stats = {
|
||||
"total_pages": total_pages,
|
||||
"total_chunks": 0,
|
||||
"translated_chunks": 0,
|
||||
"failed_chunks": 0,
|
||||
"total_chars": 0,
|
||||
"translated_chars": 0,
|
||||
"time_elapsed": 0,
|
||||
}
|
||||
|
||||
start_time = time.time()
|
||||
translated_pages = []
|
||||
context = "" # 保持上下文连贯
|
||||
|
||||
for page_data in pages_text:
|
||||
page_num = page_data["page"]
|
||||
page_text = page_data["text"]
|
||||
|
||||
print(f"\n📝 翻译第 {page_num}/{total_pages} 页...")
|
||||
|
||||
# 分块
|
||||
chunks = chunk_text(page_text)
|
||||
stats["total_chunks"] += len(chunks)
|
||||
|
||||
translated_chunks = []
|
||||
for i, chunk in enumerate(chunks):
|
||||
print(f" 🔄 块 {i+1}/{len(chunks)} ({len(chunk)} 字符)")
|
||||
|
||||
translated = translate_text(chunk, context)
|
||||
|
||||
if translated != chunk:
|
||||
stats["translated_chunks"] += 1
|
||||
stats["translated_chars"] += len(translated)
|
||||
else:
|
||||
stats["failed_chunks"] += 1
|
||||
|
||||
translated_chunks.append(translated)
|
||||
context = translated # 更新上下文
|
||||
|
||||
# 进度回调
|
||||
if progress_callback:
|
||||
progress_callback(page_num, total_pages, i+1, len(chunks))
|
||||
|
||||
translated_page_text = '\n\n'.join(translated_chunks)
|
||||
translated_pages.append({
|
||||
"page": page_num,
|
||||
"original": page_text,
|
||||
"translated": translated_page_text
|
||||
})
|
||||
|
||||
stats["total_chars"] += len(page_text)
|
||||
|
||||
# 生成输出
|
||||
stats["time_elapsed"] = time.time() - start_time
|
||||
save_output(translated_pages, output_path, output_format)
|
||||
|
||||
print(f"\n✅ 翻译完成!")
|
||||
print(f"📊 统计:")
|
||||
print(f" - 总页数: {stats['total_pages']}")
|
||||
print(f" - 总块数: {stats['total_chunks']}")
|
||||
print(f" - 成功: {stats['translated_chunks']}")
|
||||
print(f" - 失败: {stats['failed_chunks']}")
|
||||
print(f" - 耗时: {stats['time_elapsed']:.1f}秒")
|
||||
print(f" - 输出: {output_path}")
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
def save_output(
|
||||
translated_pages: List[dict],
|
||||
output_path: str,
|
||||
output_format: str = "markdown"
|
||||
):
|
||||
"""
|
||||
保存翻译结果
|
||||
"""
|
||||
if output_format == "markdown":
|
||||
content = f"# 英文PDF中文翻译\n\n> 自动翻译生成\n\n---\n\n"
|
||||
for page in translated_pages:
|
||||
content += f"## 第 {page['page']} 页\n\n"
|
||||
content += page["translated"] + "\n\n---\n\n"
|
||||
|
||||
elif output_format == "txt":
|
||||
content = ""
|
||||
for page in translated_pages:
|
||||
content += f"=== 第 {page['page']} 页 ===\n\n"
|
||||
content += page["translated"] + "\n\n"
|
||||
|
||||
elif output_format == "json":
|
||||
content = json.dumps(translated_pages, ensure_ascii=False, indent=2)
|
||||
|
||||
else:
|
||||
raise ValueError(f"不支持的格式: {output_format}")
|
||||
|
||||
# 确保目录存在
|
||||
output_dir = os.path.dirname(output_path)
|
||||
if output_dir:
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.write(content)
|
||||
|
||||
|
||||
# ==================== CLI 接口 ====================
|
||||
def main():
|
||||
"""命令行入口"""
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description='英文PDF翻译中文工具',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
示例:
|
||||
# 翻译PDF并输出为Markdown
|
||||
python translate_pdf.py input.pdf output.md
|
||||
|
||||
# 指定输出格式
|
||||
python translate_pdf.py input.pdf output.txt --format txt
|
||||
|
||||
# 输出JSON格式(包含原文对照)
|
||||
python translate_pdf.py input.pdf output.json --format json
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('input', nargs='?', help='输入PDF文件路径')
|
||||
parser.add_argument('output', nargs='?', help='输出文件路径')
|
||||
parser.add_argument('--format', '-f',
|
||||
choices=['markdown', 'txt', 'json'],
|
||||
default='markdown',
|
||||
help='输出格式 (默认: markdown)')
|
||||
parser.add_argument('--chunk-size', '-c',
|
||||
type=int,
|
||||
default=CONFIG['chunk_size'],
|
||||
help=f'翻译块大小 (默认: {CONFIG["chunk_size"]})')
|
||||
parser.add_argument('--test', '-t',
|
||||
action='store_true',
|
||||
help='测试LLM连接')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 更新配置
|
||||
if args.chunk_size:
|
||||
CONFIG['chunk_size'] = args.chunk_size
|
||||
|
||||
# 测试连接
|
||||
if args.test:
|
||||
test_connection()
|
||||
return
|
||||
|
||||
# 检查输入文件
|
||||
if not os.path.exists(args.input):
|
||||
print(f"❌ 文件不存在: {args.input}")
|
||||
sys.exit(1)
|
||||
|
||||
# 执行翻译
|
||||
translate_pdf(args.input, args.output, args.format)
|
||||
|
||||
|
||||
def test_connection():
|
||||
"""测试LLM连接"""
|
||||
print("🔗 测试LLM连接...")
|
||||
print(f" API: {CONFIG['api_base']}")
|
||||
print(f" 模型: {CONFIG['model']}")
|
||||
|
||||
try:
|
||||
response = client.chat.completions.create(
|
||||
model=CONFIG["model"],
|
||||
messages=[
|
||||
{"role": "user", "content": "请用中文回复:你好"}
|
||||
],
|
||||
max_tokens=50,
|
||||
)
|
||||
print(f"✅ 连接成功!")
|
||||
print(f" 回复: {response.choices[0].message.content}")
|
||||
except Exception as e:
|
||||
print(f"❌ 连接失败: {e}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user