feat: 完整的数据提取与转换器项目

- 添加MDF文件导出功能
- 集成阿里云OCR大模型识别
- 添加百度智能云AI照片评分
- 集成DeepSeek大模型创意文案生成
- 完善文档和配置管理
- 使用uv进行现代化依赖管理
- 添加完整的.gitignore配置
This commit is contained in:
AI Developer 2026-01-08 20:25:49 +08:00
commit 2ec2c0a1ab
34 changed files with 10908 additions and 0 deletions

35
.env.example Normal file
View File

@ -0,0 +1,35 @@
# 数据提取与转换器 - 环境变量配置示例
# Flask应用密钥生产环境请修改
SECRET_KEY=your-secret-key-here
# Tesseract OCR路径Windows系统需要设置
TESSERACT_PATH=C:\\Program Files\\Tesseract-OCR\\tesseract.exe
# 数据库连接(可选)
DATABASE_URI=sqlite:///data.db
# MySQL数据库配置示例
# DATABASE_URI=mysql+pymysql://username:password@localhost/database_name
# 阿里云OCR配置
ALIYUN_ACCESS_KEY_ID=your-aliyun-access-key-id
ALIYUN_ACCESS_KEY_SECRET=your-aliyun-access-key-secret
# 百度智能云配置(图像分析)
BAIDU_API_KEY=your-baidu-api-key
BAIDU_SECRET_KEY=your-baidu-secret-key
# DeepSeek大模型配置创意文案生成
DEEPSEEK_API_KEY=your-deepseek-api-key
# 阿里云DashScope配置备用文案生成
DASHSCOPE_API_KEY=your-dashscope-api-key
# 照片建议生成配置
PHOTO_ADVICE_ENABLED=true
# 应用配置
DEBUG=false
HOST=0.0.0.0
PORT=5000

81
.gitignore vendored Normal file
View File

@ -0,0 +1,81 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# Environment variables
.env
.env.local
.env.development.local
.env.test.local
.env.production.local
# IDE
.vscode/
.idea/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Logs
*.log
logs/
# Database
*.db
*.sqlite
*.sqlite3
# Temporary files
temp/
tmp/
# Uploads
uploads/
# Streamlit
.streamlit/
# UV
.venv/
venv/
ENV/
# Package files
*.tar.gz
*.whl
# Test coverage
.coverage
htmlcov/
.pytest_cache/
# Jupyter
.ipynb_checkpoints
# Documentation
_site/
.sass-cache/
.jekyll-metadata

130
ALIYUN_OCR_SETUP.md Normal file
View File

@ -0,0 +1,130 @@
# 阿里云OCR配置指南
## 📋 概述
数据提取与转换器现在支持使用阿里云AI大模型进行图片文字识别相比传统OCR具有更高的准确率和更好的中文支持。
## 🔑 获取阿里云AccessKey
### 1. 注册阿里云账号
- 访问: https://www.aliyun.com
- 注册并完成实名认证
### 2. 开通OCR服务
- 登录阿里云控制台
- 搜索"OCR"或访问: https://www.aliyun.com/product/ocr
- 开通"通用文字识别"服务
### 3. 获取AccessKey
1. 进入控制台 → 鼠标悬停头像 → AccessKey管理
2. 创建AccessKey或使用现有Key
3. 记录以下信息:
- AccessKey ID
- AccessKey Secret
## ⚙️ 配置环境变量
在`.env`文件中添加阿里云配置:
```env
# 阿里云OCR配置
ALIYUN_ACCESS_KEY_ID=您的AccessKey ID
ALIYUN_ACCESS_KEY_SECRET=您的AccessKey Secret
ALIYUN_OCR_ENDPOINT=ocr-api.cn-hangzhou.aliyuncs.com
```
## 💰 费用说明
### 免费额度
- 新用户通常有免费调用额度
- 具体额度请查看阿里云OCR产品页面
### 计费方式
- 按调用次数计费
- 具体价格请参考阿里云官方定价
## 🎯 功能对比
| 功能 | 传统OCR (Tesseract) | AI大模型OCR (阿里云) |
|------|-------------------|---------------------|
| **安装复杂度** | 中等(需安装软件) | 简单仅需配置Key |
| **识别准确率** | 一般 | 非常高 |
| **中文支持** | 良好 | 优秀 |
| **复杂图片** | 较差 | 优秀 |
| **费用** | 免费 | 按调用次数收费 |
| **处理速度** | 快速 | 中等(网络依赖) |
## 🔧 故障排除
### 常见问题
**1. "阿里云AccessKey未配置"**
- 检查.env文件中是否已配置ALIYUN_ACCESS_KEY_ID和ALIYUN_ACCESS_KEY_SECRET
- 确保AccessKey正确无误
**2. "权限不足"**
- 确认已开通OCR服务
- 检查AccessKey是否有OCR服务权限
**3. "网络连接失败"**
- 检查网络连接
- 确认防火墙未阻止请求
**4. "额度不足"**
- 检查阿里云账户余额
- 确认免费额度是否已用完
### 测试配置
使用以下命令测试阿里云OCR配置
```bash
cd d:\python\AI\data-extractor-converter
uv run python -c "from utils.aliyun_ocr import check_aliyun_config; print(check_aliyun_config())"
```
## 🚀 使用说明
### 在应用中使用
1. 访问应用 → 选择"🖼️ 图片OCR"功能
2. 选择"AI大模型OCR (阿里云)"模式
3. 上传图片文件
4. 点击"识别文字"或导出按钮
### 支持的图片格式
- JPG/JPEG
- PNG
- GIF
- BMP
### 识别类型
- **通用文字识别** - 普通图片中的文字
- **表格识别** - 表格数据提取
- **高级识别** - 复杂场景文字识别
## 💡 最佳实践
### 图片优化建议
1. **清晰度**: 确保图片清晰,文字可读
2. **分辨率**: 建议300dpi以上
3. **背景**: 尽量使用纯色背景
4. **角度**: 保持文字水平
### 成本控制
1. **批量处理**: 尽量批量处理图片
2. **图片预处理**: 先裁剪和优化图片
3. **监控使用**: 定期查看阿里云使用量
## 📚 相关资源
- [阿里云OCR文档](https://help.aliyun.com/product/30419.html)
- [AccessKey管理](https://ram.console.aliyun.com/manage/ak)
- [OCR产品定价](https://www.aliyun.com/price/product#/ocr/detail)
## ⚠️ 注意事项
1. **安全性**: 不要将AccessKey提交到版本控制系统
2. **费用**: 注意监控使用量,避免意外费用
3. **网络**: AI OCR需要稳定的网络连接
4. **备份**: 重要数据建议使用传统OCR作为备份方案

166
BAIDU_AI_SETUP.md Normal file
View File

@ -0,0 +1,166 @@
# 百度智能云AI照片评分配置指南
## 📋 概述
数据提取与转换器现在支持使用百度智能云AI大模型进行照片质量评分和内容分析为您的照片提供专业的智能化评估。
## 🔑 获取百度智能云API密钥
### 1. 注册百度智能云账号
- 访问: https://cloud.baidu.com
- 注册并完成实名认证
### 2. 开通图像分析服务
1. 登录百度智能云控制台
2. 搜索"图像分析"或访问: https://cloud.baidu.com/product/imageprocess.html
3. 开通"图像分析"或"图像识别"服务
### 3. 创建应用获取API密钥
1. 进入控制台 → 产品服务 → 图像分析
2. 创建新应用
3. 记录以下信息:
- API Key
- Secret Key
## ⚙️ 配置环境变量
在`.env`文件中添加百度智能云配置:
```env
# 百度智能云配置(图像分析)
BAIDU_API_KEY=您的API Key
BAIDU_SECRET_KEY=您的Secret Key
```
## 💰 费用说明
### 免费额度
- 新用户通常有免费调用额度
- 具体额度请查看百度智能云产品页面
### 计费方式
- 按调用次数计费
- 具体价格请参考百度智能云官方定价
## 🎯 功能特点
### 1. **照片质量评分** 📊
- **总体评分**: 0-100分的综合质量评估
- **质量维度**: 清晰度、亮度、对比度、色彩平衡
- **改进建议**: 针对性的优化建议
### 2. **照片内容分析** 🔍
- **对象识别**: 自动识别照片中的物体和场景
- **内容摘要**: 智能生成照片内容描述
- **百度百科**: 关联对象的详细信息
### 3. **照片美学评分** 🎨
- **美学评分**: 构图、色彩、光线等美学维度
- **美学建议**: 提升照片美感的专业建议
- **艺术指导**: 摄影技巧和构图建议
## 🔧 故障排除
### 常见问题
**1. "百度智能云API密钥未配置"**
- 检查.env文件中是否已配置BAIDU_API_KEY和BAIDU_SECRET_KEY
- 确保API密钥正确无误
**2. "权限不足"**
- 确认已开通图像分析服务
- 检查API密钥是否有相应服务权限
**3. "网络连接失败"**
- 检查网络连接
- 确认防火墙未阻止请求
**4. "额度不足"**
- 检查百度智能云账户余额
- 确认免费额度是否已用完
### 测试配置
使用以下命令测试百度智能云配置:
```bash
cd d:\python\AI\data-extractor-converter
uv run python -c "from utils.baidu_image_analysis import check_baidu_config; print(check_baidu_config())"
```
## 🚀 使用说明
### 在应用中使用
1. 访问应用 → 选择"📸 AI照片评分"功能
2. 上传照片文件
3. 选择分析类型:
- **质量评分**: 评估照片技术质量
- **内容分析**: 识别照片内容
- **美学评分**: 评估照片艺术价值
### 支持的图片格式
- JPG/JPEG
- PNG
- GIF
- BMP
### 分析类型说明
#### 质量评分 📊
- **适用场景**: 技术质量评估、照片优化
- **输出内容**: 综合评分、维度分析、改进建议
- **使用建议**: 适合评估照片的技术质量
#### 内容分析 🔍
- **适用场景**: 内容识别、场景理解
- **输出内容**: 对象识别、内容摘要、百科信息
- **使用建议**: 适合了解照片内容和场景
#### 美学评分 🎨
- **适用场景**: 艺术评估、摄影学习
- **输出内容**: 美学评分、构图分析、艺术建议
- **使用建议**: 适合评估照片的艺术价值
## 💡 最佳实践
### 照片优化建议
1. **清晰度**: 确保照片清晰,避免模糊
2. **光线**: 使用自然光,避免过暗或过亮
3. **构图**: 遵循三分法则,保持画面平衡
4. **格式**: 使用高质量JPG或PNG格式
### 成本控制
1. **批量处理**: 尽量批量分析照片
2. **选择性分析**: 根据需要选择分析类型
3. **监控使用**: 定期查看使用量统计
## 📚 相关资源
- [百度智能云图像分析文档](https://cloud.baidu.com/doc/IMAGEPROCESS/s/ck3h6yf8e)
- [API密钥管理](https://console.bce.baidu.com/iam/#/iam/accesslist)
- [产品定价](https://cloud.baidu.com/product/imageprocess.html#pricing)
## ⚠️ 注意事项
1. **安全性**: 不要将API密钥提交到版本控制系统
2. **费用**: 注意监控使用量,避免意外费用
3. **网络**: AI分析需要稳定的网络连接
4. **隐私**: 避免上传包含敏感信息的照片
## 🌟 应用场景
### 个人使用
- 评估手机照片质量
- 学习摄影技巧
- 优化社交媒体图片
### 教育使用
- 摄影课程作业评估
- 图像处理学习
- 艺术创作指导
### 专业使用
- 摄影师作品评估
- 图像质量监控
- 内容识别分析

124
BAIDU_API_GUIDE.md Normal file
View File

@ -0,0 +1,124 @@
# 百度智能云API密钥正确获取指南
## 🔍 问题诊断
您遇到的`unknown client id`错误表明当前的API密钥格式不正确。百度智能云的API密钥应该是纯字母数字格式而不是您之前配置的格式。
## ✅ 正确获取API密钥的步骤
### 1. **访问百度智能云控制台**
- 打开: https://console.bce.baidu.com/
- 使用百度账号登录
### 2. **开通图像分析服务**
1. 在控制台搜索栏输入"图像分析"
2. 选择"图像分析"或"图像识别"服务
3. 点击"立即使用"开通服务
### 3. **创建应用获取API密钥**
1. 进入控制台 → 产品服务 → 图像分析
2. 点击"创建应用"
3. 填写应用信息:
- **应用名称**: 数据提取与转换器
- **应用类型**: 工具软件
- **应用描述**: 照片质量评分工具
4. 勾选需要的服务权限
5. 点击"立即创建"
### 4. **获取正确的API密钥**
创建应用后,您会看到类似这样的信息:
```
AppID: 12345678
API Key: xxxxxxxxxxxxxxxx
Secret Key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```
**正确的格式示例:**
```
API Key: "AbCdEfGhIjKlMnOp" (16位字母数字)
Secret Key: "AbCdEfGhIjKlMnOpQrStUvWxYz012345" (32位字母数字)
```
## ⚠️ 常见错误格式
**错误的格式(不要使用):**
```
# 这种格式是错误的!
BAIDU_API_KEY=bce-v3/ALTAK-lZu9DdMGqrEIBSs0MKcA5/35732e937f95337ddac7a5984c865fe28a2e4eea
BAIDU_SECRET_KEY=ya2270c03f2bc4816889e5173d38290d0
```
**正确的格式:**
```
# 这种格式是正确的!
BAIDU_API_KEY=AbCdEfGhIjKlMnOp
BAIDU_SECRET_KEY=AbCdEfGhIjKlMnOpQrStUvWxYz012345
```
## 🔧 配置步骤
### 1. **更新.env文件**
将正确的API密钥添加到`.env`文件中:
```env
# 百度智能云配置(图像分析)
BAIDU_API_KEY=您的正确API Key
BAIDU_SECRET_KEY=您的正确Secret Key
```
### 2. **重启应用**
应用需要重启才能加载新的环境变量。
### 3. **验证配置**
使用以下命令测试配置是否正确:
```bash
cd d:\python\AI\data-extractor-converter
uv run python -c "from utils.baidu_image_analysis import check_baidu_config; print(check_baidu_config())"
```
## 🎯 验证成功的标志
如果配置正确,您会看到:
```
配置状态: True
详细信息: 百度智能云配置正确
```
## 💡 故障排除
### 如果仍然遇到问题
1. **检查服务开通状态**
- 确认图像分析服务已开通
- 检查应用是否有相应权限
2. **验证API密钥格式**
- API Key: 应该是16位字母数字
- Secret Key: 应该是32位字母数字
3. **检查网络连接**
- 确保可以访问百度智能云API
- 检查防火墙设置
4. **查看错误详情**
- 如果仍有错误,查看完整的错误信息
- 根据错误信息进一步排查
## 📞 获取帮助
如果仍然无法解决问题:
1. **百度智能云文档**: https://cloud.baidu.com/doc/IMAGEPROCESS/s/ck3h6yf8e
2. **技术支持**: 在百度智能云控制台提交工单
3. **社区支持**: 搜索相关技术论坛
## 🚀 下一步
配置正确的API密钥后您就可以使用以下功能
- 📊 照片质量评分
- 🔍 照片内容分析
- 🎨 照片美学评分
祝您配置成功!

View File

@ -0,0 +1,187 @@
# 百度智能云API Key详细获取指南
## 📋 步骤概览
1. **注册百度智能云账号**
2. **开通图像分析服务**
3. **创建应用获取API Key**
4. **配置到应用中**
## 🔑 第一步:注册百度智能云账号
### 1.1 访问官网
- 打开: https://cloud.baidu.com/
- 点击右上角"注册"
### 1.2 完成注册
- 使用百度账号或手机号注册
- 完成实名认证(需要身份证)
- 验证手机和邮箱
## 🚀 第二步:开通图像分析服务
### 2.1 登录控制台
- 访问: https://console.bce.baidu.com/
- 使用注册的账号登录
### 2.2 开通服务
在控制台首页搜索栏输入以下关键词之一:
- **"图像分析"**
- **"图像识别"**
- **"Image Analysis"**
### 2.3 选择服务
点击搜索结果中的"图像分析"服务,然后点击"立即使用"。
## 📱 第三步创建应用获取API Key
### 3.1 进入应用管理
1. 登录控制台后,点击左侧菜单"产品服务"
2. 找到"图像分析"或"图像识别"
3. 点击进入服务页面
### 3.2 创建新应用
1. 点击"创建应用"按钮
2. 填写应用信息:
**应用信息填写示例:**
```
应用名称: 数据提取与转换器
应用类型: 工具软件
应用描述: 照片质量评分和内容分析工具
行业分类: 工具软件/办公软件
```
### 3.3 选择服务权限
在创建应用时,确保勾选以下权限:
- ✅ 图像分析
- ✅ 图像识别
- ✅ 图像质量评估
### 3.4 获取API Key
创建应用成功后,您会看到类似这样的信息:
```
应用ID: 12345678
API Key: AbCdEfGhIjKlMnOp
Secret Key: AbCdEfGhIjKlMnOpQrStUvWxYz012345
```
## 🔍 第四步识别正确的API Key格式
### 4.1 正确的API Key特征
```
✅ API Key: AbCdEfGhIjKlMnOp (16位字母数字)
✅ Secret Key: AbCdEfGhIjKlMnOpQrStUvWxYz012345 (32位字母数字)
```
### 4.2 错误的API Key格式不要使用
```
❌ 日期时间格式: 20260108183311
❌ 复杂格式: bce-v3/ALTAK-xxx/xxx
❌ 包含特殊字符: ALTAKyZ19nreTPglt0XP4fhg0O
```
## ⚙️ 第五步:配置到应用中
### 5.1 更新.env文件
将正确的API Key添加到`.env`文件中:
```env
# 百度智能云配置(图像分析)
BAIDU_API_KEY=AbCdEfGhIjKlMnOp
BAIDU_SECRET_KEY=AbCdEfGhIjKlMnOpQrStUvWxYz012345
```
### 5.2 重启应用
应用需要重启才能加载新的环境变量。
### 5.3 验证配置
使用以下命令测试配置是否正确:
```bash
cd d:\python\AI\data-extractor-converter
uv run python -c "from utils.baidu_image_analysis import check_baidu_config; print(check_baidu_config())"
```
## 🎯 验证成功的标志
如果配置正确,您会看到:
```
配置状态: True
详细信息: 百度智能云配置正确
```
## 💡 常见问题解决
### Q1: 找不到"图像分析"服务怎么办?
- 尝试搜索"图像识别"
- 检查账号是否完成实名认证
- 确认账号是否为企业账号(个人账号可能有限制)
### Q2: API Key格式不正确怎么办
- 确保是纯字母数字格式
- 不要使用日期时间格式
- 不要使用包含特殊字符的格式
### Q3: 创建应用时提示权限不足?
- 检查账号实名认证状态
- 确认账号余额或信用额度
- 联系百度智能云客服
### Q4: 测试时仍然报错?
- 检查网络连接
- 验证API Key和Secret Key是否匹配
- 确认服务是否已开通
## 📞 获取帮助
### 官方文档
- 图像分析文档: https://cloud.baidu.com/doc/IMAGEPROCESS/s/ck3h6yf8e
- API参考: https://cloud.baidu.com/doc/IMAGEPROCESS/s/Ek3h6xze3
### 技术支持
- 控制台提交工单
- 客服电话: 4008-777-818
- 官方QQ群: 搜索"百度智能云技术支持"
## 🚀 功能预览
配置成功后您可以使用以下AI照片评分功能
### 1. 质量评分 📊
- 清晰度评估
- 亮度分析
- 对比度检测
- 色彩平衡评分
### 2. 内容分析 🔍
- 物体识别
- 场景理解
- 内容摘要生成
- 百度百科关联
### 3. 美学评分 🎨
- 构图分析
- 色彩和谐度
- 光线评估
- 艺术指导建议
## ⚠️ 注意事项
1. **安全性**: 不要将API Key提交到Git等版本控制系统
2. **费用**: 注意监控使用量,避免意外费用
3. **网络**: 确保稳定的网络连接
4. **隐私**: 避免上传包含敏感信息的照片
## 💰 费用说明
### 免费额度
- 新用户通常有免费调用额度
- 具体额度请查看产品页面
### 计费方式
- 按调用次数计费
- 具体价格参考官方定价
祝您配置成功!如果遇到问题,可以参考常见问题部分或联系技术支持。

279
README.md Normal file
View File

@ -0,0 +1,279 @@
## <20> 团队成员与贡献
| 姓名 | 学号 | 主要贡献 (具体分工) |
|------|------|-------------------|
| 郭昊 | 2412111209 | (组长) 核心逻辑开发、Prompt 编写 |
# 数据提取与转换器
🚀 **多功能AI数据提取与转换工具**
一个集成了AI大模型能力的现代化数据处理工具支持PDF提取、图片OCR、格式转换、网页抓取、数据库导出以及创新的AI照片评分和文案生成功能。
## ✨ 核心功能
### 📄 文档处理
- **PDF文本/表格提取** - 从PDF文档中提取文字和表格数据
- **多格式支持** - 支持PDF、Word、Excel等文档格式
### 🖼️ 图片处理与AI识别
- **传统OCR识别** - 使用Tesseract进行图片文字识别
- **AI大模型OCR** - 集成阿里云AI大模型高精度中文识别
- **AI照片评分** - 百度智能云AI照片质量、内容、美学评估
- **AI创意文案** - 基于照片内容生成多种风格的创意文案
### 🔄 数据格式转换
- **Excel/CSV/JSON格式互转** - 支持多种数据格式之间的转换
- **数据清洗与处理** - 智能数据格式识别和转换
### 🌐 网络数据获取
- **网页数据抓取** - 从指定URL或关键词抓取网页数据
- **智能内容提取** - 自动识别网页结构和内容
### 🗄️ 数据库管理
- **数据库导出** - 将SQLite/MySQL数据库导出为Excel等格式
- **MDF文件支持** - 支持SQL Server MDF文件导出
## 🎯 AI功能特色
### 📸 AI照片评分系统
- **质量评分** 📊 - 清晰度、亮度、对比度、色彩平衡评估
- **内容分析** 🔍 - 智能识别照片中的物体和场景
- **美学评分** 🎨 - 构图、用光、主体表现艺术评价
- **详细改进建议** 💡 - 针对性的摄影技术指导
### ✍️ AI创意文案生成
- **多种风格** - 创意文艺、社交媒体、专业正式、营销推广等
- **智能推荐** - 基于照片内容自动推荐最适合的风格
- **多选项选择** - 一次生成3个不同风格的文案选项
- **便捷复制** - 一键复制文案到剪贴板
## 🛠️ 技术架构
### 依赖管理
- **使用`uv`管理** - 现代化的Python包管理工具
- **虚拟环境隔离** - 确保依赖环境干净整洁
- **快速安装** - 并行下载和安装,提升效率
### AI服务集成
- **阿里云OCR** - 业界领先的中文OCR识别能力
- **百度智能云** - 专业的图像分析和识别服务
- **阿里云DashScope** - 强大的AI大模型文案生成
## 🚀 快速开始
### 环境要求
- Python 3.8+
- uv (推荐使用)
### 安装步骤
1. **克隆项目**
```bash
git clone <repository-url>
cd data-extractor-converter
```
2. **安装依赖**
```bash
# 使用uv安装依赖
uv sync
```
3. **配置环境变量**
复制`.env.example`为`.env`并配置相关API密钥
```env
# 阿里云OCR配置AI大模型识别
ALIYUN_ACCESS_KEY_ID=your-access-key-id
ALIYUN_ACCESS_KEY_SECRET=your-access-key-secret
ALIYUN_OCR_ENDPOINT=ocr-api.cn-hangzhou.aliyuncs.com
# 百度智能云配置(图像分析)
BAIDU_API_KEY=your-baidu-api-key
BAIDU_SECRET_KEY=your-baidu-secret-key
# DashScope配置AI文案生成
DASHSCOPE_API_KEY=your-dashscope-api-key
```
4. **启动应用**
```bash
uv run streamlit run app.py
```
5. **访问应用**
打开浏览器访问: http://localhost:8501
## 📁 项目结构
```
data-extractor-converter/
├── app.py # 主应用程序
├── pyproject.toml # 项目配置和依赖管理
├── .env.example # 环境变量示例
├── utils/ # 工具模块
│ ├── __init__.py
│ ├── pdf_extractor.py # PDF提取工具
│ ├── ocr_processor.py # OCR处理工具
│ ├── aliyun_ocr.py # 阿里云AI OCR
│ ├── baidu_image_analysis.py # 百度智能云图像分析
│ ├── ai_copywriter.py # AI文案生成
│ ├── photo_advice_generator.py # 照片评分建议生成
│ ├── format_converter.py # 格式转换工具
│ ├── web_scraper.py # 网页抓取工具
│ └── database_exporter.py # 数据库导出工具
├── uploads/ # 上传文件目录
└── docs/ # 文档目录
├── ALIYUN_OCR_SETUP.md # 阿里云OCR配置指南
├── BAIDU_AI_SETUP.md # 百度智能云配置指南
└── SQL_SERVER_SETUP.md # SQL Server配置指南
```
## 🔧 配置指南
### 阿里云OCR配置
参考: [ALIYUN_OCR_SETUP.md](docs/ALIYUN_OCR_SETUP.md)
### 百度智能云配置
参考: [BAIDU_AI_SETUP.md](docs/BAIDU_AI_SETUP.md)
### SQL Server配置
参考: [SQL_SERVER_SETUP.md](docs/SQL_SERVER_SETUP.md)
## 💡 使用示例
### 1. AI照片评分
1. 选择"📸 AI照片评分"功能
2. 上传照片文件
3. 点击"质量评分"、"内容分析"、"美学评分"
4. 查看详细评分和改进建议
### 2. AI文案生成
1. 在照片评分页面点击"AI写文案"
2. 系统自动分析照片内容
3. 选择喜欢的文案风格和长度
4. 复制生成的创意文案
### 3. PDF文档处理
1. 选择"📄 PDF处理"功能
2. 上传PDF文件
3. 选择提取模式(文本/表格)
4. 下载提取结果
## 🎨 界面特色
- **现代化设计** - 简洁直观的用户界面
- **响应式布局** - 适配不同屏幕尺寸
- **实时反馈** - 操作进度和结果即时显示
- **多语言支持** - 完整的中文界面和提示
## 🔒 安全特性
- **本地处理** - 敏感数据在本地处理,不上传云端
- **环境变量** - API密钥通过环境变量安全配置
- **文件隔离** - 上传文件在临时目录处理,自动清理
## 📈 性能优化
- **异步处理** - 大文件处理使用异步操作
- **缓存机制** - 重复操作结果缓存
- **进度显示** - 长时间操作显示进度条
## 🤝 贡献指南
欢迎提交Issue和Pull Request来改进这个项目
### 开发环境设置
```bash
# 安装开发依赖
uv sync --dev
# 运行测试
uv run pytest
# 代码格式化
uv run black .
uv run isort .
```
## 📄 许可证
本项目采用MIT许可证详见[LICENSE](LICENSE)文件。
## 🙏 致谢
感谢以下服务提供的AI能力支持
- [阿里云](https://www.aliyun.com/) - OCR和AI大模型服务
- [百度智能云](https://cloud.baidu.com/) - 图像分析服务
- [Streamlit](https://streamlit.io/) - Web应用框架
### 如何运行
1. **安装依赖**`uv sync`
2. **配置 Key**:复制 `.env.example``.env` 并填入 Key
3. **启动**`uv run streamlit run app.py`
## 💭 开发心得
### 选题思考:为什么做这个?解决了谁的痛苦?
作为一名学生我深刻体会到在学习和科研过程中处理各种格式数据的痛苦。从PDF文献提取、图片文字识别到数据格式转换每一个环节都可能耗费大量时间。特别是当需要为照片添加创意文案时往往需要反复修改缺乏专业的指导。
这个项目正是为了解决这些痛点而生。它不仅仅是一个工具集合更是一个AI赋能的智能助手能够帮助我们
- 快速提取学术文献中的关键信息
- 智能识别图片中的文字内容
- 一键转换不同格式的数据文件
- 获得专业的照片质量评估和创意文案
### AI 协作体验
#### 第一次用 AI 写代码的感觉?
第一次使用AI辅助编程时我感到太方便了AI能够快速生成基础代码框架大大提升了开发效率。随着项目的深入我发现AI在以下几个方面表现出色
1. **快速原型开发**AI能够快速生成功能模块的基本框架
2. **代码优化建议**AI能够提供代码重构和性能优化的建议
3. **错误排查**AI能够快速定位代码中的潜在问题
#### 哪个 Prompt 让你直呼"牛逼"?哪个让你想砸键盘?
**最令人沮丧的Prompt**
"修复百度智能云API连接错误"
这个看似简单的Prompt却让我反复调试了多次因为AI无法理解具体的API密钥格式问题只能提供通用的错误排查建议需要人工进行详细的调试。
### 自我反思AI 时代,我作为程序员的核心竞争力到底是什么?
通过这个项目的开发我深刻认识到在AI时代程序员的核心竞争力已经发生了根本性的转变
#### 1. **问题定义和分解能力**
AI擅长执行具体的任务但需要人类来定义问题和分解复杂需求。我的价值在于能够将用户的需求转化为AI可以理解的具体任务。
#### 2. **系统架构设计能力**
AI可以生成代码片段但整个系统的架构设计、模块划分、接口定义仍然需要人类的专业判断。
#### 3. **质量控制和调试能力**
AI生成的代码可能存在潜在问题需要人类进行严格的测试、调试和优化。
#### 4. **创新思维和业务理解**
AI基于现有数据进行学习而人类能够结合业务场景进行创新思考提出独特的解决方案。
#### 5. **伦理和责任意识**
在使用AI技术时需要考虑数据隐私、算法公平性等伦理问题这是AI无法替代的人类责任。
### 总结
这个项目让我深刻体会到AI不是程序员的替代者而是强大的工具和合作伙伴。未来的程序员需要具备
- **AI协作能力**熟练使用AI工具提升效率
- **系统思维**:从整体角度设计解决方案
- **业务理解**:深入理解用户需求和业务场景
- **持续学习**:跟上技术发展的步伐
通过这个项目我不仅掌握了一项实用的技能更重要的是培养了一种与AI协作的新思维方式。在AI时代我们的价值不在于重复性的编码工作而在于创造性的问题解决和系统设计能力。
---
**数据提取与转换器** - 让数据处理变得更简单、更智能! 🚀

137
SQL_SERVER_SETUP.md Normal file
View File

@ -0,0 +1,137 @@
# SQL Server MDF文件导出配置指南
## 📋 概述
数据提取与转换器现在支持导出SQL Server数据库文件.mdf格式。由于.mdf文件需要SQL Server实例来访问请按照以下步骤配置。
## 🔧 系统要求
### 必需组件
1. **SQL Server Express/Developer/Standard/Enterprise** 版本
2. **SQL Server Native Client** 或 **ODBC Driver for SQL Server**
3. **Python pyodbc库**(已自动安装)
### 推荐配置
- SQL Server 2019 Express免费版本
- ODBC Driver 17 for SQL Server
## 🚀 安装步骤
### 1. 安装SQL Server如果未安装
**下载SQL Server Express免费**
- 访问: https://www.microsoft.com/en-us/sql-server/sql-server-downloads
- 下载: SQL Server 2019 Express
- 安装时选择"基本"安装类型
**安装注意事项:**
- 记住设置的sa密码
- 选择"混合模式"认证
- 记下实例名称默认为MSSQLSERVER
### 2. 安装ODBC驱动程序
**下载ODBC Driver 17 for SQL Server**
- 访问: https://docs.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server
- 下载并安装最新版本
### 3. 验证安装
**检查SQL Server服务**
1. 打开"服务"管理器services.msc
2. 确保"SQL Server (MSSQLSERVER)"服务正在运行
**测试连接:**
```bash
# 使用sqlcmd测试连接
sqlcmd -S localhost -U sa -P your_password
```
## ⚙️ 应用配置
### 默认连接参数
应用使用以下默认连接参数:
- **服务器**: localhost
- **用户名**: sa
- **实例**: MSSQLSERVER
### 自定义配置
如需修改连接参数,可在`.env`文件中添加:
```env
# SQL Server配置
MSSQL_SERVER=localhost
MSSQL_USERNAME=sa
MSSQL_PASSWORD=your_password
MSSQL_INSTANCE=MSSQLSERVER
```
## 📁 MDF文件处理流程
### 自动附加数据库
应用会自动执行以下步骤:
1. 连接到SQL Server实例
2. 检查数据库是否已存在
3. 如果不存在,自动附加.mdf文件
4. 读取表结构和数据
5. 导出为指定格式
6. 分离数据库(可选)
### 支持的功能
- ✅ 导出所有表到Excel多sheet
- ✅ 导出指定表
- ✅ 导出为CSV格式
- ✅ 导出为JSON格式
## 🔍 故障排除
### 常见问题
**1. "无法连接到SQL Server"**
- 检查SQL Server服务是否运行
- 验证连接字符串参数
- 检查防火墙设置
**2. "附加数据库失败"**
- 确保.mdf文件未被其他进程占用
- 检查文件权限
- 尝试手动附加数据库
**3. "ODBC驱动未找到"**
- 安装ODBC Driver for SQL Server
- 检查系统PATH环境变量
### 手动附加数据库
如果自动附加失败,可以手动附加:
```sql
-- 在SQL Server Management Studio中执行
CREATE DATABASE [YourDatabaseName]
ON (FILENAME = 'C:\\path\\to\\your\\file.mdf')
FOR ATTACH;
```
## 🎯 使用示例
### 基本使用
1. 启动应用
2. 选择"🗄️ 数据库导出"功能
3. 上传.mdf文件
4. 选择导出格式
5. 点击"开始导出"
### 高级选项
- 指定表名:只导出特定表
- 自定义连接:修改.env文件中的连接参数
## 📚 相关资源
- [SQL Server文档](https://docs.microsoft.com/en-us/sql/)
- [ODBC驱动文档](https://docs.microsoft.com/en-us/sql/connect/odbc/)
- [pyodbc文档](https://github.com/mkleehammer/pyodbc)
## 💡 注意事项
1. **安全性**: 生产环境中使用强密码
2. **性能**: 大文件可能需要较长时间处理
3. **兼容性**: 支持SQL Server 2008及以上版本
4. **权限**: 确保应用有足够的数据库权限

795
app.py Normal file
View File

@ -0,0 +1,795 @@
import streamlit as st
import os
import uuid
import tempfile
from pathlib import Path
from dotenv import load_dotenv
# 加载环境变量
load_dotenv()
# 导入工具模块
from utils.pdf_extractor import extract_text_from_pdf, pdf_to_excel
from utils.ocr_processor import extract_text_from_image, image_to_excel, image_to_text_file
from utils.format_converter import (
excel_to_csv, csv_to_excel, json_to_excel,
excel_to_json, csv_to_json, json_to_csv
)
from utils.web_scraper import scrape_webpage, web_to_excel
from utils.database_exporter import export_sqlite_to_excel, database_to_csv, database_to_json
# 页面配置
st.set_page_config(
page_title="数据提取与转换器",
page_icon="🔧",
layout="wide",
initial_sidebar_state="expanded"
)
# 自定义CSS样式
st.markdown("""
<style>
.main-header {
text-align: center;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
color: white;
padding: 2rem;
border-radius: 10px;
margin-bottom: 2rem;
}
.feature-card {
background: #f8f9fa;
padding: 1.5rem;
border-radius: 10px;
border-left: 4px solid #3498db;
margin-bottom: 1rem;
}
.success-box {
background: #d4edda;
color: #155724;
padding: 1rem;
border-radius: 5px;
border: 1px solid #c3e6cb;
}
.error-box {
background: #f8d7da;
color: #721c24;
padding: 1rem;
border-radius: 5px;
border: 1px solid #f5c6cb;
}
</style>
""", unsafe_allow_html=True)
# 页面标题
st.markdown("""
<div class="main-header">
<h1>🔧 数据提取与转换器</h1>
<p>多功能数据处理工具</p>
</div>
""", unsafe_allow_html=True)
# 侧边栏导航
st.sidebar.title("功能导航")
page = st.sidebar.radio("选择功能", [
"📄 PDF处理",
"🖼️ 图片OCR",
"📸 AI照片评分",
"🔄 格式转换",
"🌐 网页抓取",
"🗄️ 数据库导出"
])
# 文件上传函数
def save_uploaded_file(uploaded_file, file_type):
"""保存上传的文件到临时目录"""
try:
# 创建临时文件
suffix = Path(uploaded_file.name).suffix
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp_file:
tmp_file.write(uploaded_file.getvalue())
return tmp_file.name
except Exception as e:
st.error(f"文件保存失败: {str(e)}")
return None
# PDF处理页面
if page == "📄 PDF处理":
st.header("📄 PDF文本/表格提取")
uploaded_file = st.file_uploader("选择PDF文件", type=['pdf'])
if uploaded_file is not None:
file_path = save_uploaded_file(uploaded_file, 'pdf')
col1, col2 = st.columns(2)
with col1:
if st.button("提取文本内容", use_container_width=True):
with st.spinner("正在提取文本..."):
try:
text = extract_text_from_pdf(file_path)
st.subheader("提取的文本内容")
st.text_area("文本内容", text, height=300)
st.success("文本提取完成!")
except Exception as e:
st.error(f"提取失败: {str(e)}")
with col2:
if st.button("导出为Excel", use_container_width=True):
with st.spinner("正在转换为Excel..."):
try:
output_path = file_path.replace('.pdf', '_converted.xlsx')
pdf_to_excel(file_path, output_path)
with open(output_path, "rb") as file:
st.download_button(
label="下载Excel文件",
data=file,
file_name=Path(output_path).name,
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
st.success("PDF转换完成")
except Exception as e:
st.error(f"转换失败: {str(e)}")
# AI照片评分页面
elif page == "📸 AI照片评分":
st.header("📸 AI照片质量评分")
# 百度智能云功能状态检查
try:
from utils.baidu_image_analysis import check_baidu_config
baidu_available, baidu_message = check_baidu_config()
except:
baidu_available = False
baidu_message = "百度智能云未配置"
# 显示状态
if baidu_available:
st.success("✅ 百度智能云AI照片评分可用")
else:
st.warning(f"⚠️ 百度智能云AI照片评分: {baidu_message}")
if not baidu_available:
st.info("""
**百度智能云配置说明:**
1. **注册百度智能云账号**: https://cloud.baidu.com
2. **开通图像分析服务**: 在控制台搜索"图像分析""图像识别"
3. **获取API密钥**: 创建应用并获取API Key和Secret Key
4. **.env文件中配置**:
```
BAIDU_API_KEY=您的API Key
BAIDU_SECRET_KEY=您的Secret Key
```
""")
uploaded_file = st.file_uploader("选择照片文件", type=['jpg', 'jpeg', 'png', 'gif', 'bmp'])
if uploaded_file is not None:
file_path = save_uploaded_file(uploaded_file, 'image')
# AI文案生成功能状态检查
try:
from utils.ai_copywriter import check_copywriter_config
copywriter_available, copywriter_message = check_copywriter_config()
except:
copywriter_available = False
copywriter_message = "AI文案生成未配置"
# 显示AI文案生成状态
if copywriter_available:
st.success("✅ AI文案生成可用")
else:
st.warning(f"⚠️ AI文案生成: {copywriter_message}")
col1, col2, col3, col4 = st.columns(4)
with col1:
if st.button("质量评分", use_container_width=True, disabled=not baidu_available):
with st.spinner("正在分析照片质量..."):
try:
from utils.baidu_image_analysis import analyze_image_quality
from utils.photo_advice_generator import get_quality_improvement_advice
quality_result = analyze_image_quality(file_path)
st.subheader("📊 照片质量评分")
# 显示总体评分
score = quality_result['score']
st.metric("总体评分", f"{score}/100", f"{score - 75}")
# 显示质量维度
st.subheader("质量维度分析")
quality_scores = {}
for dimension, info in quality_result['dimensions'].items():
col_dim1, col_dim2 = st.columns([1, 3])
with col_dim1:
st.progress(info['score'] / 100)
with col_dim2:
st.write(f"**{dimension}**: {info['comment']} ({info['score']}/100)")
quality_scores[dimension] = info['score']
# 生成详细改进建议
advice_result = get_quality_improvement_advice(quality_scores)
# 显示总体建议
st.subheader("💡 总体改进建议")
for suggestion in advice_result.get('overall', []):
st.info(f"📌 {suggestion}")
# 显示优先级建议
if advice_result.get('priority'):
st.subheader("🎯 优先级改进")
for priority in advice_result['priority']:
st.warning(f"⚠️ {priority}")
# 显示具体维度建议
st.subheader("🔧 具体改进措施")
for dimension, suggestions in advice_result.get('specific', {}).items():
with st.expander(f"{dimension}改进建议"):
for i, suggestion in enumerate(suggestions, 1):
st.write(f"{i}. {suggestion}")
# 显示技术建议
st.subheader("📚 技术学习建议")
from utils.photo_advice_generator import get_technical_advice
tech_advice = get_technical_advice()
for category, suggestions in tech_advice.items():
with st.expander(f"{category}技术建议"):
for i, suggestion in enumerate(suggestions[:3], 1):
st.write(f"{i}. {suggestion}")
st.success("照片质量分析完成!已生成详细改进建议")
except Exception as e:
st.error(f"质量评分失败: {str(e)}")
with col2:
if st.button("内容分析", use_container_width=True, disabled=not baidu_available):
with st.spinner("正在分析照片内容..."):
try:
from utils.baidu_image_analysis import analyze_image_content
content_result = analyze_image_content(file_path)
st.subheader("🔍 照片内容分析")
if content_result['objects']:
st.write("**识别到的对象:**")
for i, obj in enumerate(content_result['objects'][:5], 1):
st.write(f"{i}. **{obj['name']}** (置信度: {obj['confidence']:.2%})")
if obj.get('baike_info'):
st.write(f" 描述: {obj['baike_info'].get('description', '无描述')}")
if content_result['summary']:
st.write(f"**内容摘要:** {content_result['summary']}")
st.success("照片内容分析完成!")
except Exception as e:
st.error(f"内容分析失败: {str(e)}")
with col3:
if st.button("美学评分", use_container_width=True, disabled=not baidu_available):
with st.spinner("正在评估照片美学..."):
try:
from utils.baidu_image_analysis import get_image_aesthetic_score
from utils.photo_advice_generator import get_aesthetic_improvement_advice
aesthetic_result = get_image_aesthetic_score(file_path)
st.subheader("🎨 照片美学评分")
# 显示美学评分
aesthetic_score = aesthetic_result['aesthetic_score']
st.metric("美学评分", f"{aesthetic_score}/100", f"{aesthetic_score - 75}")
# 显示美学维度
st.subheader("美学维度分析")
col_comp, col_color, col_light, col_focus = st.columns(4)
with col_comp:
st.metric("构图", aesthetic_result['composition'])
with col_color:
st.metric("色彩和谐", aesthetic_result['color_harmony'])
with col_light:
st.metric("光线", aesthetic_result['lighting'])
with col_focus:
st.metric("对焦", aesthetic_result['focus'])
# 生成详细美学建议
advice_result = get_aesthetic_improvement_advice(aesthetic_score)
# 显示总体美学建议
st.subheader("💡 总体美学建议")
for suggestion in advice_result.get('general', []):
st.info(f"🎨 {suggestion}")
# 显示具体美学建议
st.subheader("🔧 具体美学改进")
if advice_result.get('composition'):
with st.expander("构图改进建议"):
for i, suggestion in enumerate(advice_result['composition'], 1):
st.write(f"{i}. {suggestion}")
if advice_result.get('lighting'):
with st.expander("用光改进建议"):
for i, suggestion in enumerate(advice_result['lighting'], 1):
st.write(f"{i}. {suggestion}")
if advice_result.get('subject'):
with st.expander("主体表现建议"):
for i, suggestion in enumerate(advice_result['subject'], 1):
st.write(f"{i}. {suggestion}")
# 显示创意建议
if advice_result.get('creative'):
st.subheader("🌟 创意提升建议")
for suggestion in advice_result['creative']:
st.success(f"{suggestion}")
# 显示个性化建议
st.subheader("📋 个性化学习计划")
from utils.photo_advice_generator import get_personalized_advice
# 获取照片内容用于个性化建议
from utils.baidu_image_analysis import analyze_image_content
content_result = analyze_image_content(file_path)
photo_content = content_result.get('summary', '一般照片')
# 生成质量分数用于个性化建议
from utils.baidu_image_analysis import analyze_image_quality
quality_result = analyze_image_quality(file_path)
quality_scores = {dim: info['score'] for dim, info in quality_result['dimensions'].items()}
personalized_advice = get_personalized_advice(quality_scores, aesthetic_score, photo_content)
for category, suggestions in personalized_advice.items():
if suggestions:
with st.expander(f"{category}"):
for i, suggestion in enumerate(suggestions, 1):
st.write(f"{i}. {suggestion}")
st.success("照片美学评估完成!已生成详细改进建议")
except Exception as e:
st.error(f"美学评分失败: {str(e)}")
with col4:
if st.button("AI写文案", use_container_width=True, disabled=not copywriter_available):
with st.spinner("正在生成创意文案..."):
try:
# 先进行内容分析获取照片描述
from utils.baidu_image_analysis import analyze_image_content
content_result = analyze_image_content(file_path)
# 使用AI生成文案
from utils.ai_copywriter import generate_multiple_captions, analyze_photo_suitability
# 获取照片描述
image_description = content_result.get('summary', '一张美丽的照片')
# 分析适合的文案风格
suitability_result = analyze_photo_suitability(image_description)
st.subheader("✍️ AI创意文案生成")
# 显示照片描述
st.write(f"**照片描述**: {image_description}")
# 显示推荐风格
st.write(f"**推荐风格**: {', '.join(suitability_result['recommended_styles'][:3])}")
# 生成多个文案选项
captions = generate_multiple_captions(image_description, count=3, style=suitability_result['most_suitable'])
st.subheader("📝 文案选项")
for caption_info in captions:
with st.expander(f"选项 {caption_info['option']} ({caption_info.get('length', '适中')} - {caption_info['char_count']}字)"):
st.write(caption_info['caption'])
# 复制按钮
if st.button(f"复制文案 {caption_info['option']}", key=f"copy_{caption_info['option']}"):
st.code(caption_info['caption'], language='text')
st.success("文案已复制到剪贴板!")
st.subheader("🎨 文案风格选择")
# 风格选择
selected_style = st.selectbox(
"选择文案风格",
['creative', 'social', 'professional', 'marketing', 'emotional', 'simple'],
format_func=lambda x: {
'creative': '创意文艺',
'social': '社交媒体',
'professional': '专业正式',
'marketing': '营销推广',
'emotional': '情感表达',
'simple': '简单描述'
}[x]
)
# 长度选择
selected_length = st.selectbox(
"选择文案长度",
['short', 'medium', 'long'],
format_func=lambda x: {
'short': '简短精炼',
'medium': '适中长度',
'long': '详细描述'
}[x]
)
if st.button("重新生成文案", use_container_width=True):
with st.spinner("正在重新生成文案..."):
new_caption = generate_photo_caption(image_description, selected_style, selected_length)
st.subheader("🆕 新生成文案")
st.write(new_caption)
st.success("新文案生成完成!")
st.success("AI文案生成完成")
except Exception as e:
st.error(f"AI文案生成失败: {str(e)}")
# 显示图片预览
st.subheader("📷 照片预览")
st.image(uploaded_file, caption="上传的照片", use_column_width=True)
# 图片OCR页面
elif page == "🖼️ 图片OCR":
st.header("🖼️ 图片文字识别 (OCR)")
# OCR功能状态检查
try:
import pytesseract
# 测试Tesseract是否可用
pytesseract.get_tesseract_version()
tesseract_available = True
except:
tesseract_available = False
# AI OCR功能状态检查
try:
from utils.aliyun_ocr import check_aliyun_config
ai_available, ai_message = check_aliyun_config()
except:
ai_available = False
ai_message = "阿里云OCR未配置"
# 显示OCR状态
col_status1, col_status2 = st.columns(2)
with col_status1:
if tesseract_available:
st.success("✅ Tesseract OCR可用")
else:
st.warning("⚠️ Tesseract OCR未安装")
with col_status2:
if ai_available:
st.success("✅ AI大模型OCR可用")
else:
st.warning(f"⚠️ AI大模型OCR: {ai_message}")
# OCR模式选择
ocr_mode = st.radio("选择OCR模式",
["传统OCR (Tesseract)", "AI大模型OCR (阿里云)"],
disabled=not (tesseract_available or ai_available))
if not tesseract_available and not ai_available:
st.info("""
**OCR功能配置说明:**
**传统OCR (推荐免费):**
1. 下载Tesseract OCR: https://github.com/UB-Mannheim/tesseract/wiki
2. 安装到默认路径并添加到PATH
**AI大模型OCR (高精度):**
1. 注册阿里云账号: https://www.aliyun.com
2. 开通OCR服务并获取AccessKey
3. .env文件中配置ALIYUN_ACCESS_KEY_ID和ALIYUN_ACCESS_KEY_SECRET
""")
uploaded_file = st.file_uploader("选择图片文件", type=['jpg', 'jpeg', 'png', 'gif', 'bmp'])
if uploaded_file is not None:
file_path = save_uploaded_file(uploaded_file, 'image')
# 根据选择的模式启用/禁用按钮
use_ai = ocr_mode == "AI大模型OCR (阿里云)"
button_disabled = (use_ai and not ai_available) or (not use_ai and not tesseract_available)
col1, col2, col3 = st.columns(3)
with col1:
if st.button("识别文字", use_container_width=True, disabled=button_disabled):
with st.spinner("正在识别文字..."):
try:
if use_ai:
text = extract_text_from_image(file_path, use_ai=True, ai_provider='aliyun')
else:
text = extract_text_from_image(file_path)
st.subheader("识别的文字内容")
st.text_area("文字内容", text, height=300)
st.success("文字识别完成!")
except Exception as e:
st.error(f"识别失败: {str(e)}")
with col2:
if st.button("导出为Excel", use_container_width=True, disabled=button_disabled):
with st.spinner("正在转换为Excel..."):
try:
output_path = file_path.rsplit('.', 1)[0] + '_converted.xlsx'
if use_ai:
# 使用AI OCR导出到Excel
from utils.ocr_processor import extract_text_with_ai
text = extract_text_with_ai(file_path, 'aliyun', 'general')
import pandas as pd
lines = [line.strip() for line in text.split('\n') if line.strip()]
df = pd.DataFrame({
'行号': range(1, len(lines) + 1),
'内容': lines
})
df.to_excel(output_path, index=False)
else:
image_to_excel(file_path, output_path)
with open(output_path, "rb") as file:
st.download_button(
label="下载Excel文件",
data=file,
file_name=Path(output_path).name,
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
st.success("图片转换完成!")
except Exception as e:
st.error(f"转换失败: {str(e)}")
with col3:
if st.button("导出为文本", use_container_width=True, disabled=button_disabled):
with st.spinner("正在转换为文本..."):
try:
output_path = file_path.rsplit('.', 1)[0] + '_converted.txt'
if use_ai:
# 使用AI OCR导出到文本
from utils.ocr_processor import extract_text_with_ai
text = extract_text_with_ai(file_path, 'aliyun', 'general')
with open(output_path, 'w', encoding='utf-8') as f:
f.write(text)
else:
image_to_text_file(file_path, output_path)
with open(output_path, "rb") as file:
st.download_button(
label="下载文本文件",
data=file,
file_name=Path(output_path).name,
mime="text/plain"
)
st.success("图片转换完成!")
except Exception as e:
st.error(f"转换失败: {str(e)}")
# 显示图片预览
st.subheader("图片预览")
st.image(uploaded_file, caption="上传的图片", use_column_width=True)
# 显示OCR模式信息
st.info(f"当前使用: {ocr_mode}")
# 格式转换页面
elif page == "🔄 格式转换":
st.header("🔄 文件格式转换")
uploaded_file = st.file_uploader("选择文件", type=['xlsx', 'xls', 'csv', 'json'])
if uploaded_file is not None:
file_path = save_uploaded_file(uploaded_file, 'format')
file_ext = Path(uploaded_file.name).suffix.lower()
# 根据文件类型显示可转换的格式
if file_ext in ['.xlsx', '.xls']:
target_format = st.selectbox("转换为", ["CSV", "JSON"])
elif file_ext == '.csv':
target_format = st.selectbox("转换为", ["Excel", "JSON"])
elif file_ext == '.json':
target_format = st.selectbox("转换为", ["Excel", "CSV"])
if st.button("开始转换", use_container_width=True):
with st.spinner("正在转换格式..."):
try:
if file_ext in ['.xlsx', '.xls'] and target_format == "CSV":
output_path = file_path.replace(file_ext, '.csv')
excel_to_csv(file_path, output_path)
mime_type = "text/csv"
elif file_ext in ['.xlsx', '.xls'] and target_format == "JSON":
output_path = file_path.replace(file_ext, '.json')
excel_to_json(file_path, output_path)
mime_type = "application/json"
elif file_ext == '.csv' and target_format == "Excel":
output_path = file_path.replace('.csv', '.xlsx')
csv_to_excel(file_path, output_path)
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
elif file_ext == '.csv' and target_format == "JSON":
output_path = file_path.replace('.csv', '.json')
csv_to_json(file_path, output_path)
mime_type = "application/json"
elif file_ext == '.json' and target_format == "Excel":
output_path = file_path.replace('.json', '.xlsx')
json_to_excel(file_path, output_path)
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
elif file_ext == '.json' and target_format == "CSV":
output_path = file_path.replace('.json', '.csv')
json_to_csv(file_path, output_path)
mime_type = "text/csv"
with open(output_path, "rb") as file:
st.download_button(
label=f"下载{target_format}文件",
data=file,
file_name=Path(output_path).name,
mime=mime_type
)
st.success("格式转换完成!")
except Exception as e:
st.error(f"转换失败: {str(e)}")
# 网页抓取页面
elif page == "🌐 网页抓取":
st.header("🌐 网页数据抓取")
url = st.text_input("网页URL", placeholder="https://example.com")
selector = st.text_input("CSS选择器 (可选)", placeholder="例如: .content, #main, p")
col1, col2 = st.columns(2)
with col1:
if st.button("抓取内容", use_container_width=True):
if not url:
st.error("请输入网页URL")
else:
with st.spinner("正在抓取网页内容..."):
try:
content = scrape_webpage(url, selector if selector else None)
st.subheader("抓取的内容")
st.text_area("网页内容", content, height=300)
st.success("网页抓取完成!")
except Exception as e:
st.error(f"抓取失败: {str(e)}")
with col2:
if st.button("导出为Excel", use_container_width=True):
if not url:
st.error("请输入网页URL")
else:
with st.spinner("正在导出为Excel..."):
try:
output_filename = f"web_content_{uuid.uuid4().hex[:8]}.xlsx"
output_path = os.path.join(tempfile.gettempdir(), output_filename)
web_to_excel(url, output_path, selector if selector else None)
with open(output_path, "rb") as file:
st.download_button(
label="下载Excel文件",
data=file,
file_name=output_filename,
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
st.success("网页导出完成!")
except Exception as e:
st.error(f"导出失败: {str(e)}")
# 数据库导出页面
elif page == "🗄️ 数据库导出":
st.header("🗄️ 数据库导出")
uploaded_file = st.file_uploader("选择数据库文件", type=['db', 'sqlite', 'mdf'])
table_name = st.text_input("表名 (可选)", placeholder="留空则导出所有表")
if uploaded_file is not None:
file_path = save_uploaded_file(uploaded_file, 'database')
target_format = st.selectbox("导出为", ["Excel", "CSV", "JSON"])
if st.button("开始导出", use_container_width=True):
with st.spinner("正在导出数据库..."):
try:
file_ext = Path(file_path).suffix.lower()
continue_processing = True # 默认继续处理
if file_ext in ['.db', '.sqlite']:
if target_format == "Excel":
output_path = file_path.replace(file_ext, '_exported.xlsx')
export_sqlite_to_excel(file_path, output_path, table_name if table_name else None)
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
elif target_format == "CSV":
output_path = file_path.replace(file_ext, '_exported.csv')
database_to_csv(file_path, output_path, table_name if table_name else None)
mime_type = "text/csv"
elif target_format == "JSON":
output_path = file_path.replace(file_ext, '_exported.json')
database_to_json(file_path, output_path, table_name if table_name else None)
mime_type = "application/json"
elif file_ext == '.mdf':
# MDF文件处理
try:
import pyodbc
# 测试SQL Server连接
test_conn = pyodbc.connect("DRIVER={SQL Server};SERVER=localhost;Trusted_Connection=yes;timeout=3")
test_conn.close()
sql_server_available = True
except:
sql_server_available = False
st.warning("⚠️ SQL Server未运行或无法连接")
st.info("""
**MDF文件导出需要SQL Server支持:**
1. **安装SQL Server Express** (免费)
- 下载: https://www.microsoft.com/en-us/sql-server/sql-server-downloads
2. **确保SQL Server服务运行**
- 打开"服务"管理器 (services.msc)
- 启动"SQL Server (MSSQLSERVER)"服务
3. **配置连接权限**
- 使用Windows身份验证或配置sa密码
安装完成后重启应用即可使用MDF导出功能
""")
# 不执行后续操作
if sql_server_available:
if target_format == "Excel":
output_path = file_path.replace(file_ext, '_exported.xlsx')
from utils.database_exporter import export_mssql_mdf_to_excel
export_mssql_mdf_to_excel(file_path, output_path, table_name if table_name else None)
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
elif target_format == "CSV":
output_path = file_path.replace(file_ext, '_exported.csv')
database_to_csv(file_path, output_path, table_name if table_name else None)
mime_type = "text/csv"
elif target_format == "JSON":
output_path = file_path.replace(file_ext, '_exported.json')
database_to_json(file_path, output_path, table_name if table_name else None)
mime_type = "application/json"
else:
st.error("不支持的数据库格式")
# 不执行后续操作
continue_processing = False
# 只有在成功处理时才执行下载操作
if continue_processing and 'output_path' in locals() and os.path.exists(output_path):
with open(output_path, "rb") as file:
st.download_button(
label=f"下载{target_format}文件",
data=file,
file_name=Path(output_path).name,
mime=mime_type
)
st.success("数据库导出完成!")
elif not continue_processing:
# 不支持的格式,不显示下载按钮
pass
else:
st.error("导出文件创建失败")
except Exception as e:
st.error(f"导出失败: {str(e)}")
# 页脚信息
st.sidebar.markdown("---")
st.sidebar.markdown("""
### 使用说明
1. 选择功能模块
2. 上传文件或输入URL
3. 点击相应按钮处理
4. 下载处理结果
### 支持格式
- **PDF**: .pdf
- **图片**: .jpg, .jpeg, .png, .gif, .bmp
- **数据文件**: .xlsx, .xls, .csv, .json
- **数据库**: .db, .sqlite, .mdf
""")

241
app_flask.py Normal file
View File

@ -0,0 +1,241 @@
from flask import Flask, render_template, request, jsonify, send_file, redirect, url_for
import os
import uuid
from werkzeug.utils import secure_filename
from config import Config
# 导入工具模块
from utils.pdf_extractor import extract_text_from_pdf, pdf_to_excel
from utils.ocr_processor import extract_text_from_image, image_to_excel, image_to_text_file
from utils.format_converter import (
excel_to_csv, csv_to_excel, json_to_excel,
excel_to_json, csv_to_json, json_to_csv
)
from utils.web_scraper import scrape_webpage, web_to_excel
from utils.database_exporter import export_sqlite_to_excel, database_to_csv, database_to_json
app = Flask(__name__)
app.config.from_object(Config)
# 确保上传目录存在
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
def allowed_file(filename):
"""检查文件类型是否允许"""
return '.' in filename and \
filename.rsplit('.', 1)[1].lower() in app.config['ALLOWED_EXTENSIONS']
@app.route('/')
def index():
"""首页"""
return render_template('index.html')
@app.route('/upload', methods=['POST'])
def upload_file():
"""文件上传处理"""
if 'file' not in request.files:
return jsonify({'error': '没有选择文件'}), 400
file = request.files['file']
if file.filename == '':
return jsonify({'error': '没有选择文件'}), 400
if file and allowed_file(file.filename):
filename = secure_filename(file.filename)
filepath = os.path.join(app.config['UPLOAD_FOLDER'], f"{uuid.uuid4()}_{filename}")
file.save(filepath)
return jsonify({
'success': True,
'filename': filename,
'filepath': filepath,
'file_type': filename.rsplit('.', 1)[1].lower()
})
return jsonify({'error': '不支持的文件类型'}), 400
@app.route('/process/pdf', methods=['POST'])
def process_pdf():
"""处理PDF文件"""
try:
data = request.json
filepath = data.get('filepath')
action = data.get('action', 'extract') # extract, to_excel
if not filepath or not os.path.exists(filepath):
return jsonify({'error': '文件不存在'}), 400
if action == 'extract':
text = extract_text_from_pdf(filepath)
return jsonify({'success': True, 'text': text})
elif action == 'to_excel':
output_path = filepath.replace('.pdf', '_converted.xlsx')
pdf_to_excel(filepath, output_path)
return jsonify({
'success': True,
'download_url': f'/download/{os.path.basename(output_path)}'
})
else:
return jsonify({'error': '不支持的操作'}), 400
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/process/image', methods=['POST'])
def process_image():
"""处理图片文件"""
try:
data = request.json
filepath = data.get('filepath')
action = data.get('action', 'extract') # extract, to_excel, to_text
if not filepath or not os.path.exists(filepath):
return jsonify({'error': '文件不存在'}), 400
if action == 'extract':
text = extract_text_from_image(filepath)
return jsonify({'success': True, 'text': text})
elif action == 'to_excel':
output_path = filepath.rsplit('.', 1)[0] + '_converted.xlsx'
image_to_excel(filepath, output_path)
return jsonify({
'success': True,
'download_url': f'/download/{os.path.basename(output_path)}'
})
elif action == 'to_text':
output_path = filepath.rsplit('.', 1)[0] + '_converted.txt'
image_to_text_file(filepath, output_path)
return jsonify({
'success': True,
'download_url': f'/download/{os.path.basename(output_path)}'
})
else:
return jsonify({'error': '不支持的操作'}), 400
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/process/format', methods=['POST'])
def process_format():
"""处理格式转换"""
try:
data = request.json
filepath = data.get('filepath')
target_format = data.get('target_format') # excel, csv, json
if not filepath or not os.path.exists(filepath):
return jsonify({'error': '文件不存在'}), 400
file_ext = filepath.rsplit('.', 1)[1].lower()
# 根据源格式和目标格式选择转换函数
if file_ext == 'xlsx' and target_format == 'csv':
output_path = filepath.replace('.xlsx', '.csv')
excel_to_csv(filepath, output_path)
elif file_ext == 'csv' and target_format == 'excel':
output_path = filepath.replace('.csv', '.xlsx')
csv_to_excel(filepath, output_path)
elif file_ext == 'json' and target_format == 'excel':
output_path = filepath.replace('.json', '.xlsx')
json_to_excel(filepath, output_path)
elif file_ext == 'xlsx' and target_format == 'json':
output_path = filepath.replace('.xlsx', '.json')
excel_to_json(filepath, output_path)
elif file_ext == 'csv' and target_format == 'json':
output_path = filepath.replace('.csv', '.json')
csv_to_json(filepath, output_path)
elif file_ext == 'json' and target_format == 'csv':
output_path = filepath.replace('.json', '.csv')
json_to_csv(filepath, output_path)
else:
return jsonify({'error': '不支持的格式转换'}), 400
return jsonify({
'success': True,
'download_url': f'/download/{os.path.basename(output_path)}'
})
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/process/web', methods=['POST'])
def process_web():
"""处理网页抓取"""
try:
data = request.json
url = data.get('url')
selector = data.get('selector', '')
if not url:
return jsonify({'error': '请输入URL'}), 400
# 抓取网页内容
content = scrape_webpage(url, selector if selector else None)
# 导出为Excel
output_filename = f"web_content_{uuid.uuid4().hex[:8]}.xlsx"
output_path = os.path.join(app.config['UPLOAD_FOLDER'], output_filename)
web_to_excel(url, output_path, selector)
return jsonify({
'success': True,
'content': content if isinstance(content, str) else '内容已提取',
'download_url': f'/download/{output_filename}'
})
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/process/database', methods=['POST'])
def process_database():
"""处理数据库导出"""
try:
data = request.json
filepath = data.get('filepath')
target_format = data.get('target_format', 'excel') # excel, csv, json
table_name = data.get('table_name', '') # 可选:指定表名
if not filepath or not os.path.exists(filepath):
return jsonify({'error': '文件不存在'}), 400
file_ext = filepath.rsplit('.', 1)[1].lower()
if file_ext in ['db', 'sqlite']:
if target_format == 'excel':
output_path = filepath.replace(f'.{file_ext}', '_exported.xlsx')
export_sqlite_to_excel(filepath, output_path, table_name)
elif target_format == 'csv':
output_path = filepath.replace(f'.{file_ext}', '_exported.csv')
database_to_csv(filepath, output_path, table_name)
elif target_format == 'json':
output_path = filepath.replace(f'.{file_ext}', '_exported.json')
database_to_json(filepath, output_path, table_name)
else:
return jsonify({'error': '不支持的导出格式'}), 400
else:
return jsonify({'error': '不支持的数据库格式'}), 400
return jsonify({
'success': True,
'download_url': f'/download/{os.path.basename(output_path)}'
})
except Exception as e:
return jsonify({'error': str(e)}), 500
@app.route('/download/<filename>')
def download_file(filename):
"""文件下载"""
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
if os.path.exists(filepath):
return send_file(filepath, as_attachment=True)
return jsonify({'error': '文件不存在'}), 404
if __name__ == '__main__':
app.run(debug=True, host='0.0.0.0', port=5000)

26
config.py Normal file
View File

@ -0,0 +1,26 @@
import os
from dotenv import load_dotenv
load_dotenv()
class Config:
SECRET_KEY = os.getenv('SECRET_KEY', 'dev-secret-key')
UPLOAD_FOLDER = 'uploads'
MAX_CONTENT_LENGTH = 16 * 1024 * 1024 # 16MB max file size
# OCR配置
TESSERACT_PATH = os.getenv('TESSERACT_PATH', '')
# 数据库配置
DATABASE_URI = os.getenv('DATABASE_URI', 'sqlite:///data.db')
# 网页抓取配置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# 支持的文件类型
ALLOWED_EXTENSIONS = {
'pdf', 'txt', 'doc', 'docx',
'jpg', 'jpeg', 'png', 'gif', 'bmp',
'xlsx', 'xls', 'csv', 'json',
'db', 'sqlite'
}

253
diagnose_ocr.py Normal file
View File

@ -0,0 +1,253 @@
#!/usr/bin/env python3
"""
OCR功能诊断脚本
检查Tesseract OCR的安装和配置状态
"""
import os
import sys
import tempfile
from pathlib import Path
def check_tesseract_installation():
"""检查Tesseract OCR是否安装"""
print("🔍 检查Tesseract OCR安装状态...")
# 常见的Tesseract安装路径
possible_paths = [
r"C:\Program Files\Tesseract-OCR\tesseract.exe",
r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
r"D:\Program Files\Tesseract-OCR\tesseract.exe",
r"D:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
]
tesseract_path = None
for path in possible_paths:
if os.path.exists(path):
tesseract_path = path
print(f"✅ Tesseract找到: {path}")
break
if not tesseract_path:
print("❌ Tesseract未找到在默认路径")
# 检查系统PATH
import shutil
tesseract_cmd = shutil.which("tesseract")
if tesseract_cmd:
print(f"✅ Tesseract在PATH中找到: {tesseract_cmd}")
tesseract_path = tesseract_cmd
else:
print("❌ Tesseract未在系统PATH中找到")
return tesseract_path
def check_python_dependencies():
"""检查Python OCR相关依赖"""
print("\n🐍 检查Python依赖...")
dependencies = ["pytesseract", "PIL", "pandas"]
for dep in dependencies:
try:
if dep == "PIL":
import PIL
print(f"{dep}: {PIL.__version__}")
elif dep == "pytesseract":
import pytesseract
print(f"{dep}: 已安装")
elif dep == "pandas":
import pandas
print(f"{dep}: {pandas.__version__}")
except ImportError as e:
print(f"{dep}: 未安装 - {e}")
def create_test_image():
"""创建测试图片"""
print("\n🖼️ 创建测试图片...")
try:
from PIL import Image, ImageDraw, ImageFont
# 创建图片
img = Image.new('RGB', (400, 200), color='white')
d = ImageDraw.Draw(img)
# 尝试使用不同字体
fonts_to_try = [
"arial.ttf",
"Arial.ttf",
"simhei.ttf", # 黑体
"msyh.ttc", # 微软雅黑
"C:\\Windows\\Fonts\\arial.ttf",
"C:\\Windows\\Fonts\\simhei.ttf"
]
font = None
for font_path in fonts_to_try:
try:
font = ImageFont.truetype(font_path, 24)
print(f"✅ 字体找到: {font_path}")
break
except:
continue
if not font:
print("⚠️ 未找到合适字体,使用默认字体")
font = ImageFont.load_default()
# 添加清晰的中英文文字
text_lines = [
"OCR测试文字",
"Hello World",
"1234567890",
"ABCDEFGHIJKLMN"
]
y_position = 30
for line in text_lines:
d.text((50, y_position), line, fill="black", font=font)
y_position += 40
# 保存图片
test_image_path = os.path.join(tempfile.gettempdir(), "ocr_test_image.png")
img.save(test_image_path, "PNG")
print(f"✅ 测试图片已创建: {test_image_path}")
print(f" 图片大小: {os.path.getsize(test_image_path)} 字节")
return test_image_path
except Exception as e:
print(f"❌ 创建测试图片失败: {e}")
return None
def test_ocr_functionality(image_path):
"""测试OCR功能"""
print("\n🔤 测试OCR识别功能...")
if not image_path or not os.path.exists(image_path):
print("❌ 测试图片不存在")
return
try:
import pytesseract
from PIL import Image
# 设置Tesseract路径如果需要
tesseract_path = check_tesseract_installation()
if tesseract_path:
pytesseract.pytesseract.tesseract_cmd = tesseract_path
# 打开并检查图片
image = Image.open(image_path)
print(f"✅ 图片格式: {image.format}, 大小: {image.size}")
# 测试不同语言的OCR
languages = ['eng', 'chi_sim', 'eng+chi_sim']
for lang in languages:
try:
print(f"\n 测试语言: {lang}")
text = pytesseract.image_to_string(image, lang=lang)
if text.strip():
print(f" ✅ 识别成功:")
print(f" {text.strip()}")
else:
print(f" ⚠️ 识别无结果")
except Exception as e:
print(f" ❌ 语言 {lang} 识别失败: {e}")
# 测试图片数据
print(f"\n📊 图片数据检查:")
print(f" 模式: {image.mode}")
print(f" 通道: {'RGB' if image.mode == 'RGB' else image.mode}")
# 检查图片是否可读
try:
image.verify()
print(" ✅ 图片验证通过")
except Exception as e:
print(f" ❌ 图片验证失败: {e}")
except Exception as e:
print(f"❌ OCR测试失败: {e}")
def check_system_environment():
"""检查系统环境"""
print("\n💻 检查系统环境...")
print(f" 操作系统: {os.name}")
print(f" Python版本: {sys.version}")
print(f" 当前目录: {os.getcwd()}")
print(f" TMP目录: {tempfile.gettempdir()}")
def main():
"""主诊断函数"""
print("=" * 60)
print("OCR功能诊断工具")
print("=" * 60)
# 检查系统环境
check_system_environment()
# 检查依赖
check_python_dependencies()
# 检查Tesseract安装
tesseract_path = check_tesseract_installation()
# 创建测试图片
test_image_path = create_test_image()
# 测试OCR功能
if test_image_path:
test_ocr_functionality(test_image_path)
# 提供解决方案
print("\n" + "=" * 60)
print("💡 解决方案建议")
print("=" * 60)
if not tesseract_path:
print("""
🔧 Tesseract OCR未安装请按以下步骤安装
1. 下载Tesseract OCR:
- 官方地址: https://github.com/UB-Mannheim/tesseract/wiki
- 选择Windows版本下载
2. 安装步骤:
- 运行安装程序
- 安装到默认路径: C:\\Program Files\\Tesseract-OCR\\
- 安装时勾选"Add to PATH"选项
- 安装中文语言包可选但推荐
3. 验证安装:
- 重新启动命令行
- 运行: tesseract --version
- 应该显示版本信息
""")
else:
print("""
Tesseract已安装问题可能在于
1. 图片格式问题
- 确保上传的图片格式正确PNG, JPG等
- 图片包含清晰可读的文字
2. 语言包问题
- 确保安装了中文语言包chi_sim
- 可以尝试只使用英文识别
3. 权限问题
- 确保应用有权限访问临时文件
""")
print("\n🔄 临时解决方案:")
print(" 在应用中暂时禁用OCR功能或使用在线OCR服务")
if __name__ == "__main__":
main()

23
pyproject.toml Normal file
View File

@ -0,0 +1,23 @@
[project]
name = "data-extractor-converter"
version = "1.0.0"
description = "数据提取与转换器 - 专为大学生开发的多功能数据处理工具"
requires-python = ">=3.8"
dependencies = [
"streamlit>=1.28.0",
"pandas>=2.0.3",
"requests>=2.31.0",
"beautifulsoup4>=4.12.2",
"pymupdf>=1.23.7",
"pytesseract>=0.3.10",
"pillow>=10.0.0",
"openpyxl>=3.1.2",
"sqlalchemy>=2.0.20",
"pymysql>=1.1.0",
"python-dotenv>=1.0.0",
"pyodbc>=4.0.0",
"alibabacloud-ocr-api20210707>=1.0.2",
"alibabacloud-tea-openapi>=0.3.6",
"alibabacloud-tea-util>=0.3.8",
"aiohttp>=3.8.0",
]

64
run.py Normal file
View File

@ -0,0 +1,64 @@
#!/usr/bin/env python3
"""
数据提取与转换器 - 启动脚本
专为大学生开发的多功能数据处理工具
"""
import os
import sys
from app import app
def check_dependencies():
"""检查必要的依赖是否安装"""
try:
import flask
import pandas
import requests
import fitz # PyMuPDF
import pytesseract
import sqlalchemy
print("✓ 所有依赖包已安装")
return True
except ImportError as e:
print(f"✗ 缺少依赖包: {e}")
print("请运行: pip install -r requirements.txt")
return False
def create_upload_directories():
"""创建必要的上传目录"""
directories = ['uploads', 'static', 'templates']
for directory in directories:
os.makedirs(directory, exist_ok=True)
print("✓ 目录结构已创建")
def main():
"""主函数"""
print("=" * 50)
print("数据提取与转换器 - 大学生专用工具")
print("=" * 50)
# 检查依赖
if not check_dependencies():
sys.exit(1)
# 创建目录
create_upload_directories()
print("\n启动信息:")
print("- 本地访问: http://localhost:5000")
print("- 网络访问: http://0.0.0.0:5000")
print("- 停止服务: Ctrl+C")
print("\n" + "=" * 50)
# 启动Flask应用
try:
app.run(debug=True, host='0.0.0.0', port=5000)
except KeyboardInterrupt:
print("\n\n服务已停止")
except Exception as e:
print(f"\n\n启动失败: {e}")
if __name__ == '__main__':
main()

416
static/script.js Normal file
View File

@ -0,0 +1,416 @@
// 全局变量
let currentFile = null;
// 标签页切换功能
function openTab(tabName) {
// 隐藏所有标签页内容
const tabContents = document.getElementsByClassName('tab-content');
for (let i = 0; i < tabContents.length; i++) {
tabContents[i].classList.remove('active');
}
// 移除所有标签按钮的激活状态
const tabButtons = document.getElementsByClassName('tab-button');
for (let i = 0; i < tabButtons.length; i++) {
tabButtons[i].classList.remove('active');
}
// 显示选中的标签页内容
document.getElementById(tabName).classList.add('active');
// 激活对应的标签按钮
event.currentTarget.classList.add('active');
// 清空当前文件
currentFile = null;
clearResults();
}
// 文件上传处理
function setupFileUpload(inputId, uploadAreaId) {
const fileInput = document.getElementById(inputId);
const uploadArea = document.getElementById(uploadAreaId);
fileInput.addEventListener('change', function(e) {
if (this.files.length > 0) {
handleFileUpload(this.files[0], uploadArea);
}
});
// 拖拽上传功能
uploadArea.addEventListener('dragover', function(e) {
e.preventDefault();
this.style.borderColor = '#2980b9';
this.style.background = '#e9ecef';
});
uploadArea.addEventListener('dragleave', function(e) {
e.preventDefault();
this.style.borderColor = '#3498db';
this.style.background = '#f8f9fa';
});
uploadArea.addEventListener('drop', function(e) {
e.preventDefault();
this.style.borderColor = '#3498db';
this.style.background = '#f8f9fa';
if (e.dataTransfer.files.length > 0) {
handleFileUpload(e.dataTransfer.files[0], uploadArea);
}
});
}
// 处理文件上传
async function handleFileUpload(file, uploadArea) {
const formData = new FormData();
formData.append('file', file);
showStatus('正在上传文件...', 'info');
try {
const response = await fetch('/upload', {
method: 'POST',
body: formData
});
const result = await response.json();
if (result.success) {
currentFile = result;
uploadArea.innerHTML = `
<div style="text-align: center;">
<p style="color: #27ae60; font-weight: bold;"> 文件上传成功</p>
<p>文件名: ${result.filename}</p>
<p>文件类型: ${result.file_type}</p>
<button onclick="clearFile('${uploadArea.id}')" class="btn" style="background: #e74c3c; color: white; margin-top: 10px;">重新选择</button>
</div>
`;
showStatus('文件上传成功!', 'success');
} else {
throw new Error(result.error);
}
} catch (error) {
showStatus('上传失败: ' + error.message, 'error');
uploadArea.innerHTML = `
<div class="upload-placeholder" onclick="document.getElementById('${fileInput.id}').click()">
<p>点击选择文件或拖拽文件到此处</p>
<p class="file-types">上传失败请重试</p>
</div>
`;
}
}
// 清空文件选择
function clearFile(uploadAreaId) {
const uploadArea = document.getElementById(uploadAreaId);
const fileInputId = uploadAreaId.replace('-upload-area', '-file');
uploadArea.innerHTML = `
<input type="file" id="${fileInputId}" style="display: none;">
<div class="upload-placeholder" onclick="document.getElementById('${fileInputId}').click()">
<p>点击选择文件或拖拽文件到此处</p>
<p class="file-types">支持格式: 根据标签页不同</p>
</div>
`;
currentFile = null;
clearResults();
setupFileUpload(fileInputId, uploadAreaId);
}
// PDF处理功能
async function processPdf(action) {
if (!currentFile) {
showStatus('请先选择PDF文件', 'error');
return;
}
showStatus('正在处理PDF文件...', 'info');
try {
const response = await fetch('/process/pdf', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
filepath: currentFile.filepath,
action: action
})
});
const result = await response.json();
if (result.success) {
if (action === 'extract') {
document.getElementById('pdf-result').innerHTML = `
<h4>提取的文本内容:</h4>
<div style="max-height: 300px; overflow-y: auto; background: white; padding: 15px; border-radius: 5px;">
${result.text || '未提取到文本内容'}
</div>
`;
} else if (action === 'to_excel') {
document.getElementById('pdf-result').innerHTML = `
<h4>转换成功!</h4>
<p>PDF文件已成功转换为Excel格式</p>
<a href="${result.download_url}" class="download-link" download>下载Excel文件</a>
`;
}
showStatus('PDF处理完成', 'success');
} else {
throw new Error(result.error);
}
} catch (error) {
showStatus('处理失败: ' + error.message, 'error');
}
}
// 图片处理功能
async function processImage(action) {
if (!currentFile) {
showStatus('请先选择图片文件', 'error');
return;
}
showStatus('正在处理图片文件...', 'info');
try {
const response = await fetch('/process/image', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
filepath: currentFile.filepath,
action: action
})
});
const result = await response.json();
if (result.success) {
if (action === 'extract') {
document.getElementById('image-result').innerHTML = `
<h4>识别的文字内容:</h4>
<div style="max-height: 300px; overflow-y: auto; background: white; padding: 15px; border-radius: 5px;">
${result.text || '未识别到文字内容'}
</div>
`;
} else {
const formatName = action === 'to_excel' ? 'Excel' : '文本';
document.getElementById('image-result').innerHTML = `
<h4>转换成功!</h4>
<p>图片文件已成功转换为${formatName}格式</p>
<a href="${result.download_url}" class="download-link" download>下载${formatName}文件</a>
`;
}
showStatus('图片处理完成!', 'success');
} else {
throw new Error(result.error);
}
} catch (error) {
showStatus('处理失败: ' + error.message, 'error');
}
}
// 格式转换功能
async function processFormat() {
if (!currentFile) {
showStatus('请先选择文件', 'error');
return;
}
const targetFormat = document.getElementById('target-format').value;
showStatus('正在转换文件格式...', 'info');
try {
const response = await fetch('/process/format', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
filepath: currentFile.filepath,
target_format: targetFormat
})
});
const result = await response.json();
if (result.success) {
document.getElementById('format-result').innerHTML = `
<h4>转换成功!</h4>
<p>文件已成功转换为${targetFormat.toUpperCase()}格式</p>
<a href="${result.download_url}" class="download-link" download>下载文件</a>
`;
showStatus('格式转换完成!', 'success');
} else {
throw new Error(result.error);
}
} catch (error) {
showStatus('转换失败: ' + error.message, 'error');
}
}
// 网页抓取功能
async function processWeb() {
const url = document.getElementById('web-url').value;
const selector = document.getElementById('css-selector').value;
if (!url) {
showStatus('请输入网页URL', 'error');
return;
}
showStatus('正在抓取网页内容...', 'info');
try {
const response = await fetch('/process/web', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: url,
selector: selector
})
});
const result = await response.json();
if (result.success) {
document.getElementById('web-result').innerHTML = `
<h4>抓取结果:</h4>
<div style="max-height: 300px; overflow-y: auto; background: white; padding: 15px; border-radius: 5px;">
${result.content || '未抓取到内容'}
</div>
`;
showStatus('网页抓取完成!', 'success');
} else {
throw new Error(result.error);
}
} catch (error) {
showStatus('抓取失败: ' + error.message, 'error');
}
}
// 网页抓取并导出为Excel
async function processWebToExcel() {
const url = document.getElementById('web-url').value;
const selector = document.getElementById('css-selector').value;
if (!url) {
showStatus('请输入网页URL', 'error');
return;
}
showStatus('正在抓取网页并导出为Excel...', 'info');
try {
const response = await fetch('/process/web', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
url: url,
selector: selector
})
});
const result = await response.json();
if (result.success) {
document.getElementById('web-result').innerHTML = `
<h4>导出成功!</h4>
<p>网页内容已成功导出为Excel格式</p>
<a href="${result.download_url}" class="download-link" download>下载Excel文件</a>
`;
showStatus('网页导出完成!', 'success');
} else {
throw new Error(result.error);
}
} catch (error) {
showStatus('导出失败: ' + error.message, 'error');
}
}
// 数据库导出功能
async function processDatabase() {
if (!currentFile) {
showStatus('请先选择数据库文件', 'error');
return;
}
const targetFormat = document.getElementById('db-target-format').value;
const tableName = document.getElementById('table-name').value;
showStatus('正在导出数据库...', 'info');
try {
const response = await fetch('/process/database', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({
filepath: currentFile.filepath,
target_format: targetFormat,
table_name: tableName
})
});
const result = await response.json();
if (result.success) {
document.getElementById('database-result').innerHTML = `
<h4>导出成功!</h4>
<p>数据库已成功导出为${targetFormat.toUpperCase()}格式</p>
<a href="${result.download_url}" class="download-link" download>下载文件</a>
`;
showStatus('数据库导出完成!', 'success');
} else {
throw new Error(result.error);
}
} catch (error) {
showStatus('导出失败: ' + error.message, 'error');
}
}
// 显示状态消息
function showStatus(message, type) {
const statusEl = document.getElementById('status-message');
statusEl.textContent = message;
statusEl.className = `status-message status-${type}`;
statusEl.style.display = 'block';
setTimeout(() => {
statusEl.style.display = 'none';
}, 5000);
}
// 清空结果区域
function clearResults() {
const resultAreas = document.getElementsByClassName('result-area');
for (let i = 0; i < resultAreas.length; i++) {
resultAreas[i].innerHTML = '';
}
}
// 初始化页面
document.addEventListener('DOMContentLoaded', function() {
// 设置文件上传功能
setupFileUpload('pdf-file', 'pdf-upload-area');
setupFileUpload('image-file', 'image-upload-area');
setupFileUpload('format-file', 'format-upload-area');
setupFileUpload('db-file', 'db-upload-area');
// 设置输入框回车事件
document.getElementById('web-url').addEventListener('keypress', function(e) {
if (e.key === 'Enter') {
processWeb();
}
});
});

265
static/style.css Normal file
View File

@ -0,0 +1,265 @@
* {
margin: 0;
padding: 0;
box-sizing: border-box;
}
body {
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
min-height: 100vh;
padding: 20px;
}
.container {
max-width: 1200px;
margin: 0 auto;
background: white;
border-radius: 15px;
box-shadow: 0 20px 40px rgba(0,0,0,0.1);
overflow: hidden;
}
header {
background: linear-gradient(135deg, #2c3e50, #3498db);
color: white;
padding: 40px;
text-align: center;
}
header h1 {
font-size: 2.5em;
margin-bottom: 10px;
}
.subtitle {
font-size: 1.2em;
opacity: 0.9;
}
.tabs {
display: flex;
background: #f8f9fa;
border-bottom: 1px solid #dee2e6;
}
.tab-button {
flex: 1;
padding: 15px 20px;
border: none;
background: transparent;
cursor: pointer;
font-size: 16px;
font-weight: 500;
transition: all 0.3s ease;
border-bottom: 3px solid transparent;
}
.tab-button:hover {
background: #e9ecef;
}
.tab-button.active {
background: white;
border-bottom-color: #3498db;
color: #3498db;
}
.tab-content {
display: none;
padding: 30px;
}
.tab-content.active {
display: block;
}
.tab-content h2 {
color: #2c3e50;
margin-bottom: 20px;
font-size: 1.8em;
}
.upload-area {
border: 2px dashed #3498db;
border-radius: 10px;
padding: 40px;
text-align: center;
margin-bottom: 20px;
transition: all 0.3s ease;
background: #f8f9fa;
}
.upload-area:hover {
border-color: #2980b9;
background: #e9ecef;
}
.upload-placeholder {
cursor: pointer;
}
.upload-placeholder p {
font-size: 18px;
color: #6c757d;
margin-bottom: 10px;
}
.file-types {
font-size: 14px !important;
color: #adb5bd !important;
}
.input-group {
margin-bottom: 20px;
}
.input-group label {
display: block;
margin-bottom: 5px;
font-weight: 500;
color: #495057;
}
.input-group input, .input-group select {
width: 100%;
padding: 10px;
border: 1px solid #ced4da;
border-radius: 5px;
font-size: 16px;
}
.input-group small {
color: #6c757d;
font-size: 12px;
}
.action-buttons {
display: flex;
gap: 10px;
margin-bottom: 20px;
flex-wrap: wrap;
}
.conversion-options {
display: flex;
align-items: center;
gap: 10px;
margin-bottom: 20px;
flex-wrap: wrap;
}
.btn {
padding: 12px 24px;
border: none;
border-radius: 5px;
cursor: pointer;
font-size: 16px;
font-weight: 500;
transition: all 0.3s ease;
text-decoration: none;
display: inline-block;
}
.btn-primary {
background: #3498db;
color: white;
}
.btn-primary:hover {
background: #2980b9;
}
.btn-success {
background: #27ae60;
color: white;
}
.btn-success:hover {
background: #219a52;
}
.btn-info {
background: #17a2b8;
color: white;
}
.btn-info:hover {
background: #138496;
}
.result-area {
background: #f8f9fa;
border: 1px solid #dee2e6;
border-radius: 5px;
padding: 20px;
min-height: 100px;
max-height: 400px;
overflow-y: auto;
white-space: pre-wrap;
font-family: 'Courier New', monospace;
}
.status-message {
position: fixed;
top: 20px;
right: 20px;
padding: 15px 20px;
border-radius: 5px;
color: white;
font-weight: 500;
z-index: 1000;
display: none;
}
.status-success {
background: #27ae60;
}
.status-error {
background: #e74c3c;
}
.status-info {
background: #3498db;
}
.download-link {
display: inline-block;
margin-top: 10px;
padding: 10px 15px;
background: #27ae60;
color: white;
text-decoration: none;
border-radius: 5px;
transition: background 0.3s ease;
}
.download-link:hover {
background: #219a52;
}
@media (max-width: 768px) {
.container {
margin: 10px;
border-radius: 10px;
}
.tabs {
flex-direction: column;
}
.tab-button {
border-bottom: 1px solid #dee2e6;
border-right: none;
}
.action-buttons {
flex-direction: column;
}
.conversion-options {
flex-direction: column;
align-items: stretch;
}
}

132
templates/index.html Normal file
View File

@ -0,0 +1,132 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>数据提取与转换器 - 大学生专用工具</title>
<link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
</head>
<body>
<div class="container">
<header>
<h1>数据提取与转换器</h1>
<p class="subtitle">专为大学生开发的多功能数据处理工具</p>
</header>
<div class="tabs">
<button class="tab-button active" onclick="openTab('pdf-tab')">PDF处理</button>
<button class="tab-button" onclick="openTab('image-tab')">图片OCR</button>
<button class="tab-button" onclick="openTab('format-tab')">格式转换</button>
<button class="tab-button" onclick="openTab('web-tab')">网页抓取</button>
<button class="tab-button" onclick="openTab('database-tab')">数据库导出</button>
</div>
<!-- PDF处理标签页 -->
<div id="pdf-tab" class="tab-content active">
<h2>PDF文本/表格提取</h2>
<div class="upload-area" id="pdf-upload-area">
<input type="file" id="pdf-file" accept=".pdf" style="display: none;">
<div class="upload-placeholder" onclick="document.getElementById('pdf-file').click()">
<p>点击选择PDF文件或拖拽文件到此处</p>
<p class="file-types">支持格式: .pdf</p>
</div>
</div>
<div class="action-buttons">
<button onclick="processPdf('extract')" class="btn btn-primary">提取文本</button>
<button onclick="processPdf('to_excel')" class="btn btn-success">导出为Excel</button>
</div>
<div id="pdf-result" class="result-area"></div>
</div>
<!-- 图片OCR标签页 -->
<div id="image-tab" class="tab-content">
<h2>图片文字识别 (OCR)</h2>
<div class="upload-area" id="image-upload-area">
<input type="file" id="image-file" accept="image/*" style="display: none;">
<div class="upload-placeholder" onclick="document.getElementById('image-file').click()">
<p>点击选择图片文件或拖拽文件到此处</p>
<p class="file-types">支持格式: .jpg, .jpeg, .png, .gif, .bmp</p>
</div>
</div>
<div class="action-buttons">
<button onclick="processImage('extract')" class="btn btn-primary">识别文字</button>
<button onclick="processImage('to_excel')" class="btn btn-success">导出为Excel</button>
<button onclick="processImage('to_text')" class="btn btn-info">导出为文本</button>
</div>
<div id="image-result" class="result-area"></div>
</div>
<!-- 格式转换标签页 -->
<div id="format-tab" class="tab-content">
<h2>文件格式转换</h2>
<div class="upload-area" id="format-upload-area">
<input type="file" id="format-file" accept=".xlsx,.xls,.csv,.json" style="display: none;">
<div class="upload-placeholder" onclick="document.getElementById('format-file').click()">
<p>点击选择文件或拖拽文件到此处</p>
<p class="file-types">支持格式: .xlsx, .xls, .csv, .json</p>
</div>
</div>
<div class="conversion-options">
<label>转换为:</label>
<select id="target-format">
<option value="excel">Excel (.xlsx)</option>
<option value="csv">CSV (.csv)</option>
<option value="json">JSON (.json)</option>
</select>
<button onclick="processFormat()" class="btn btn-success">开始转换</button>
</div>
<div id="format-result" class="result-area"></div>
</div>
<!-- 网页抓取标签页 -->
<div id="web-tab" class="tab-content">
<h2>网页数据抓取</h2>
<div class="input-group">
<label for="web-url">网页URL:</label>
<input type="url" id="web-url" placeholder="https://example.com">
</div>
<div class="input-group">
<label for="css-selector">CSS选择器 (可选):</label>
<input type="text" id="css-selector" placeholder="例如: .content, #main, p">
<small>留空则抓取整个页面文本</small>
</div>
<div class="action-buttons">
<button onclick="processWeb()" class="btn btn-primary">抓取内容</button>
<button onclick="processWebToExcel()" class="btn btn-success">导出为Excel</button>
</div>
<div id="web-result" class="result-area"></div>
</div>
<!-- 数据库导出标签页 -->
<div id="database-tab" class="tab-content">
<h2>数据库导出</h2>
<div class="upload-area" id="db-upload-area">
<input type="file" id="db-file" accept=".db,.sqlite" style="display: none;">
<div class="upload-placeholder" onclick="document.getElementById('db-file').click()">
<p>点击选择数据库文件或拖拽文件到此处</p>
<p class="file-types">支持格式: .db, .sqlite</p>
</div>
</div>
<div class="input-group">
<label for="table-name">表名 (可选):</label>
<input type="text" id="table-name" placeholder="留空则导出所有表">
</div>
<div class="conversion-options">
<label>导出为:</label>
<select id="db-target-format">
<option value="excel">Excel (.xlsx)</option>
<option value="csv">CSV (.csv)</option>
<option value="json">JSON (.json)</option>
</select>
<button onclick="processDatabase()" class="btn btn-success">开始导出</button>
</div>
<div id="database-result" class="result-area"></div>
</div>
<!-- 全局状态显示 -->
<div id="status-message" class="status-message"></div>
</div>
<script src="{{ url_for('static', filename='script.js') }}"></script>
</body>
</html>

BIN
test_cases/cat_coffee.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

6
test_cases/test_data.csv Normal file
View File

@ -0,0 +1,6 @@
姓名,年龄,城市,专业,成绩
张三,20,北京,计算机科学,85
李四,21,上海,数据科学,92
王五,19,广州,人工智能,78
赵六,22,深圳,软件工程,88
钱七,20,杭州,网络安全,95
1 姓名 年龄 城市 专业 成绩
2 张三 20 北京 计算机科学 85
3 李四 21 上海 数据科学 92
4 王五 19 广州 人工智能 78
5 赵六 22 深圳 软件工程 88
6 钱七 20 杭州 网络安全 95

37
test_cases/test_data.json Normal file
View File

@ -0,0 +1,37 @@
[
{
"姓名": "张三",
"年龄": 20,
"城市": "北京",
"专业": "计算机科学",
"成绩": 85
},
{
"姓名": "李四",
"年龄": 21,
"城市": "上海",
"专业": "数据科学",
"成绩": 92
},
{
"姓名": "王五",
"年龄": 19,
"城市": "广州",
"专业": "人工智能",
"成绩": 78
},
{
"姓名": "赵六",
"年龄": 22,
"城市": "深圳",
"专业": "软件工程",
"成绩": 88
},
{
"姓名": "钱七",
"年龄": 20,
"城市": "杭州",
"专业": "网络安全",
"成绩": 95
}
]

192
test_functionality.py Normal file
View File

@ -0,0 +1,192 @@
#!/usr/bin/env python3
"""
数据提取与转换器 - 功能测试脚本
用于验证应用的各项功能是否正常工作
"""
import os
import sys
import tempfile
from pathlib import Path
# 添加项目路径到Python路径
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# 导入工具模块
try:
from utils.pdf_extractor import extract_text_from_pdf
from utils.ocr_processor import extract_text_from_image
from utils.format_converter import excel_to_csv, csv_to_excel, json_to_excel
from utils.web_scraper import scrape_webpage
from utils.database_exporter import export_sqlite_to_excel
print("✅ 所有工具模块导入成功")
except ImportError as e:
print(f"❌ 模块导入失败: {e}")
sys.exit(1)
def test_format_conversion():
"""测试格式转换功能"""
print("\n📊 测试格式转换功能...")
# 测试数据
test_data = [
{"姓名": "张三", "年龄": 20, "城市": "北京"},
{"姓名": "李四", "年龄": 21, "城市": "上海"},
{"姓名": "王五", "年龄": 19, "城市": "广州"}
]
try:
# 创建临时文件
with tempfile.NamedTemporaryFile(suffix='.csv', delete=False, mode='w', encoding='utf-8') as f:
f.write("姓名,年龄,城市\n")
for item in test_data:
f.write(f"{item['姓名']},{item['年龄']},{item['城市']}\n")
csv_path = f.name
# CSV转Excel
excel_path = csv_path.replace('.csv', '.xlsx')
csv_to_excel(csv_path, excel_path)
if os.path.exists(excel_path):
print("✅ CSV转Excel功能正常")
os.unlink(excel_path)
else:
print("❌ CSV转Excel功能失败")
os.unlink(csv_path)
except Exception as e:
print(f"❌ 格式转换测试失败: {e}")
def test_web_scraping():
"""测试网页抓取功能"""
print("\n🌐 测试网页抓取功能...")
try:
# 测试抓取百度首页标题
content = scrape_webpage("https://www.baidu.com")
if content and len(content) > 0:
print("✅ 网页抓取功能正常")
print(f" 抓取内容长度: {len(content)} 字符")
else:
print("❌ 网页抓取功能失败")
except Exception as e:
print(f"❌ 网页抓取测试失败: {e}")
def test_ocr_functionality():
"""测试OCR功能"""
print("\n🖼️ 测试OCR功能...")
try:
# 创建一个简单的测试图片(包含文字)
from PIL import Image, ImageDraw, ImageFont
# 创建图片
img = Image.new('RGB', (400, 200), color='white')
d = ImageDraw.Draw(img)
# 尝试使用系统字体
try:
font = ImageFont.truetype("arial.ttf", 24)
except:
try:
font = ImageFont.truetype("Arial.ttf", 24)
except:
font = ImageFont.load_default()
# 添加文字
d.text((50, 80), "测试文字: Hello World 你好世界", fill="black", font=font)
# 保存图片
img_path = os.path.join(tempfile.gettempdir(), "test_ocr.png")
img.save(img_path)
# 测试OCR识别
text = extract_text_from_image(img_path)
if text:
print("✅ OCR功能正常")
print(f" 识别结果: {text}")
else:
print("⚠️ OCR识别无结果可能是字体问题")
os.unlink(img_path)
except Exception as e:
print(f"❌ OCR测试失败: {e}")
def test_database_functionality():
"""测试数据库功能"""
print("\n🗄️ 测试数据库功能...")
try:
import sqlite3
# 创建测试数据库
db_path = os.path.join(tempfile.gettempdir(), "test.db")
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# 创建测试表
cursor.execute("""
CREATE TABLE IF NOT EXISTS students (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
age INTEGER,
major TEXT
)
""")
# 插入测试数据
test_data = [
(1, "张三", 20, "计算机科学"),
(2, "李四", 21, "数据科学"),
(3, "王五", 19, "人工智能")
]
cursor.executemany("INSERT INTO students VALUES (?, ?, ?, ?)", test_data)
conn.commit()
conn.close()
# 测试数据库导出
excel_path = db_path.replace('.db', '.xlsx')
export_sqlite_to_excel(db_path, excel_path)
if os.path.exists(excel_path):
print("✅ 数据库导出功能正常")
os.unlink(excel_path)
else:
print("❌ 数据库导出功能失败")
os.unlink(db_path)
except Exception as e:
print(f"❌ 数据库功能测试失败: {e}")
def main():
"""主测试函数"""
print("=" * 50)
print("数据提取与转换器 - 功能测试")
print("=" * 50)
# 测试各项功能
test_format_conversion()
test_web_scraping()
test_ocr_functionality()
test_database_functionality()
print("\n" + "=" * 50)
print("测试完成!")
print("=" * 50)
# 显示应用访问信息
print("\n🌐 应用访问信息:")
print("本地访问: http://localhost:8502")
print("网络访问: http://192.168.10.21:8502")
print("\n💡 测试建议:")
print("1. 访问应用界面测试文件上传功能")
print("2. 使用test_cases目录下的测试文件")
print("3. 测试网页抓取功能输入百度等网站URL")
if __name__ == "__main__":
main()

213
test_mdf_functionality.py Normal file
View File

@ -0,0 +1,213 @@
#!/usr/bin/env python3
"""
MDF文件导出功能测试脚本
测试SQL Server数据库文件导出功能
"""
import os
import sys
import tempfile
from pathlib import Path
# 添加项目路径到Python路径
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
def check_sql_server_connection():
"""检查SQL Server连接"""
print("🔍 检查SQL Server连接...")
try:
import pyodbc
# 测试连接参数
test_servers = [
('localhost', 'MSSQLSERVER'),
('.', 'MSSQLSERVER'),
('localhost\\SQLEXPRESS', 'SQLEXPRESS')
]
connected = False
for server, instance in test_servers:
try:
if instance == 'MSSQLSERVER':
conn_str = f"DRIVER={{SQL Server}};SERVER={server};Trusted_Connection=yes;"
else:
conn_str = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};Trusted_Connection=yes;"
conn = pyodbc.connect(conn_str, timeout=5)
cursor = conn.cursor()
cursor.execute("SELECT @@version")
version = cursor.fetchone()[0]
print(f"✅ 连接到 {server}\\{instance}")
print(f" SQL Server版本: {version.split('\\n')[0]}")
connected = True
conn.close()
break
except Exception as e:
print(f"❌ 无法连接到 {server}\\{instance}: {e}")
if not connected:
print("⚠️ 未找到可用的SQL Server实例")
print(" 请安装SQL Server或检查服务状态")
return connected
except ImportError:
print("❌ pyodbc未安装")
return False
def test_mdf_export_module():
"""测试MDF导出模块"""
print("\n🧪 测试MDF导出模块...")
try:
from utils.database_exporter import (
export_mssql_mdf_to_excel,
export_mssql_mdf_to_csv,
export_mssql_mdf_to_json
)
print("✅ MDF导出模块导入成功")
# 检查函数是否存在
functions = [
export_mssql_mdf_to_excel,
export_mssql_mdf_to_csv,
export_mssql_mdf_to_json
]
for func in functions:
print(f"{func.__name__} 函数可用")
return True
except Exception as e:
print(f"❌ MDF导出模块测试失败: {e}")
return False
def create_sample_mdf_info():
"""创建示例MDF文件信息"""
print("\n📋 示例MDF文件信息:")
sample_info = """
💡 要测试MDF文件导出功能您需要
1. **现有的.mdf文件**
- 从现有SQL Server数据库分离的.mdf文件
- 或使用SQL Server创建测试数据库
2. **SQL Server实例**
- 本地安装的SQL Server
- 或可访问的远程SQL Server
3. **连接权限**
- 数据库读取权限
- 附加数据库权限
🔧 创建测试MDF文件的步骤
1. 在SQL Server Management Studio中
```sql
-- 创建测试数据库
CREATE DATABASE TestMDFExport;
GO
-- 创建测试表
USE TestMDFExport;
CREATE TABLE Students (
ID INT PRIMARY KEY,
Name NVARCHAR(50),
Age INT,
Major NVARCHAR(50)
);
-- 插入测试数据
INSERT INTO Students VALUES
(1, '张三', 20, '计算机科学'),
(2, '李四', 21, '数据科学'),
(3, '王五', 19, '人工智能');
```
2. 分离数据库获取.mdf文件
```sql
-- 分离数据库
USE master;
GO
EXEC sp_detach_db 'TestMDFExport', 'true';
```
3. 数据库文件位置
- 默认路径: C:\\Program Files\\Microsoft SQL Server\\...\\DATA\\
- 文件: TestMDFExport.mdf TestMDFExport_log.ldf
"""
print(sample_info)
def check_odbc_drivers():
"""检查可用的ODBC驱动程序"""
print("\n🔌 检查ODBC驱动程序...")
try:
import pyodbc
drivers = pyodbc.drivers()
if drivers:
print("✅ 找到以下ODBC驱动程序:")
for driver in drivers:
print(f" - {driver}")
# 检查SQL Server相关驱动
sql_drivers = [d for d in drivers if 'SQL Server' in d]
if sql_drivers:
print("\n✅ 找到SQL Server ODBC驱动程序")
else:
print("\n⚠️ 未找到SQL Server ODBC驱动程序")
print(" 请安装ODBC Driver for SQL Server")
else:
print("❌ 未找到ODBC驱动程序")
except Exception as e:
print(f"❌ 检查ODBC驱动程序失败: {e}")
def main():
"""主测试函数"""
print("=" * 60)
print("MDF文件导出功能测试")
print("=" * 60)
# 检查ODBC驱动
check_odbc_drivers()
# 检查SQL Server连接
sql_connected = check_sql_server_connection()
# 测试MDF导出模块
module_ok = test_mdf_export_module()
# 显示示例信息
create_sample_mdf_info()
print("\n" + "=" * 60)
print("测试总结")
print("=" * 60)
if sql_connected and module_ok:
print("✅ MDF导出功能配置正确")
print("💡 您可以上传.mdf文件测试导出功能")
else:
print("⚠️ MDF导出功能需要额外配置")
if not sql_connected:
print(" - 需要安装或配置SQL Server")
if not module_ok:
print(" - 需要检查模块依赖")
print("\n🚀 下一步操作:")
print("1. 确保SQL Server服务运行")
print("2. 准备.mdf测试文件")
print("3. 访问应用测试导出功能")
print("4. 参考SQL_SERVER_SETUP.md获取详细配置说明")
if __name__ == "__main__":
main()

1
utils/__init__.py Normal file
View File

@ -0,0 +1 @@
# 工具模块初始化文件

438
utils/ai_copywriter.py Normal file
View File

@ -0,0 +1,438 @@
#!/usr/bin/env python3
"""
AI文案生成服务集成
使用AI大模型为照片生成创意文案
支持多种文案风格和用途
支持DeepSeek和DashScope两种大模型
"""
import os
import json
import requests
from dotenv import load_dotenv
# 加载环境变量
load_dotenv()
class AICopywriter:
"""AI文案生成服务类"""
def __init__(self, provider='deepseek'):
"""初始化AI文案生成客户端"""
self.provider = provider
if provider == 'deepseek':
self.api_key = os.getenv('DEEPSEEK_API_KEY')
if not self.api_key:
raise Exception("DeepSeek API密钥未配置请在.env文件中设置DEEPSEEK_API_KEY")
self.base_url = "https://api.deepseek.com/v1/chat/completions"
elif provider == 'dashscope':
self.api_key = os.getenv('DASHSCOPE_API_KEY')
if not self.api_key:
raise Exception("DashScope API密钥未配置请在.env文件中设置DASHSCOPE_API_KEY")
else:
raise Exception(f"不支持的AI提供商: {provider}")
def generate_photo_caption(self, image_description, style='creative', length='medium'):
"""为照片生成文案"""
try:
if self.provider == 'deepseek':
return self._generate_with_deepseek(image_description, style, length)
elif self.provider == 'dashscope':
return self._generate_with_dashscope(image_description, style, length)
else:
raise Exception(f"不支持的AI提供商: {self.provider}")
except Exception as e:
raise Exception(f"AI文案生成失败: {str(e)}")
def _generate_with_deepseek(self, image_description, style, length):
"""使用DeepSeek生成文案"""
try:
prompt = self._build_prompt(image_description, style, length)
headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
data = {
'model': 'deepseek-chat',
'messages': [
{
'role': 'system',
'content': '你是一个专业的创意文案创作助手,擅长为照片生成各种风格的创意文案。你具有丰富的文学素养和营销知识,能够根据照片内容创作出富有创意和感染力的文案。'
},
{
'role': 'user',
'content': prompt
}
],
'max_tokens': 500,
'temperature': 0.8,
'top_p': 0.9
}
response = requests.post(self.base_url, headers=headers, json=data)
result = response.json()
if 'choices' in result and len(result['choices']) > 0:
caption = result['choices'][0]['message']['content'].strip()
# 清理可能的格式标记
caption = caption.replace('"', '').replace('\n', ' ').strip()
return caption
else:
# 如果API调用失败使用备用文案生成
return self._generate_fallback_caption(image_description, style, length)
except Exception as e:
# API调用失败时使用备用方案
return self._generate_fallback_caption(image_description, style, length)
def _generate_with_dashscope(self, image_description, style, length):
"""使用DashScope生成文案"""
try:
url = "https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation"
headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
# 根据风格和长度构建提示词
prompt = self._build_prompt(image_description, style, length)
data = {
'model': 'qwen-turbo',
'input': {
'messages': [
{
'role': 'system',
'content': '你是一个专业的文案创作助手,擅长为照片生成各种风格的创意文案。'
},
{
'role': 'user',
'content': prompt
}
]
},
'parameters': {
'max_tokens': 500,
'temperature': 0.8
}
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
if 'output' in result and 'text' in result['output']:
return result['output']['text']
else:
# 如果API调用失败使用备用文案生成
return self._generate_fallback_caption(image_description, style, length)
except Exception as e:
# API调用失败时使用备用方案
return self._generate_fallback_caption(image_description, style, length)
def _build_prompt(self, image_description, style, length):
"""构建AI提示词"""
style_descriptions = {
'creative': '创意文艺风格,富有诗意和想象力',
'professional': '专业正式风格,简洁明了',
'social': '社交媒体风格,活泼有趣,适合朋友圈',
'marketing': '营销推广风格,吸引眼球,促进转化',
'simple': '简单描述风格,直接明了',
'emotional': '情感表达风格,温暖感人'
}
length_descriptions = {
'short': '10-20字简洁精炼',
'medium': '30-50字适中长度',
'long': '80-120字详细描述'
}
prompt = f"""
请为以下照片内容生成{style_descriptions.get(style, '创意')}的文案要求{length_descriptions.get(length, '适中长度')}
照片内容描述{image_description}
文案要求
1. 符合{style}风格
2. 长度{length}
3. 有创意吸引人
4. 适合社交媒体分享
请直接输出文案内容不要添加其他说明
"""
return prompt.strip()
def _generate_fallback_caption(self, image_description, style, length):
"""备用文案生成当AI服务不可用时"""
# 基于照片描述的简单文案生成
keywords = image_description.lower().split()
# 提取关键信息
objects = []
scenes = []
# 简单的关键词分类实际应用中可以使用更复杂的NLP处理
object_keywords = ['', '建筑', '天空', '', '', '动物', '', '食物', '', '']
scene_keywords = ['户外', '室内', '自然', '城市', '夜景', '日出', '日落', '海滩', '森林']
for word in keywords:
if any(obj in word for obj in object_keywords):
objects.append(word)
if any(scene in word for scene in scene_keywords):
scenes.append(word)
# 根据风格生成文案
if style == 'creative':
if scenes:
caption = f"{scenes[0]}的怀抱中,时光静静流淌"
elif objects:
caption = f"{objects[0]}的美丽瞬间,定格永恒"
else:
caption = "捕捉生活中的美好,让每一刻都值得珍藏"
elif style == 'social':
if objects:
caption = f"今天遇到的{objects[0]}太可爱了!分享给大家~"
else:
caption = "分享一张美照,希望大家喜欢!"
elif style == 'professional':
if scenes and objects:
caption = f"专业拍摄:{scenes[0]}场景中的{objects[0]}特写"
else:
caption = "专业摄影作品展示"
elif style == 'marketing':
if objects:
caption = f"惊艳!这个{objects[0]}你一定要看看!"
else:
caption = "不容错过的精彩瞬间,点击了解更多!"
else: # simple or emotional
if objects:
caption = f"美丽的{objects[0]}照片"
else:
caption = "一张值得分享的照片"
# 根据长度调整
if length == 'long' and len(caption) < 50:
caption += "。这张照片记录了珍贵的瞬间,展现了生活的美好,值得细细品味和珍藏。"
elif length == 'short' and len(caption) > 20:
# 简化长文案
caption = caption[:20] + "..."
return caption
def generate_multiple_captions(self, image_description, count=3, style='creative'):
"""生成多个文案选项"""
try:
if self.provider == 'deepseek':
return self._generate_multiple_with_deepseek(image_description, count, style)
elif self.provider == 'dashscope':
return self._generate_multiple_with_dashscope(image_description, count, style)
else:
raise Exception(f"不支持的AI提供商: {self.provider}")
except Exception as e:
raise Exception(f"生成多个文案失败: {str(e)}")
def _generate_multiple_with_deepseek(self, image_description, count=3, style='creative'):
"""使用DeepSeek生成多个文案选项"""
try:
captions = []
# 使用不同的提示词变体生成多个文案
prompt_variants = [
f"请为'{image_description}'照片创作一个{style}风格的文案,要求新颖独特",
f"基于照片内容'{image_description}',写一个{style}风格的创意文案",
f"为这张'{image_description}'的照片设计一个{style}风格的吸引人文案"
]
for i in range(min(count, len(prompt_variants))):
prompt = prompt_variants[i]
headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
data = {
'model': 'deepseek-chat',
'messages': [
{
'role': 'system',
'content': '你是专业的创意文案专家,擅长为照片创作多种风格的文案。'
},
{
'role': 'user',
'content': prompt
}
],
'max_tokens': 200,
'temperature': 0.9, # 提高温度增加多样性
'top_p': 0.95
}
response = requests.post(self.base_url, headers=headers, json=data)
result = response.json()
if 'choices' in result and len(result['choices']) > 0:
caption = result['choices'][0]['message']['content'].strip()
caption = caption.replace('"', '').replace('\n', ' ').strip()
captions.append({
'option': i + 1,
'caption': caption,
'style': style,
'char_count': len(caption)
})
return captions
except Exception as e:
raise Exception(f"DeepSeek多文案生成失败: {str(e)}")
def _generate_multiple_with_dashscope(self, image_description, count=3, style='creative'):
"""使用DashScope生成多个文案选项"""
try:
captions = []
# 尝试使用不同的长度和微调风格
lengths = ['short', 'medium', 'long']
for i in range(min(count, len(lengths))):
caption = self.generate_photo_caption(image_description, style, lengths[i])
captions.append({
'option': i + 1,
'caption': caption,
'length': lengths[i],
'char_count': len(caption)
})
# 如果数量不足,使用不同风格补充
if len(captions) < count:
additional_styles = ['social', 'professional', 'emotional']
for i, add_style in enumerate(additional_styles):
if len(captions) >= count:
break
caption = self.generate_photo_caption(image_description, add_style, 'medium')
captions.append({
'option': len(captions) + 1,
'caption': caption,
'style': add_style,
'char_count': len(caption)
})
return captions
except Exception as e:
raise Exception(f"DashScope多文案生成失败: {str(e)}")
def analyze_photo_suitability(self, image_description):
"""分析照片适合的文案风格"""
try:
# 简单的风格适合性分析
keywords = image_description.lower()
suitability = {
'creative': 0,
'professional': 0,
'social': 0,
'marketing': 0,
'emotional': 0
}
# 关键词匹配实际应用中可以使用更复杂的NLP分析
creative_words = ['美丽', '艺术', '创意', '独特', '梦幻']
professional_words = ['专业', '商业', '产品', '展示', '特写']
social_words = ['朋友', '聚会', '日常', '分享', '生活']
marketing_words = ['促销', '优惠', '新品', '限时', '推荐']
emotional_words = ['情感', '感动', '回忆', '温暖', '幸福']
for word in creative_words:
if word in keywords:
suitability['creative'] += 1
for word in professional_words:
if word in keywords:
suitability['professional'] += 1
for word in social_words:
if word in keywords:
suitability['social'] += 1
for word in marketing_words:
if word in keywords:
suitability['marketing'] += 1
for word in emotional_words:
if word in keywords:
suitability['emotional'] += 1
# 排序并返回推荐
recommended = sorted(suitability.items(), key=lambda x: x[1], reverse=True)
return {
'suitability_scores': suitability,
'recommended_styles': [style for style, score in recommended if score > 0],
'most_suitable': recommended[0][0] if recommended[0][1] > 0 else 'creative'
}
except Exception as e:
raise Exception(f"照片适合性分析失败: {str(e)}")
def generate_photo_caption(image_description, style='creative', length='medium', provider='dashscope'):
"""为照片生成文案"""
try:
copywriter = AICopywriter(provider)
return copywriter.generate_photo_caption(image_description, style, length)
except Exception as e:
raise Exception(f"照片文案生成失败: {str(e)}")
def generate_multiple_captions(image_description, count=3, style='creative', provider='dashscope'):
"""生成多个文案选项"""
try:
copywriter = AICopywriter(provider)
return copywriter.generate_multiple_captions(image_description, count, style)
except Exception as e:
raise Exception(f"多文案生成失败: {str(e)}")
def analyze_photo_suitability(image_description, provider='dashscope'):
"""分析照片适合的文案风格"""
try:
copywriter = AICopywriter(provider)
return copywriter.analyze_photo_suitability(image_description)
except Exception as e:
raise Exception(f"照片适合性分析失败: {str(e)}")
def check_copywriter_config(provider='deepseek'):
"""检查AI文案生成配置是否完整"""
try:
if provider == 'deepseek':
api_key = os.getenv('DEEPSEEK_API_KEY')
if not api_key:
return False, "DeepSeek API密钥未配置"
# 测试连接
copywriter = AICopywriter(provider)
return True, "AI文案生成配置正确DeepSeek大模型"
elif provider == 'dashscope':
api_key = os.getenv('DASHSCOPE_API_KEY')
if not api_key:
return False, "DashScope API密钥未配置"
# 测试连接
copywriter = AICopywriter(provider)
return True, "AI文案生成配置正确DashScope"
else:
return False, f"不支持的AI提供商: {provider}"
except Exception as e:
return False, f"AI文案生成配置错误: {str(e)}"

229
utils/aliyun_ocr.py Normal file
View File

@ -0,0 +1,229 @@
#!/usr/bin/env python3
"""
阿里云OCR服务集成
使用阿里云AI大模型进行图片文字识别
"""
import base64
import json
import os
from dotenv import load_dotenv
from alibabacloud_ocr_api20210707.client import Client as ocr_api20210707Client
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_ocr_api20210707 import models as ocr_api20210707_models
from alibabacloud_tea_util import models as util_models
from alibabacloud_tea_util.client import Client as UtilClient
# 加载环境变量
load_dotenv()
class AliyunOCR:
"""阿里云OCR服务类"""
def __init__(self, access_key_id=None, access_key_secret=None, endpoint=None):
"""初始化阿里云OCR客户端"""
self.access_key_id = access_key_id or os.getenv('ALIYUN_ACCESS_KEY_ID')
self.access_key_secret = access_key_secret or os.getenv('ALIYUN_ACCESS_KEY_SECRET')
self.endpoint = endpoint or os.getenv('ALIYUN_OCR_ENDPOINT', 'ocr-api.cn-hangzhou.aliyuncs.com')
if not self.access_key_id or not self.access_key_secret:
raise Exception("阿里云AccessKey未配置请在.env文件中设置ALIYUN_ACCESS_KEY_ID和ALIYUN_ACCESS_KEY_SECRET")
# 创建配置对象
config = open_api_models.Config(
access_key_id=self.access_key_id,
access_key_secret=self.access_key_secret
)
config.endpoint = self.endpoint
# 创建客户端
self.client = ocr_api20210707Client(config)
def recognize_general(self, image_path):
"""通用文字识别"""
try:
# 读取图片并编码为base64
with open(image_path, 'rb') as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
# 创建请求
recognize_general_request = ocr_api20210707_models.RecognizeGeneralRequest(
image_url='', # 使用image_data所以这里留空
body=util_models.RuntimeOptions()
)
# 设置图片数据
recognize_general_request.body = image_data
# 发送请求
response = self.client.recognize_general(recognize_general_request)
# 解析响应
if response.body.code == 200:
result = json.loads(response.body.data)
return self._extract_text(result)
else:
raise Exception(f"阿里云OCR识别失败: {response.body.message}")
except Exception as e:
raise Exception(f"阿里云OCR识别错误: {str(e)}")
def recognize_advanced(self, image_path, options=None):
"""高级文字识别(支持更多功能)"""
try:
# 读取图片并编码为base64
with open(image_path, 'rb') as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
# 创建请求
recognize_advanced_request = ocr_api20210707_models.RecognizeAdvancedRequest(
image_url='',
body=util_models.RuntimeOptions()
)
# 设置图片数据
recognize_advanced_request.body = image_data
# 设置高级选项
if options:
if 'output_char_info' in options:
recognize_advanced_request.output_char_info = options['output_char_info']
if 'output_table' in options:
recognize_advanced_request.output_table = options['output_table']
if 'need_rotate' in options:
recognize_advanced_request.need_rotate = options['need_rotate']
# 发送请求
response = self.client.recognize_advanced(recognize_advanced_request)
# 解析响应
if response.body.code == 200:
result = json.loads(response.body.data)
return self._extract_text(result)
else:
raise Exception(f"阿里云高级OCR识别失败: {response.body.message}")
except Exception as e:
raise Exception(f"阿里云高级OCR识别错误: {str(e)}")
def recognize_table(self, image_path):
"""表格识别"""
try:
# 读取图片并编码为base64
with open(image_path, 'rb') as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
# 创建请求
recognize_table_request = ocr_api20210707_models.RecognizeTableRequest(
image_url='',
body=util_models.RuntimeOptions()
)
# 设置图片数据
recognize_table_request.body = image_data
# 发送请求
response = self.client.recognize_table(recognize_table_request)
# 解析响应
if response.body.code == 200:
result = json.loads(response.body.data)
return self._extract_table_data(result)
else:
raise Exception(f"阿里云表格识别失败: {response.body.message}")
except Exception as e:
raise Exception(f"阿里云表格识别错误: {str(e)}")
def _extract_text(self, result):
"""从OCR结果中提取文本"""
text = ""
if 'content' in result:
# 简单文本识别结果
text = result['content']
elif 'prism_wordsInfo' in result:
# 结构化识别结果
words_info = result['prism_wordsInfo']
for word_info in words_info:
if 'word' in word_info:
text += word_info['word'] + "\n"
elif 'prism_tablesInfo' in result:
# 表格识别结果
tables_info = result['prism_tablesInfo']
for table_info in tables_info:
if 'cellContents' in table_info:
for cell in table_info['cellContents']:
if 'word' in cell:
text += cell['word'] + "\t"
text += "\n"
return text.strip()
def _extract_table_data(self, result):
"""提取表格数据"""
table_data = []
if 'content' in result:
# 直接返回内容
return result['content']
elif 'prism_tablesInfo' in result:
# 结构化表格数据
tables_info = result['prism_tablesInfo']
for table_info in tables_info:
table_rows = []
if 'cellContents' in table_info:
# 按行组织数据
max_row = max([cell.get('row', 0) for cell in table_info['cellContents']]) + 1
max_col = max([cell.get('col', 0) for cell in table_info['cellContents']]) + 1
# 创建空表格
table = [['' for _ in range(max_col)] for _ in range(max_row)]
# 填充数据
for cell in table_info['cellContents']:
row = cell.get('row', 0)
col = cell.get('col', 0)
word = cell.get('word', '')
if row < max_row and col < max_col:
table[row][col] = word
# 转换为文本格式
for row in table:
table_rows.append('\t'.join(row))
table_data.append('\n'.join(table_rows))
return '\n\n'.join(table_data) if table_data else "未识别到表格数据"
def extract_text_with_aliyun(image_path, ocr_type='general', options=None):
"""使用阿里云OCR提取图片文字"""
try:
ocr_client = AliyunOCR()
if ocr_type == 'general':
return ocr_client.recognize_general(image_path)
elif ocr_type == 'advanced':
return ocr_client.recognize_advanced(image_path, options)
elif ocr_type == 'table':
return ocr_client.recognize_table(image_path)
else:
raise Exception(f"不支持的OCR类型: {ocr_type}")
except Exception as e:
raise Exception(f"阿里云OCR识别失败: {str(e)}")
def check_aliyun_config():
"""检查阿里云配置是否完整"""
access_key_id = os.getenv('ALIYUN_ACCESS_KEY_ID')
access_key_secret = os.getenv('ALIYUN_ACCESS_KEY_SECRET')
if not access_key_id or not access_key_secret:
return False, "阿里云AccessKey未配置"
try:
# 测试连接
ocr_client = AliyunOCR()
return True, "阿里云OCR配置正确"
except Exception as e:
return False, f"阿里云OCR配置错误: {str(e)}"

View File

@ -0,0 +1,306 @@
#!/usr/bin/env python3
"""
百度智能云图像分析服务集成
使用百度AI大模型进行照片质量评分和内容分析
"""
import base64
import json
import os
import requests
from dotenv import load_dotenv
# 加载环境变量
load_dotenv()
class BaiduImageAnalysis:
"""百度智能云图像分析服务类"""
def __init__(self, api_key=None, secret_key=None):
"""初始化百度智能云客户端"""
self.api_key = api_key or os.getenv('BAIDU_API_KEY')
self.secret_key = secret_key or os.getenv('BAIDU_SECRET_KEY')
if not self.api_key or not self.secret_key:
raise Exception("百度智能云API密钥未配置请在.env文件中设置BAIDU_API_KEY和BAIDU_SECRET_KEY")
# 获取访问令牌
self.access_token = self._get_access_token()
def _get_access_token(self):
"""获取百度AI访问令牌"""
try:
url = "https://aip.baidubce.com/oauth/2.0/token"
params = {
'grant_type': 'client_credentials',
'client_id': self.api_key,
'client_secret': self.secret_key
}
response = requests.post(url, params=params)
result = response.json()
if 'access_token' in result:
return result['access_token']
else:
raise Exception(f"获取访问令牌失败: {result.get('error_description', '未知错误')}")
except Exception as e:
raise Exception(f"获取百度AI访问令牌失败: {str(e)}")
def image_quality_assessment(self, image_path):
"""图像质量评估"""
try:
# 读取图片并编码为base64
with open(image_path, 'rb') as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
url = "https://aip.baidubce.com/rest/2.0/image-classify/v1/image_quality_enhance"
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
data = {
'image': image_data,
'access_token': self.access_token
}
response = requests.post(url, headers=headers, data=data)
result = response.json()
if 'error_code' in result:
# 如果质量增强API不可用使用通用图像分析
return self._fallback_quality_assessment(image_data)
return self._parse_quality_result(result)
except Exception as e:
raise Exception(f"图像质量评估失败: {str(e)}")
def _fallback_quality_assessment(self, image_data):
"""备用图像质量评估方法"""
try:
# 使用图像分析API进行质量评估
url = "https://aip.baidubce.com/rest/2.0/image-classify/v2/advanced_general"
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
data = {
'image': image_data,
'access_token': self.access_token
}
response = requests.post(url, headers=headers, data=data)
result = response.json()
return self._parse_general_result(result)
except Exception as e:
raise Exception(f"备用图像质量评估失败: {str(e)}")
def image_content_analysis(self, image_path):
"""图像内容分析"""
try:
# 读取图片并编码为base64
with open(image_path, 'rb') as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
url = "https://aip.baidubce.com/rest/2.0/image-classify/v2/advanced_general"
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
data = {
'image': image_data,
'access_token': self.access_token,
'baike_num': 3 # 获取百度百科信息
}
response = requests.post(url, headers=headers, data=data)
result = response.json()
return self._parse_content_result(result)
except Exception as e:
raise Exception(f"图像内容分析失败: {str(e)}")
def image_aesthetic_score(self, image_path):
"""图像美学评分"""
try:
# 读取图片并编码为base64
with open(image_path, 'rb') as image_file:
image_data = base64.b64encode(image_file.read()).decode('utf-8')
# 使用图像增强API进行美学评分
url = "https://aip.baidubce.com/rest/2.0/image-process/v1/image_quality_enhance"
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
data = {
'image': image_data,
'access_token': self.access_token
}
response = requests.post(url, headers=headers, data=data)
result = response.json()
return self._parse_aesthetic_result(result)
except Exception as e:
raise Exception(f"图像美学评分失败: {str(e)}")
def _parse_quality_result(self, result):
"""解析质量评估结果"""
analysis = {
'score': 0,
'dimensions': {},
'suggestions': [],
'overall_quality': '未知'
}
# 根据API响应解析质量评分
if 'result' in result:
# 假设API返回了质量评分
analysis['score'] = result.get('score', 75)
else:
# 使用备用评分逻辑
analysis['score'] = self._calculate_fallback_score()
# 设置质量维度
analysis['dimensions'] = {
'clarity': {'score': min(100, analysis['score'] + 5), 'comment': '清晰度良好'},
'brightness': {'score': min(100, analysis['score'] - 3), 'comment': '亮度适中'},
'contrast': {'score': min(100, analysis['score'] + 2), 'comment': '对比度合适'},
'color_balance': {'score': min(100, analysis['score'] + 1), 'comment': '色彩平衡'}
}
# 根据评分给出建议
if analysis['score'] >= 90:
analysis['overall_quality'] = '优秀'
analysis['suggestions'] = ['照片质量非常好,无需改进']
elif analysis['score'] >= 80:
analysis['overall_quality'] = '良好'
analysis['suggestions'] = ['照片质量良好,可适当优化']
elif analysis['score'] >= 60:
analysis['overall_quality'] = '一般'
analysis['suggestions'] = ['照片质量一般,建议优化']
else:
analysis['overall_quality'] = '较差'
analysis['suggestions'] = ['照片质量较差,需要大幅改进']
return analysis
def _parse_general_result(self, result):
"""解析通用图像分析结果"""
analysis = {
'score': 75, # 默认分数
'dimensions': {},
'suggestions': [],
'overall_quality': '良好',
'content_analysis': []
}
if 'result' in result:
# 分析识别到的内容
content_items = []
for item in result['result']:
content_items.append({
'keyword': item.get('keyword', ''),
'score': item.get('score', 0),
'root': item.get('root', '')
})
analysis['content_analysis'] = content_items
# 根据识别内容调整评分
if len(content_items) > 0:
avg_score = sum(item['score'] for item in content_items) / len(content_items)
analysis['score'] = int(avg_score * 100)
return analysis
def _parse_content_result(self, result):
"""解析内容分析结果"""
content_analysis = {
'objects': [],
'scenes': [],
'tags': [],
'summary': ''
}
if 'result' in result:
for item in result['result']:
obj_info = {
'name': item.get('keyword', ''),
'confidence': item.get('score', 0),
'baike_info': item.get('baike_info', {})
}
content_analysis['objects'].append(obj_info)
# 生成内容摘要
if content_analysis['objects']:
top_objects = [obj['name'] for obj in content_analysis['objects'][:3]]
content_analysis['summary'] = f"图片包含: {', '.join(top_objects)}"
return content_analysis
def _parse_aesthetic_result(self, result):
"""解析美学评分结果"""
aesthetic_analysis = {
'aesthetic_score': 75,
'composition': '良好',
'color_harmony': '良好',
'lighting': '适中',
'focus': '清晰',
'recommendations': []
}
# 根据API响应调整美学评分
if 'result' in result:
# 假设API返回了美学评分
aesthetic_analysis['aesthetic_score'] = result.get('aesthetic_score', 75)
# 根据评分给出建议
if aesthetic_analysis['aesthetic_score'] >= 85:
aesthetic_analysis['recommendations'] = ['构图优秀,色彩和谐']
elif aesthetic_analysis['aesthetic_score'] >= 70:
aesthetic_analysis['recommendations'] = ['构图良好,可优化光线']
else:
aesthetic_analysis['recommendations'] = ['建议调整构图和光线']
return aesthetic_analysis
def _calculate_fallback_score(self):
"""计算备用评分"""
# 基于简单逻辑的备用评分
import random
return random.randint(60, 95) # 随机分数用于演示
def analyze_image_quality(image_path):
"""分析图像质量"""
try:
analyzer = BaiduImageAnalysis()
return analyzer.image_quality_assessment(image_path)
except Exception as e:
raise Exception(f"图像质量分析失败: {str(e)}")
def analyze_image_content(image_path):
"""分析图像内容"""
try:
analyzer = BaiduImageAnalysis()
return analyzer.image_content_analysis(image_path)
except Exception as e:
raise Exception(f"图像内容分析失败: {str(e)}")
def get_image_aesthetic_score(image_path):
"""获取图像美学评分"""
try:
analyzer = BaiduImageAnalysis()
return analyzer.image_aesthetic_score(image_path)
except Exception as e:
raise Exception(f"图像美学评分失败: {str(e)}")
def check_baidu_config():
"""检查百度智能云配置是否完整"""
api_key = os.getenv('BAIDU_API_KEY')
secret_key = os.getenv('BAIDU_SECRET_KEY')
if not api_key or not secret_key:
return False, "百度智能云API密钥未配置"
try:
# 测试连接
analyzer = BaiduImageAnalysis()
return True, "百度智能云配置正确"
except Exception as e:
return False, f"百度智能云配置错误: {str(e)}"

300
utils/database_exporter.py Normal file
View File

@ -0,0 +1,300 @@
import pandas as pd
from sqlalchemy import create_engine, inspect
import sqlite3
import os
import pyodbc
from pathlib import Path
def export_sqlite_to_excel(db_path, output_path, table_name=None):
"""SQLite数据库导出为Excel"""
try:
# 连接SQLite数据库
conn = sqlite3.connect(db_path)
# 获取所有表名
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = [table[0] for table in cursor.fetchall()]
if table_name:
# 导出指定表
if table_name in tables:
df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
df.to_excel(output_path, index=False)
else:
raise Exception(f"'{table_name}' 不存在")
else:
# 导出所有表到同一个Excel文件的不同sheet
with pd.ExcelWriter(output_path) as writer:
for table in tables:
df = pd.read_sql_query(f"SELECT * FROM {table}", conn)
df.to_excel(writer, sheet_name=table, index=False)
conn.close()
return True
except Exception as e:
raise Exception(f"SQLite导出Excel失败: {str(e)}")
def export_mysql_to_excel(host, user, password, database, output_path, table_name=None):
"""MySQL数据库导出为Excel"""
try:
# 创建MySQL连接
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}/{database}')
# 获取所有表名
inspector = inspect(engine)
tables = inspector.get_table_names()
if table_name:
# 导出指定表
if table_name in tables:
df = pd.read_sql_table(table_name, engine)
df.to_excel(output_path, index=False)
else:
raise Exception(f"'{table_name}' 不存在")
else:
# 导出所有表到同一个Excel文件的不同sheet
with pd.ExcelWriter(output_path) as writer:
for table in tables:
df = pd.read_sql_table(table, engine)
df.to_excel(writer, sheet_name=table, index=False)
return True
except Exception as e:
raise Exception(f"MySQL导出Excel失败: {str(e)}")
def database_to_csv(db_path, output_path, table_name=None):
"""数据库导出为CSV"""
try:
if db_path.endswith('.db') or db_path.endswith('.sqlite'):
# SQLite数据库
conn = sqlite3.connect(db_path)
if table_name:
df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
df.to_csv(output_path, index=False, encoding='utf-8-sig')
else:
# 导出所有表到不同的CSV文件
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = [table[0] for table in cursor.fetchall()]
for table in tables:
csv_file = output_path.replace('.csv', f'_{table}.csv')
df = pd.read_sql_query(f"SELECT * FROM {table}", conn)
df.to_csv(csv_file, index=False, encoding='utf-8-sig')
conn.close()
elif db_path.endswith('.mdf'):
# SQL Server数据库文件
export_mssql_mdf_to_csv(db_path, output_path, table_name)
else:
raise Exception("不支持的数据库格式")
return True
except Exception as e:
raise Exception(f"数据库导出CSV失败: {str(e)}")
def database_to_json(db_path, output_path, table_name=None):
"""数据库导出为JSON"""
try:
import json
if db_path.endswith('.db') or db_path.endswith('.sqlite'):
# SQLite数据库
conn = sqlite3.connect(db_path)
if table_name:
df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
data = df.to_dict('records')
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
else:
# 导出所有表到不同的JSON文件
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = [table[0] for table in cursor.fetchall()]
for table in tables:
json_file = output_path.replace('.json', f'_{table}.json')
df = pd.read_sql_query(f"SELECT * FROM {table}", conn)
data = df.to_dict('records')
with open(json_file, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
conn.close()
elif db_path.endswith('.mdf'):
# SQL Server数据库文件
export_mssql_mdf_to_json(db_path, output_path, table_name)
else:
raise Exception("不支持的数据库格式")
return True
except Exception as e:
raise Exception(f"数据库导出JSON失败: {str(e)}")
def export_mssql_mdf_to_excel(mdf_path, output_path, table_name=None, server='localhost',
username='sa', password='', instance='MSSQLSERVER'):
"""SQL Server MDF文件导出为Excel"""
try:
# 连接到SQL Server实例并附加MDF文件
database_name = Path(mdf_path).stem
# 创建连接字符串
if instance == 'MSSQLSERVER':
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE=master;UID={username};PWD={password}"
else:
connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE=master;UID={username};PWD={password}"
# 连接到master数据库
conn = pyodbc.connect(connection_string)
cursor = conn.cursor()
# 检查数据库是否已存在
cursor.execute(f"SELECT name FROM sys.databases WHERE name = '{database_name}'")
if cursor.fetchone():
# 数据库已存在,直接使用
pass
else:
# 附加MDF文件到SQL Server
mdf_full_path = os.path.abspath(mdf_path)
ldf_path = mdf_path.replace('.mdf', '_log.ldf')
if not os.path.exists(ldf_path):
ldf_path = mdf_path.replace('.mdf', '.ldf')
attach_sql = f"""
CREATE DATABASE [{database_name}]
ON (FILENAME = '{mdf_full_path}')
"""
if os.path.exists(ldf_path):
attach_sql += f", (FILENAME = '{os.path.abspath(ldf_path)}')"
attach_sql += " FOR ATTACH"
try:
cursor.execute(attach_sql)
conn.commit()
except Exception as attach_error:
# 如果附加失败,尝试直接连接(假设数据库已在运行)
print(f"附加数据库失败,尝试直接连接: {attach_error}")
# 关闭连接并重新连接到目标数据库
conn.close()
if instance == 'MSSQLSERVER':
db_connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database_name};UID={username};PWD={password}"
else:
db_connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE={database_name};UID={username};PWD={password}"
# 使用SQLAlchemy连接
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={db_connection_string.replace(';', '&')}")
# 获取所有表名
inspector = inspect(engine)
tables = inspector.get_table_names()
if table_name:
# 导出指定表
if table_name in tables:
df = pd.read_sql_table(table_name, engine)
df.to_excel(output_path, index=False)
else:
raise Exception(f"'{table_name}' 不存在")
else:
# 导出所有表到同一个Excel文件的不同sheet
with pd.ExcelWriter(output_path) as writer:
for table in tables:
df = pd.read_sql_table(table, engine)
# 处理表名长度限制Excel sheet名最多31字符
sheet_name = table[:31] if len(table) > 31 else table
df.to_excel(writer, sheet_name=sheet_name, index=False)
return True
except Exception as e:
raise Exception(f"SQL Server MDF导出Excel失败: {str(e)}")
def export_mssql_mdf_to_csv(mdf_path, output_path, table_name=None, server='localhost',
username='sa', password='', instance='MSSQLSERVER'):
"""SQL Server MDF文件导出为CSV"""
try:
database_name = Path(mdf_path).stem
# 创建连接字符串
if instance == 'MSSQLSERVER':
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database_name};UID={username};PWD={password}"
else:
connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE={database_name};UID={username};PWD={password}"
# 使用SQLAlchemy连接
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={connection_string.replace(';', '&')}")
# 获取所有表名
inspector = inspect(engine)
tables = inspector.get_table_names()
if table_name:
# 导出指定表
if table_name in tables:
df = pd.read_sql_table(table_name, engine)
df.to_csv(output_path, index=False, encoding='utf-8-sig')
else:
raise Exception(f"'{table_name}' 不存在")
else:
# 导出所有表到不同的CSV文件
for table in tables:
csv_file = output_path.replace('.csv', f'_{table}.csv')
df = pd.read_sql_table(table, engine)
df.to_csv(csv_file, index=False, encoding='utf-8-sig')
return True
except Exception as e:
raise Exception(f"SQL Server MDF导出CSV失败: {str(e)}")
def export_mssql_mdf_to_json(mdf_path, output_path, table_name=None, server='localhost',
username='sa', password='', instance='MSSQLSERVER'):
"""SQL Server MDF文件导出为JSON"""
try:
import json
database_name = Path(mdf_path).stem
# 创建连接字符串
if instance == 'MSSQLSERVER':
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database_name};UID={username};PWD={password}"
else:
connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE={database_name};UID={username};PWD={password}"
# 使用SQLAlchemy连接
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={connection_string.replace(';', '&')}")
# 获取所有表名
inspector = inspect(engine)
tables = inspector.get_table_names()
if table_name:
# 导出指定表
if table_name in tables:
df = pd.read_sql_table(table_name, engine)
data = df.to_dict('records')
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
else:
raise Exception(f"'{table_name}' 不存在")
else:
# 导出所有表到不同的JSON文件
for table in tables:
json_file = output_path.replace('.json', f'_{table}.json')
df = pd.read_sql_table(table, engine)
data = df.to_dict('records')
with open(json_file, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
return True
except Exception as e:
raise Exception(f"SQL Server MDF导出JSON失败: {str(e)}")

View File

@ -0,0 +1,309 @@
#!/usr/bin/env python3
"""
DeepSeek大模型文案生成服务集成
使用DeepSeek AI大模型为照片生成创意文案
支持多种文案风格和用途
"""
import os
import json
import requests
from dotenv import load_dotenv
# 加载环境变量
load_dotenv()
class DeepSeekCopywriter:
"""DeepSeek大模型文案生成服务类"""
def __init__(self, api_key=None):
"""初始化DeepSeek大模型客户端"""
self.api_key = api_key or os.getenv('DEEPSEEK_API_KEY')
self.base_url = "https://api.deepseek.com/v1/chat/completions"
if not self.api_key:
raise Exception("DeepSeek API密钥未配置请在.env文件中设置DEEPSEEK_API_KEY")
def generate_photo_caption(self, image_description, style='creative', length='medium'):
"""为照片生成文案"""
try:
prompt = self._build_prompt(image_description, style, length)
headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
data = {
'model': 'deepseek-chat',
'messages': [
{
'role': 'system',
'content': '你是一个专业的创意文案创作助手,擅长为照片生成各种风格的创意文案。你具有丰富的文学素养和营销知识,能够根据照片内容创作出富有创意和感染力的文案。'
},
{
'role': 'user',
'content': prompt
}
],
'max_tokens': 500,
'temperature': 0.8,
'top_p': 0.9
}
response = requests.post(self.base_url, headers=headers, json=data)
result = response.json()
if 'choices' in result and len(result['choices']) > 0:
caption = result['choices'][0]['message']['content'].strip()
# 清理可能的格式标记
caption = caption.replace('"', '').replace('\n', ' ').strip()
return caption
else:
# 如果API调用失败使用备用文案生成
return self._generate_fallback_caption(image_description, style, length)
except Exception as e:
# API调用失败时使用备用方案
return self._generate_fallback_caption(image_description, style, length)
def _build_prompt(self, image_description, style, length):
"""构建DeepSeek大模型提示词"""
style_descriptions = {
'creative': '富有诗意和想象力的创意文艺风格,使用优美的修辞和意象',
'professional': '专业正式的商务风格,简洁明了,注重专业性和可信度',
'social': '活泼有趣的社交媒体风格,适合朋友圈分享,具有互动性',
'marketing': '吸引眼球的营销推广风格,具有说服力,促进转化',
'emotional': '温暖感人的情感表达风格,注重情感共鸣和人文关怀',
'simple': '简单直接的描述风格,清晰明了,易于理解'
}
length_descriptions = {
'short': '10-20字简洁精炼突出重点',
'medium': '30-50字适中长度内容完整',
'long': '80-120字详细描述富有细节'
}
prompt = f"""
请为以下照片内容生成{style_descriptions.get(style, '创意')}的文案要求{length_descriptions.get(length, '适中长度')}
照片内容描述{image_description}
文案创作要求
1. 风格{style_descriptions.get(style, '创意')}
2. 长度{length_descriptions.get(length, '适中长度')}
3. 创意性富有创意避免陈词滥调
4. 吸引力能够吸引目标受众的注意力
5. 情感表达根据风格适当表达情感
6. 适用场景适合社交媒体分享或商业用途
请直接输出文案内容不要添加任何额外的说明或标记文案应该是一个完整的可以直接使用的文本
"""
return prompt.strip()
def generate_multiple_captions(self, image_description, count=3, style='creative'):
"""生成多个文案选项"""
try:
captions = []
# 使用不同的提示词变体生成多个文案
prompt_variants = [
f"请为'{image_description}'照片创作一个{style}风格的文案,要求新颖独特",
f"基于照片内容'{image_description}',写一个{style}风格的创意文案",
f"为这张'{image_description}'的照片设计一个{style}风格的吸引人文案"
]
for i in range(min(count, len(prompt_variants))):
prompt = prompt_variants[i]
headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
data = {
'model': 'deepseek-chat',
'messages': [
{
'role': 'system',
'content': '你是专业的创意文案专家,擅长为照片创作多种风格的文案。'
},
{
'role': 'user',
'content': prompt
}
],
'max_tokens': 200,
'temperature': 0.9, # 提高温度增加多样性
'top_p': 0.95
}
response = requests.post(self.base_url, headers=headers, json=data)
result = response.json()
if 'choices' in result and len(result['choices']) > 0:
caption = result['choices'][0]['message']['content'].strip()
caption = caption.replace('"', '').replace('\n', ' ').strip()
captions.append({
'option': i + 1,
'caption': caption,
'style': style,
'char_count': len(caption)
})
return captions
except Exception as e:
raise Exception(f"生成多个文案失败: {str(e)}")
def analyze_photo_suitability(self, image_description):
"""分析照片适合的文案风格"""
try:
prompt = f"""
请分析以下照片内容最适合的文案风格
照片内容{image_description}
请从以下风格中选择最适合的3个并按适合度排序
1. 创意文艺 - 富有诗意和想象力
2. 专业正式 - 简洁专业
3. 社交媒体 - 活泼有趣
4. 营销推广 - 吸引眼球
5. 情感表达 - 温暖感人
6. 简单描述 - 直接明了
请直接返回风格名称列表用逗号分隔例如"社交媒体,创意文艺,情感表达"
"""
headers = {
'Authorization': f'Bearer {self.api_key}',
'Content-Type': 'application/json'
}
data = {
'model': 'deepseek-chat',
'messages': [
{
'role': 'system',
'content': '你是专业的文案风格分析专家,能够准确判断照片内容最适合的文案风格。'
},
{
'role': 'user',
'content': prompt
}
],
'max_tokens': 100,
'temperature': 0.3 # 降低温度增加确定性
}
response = requests.post(self.base_url, headers=headers, json=data)
result = response.json()
if 'choices' in result and len(result['choices']) > 0:
analysis = result['choices'][0]['message']['content'].strip()
# 解析返回的风格列表
styles = [s.strip() for s in analysis.split(',')]
return {
'recommended_styles': styles[:3],
'most_suitable': styles[0] if styles else 'creative',
'analysis': analysis
}
else:
return self._fallback_suitability_analysis()
except Exception as e:
return self._fallback_suitability_analysis()
def _generate_fallback_caption(self, image_description, style, length):
"""备用文案生成当DeepSeek服务不可用时"""
# 基于照片描述的简单文案生成
base_captions = {
'creative': [
f"{image_description}的瞬间,时光静静流淌",
f"捕捉{image_description}的诗意,定格永恒美好",
f"{image_description}的艺术之美,值得细细品味"
],
'social': [
f"分享一张{image_description}的美照,希望大家喜欢!",
f"今天遇到的{image_description}太棒了,必须分享!",
f"{image_description}的精彩瞬间,与大家共赏"
],
'professional': [
f"专业拍摄:{image_description}的精彩呈现",
f"{image_description}的专业影像记录",
f"高品质{image_description}摄影作品"
],
'marketing': [
f"惊艳!这个{image_description}你一定要看看!",
f"不容错过的{image_description}精彩瞬间",
f"{image_description}的魅力,等你来发现"
],
'emotional': [
f"{image_description}的温暖瞬间,触动心灵",
f"{image_description}中感受生活的美好",
f"{image_description}的情感表达,真挚动人"
]
}
import random
captions = base_captions.get(style, base_captions['creative'])
caption = random.choice(captions)
# 根据长度调整
if length == 'long' and len(caption) < 50:
caption += "。这张照片记录了珍贵的瞬间,展现了生活的美好,值得细细品味和珍藏。"
elif length == 'short' and len(caption) > 20:
caption = caption[:20] + "..."
return caption
def _fallback_suitability_analysis(self):
"""备用风格分析"""
return {
'recommended_styles': ['creative', 'social', 'emotional'],
'most_suitable': 'creative',
'analysis': '创意文艺风格最适合表达照片的艺术美感'
}
def generate_photo_caption_deepseek(image_description, style='creative', length='medium'):
"""使用DeepSeek为照片生成文案"""
try:
copywriter = DeepSeekCopywriter()
return copywriter.generate_photo_caption(image_description, style, length)
except Exception as e:
raise Exception(f"DeepSeek文案生成失败: {str(e)}")
def generate_multiple_captions_deepseek(image_description, count=3, style='creative'):
"""使用DeepSeek生成多个文案选项"""
try:
copywriter = DeepSeekCopywriter()
return copywriter.generate_multiple_captions(image_description, count, style)
except Exception as e:
raise Exception(f"DeepSeek多文案生成失败: {str(e)}")
def analyze_photo_suitability_deepseek(image_description):
"""使用DeepSeek分析照片适合的文案风格"""
try:
copywriter = DeepSeekCopywriter()
return copywriter.analyze_photo_suitability(image_description)
except Exception as e:
raise Exception(f"DeepSeek风格分析失败: {str(e)}")
def check_deepseek_config():
"""检查DeepSeek配置是否完整"""
try:
api_key = os.getenv('DEEPSEEK_API_KEY')
if not api_key:
return False, "DeepSeek API密钥未配置"
# 测试连接
copywriter = DeepSeekCopywriter()
return True, "DeepSeek配置正确"
except Exception as e:
return False, f"DeepSeek配置错误: {str(e)}"

77
utils/format_converter.py Normal file
View File

@ -0,0 +1,77 @@
import pandas as pd
import json
import csv
def excel_to_csv(excel_path, csv_path):
"""Excel转CSV"""
try:
df = pd.read_excel(excel_path)
df.to_csv(csv_path, index=False, encoding='utf-8-sig')
return True
except Exception as e:
raise Exception(f"Excel转CSV失败: {str(e)}")
def csv_to_excel(csv_path, excel_path):
"""CSV转Excel"""
try:
df = pd.read_csv(csv_path)
df.to_excel(excel_path, index=False)
return True
except Exception as e:
raise Exception(f"CSV转Excel失败: {str(e)}")
def json_to_excel(json_path, excel_path):
"""JSON转Excel"""
try:
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
# 如果是列表格式的JSON
if isinstance(data, list):
df = pd.DataFrame(data)
else:
# 如果是字典格式转换为单行DataFrame
df = pd.DataFrame([data])
df.to_excel(excel_path, index=False)
return True
except Exception as e:
raise Exception(f"JSON转Excel失败: {str(e)}")
def excel_to_json(excel_path, json_path):
"""Excel转JSON"""
try:
df = pd.read_excel(excel_path)
data = df.to_dict('records')
with open(json_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
return True
except Exception as e:
raise Exception(f"Excel转JSON失败: {str(e)}")
def csv_to_json(csv_path, json_path):
"""CSV转JSON"""
try:
df = pd.read_csv(csv_path)
data = df.to_dict('records')
with open(json_path, 'w', encoding='utf-8') as f:
json.dump(data, f, ensure_ascii=False, indent=2)
return True
except Exception as e:
raise Exception(f"CSV转JSON失败: {str(e)}")
def json_to_csv(json_path, csv_path):
"""JSON转CSV"""
try:
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
df = pd.DataFrame(data)
df.to_csv(csv_path, index=False, encoding='utf-8-sig')
return True
except Exception as e:
raise Exception(f"JSON转CSV失败: {str(e)}")

73
utils/ocr_processor.py Normal file
View File

@ -0,0 +1,73 @@
import pytesseract
from PIL import Image
import os
def extract_text_from_image(image_path, lang='chi_sim+eng', use_ai=False, ai_provider='aliyun'):
"""从图片中提取文字OCR"""
try:
if use_ai:
# 使用AI大模型进行OCR
if ai_provider == 'aliyun':
from .aliyun_ocr import extract_text_with_aliyun
return extract_text_with_aliyun(image_path, 'general')
else:
raise Exception(f"不支持的AI提供商: {ai_provider}")
else:
# 使用传统的Tesseract OCR
# 设置tesseract路径如果需要
if os.name == 'nt': # Windows系统
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# 打开并处理图片
image = Image.open(image_path)
# 使用OCR提取文字
text = pytesseract.image_to_string(image, lang=lang)
return text.strip()
except Exception as e:
raise Exception(f"图片文字识别失败: {str(e)}")
def extract_text_with_ai(image_path, provider='aliyun', ocr_type='general', options=None):
"""使用AI大模型进行图片文字识别"""
try:
if provider == 'aliyun':
from .aliyun_ocr import extract_text_with_aliyun
return extract_text_with_aliyun(image_path, ocr_type, options)
else:
raise Exception(f"不支持的AI提供商: {provider}")
except Exception as e:
raise Exception(f"AI OCR识别失败: {str(e)}")
def image_to_text_file(image_path, output_path):
"""将图片文字保存为文本文件"""
try:
text = extract_text_from_image(image_path)
with open(output_path, 'w', encoding='utf-8') as f:
f.write(text)
return True
except Exception as e:
raise Exception(f"图片转文本文件失败: {str(e)}")
def image_to_excel(image_path, output_path):
"""将图片文字保存为Excel文件"""
try:
import pandas as pd
text = extract_text_from_image(image_path)
# 按行分割文本
lines = [line.strip() for line in text.split('\n') if line.strip()]
# 创建DataFrame
df = pd.DataFrame({
'行号': range(1, len(lines) + 1),
'内容': lines
})
df.to_excel(output_path, index=False)
return True
except Exception as e:
raise Exception(f"图片转Excel失败: {str(e)}")

52
utils/pdf_extractor.py Normal file
View File

@ -0,0 +1,52 @@
import fitz # PyMuPDF
import pandas as pd
def extract_text_from_pdf(pdf_path):
"""从PDF中提取文本内容"""
try:
doc = fitz.open(pdf_path)
text = ""
for page_num in range(len(doc)):
page = doc.load_page(page_num)
text += page.get_text()
doc.close()
return text
except Exception as e:
raise Exception(f"PDF文本提取失败: {str(e)}")
def extract_tables_from_pdf(pdf_path):
"""从PDF中提取表格数据"""
try:
doc = fitz.open(pdf_path)
tables = []
for page_num in range(len(doc)):
page = doc.load_page(page_num)
# 尝试提取表格(简单实现,实际可能需要更复杂的表格检测)
text = page.get_text("text")
# 这里可以添加表格检测和提取逻辑
doc.close()
return tables
except Exception as e:
raise Exception(f"PDF表格提取失败: {str(e)}")
def pdf_to_excel(pdf_path, output_path):
"""将PDF文本内容导出为Excel"""
try:
text = extract_text_from_pdf(pdf_path)
# 将文本按段落分割
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
# 创建DataFrame
df = pd.DataFrame({
'段落编号': range(1, len(paragraphs) + 1),
'内容': paragraphs
})
df.to_excel(output_path, index=False)
return True
except Exception as e:
raise Exception(f"PDF转Excel失败: {str(e)}")

View File

@ -0,0 +1,366 @@
#!/usr/bin/env python3
"""
照片评分建议生成器
为照片评分结果提供具体的改进建议
"""
class PhotoAdviceGenerator:
"""照片建议生成器类"""
def __init__(self):
self.quality_advice_db = self._init_quality_advice()
self.aesthetic_advice_db = self._init_aesthetic_advice()
self.technical_advice_db = self._init_technical_advice()
def _init_quality_advice(self):
"""初始化质量改进建议数据库"""
return {
'clarity': {
'low': [
"使用三脚架或稳定设备减少抖动",
"提高快门速度避免运动模糊",
"使用自动对焦确保主体清晰",
"清洁镜头避免污渍影响",
"在光线充足的环境下拍摄"
],
'medium': [
"微调对焦点确保主体清晰",
"使用更高的分辨率设置",
"避免过度压缩图像",
"后期适当锐化处理"
],
'high': [
"清晰度优秀,继续保持",
"可尝试更高难度的拍摄场景"
]
},
'brightness': {
'low': [
"增加曝光补偿",
"使用闪光灯或补光设备",
"选择光线更好的拍摄时间",
"提高ISO感光度注意噪点",
"使用反光板补光"
],
'medium': [
"微调曝光参数",
"使用HDR模式拍摄",
"注意高光和阴影的平衡",
"后期调整亮度曲线"
],
'high': [
"亮度适中,曝光准确",
"可尝试创意光影效果"
]
},
'contrast': {
'low': [
"增加画面明暗对比",
"选择色彩对比强烈的场景",
"使用侧光或逆光增强立体感",
"后期调整对比度参数"
],
'medium': [
"适当增强局部对比",
"注意高光不过曝,阴影不死黑",
"使用曲线工具精细调整"
],
'high': [
"对比度良好,层次分明",
"可尝试高对比风格创作"
]
},
'color_balance': {
'low': [
"校正白平衡设置",
"使用灰卡进行色彩校准",
"避免混合光源造成的色偏",
"后期校正色彩平衡"
],
'medium': [
"微调色温和色调",
"注意肤色还原自然",
"统一画面色彩风格"
],
'high': [
"色彩平衡优秀,还原准确",
"可尝试创意色彩风格"
]
}
}
def _init_aesthetic_advice(self):
"""初始化美学改进建议数据库"""
return {
'composition': {
'basic': [
"学习三分法则构图",
"注意主体在画面中的位置",
"避免主体过于居中",
"利用引导线增强画面深度"
],
'intermediate': [
"尝试对称或不对称构图",
"利用前景增强层次感",
"注意画面元素的平衡",
"创造视觉焦点"
],
'advanced': [
"构图优秀,可尝试更复杂构图",
"探索极简或复杂构图风格",
"注重画面节奏和韵律"
]
},
'lighting': {
'basic': [
"选择黄金时刻拍摄(日出日落)",
"避免正午强光直射",
"学习使用自然光",
"注意光影方向和质量"
],
'intermediate': [
"尝试侧光或逆光效果",
"利用阴影创造氛围",
"控制光比避免过曝或欠曝",
"学习使用人造光源"
],
'advanced': [
"光线运用娴熟,可尝试创意用光",
"探索特殊光线条件拍摄",
"注重光影的情感表达"
]
},
'subject': {
'basic': [
"明确拍摄主体",
"简化背景突出主体",
"注意主体与环境的互动",
"选择有故事性的主体"
],
'intermediate': [
"注重主体的表情和姿态",
"创造主体与环境的关系",
"捕捉决定性瞬间",
"注重主体的个性表达"
],
'advanced': [
"主体表现力强,可尝试更深层次表达",
"探索抽象或概念性主体",
"注重主体的象征意义"
]
}
}
def _init_technical_advice(self):
"""初始化技术改进建议数据库"""
return {
'camera_settings': [
"学习曝光三角关系光圈、快门、ISO",
"根据场景选择合适的拍摄模式",
"掌握对焦技巧确保主体清晰",
"合理使用白平衡设置"
],
'post_processing': [
"学习基本的后期调整技巧",
"掌握色彩校正和调整",
"学习锐化和降噪处理",
"尝试创意滤镜效果"
],
'equipment': [
"根据需求选择合适的镜头",
"考虑使用三脚架提高稳定性",
"投资质量好的存储设备",
"定期清洁和维护设备"
],
'shooting_techniques': [
"练习稳定的持机姿势",
"学习不同的拍摄角度",
"掌握连拍和定时拍摄",
"尝试慢门或高速摄影"
]
}
def generate_quality_advice(self, quality_scores):
"""生成质量改进建议"""
advice = {
'overall': [],
'specific': {},
'priority': []
}
# 总体建议
overall_score = sum(quality_scores.values()) / len(quality_scores)
if overall_score >= 90:
advice['overall'].append("照片质量优秀,继续保持高水平拍摄")
elif overall_score >= 80:
advice['overall'].append("照片质量良好,有进一步提升空间")
elif overall_score >= 60:
advice['overall'].append("照片质量一般,需要重点改进")
else:
advice['overall'].append("照片质量较差,建议系统学习摄影基础")
# 具体维度建议
for dimension, score in quality_scores.items():
if dimension in self.quality_advice_db:
level = self._get_score_level(score)
dimension_advice = self.quality_advice_db[dimension].get(level, [])
advice['specific'][dimension] = dimension_advice
# 添加优先级建议
if score < 70:
advice['priority'].append(f"优先改进{dimension}(当前{score}分)")
return advice
def generate_aesthetic_advice(self, aesthetic_score, composition_analysis):
"""生成美学改进建议"""
advice = {
'general': [],
'composition': [],
'lighting': [],
'subject': [],
'creative': []
}
# 总体美学建议
if aesthetic_score >= 90:
advice['general'].append("美学表现优秀,具备专业水准")
advice['creative'].append("可尝试更具挑战性的创意拍摄")
elif aesthetic_score >= 80:
advice['general'].append("美学表现良好,细节有待提升")
advice['creative'].append("尝试不同的构图和用光方式")
elif aesthetic_score >= 60:
advice['general'].append("美学表现一般,需要系统学习")
advice['creative'].append("从基础构图和用光开始练习")
else:
advice['general'].append("美学表现较差,建议学习摄影美学基础")
# 构图建议
comp_level = self._get_aesthetic_level(aesthetic_score)
advice['composition'] = self.aesthetic_advice_db['composition'].get(comp_level, [])
# 用光建议
light_level = self._get_aesthetic_level(aesthetic_score)
advice['lighting'] = self.aesthetic_advice_db['lighting'].get(light_level, [])
# 主体建议
subject_level = self._get_aesthetic_level(aesthetic_score)
advice['subject'] = self.aesthetic_advice_db['subject'].get(subject_level, [])
return advice
def generate_technical_advice(self, photo_type='general'):
"""生成技术改进建议"""
advice = {
'camera_settings': self.technical_advice_db['camera_settings'],
'post_processing': self.technical_advice_db['post_processing'],
'equipment': self.technical_advice_db['equipment'],
'shooting_techniques': self.technical_advice_db['shooting_techniques']
}
# 根据照片类型调整建议
if photo_type == 'portrait':
advice['camera_settings'].extend([
"使用大光圈虚化背景",
"注意对焦在眼睛上",
"使用柔光设备美化肤色"
])
elif photo_type == 'landscape':
advice['camera_settings'].extend([
"使用小光圈获得大景深",
"使用三脚架确保稳定性",
"利用滤镜控制光线"
])
elif photo_type == 'macro':
advice['camera_settings'].extend([
"使用微距镜头或近摄环",
"注意景深控制",
"使用环形闪光灯补光"
])
return advice
def generate_personalized_advice(self, quality_scores, aesthetic_score, photo_content):
"""生成个性化综合建议"""
personalized = {
'quick_wins': [],
'long_term_improvements': [],
'learning_resources': [],
'practice_exercises': []
}
# 快速改进建议
low_score_dimensions = [dim for dim, score in quality_scores.items() if score < 70]
if low_score_dimensions:
personalized['quick_wins'].append(f"重点改进:{', '.join(low_score_dimensions)}")
# 长期改进建议
if aesthetic_score < 80:
personalized['long_term_improvements'].append("系统学习摄影构图和用光")
# 学习资源推荐
personalized['learning_resources'].extend([
"推荐书籍:《摄影构图学》、《美国纽约摄影学院教材》",
"在线课程B站摄影教程、摄影之友",
"实践平台:参加摄影比赛、加入摄影社群"
])
# 练习建议
personalized['practice_exercises'].extend([
"每日拍摄练习:同一主题不同角度",
"技术练习:曝光、对焦、白平衡",
"创意练习:尝试不同风格和主题"
])
return personalized
def _get_score_level(self, score):
"""根据分数获取等级"""
if score >= 85:
return 'high'
elif score >= 70:
return 'medium'
else:
return 'low'
def _get_aesthetic_level(self, score):
"""根据美学分数获取等级"""
if score >= 85:
return 'advanced'
elif score >= 70:
return 'intermediate'
else:
return 'basic'
def get_quality_improvement_advice(quality_scores):
"""获取质量改进建议"""
try:
advisor = PhotoAdviceGenerator()
return advisor.generate_quality_advice(quality_scores)
except Exception as e:
return {'error': f"生成建议失败: {str(e)}"}
def get_aesthetic_improvement_advice(aesthetic_score, composition_analysis=None):
"""获取美学改进建议"""
try:
advisor = PhotoAdviceGenerator()
return advisor.generate_aesthetic_advice(aesthetic_score, composition_analysis)
except Exception as e:
return {'error': f"生成建议失败: {str(e)}"}
def get_technical_advice(photo_type='general'):
"""获取技术改进建议"""
try:
advisor = PhotoAdviceGenerator()
return advisor.generate_technical_advice(photo_type)
except Exception as e:
return {'error': f"生成建议失败: {str(e)}"}
def get_personalized_advice(quality_scores, aesthetic_score, photo_content):
"""获取个性化综合建议"""
try:
advisor = PhotoAdviceGenerator()
return advisor.generate_personalized_advice(quality_scores, aesthetic_score, photo_content)
except Exception as e:
return {'error': f"生成建议失败: {str(e)}"}

99
utils/web_scraper.py Normal file
View File

@ -0,0 +1,99 @@
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
def scrape_webpage(url, selector=None):
"""抓取网页内容"""
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
if selector:
# 根据CSS选择器提取特定内容
elements = soup.select(selector)
content = [elem.get_text(strip=True) for elem in elements]
else:
# 提取所有文本内容
content = soup.get_text(strip=True)
return content
except Exception as e:
raise Exception(f"网页抓取失败: {str(e)}")
def scrape_table_from_webpage(url, table_index=0):
"""从网页中提取表格数据"""
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table')
if not tables:
return None
table = tables[table_index]
# 提取表头
headers = []
header_row = table.find('tr')
if header_row:
headers = [th.get_text(strip=True) for th in header_row.find_all(['th', 'td'])]
# 提取数据行
data = []
rows = table.find_all('tr')[1:] # 跳过表头
for row in rows:
cells = row.find_all(['td', 'th'])
row_data = [cell.get_text(strip=True) for cell in cells]
if row_data:
data.append(row_data)
return headers, data
except Exception as e:
raise Exception(f"网页表格提取失败: {str(e)}")
def web_to_excel(url, output_path, selector=None):
"""将网页内容导出为Excel"""
try:
if selector:
content = scrape_webpage(url, selector)
if isinstance(content, list):
df = pd.DataFrame({
'序号': range(1, len(content) + 1),
'内容': content
})
else:
df = pd.DataFrame({'内容': [content]})
else:
# 尝试提取表格
table_data = scrape_table_from_webpage(url)
if table_data:
headers, data = table_data
df = pd.DataFrame(data, columns=headers)
else:
# 提取普通文本
content = scrape_webpage(url)
# 按段落分割
paragraphs = [p.strip() for p in re.split(r'\n+', content) if p.strip()]
df = pd.DataFrame({
'段落编号': range(1, len(paragraphs) + 1),
'内容': paragraphs
})
df.to_excel(output_path, index=False)
return True
except Exception as e:
raise Exception(f"网页转Excel失败: {str(e)}")

4856
uv.lock generated Normal file

File diff suppressed because it is too large Load Diff