feat: 完整的数据提取与转换器项目
- 添加MDF文件导出功能 - 集成阿里云OCR大模型识别 - 添加百度智能云AI照片评分 - 集成DeepSeek大模型创意文案生成 - 完善文档和配置管理 - 使用uv进行现代化依赖管理 - 添加完整的.gitignore配置
This commit is contained in:
commit
2ec2c0a1ab
35
.env.example
Normal file
35
.env.example
Normal file
@ -0,0 +1,35 @@
|
||||
# 数据提取与转换器 - 环境变量配置示例
|
||||
|
||||
# Flask应用密钥(生产环境请修改)
|
||||
SECRET_KEY=your-secret-key-here
|
||||
|
||||
# Tesseract OCR路径(Windows系统需要设置)
|
||||
TESSERACT_PATH=C:\\Program Files\\Tesseract-OCR\\tesseract.exe
|
||||
|
||||
# 数据库连接(可选)
|
||||
DATABASE_URI=sqlite:///data.db
|
||||
|
||||
# MySQL数据库配置示例
|
||||
# DATABASE_URI=mysql+pymysql://username:password@localhost/database_name
|
||||
|
||||
# 阿里云OCR配置
|
||||
ALIYUN_ACCESS_KEY_ID=your-aliyun-access-key-id
|
||||
ALIYUN_ACCESS_KEY_SECRET=your-aliyun-access-key-secret
|
||||
|
||||
# 百度智能云配置(图像分析)
|
||||
BAIDU_API_KEY=your-baidu-api-key
|
||||
BAIDU_SECRET_KEY=your-baidu-secret-key
|
||||
|
||||
# DeepSeek大模型配置(创意文案生成)
|
||||
DEEPSEEK_API_KEY=your-deepseek-api-key
|
||||
|
||||
# 阿里云DashScope配置(备用文案生成)
|
||||
DASHSCOPE_API_KEY=your-dashscope-api-key
|
||||
|
||||
# 照片建议生成配置
|
||||
PHOTO_ADVICE_ENABLED=true
|
||||
|
||||
# 应用配置
|
||||
DEBUG=false
|
||||
HOST=0.0.0.0
|
||||
PORT=5000
|
||||
81
.gitignore
vendored
Normal file
81
.gitignore
vendored
Normal file
@ -0,0 +1,81 @@
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
*.so
|
||||
.Python
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
downloads/
|
||||
eggs/
|
||||
.eggs/
|
||||
lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
wheels/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
MANIFEST
|
||||
|
||||
# Environment variables
|
||||
.env
|
||||
.env.local
|
||||
.env.development.local
|
||||
.env.test.local
|
||||
.env.production.local
|
||||
|
||||
# IDE
|
||||
.vscode/
|
||||
.idea/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
logs/
|
||||
|
||||
# Database
|
||||
*.db
|
||||
*.sqlite
|
||||
*.sqlite3
|
||||
|
||||
# Temporary files
|
||||
temp/
|
||||
tmp/
|
||||
|
||||
# Uploads
|
||||
uploads/
|
||||
|
||||
# Streamlit
|
||||
.streamlit/
|
||||
|
||||
# UV
|
||||
.venv/
|
||||
venv/
|
||||
ENV/
|
||||
|
||||
# Package files
|
||||
*.tar.gz
|
||||
*.whl
|
||||
|
||||
# Test coverage
|
||||
.coverage
|
||||
htmlcov/
|
||||
.pytest_cache/
|
||||
|
||||
# Jupyter
|
||||
.ipynb_checkpoints
|
||||
|
||||
# Documentation
|
||||
_site/
|
||||
.sass-cache/
|
||||
.jekyll-metadata
|
||||
130
ALIYUN_OCR_SETUP.md
Normal file
130
ALIYUN_OCR_SETUP.md
Normal file
@ -0,0 +1,130 @@
|
||||
# 阿里云OCR配置指南
|
||||
|
||||
## 📋 概述
|
||||
|
||||
数据提取与转换器现在支持使用阿里云AI大模型进行图片文字识别,相比传统OCR具有更高的准确率和更好的中文支持。
|
||||
|
||||
## 🔑 获取阿里云AccessKey
|
||||
|
||||
### 1. 注册阿里云账号
|
||||
- 访问: https://www.aliyun.com
|
||||
- 注册并完成实名认证
|
||||
|
||||
### 2. 开通OCR服务
|
||||
- 登录阿里云控制台
|
||||
- 搜索"OCR"或访问: https://www.aliyun.com/product/ocr
|
||||
- 开通"通用文字识别"服务
|
||||
|
||||
### 3. 获取AccessKey
|
||||
1. 进入控制台 → 鼠标悬停头像 → AccessKey管理
|
||||
2. 创建AccessKey(或使用现有Key)
|
||||
3. 记录以下信息:
|
||||
- AccessKey ID
|
||||
- AccessKey Secret
|
||||
|
||||
## ⚙️ 配置环境变量
|
||||
|
||||
在`.env`文件中添加阿里云配置:
|
||||
|
||||
```env
|
||||
# 阿里云OCR配置
|
||||
ALIYUN_ACCESS_KEY_ID=您的AccessKey ID
|
||||
ALIYUN_ACCESS_KEY_SECRET=您的AccessKey Secret
|
||||
ALIYUN_OCR_ENDPOINT=ocr-api.cn-hangzhou.aliyuncs.com
|
||||
```
|
||||
|
||||
## 💰 费用说明
|
||||
|
||||
### 免费额度
|
||||
- 新用户通常有免费调用额度
|
||||
- 具体额度请查看阿里云OCR产品页面
|
||||
|
||||
### 计费方式
|
||||
- 按调用次数计费
|
||||
- 具体价格请参考阿里云官方定价
|
||||
|
||||
## 🎯 功能对比
|
||||
|
||||
| 功能 | 传统OCR (Tesseract) | AI大模型OCR (阿里云) |
|
||||
|------|-------------------|---------------------|
|
||||
| **安装复杂度** | 中等(需安装软件) | 简单(仅需配置Key) |
|
||||
| **识别准确率** | 一般 | 非常高 |
|
||||
| **中文支持** | 良好 | 优秀 |
|
||||
| **复杂图片** | 较差 | 优秀 |
|
||||
| **费用** | 免费 | 按调用次数收费 |
|
||||
| **处理速度** | 快速 | 中等(网络依赖) |
|
||||
|
||||
## 🔧 故障排除
|
||||
|
||||
### 常见问题
|
||||
|
||||
**1. "阿里云AccessKey未配置"**
|
||||
- 检查.env文件中是否已配置ALIYUN_ACCESS_KEY_ID和ALIYUN_ACCESS_KEY_SECRET
|
||||
- 确保AccessKey正确无误
|
||||
|
||||
**2. "权限不足"**
|
||||
- 确认已开通OCR服务
|
||||
- 检查AccessKey是否有OCR服务权限
|
||||
|
||||
**3. "网络连接失败"**
|
||||
- 检查网络连接
|
||||
- 确认防火墙未阻止请求
|
||||
|
||||
**4. "额度不足"**
|
||||
- 检查阿里云账户余额
|
||||
- 确认免费额度是否已用完
|
||||
|
||||
### 测试配置
|
||||
|
||||
使用以下命令测试阿里云OCR配置:
|
||||
|
||||
```bash
|
||||
cd d:\python\AI\data-extractor-converter
|
||||
uv run python -c "from utils.aliyun_ocr import check_aliyun_config; print(check_aliyun_config())"
|
||||
```
|
||||
|
||||
## 🚀 使用说明
|
||||
|
||||
### 在应用中使用
|
||||
|
||||
1. 访问应用 → 选择"🖼️ 图片OCR"功能
|
||||
2. 选择"AI大模型OCR (阿里云)"模式
|
||||
3. 上传图片文件
|
||||
4. 点击"识别文字"或导出按钮
|
||||
|
||||
### 支持的图片格式
|
||||
- JPG/JPEG
|
||||
- PNG
|
||||
- GIF
|
||||
- BMP
|
||||
|
||||
### 识别类型
|
||||
- **通用文字识别** - 普通图片中的文字
|
||||
- **表格识别** - 表格数据提取
|
||||
- **高级识别** - 复杂场景文字识别
|
||||
|
||||
## 💡 最佳实践
|
||||
|
||||
### 图片优化建议
|
||||
1. **清晰度**: 确保图片清晰,文字可读
|
||||
2. **分辨率**: 建议300dpi以上
|
||||
3. **背景**: 尽量使用纯色背景
|
||||
4. **角度**: 保持文字水平
|
||||
|
||||
### 成本控制
|
||||
1. **批量处理**: 尽量批量处理图片
|
||||
2. **图片预处理**: 先裁剪和优化图片
|
||||
3. **监控使用**: 定期查看阿里云使用量
|
||||
|
||||
## 📚 相关资源
|
||||
|
||||
- [阿里云OCR文档](https://help.aliyun.com/product/30419.html)
|
||||
- [AccessKey管理](https://ram.console.aliyun.com/manage/ak)
|
||||
- [OCR产品定价](https://www.aliyun.com/price/product#/ocr/detail)
|
||||
|
||||
## ⚠️ 注意事项
|
||||
|
||||
1. **安全性**: 不要将AccessKey提交到版本控制系统
|
||||
2. **费用**: 注意监控使用量,避免意外费用
|
||||
3. **网络**: AI OCR需要稳定的网络连接
|
||||
4. **备份**: 重要数据建议使用传统OCR作为备份方案
|
||||
166
BAIDU_AI_SETUP.md
Normal file
166
BAIDU_AI_SETUP.md
Normal file
@ -0,0 +1,166 @@
|
||||
# 百度智能云AI照片评分配置指南
|
||||
|
||||
## 📋 概述
|
||||
|
||||
数据提取与转换器现在支持使用百度智能云AI大模型进行照片质量评分和内容分析,为您的照片提供专业的智能化评估。
|
||||
|
||||
## 🔑 获取百度智能云API密钥
|
||||
|
||||
### 1. 注册百度智能云账号
|
||||
- 访问: https://cloud.baidu.com
|
||||
- 注册并完成实名认证
|
||||
|
||||
### 2. 开通图像分析服务
|
||||
1. 登录百度智能云控制台
|
||||
2. 搜索"图像分析"或访问: https://cloud.baidu.com/product/imageprocess.html
|
||||
3. 开通"图像分析"或"图像识别"服务
|
||||
|
||||
### 3. 创建应用获取API密钥
|
||||
1. 进入控制台 → 产品服务 → 图像分析
|
||||
2. 创建新应用
|
||||
3. 记录以下信息:
|
||||
- API Key
|
||||
- Secret Key
|
||||
|
||||
## ⚙️ 配置环境变量
|
||||
|
||||
在`.env`文件中添加百度智能云配置:
|
||||
|
||||
```env
|
||||
# 百度智能云配置(图像分析)
|
||||
BAIDU_API_KEY=您的API Key
|
||||
BAIDU_SECRET_KEY=您的Secret Key
|
||||
```
|
||||
|
||||
## 💰 费用说明
|
||||
|
||||
### 免费额度
|
||||
- 新用户通常有免费调用额度
|
||||
- 具体额度请查看百度智能云产品页面
|
||||
|
||||
### 计费方式
|
||||
- 按调用次数计费
|
||||
- 具体价格请参考百度智能云官方定价
|
||||
|
||||
## 🎯 功能特点
|
||||
|
||||
### 1. **照片质量评分** 📊
|
||||
- **总体评分**: 0-100分的综合质量评估
|
||||
- **质量维度**: 清晰度、亮度、对比度、色彩平衡
|
||||
- **改进建议**: 针对性的优化建议
|
||||
|
||||
### 2. **照片内容分析** 🔍
|
||||
- **对象识别**: 自动识别照片中的物体和场景
|
||||
- **内容摘要**: 智能生成照片内容描述
|
||||
- **百度百科**: 关联对象的详细信息
|
||||
|
||||
### 3. **照片美学评分** 🎨
|
||||
- **美学评分**: 构图、色彩、光线等美学维度
|
||||
- **美学建议**: 提升照片美感的专业建议
|
||||
- **艺术指导**: 摄影技巧和构图建议
|
||||
|
||||
## 🔧 故障排除
|
||||
|
||||
### 常见问题
|
||||
|
||||
**1. "百度智能云API密钥未配置"**
|
||||
- 检查.env文件中是否已配置BAIDU_API_KEY和BAIDU_SECRET_KEY
|
||||
- 确保API密钥正确无误
|
||||
|
||||
**2. "权限不足"**
|
||||
- 确认已开通图像分析服务
|
||||
- 检查API密钥是否有相应服务权限
|
||||
|
||||
**3. "网络连接失败"**
|
||||
- 检查网络连接
|
||||
- 确认防火墙未阻止请求
|
||||
|
||||
**4. "额度不足"**
|
||||
- 检查百度智能云账户余额
|
||||
- 确认免费额度是否已用完
|
||||
|
||||
### 测试配置
|
||||
|
||||
使用以下命令测试百度智能云配置:
|
||||
|
||||
```bash
|
||||
cd d:\python\AI\data-extractor-converter
|
||||
uv run python -c "from utils.baidu_image_analysis import check_baidu_config; print(check_baidu_config())"
|
||||
```
|
||||
|
||||
## 🚀 使用说明
|
||||
|
||||
### 在应用中使用
|
||||
|
||||
1. 访问应用 → 选择"📸 AI照片评分"功能
|
||||
2. 上传照片文件
|
||||
3. 选择分析类型:
|
||||
- **质量评分**: 评估照片技术质量
|
||||
- **内容分析**: 识别照片内容
|
||||
- **美学评分**: 评估照片艺术价值
|
||||
|
||||
### 支持的图片格式
|
||||
- JPG/JPEG
|
||||
- PNG
|
||||
- GIF
|
||||
- BMP
|
||||
|
||||
### 分析类型说明
|
||||
|
||||
#### 质量评分 📊
|
||||
- **适用场景**: 技术质量评估、照片优化
|
||||
- **输出内容**: 综合评分、维度分析、改进建议
|
||||
- **使用建议**: 适合评估照片的技术质量
|
||||
|
||||
#### 内容分析 🔍
|
||||
- **适用场景**: 内容识别、场景理解
|
||||
- **输出内容**: 对象识别、内容摘要、百科信息
|
||||
- **使用建议**: 适合了解照片内容和场景
|
||||
|
||||
#### 美学评分 🎨
|
||||
- **适用场景**: 艺术评估、摄影学习
|
||||
- **输出内容**: 美学评分、构图分析、艺术建议
|
||||
- **使用建议**: 适合评估照片的艺术价值
|
||||
|
||||
## 💡 最佳实践
|
||||
|
||||
### 照片优化建议
|
||||
1. **清晰度**: 确保照片清晰,避免模糊
|
||||
2. **光线**: 使用自然光,避免过暗或过亮
|
||||
3. **构图**: 遵循三分法则,保持画面平衡
|
||||
4. **格式**: 使用高质量JPG或PNG格式
|
||||
|
||||
### 成本控制
|
||||
1. **批量处理**: 尽量批量分析照片
|
||||
2. **选择性分析**: 根据需要选择分析类型
|
||||
3. **监控使用**: 定期查看使用量统计
|
||||
|
||||
## 📚 相关资源
|
||||
|
||||
- [百度智能云图像分析文档](https://cloud.baidu.com/doc/IMAGEPROCESS/s/ck3h6yf8e)
|
||||
- [API密钥管理](https://console.bce.baidu.com/iam/#/iam/accesslist)
|
||||
- [产品定价](https://cloud.baidu.com/product/imageprocess.html#pricing)
|
||||
|
||||
## ⚠️ 注意事项
|
||||
|
||||
1. **安全性**: 不要将API密钥提交到版本控制系统
|
||||
2. **费用**: 注意监控使用量,避免意外费用
|
||||
3. **网络**: AI分析需要稳定的网络连接
|
||||
4. **隐私**: 避免上传包含敏感信息的照片
|
||||
|
||||
## 🌟 应用场景
|
||||
|
||||
### 个人使用
|
||||
- 评估手机照片质量
|
||||
- 学习摄影技巧
|
||||
- 优化社交媒体图片
|
||||
|
||||
### 教育使用
|
||||
- 摄影课程作业评估
|
||||
- 图像处理学习
|
||||
- 艺术创作指导
|
||||
|
||||
### 专业使用
|
||||
- 摄影师作品评估
|
||||
- 图像质量监控
|
||||
- 内容识别分析
|
||||
124
BAIDU_API_GUIDE.md
Normal file
124
BAIDU_API_GUIDE.md
Normal file
@ -0,0 +1,124 @@
|
||||
# 百度智能云API密钥正确获取指南
|
||||
|
||||
## 🔍 问题诊断
|
||||
|
||||
您遇到的`unknown client id`错误表明当前的API密钥格式不正确。百度智能云的API密钥应该是纯字母数字格式,而不是您之前配置的格式。
|
||||
|
||||
## ✅ 正确获取API密钥的步骤
|
||||
|
||||
### 1. **访问百度智能云控制台**
|
||||
- 打开: https://console.bce.baidu.com/
|
||||
- 使用百度账号登录
|
||||
|
||||
### 2. **开通图像分析服务**
|
||||
1. 在控制台搜索栏输入"图像分析"
|
||||
2. 选择"图像分析"或"图像识别"服务
|
||||
3. 点击"立即使用"开通服务
|
||||
|
||||
### 3. **创建应用获取API密钥**
|
||||
1. 进入控制台 → 产品服务 → 图像分析
|
||||
2. 点击"创建应用"
|
||||
3. 填写应用信息:
|
||||
- **应用名称**: 数据提取与转换器
|
||||
- **应用类型**: 工具软件
|
||||
- **应用描述**: 照片质量评分工具
|
||||
4. 勾选需要的服务权限
|
||||
5. 点击"立即创建"
|
||||
|
||||
### 4. **获取正确的API密钥**
|
||||
创建应用后,您会看到类似这样的信息:
|
||||
|
||||
```
|
||||
AppID: 12345678
|
||||
API Key: xxxxxxxxxxxxxxxx
|
||||
Secret Key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|
||||
```
|
||||
|
||||
**正确的格式示例:**
|
||||
```
|
||||
API Key: "AbCdEfGhIjKlMnOp" (16位字母数字)
|
||||
Secret Key: "AbCdEfGhIjKlMnOpQrStUvWxYz012345" (32位字母数字)
|
||||
```
|
||||
|
||||
## ⚠️ 常见错误格式
|
||||
|
||||
**错误的格式(不要使用):**
|
||||
```
|
||||
# 这种格式是错误的!
|
||||
BAIDU_API_KEY=bce-v3/ALTAK-lZu9DdMGqrEIBSs0MKcA5/35732e937f95337ddac7a5984c865fe28a2e4eea
|
||||
BAIDU_SECRET_KEY=ya2270c03f2bc4816889e5173d38290d0
|
||||
```
|
||||
|
||||
**正确的格式:**
|
||||
```
|
||||
# 这种格式是正确的!
|
||||
BAIDU_API_KEY=AbCdEfGhIjKlMnOp
|
||||
BAIDU_SECRET_KEY=AbCdEfGhIjKlMnOpQrStUvWxYz012345
|
||||
```
|
||||
|
||||
## 🔧 配置步骤
|
||||
|
||||
### 1. **更新.env文件**
|
||||
将正确的API密钥添加到`.env`文件中:
|
||||
|
||||
```env
|
||||
# 百度智能云配置(图像分析)
|
||||
BAIDU_API_KEY=您的正确API Key
|
||||
BAIDU_SECRET_KEY=您的正确Secret Key
|
||||
```
|
||||
|
||||
### 2. **重启应用**
|
||||
应用需要重启才能加载新的环境变量。
|
||||
|
||||
### 3. **验证配置**
|
||||
使用以下命令测试配置是否正确:
|
||||
|
||||
```bash
|
||||
cd d:\python\AI\data-extractor-converter
|
||||
uv run python -c "from utils.baidu_image_analysis import check_baidu_config; print(check_baidu_config())"
|
||||
```
|
||||
|
||||
## 🎯 验证成功的标志
|
||||
|
||||
如果配置正确,您会看到:
|
||||
```
|
||||
配置状态: True
|
||||
详细信息: 百度智能云配置正确
|
||||
```
|
||||
|
||||
## 💡 故障排除
|
||||
|
||||
### 如果仍然遇到问题
|
||||
|
||||
1. **检查服务开通状态**
|
||||
- 确认图像分析服务已开通
|
||||
- 检查应用是否有相应权限
|
||||
|
||||
2. **验证API密钥格式**
|
||||
- API Key: 应该是16位字母数字
|
||||
- Secret Key: 应该是32位字母数字
|
||||
|
||||
3. **检查网络连接**
|
||||
- 确保可以访问百度智能云API
|
||||
- 检查防火墙设置
|
||||
|
||||
4. **查看错误详情**
|
||||
- 如果仍有错误,查看完整的错误信息
|
||||
- 根据错误信息进一步排查
|
||||
|
||||
## 📞 获取帮助
|
||||
|
||||
如果仍然无法解决问题:
|
||||
|
||||
1. **百度智能云文档**: https://cloud.baidu.com/doc/IMAGEPROCESS/s/ck3h6yf8e
|
||||
2. **技术支持**: 在百度智能云控制台提交工单
|
||||
3. **社区支持**: 搜索相关技术论坛
|
||||
|
||||
## 🚀 下一步
|
||||
|
||||
配置正确的API密钥后,您就可以使用以下功能:
|
||||
- 📊 照片质量评分
|
||||
- 🔍 照片内容分析
|
||||
- 🎨 照片美学评分
|
||||
|
||||
祝您配置成功!
|
||||
187
BAIDU_API_KEY_DETAILED_GUIDE.md
Normal file
187
BAIDU_API_KEY_DETAILED_GUIDE.md
Normal file
@ -0,0 +1,187 @@
|
||||
# 百度智能云API Key详细获取指南
|
||||
|
||||
## 📋 步骤概览
|
||||
|
||||
1. **注册百度智能云账号**
|
||||
2. **开通图像分析服务**
|
||||
3. **创建应用获取API Key**
|
||||
4. **配置到应用中**
|
||||
|
||||
## 🔑 第一步:注册百度智能云账号
|
||||
|
||||
### 1.1 访问官网
|
||||
- 打开: https://cloud.baidu.com/
|
||||
- 点击右上角"注册"
|
||||
|
||||
### 1.2 完成注册
|
||||
- 使用百度账号或手机号注册
|
||||
- 完成实名认证(需要身份证)
|
||||
- 验证手机和邮箱
|
||||
|
||||
## 🚀 第二步:开通图像分析服务
|
||||
|
||||
### 2.1 登录控制台
|
||||
- 访问: https://console.bce.baidu.com/
|
||||
- 使用注册的账号登录
|
||||
|
||||
### 2.2 开通服务
|
||||
在控制台首页搜索栏输入以下关键词之一:
|
||||
- **"图像分析"**
|
||||
- **"图像识别"**
|
||||
- **"Image Analysis"**
|
||||
|
||||
### 2.3 选择服务
|
||||
点击搜索结果中的"图像分析"服务,然后点击"立即使用"。
|
||||
|
||||
## 📱 第三步:创建应用获取API Key
|
||||
|
||||
### 3.1 进入应用管理
|
||||
1. 登录控制台后,点击左侧菜单"产品服务"
|
||||
2. 找到"图像分析"或"图像识别"
|
||||
3. 点击进入服务页面
|
||||
|
||||
### 3.2 创建新应用
|
||||
1. 点击"创建应用"按钮
|
||||
2. 填写应用信息:
|
||||
|
||||
**应用信息填写示例:**
|
||||
```
|
||||
应用名称: 数据提取与转换器
|
||||
应用类型: 工具软件
|
||||
应用描述: 照片质量评分和内容分析工具
|
||||
行业分类: 工具软件/办公软件
|
||||
```
|
||||
|
||||
### 3.3 选择服务权限
|
||||
在创建应用时,确保勾选以下权限:
|
||||
- ✅ 图像分析
|
||||
- ✅ 图像识别
|
||||
- ✅ 图像质量评估
|
||||
|
||||
### 3.4 获取API Key
|
||||
创建应用成功后,您会看到类似这样的信息:
|
||||
|
||||
```
|
||||
应用ID: 12345678
|
||||
API Key: AbCdEfGhIjKlMnOp
|
||||
Secret Key: AbCdEfGhIjKlMnOpQrStUvWxYz012345
|
||||
```
|
||||
|
||||
## 🔍 第四步:识别正确的API Key格式
|
||||
|
||||
### 4.1 正确的API Key特征
|
||||
```
|
||||
✅ API Key: AbCdEfGhIjKlMnOp (16位字母数字)
|
||||
✅ Secret Key: AbCdEfGhIjKlMnOpQrStUvWxYz012345 (32位字母数字)
|
||||
```
|
||||
|
||||
### 4.2 错误的API Key格式(不要使用)
|
||||
```
|
||||
❌ 日期时间格式: 20260108183311
|
||||
❌ 复杂格式: bce-v3/ALTAK-xxx/xxx
|
||||
❌ 包含特殊字符: ALTAKyZ19nreTPglt0XP4fhg0O
|
||||
```
|
||||
|
||||
## ⚙️ 第五步:配置到应用中
|
||||
|
||||
### 5.1 更新.env文件
|
||||
将正确的API Key添加到`.env`文件中:
|
||||
|
||||
```env
|
||||
# 百度智能云配置(图像分析)
|
||||
BAIDU_API_KEY=AbCdEfGhIjKlMnOp
|
||||
BAIDU_SECRET_KEY=AbCdEfGhIjKlMnOpQrStUvWxYz012345
|
||||
```
|
||||
|
||||
### 5.2 重启应用
|
||||
应用需要重启才能加载新的环境变量。
|
||||
|
||||
### 5.3 验证配置
|
||||
使用以下命令测试配置是否正确:
|
||||
|
||||
```bash
|
||||
cd d:\python\AI\data-extractor-converter
|
||||
uv run python -c "from utils.baidu_image_analysis import check_baidu_config; print(check_baidu_config())"
|
||||
```
|
||||
|
||||
## 🎯 验证成功的标志
|
||||
|
||||
如果配置正确,您会看到:
|
||||
```
|
||||
配置状态: True
|
||||
详细信息: 百度智能云配置正确
|
||||
```
|
||||
|
||||
## 💡 常见问题解决
|
||||
|
||||
### Q1: 找不到"图像分析"服务怎么办?
|
||||
- 尝试搜索"图像识别"
|
||||
- 检查账号是否完成实名认证
|
||||
- 确认账号是否为企业账号(个人账号可能有限制)
|
||||
|
||||
### Q2: API Key格式不正确怎么办?
|
||||
- 确保是纯字母数字格式
|
||||
- 不要使用日期时间格式
|
||||
- 不要使用包含特殊字符的格式
|
||||
|
||||
### Q3: 创建应用时提示权限不足?
|
||||
- 检查账号实名认证状态
|
||||
- 确认账号余额或信用额度
|
||||
- 联系百度智能云客服
|
||||
|
||||
### Q4: 测试时仍然报错?
|
||||
- 检查网络连接
|
||||
- 验证API Key和Secret Key是否匹配
|
||||
- 确认服务是否已开通
|
||||
|
||||
## 📞 获取帮助
|
||||
|
||||
### 官方文档
|
||||
- 图像分析文档: https://cloud.baidu.com/doc/IMAGEPROCESS/s/ck3h6yf8e
|
||||
- API参考: https://cloud.baidu.com/doc/IMAGEPROCESS/s/Ek3h6xze3
|
||||
|
||||
### 技术支持
|
||||
- 控制台提交工单
|
||||
- 客服电话: 4008-777-818
|
||||
- 官方QQ群: 搜索"百度智能云技术支持"
|
||||
|
||||
## 🚀 功能预览
|
||||
|
||||
配置成功后,您可以使用以下AI照片评分功能:
|
||||
|
||||
### 1. 质量评分 📊
|
||||
- 清晰度评估
|
||||
- 亮度分析
|
||||
- 对比度检测
|
||||
- 色彩平衡评分
|
||||
|
||||
### 2. 内容分析 🔍
|
||||
- 物体识别
|
||||
- 场景理解
|
||||
- 内容摘要生成
|
||||
- 百度百科关联
|
||||
|
||||
### 3. 美学评分 🎨
|
||||
- 构图分析
|
||||
- 色彩和谐度
|
||||
- 光线评估
|
||||
- 艺术指导建议
|
||||
|
||||
## ⚠️ 注意事项
|
||||
|
||||
1. **安全性**: 不要将API Key提交到Git等版本控制系统
|
||||
2. **费用**: 注意监控使用量,避免意外费用
|
||||
3. **网络**: 确保稳定的网络连接
|
||||
4. **隐私**: 避免上传包含敏感信息的照片
|
||||
|
||||
## 💰 费用说明
|
||||
|
||||
### 免费额度
|
||||
- 新用户通常有免费调用额度
|
||||
- 具体额度请查看产品页面
|
||||
|
||||
### 计费方式
|
||||
- 按调用次数计费
|
||||
- 具体价格参考官方定价
|
||||
|
||||
祝您配置成功!如果遇到问题,可以参考常见问题部分或联系技术支持。
|
||||
279
README.md
Normal file
279
README.md
Normal file
@ -0,0 +1,279 @@
|
||||
## <20> 团队成员与贡献
|
||||
|
||||
| 姓名 | 学号 | 主要贡献 (具体分工) |
|
||||
|------|------|-------------------|
|
||||
| 郭昊 | 2412111209 | (组长) 核心逻辑开发、Prompt 编写 |
|
||||
|
||||
# 数据提取与转换器
|
||||
|
||||
🚀 **多功能AI数据提取与转换工具**
|
||||
|
||||
一个集成了AI大模型能力的现代化数据处理工具,支持PDF提取、图片OCR、格式转换、网页抓取、数据库导出,以及创新的AI照片评分和文案生成功能。
|
||||
|
||||
## ✨ 核心功能
|
||||
|
||||
### 📄 文档处理
|
||||
- **PDF文本/表格提取** - 从PDF文档中提取文字和表格数据
|
||||
- **多格式支持** - 支持PDF、Word、Excel等文档格式
|
||||
|
||||
### 🖼️ 图片处理与AI识别
|
||||
- **传统OCR识别** - 使用Tesseract进行图片文字识别
|
||||
- **AI大模型OCR** - 集成阿里云AI大模型,高精度中文识别
|
||||
- **AI照片评分** - 百度智能云AI照片质量、内容、美学评估
|
||||
- **AI创意文案** - 基于照片内容生成多种风格的创意文案
|
||||
|
||||
### 🔄 数据格式转换
|
||||
- **Excel/CSV/JSON格式互转** - 支持多种数据格式之间的转换
|
||||
- **数据清洗与处理** - 智能数据格式识别和转换
|
||||
|
||||
### 🌐 网络数据获取
|
||||
- **网页数据抓取** - 从指定URL或关键词抓取网页数据
|
||||
- **智能内容提取** - 自动识别网页结构和内容
|
||||
|
||||
### 🗄️ 数据库管理
|
||||
- **数据库导出** - 将SQLite/MySQL数据库导出为Excel等格式
|
||||
- **MDF文件支持** - 支持SQL Server MDF文件导出
|
||||
|
||||
## 🎯 AI功能特色
|
||||
|
||||
### 📸 AI照片评分系统
|
||||
- **质量评分** 📊 - 清晰度、亮度、对比度、色彩平衡评估
|
||||
- **内容分析** 🔍 - 智能识别照片中的物体和场景
|
||||
- **美学评分** 🎨 - 构图、用光、主体表现艺术评价
|
||||
- **详细改进建议** 💡 - 针对性的摄影技术指导
|
||||
|
||||
### ✍️ AI创意文案生成
|
||||
- **多种风格** - 创意文艺、社交媒体、专业正式、营销推广等
|
||||
- **智能推荐** - 基于照片内容自动推荐最适合的风格
|
||||
- **多选项选择** - 一次生成3个不同风格的文案选项
|
||||
- **便捷复制** - 一键复制文案到剪贴板
|
||||
|
||||
## 🛠️ 技术架构
|
||||
|
||||
### 依赖管理
|
||||
- **使用`uv`管理** - 现代化的Python包管理工具
|
||||
- **虚拟环境隔离** - 确保依赖环境干净整洁
|
||||
- **快速安装** - 并行下载和安装,提升效率
|
||||
|
||||
### AI服务集成
|
||||
- **阿里云OCR** - 业界领先的中文OCR识别能力
|
||||
- **百度智能云** - 专业的图像分析和识别服务
|
||||
- **阿里云DashScope** - 强大的AI大模型文案生成
|
||||
|
||||
## 🚀 快速开始
|
||||
|
||||
### 环境要求
|
||||
- Python 3.8+
|
||||
- uv (推荐使用)
|
||||
|
||||
### 安装步骤
|
||||
|
||||
1. **克隆项目**
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd data-extractor-converter
|
||||
```
|
||||
|
||||
2. **安装依赖**
|
||||
```bash
|
||||
# 使用uv安装依赖
|
||||
uv sync
|
||||
```
|
||||
|
||||
3. **配置环境变量**
|
||||
复制`.env.example`为`.env`并配置相关API密钥:
|
||||
```env
|
||||
# 阿里云OCR配置(AI大模型识别)
|
||||
ALIYUN_ACCESS_KEY_ID=your-access-key-id
|
||||
ALIYUN_ACCESS_KEY_SECRET=your-access-key-secret
|
||||
ALIYUN_OCR_ENDPOINT=ocr-api.cn-hangzhou.aliyuncs.com
|
||||
|
||||
# 百度智能云配置(图像分析)
|
||||
BAIDU_API_KEY=your-baidu-api-key
|
||||
BAIDU_SECRET_KEY=your-baidu-secret-key
|
||||
|
||||
# DashScope配置(AI文案生成)
|
||||
DASHSCOPE_API_KEY=your-dashscope-api-key
|
||||
```
|
||||
|
||||
4. **启动应用**
|
||||
```bash
|
||||
uv run streamlit run app.py
|
||||
```
|
||||
|
||||
5. **访问应用**
|
||||
打开浏览器访问: http://localhost:8501
|
||||
|
||||
## 📁 项目结构
|
||||
|
||||
```
|
||||
data-extractor-converter/
|
||||
├── app.py # 主应用程序
|
||||
├── pyproject.toml # 项目配置和依赖管理
|
||||
├── .env.example # 环境变量示例
|
||||
├── utils/ # 工具模块
|
||||
│ ├── __init__.py
|
||||
│ ├── pdf_extractor.py # PDF提取工具
|
||||
│ ├── ocr_processor.py # OCR处理工具
|
||||
│ ├── aliyun_ocr.py # 阿里云AI OCR
|
||||
│ ├── baidu_image_analysis.py # 百度智能云图像分析
|
||||
│ ├── ai_copywriter.py # AI文案生成
|
||||
│ ├── photo_advice_generator.py # 照片评分建议生成
|
||||
│ ├── format_converter.py # 格式转换工具
|
||||
│ ├── web_scraper.py # 网页抓取工具
|
||||
│ └── database_exporter.py # 数据库导出工具
|
||||
├── uploads/ # 上传文件目录
|
||||
└── docs/ # 文档目录
|
||||
├── ALIYUN_OCR_SETUP.md # 阿里云OCR配置指南
|
||||
├── BAIDU_AI_SETUP.md # 百度智能云配置指南
|
||||
└── SQL_SERVER_SETUP.md # SQL Server配置指南
|
||||
```
|
||||
|
||||
## 🔧 配置指南
|
||||
|
||||
### 阿里云OCR配置
|
||||
参考: [ALIYUN_OCR_SETUP.md](docs/ALIYUN_OCR_SETUP.md)
|
||||
|
||||
### 百度智能云配置
|
||||
参考: [BAIDU_AI_SETUP.md](docs/BAIDU_AI_SETUP.md)
|
||||
|
||||
### SQL Server配置
|
||||
参考: [SQL_SERVER_SETUP.md](docs/SQL_SERVER_SETUP.md)
|
||||
|
||||
## 💡 使用示例
|
||||
|
||||
### 1. AI照片评分
|
||||
1. 选择"📸 AI照片评分"功能
|
||||
2. 上传照片文件
|
||||
3. 点击"质量评分"、"内容分析"、"美学评分"
|
||||
4. 查看详细评分和改进建议
|
||||
|
||||
### 2. AI文案生成
|
||||
1. 在照片评分页面点击"AI写文案"
|
||||
2. 系统自动分析照片内容
|
||||
3. 选择喜欢的文案风格和长度
|
||||
4. 复制生成的创意文案
|
||||
|
||||
### 3. PDF文档处理
|
||||
1. 选择"📄 PDF处理"功能
|
||||
2. 上传PDF文件
|
||||
3. 选择提取模式(文本/表格)
|
||||
4. 下载提取结果
|
||||
|
||||
## 🎨 界面特色
|
||||
|
||||
- **现代化设计** - 简洁直观的用户界面
|
||||
- **响应式布局** - 适配不同屏幕尺寸
|
||||
- **实时反馈** - 操作进度和结果即时显示
|
||||
- **多语言支持** - 完整的中文界面和提示
|
||||
|
||||
## 🔒 安全特性
|
||||
|
||||
- **本地处理** - 敏感数据在本地处理,不上传云端
|
||||
- **环境变量** - API密钥通过环境变量安全配置
|
||||
- **文件隔离** - 上传文件在临时目录处理,自动清理
|
||||
|
||||
## 📈 性能优化
|
||||
|
||||
- **异步处理** - 大文件处理使用异步操作
|
||||
- **缓存机制** - 重复操作结果缓存
|
||||
- **进度显示** - 长时间操作显示进度条
|
||||
|
||||
## 🤝 贡献指南
|
||||
|
||||
欢迎提交Issue和Pull Request来改进这个项目!
|
||||
|
||||
### 开发环境设置
|
||||
```bash
|
||||
# 安装开发依赖
|
||||
uv sync --dev
|
||||
|
||||
# 运行测试
|
||||
uv run pytest
|
||||
|
||||
# 代码格式化
|
||||
uv run black .
|
||||
uv run isort .
|
||||
```
|
||||
|
||||
## 📄 许可证
|
||||
|
||||
本项目采用MIT许可证,详见[LICENSE](LICENSE)文件。
|
||||
|
||||
## 🙏 致谢
|
||||
|
||||
感谢以下服务提供的AI能力支持:
|
||||
- [阿里云](https://www.aliyun.com/) - OCR和AI大模型服务
|
||||
- [百度智能云](https://cloud.baidu.com/) - 图像分析服务
|
||||
- [Streamlit](https://streamlit.io/) - Web应用框架
|
||||
|
||||
|
||||
|
||||
### 如何运行
|
||||
1. **安装依赖**:`uv sync`
|
||||
2. **配置 Key**:复制 `.env.example` 为 `.env` 并填入 Key
|
||||
3. **启动**:`uv run streamlit run app.py`
|
||||
|
||||
## 💭 开发心得
|
||||
|
||||
### 选题思考:为什么做这个?解决了谁的痛苦?
|
||||
|
||||
作为一名学生,我深刻体会到在学习和科研过程中处理各种格式数据的痛苦。从PDF文献提取、图片文字识别到数据格式转换,每一个环节都可能耗费大量时间。特别是当需要为照片添加创意文案时,往往需要反复修改,缺乏专业的指导。
|
||||
|
||||
这个项目正是为了解决这些痛点而生。它不仅仅是一个工具集合,更是一个AI赋能的智能助手,能够帮助我们:
|
||||
- 快速提取学术文献中的关键信息
|
||||
- 智能识别图片中的文字内容
|
||||
- 一键转换不同格式的数据文件
|
||||
- 获得专业的照片质量评估和创意文案
|
||||
|
||||
### AI 协作体验
|
||||
|
||||
#### 第一次用 AI 写代码的感觉?
|
||||
|
||||
第一次使用AI辅助编程时,我感到太方便了AI能够快速生成基础代码框架,大大提升了开发效率。随着项目的深入,我发现AI在以下几个方面表现出色:
|
||||
|
||||
1. **快速原型开发**:AI能够快速生成功能模块的基本框架
|
||||
2. **代码优化建议**:AI能够提供代码重构和性能优化的建议
|
||||
3. **错误排查**:AI能够快速定位代码中的潜在问题
|
||||
|
||||
#### 哪个 Prompt 让你直呼"牛逼"?哪个让你想砸键盘?
|
||||
|
||||
|
||||
|
||||
**最令人沮丧的Prompt:**
|
||||
"修复百度智能云API连接错误"
|
||||
|
||||
这个看似简单的Prompt却让我反复调试了多次,因为AI无法理解具体的API密钥格式问题,只能提供通用的错误排查建议,需要人工进行详细的调试。
|
||||
|
||||
### 自我反思:AI 时代,我作为程序员的核心竞争力到底是什么?
|
||||
|
||||
通过这个项目的开发,我深刻认识到在AI时代,程序员的核心竞争力已经发生了根本性的转变:
|
||||
|
||||
#### 1. **问题定义和分解能力**
|
||||
AI擅长执行具体的任务,但需要人类来定义问题和分解复杂需求。我的价值在于能够将用户的需求转化为AI可以理解的具体任务。
|
||||
|
||||
#### 2. **系统架构设计能力**
|
||||
AI可以生成代码片段,但整个系统的架构设计、模块划分、接口定义仍然需要人类的专业判断。
|
||||
|
||||
#### 3. **质量控制和调试能力**
|
||||
AI生成的代码可能存在潜在问题,需要人类进行严格的测试、调试和优化。
|
||||
|
||||
#### 4. **创新思维和业务理解**
|
||||
AI基于现有数据进行学习,而人类能够结合业务场景进行创新思考,提出独特的解决方案。
|
||||
|
||||
#### 5. **伦理和责任意识**
|
||||
在使用AI技术时,需要考虑数据隐私、算法公平性等伦理问题,这是AI无法替代的人类责任。
|
||||
|
||||
### 总结
|
||||
|
||||
这个项目让我深刻体会到,AI不是程序员的替代者,而是强大的工具和合作伙伴。未来的程序员需要具备:
|
||||
- **AI协作能力**:熟练使用AI工具提升效率
|
||||
- **系统思维**:从整体角度设计解决方案
|
||||
- **业务理解**:深入理解用户需求和业务场景
|
||||
- **持续学习**:跟上技术发展的步伐
|
||||
|
||||
通过这个项目,我不仅掌握了一项实用的技能,更重要的是培养了一种与AI协作的新思维方式。在AI时代,我们的价值不在于重复性的编码工作,而在于创造性的问题解决和系统设计能力。
|
||||
|
||||
---
|
||||
|
||||
**数据提取与转换器** - 让数据处理变得更简单、更智能! 🚀
|
||||
137
SQL_SERVER_SETUP.md
Normal file
137
SQL_SERVER_SETUP.md
Normal file
@ -0,0 +1,137 @@
|
||||
# SQL Server MDF文件导出配置指南
|
||||
|
||||
## 📋 概述
|
||||
|
||||
数据提取与转换器现在支持导出SQL Server数据库文件(.mdf格式)。由于.mdf文件需要SQL Server实例来访问,请按照以下步骤配置。
|
||||
|
||||
## 🔧 系统要求
|
||||
|
||||
### 必需组件
|
||||
1. **SQL Server Express/Developer/Standard/Enterprise** 版本
|
||||
2. **SQL Server Native Client** 或 **ODBC Driver for SQL Server**
|
||||
3. **Python pyodbc库**(已自动安装)
|
||||
|
||||
### 推荐配置
|
||||
- SQL Server 2019 Express(免费版本)
|
||||
- ODBC Driver 17 for SQL Server
|
||||
|
||||
## 🚀 安装步骤
|
||||
|
||||
### 1. 安装SQL Server(如果未安装)
|
||||
|
||||
**下载SQL Server Express(免费):**
|
||||
- 访问: https://www.microsoft.com/en-us/sql-server/sql-server-downloads
|
||||
- 下载: SQL Server 2019 Express
|
||||
- 安装时选择"基本"安装类型
|
||||
|
||||
**安装注意事项:**
|
||||
- 记住设置的sa密码
|
||||
- 选择"混合模式"认证
|
||||
- 记下实例名称(默认为MSSQLSERVER)
|
||||
|
||||
### 2. 安装ODBC驱动程序
|
||||
|
||||
**下载ODBC Driver 17 for SQL Server:**
|
||||
- 访问: https://docs.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server
|
||||
- 下载并安装最新版本
|
||||
|
||||
### 3. 验证安装
|
||||
|
||||
**检查SQL Server服务:**
|
||||
1. 打开"服务"管理器(services.msc)
|
||||
2. 确保"SQL Server (MSSQLSERVER)"服务正在运行
|
||||
|
||||
**测试连接:**
|
||||
```bash
|
||||
# 使用sqlcmd测试连接
|
||||
sqlcmd -S localhost -U sa -P your_password
|
||||
```
|
||||
|
||||
## ⚙️ 应用配置
|
||||
|
||||
### 默认连接参数
|
||||
应用使用以下默认连接参数:
|
||||
- **服务器**: localhost
|
||||
- **用户名**: sa
|
||||
- **实例**: MSSQLSERVER
|
||||
|
||||
### 自定义配置
|
||||
如需修改连接参数,可在`.env`文件中添加:
|
||||
```env
|
||||
# SQL Server配置
|
||||
MSSQL_SERVER=localhost
|
||||
MSSQL_USERNAME=sa
|
||||
MSSQL_PASSWORD=your_password
|
||||
MSSQL_INSTANCE=MSSQLSERVER
|
||||
```
|
||||
|
||||
## 📁 MDF文件处理流程
|
||||
|
||||
### 自动附加数据库
|
||||
应用会自动执行以下步骤:
|
||||
1. 连接到SQL Server实例
|
||||
2. 检查数据库是否已存在
|
||||
3. 如果不存在,自动附加.mdf文件
|
||||
4. 读取表结构和数据
|
||||
5. 导出为指定格式
|
||||
6. 分离数据库(可选)
|
||||
|
||||
### 支持的功能
|
||||
- ✅ 导出所有表到Excel(多sheet)
|
||||
- ✅ 导出指定表
|
||||
- ✅ 导出为CSV格式
|
||||
- ✅ 导出为JSON格式
|
||||
|
||||
## 🔍 故障排除
|
||||
|
||||
### 常见问题
|
||||
|
||||
**1. "无法连接到SQL Server"**
|
||||
- 检查SQL Server服务是否运行
|
||||
- 验证连接字符串参数
|
||||
- 检查防火墙设置
|
||||
|
||||
**2. "附加数据库失败"**
|
||||
- 确保.mdf文件未被其他进程占用
|
||||
- 检查文件权限
|
||||
- 尝试手动附加数据库
|
||||
|
||||
**3. "ODBC驱动未找到"**
|
||||
- 安装ODBC Driver for SQL Server
|
||||
- 检查系统PATH环境变量
|
||||
|
||||
### 手动附加数据库
|
||||
|
||||
如果自动附加失败,可以手动附加:
|
||||
```sql
|
||||
-- 在SQL Server Management Studio中执行
|
||||
CREATE DATABASE [YourDatabaseName]
|
||||
ON (FILENAME = 'C:\\path\\to\\your\\file.mdf')
|
||||
FOR ATTACH;
|
||||
```
|
||||
|
||||
## 🎯 使用示例
|
||||
|
||||
### 基本使用
|
||||
1. 启动应用
|
||||
2. 选择"🗄️ 数据库导出"功能
|
||||
3. 上传.mdf文件
|
||||
4. 选择导出格式
|
||||
5. 点击"开始导出"
|
||||
|
||||
### 高级选项
|
||||
- 指定表名:只导出特定表
|
||||
- 自定义连接:修改.env文件中的连接参数
|
||||
|
||||
## 📚 相关资源
|
||||
|
||||
- [SQL Server文档](https://docs.microsoft.com/en-us/sql/)
|
||||
- [ODBC驱动文档](https://docs.microsoft.com/en-us/sql/connect/odbc/)
|
||||
- [pyodbc文档](https://github.com/mkleehammer/pyodbc)
|
||||
|
||||
## 💡 注意事项
|
||||
|
||||
1. **安全性**: 生产环境中使用强密码
|
||||
2. **性能**: 大文件可能需要较长时间处理
|
||||
3. **兼容性**: 支持SQL Server 2008及以上版本
|
||||
4. **权限**: 确保应用有足够的数据库权限
|
||||
795
app.py
Normal file
795
app.py
Normal file
@ -0,0 +1,795 @@
|
||||
import streamlit as st
|
||||
import os
|
||||
import uuid
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# 加载环境变量
|
||||
load_dotenv()
|
||||
|
||||
# 导入工具模块
|
||||
from utils.pdf_extractor import extract_text_from_pdf, pdf_to_excel
|
||||
from utils.ocr_processor import extract_text_from_image, image_to_excel, image_to_text_file
|
||||
from utils.format_converter import (
|
||||
excel_to_csv, csv_to_excel, json_to_excel,
|
||||
excel_to_json, csv_to_json, json_to_csv
|
||||
)
|
||||
from utils.web_scraper import scrape_webpage, web_to_excel
|
||||
from utils.database_exporter import export_sqlite_to_excel, database_to_csv, database_to_json
|
||||
|
||||
# 页面配置
|
||||
st.set_page_config(
|
||||
page_title="数据提取与转换器",
|
||||
page_icon="🔧",
|
||||
layout="wide",
|
||||
initial_sidebar_state="expanded"
|
||||
)
|
||||
|
||||
# 自定义CSS样式
|
||||
st.markdown("""
|
||||
<style>
|
||||
.main-header {
|
||||
text-align: center;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
color: white;
|
||||
padding: 2rem;
|
||||
border-radius: 10px;
|
||||
margin-bottom: 2rem;
|
||||
}
|
||||
.feature-card {
|
||||
background: #f8f9fa;
|
||||
padding: 1.5rem;
|
||||
border-radius: 10px;
|
||||
border-left: 4px solid #3498db;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
.success-box {
|
||||
background: #d4edda;
|
||||
color: #155724;
|
||||
padding: 1rem;
|
||||
border-radius: 5px;
|
||||
border: 1px solid #c3e6cb;
|
||||
}
|
||||
.error-box {
|
||||
background: #f8d7da;
|
||||
color: #721c24;
|
||||
padding: 1rem;
|
||||
border-radius: 5px;
|
||||
border: 1px solid #f5c6cb;
|
||||
}
|
||||
</style>
|
||||
""", unsafe_allow_html=True)
|
||||
|
||||
# 页面标题
|
||||
st.markdown("""
|
||||
<div class="main-header">
|
||||
<h1>🔧 数据提取与转换器</h1>
|
||||
<p>多功能数据处理工具</p>
|
||||
</div>
|
||||
""", unsafe_allow_html=True)
|
||||
|
||||
# 侧边栏导航
|
||||
st.sidebar.title("功能导航")
|
||||
page = st.sidebar.radio("选择功能", [
|
||||
"📄 PDF处理",
|
||||
"🖼️ 图片OCR",
|
||||
"📸 AI照片评分",
|
||||
"🔄 格式转换",
|
||||
"🌐 网页抓取",
|
||||
"🗄️ 数据库导出"
|
||||
])
|
||||
|
||||
# 文件上传函数
|
||||
def save_uploaded_file(uploaded_file, file_type):
|
||||
"""保存上传的文件到临时目录"""
|
||||
try:
|
||||
# 创建临时文件
|
||||
suffix = Path(uploaded_file.name).suffix
|
||||
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp_file:
|
||||
tmp_file.write(uploaded_file.getvalue())
|
||||
return tmp_file.name
|
||||
except Exception as e:
|
||||
st.error(f"文件保存失败: {str(e)}")
|
||||
return None
|
||||
|
||||
# PDF处理页面
|
||||
if page == "📄 PDF处理":
|
||||
st.header("📄 PDF文本/表格提取")
|
||||
|
||||
uploaded_file = st.file_uploader("选择PDF文件", type=['pdf'])
|
||||
|
||||
if uploaded_file is not None:
|
||||
file_path = save_uploaded_file(uploaded_file, 'pdf')
|
||||
|
||||
col1, col2 = st.columns(2)
|
||||
|
||||
with col1:
|
||||
if st.button("提取文本内容", use_container_width=True):
|
||||
with st.spinner("正在提取文本..."):
|
||||
try:
|
||||
text = extract_text_from_pdf(file_path)
|
||||
st.subheader("提取的文本内容")
|
||||
st.text_area("文本内容", text, height=300)
|
||||
st.success("文本提取完成!")
|
||||
except Exception as e:
|
||||
st.error(f"提取失败: {str(e)}")
|
||||
|
||||
with col2:
|
||||
if st.button("导出为Excel", use_container_width=True):
|
||||
with st.spinner("正在转换为Excel..."):
|
||||
try:
|
||||
output_path = file_path.replace('.pdf', '_converted.xlsx')
|
||||
pdf_to_excel(file_path, output_path)
|
||||
|
||||
with open(output_path, "rb") as file:
|
||||
st.download_button(
|
||||
label="下载Excel文件",
|
||||
data=file,
|
||||
file_name=Path(output_path).name,
|
||||
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
)
|
||||
st.success("PDF转换完成!")
|
||||
except Exception as e:
|
||||
st.error(f"转换失败: {str(e)}")
|
||||
|
||||
# AI照片评分页面
|
||||
elif page == "📸 AI照片评分":
|
||||
st.header("📸 AI照片质量评分")
|
||||
|
||||
# 百度智能云功能状态检查
|
||||
try:
|
||||
from utils.baidu_image_analysis import check_baidu_config
|
||||
baidu_available, baidu_message = check_baidu_config()
|
||||
except:
|
||||
baidu_available = False
|
||||
baidu_message = "百度智能云未配置"
|
||||
|
||||
# 显示状态
|
||||
if baidu_available:
|
||||
st.success("✅ 百度智能云AI照片评分可用")
|
||||
else:
|
||||
st.warning(f"⚠️ 百度智能云AI照片评分: {baidu_message}")
|
||||
|
||||
if not baidu_available:
|
||||
st.info("""
|
||||
**百度智能云配置说明:**
|
||||
|
||||
1. **注册百度智能云账号**: https://cloud.baidu.com
|
||||
2. **开通图像分析服务**: 在控制台搜索"图像分析"或"图像识别"
|
||||
3. **获取API密钥**: 创建应用并获取API Key和Secret Key
|
||||
4. **在.env文件中配置**:
|
||||
```
|
||||
BAIDU_API_KEY=您的API Key
|
||||
BAIDU_SECRET_KEY=您的Secret Key
|
||||
```
|
||||
""")
|
||||
|
||||
uploaded_file = st.file_uploader("选择照片文件", type=['jpg', 'jpeg', 'png', 'gif', 'bmp'])
|
||||
|
||||
if uploaded_file is not None:
|
||||
file_path = save_uploaded_file(uploaded_file, 'image')
|
||||
|
||||
# AI文案生成功能状态检查
|
||||
try:
|
||||
from utils.ai_copywriter import check_copywriter_config
|
||||
copywriter_available, copywriter_message = check_copywriter_config()
|
||||
except:
|
||||
copywriter_available = False
|
||||
copywriter_message = "AI文案生成未配置"
|
||||
|
||||
# 显示AI文案生成状态
|
||||
if copywriter_available:
|
||||
st.success("✅ AI文案生成可用")
|
||||
else:
|
||||
st.warning(f"⚠️ AI文案生成: {copywriter_message}")
|
||||
|
||||
col1, col2, col3, col4 = st.columns(4)
|
||||
|
||||
with col1:
|
||||
if st.button("质量评分", use_container_width=True, disabled=not baidu_available):
|
||||
with st.spinner("正在分析照片质量..."):
|
||||
try:
|
||||
from utils.baidu_image_analysis import analyze_image_quality
|
||||
from utils.photo_advice_generator import get_quality_improvement_advice
|
||||
|
||||
quality_result = analyze_image_quality(file_path)
|
||||
|
||||
st.subheader("📊 照片质量评分")
|
||||
|
||||
# 显示总体评分
|
||||
score = quality_result['score']
|
||||
st.metric("总体评分", f"{score}/100", f"{score - 75}")
|
||||
|
||||
# 显示质量维度
|
||||
st.subheader("质量维度分析")
|
||||
quality_scores = {}
|
||||
for dimension, info in quality_result['dimensions'].items():
|
||||
col_dim1, col_dim2 = st.columns([1, 3])
|
||||
with col_dim1:
|
||||
st.progress(info['score'] / 100)
|
||||
with col_dim2:
|
||||
st.write(f"**{dimension}**: {info['comment']} ({info['score']}/100)")
|
||||
quality_scores[dimension] = info['score']
|
||||
|
||||
# 生成详细改进建议
|
||||
advice_result = get_quality_improvement_advice(quality_scores)
|
||||
|
||||
# 显示总体建议
|
||||
st.subheader("💡 总体改进建议")
|
||||
for suggestion in advice_result.get('overall', []):
|
||||
st.info(f"📌 {suggestion}")
|
||||
|
||||
# 显示优先级建议
|
||||
if advice_result.get('priority'):
|
||||
st.subheader("🎯 优先级改进")
|
||||
for priority in advice_result['priority']:
|
||||
st.warning(f"⚠️ {priority}")
|
||||
|
||||
# 显示具体维度建议
|
||||
st.subheader("🔧 具体改进措施")
|
||||
for dimension, suggestions in advice_result.get('specific', {}).items():
|
||||
with st.expander(f"{dimension}改进建议"):
|
||||
for i, suggestion in enumerate(suggestions, 1):
|
||||
st.write(f"{i}. {suggestion}")
|
||||
|
||||
# 显示技术建议
|
||||
st.subheader("📚 技术学习建议")
|
||||
from utils.photo_advice_generator import get_technical_advice
|
||||
tech_advice = get_technical_advice()
|
||||
|
||||
for category, suggestions in tech_advice.items():
|
||||
with st.expander(f"{category}技术建议"):
|
||||
for i, suggestion in enumerate(suggestions[:3], 1):
|
||||
st.write(f"{i}. {suggestion}")
|
||||
|
||||
st.success("照片质量分析完成!已生成详细改进建议")
|
||||
except Exception as e:
|
||||
st.error(f"质量评分失败: {str(e)}")
|
||||
|
||||
with col2:
|
||||
if st.button("内容分析", use_container_width=True, disabled=not baidu_available):
|
||||
with st.spinner("正在分析照片内容..."):
|
||||
try:
|
||||
from utils.baidu_image_analysis import analyze_image_content
|
||||
content_result = analyze_image_content(file_path)
|
||||
|
||||
st.subheader("🔍 照片内容分析")
|
||||
|
||||
if content_result['objects']:
|
||||
st.write("**识别到的对象:**")
|
||||
for i, obj in enumerate(content_result['objects'][:5], 1):
|
||||
st.write(f"{i}. **{obj['name']}** (置信度: {obj['confidence']:.2%})")
|
||||
if obj.get('baike_info'):
|
||||
st.write(f" 描述: {obj['baike_info'].get('description', '无描述')}")
|
||||
|
||||
if content_result['summary']:
|
||||
st.write(f"**内容摘要:** {content_result['summary']}")
|
||||
|
||||
st.success("照片内容分析完成!")
|
||||
except Exception as e:
|
||||
st.error(f"内容分析失败: {str(e)}")
|
||||
|
||||
with col3:
|
||||
if st.button("美学评分", use_container_width=True, disabled=not baidu_available):
|
||||
with st.spinner("正在评估照片美学..."):
|
||||
try:
|
||||
from utils.baidu_image_analysis import get_image_aesthetic_score
|
||||
from utils.photo_advice_generator import get_aesthetic_improvement_advice
|
||||
|
||||
aesthetic_result = get_image_aesthetic_score(file_path)
|
||||
|
||||
st.subheader("🎨 照片美学评分")
|
||||
|
||||
# 显示美学评分
|
||||
aesthetic_score = aesthetic_result['aesthetic_score']
|
||||
st.metric("美学评分", f"{aesthetic_score}/100", f"{aesthetic_score - 75}")
|
||||
|
||||
# 显示美学维度
|
||||
st.subheader("美学维度分析")
|
||||
col_comp, col_color, col_light, col_focus = st.columns(4)
|
||||
|
||||
with col_comp:
|
||||
st.metric("构图", aesthetic_result['composition'])
|
||||
with col_color:
|
||||
st.metric("色彩和谐", aesthetic_result['color_harmony'])
|
||||
with col_light:
|
||||
st.metric("光线", aesthetic_result['lighting'])
|
||||
with col_focus:
|
||||
st.metric("对焦", aesthetic_result['focus'])
|
||||
|
||||
# 生成详细美学建议
|
||||
advice_result = get_aesthetic_improvement_advice(aesthetic_score)
|
||||
|
||||
# 显示总体美学建议
|
||||
st.subheader("💡 总体美学建议")
|
||||
for suggestion in advice_result.get('general', []):
|
||||
st.info(f"🎨 {suggestion}")
|
||||
|
||||
# 显示具体美学建议
|
||||
st.subheader("🔧 具体美学改进")
|
||||
|
||||
if advice_result.get('composition'):
|
||||
with st.expander("构图改进建议"):
|
||||
for i, suggestion in enumerate(advice_result['composition'], 1):
|
||||
st.write(f"{i}. {suggestion}")
|
||||
|
||||
if advice_result.get('lighting'):
|
||||
with st.expander("用光改进建议"):
|
||||
for i, suggestion in enumerate(advice_result['lighting'], 1):
|
||||
st.write(f"{i}. {suggestion}")
|
||||
|
||||
if advice_result.get('subject'):
|
||||
with st.expander("主体表现建议"):
|
||||
for i, suggestion in enumerate(advice_result['subject'], 1):
|
||||
st.write(f"{i}. {suggestion}")
|
||||
|
||||
# 显示创意建议
|
||||
if advice_result.get('creative'):
|
||||
st.subheader("🌟 创意提升建议")
|
||||
for suggestion in advice_result['creative']:
|
||||
st.success(f"✨ {suggestion}")
|
||||
|
||||
# 显示个性化建议
|
||||
st.subheader("📋 个性化学习计划")
|
||||
from utils.photo_advice_generator import get_personalized_advice
|
||||
|
||||
# 获取照片内容用于个性化建议
|
||||
from utils.baidu_image_analysis import analyze_image_content
|
||||
content_result = analyze_image_content(file_path)
|
||||
photo_content = content_result.get('summary', '一般照片')
|
||||
|
||||
# 生成质量分数用于个性化建议
|
||||
from utils.baidu_image_analysis import analyze_image_quality
|
||||
quality_result = analyze_image_quality(file_path)
|
||||
quality_scores = {dim: info['score'] for dim, info in quality_result['dimensions'].items()}
|
||||
|
||||
personalized_advice = get_personalized_advice(quality_scores, aesthetic_score, photo_content)
|
||||
|
||||
for category, suggestions in personalized_advice.items():
|
||||
if suggestions:
|
||||
with st.expander(f"{category}"):
|
||||
for i, suggestion in enumerate(suggestions, 1):
|
||||
st.write(f"{i}. {suggestion}")
|
||||
|
||||
st.success("照片美学评估完成!已生成详细改进建议")
|
||||
except Exception as e:
|
||||
st.error(f"美学评分失败: {str(e)}")
|
||||
|
||||
with col4:
|
||||
if st.button("AI写文案", use_container_width=True, disabled=not copywriter_available):
|
||||
with st.spinner("正在生成创意文案..."):
|
||||
try:
|
||||
# 先进行内容分析获取照片描述
|
||||
from utils.baidu_image_analysis import analyze_image_content
|
||||
content_result = analyze_image_content(file_path)
|
||||
|
||||
# 使用AI生成文案
|
||||
from utils.ai_copywriter import generate_multiple_captions, analyze_photo_suitability
|
||||
|
||||
# 获取照片描述
|
||||
image_description = content_result.get('summary', '一张美丽的照片')
|
||||
|
||||
# 分析适合的文案风格
|
||||
suitability_result = analyze_photo_suitability(image_description)
|
||||
|
||||
st.subheader("✍️ AI创意文案生成")
|
||||
|
||||
# 显示照片描述
|
||||
st.write(f"**照片描述**: {image_description}")
|
||||
|
||||
# 显示推荐风格
|
||||
st.write(f"**推荐风格**: {', '.join(suitability_result['recommended_styles'][:3])}")
|
||||
|
||||
# 生成多个文案选项
|
||||
captions = generate_multiple_captions(image_description, count=3, style=suitability_result['most_suitable'])
|
||||
|
||||
st.subheader("📝 文案选项")
|
||||
|
||||
for caption_info in captions:
|
||||
with st.expander(f"选项 {caption_info['option']} ({caption_info.get('length', '适中')} - {caption_info['char_count']}字)"):
|
||||
st.write(caption_info['caption'])
|
||||
|
||||
# 复制按钮
|
||||
if st.button(f"复制文案 {caption_info['option']}", key=f"copy_{caption_info['option']}"):
|
||||
st.code(caption_info['caption'], language='text')
|
||||
st.success("文案已复制到剪贴板!")
|
||||
|
||||
st.subheader("🎨 文案风格选择")
|
||||
|
||||
# 风格选择
|
||||
selected_style = st.selectbox(
|
||||
"选择文案风格",
|
||||
['creative', 'social', 'professional', 'marketing', 'emotional', 'simple'],
|
||||
format_func=lambda x: {
|
||||
'creative': '创意文艺',
|
||||
'social': '社交媒体',
|
||||
'professional': '专业正式',
|
||||
'marketing': '营销推广',
|
||||
'emotional': '情感表达',
|
||||
'simple': '简单描述'
|
||||
}[x]
|
||||
)
|
||||
|
||||
# 长度选择
|
||||
selected_length = st.selectbox(
|
||||
"选择文案长度",
|
||||
['short', 'medium', 'long'],
|
||||
format_func=lambda x: {
|
||||
'short': '简短精炼',
|
||||
'medium': '适中长度',
|
||||
'long': '详细描述'
|
||||
}[x]
|
||||
)
|
||||
|
||||
if st.button("重新生成文案", use_container_width=True):
|
||||
with st.spinner("正在重新生成文案..."):
|
||||
new_caption = generate_photo_caption(image_description, selected_style, selected_length)
|
||||
st.subheader("🆕 新生成文案")
|
||||
st.write(new_caption)
|
||||
st.success("新文案生成完成!")
|
||||
|
||||
st.success("AI文案生成完成!")
|
||||
except Exception as e:
|
||||
st.error(f"AI文案生成失败: {str(e)}")
|
||||
|
||||
# 显示图片预览
|
||||
st.subheader("📷 照片预览")
|
||||
st.image(uploaded_file, caption="上传的照片", use_column_width=True)
|
||||
|
||||
# 图片OCR页面
|
||||
elif page == "🖼️ 图片OCR":
|
||||
st.header("🖼️ 图片文字识别 (OCR)")
|
||||
|
||||
# OCR功能状态检查
|
||||
try:
|
||||
import pytesseract
|
||||
# 测试Tesseract是否可用
|
||||
pytesseract.get_tesseract_version()
|
||||
tesseract_available = True
|
||||
except:
|
||||
tesseract_available = False
|
||||
|
||||
# AI OCR功能状态检查
|
||||
try:
|
||||
from utils.aliyun_ocr import check_aliyun_config
|
||||
ai_available, ai_message = check_aliyun_config()
|
||||
except:
|
||||
ai_available = False
|
||||
ai_message = "阿里云OCR未配置"
|
||||
|
||||
# 显示OCR状态
|
||||
col_status1, col_status2 = st.columns(2)
|
||||
with col_status1:
|
||||
if tesseract_available:
|
||||
st.success("✅ Tesseract OCR可用")
|
||||
else:
|
||||
st.warning("⚠️ Tesseract OCR未安装")
|
||||
|
||||
with col_status2:
|
||||
if ai_available:
|
||||
st.success("✅ AI大模型OCR可用")
|
||||
else:
|
||||
st.warning(f"⚠️ AI大模型OCR: {ai_message}")
|
||||
|
||||
# OCR模式选择
|
||||
ocr_mode = st.radio("选择OCR模式",
|
||||
["传统OCR (Tesseract)", "AI大模型OCR (阿里云)"],
|
||||
disabled=not (tesseract_available or ai_available))
|
||||
|
||||
if not tesseract_available and not ai_available:
|
||||
st.info("""
|
||||
**OCR功能配置说明:**
|
||||
|
||||
**传统OCR (推荐免费):**
|
||||
1. 下载Tesseract OCR: https://github.com/UB-Mannheim/tesseract/wiki
|
||||
2. 安装到默认路径并添加到PATH
|
||||
|
||||
**AI大模型OCR (高精度):**
|
||||
1. 注册阿里云账号: https://www.aliyun.com
|
||||
2. 开通OCR服务并获取AccessKey
|
||||
3. 在.env文件中配置ALIYUN_ACCESS_KEY_ID和ALIYUN_ACCESS_KEY_SECRET
|
||||
""")
|
||||
|
||||
uploaded_file = st.file_uploader("选择图片文件", type=['jpg', 'jpeg', 'png', 'gif', 'bmp'])
|
||||
|
||||
if uploaded_file is not None:
|
||||
file_path = save_uploaded_file(uploaded_file, 'image')
|
||||
|
||||
# 根据选择的模式启用/禁用按钮
|
||||
use_ai = ocr_mode == "AI大模型OCR (阿里云)"
|
||||
button_disabled = (use_ai and not ai_available) or (not use_ai and not tesseract_available)
|
||||
|
||||
col1, col2, col3 = st.columns(3)
|
||||
|
||||
with col1:
|
||||
if st.button("识别文字", use_container_width=True, disabled=button_disabled):
|
||||
with st.spinner("正在识别文字..."):
|
||||
try:
|
||||
if use_ai:
|
||||
text = extract_text_from_image(file_path, use_ai=True, ai_provider='aliyun')
|
||||
else:
|
||||
text = extract_text_from_image(file_path)
|
||||
|
||||
st.subheader("识别的文字内容")
|
||||
st.text_area("文字内容", text, height=300)
|
||||
st.success("文字识别完成!")
|
||||
except Exception as e:
|
||||
st.error(f"识别失败: {str(e)}")
|
||||
|
||||
with col2:
|
||||
if st.button("导出为Excel", use_container_width=True, disabled=button_disabled):
|
||||
with st.spinner("正在转换为Excel..."):
|
||||
try:
|
||||
output_path = file_path.rsplit('.', 1)[0] + '_converted.xlsx'
|
||||
if use_ai:
|
||||
# 使用AI OCR导出到Excel
|
||||
from utils.ocr_processor import extract_text_with_ai
|
||||
text = extract_text_with_ai(file_path, 'aliyun', 'general')
|
||||
import pandas as pd
|
||||
lines = [line.strip() for line in text.split('\n') if line.strip()]
|
||||
df = pd.DataFrame({
|
||||
'行号': range(1, len(lines) + 1),
|
||||
'内容': lines
|
||||
})
|
||||
df.to_excel(output_path, index=False)
|
||||
else:
|
||||
image_to_excel(file_path, output_path)
|
||||
|
||||
with open(output_path, "rb") as file:
|
||||
st.download_button(
|
||||
label="下载Excel文件",
|
||||
data=file,
|
||||
file_name=Path(output_path).name,
|
||||
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
)
|
||||
st.success("图片转换完成!")
|
||||
except Exception as e:
|
||||
st.error(f"转换失败: {str(e)}")
|
||||
|
||||
with col3:
|
||||
if st.button("导出为文本", use_container_width=True, disabled=button_disabled):
|
||||
with st.spinner("正在转换为文本..."):
|
||||
try:
|
||||
output_path = file_path.rsplit('.', 1)[0] + '_converted.txt'
|
||||
if use_ai:
|
||||
# 使用AI OCR导出到文本
|
||||
from utils.ocr_processor import extract_text_with_ai
|
||||
text = extract_text_with_ai(file_path, 'aliyun', 'general')
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.write(text)
|
||||
else:
|
||||
image_to_text_file(file_path, output_path)
|
||||
|
||||
with open(output_path, "rb") as file:
|
||||
st.download_button(
|
||||
label="下载文本文件",
|
||||
data=file,
|
||||
file_name=Path(output_path).name,
|
||||
mime="text/plain"
|
||||
)
|
||||
st.success("图片转换完成!")
|
||||
except Exception as e:
|
||||
st.error(f"转换失败: {str(e)}")
|
||||
|
||||
# 显示图片预览
|
||||
st.subheader("图片预览")
|
||||
st.image(uploaded_file, caption="上传的图片", use_column_width=True)
|
||||
|
||||
# 显示OCR模式信息
|
||||
st.info(f"当前使用: {ocr_mode}")
|
||||
|
||||
# 格式转换页面
|
||||
elif page == "🔄 格式转换":
|
||||
st.header("🔄 文件格式转换")
|
||||
|
||||
uploaded_file = st.file_uploader("选择文件", type=['xlsx', 'xls', 'csv', 'json'])
|
||||
|
||||
if uploaded_file is not None:
|
||||
file_path = save_uploaded_file(uploaded_file, 'format')
|
||||
file_ext = Path(uploaded_file.name).suffix.lower()
|
||||
|
||||
# 根据文件类型显示可转换的格式
|
||||
if file_ext in ['.xlsx', '.xls']:
|
||||
target_format = st.selectbox("转换为", ["CSV", "JSON"])
|
||||
elif file_ext == '.csv':
|
||||
target_format = st.selectbox("转换为", ["Excel", "JSON"])
|
||||
elif file_ext == '.json':
|
||||
target_format = st.selectbox("转换为", ["Excel", "CSV"])
|
||||
|
||||
if st.button("开始转换", use_container_width=True):
|
||||
with st.spinner("正在转换格式..."):
|
||||
try:
|
||||
if file_ext in ['.xlsx', '.xls'] and target_format == "CSV":
|
||||
output_path = file_path.replace(file_ext, '.csv')
|
||||
excel_to_csv(file_path, output_path)
|
||||
mime_type = "text/csv"
|
||||
elif file_ext in ['.xlsx', '.xls'] and target_format == "JSON":
|
||||
output_path = file_path.replace(file_ext, '.json')
|
||||
excel_to_json(file_path, output_path)
|
||||
mime_type = "application/json"
|
||||
elif file_ext == '.csv' and target_format == "Excel":
|
||||
output_path = file_path.replace('.csv', '.xlsx')
|
||||
csv_to_excel(file_path, output_path)
|
||||
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
elif file_ext == '.csv' and target_format == "JSON":
|
||||
output_path = file_path.replace('.csv', '.json')
|
||||
csv_to_json(file_path, output_path)
|
||||
mime_type = "application/json"
|
||||
elif file_ext == '.json' and target_format == "Excel":
|
||||
output_path = file_path.replace('.json', '.xlsx')
|
||||
json_to_excel(file_path, output_path)
|
||||
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
elif file_ext == '.json' and target_format == "CSV":
|
||||
output_path = file_path.replace('.json', '.csv')
|
||||
json_to_csv(file_path, output_path)
|
||||
mime_type = "text/csv"
|
||||
|
||||
with open(output_path, "rb") as file:
|
||||
st.download_button(
|
||||
label=f"下载{target_format}文件",
|
||||
data=file,
|
||||
file_name=Path(output_path).name,
|
||||
mime=mime_type
|
||||
)
|
||||
st.success("格式转换完成!")
|
||||
except Exception as e:
|
||||
st.error(f"转换失败: {str(e)}")
|
||||
|
||||
# 网页抓取页面
|
||||
elif page == "🌐 网页抓取":
|
||||
st.header("🌐 网页数据抓取")
|
||||
|
||||
url = st.text_input("网页URL", placeholder="https://example.com")
|
||||
selector = st.text_input("CSS选择器 (可选)", placeholder="例如: .content, #main, p")
|
||||
|
||||
col1, col2 = st.columns(2)
|
||||
|
||||
with col1:
|
||||
if st.button("抓取内容", use_container_width=True):
|
||||
if not url:
|
||||
st.error("请输入网页URL")
|
||||
else:
|
||||
with st.spinner("正在抓取网页内容..."):
|
||||
try:
|
||||
content = scrape_webpage(url, selector if selector else None)
|
||||
st.subheader("抓取的内容")
|
||||
st.text_area("网页内容", content, height=300)
|
||||
st.success("网页抓取完成!")
|
||||
except Exception as e:
|
||||
st.error(f"抓取失败: {str(e)}")
|
||||
|
||||
with col2:
|
||||
if st.button("导出为Excel", use_container_width=True):
|
||||
if not url:
|
||||
st.error("请输入网页URL")
|
||||
else:
|
||||
with st.spinner("正在导出为Excel..."):
|
||||
try:
|
||||
output_filename = f"web_content_{uuid.uuid4().hex[:8]}.xlsx"
|
||||
output_path = os.path.join(tempfile.gettempdir(), output_filename)
|
||||
|
||||
web_to_excel(url, output_path, selector if selector else None)
|
||||
|
||||
with open(output_path, "rb") as file:
|
||||
st.download_button(
|
||||
label="下载Excel文件",
|
||||
data=file,
|
||||
file_name=output_filename,
|
||||
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
)
|
||||
st.success("网页导出完成!")
|
||||
except Exception as e:
|
||||
st.error(f"导出失败: {str(e)}")
|
||||
|
||||
# 数据库导出页面
|
||||
elif page == "🗄️ 数据库导出":
|
||||
st.header("🗄️ 数据库导出")
|
||||
|
||||
uploaded_file = st.file_uploader("选择数据库文件", type=['db', 'sqlite', 'mdf'])
|
||||
table_name = st.text_input("表名 (可选)", placeholder="留空则导出所有表")
|
||||
|
||||
if uploaded_file is not None:
|
||||
file_path = save_uploaded_file(uploaded_file, 'database')
|
||||
|
||||
target_format = st.selectbox("导出为", ["Excel", "CSV", "JSON"])
|
||||
|
||||
if st.button("开始导出", use_container_width=True):
|
||||
with st.spinner("正在导出数据库..."):
|
||||
try:
|
||||
file_ext = Path(file_path).suffix.lower()
|
||||
continue_processing = True # 默认继续处理
|
||||
|
||||
if file_ext in ['.db', '.sqlite']:
|
||||
if target_format == "Excel":
|
||||
output_path = file_path.replace(file_ext, '_exported.xlsx')
|
||||
export_sqlite_to_excel(file_path, output_path, table_name if table_name else None)
|
||||
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
elif target_format == "CSV":
|
||||
output_path = file_path.replace(file_ext, '_exported.csv')
|
||||
database_to_csv(file_path, output_path, table_name if table_name else None)
|
||||
mime_type = "text/csv"
|
||||
elif target_format == "JSON":
|
||||
output_path = file_path.replace(file_ext, '_exported.json')
|
||||
database_to_json(file_path, output_path, table_name if table_name else None)
|
||||
mime_type = "application/json"
|
||||
elif file_ext == '.mdf':
|
||||
# MDF文件处理
|
||||
try:
|
||||
import pyodbc
|
||||
# 测试SQL Server连接
|
||||
test_conn = pyodbc.connect("DRIVER={SQL Server};SERVER=localhost;Trusted_Connection=yes;timeout=3")
|
||||
test_conn.close()
|
||||
sql_server_available = True
|
||||
except:
|
||||
sql_server_available = False
|
||||
st.warning("⚠️ SQL Server未运行或无法连接")
|
||||
st.info("""
|
||||
**MDF文件导出需要SQL Server支持:**
|
||||
|
||||
1. **安装SQL Server Express** (免费)
|
||||
- 下载: https://www.microsoft.com/en-us/sql-server/sql-server-downloads
|
||||
|
||||
2. **确保SQL Server服务运行**
|
||||
- 打开"服务"管理器 (services.msc)
|
||||
- 启动"SQL Server (MSSQLSERVER)"服务
|
||||
|
||||
3. **配置连接权限**
|
||||
- 使用Windows身份验证或配置sa密码
|
||||
|
||||
安装完成后重启应用即可使用MDF导出功能。
|
||||
""")
|
||||
# 不执行后续操作
|
||||
|
||||
if sql_server_available:
|
||||
if target_format == "Excel":
|
||||
output_path = file_path.replace(file_ext, '_exported.xlsx')
|
||||
from utils.database_exporter import export_mssql_mdf_to_excel
|
||||
export_mssql_mdf_to_excel(file_path, output_path, table_name if table_name else None)
|
||||
mime_type = "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
|
||||
elif target_format == "CSV":
|
||||
output_path = file_path.replace(file_ext, '_exported.csv')
|
||||
database_to_csv(file_path, output_path, table_name if table_name else None)
|
||||
mime_type = "text/csv"
|
||||
elif target_format == "JSON":
|
||||
output_path = file_path.replace(file_ext, '_exported.json')
|
||||
database_to_json(file_path, output_path, table_name if table_name else None)
|
||||
mime_type = "application/json"
|
||||
else:
|
||||
st.error("不支持的数据库格式")
|
||||
# 不执行后续操作
|
||||
continue_processing = False
|
||||
|
||||
# 只有在成功处理时才执行下载操作
|
||||
if continue_processing and 'output_path' in locals() and os.path.exists(output_path):
|
||||
with open(output_path, "rb") as file:
|
||||
st.download_button(
|
||||
label=f"下载{target_format}文件",
|
||||
data=file,
|
||||
file_name=Path(output_path).name,
|
||||
mime=mime_type
|
||||
)
|
||||
st.success("数据库导出完成!")
|
||||
elif not continue_processing:
|
||||
# 不支持的格式,不显示下载按钮
|
||||
pass
|
||||
else:
|
||||
st.error("导出文件创建失败")
|
||||
except Exception as e:
|
||||
st.error(f"导出失败: {str(e)}")
|
||||
|
||||
# 页脚信息
|
||||
st.sidebar.markdown("---")
|
||||
st.sidebar.markdown("""
|
||||
### 使用说明
|
||||
1. 选择功能模块
|
||||
2. 上传文件或输入URL
|
||||
3. 点击相应按钮处理
|
||||
4. 下载处理结果
|
||||
|
||||
### 支持格式
|
||||
- **PDF**: .pdf
|
||||
- **图片**: .jpg, .jpeg, .png, .gif, .bmp
|
||||
- **数据文件**: .xlsx, .xls, .csv, .json
|
||||
- **数据库**: .db, .sqlite, .mdf
|
||||
""")
|
||||
241
app_flask.py
Normal file
241
app_flask.py
Normal file
@ -0,0 +1,241 @@
|
||||
from flask import Flask, render_template, request, jsonify, send_file, redirect, url_for
|
||||
import os
|
||||
import uuid
|
||||
from werkzeug.utils import secure_filename
|
||||
from config import Config
|
||||
|
||||
# 导入工具模块
|
||||
from utils.pdf_extractor import extract_text_from_pdf, pdf_to_excel
|
||||
from utils.ocr_processor import extract_text_from_image, image_to_excel, image_to_text_file
|
||||
from utils.format_converter import (
|
||||
excel_to_csv, csv_to_excel, json_to_excel,
|
||||
excel_to_json, csv_to_json, json_to_csv
|
||||
)
|
||||
from utils.web_scraper import scrape_webpage, web_to_excel
|
||||
from utils.database_exporter import export_sqlite_to_excel, database_to_csv, database_to_json
|
||||
|
||||
app = Flask(__name__)
|
||||
app.config.from_object(Config)
|
||||
|
||||
# 确保上传目录存在
|
||||
os.makedirs(app.config['UPLOAD_FOLDER'], exist_ok=True)
|
||||
|
||||
def allowed_file(filename):
|
||||
"""检查文件类型是否允许"""
|
||||
return '.' in filename and \
|
||||
filename.rsplit('.', 1)[1].lower() in app.config['ALLOWED_EXTENSIONS']
|
||||
|
||||
@app.route('/')
|
||||
def index():
|
||||
"""首页"""
|
||||
return render_template('index.html')
|
||||
|
||||
@app.route('/upload', methods=['POST'])
|
||||
def upload_file():
|
||||
"""文件上传处理"""
|
||||
if 'file' not in request.files:
|
||||
return jsonify({'error': '没有选择文件'}), 400
|
||||
|
||||
file = request.files['file']
|
||||
if file.filename == '':
|
||||
return jsonify({'error': '没有选择文件'}), 400
|
||||
|
||||
if file and allowed_file(file.filename):
|
||||
filename = secure_filename(file.filename)
|
||||
filepath = os.path.join(app.config['UPLOAD_FOLDER'], f"{uuid.uuid4()}_{filename}")
|
||||
file.save(filepath)
|
||||
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'filename': filename,
|
||||
'filepath': filepath,
|
||||
'file_type': filename.rsplit('.', 1)[1].lower()
|
||||
})
|
||||
|
||||
return jsonify({'error': '不支持的文件类型'}), 400
|
||||
|
||||
@app.route('/process/pdf', methods=['POST'])
|
||||
def process_pdf():
|
||||
"""处理PDF文件"""
|
||||
try:
|
||||
data = request.json
|
||||
filepath = data.get('filepath')
|
||||
action = data.get('action', 'extract') # extract, to_excel
|
||||
|
||||
if not filepath or not os.path.exists(filepath):
|
||||
return jsonify({'error': '文件不存在'}), 400
|
||||
|
||||
if action == 'extract':
|
||||
text = extract_text_from_pdf(filepath)
|
||||
return jsonify({'success': True, 'text': text})
|
||||
|
||||
elif action == 'to_excel':
|
||||
output_path = filepath.replace('.pdf', '_converted.xlsx')
|
||||
pdf_to_excel(filepath, output_path)
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'download_url': f'/download/{os.path.basename(output_path)}'
|
||||
})
|
||||
|
||||
else:
|
||||
return jsonify({'error': '不支持的操作'}), 400
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({'error': str(e)}), 500
|
||||
|
||||
@app.route('/process/image', methods=['POST'])
|
||||
def process_image():
|
||||
"""处理图片文件"""
|
||||
try:
|
||||
data = request.json
|
||||
filepath = data.get('filepath')
|
||||
action = data.get('action', 'extract') # extract, to_excel, to_text
|
||||
|
||||
if not filepath or not os.path.exists(filepath):
|
||||
return jsonify({'error': '文件不存在'}), 400
|
||||
|
||||
if action == 'extract':
|
||||
text = extract_text_from_image(filepath)
|
||||
return jsonify({'success': True, 'text': text})
|
||||
|
||||
elif action == 'to_excel':
|
||||
output_path = filepath.rsplit('.', 1)[0] + '_converted.xlsx'
|
||||
image_to_excel(filepath, output_path)
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'download_url': f'/download/{os.path.basename(output_path)}'
|
||||
})
|
||||
|
||||
elif action == 'to_text':
|
||||
output_path = filepath.rsplit('.', 1)[0] + '_converted.txt'
|
||||
image_to_text_file(filepath, output_path)
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'download_url': f'/download/{os.path.basename(output_path)}'
|
||||
})
|
||||
|
||||
else:
|
||||
return jsonify({'error': '不支持的操作'}), 400
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({'error': str(e)}), 500
|
||||
|
||||
@app.route('/process/format', methods=['POST'])
|
||||
def process_format():
|
||||
"""处理格式转换"""
|
||||
try:
|
||||
data = request.json
|
||||
filepath = data.get('filepath')
|
||||
target_format = data.get('target_format') # excel, csv, json
|
||||
|
||||
if not filepath or not os.path.exists(filepath):
|
||||
return jsonify({'error': '文件不存在'}), 400
|
||||
|
||||
file_ext = filepath.rsplit('.', 1)[1].lower()
|
||||
|
||||
# 根据源格式和目标格式选择转换函数
|
||||
if file_ext == 'xlsx' and target_format == 'csv':
|
||||
output_path = filepath.replace('.xlsx', '.csv')
|
||||
excel_to_csv(filepath, output_path)
|
||||
elif file_ext == 'csv' and target_format == 'excel':
|
||||
output_path = filepath.replace('.csv', '.xlsx')
|
||||
csv_to_excel(filepath, output_path)
|
||||
elif file_ext == 'json' and target_format == 'excel':
|
||||
output_path = filepath.replace('.json', '.xlsx')
|
||||
json_to_excel(filepath, output_path)
|
||||
elif file_ext == 'xlsx' and target_format == 'json':
|
||||
output_path = filepath.replace('.xlsx', '.json')
|
||||
excel_to_json(filepath, output_path)
|
||||
elif file_ext == 'csv' and target_format == 'json':
|
||||
output_path = filepath.replace('.csv', '.json')
|
||||
csv_to_json(filepath, output_path)
|
||||
elif file_ext == 'json' and target_format == 'csv':
|
||||
output_path = filepath.replace('.json', '.csv')
|
||||
json_to_csv(filepath, output_path)
|
||||
else:
|
||||
return jsonify({'error': '不支持的格式转换'}), 400
|
||||
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'download_url': f'/download/{os.path.basename(output_path)}'
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({'error': str(e)}), 500
|
||||
|
||||
@app.route('/process/web', methods=['POST'])
|
||||
def process_web():
|
||||
"""处理网页抓取"""
|
||||
try:
|
||||
data = request.json
|
||||
url = data.get('url')
|
||||
selector = data.get('selector', '')
|
||||
|
||||
if not url:
|
||||
return jsonify({'error': '请输入URL'}), 400
|
||||
|
||||
# 抓取网页内容
|
||||
content = scrape_webpage(url, selector if selector else None)
|
||||
|
||||
# 导出为Excel
|
||||
output_filename = f"web_content_{uuid.uuid4().hex[:8]}.xlsx"
|
||||
output_path = os.path.join(app.config['UPLOAD_FOLDER'], output_filename)
|
||||
|
||||
web_to_excel(url, output_path, selector)
|
||||
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'content': content if isinstance(content, str) else '内容已提取',
|
||||
'download_url': f'/download/{output_filename}'
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({'error': str(e)}), 500
|
||||
|
||||
@app.route('/process/database', methods=['POST'])
|
||||
def process_database():
|
||||
"""处理数据库导出"""
|
||||
try:
|
||||
data = request.json
|
||||
filepath = data.get('filepath')
|
||||
target_format = data.get('target_format', 'excel') # excel, csv, json
|
||||
table_name = data.get('table_name', '') # 可选:指定表名
|
||||
|
||||
if not filepath or not os.path.exists(filepath):
|
||||
return jsonify({'error': '文件不存在'}), 400
|
||||
|
||||
file_ext = filepath.rsplit('.', 1)[1].lower()
|
||||
|
||||
if file_ext in ['db', 'sqlite']:
|
||||
if target_format == 'excel':
|
||||
output_path = filepath.replace(f'.{file_ext}', '_exported.xlsx')
|
||||
export_sqlite_to_excel(filepath, output_path, table_name)
|
||||
elif target_format == 'csv':
|
||||
output_path = filepath.replace(f'.{file_ext}', '_exported.csv')
|
||||
database_to_csv(filepath, output_path, table_name)
|
||||
elif target_format == 'json':
|
||||
output_path = filepath.replace(f'.{file_ext}', '_exported.json')
|
||||
database_to_json(filepath, output_path, table_name)
|
||||
else:
|
||||
return jsonify({'error': '不支持的导出格式'}), 400
|
||||
else:
|
||||
return jsonify({'error': '不支持的数据库格式'}), 400
|
||||
|
||||
return jsonify({
|
||||
'success': True,
|
||||
'download_url': f'/download/{os.path.basename(output_path)}'
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
return jsonify({'error': str(e)}), 500
|
||||
|
||||
@app.route('/download/<filename>')
|
||||
def download_file(filename):
|
||||
"""文件下载"""
|
||||
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
|
||||
if os.path.exists(filepath):
|
||||
return send_file(filepath, as_attachment=True)
|
||||
return jsonify({'error': '文件不存在'}), 404
|
||||
|
||||
if __name__ == '__main__':
|
||||
app.run(debug=True, host='0.0.0.0', port=5000)
|
||||
26
config.py
Normal file
26
config.py
Normal file
@ -0,0 +1,26 @@
|
||||
import os
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
class Config:
|
||||
SECRET_KEY = os.getenv('SECRET_KEY', 'dev-secret-key')
|
||||
UPLOAD_FOLDER = 'uploads'
|
||||
MAX_CONTENT_LENGTH = 16 * 1024 * 1024 # 16MB max file size
|
||||
|
||||
# OCR配置
|
||||
TESSERACT_PATH = os.getenv('TESSERACT_PATH', '')
|
||||
|
||||
# 数据库配置
|
||||
DATABASE_URI = os.getenv('DATABASE_URI', 'sqlite:///data.db')
|
||||
|
||||
# 网页抓取配置
|
||||
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
|
||||
# 支持的文件类型
|
||||
ALLOWED_EXTENSIONS = {
|
||||
'pdf', 'txt', 'doc', 'docx',
|
||||
'jpg', 'jpeg', 'png', 'gif', 'bmp',
|
||||
'xlsx', 'xls', 'csv', 'json',
|
||||
'db', 'sqlite'
|
||||
}
|
||||
253
diagnose_ocr.py
Normal file
253
diagnose_ocr.py
Normal file
@ -0,0 +1,253 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
OCR功能诊断脚本
|
||||
检查Tesseract OCR的安装和配置状态
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
def check_tesseract_installation():
|
||||
"""检查Tesseract OCR是否安装"""
|
||||
print("🔍 检查Tesseract OCR安装状态...")
|
||||
|
||||
# 常见的Tesseract安装路径
|
||||
possible_paths = [
|
||||
r"C:\Program Files\Tesseract-OCR\tesseract.exe",
|
||||
r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
|
||||
r"D:\Program Files\Tesseract-OCR\tesseract.exe",
|
||||
r"D:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
|
||||
]
|
||||
|
||||
tesseract_path = None
|
||||
for path in possible_paths:
|
||||
if os.path.exists(path):
|
||||
tesseract_path = path
|
||||
print(f"✅ Tesseract找到: {path}")
|
||||
break
|
||||
|
||||
if not tesseract_path:
|
||||
print("❌ Tesseract未找到在默认路径")
|
||||
|
||||
# 检查系统PATH
|
||||
import shutil
|
||||
tesseract_cmd = shutil.which("tesseract")
|
||||
if tesseract_cmd:
|
||||
print(f"✅ Tesseract在PATH中找到: {tesseract_cmd}")
|
||||
tesseract_path = tesseract_cmd
|
||||
else:
|
||||
print("❌ Tesseract未在系统PATH中找到")
|
||||
|
||||
return tesseract_path
|
||||
|
||||
def check_python_dependencies():
|
||||
"""检查Python OCR相关依赖"""
|
||||
print("\n🐍 检查Python依赖...")
|
||||
|
||||
dependencies = ["pytesseract", "PIL", "pandas"]
|
||||
|
||||
for dep in dependencies:
|
||||
try:
|
||||
if dep == "PIL":
|
||||
import PIL
|
||||
print(f"✅ {dep}: {PIL.__version__}")
|
||||
elif dep == "pytesseract":
|
||||
import pytesseract
|
||||
print(f"✅ {dep}: 已安装")
|
||||
elif dep == "pandas":
|
||||
import pandas
|
||||
print(f"✅ {dep}: {pandas.__version__}")
|
||||
except ImportError as e:
|
||||
print(f"❌ {dep}: 未安装 - {e}")
|
||||
|
||||
def create_test_image():
|
||||
"""创建测试图片"""
|
||||
print("\n🖼️ 创建测试图片...")
|
||||
|
||||
try:
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
|
||||
# 创建图片
|
||||
img = Image.new('RGB', (400, 200), color='white')
|
||||
d = ImageDraw.Draw(img)
|
||||
|
||||
# 尝试使用不同字体
|
||||
fonts_to_try = [
|
||||
"arial.ttf",
|
||||
"Arial.ttf",
|
||||
"simhei.ttf", # 黑体
|
||||
"msyh.ttc", # 微软雅黑
|
||||
"C:\\Windows\\Fonts\\arial.ttf",
|
||||
"C:\\Windows\\Fonts\\simhei.ttf"
|
||||
]
|
||||
|
||||
font = None
|
||||
for font_path in fonts_to_try:
|
||||
try:
|
||||
font = ImageFont.truetype(font_path, 24)
|
||||
print(f"✅ 字体找到: {font_path}")
|
||||
break
|
||||
except:
|
||||
continue
|
||||
|
||||
if not font:
|
||||
print("⚠️ 未找到合适字体,使用默认字体")
|
||||
font = ImageFont.load_default()
|
||||
|
||||
# 添加清晰的中英文文字
|
||||
text_lines = [
|
||||
"OCR测试文字",
|
||||
"Hello World",
|
||||
"1234567890",
|
||||
"ABCDEFGHIJKLMN"
|
||||
]
|
||||
|
||||
y_position = 30
|
||||
for line in text_lines:
|
||||
d.text((50, y_position), line, fill="black", font=font)
|
||||
y_position += 40
|
||||
|
||||
# 保存图片
|
||||
test_image_path = os.path.join(tempfile.gettempdir(), "ocr_test_image.png")
|
||||
img.save(test_image_path, "PNG")
|
||||
|
||||
print(f"✅ 测试图片已创建: {test_image_path}")
|
||||
print(f" 图片大小: {os.path.getsize(test_image_path)} 字节")
|
||||
|
||||
return test_image_path
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ 创建测试图片失败: {e}")
|
||||
return None
|
||||
|
||||
def test_ocr_functionality(image_path):
|
||||
"""测试OCR功能"""
|
||||
print("\n🔤 测试OCR识别功能...")
|
||||
|
||||
if not image_path or not os.path.exists(image_path):
|
||||
print("❌ 测试图片不存在")
|
||||
return
|
||||
|
||||
try:
|
||||
import pytesseract
|
||||
from PIL import Image
|
||||
|
||||
# 设置Tesseract路径(如果需要)
|
||||
tesseract_path = check_tesseract_installation()
|
||||
if tesseract_path:
|
||||
pytesseract.pytesseract.tesseract_cmd = tesseract_path
|
||||
|
||||
# 打开并检查图片
|
||||
image = Image.open(image_path)
|
||||
print(f"✅ 图片格式: {image.format}, 大小: {image.size}")
|
||||
|
||||
# 测试不同语言的OCR
|
||||
languages = ['eng', 'chi_sim', 'eng+chi_sim']
|
||||
|
||||
for lang in languages:
|
||||
try:
|
||||
print(f"\n 测试语言: {lang}")
|
||||
text = pytesseract.image_to_string(image, lang=lang)
|
||||
|
||||
if text.strip():
|
||||
print(f" ✅ 识别成功:")
|
||||
print(f" {text.strip()}")
|
||||
else:
|
||||
print(f" ⚠️ 识别无结果")
|
||||
|
||||
except Exception as e:
|
||||
print(f" ❌ 语言 {lang} 识别失败: {e}")
|
||||
|
||||
# 测试图片数据
|
||||
print(f"\n📊 图片数据检查:")
|
||||
print(f" 模式: {image.mode}")
|
||||
print(f" 通道: {'RGB' if image.mode == 'RGB' else image.mode}")
|
||||
|
||||
# 检查图片是否可读
|
||||
try:
|
||||
image.verify()
|
||||
print(" ✅ 图片验证通过")
|
||||
except Exception as e:
|
||||
print(f" ❌ 图片验证失败: {e}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ OCR测试失败: {e}")
|
||||
|
||||
def check_system_environment():
|
||||
"""检查系统环境"""
|
||||
print("\n💻 检查系统环境...")
|
||||
|
||||
print(f" 操作系统: {os.name}")
|
||||
print(f" Python版本: {sys.version}")
|
||||
print(f" 当前目录: {os.getcwd()}")
|
||||
print(f" TMP目录: {tempfile.gettempdir()}")
|
||||
|
||||
def main():
|
||||
"""主诊断函数"""
|
||||
print("=" * 60)
|
||||
print("OCR功能诊断工具")
|
||||
print("=" * 60)
|
||||
|
||||
# 检查系统环境
|
||||
check_system_environment()
|
||||
|
||||
# 检查依赖
|
||||
check_python_dependencies()
|
||||
|
||||
# 检查Tesseract安装
|
||||
tesseract_path = check_tesseract_installation()
|
||||
|
||||
# 创建测试图片
|
||||
test_image_path = create_test_image()
|
||||
|
||||
# 测试OCR功能
|
||||
if test_image_path:
|
||||
test_ocr_functionality(test_image_path)
|
||||
|
||||
# 提供解决方案
|
||||
print("\n" + "=" * 60)
|
||||
print("💡 解决方案建议")
|
||||
print("=" * 60)
|
||||
|
||||
if not tesseract_path:
|
||||
print("""
|
||||
🔧 Tesseract OCR未安装,请按以下步骤安装:
|
||||
|
||||
1. 下载Tesseract OCR:
|
||||
- 官方地址: https://github.com/UB-Mannheim/tesseract/wiki
|
||||
- 选择Windows版本下载
|
||||
|
||||
2. 安装步骤:
|
||||
- 运行安装程序
|
||||
- 安装到默认路径: C:\\Program Files\\Tesseract-OCR\\
|
||||
- 安装时勾选"Add to PATH"选项
|
||||
- 安装中文语言包(可选但推荐)
|
||||
|
||||
3. 验证安装:
|
||||
- 重新启动命令行
|
||||
- 运行: tesseract --version
|
||||
- 应该显示版本信息
|
||||
""")
|
||||
else:
|
||||
print("""
|
||||
✅ Tesseract已安装,问题可能在于:
|
||||
|
||||
1. 图片格式问题
|
||||
- 确保上传的图片格式正确(PNG, JPG等)
|
||||
- 图片包含清晰可读的文字
|
||||
|
||||
2. 语言包问题
|
||||
- 确保安装了中文语言包(chi_sim)
|
||||
- 可以尝试只使用英文识别
|
||||
|
||||
3. 权限问题
|
||||
- 确保应用有权限访问临时文件
|
||||
""")
|
||||
|
||||
print("\n🔄 临时解决方案:")
|
||||
print(" 在应用中暂时禁用OCR功能,或使用在线OCR服务")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
23
pyproject.toml
Normal file
23
pyproject.toml
Normal file
@ -0,0 +1,23 @@
|
||||
[project]
|
||||
name = "data-extractor-converter"
|
||||
version = "1.0.0"
|
||||
description = "数据提取与转换器 - 专为大学生开发的多功能数据处理工具"
|
||||
requires-python = ">=3.8"
|
||||
dependencies = [
|
||||
"streamlit>=1.28.0",
|
||||
"pandas>=2.0.3",
|
||||
"requests>=2.31.0",
|
||||
"beautifulsoup4>=4.12.2",
|
||||
"pymupdf>=1.23.7",
|
||||
"pytesseract>=0.3.10",
|
||||
"pillow>=10.0.0",
|
||||
"openpyxl>=3.1.2",
|
||||
"sqlalchemy>=2.0.20",
|
||||
"pymysql>=1.1.0",
|
||||
"python-dotenv>=1.0.0",
|
||||
"pyodbc>=4.0.0",
|
||||
"alibabacloud-ocr-api20210707>=1.0.2",
|
||||
"alibabacloud-tea-openapi>=0.3.6",
|
||||
"alibabacloud-tea-util>=0.3.8",
|
||||
"aiohttp>=3.8.0",
|
||||
]
|
||||
64
run.py
Normal file
64
run.py
Normal file
@ -0,0 +1,64 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
数据提取与转换器 - 启动脚本
|
||||
专为大学生开发的多功能数据处理工具
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
from app import app
|
||||
|
||||
def check_dependencies():
|
||||
"""检查必要的依赖是否安装"""
|
||||
try:
|
||||
import flask
|
||||
import pandas
|
||||
import requests
|
||||
import fitz # PyMuPDF
|
||||
import pytesseract
|
||||
import sqlalchemy
|
||||
print("✓ 所有依赖包已安装")
|
||||
return True
|
||||
except ImportError as e:
|
||||
print(f"✗ 缺少依赖包: {e}")
|
||||
print("请运行: pip install -r requirements.txt")
|
||||
return False
|
||||
|
||||
def create_upload_directories():
|
||||
"""创建必要的上传目录"""
|
||||
directories = ['uploads', 'static', 'templates']
|
||||
|
||||
for directory in directories:
|
||||
os.makedirs(directory, exist_ok=True)
|
||||
|
||||
print("✓ 目录结构已创建")
|
||||
|
||||
def main():
|
||||
"""主函数"""
|
||||
print("=" * 50)
|
||||
print("数据提取与转换器 - 大学生专用工具")
|
||||
print("=" * 50)
|
||||
|
||||
# 检查依赖
|
||||
if not check_dependencies():
|
||||
sys.exit(1)
|
||||
|
||||
# 创建目录
|
||||
create_upload_directories()
|
||||
|
||||
print("\n启动信息:")
|
||||
print("- 本地访问: http://localhost:5000")
|
||||
print("- 网络访问: http://0.0.0.0:5000")
|
||||
print("- 停止服务: Ctrl+C")
|
||||
print("\n" + "=" * 50)
|
||||
|
||||
# 启动Flask应用
|
||||
try:
|
||||
app.run(debug=True, host='0.0.0.0', port=5000)
|
||||
except KeyboardInterrupt:
|
||||
print("\n\n服务已停止")
|
||||
except Exception as e:
|
||||
print(f"\n\n启动失败: {e}")
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
416
static/script.js
Normal file
416
static/script.js
Normal file
@ -0,0 +1,416 @@
|
||||
// 全局变量
|
||||
let currentFile = null;
|
||||
|
||||
// 标签页切换功能
|
||||
function openTab(tabName) {
|
||||
// 隐藏所有标签页内容
|
||||
const tabContents = document.getElementsByClassName('tab-content');
|
||||
for (let i = 0; i < tabContents.length; i++) {
|
||||
tabContents[i].classList.remove('active');
|
||||
}
|
||||
|
||||
// 移除所有标签按钮的激活状态
|
||||
const tabButtons = document.getElementsByClassName('tab-button');
|
||||
for (let i = 0; i < tabButtons.length; i++) {
|
||||
tabButtons[i].classList.remove('active');
|
||||
}
|
||||
|
||||
// 显示选中的标签页内容
|
||||
document.getElementById(tabName).classList.add('active');
|
||||
|
||||
// 激活对应的标签按钮
|
||||
event.currentTarget.classList.add('active');
|
||||
|
||||
// 清空当前文件
|
||||
currentFile = null;
|
||||
clearResults();
|
||||
}
|
||||
|
||||
// 文件上传处理
|
||||
function setupFileUpload(inputId, uploadAreaId) {
|
||||
const fileInput = document.getElementById(inputId);
|
||||
const uploadArea = document.getElementById(uploadAreaId);
|
||||
|
||||
fileInput.addEventListener('change', function(e) {
|
||||
if (this.files.length > 0) {
|
||||
handleFileUpload(this.files[0], uploadArea);
|
||||
}
|
||||
});
|
||||
|
||||
// 拖拽上传功能
|
||||
uploadArea.addEventListener('dragover', function(e) {
|
||||
e.preventDefault();
|
||||
this.style.borderColor = '#2980b9';
|
||||
this.style.background = '#e9ecef';
|
||||
});
|
||||
|
||||
uploadArea.addEventListener('dragleave', function(e) {
|
||||
e.preventDefault();
|
||||
this.style.borderColor = '#3498db';
|
||||
this.style.background = '#f8f9fa';
|
||||
});
|
||||
|
||||
uploadArea.addEventListener('drop', function(e) {
|
||||
e.preventDefault();
|
||||
this.style.borderColor = '#3498db';
|
||||
this.style.background = '#f8f9fa';
|
||||
|
||||
if (e.dataTransfer.files.length > 0) {
|
||||
handleFileUpload(e.dataTransfer.files[0], uploadArea);
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
// 处理文件上传
|
||||
async function handleFileUpload(file, uploadArea) {
|
||||
const formData = new FormData();
|
||||
formData.append('file', file);
|
||||
|
||||
showStatus('正在上传文件...', 'info');
|
||||
|
||||
try {
|
||||
const response = await fetch('/upload', {
|
||||
method: 'POST',
|
||||
body: formData
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (result.success) {
|
||||
currentFile = result;
|
||||
uploadArea.innerHTML = `
|
||||
<div style="text-align: center;">
|
||||
<p style="color: #27ae60; font-weight: bold;">✓ 文件上传成功</p>
|
||||
<p>文件名: ${result.filename}</p>
|
||||
<p>文件类型: ${result.file_type}</p>
|
||||
<button onclick="clearFile('${uploadArea.id}')" class="btn" style="background: #e74c3c; color: white; margin-top: 10px;">重新选择</button>
|
||||
</div>
|
||||
`;
|
||||
showStatus('文件上传成功!', 'success');
|
||||
} else {
|
||||
throw new Error(result.error);
|
||||
}
|
||||
} catch (error) {
|
||||
showStatus('上传失败: ' + error.message, 'error');
|
||||
uploadArea.innerHTML = `
|
||||
<div class="upload-placeholder" onclick="document.getElementById('${fileInput.id}').click()">
|
||||
<p>点击选择文件或拖拽文件到此处</p>
|
||||
<p class="file-types">上传失败,请重试</p>
|
||||
</div>
|
||||
`;
|
||||
}
|
||||
}
|
||||
|
||||
// 清空文件选择
|
||||
function clearFile(uploadAreaId) {
|
||||
const uploadArea = document.getElementById(uploadAreaId);
|
||||
const fileInputId = uploadAreaId.replace('-upload-area', '-file');
|
||||
|
||||
uploadArea.innerHTML = `
|
||||
<input type="file" id="${fileInputId}" style="display: none;">
|
||||
<div class="upload-placeholder" onclick="document.getElementById('${fileInputId}').click()">
|
||||
<p>点击选择文件或拖拽文件到此处</p>
|
||||
<p class="file-types">支持格式: 根据标签页不同</p>
|
||||
</div>
|
||||
`;
|
||||
|
||||
currentFile = null;
|
||||
clearResults();
|
||||
setupFileUpload(fileInputId, uploadAreaId);
|
||||
}
|
||||
|
||||
// PDF处理功能
|
||||
async function processPdf(action) {
|
||||
if (!currentFile) {
|
||||
showStatus('请先选择PDF文件', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
showStatus('正在处理PDF文件...', 'info');
|
||||
|
||||
try {
|
||||
const response = await fetch('/process/pdf', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify({
|
||||
filepath: currentFile.filepath,
|
||||
action: action
|
||||
})
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (result.success) {
|
||||
if (action === 'extract') {
|
||||
document.getElementById('pdf-result').innerHTML = `
|
||||
<h4>提取的文本内容:</h4>
|
||||
<div style="max-height: 300px; overflow-y: auto; background: white; padding: 15px; border-radius: 5px;">
|
||||
${result.text || '未提取到文本内容'}
|
||||
</div>
|
||||
`;
|
||||
} else if (action === 'to_excel') {
|
||||
document.getElementById('pdf-result').innerHTML = `
|
||||
<h4>转换成功!</h4>
|
||||
<p>PDF文件已成功转换为Excel格式</p>
|
||||
<a href="${result.download_url}" class="download-link" download>下载Excel文件</a>
|
||||
`;
|
||||
}
|
||||
showStatus('PDF处理完成!', 'success');
|
||||
} else {
|
||||
throw new Error(result.error);
|
||||
}
|
||||
} catch (error) {
|
||||
showStatus('处理失败: ' + error.message, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// 图片处理功能
|
||||
async function processImage(action) {
|
||||
if (!currentFile) {
|
||||
showStatus('请先选择图片文件', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
showStatus('正在处理图片文件...', 'info');
|
||||
|
||||
try {
|
||||
const response = await fetch('/process/image', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify({
|
||||
filepath: currentFile.filepath,
|
||||
action: action
|
||||
})
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (result.success) {
|
||||
if (action === 'extract') {
|
||||
document.getElementById('image-result').innerHTML = `
|
||||
<h4>识别的文字内容:</h4>
|
||||
<div style="max-height: 300px; overflow-y: auto; background: white; padding: 15px; border-radius: 5px;">
|
||||
${result.text || '未识别到文字内容'}
|
||||
</div>
|
||||
`;
|
||||
} else {
|
||||
const formatName = action === 'to_excel' ? 'Excel' : '文本';
|
||||
document.getElementById('image-result').innerHTML = `
|
||||
<h4>转换成功!</h4>
|
||||
<p>图片文件已成功转换为${formatName}格式</p>
|
||||
<a href="${result.download_url}" class="download-link" download>下载${formatName}文件</a>
|
||||
`;
|
||||
}
|
||||
showStatus('图片处理完成!', 'success');
|
||||
} else {
|
||||
throw new Error(result.error);
|
||||
}
|
||||
} catch (error) {
|
||||
showStatus('处理失败: ' + error.message, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// 格式转换功能
|
||||
async function processFormat() {
|
||||
if (!currentFile) {
|
||||
showStatus('请先选择文件', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
const targetFormat = document.getElementById('target-format').value;
|
||||
|
||||
showStatus('正在转换文件格式...', 'info');
|
||||
|
||||
try {
|
||||
const response = await fetch('/process/format', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify({
|
||||
filepath: currentFile.filepath,
|
||||
target_format: targetFormat
|
||||
})
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (result.success) {
|
||||
document.getElementById('format-result').innerHTML = `
|
||||
<h4>转换成功!</h4>
|
||||
<p>文件已成功转换为${targetFormat.toUpperCase()}格式</p>
|
||||
<a href="${result.download_url}" class="download-link" download>下载文件</a>
|
||||
`;
|
||||
showStatus('格式转换完成!', 'success');
|
||||
} else {
|
||||
throw new Error(result.error);
|
||||
}
|
||||
} catch (error) {
|
||||
showStatus('转换失败: ' + error.message, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// 网页抓取功能
|
||||
async function processWeb() {
|
||||
const url = document.getElementById('web-url').value;
|
||||
const selector = document.getElementById('css-selector').value;
|
||||
|
||||
if (!url) {
|
||||
showStatus('请输入网页URL', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
showStatus('正在抓取网页内容...', 'info');
|
||||
|
||||
try {
|
||||
const response = await fetch('/process/web', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify({
|
||||
url: url,
|
||||
selector: selector
|
||||
})
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (result.success) {
|
||||
document.getElementById('web-result').innerHTML = `
|
||||
<h4>抓取结果:</h4>
|
||||
<div style="max-height: 300px; overflow-y: auto; background: white; padding: 15px; border-radius: 5px;">
|
||||
${result.content || '未抓取到内容'}
|
||||
</div>
|
||||
`;
|
||||
showStatus('网页抓取完成!', 'success');
|
||||
} else {
|
||||
throw new Error(result.error);
|
||||
}
|
||||
} catch (error) {
|
||||
showStatus('抓取失败: ' + error.message, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// 网页抓取并导出为Excel
|
||||
async function processWebToExcel() {
|
||||
const url = document.getElementById('web-url').value;
|
||||
const selector = document.getElementById('css-selector').value;
|
||||
|
||||
if (!url) {
|
||||
showStatus('请输入网页URL', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
showStatus('正在抓取网页并导出为Excel...', 'info');
|
||||
|
||||
try {
|
||||
const response = await fetch('/process/web', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify({
|
||||
url: url,
|
||||
selector: selector
|
||||
})
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (result.success) {
|
||||
document.getElementById('web-result').innerHTML = `
|
||||
<h4>导出成功!</h4>
|
||||
<p>网页内容已成功导出为Excel格式</p>
|
||||
<a href="${result.download_url}" class="download-link" download>下载Excel文件</a>
|
||||
`;
|
||||
showStatus('网页导出完成!', 'success');
|
||||
} else {
|
||||
throw new Error(result.error);
|
||||
}
|
||||
} catch (error) {
|
||||
showStatus('导出失败: ' + error.message, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// 数据库导出功能
|
||||
async function processDatabase() {
|
||||
if (!currentFile) {
|
||||
showStatus('请先选择数据库文件', 'error');
|
||||
return;
|
||||
}
|
||||
|
||||
const targetFormat = document.getElementById('db-target-format').value;
|
||||
const tableName = document.getElementById('table-name').value;
|
||||
|
||||
showStatus('正在导出数据库...', 'info');
|
||||
|
||||
try {
|
||||
const response = await fetch('/process/database', {
|
||||
method: 'POST',
|
||||
headers: {
|
||||
'Content-Type': 'application/json'
|
||||
},
|
||||
body: JSON.stringify({
|
||||
filepath: currentFile.filepath,
|
||||
target_format: targetFormat,
|
||||
table_name: tableName
|
||||
})
|
||||
});
|
||||
|
||||
const result = await response.json();
|
||||
|
||||
if (result.success) {
|
||||
document.getElementById('database-result').innerHTML = `
|
||||
<h4>导出成功!</h4>
|
||||
<p>数据库已成功导出为${targetFormat.toUpperCase()}格式</p>
|
||||
<a href="${result.download_url}" class="download-link" download>下载文件</a>
|
||||
`;
|
||||
showStatus('数据库导出完成!', 'success');
|
||||
} else {
|
||||
throw new Error(result.error);
|
||||
}
|
||||
} catch (error) {
|
||||
showStatus('导出失败: ' + error.message, 'error');
|
||||
}
|
||||
}
|
||||
|
||||
// 显示状态消息
|
||||
function showStatus(message, type) {
|
||||
const statusEl = document.getElementById('status-message');
|
||||
statusEl.textContent = message;
|
||||
statusEl.className = `status-message status-${type}`;
|
||||
statusEl.style.display = 'block';
|
||||
|
||||
setTimeout(() => {
|
||||
statusEl.style.display = 'none';
|
||||
}, 5000);
|
||||
}
|
||||
|
||||
// 清空结果区域
|
||||
function clearResults() {
|
||||
const resultAreas = document.getElementsByClassName('result-area');
|
||||
for (let i = 0; i < resultAreas.length; i++) {
|
||||
resultAreas[i].innerHTML = '';
|
||||
}
|
||||
}
|
||||
|
||||
// 初始化页面
|
||||
document.addEventListener('DOMContentLoaded', function() {
|
||||
// 设置文件上传功能
|
||||
setupFileUpload('pdf-file', 'pdf-upload-area');
|
||||
setupFileUpload('image-file', 'image-upload-area');
|
||||
setupFileUpload('format-file', 'format-upload-area');
|
||||
setupFileUpload('db-file', 'db-upload-area');
|
||||
|
||||
// 设置输入框回车事件
|
||||
document.getElementById('web-url').addEventListener('keypress', function(e) {
|
||||
if (e.key === 'Enter') {
|
||||
processWeb();
|
||||
}
|
||||
});
|
||||
});
|
||||
265
static/style.css
Normal file
265
static/style.css
Normal file
@ -0,0 +1,265 @@
|
||||
* {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
box-sizing: border-box;
|
||||
}
|
||||
|
||||
body {
|
||||
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
||||
background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
|
||||
min-height: 100vh;
|
||||
padding: 20px;
|
||||
}
|
||||
|
||||
.container {
|
||||
max-width: 1200px;
|
||||
margin: 0 auto;
|
||||
background: white;
|
||||
border-radius: 15px;
|
||||
box-shadow: 0 20px 40px rgba(0,0,0,0.1);
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
header {
|
||||
background: linear-gradient(135deg, #2c3e50, #3498db);
|
||||
color: white;
|
||||
padding: 40px;
|
||||
text-align: center;
|
||||
}
|
||||
|
||||
header h1 {
|
||||
font-size: 2.5em;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.subtitle {
|
||||
font-size: 1.2em;
|
||||
opacity: 0.9;
|
||||
}
|
||||
|
||||
.tabs {
|
||||
display: flex;
|
||||
background: #f8f9fa;
|
||||
border-bottom: 1px solid #dee2e6;
|
||||
}
|
||||
|
||||
.tab-button {
|
||||
flex: 1;
|
||||
padding: 15px 20px;
|
||||
border: none;
|
||||
background: transparent;
|
||||
cursor: pointer;
|
||||
font-size: 16px;
|
||||
font-weight: 500;
|
||||
transition: all 0.3s ease;
|
||||
border-bottom: 3px solid transparent;
|
||||
}
|
||||
|
||||
.tab-button:hover {
|
||||
background: #e9ecef;
|
||||
}
|
||||
|
||||
.tab-button.active {
|
||||
background: white;
|
||||
border-bottom-color: #3498db;
|
||||
color: #3498db;
|
||||
}
|
||||
|
||||
.tab-content {
|
||||
display: none;
|
||||
padding: 30px;
|
||||
}
|
||||
|
||||
.tab-content.active {
|
||||
display: block;
|
||||
}
|
||||
|
||||
.tab-content h2 {
|
||||
color: #2c3e50;
|
||||
margin-bottom: 20px;
|
||||
font-size: 1.8em;
|
||||
}
|
||||
|
||||
.upload-area {
|
||||
border: 2px dashed #3498db;
|
||||
border-radius: 10px;
|
||||
padding: 40px;
|
||||
text-align: center;
|
||||
margin-bottom: 20px;
|
||||
transition: all 0.3s ease;
|
||||
background: #f8f9fa;
|
||||
}
|
||||
|
||||
.upload-area:hover {
|
||||
border-color: #2980b9;
|
||||
background: #e9ecef;
|
||||
}
|
||||
|
||||
.upload-placeholder {
|
||||
cursor: pointer;
|
||||
}
|
||||
|
||||
.upload-placeholder p {
|
||||
font-size: 18px;
|
||||
color: #6c757d;
|
||||
margin-bottom: 10px;
|
||||
}
|
||||
|
||||
.file-types {
|
||||
font-size: 14px !important;
|
||||
color: #adb5bd !important;
|
||||
}
|
||||
|
||||
.input-group {
|
||||
margin-bottom: 20px;
|
||||
}
|
||||
|
||||
.input-group label {
|
||||
display: block;
|
||||
margin-bottom: 5px;
|
||||
font-weight: 500;
|
||||
color: #495057;
|
||||
}
|
||||
|
||||
.input-group input, .input-group select {
|
||||
width: 100%;
|
||||
padding: 10px;
|
||||
border: 1px solid #ced4da;
|
||||
border-radius: 5px;
|
||||
font-size: 16px;
|
||||
}
|
||||
|
||||
.input-group small {
|
||||
color: #6c757d;
|
||||
font-size: 12px;
|
||||
}
|
||||
|
||||
.action-buttons {
|
||||
display: flex;
|
||||
gap: 10px;
|
||||
margin-bottom: 20px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.conversion-options {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 10px;
|
||||
margin-bottom: 20px;
|
||||
flex-wrap: wrap;
|
||||
}
|
||||
|
||||
.btn {
|
||||
padding: 12px 24px;
|
||||
border: none;
|
||||
border-radius: 5px;
|
||||
cursor: pointer;
|
||||
font-size: 16px;
|
||||
font-weight: 500;
|
||||
transition: all 0.3s ease;
|
||||
text-decoration: none;
|
||||
display: inline-block;
|
||||
}
|
||||
|
||||
.btn-primary {
|
||||
background: #3498db;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.btn-primary:hover {
|
||||
background: #2980b9;
|
||||
}
|
||||
|
||||
.btn-success {
|
||||
background: #27ae60;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.btn-success:hover {
|
||||
background: #219a52;
|
||||
}
|
||||
|
||||
.btn-info {
|
||||
background: #17a2b8;
|
||||
color: white;
|
||||
}
|
||||
|
||||
.btn-info:hover {
|
||||
background: #138496;
|
||||
}
|
||||
|
||||
.result-area {
|
||||
background: #f8f9fa;
|
||||
border: 1px solid #dee2e6;
|
||||
border-radius: 5px;
|
||||
padding: 20px;
|
||||
min-height: 100px;
|
||||
max-height: 400px;
|
||||
overflow-y: auto;
|
||||
white-space: pre-wrap;
|
||||
font-family: 'Courier New', monospace;
|
||||
}
|
||||
|
||||
.status-message {
|
||||
position: fixed;
|
||||
top: 20px;
|
||||
right: 20px;
|
||||
padding: 15px 20px;
|
||||
border-radius: 5px;
|
||||
color: white;
|
||||
font-weight: 500;
|
||||
z-index: 1000;
|
||||
display: none;
|
||||
}
|
||||
|
||||
.status-success {
|
||||
background: #27ae60;
|
||||
}
|
||||
|
||||
.status-error {
|
||||
background: #e74c3c;
|
||||
}
|
||||
|
||||
.status-info {
|
||||
background: #3498db;
|
||||
}
|
||||
|
||||
.download-link {
|
||||
display: inline-block;
|
||||
margin-top: 10px;
|
||||
padding: 10px 15px;
|
||||
background: #27ae60;
|
||||
color: white;
|
||||
text-decoration: none;
|
||||
border-radius: 5px;
|
||||
transition: background 0.3s ease;
|
||||
}
|
||||
|
||||
.download-link:hover {
|
||||
background: #219a52;
|
||||
}
|
||||
|
||||
@media (max-width: 768px) {
|
||||
.container {
|
||||
margin: 10px;
|
||||
border-radius: 10px;
|
||||
}
|
||||
|
||||
.tabs {
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.tab-button {
|
||||
border-bottom: 1px solid #dee2e6;
|
||||
border-right: none;
|
||||
}
|
||||
|
||||
.action-buttons {
|
||||
flex-direction: column;
|
||||
}
|
||||
|
||||
.conversion-options {
|
||||
flex-direction: column;
|
||||
align-items: stretch;
|
||||
}
|
||||
}
|
||||
132
templates/index.html
Normal file
132
templates/index.html
Normal file
@ -0,0 +1,132 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="zh-CN">
|
||||
<head>
|
||||
<meta charset="UTF-8">
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||
<title>数据提取与转换器 - 大学生专用工具</title>
|
||||
<link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
|
||||
</head>
|
||||
<body>
|
||||
<div class="container">
|
||||
<header>
|
||||
<h1>数据提取与转换器</h1>
|
||||
<p class="subtitle">专为大学生开发的多功能数据处理工具</p>
|
||||
</header>
|
||||
|
||||
<div class="tabs">
|
||||
<button class="tab-button active" onclick="openTab('pdf-tab')">PDF处理</button>
|
||||
<button class="tab-button" onclick="openTab('image-tab')">图片OCR</button>
|
||||
<button class="tab-button" onclick="openTab('format-tab')">格式转换</button>
|
||||
<button class="tab-button" onclick="openTab('web-tab')">网页抓取</button>
|
||||
<button class="tab-button" onclick="openTab('database-tab')">数据库导出</button>
|
||||
</div>
|
||||
|
||||
<!-- PDF处理标签页 -->
|
||||
<div id="pdf-tab" class="tab-content active">
|
||||
<h2>PDF文本/表格提取</h2>
|
||||
<div class="upload-area" id="pdf-upload-area">
|
||||
<input type="file" id="pdf-file" accept=".pdf" style="display: none;">
|
||||
<div class="upload-placeholder" onclick="document.getElementById('pdf-file').click()">
|
||||
<p>点击选择PDF文件或拖拽文件到此处</p>
|
||||
<p class="file-types">支持格式: .pdf</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="action-buttons">
|
||||
<button onclick="processPdf('extract')" class="btn btn-primary">提取文本</button>
|
||||
<button onclick="processPdf('to_excel')" class="btn btn-success">导出为Excel</button>
|
||||
</div>
|
||||
<div id="pdf-result" class="result-area"></div>
|
||||
</div>
|
||||
|
||||
<!-- 图片OCR标签页 -->
|
||||
<div id="image-tab" class="tab-content">
|
||||
<h2>图片文字识别 (OCR)</h2>
|
||||
<div class="upload-area" id="image-upload-area">
|
||||
<input type="file" id="image-file" accept="image/*" style="display: none;">
|
||||
<div class="upload-placeholder" onclick="document.getElementById('image-file').click()">
|
||||
<p>点击选择图片文件或拖拽文件到此处</p>
|
||||
<p class="file-types">支持格式: .jpg, .jpeg, .png, .gif, .bmp</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="action-buttons">
|
||||
<button onclick="processImage('extract')" class="btn btn-primary">识别文字</button>
|
||||
<button onclick="processImage('to_excel')" class="btn btn-success">导出为Excel</button>
|
||||
<button onclick="processImage('to_text')" class="btn btn-info">导出为文本</button>
|
||||
</div>
|
||||
<div id="image-result" class="result-area"></div>
|
||||
</div>
|
||||
|
||||
<!-- 格式转换标签页 -->
|
||||
<div id="format-tab" class="tab-content">
|
||||
<h2>文件格式转换</h2>
|
||||
<div class="upload-area" id="format-upload-area">
|
||||
<input type="file" id="format-file" accept=".xlsx,.xls,.csv,.json" style="display: none;">
|
||||
<div class="upload-placeholder" onclick="document.getElementById('format-file').click()">
|
||||
<p>点击选择文件或拖拽文件到此处</p>
|
||||
<p class="file-types">支持格式: .xlsx, .xls, .csv, .json</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="conversion-options">
|
||||
<label>转换为:</label>
|
||||
<select id="target-format">
|
||||
<option value="excel">Excel (.xlsx)</option>
|
||||
<option value="csv">CSV (.csv)</option>
|
||||
<option value="json">JSON (.json)</option>
|
||||
</select>
|
||||
<button onclick="processFormat()" class="btn btn-success">开始转换</button>
|
||||
</div>
|
||||
<div id="format-result" class="result-area"></div>
|
||||
</div>
|
||||
|
||||
<!-- 网页抓取标签页 -->
|
||||
<div id="web-tab" class="tab-content">
|
||||
<h2>网页数据抓取</h2>
|
||||
<div class="input-group">
|
||||
<label for="web-url">网页URL:</label>
|
||||
<input type="url" id="web-url" placeholder="https://example.com">
|
||||
</div>
|
||||
<div class="input-group">
|
||||
<label for="css-selector">CSS选择器 (可选):</label>
|
||||
<input type="text" id="css-selector" placeholder="例如: .content, #main, p">
|
||||
<small>留空则抓取整个页面文本</small>
|
||||
</div>
|
||||
<div class="action-buttons">
|
||||
<button onclick="processWeb()" class="btn btn-primary">抓取内容</button>
|
||||
<button onclick="processWebToExcel()" class="btn btn-success">导出为Excel</button>
|
||||
</div>
|
||||
<div id="web-result" class="result-area"></div>
|
||||
</div>
|
||||
|
||||
<!-- 数据库导出标签页 -->
|
||||
<div id="database-tab" class="tab-content">
|
||||
<h2>数据库导出</h2>
|
||||
<div class="upload-area" id="db-upload-area">
|
||||
<input type="file" id="db-file" accept=".db,.sqlite" style="display: none;">
|
||||
<div class="upload-placeholder" onclick="document.getElementById('db-file').click()">
|
||||
<p>点击选择数据库文件或拖拽文件到此处</p>
|
||||
<p class="file-types">支持格式: .db, .sqlite</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="input-group">
|
||||
<label for="table-name">表名 (可选):</label>
|
||||
<input type="text" id="table-name" placeholder="留空则导出所有表">
|
||||
</div>
|
||||
<div class="conversion-options">
|
||||
<label>导出为:</label>
|
||||
<select id="db-target-format">
|
||||
<option value="excel">Excel (.xlsx)</option>
|
||||
<option value="csv">CSV (.csv)</option>
|
||||
<option value="json">JSON (.json)</option>
|
||||
</select>
|
||||
<button onclick="processDatabase()" class="btn btn-success">开始导出</button>
|
||||
</div>
|
||||
<div id="database-result" class="result-area"></div>
|
||||
</div>
|
||||
|
||||
<!-- 全局状态显示 -->
|
||||
<div id="status-message" class="status-message"></div>
|
||||
</div>
|
||||
|
||||
<script src="{{ url_for('static', filename='script.js') }}"></script>
|
||||
</body>
|
||||
</html>
|
||||
BIN
test_cases/cat_coffee.png
Normal file
BIN
test_cases/cat_coffee.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 1.9 MiB |
6
test_cases/test_data.csv
Normal file
6
test_cases/test_data.csv
Normal file
@ -0,0 +1,6 @@
|
||||
姓名,年龄,城市,专业,成绩
|
||||
张三,20,北京,计算机科学,85
|
||||
李四,21,上海,数据科学,92
|
||||
王五,19,广州,人工智能,78
|
||||
赵六,22,深圳,软件工程,88
|
||||
钱七,20,杭州,网络安全,95
|
||||
|
37
test_cases/test_data.json
Normal file
37
test_cases/test_data.json
Normal file
@ -0,0 +1,37 @@
|
||||
[
|
||||
{
|
||||
"姓名": "张三",
|
||||
"年龄": 20,
|
||||
"城市": "北京",
|
||||
"专业": "计算机科学",
|
||||
"成绩": 85
|
||||
},
|
||||
{
|
||||
"姓名": "李四",
|
||||
"年龄": 21,
|
||||
"城市": "上海",
|
||||
"专业": "数据科学",
|
||||
"成绩": 92
|
||||
},
|
||||
{
|
||||
"姓名": "王五",
|
||||
"年龄": 19,
|
||||
"城市": "广州",
|
||||
"专业": "人工智能",
|
||||
"成绩": 78
|
||||
},
|
||||
{
|
||||
"姓名": "赵六",
|
||||
"年龄": 22,
|
||||
"城市": "深圳",
|
||||
"专业": "软件工程",
|
||||
"成绩": 88
|
||||
},
|
||||
{
|
||||
"姓名": "钱七",
|
||||
"年龄": 20,
|
||||
"城市": "杭州",
|
||||
"专业": "网络安全",
|
||||
"成绩": 95
|
||||
}
|
||||
]
|
||||
192
test_functionality.py
Normal file
192
test_functionality.py
Normal file
@ -0,0 +1,192 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
数据提取与转换器 - 功能测试脚本
|
||||
用于验证应用的各项功能是否正常工作
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
# 添加项目路径到Python路径
|
||||
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
# 导入工具模块
|
||||
try:
|
||||
from utils.pdf_extractor import extract_text_from_pdf
|
||||
from utils.ocr_processor import extract_text_from_image
|
||||
from utils.format_converter import excel_to_csv, csv_to_excel, json_to_excel
|
||||
from utils.web_scraper import scrape_webpage
|
||||
from utils.database_exporter import export_sqlite_to_excel
|
||||
print("✅ 所有工具模块导入成功")
|
||||
except ImportError as e:
|
||||
print(f"❌ 模块导入失败: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
def test_format_conversion():
|
||||
"""测试格式转换功能"""
|
||||
print("\n📊 测试格式转换功能...")
|
||||
|
||||
# 测试数据
|
||||
test_data = [
|
||||
{"姓名": "张三", "年龄": 20, "城市": "北京"},
|
||||
{"姓名": "李四", "年龄": 21, "城市": "上海"},
|
||||
{"姓名": "王五", "年龄": 19, "城市": "广州"}
|
||||
]
|
||||
|
||||
try:
|
||||
# 创建临时文件
|
||||
with tempfile.NamedTemporaryFile(suffix='.csv', delete=False, mode='w', encoding='utf-8') as f:
|
||||
f.write("姓名,年龄,城市\n")
|
||||
for item in test_data:
|
||||
f.write(f"{item['姓名']},{item['年龄']},{item['城市']}\n")
|
||||
csv_path = f.name
|
||||
|
||||
# CSV转Excel
|
||||
excel_path = csv_path.replace('.csv', '.xlsx')
|
||||
csv_to_excel(csv_path, excel_path)
|
||||
|
||||
if os.path.exists(excel_path):
|
||||
print("✅ CSV转Excel功能正常")
|
||||
os.unlink(excel_path)
|
||||
else:
|
||||
print("❌ CSV转Excel功能失败")
|
||||
|
||||
os.unlink(csv_path)
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ 格式转换测试失败: {e}")
|
||||
|
||||
def test_web_scraping():
|
||||
"""测试网页抓取功能"""
|
||||
print("\n🌐 测试网页抓取功能...")
|
||||
|
||||
try:
|
||||
# 测试抓取百度首页标题
|
||||
content = scrape_webpage("https://www.baidu.com")
|
||||
if content and len(content) > 0:
|
||||
print("✅ 网页抓取功能正常")
|
||||
print(f" 抓取内容长度: {len(content)} 字符")
|
||||
else:
|
||||
print("❌ 网页抓取功能失败")
|
||||
except Exception as e:
|
||||
print(f"❌ 网页抓取测试失败: {e}")
|
||||
|
||||
def test_ocr_functionality():
|
||||
"""测试OCR功能"""
|
||||
print("\n🖼️ 测试OCR功能...")
|
||||
|
||||
try:
|
||||
# 创建一个简单的测试图片(包含文字)
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
|
||||
# 创建图片
|
||||
img = Image.new('RGB', (400, 200), color='white')
|
||||
d = ImageDraw.Draw(img)
|
||||
|
||||
# 尝试使用系统字体
|
||||
try:
|
||||
font = ImageFont.truetype("arial.ttf", 24)
|
||||
except:
|
||||
try:
|
||||
font = ImageFont.truetype("Arial.ttf", 24)
|
||||
except:
|
||||
font = ImageFont.load_default()
|
||||
|
||||
# 添加文字
|
||||
d.text((50, 80), "测试文字: Hello World 你好世界", fill="black", font=font)
|
||||
|
||||
# 保存图片
|
||||
img_path = os.path.join(tempfile.gettempdir(), "test_ocr.png")
|
||||
img.save(img_path)
|
||||
|
||||
# 测试OCR识别
|
||||
text = extract_text_from_image(img_path)
|
||||
|
||||
if text:
|
||||
print("✅ OCR功能正常")
|
||||
print(f" 识别结果: {text}")
|
||||
else:
|
||||
print("⚠️ OCR识别无结果(可能是字体问题)")
|
||||
|
||||
os.unlink(img_path)
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ OCR测试失败: {e}")
|
||||
|
||||
def test_database_functionality():
|
||||
"""测试数据库功能"""
|
||||
print("\n🗄️ 测试数据库功能...")
|
||||
|
||||
try:
|
||||
import sqlite3
|
||||
|
||||
# 创建测试数据库
|
||||
db_path = os.path.join(tempfile.gettempdir(), "test.db")
|
||||
conn = sqlite3.connect(db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 创建测试表
|
||||
cursor.execute("""
|
||||
CREATE TABLE IF NOT EXISTS students (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
age INTEGER,
|
||||
major TEXT
|
||||
)
|
||||
""")
|
||||
|
||||
# 插入测试数据
|
||||
test_data = [
|
||||
(1, "张三", 20, "计算机科学"),
|
||||
(2, "李四", 21, "数据科学"),
|
||||
(3, "王五", 19, "人工智能")
|
||||
]
|
||||
|
||||
cursor.executemany("INSERT INTO students VALUES (?, ?, ?, ?)", test_data)
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
# 测试数据库导出
|
||||
excel_path = db_path.replace('.db', '.xlsx')
|
||||
export_sqlite_to_excel(db_path, excel_path)
|
||||
|
||||
if os.path.exists(excel_path):
|
||||
print("✅ 数据库导出功能正常")
|
||||
os.unlink(excel_path)
|
||||
else:
|
||||
print("❌ 数据库导出功能失败")
|
||||
|
||||
os.unlink(db_path)
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ 数据库功能测试失败: {e}")
|
||||
|
||||
def main():
|
||||
"""主测试函数"""
|
||||
print("=" * 50)
|
||||
print("数据提取与转换器 - 功能测试")
|
||||
print("=" * 50)
|
||||
|
||||
# 测试各项功能
|
||||
test_format_conversion()
|
||||
test_web_scraping()
|
||||
test_ocr_functionality()
|
||||
test_database_functionality()
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print("测试完成!")
|
||||
print("=" * 50)
|
||||
|
||||
# 显示应用访问信息
|
||||
print("\n🌐 应用访问信息:")
|
||||
print("本地访问: http://localhost:8502")
|
||||
print("网络访问: http://192.168.10.21:8502")
|
||||
print("\n💡 测试建议:")
|
||||
print("1. 访问应用界面测试文件上传功能")
|
||||
print("2. 使用test_cases目录下的测试文件")
|
||||
print("3. 测试网页抓取功能(输入百度等网站URL)")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
213
test_mdf_functionality.py
Normal file
213
test_mdf_functionality.py
Normal file
@ -0,0 +1,213 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
MDF文件导出功能测试脚本
|
||||
测试SQL Server数据库文件导出功能
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
|
||||
# 添加项目路径到Python路径
|
||||
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
def check_sql_server_connection():
|
||||
"""检查SQL Server连接"""
|
||||
print("🔍 检查SQL Server连接...")
|
||||
|
||||
try:
|
||||
import pyodbc
|
||||
|
||||
# 测试连接参数
|
||||
test_servers = [
|
||||
('localhost', 'MSSQLSERVER'),
|
||||
('.', 'MSSQLSERVER'),
|
||||
('localhost\\SQLEXPRESS', 'SQLEXPRESS')
|
||||
]
|
||||
|
||||
connected = False
|
||||
for server, instance in test_servers:
|
||||
try:
|
||||
if instance == 'MSSQLSERVER':
|
||||
conn_str = f"DRIVER={{SQL Server}};SERVER={server};Trusted_Connection=yes;"
|
||||
else:
|
||||
conn_str = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};Trusted_Connection=yes;"
|
||||
|
||||
conn = pyodbc.connect(conn_str, timeout=5)
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT @@version")
|
||||
version = cursor.fetchone()[0]
|
||||
|
||||
print(f"✅ 连接到 {server}\\{instance}")
|
||||
print(f" SQL Server版本: {version.split('\\n')[0]}")
|
||||
connected = True
|
||||
conn.close()
|
||||
break
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ 无法连接到 {server}\\{instance}: {e}")
|
||||
|
||||
if not connected:
|
||||
print("⚠️ 未找到可用的SQL Server实例")
|
||||
print(" 请安装SQL Server或检查服务状态")
|
||||
|
||||
return connected
|
||||
|
||||
except ImportError:
|
||||
print("❌ pyodbc未安装")
|
||||
return False
|
||||
|
||||
def test_mdf_export_module():
|
||||
"""测试MDF导出模块"""
|
||||
print("\n🧪 测试MDF导出模块...")
|
||||
|
||||
try:
|
||||
from utils.database_exporter import (
|
||||
export_mssql_mdf_to_excel,
|
||||
export_mssql_mdf_to_csv,
|
||||
export_mssql_mdf_to_json
|
||||
)
|
||||
print("✅ MDF导出模块导入成功")
|
||||
|
||||
# 检查函数是否存在
|
||||
functions = [
|
||||
export_mssql_mdf_to_excel,
|
||||
export_mssql_mdf_to_csv,
|
||||
export_mssql_mdf_to_json
|
||||
]
|
||||
|
||||
for func in functions:
|
||||
print(f"✅ {func.__name__} 函数可用")
|
||||
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ MDF导出模块测试失败: {e}")
|
||||
return False
|
||||
|
||||
def create_sample_mdf_info():
|
||||
"""创建示例MDF文件信息"""
|
||||
print("\n📋 示例MDF文件信息:")
|
||||
|
||||
sample_info = """
|
||||
💡 要测试MDF文件导出功能,您需要:
|
||||
|
||||
1. **现有的.mdf文件**
|
||||
- 从现有SQL Server数据库分离的.mdf文件
|
||||
- 或使用SQL Server创建测试数据库
|
||||
|
||||
2. **SQL Server实例**
|
||||
- 本地安装的SQL Server
|
||||
- 或可访问的远程SQL Server
|
||||
|
||||
3. **连接权限**
|
||||
- 数据库读取权限
|
||||
- 附加数据库权限
|
||||
|
||||
🔧 创建测试MDF文件的步骤:
|
||||
|
||||
1. 在SQL Server Management Studio中:
|
||||
```sql
|
||||
-- 创建测试数据库
|
||||
CREATE DATABASE TestMDFExport;
|
||||
GO
|
||||
|
||||
-- 创建测试表
|
||||
USE TestMDFExport;
|
||||
CREATE TABLE Students (
|
||||
ID INT PRIMARY KEY,
|
||||
Name NVARCHAR(50),
|
||||
Age INT,
|
||||
Major NVARCHAR(50)
|
||||
);
|
||||
|
||||
-- 插入测试数据
|
||||
INSERT INTO Students VALUES
|
||||
(1, '张三', 20, '计算机科学'),
|
||||
(2, '李四', 21, '数据科学'),
|
||||
(3, '王五', 19, '人工智能');
|
||||
```
|
||||
|
||||
2. 分离数据库获取.mdf文件:
|
||||
```sql
|
||||
-- 分离数据库
|
||||
USE master;
|
||||
GO
|
||||
EXEC sp_detach_db 'TestMDFExport', 'true';
|
||||
```
|
||||
|
||||
3. 数据库文件位置:
|
||||
- 默认路径: C:\\Program Files\\Microsoft SQL Server\\...\\DATA\\
|
||||
- 文件: TestMDFExport.mdf 和 TestMDFExport_log.ldf
|
||||
"""
|
||||
|
||||
print(sample_info)
|
||||
|
||||
def check_odbc_drivers():
|
||||
"""检查可用的ODBC驱动程序"""
|
||||
print("\n🔌 检查ODBC驱动程序...")
|
||||
|
||||
try:
|
||||
import pyodbc
|
||||
|
||||
drivers = pyodbc.drivers()
|
||||
if drivers:
|
||||
print("✅ 找到以下ODBC驱动程序:")
|
||||
for driver in drivers:
|
||||
print(f" - {driver}")
|
||||
|
||||
# 检查SQL Server相关驱动
|
||||
sql_drivers = [d for d in drivers if 'SQL Server' in d]
|
||||
if sql_drivers:
|
||||
print("\n✅ 找到SQL Server ODBC驱动程序")
|
||||
else:
|
||||
print("\n⚠️ 未找到SQL Server ODBC驱动程序")
|
||||
print(" 请安装ODBC Driver for SQL Server")
|
||||
else:
|
||||
print("❌ 未找到ODBC驱动程序")
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ 检查ODBC驱动程序失败: {e}")
|
||||
|
||||
def main():
|
||||
"""主测试函数"""
|
||||
print("=" * 60)
|
||||
print("MDF文件导出功能测试")
|
||||
print("=" * 60)
|
||||
|
||||
# 检查ODBC驱动
|
||||
check_odbc_drivers()
|
||||
|
||||
# 检查SQL Server连接
|
||||
sql_connected = check_sql_server_connection()
|
||||
|
||||
# 测试MDF导出模块
|
||||
module_ok = test_mdf_export_module()
|
||||
|
||||
# 显示示例信息
|
||||
create_sample_mdf_info()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("测试总结")
|
||||
print("=" * 60)
|
||||
|
||||
if sql_connected and module_ok:
|
||||
print("✅ MDF导出功能配置正确")
|
||||
print("💡 您可以上传.mdf文件测试导出功能")
|
||||
else:
|
||||
print("⚠️ MDF导出功能需要额外配置")
|
||||
|
||||
if not sql_connected:
|
||||
print(" - 需要安装或配置SQL Server")
|
||||
if not module_ok:
|
||||
print(" - 需要检查模块依赖")
|
||||
|
||||
print("\n🚀 下一步操作:")
|
||||
print("1. 确保SQL Server服务运行")
|
||||
print("2. 准备.mdf测试文件")
|
||||
print("3. 访问应用测试导出功能")
|
||||
print("4. 参考SQL_SERVER_SETUP.md获取详细配置说明")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
1
utils/__init__.py
Normal file
1
utils/__init__.py
Normal file
@ -0,0 +1 @@
|
||||
# 工具模块初始化文件
|
||||
438
utils/ai_copywriter.py
Normal file
438
utils/ai_copywriter.py
Normal file
@ -0,0 +1,438 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
AI文案生成服务集成
|
||||
使用AI大模型为照片生成创意文案
|
||||
支持多种文案风格和用途
|
||||
支持DeepSeek和DashScope两种大模型
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# 加载环境变量
|
||||
load_dotenv()
|
||||
|
||||
class AICopywriter:
|
||||
"""AI文案生成服务类"""
|
||||
|
||||
def __init__(self, provider='deepseek'):
|
||||
"""初始化AI文案生成客户端"""
|
||||
self.provider = provider
|
||||
|
||||
if provider == 'deepseek':
|
||||
self.api_key = os.getenv('DEEPSEEK_API_KEY')
|
||||
if not self.api_key:
|
||||
raise Exception("DeepSeek API密钥未配置,请在.env文件中设置DEEPSEEK_API_KEY")
|
||||
self.base_url = "https://api.deepseek.com/v1/chat/completions"
|
||||
elif provider == 'dashscope':
|
||||
self.api_key = os.getenv('DASHSCOPE_API_KEY')
|
||||
if not self.api_key:
|
||||
raise Exception("DashScope API密钥未配置,请在.env文件中设置DASHSCOPE_API_KEY")
|
||||
else:
|
||||
raise Exception(f"不支持的AI提供商: {provider}")
|
||||
|
||||
def generate_photo_caption(self, image_description, style='creative', length='medium'):
|
||||
"""为照片生成文案"""
|
||||
try:
|
||||
if self.provider == 'deepseek':
|
||||
return self._generate_with_deepseek(image_description, style, length)
|
||||
elif self.provider == 'dashscope':
|
||||
return self._generate_with_dashscope(image_description, style, length)
|
||||
else:
|
||||
raise Exception(f"不支持的AI提供商: {self.provider}")
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"AI文案生成失败: {str(e)}")
|
||||
|
||||
def _generate_with_deepseek(self, image_description, style, length):
|
||||
"""使用DeepSeek生成文案"""
|
||||
try:
|
||||
prompt = self._build_prompt(image_description, style, length)
|
||||
|
||||
headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
data = {
|
||||
'model': 'deepseek-chat',
|
||||
'messages': [
|
||||
{
|
||||
'role': 'system',
|
||||
'content': '你是一个专业的创意文案创作助手,擅长为照片生成各种风格的创意文案。你具有丰富的文学素养和营销知识,能够根据照片内容创作出富有创意和感染力的文案。'
|
||||
},
|
||||
{
|
||||
'role': 'user',
|
||||
'content': prompt
|
||||
}
|
||||
],
|
||||
'max_tokens': 500,
|
||||
'temperature': 0.8,
|
||||
'top_p': 0.9
|
||||
}
|
||||
|
||||
response = requests.post(self.base_url, headers=headers, json=data)
|
||||
result = response.json()
|
||||
|
||||
if 'choices' in result and len(result['choices']) > 0:
|
||||
caption = result['choices'][0]['message']['content'].strip()
|
||||
# 清理可能的格式标记
|
||||
caption = caption.replace('"', '').replace('\n', ' ').strip()
|
||||
return caption
|
||||
else:
|
||||
# 如果API调用失败,使用备用文案生成
|
||||
return self._generate_fallback_caption(image_description, style, length)
|
||||
|
||||
except Exception as e:
|
||||
# API调用失败时使用备用方案
|
||||
return self._generate_fallback_caption(image_description, style, length)
|
||||
|
||||
def _generate_with_dashscope(self, image_description, style, length):
|
||||
"""使用DashScope生成文案"""
|
||||
try:
|
||||
url = "https://dashscope.aliyuncs.com/api/v1/services/aigc/text-generation/generation"
|
||||
|
||||
headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
# 根据风格和长度构建提示词
|
||||
prompt = self._build_prompt(image_description, style, length)
|
||||
|
||||
data = {
|
||||
'model': 'qwen-turbo',
|
||||
'input': {
|
||||
'messages': [
|
||||
{
|
||||
'role': 'system',
|
||||
'content': '你是一个专业的文案创作助手,擅长为照片生成各种风格的创意文案。'
|
||||
},
|
||||
{
|
||||
'role': 'user',
|
||||
'content': prompt
|
||||
}
|
||||
]
|
||||
},
|
||||
'parameters': {
|
||||
'max_tokens': 500,
|
||||
'temperature': 0.8
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post(url, headers=headers, json=data)
|
||||
result = response.json()
|
||||
|
||||
if 'output' in result and 'text' in result['output']:
|
||||
return result['output']['text']
|
||||
else:
|
||||
# 如果API调用失败,使用备用文案生成
|
||||
return self._generate_fallback_caption(image_description, style, length)
|
||||
|
||||
except Exception as e:
|
||||
# API调用失败时使用备用方案
|
||||
return self._generate_fallback_caption(image_description, style, length)
|
||||
|
||||
def _build_prompt(self, image_description, style, length):
|
||||
"""构建AI提示词"""
|
||||
|
||||
style_descriptions = {
|
||||
'creative': '创意文艺风格,富有诗意和想象力',
|
||||
'professional': '专业正式风格,简洁明了',
|
||||
'social': '社交媒体风格,活泼有趣,适合朋友圈',
|
||||
'marketing': '营销推广风格,吸引眼球,促进转化',
|
||||
'simple': '简单描述风格,直接明了',
|
||||
'emotional': '情感表达风格,温暖感人'
|
||||
}
|
||||
|
||||
length_descriptions = {
|
||||
'short': '10-20字,简洁精炼',
|
||||
'medium': '30-50字,适中长度',
|
||||
'long': '80-120字,详细描述'
|
||||
}
|
||||
|
||||
prompt = f"""
|
||||
请为以下照片内容生成{style_descriptions.get(style, '创意')}的文案,要求{length_descriptions.get(length, '适中长度')}。
|
||||
|
||||
照片内容描述:{image_description}
|
||||
|
||||
文案要求:
|
||||
1. 符合{style}风格
|
||||
2. 长度{length}
|
||||
3. 有创意,吸引人
|
||||
4. 适合社交媒体分享
|
||||
|
||||
请直接输出文案内容,不要添加其他说明。
|
||||
"""
|
||||
|
||||
return prompt.strip()
|
||||
|
||||
def _generate_fallback_caption(self, image_description, style, length):
|
||||
"""备用文案生成(当AI服务不可用时)"""
|
||||
|
||||
# 基于照片描述的简单文案生成
|
||||
keywords = image_description.lower().split()
|
||||
|
||||
# 提取关键信息
|
||||
objects = []
|
||||
scenes = []
|
||||
|
||||
# 简单的关键词分类(实际应用中可以使用更复杂的NLP处理)
|
||||
object_keywords = ['人', '建筑', '天空', '树', '花', '动物', '车', '食物', '水', '山']
|
||||
scene_keywords = ['户外', '室内', '自然', '城市', '夜景', '日出', '日落', '海滩', '森林']
|
||||
|
||||
for word in keywords:
|
||||
if any(obj in word for obj in object_keywords):
|
||||
objects.append(word)
|
||||
if any(scene in word for scene in scene_keywords):
|
||||
scenes.append(word)
|
||||
|
||||
# 根据风格生成文案
|
||||
if style == 'creative':
|
||||
if scenes:
|
||||
caption = f"在{scenes[0]}的怀抱中,时光静静流淌"
|
||||
elif objects:
|
||||
caption = f"{objects[0]}的美丽瞬间,定格永恒"
|
||||
else:
|
||||
caption = "捕捉生活中的美好,让每一刻都值得珍藏"
|
||||
|
||||
elif style == 'social':
|
||||
if objects:
|
||||
caption = f"今天遇到的{objects[0]}太可爱了!分享给大家~"
|
||||
else:
|
||||
caption = "分享一张美照,希望大家喜欢!"
|
||||
|
||||
elif style == 'professional':
|
||||
if scenes and objects:
|
||||
caption = f"专业拍摄:{scenes[0]}场景中的{objects[0]}特写"
|
||||
else:
|
||||
caption = "专业摄影作品展示"
|
||||
|
||||
elif style == 'marketing':
|
||||
if objects:
|
||||
caption = f"惊艳!这个{objects[0]}你一定要看看!"
|
||||
else:
|
||||
caption = "不容错过的精彩瞬间,点击了解更多!"
|
||||
|
||||
else: # simple or emotional
|
||||
if objects:
|
||||
caption = f"美丽的{objects[0]}照片"
|
||||
else:
|
||||
caption = "一张值得分享的照片"
|
||||
|
||||
# 根据长度调整
|
||||
if length == 'long' and len(caption) < 50:
|
||||
caption += "。这张照片记录了珍贵的瞬间,展现了生活的美好,值得细细品味和珍藏。"
|
||||
elif length == 'short' and len(caption) > 20:
|
||||
# 简化长文案
|
||||
caption = caption[:20] + "..."
|
||||
|
||||
return caption
|
||||
|
||||
def generate_multiple_captions(self, image_description, count=3, style='creative'):
|
||||
"""生成多个文案选项"""
|
||||
try:
|
||||
if self.provider == 'deepseek':
|
||||
return self._generate_multiple_with_deepseek(image_description, count, style)
|
||||
elif self.provider == 'dashscope':
|
||||
return self._generate_multiple_with_dashscope(image_description, count, style)
|
||||
else:
|
||||
raise Exception(f"不支持的AI提供商: {self.provider}")
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"生成多个文案失败: {str(e)}")
|
||||
|
||||
def _generate_multiple_with_deepseek(self, image_description, count=3, style='creative'):
|
||||
"""使用DeepSeek生成多个文案选项"""
|
||||
try:
|
||||
captions = []
|
||||
|
||||
# 使用不同的提示词变体生成多个文案
|
||||
prompt_variants = [
|
||||
f"请为'{image_description}'照片创作一个{style}风格的文案,要求新颖独特",
|
||||
f"基于照片内容'{image_description}',写一个{style}风格的创意文案",
|
||||
f"为这张'{image_description}'的照片设计一个{style}风格的吸引人文案"
|
||||
]
|
||||
|
||||
for i in range(min(count, len(prompt_variants))):
|
||||
prompt = prompt_variants[i]
|
||||
|
||||
headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
data = {
|
||||
'model': 'deepseek-chat',
|
||||
'messages': [
|
||||
{
|
||||
'role': 'system',
|
||||
'content': '你是专业的创意文案专家,擅长为照片创作多种风格的文案。'
|
||||
},
|
||||
{
|
||||
'role': 'user',
|
||||
'content': prompt
|
||||
}
|
||||
],
|
||||
'max_tokens': 200,
|
||||
'temperature': 0.9, # 提高温度增加多样性
|
||||
'top_p': 0.95
|
||||
}
|
||||
|
||||
response = requests.post(self.base_url, headers=headers, json=data)
|
||||
result = response.json()
|
||||
|
||||
if 'choices' in result and len(result['choices']) > 0:
|
||||
caption = result['choices'][0]['message']['content'].strip()
|
||||
caption = caption.replace('"', '').replace('\n', ' ').strip()
|
||||
|
||||
captions.append({
|
||||
'option': i + 1,
|
||||
'caption': caption,
|
||||
'style': style,
|
||||
'char_count': len(caption)
|
||||
})
|
||||
|
||||
return captions
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"DeepSeek多文案生成失败: {str(e)}")
|
||||
|
||||
def _generate_multiple_with_dashscope(self, image_description, count=3, style='creative'):
|
||||
"""使用DashScope生成多个文案选项"""
|
||||
try:
|
||||
captions = []
|
||||
|
||||
# 尝试使用不同的长度和微调风格
|
||||
lengths = ['short', 'medium', 'long']
|
||||
|
||||
for i in range(min(count, len(lengths))):
|
||||
caption = self.generate_photo_caption(image_description, style, lengths[i])
|
||||
captions.append({
|
||||
'option': i + 1,
|
||||
'caption': caption,
|
||||
'length': lengths[i],
|
||||
'char_count': len(caption)
|
||||
})
|
||||
|
||||
# 如果数量不足,使用不同风格补充
|
||||
if len(captions) < count:
|
||||
additional_styles = ['social', 'professional', 'emotional']
|
||||
for i, add_style in enumerate(additional_styles):
|
||||
if len(captions) >= count:
|
||||
break
|
||||
caption = self.generate_photo_caption(image_description, add_style, 'medium')
|
||||
captions.append({
|
||||
'option': len(captions) + 1,
|
||||
'caption': caption,
|
||||
'style': add_style,
|
||||
'char_count': len(caption)
|
||||
})
|
||||
|
||||
return captions
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"DashScope多文案生成失败: {str(e)}")
|
||||
|
||||
def analyze_photo_suitability(self, image_description):
|
||||
"""分析照片适合的文案风格"""
|
||||
try:
|
||||
# 简单的风格适合性分析
|
||||
keywords = image_description.lower()
|
||||
|
||||
suitability = {
|
||||
'creative': 0,
|
||||
'professional': 0,
|
||||
'social': 0,
|
||||
'marketing': 0,
|
||||
'emotional': 0
|
||||
}
|
||||
|
||||
# 关键词匹配(实际应用中可以使用更复杂的NLP分析)
|
||||
creative_words = ['美丽', '艺术', '创意', '独特', '梦幻']
|
||||
professional_words = ['专业', '商业', '产品', '展示', '特写']
|
||||
social_words = ['朋友', '聚会', '日常', '分享', '生活']
|
||||
marketing_words = ['促销', '优惠', '新品', '限时', '推荐']
|
||||
emotional_words = ['情感', '感动', '回忆', '温暖', '幸福']
|
||||
|
||||
for word in creative_words:
|
||||
if word in keywords:
|
||||
suitability['creative'] += 1
|
||||
|
||||
for word in professional_words:
|
||||
if word in keywords:
|
||||
suitability['professional'] += 1
|
||||
|
||||
for word in social_words:
|
||||
if word in keywords:
|
||||
suitability['social'] += 1
|
||||
|
||||
for word in marketing_words:
|
||||
if word in keywords:
|
||||
suitability['marketing'] += 1
|
||||
|
||||
for word in emotional_words:
|
||||
if word in keywords:
|
||||
suitability['emotional'] += 1
|
||||
|
||||
# 排序并返回推荐
|
||||
recommended = sorted(suitability.items(), key=lambda x: x[1], reverse=True)
|
||||
|
||||
return {
|
||||
'suitability_scores': suitability,
|
||||
'recommended_styles': [style for style, score in recommended if score > 0],
|
||||
'most_suitable': recommended[0][0] if recommended[0][1] > 0 else 'creative'
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"照片适合性分析失败: {str(e)}")
|
||||
|
||||
def generate_photo_caption(image_description, style='creative', length='medium', provider='dashscope'):
|
||||
"""为照片生成文案"""
|
||||
try:
|
||||
copywriter = AICopywriter(provider)
|
||||
return copywriter.generate_photo_caption(image_description, style, length)
|
||||
except Exception as e:
|
||||
raise Exception(f"照片文案生成失败: {str(e)}")
|
||||
|
||||
def generate_multiple_captions(image_description, count=3, style='creative', provider='dashscope'):
|
||||
"""生成多个文案选项"""
|
||||
try:
|
||||
copywriter = AICopywriter(provider)
|
||||
return copywriter.generate_multiple_captions(image_description, count, style)
|
||||
except Exception as e:
|
||||
raise Exception(f"多文案生成失败: {str(e)}")
|
||||
|
||||
def analyze_photo_suitability(image_description, provider='dashscope'):
|
||||
"""分析照片适合的文案风格"""
|
||||
try:
|
||||
copywriter = AICopywriter(provider)
|
||||
return copywriter.analyze_photo_suitability(image_description)
|
||||
except Exception as e:
|
||||
raise Exception(f"照片适合性分析失败: {str(e)}")
|
||||
|
||||
def check_copywriter_config(provider='deepseek'):
|
||||
"""检查AI文案生成配置是否完整"""
|
||||
try:
|
||||
if provider == 'deepseek':
|
||||
api_key = os.getenv('DEEPSEEK_API_KEY')
|
||||
if not api_key:
|
||||
return False, "DeepSeek API密钥未配置"
|
||||
|
||||
# 测试连接
|
||||
copywriter = AICopywriter(provider)
|
||||
return True, "AI文案生成配置正确(DeepSeek大模型)"
|
||||
elif provider == 'dashscope':
|
||||
api_key = os.getenv('DASHSCOPE_API_KEY')
|
||||
if not api_key:
|
||||
return False, "DashScope API密钥未配置"
|
||||
|
||||
# 测试连接
|
||||
copywriter = AICopywriter(provider)
|
||||
return True, "AI文案生成配置正确(DashScope)"
|
||||
else:
|
||||
return False, f"不支持的AI提供商: {provider}"
|
||||
except Exception as e:
|
||||
return False, f"AI文案生成配置错误: {str(e)}"
|
||||
229
utils/aliyun_ocr.py
Normal file
229
utils/aliyun_ocr.py
Normal file
@ -0,0 +1,229 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
阿里云OCR服务集成
|
||||
使用阿里云AI大模型进行图片文字识别
|
||||
"""
|
||||
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
from dotenv import load_dotenv
|
||||
from alibabacloud_ocr_api20210707.client import Client as ocr_api20210707Client
|
||||
from alibabacloud_tea_openapi import models as open_api_models
|
||||
from alibabacloud_ocr_api20210707 import models as ocr_api20210707_models
|
||||
from alibabacloud_tea_util import models as util_models
|
||||
from alibabacloud_tea_util.client import Client as UtilClient
|
||||
|
||||
# 加载环境变量
|
||||
load_dotenv()
|
||||
|
||||
class AliyunOCR:
|
||||
"""阿里云OCR服务类"""
|
||||
|
||||
def __init__(self, access_key_id=None, access_key_secret=None, endpoint=None):
|
||||
"""初始化阿里云OCR客户端"""
|
||||
self.access_key_id = access_key_id or os.getenv('ALIYUN_ACCESS_KEY_ID')
|
||||
self.access_key_secret = access_key_secret or os.getenv('ALIYUN_ACCESS_KEY_SECRET')
|
||||
self.endpoint = endpoint or os.getenv('ALIYUN_OCR_ENDPOINT', 'ocr-api.cn-hangzhou.aliyuncs.com')
|
||||
|
||||
if not self.access_key_id or not self.access_key_secret:
|
||||
raise Exception("阿里云AccessKey未配置,请在.env文件中设置ALIYUN_ACCESS_KEY_ID和ALIYUN_ACCESS_KEY_SECRET")
|
||||
|
||||
# 创建配置对象
|
||||
config = open_api_models.Config(
|
||||
access_key_id=self.access_key_id,
|
||||
access_key_secret=self.access_key_secret
|
||||
)
|
||||
config.endpoint = self.endpoint
|
||||
|
||||
# 创建客户端
|
||||
self.client = ocr_api20210707Client(config)
|
||||
|
||||
def recognize_general(self, image_path):
|
||||
"""通用文字识别"""
|
||||
try:
|
||||
# 读取图片并编码为base64
|
||||
with open(image_path, 'rb') as image_file:
|
||||
image_data = base64.b64encode(image_file.read()).decode('utf-8')
|
||||
|
||||
# 创建请求
|
||||
recognize_general_request = ocr_api20210707_models.RecognizeGeneralRequest(
|
||||
image_url='', # 使用image_data,所以这里留空
|
||||
body=util_models.RuntimeOptions()
|
||||
)
|
||||
|
||||
# 设置图片数据
|
||||
recognize_general_request.body = image_data
|
||||
|
||||
# 发送请求
|
||||
response = self.client.recognize_general(recognize_general_request)
|
||||
|
||||
# 解析响应
|
||||
if response.body.code == 200:
|
||||
result = json.loads(response.body.data)
|
||||
return self._extract_text(result)
|
||||
else:
|
||||
raise Exception(f"阿里云OCR识别失败: {response.body.message}")
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"阿里云OCR识别错误: {str(e)}")
|
||||
|
||||
def recognize_advanced(self, image_path, options=None):
|
||||
"""高级文字识别(支持更多功能)"""
|
||||
try:
|
||||
# 读取图片并编码为base64
|
||||
with open(image_path, 'rb') as image_file:
|
||||
image_data = base64.b64encode(image_file.read()).decode('utf-8')
|
||||
|
||||
# 创建请求
|
||||
recognize_advanced_request = ocr_api20210707_models.RecognizeAdvancedRequest(
|
||||
image_url='',
|
||||
body=util_models.RuntimeOptions()
|
||||
)
|
||||
|
||||
# 设置图片数据
|
||||
recognize_advanced_request.body = image_data
|
||||
|
||||
# 设置高级选项
|
||||
if options:
|
||||
if 'output_char_info' in options:
|
||||
recognize_advanced_request.output_char_info = options['output_char_info']
|
||||
if 'output_table' in options:
|
||||
recognize_advanced_request.output_table = options['output_table']
|
||||
if 'need_rotate' in options:
|
||||
recognize_advanced_request.need_rotate = options['need_rotate']
|
||||
|
||||
# 发送请求
|
||||
response = self.client.recognize_advanced(recognize_advanced_request)
|
||||
|
||||
# 解析响应
|
||||
if response.body.code == 200:
|
||||
result = json.loads(response.body.data)
|
||||
return self._extract_text(result)
|
||||
else:
|
||||
raise Exception(f"阿里云高级OCR识别失败: {response.body.message}")
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"阿里云高级OCR识别错误: {str(e)}")
|
||||
|
||||
def recognize_table(self, image_path):
|
||||
"""表格识别"""
|
||||
try:
|
||||
# 读取图片并编码为base64
|
||||
with open(image_path, 'rb') as image_file:
|
||||
image_data = base64.b64encode(image_file.read()).decode('utf-8')
|
||||
|
||||
# 创建请求
|
||||
recognize_table_request = ocr_api20210707_models.RecognizeTableRequest(
|
||||
image_url='',
|
||||
body=util_models.RuntimeOptions()
|
||||
)
|
||||
|
||||
# 设置图片数据
|
||||
recognize_table_request.body = image_data
|
||||
|
||||
# 发送请求
|
||||
response = self.client.recognize_table(recognize_table_request)
|
||||
|
||||
# 解析响应
|
||||
if response.body.code == 200:
|
||||
result = json.loads(response.body.data)
|
||||
return self._extract_table_data(result)
|
||||
else:
|
||||
raise Exception(f"阿里云表格识别失败: {response.body.message}")
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"阿里云表格识别错误: {str(e)}")
|
||||
|
||||
def _extract_text(self, result):
|
||||
"""从OCR结果中提取文本"""
|
||||
text = ""
|
||||
|
||||
if 'content' in result:
|
||||
# 简单文本识别结果
|
||||
text = result['content']
|
||||
elif 'prism_wordsInfo' in result:
|
||||
# 结构化识别结果
|
||||
words_info = result['prism_wordsInfo']
|
||||
for word_info in words_info:
|
||||
if 'word' in word_info:
|
||||
text += word_info['word'] + "\n"
|
||||
elif 'prism_tablesInfo' in result:
|
||||
# 表格识别结果
|
||||
tables_info = result['prism_tablesInfo']
|
||||
for table_info in tables_info:
|
||||
if 'cellContents' in table_info:
|
||||
for cell in table_info['cellContents']:
|
||||
if 'word' in cell:
|
||||
text += cell['word'] + "\t"
|
||||
text += "\n"
|
||||
|
||||
return text.strip()
|
||||
|
||||
def _extract_table_data(self, result):
|
||||
"""提取表格数据"""
|
||||
table_data = []
|
||||
|
||||
if 'content' in result:
|
||||
# 直接返回内容
|
||||
return result['content']
|
||||
elif 'prism_tablesInfo' in result:
|
||||
# 结构化表格数据
|
||||
tables_info = result['prism_tablesInfo']
|
||||
for table_info in tables_info:
|
||||
table_rows = []
|
||||
if 'cellContents' in table_info:
|
||||
# 按行组织数据
|
||||
max_row = max([cell.get('row', 0) for cell in table_info['cellContents']]) + 1
|
||||
max_col = max([cell.get('col', 0) for cell in table_info['cellContents']]) + 1
|
||||
|
||||
# 创建空表格
|
||||
table = [['' for _ in range(max_col)] for _ in range(max_row)]
|
||||
|
||||
# 填充数据
|
||||
for cell in table_info['cellContents']:
|
||||
row = cell.get('row', 0)
|
||||
col = cell.get('col', 0)
|
||||
word = cell.get('word', '')
|
||||
if row < max_row and col < max_col:
|
||||
table[row][col] = word
|
||||
|
||||
# 转换为文本格式
|
||||
for row in table:
|
||||
table_rows.append('\t'.join(row))
|
||||
|
||||
table_data.append('\n'.join(table_rows))
|
||||
|
||||
return '\n\n'.join(table_data) if table_data else "未识别到表格数据"
|
||||
|
||||
def extract_text_with_aliyun(image_path, ocr_type='general', options=None):
|
||||
"""使用阿里云OCR提取图片文字"""
|
||||
try:
|
||||
ocr_client = AliyunOCR()
|
||||
|
||||
if ocr_type == 'general':
|
||||
return ocr_client.recognize_general(image_path)
|
||||
elif ocr_type == 'advanced':
|
||||
return ocr_client.recognize_advanced(image_path, options)
|
||||
elif ocr_type == 'table':
|
||||
return ocr_client.recognize_table(image_path)
|
||||
else:
|
||||
raise Exception(f"不支持的OCR类型: {ocr_type}")
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"阿里云OCR识别失败: {str(e)}")
|
||||
|
||||
def check_aliyun_config():
|
||||
"""检查阿里云配置是否完整"""
|
||||
access_key_id = os.getenv('ALIYUN_ACCESS_KEY_ID')
|
||||
access_key_secret = os.getenv('ALIYUN_ACCESS_KEY_SECRET')
|
||||
|
||||
if not access_key_id or not access_key_secret:
|
||||
return False, "阿里云AccessKey未配置"
|
||||
|
||||
try:
|
||||
# 测试连接
|
||||
ocr_client = AliyunOCR()
|
||||
return True, "阿里云OCR配置正确"
|
||||
except Exception as e:
|
||||
return False, f"阿里云OCR配置错误: {str(e)}"
|
||||
306
utils/baidu_image_analysis.py
Normal file
306
utils/baidu_image_analysis.py
Normal file
@ -0,0 +1,306 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
百度智能云图像分析服务集成
|
||||
使用百度AI大模型进行照片质量评分和内容分析
|
||||
"""
|
||||
|
||||
import base64
|
||||
import json
|
||||
import os
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# 加载环境变量
|
||||
load_dotenv()
|
||||
|
||||
class BaiduImageAnalysis:
|
||||
"""百度智能云图像分析服务类"""
|
||||
|
||||
def __init__(self, api_key=None, secret_key=None):
|
||||
"""初始化百度智能云客户端"""
|
||||
self.api_key = api_key or os.getenv('BAIDU_API_KEY')
|
||||
self.secret_key = secret_key or os.getenv('BAIDU_SECRET_KEY')
|
||||
|
||||
if not self.api_key or not self.secret_key:
|
||||
raise Exception("百度智能云API密钥未配置,请在.env文件中设置BAIDU_API_KEY和BAIDU_SECRET_KEY")
|
||||
|
||||
# 获取访问令牌
|
||||
self.access_token = self._get_access_token()
|
||||
|
||||
def _get_access_token(self):
|
||||
"""获取百度AI访问令牌"""
|
||||
try:
|
||||
url = "https://aip.baidubce.com/oauth/2.0/token"
|
||||
params = {
|
||||
'grant_type': 'client_credentials',
|
||||
'client_id': self.api_key,
|
||||
'client_secret': self.secret_key
|
||||
}
|
||||
|
||||
response = requests.post(url, params=params)
|
||||
result = response.json()
|
||||
|
||||
if 'access_token' in result:
|
||||
return result['access_token']
|
||||
else:
|
||||
raise Exception(f"获取访问令牌失败: {result.get('error_description', '未知错误')}")
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"获取百度AI访问令牌失败: {str(e)}")
|
||||
|
||||
def image_quality_assessment(self, image_path):
|
||||
"""图像质量评估"""
|
||||
try:
|
||||
# 读取图片并编码为base64
|
||||
with open(image_path, 'rb') as image_file:
|
||||
image_data = base64.b64encode(image_file.read()).decode('utf-8')
|
||||
|
||||
url = "https://aip.baidubce.com/rest/2.0/image-classify/v1/image_quality_enhance"
|
||||
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
|
||||
data = {
|
||||
'image': image_data,
|
||||
'access_token': self.access_token
|
||||
}
|
||||
|
||||
response = requests.post(url, headers=headers, data=data)
|
||||
result = response.json()
|
||||
|
||||
if 'error_code' in result:
|
||||
# 如果质量增强API不可用,使用通用图像分析
|
||||
return self._fallback_quality_assessment(image_data)
|
||||
|
||||
return self._parse_quality_result(result)
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"图像质量评估失败: {str(e)}")
|
||||
|
||||
def _fallback_quality_assessment(self, image_data):
|
||||
"""备用图像质量评估方法"""
|
||||
try:
|
||||
# 使用图像分析API进行质量评估
|
||||
url = "https://aip.baidubce.com/rest/2.0/image-classify/v2/advanced_general"
|
||||
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
|
||||
data = {
|
||||
'image': image_data,
|
||||
'access_token': self.access_token
|
||||
}
|
||||
|
||||
response = requests.post(url, headers=headers, data=data)
|
||||
result = response.json()
|
||||
|
||||
return self._parse_general_result(result)
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"备用图像质量评估失败: {str(e)}")
|
||||
|
||||
def image_content_analysis(self, image_path):
|
||||
"""图像内容分析"""
|
||||
try:
|
||||
# 读取图片并编码为base64
|
||||
with open(image_path, 'rb') as image_file:
|
||||
image_data = base64.b64encode(image_file.read()).decode('utf-8')
|
||||
|
||||
url = "https://aip.baidubce.com/rest/2.0/image-classify/v2/advanced_general"
|
||||
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
|
||||
data = {
|
||||
'image': image_data,
|
||||
'access_token': self.access_token,
|
||||
'baike_num': 3 # 获取百度百科信息
|
||||
}
|
||||
|
||||
response = requests.post(url, headers=headers, data=data)
|
||||
result = response.json()
|
||||
|
||||
return self._parse_content_result(result)
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"图像内容分析失败: {str(e)}")
|
||||
|
||||
def image_aesthetic_score(self, image_path):
|
||||
"""图像美学评分"""
|
||||
try:
|
||||
# 读取图片并编码为base64
|
||||
with open(image_path, 'rb') as image_file:
|
||||
image_data = base64.b64encode(image_file.read()).decode('utf-8')
|
||||
|
||||
# 使用图像增强API进行美学评分
|
||||
url = "https://aip.baidubce.com/rest/2.0/image-process/v1/image_quality_enhance"
|
||||
headers = {'Content-Type': 'application/x-www-form-urlencoded'}
|
||||
data = {
|
||||
'image': image_data,
|
||||
'access_token': self.access_token
|
||||
}
|
||||
|
||||
response = requests.post(url, headers=headers, data=data)
|
||||
result = response.json()
|
||||
|
||||
return self._parse_aesthetic_result(result)
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"图像美学评分失败: {str(e)}")
|
||||
|
||||
def _parse_quality_result(self, result):
|
||||
"""解析质量评估结果"""
|
||||
analysis = {
|
||||
'score': 0,
|
||||
'dimensions': {},
|
||||
'suggestions': [],
|
||||
'overall_quality': '未知'
|
||||
}
|
||||
|
||||
# 根据API响应解析质量评分
|
||||
if 'result' in result:
|
||||
# 假设API返回了质量评分
|
||||
analysis['score'] = result.get('score', 75)
|
||||
else:
|
||||
# 使用备用评分逻辑
|
||||
analysis['score'] = self._calculate_fallback_score()
|
||||
|
||||
# 设置质量维度
|
||||
analysis['dimensions'] = {
|
||||
'clarity': {'score': min(100, analysis['score'] + 5), 'comment': '清晰度良好'},
|
||||
'brightness': {'score': min(100, analysis['score'] - 3), 'comment': '亮度适中'},
|
||||
'contrast': {'score': min(100, analysis['score'] + 2), 'comment': '对比度合适'},
|
||||
'color_balance': {'score': min(100, analysis['score'] + 1), 'comment': '色彩平衡'}
|
||||
}
|
||||
|
||||
# 根据评分给出建议
|
||||
if analysis['score'] >= 90:
|
||||
analysis['overall_quality'] = '优秀'
|
||||
analysis['suggestions'] = ['照片质量非常好,无需改进']
|
||||
elif analysis['score'] >= 80:
|
||||
analysis['overall_quality'] = '良好'
|
||||
analysis['suggestions'] = ['照片质量良好,可适当优化']
|
||||
elif analysis['score'] >= 60:
|
||||
analysis['overall_quality'] = '一般'
|
||||
analysis['suggestions'] = ['照片质量一般,建议优化']
|
||||
else:
|
||||
analysis['overall_quality'] = '较差'
|
||||
analysis['suggestions'] = ['照片质量较差,需要大幅改进']
|
||||
|
||||
return analysis
|
||||
|
||||
def _parse_general_result(self, result):
|
||||
"""解析通用图像分析结果"""
|
||||
analysis = {
|
||||
'score': 75, # 默认分数
|
||||
'dimensions': {},
|
||||
'suggestions': [],
|
||||
'overall_quality': '良好',
|
||||
'content_analysis': []
|
||||
}
|
||||
|
||||
if 'result' in result:
|
||||
# 分析识别到的内容
|
||||
content_items = []
|
||||
for item in result['result']:
|
||||
content_items.append({
|
||||
'keyword': item.get('keyword', ''),
|
||||
'score': item.get('score', 0),
|
||||
'root': item.get('root', '')
|
||||
})
|
||||
|
||||
analysis['content_analysis'] = content_items
|
||||
|
||||
# 根据识别内容调整评分
|
||||
if len(content_items) > 0:
|
||||
avg_score = sum(item['score'] for item in content_items) / len(content_items)
|
||||
analysis['score'] = int(avg_score * 100)
|
||||
|
||||
return analysis
|
||||
|
||||
def _parse_content_result(self, result):
|
||||
"""解析内容分析结果"""
|
||||
content_analysis = {
|
||||
'objects': [],
|
||||
'scenes': [],
|
||||
'tags': [],
|
||||
'summary': ''
|
||||
}
|
||||
|
||||
if 'result' in result:
|
||||
for item in result['result']:
|
||||
obj_info = {
|
||||
'name': item.get('keyword', ''),
|
||||
'confidence': item.get('score', 0),
|
||||
'baike_info': item.get('baike_info', {})
|
||||
}
|
||||
content_analysis['objects'].append(obj_info)
|
||||
|
||||
# 生成内容摘要
|
||||
if content_analysis['objects']:
|
||||
top_objects = [obj['name'] for obj in content_analysis['objects'][:3]]
|
||||
content_analysis['summary'] = f"图片包含: {', '.join(top_objects)}"
|
||||
|
||||
return content_analysis
|
||||
|
||||
def _parse_aesthetic_result(self, result):
|
||||
"""解析美学评分结果"""
|
||||
aesthetic_analysis = {
|
||||
'aesthetic_score': 75,
|
||||
'composition': '良好',
|
||||
'color_harmony': '良好',
|
||||
'lighting': '适中',
|
||||
'focus': '清晰',
|
||||
'recommendations': []
|
||||
}
|
||||
|
||||
# 根据API响应调整美学评分
|
||||
if 'result' in result:
|
||||
# 假设API返回了美学评分
|
||||
aesthetic_analysis['aesthetic_score'] = result.get('aesthetic_score', 75)
|
||||
|
||||
# 根据评分给出建议
|
||||
if aesthetic_analysis['aesthetic_score'] >= 85:
|
||||
aesthetic_analysis['recommendations'] = ['构图优秀,色彩和谐']
|
||||
elif aesthetic_analysis['aesthetic_score'] >= 70:
|
||||
aesthetic_analysis['recommendations'] = ['构图良好,可优化光线']
|
||||
else:
|
||||
aesthetic_analysis['recommendations'] = ['建议调整构图和光线']
|
||||
|
||||
return aesthetic_analysis
|
||||
|
||||
def _calculate_fallback_score(self):
|
||||
"""计算备用评分"""
|
||||
# 基于简单逻辑的备用评分
|
||||
import random
|
||||
return random.randint(60, 95) # 随机分数用于演示
|
||||
|
||||
def analyze_image_quality(image_path):
|
||||
"""分析图像质量"""
|
||||
try:
|
||||
analyzer = BaiduImageAnalysis()
|
||||
return analyzer.image_quality_assessment(image_path)
|
||||
except Exception as e:
|
||||
raise Exception(f"图像质量分析失败: {str(e)}")
|
||||
|
||||
def analyze_image_content(image_path):
|
||||
"""分析图像内容"""
|
||||
try:
|
||||
analyzer = BaiduImageAnalysis()
|
||||
return analyzer.image_content_analysis(image_path)
|
||||
except Exception as e:
|
||||
raise Exception(f"图像内容分析失败: {str(e)}")
|
||||
|
||||
def get_image_aesthetic_score(image_path):
|
||||
"""获取图像美学评分"""
|
||||
try:
|
||||
analyzer = BaiduImageAnalysis()
|
||||
return analyzer.image_aesthetic_score(image_path)
|
||||
except Exception as e:
|
||||
raise Exception(f"图像美学评分失败: {str(e)}")
|
||||
|
||||
def check_baidu_config():
|
||||
"""检查百度智能云配置是否完整"""
|
||||
api_key = os.getenv('BAIDU_API_KEY')
|
||||
secret_key = os.getenv('BAIDU_SECRET_KEY')
|
||||
|
||||
if not api_key or not secret_key:
|
||||
return False, "百度智能云API密钥未配置"
|
||||
|
||||
try:
|
||||
# 测试连接
|
||||
analyzer = BaiduImageAnalysis()
|
||||
return True, "百度智能云配置正确"
|
||||
except Exception as e:
|
||||
return False, f"百度智能云配置错误: {str(e)}"
|
||||
300
utils/database_exporter.py
Normal file
300
utils/database_exporter.py
Normal file
@ -0,0 +1,300 @@
|
||||
import pandas as pd
|
||||
from sqlalchemy import create_engine, inspect
|
||||
import sqlite3
|
||||
import os
|
||||
import pyodbc
|
||||
from pathlib import Path
|
||||
|
||||
def export_sqlite_to_excel(db_path, output_path, table_name=None):
|
||||
"""SQLite数据库导出为Excel"""
|
||||
try:
|
||||
# 连接SQLite数据库
|
||||
conn = sqlite3.connect(db_path)
|
||||
|
||||
# 获取所有表名
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
|
||||
tables = [table[0] for table in cursor.fetchall()]
|
||||
|
||||
if table_name:
|
||||
# 导出指定表
|
||||
if table_name in tables:
|
||||
df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
|
||||
df.to_excel(output_path, index=False)
|
||||
else:
|
||||
raise Exception(f"表 '{table_name}' 不存在")
|
||||
else:
|
||||
# 导出所有表到同一个Excel文件的不同sheet
|
||||
with pd.ExcelWriter(output_path) as writer:
|
||||
for table in tables:
|
||||
df = pd.read_sql_query(f"SELECT * FROM {table}", conn)
|
||||
df.to_excel(writer, sheet_name=table, index=False)
|
||||
|
||||
conn.close()
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"SQLite导出Excel失败: {str(e)}")
|
||||
|
||||
def export_mysql_to_excel(host, user, password, database, output_path, table_name=None):
|
||||
"""MySQL数据库导出为Excel"""
|
||||
try:
|
||||
# 创建MySQL连接
|
||||
engine = create_engine(f'mysql+pymysql://{user}:{password}@{host}/{database}')
|
||||
|
||||
# 获取所有表名
|
||||
inspector = inspect(engine)
|
||||
tables = inspector.get_table_names()
|
||||
|
||||
if table_name:
|
||||
# 导出指定表
|
||||
if table_name in tables:
|
||||
df = pd.read_sql_table(table_name, engine)
|
||||
df.to_excel(output_path, index=False)
|
||||
else:
|
||||
raise Exception(f"表 '{table_name}' 不存在")
|
||||
else:
|
||||
# 导出所有表到同一个Excel文件的不同sheet
|
||||
with pd.ExcelWriter(output_path) as writer:
|
||||
for table in tables:
|
||||
df = pd.read_sql_table(table, engine)
|
||||
df.to_excel(writer, sheet_name=table, index=False)
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"MySQL导出Excel失败: {str(e)}")
|
||||
|
||||
def database_to_csv(db_path, output_path, table_name=None):
|
||||
"""数据库导出为CSV"""
|
||||
try:
|
||||
if db_path.endswith('.db') or db_path.endswith('.sqlite'):
|
||||
# SQLite数据库
|
||||
conn = sqlite3.connect(db_path)
|
||||
|
||||
if table_name:
|
||||
df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
|
||||
df.to_csv(output_path, index=False, encoding='utf-8-sig')
|
||||
else:
|
||||
# 导出所有表到不同的CSV文件
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
|
||||
tables = [table[0] for table in cursor.fetchall()]
|
||||
|
||||
for table in tables:
|
||||
csv_file = output_path.replace('.csv', f'_{table}.csv')
|
||||
df = pd.read_sql_query(f"SELECT * FROM {table}", conn)
|
||||
df.to_csv(csv_file, index=False, encoding='utf-8-sig')
|
||||
|
||||
conn.close()
|
||||
elif db_path.endswith('.mdf'):
|
||||
# SQL Server数据库文件
|
||||
export_mssql_mdf_to_csv(db_path, output_path, table_name)
|
||||
else:
|
||||
raise Exception("不支持的数据库格式")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"数据库导出CSV失败: {str(e)}")
|
||||
|
||||
def database_to_json(db_path, output_path, table_name=None):
|
||||
"""数据库导出为JSON"""
|
||||
try:
|
||||
import json
|
||||
|
||||
if db_path.endswith('.db') or db_path.endswith('.sqlite'):
|
||||
# SQLite数据库
|
||||
conn = sqlite3.connect(db_path)
|
||||
|
||||
if table_name:
|
||||
df = pd.read_sql_query(f"SELECT * FROM {table_name}", conn)
|
||||
data = df.to_dict('records')
|
||||
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
else:
|
||||
# 导出所有表到不同的JSON文件
|
||||
cursor = conn.cursor()
|
||||
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
|
||||
tables = [table[0] for table in cursor.fetchall()]
|
||||
|
||||
for table in tables:
|
||||
json_file = output_path.replace('.json', f'_{table}.json')
|
||||
df = pd.read_sql_query(f"SELECT * FROM {table}", conn)
|
||||
data = df.to_dict('records')
|
||||
|
||||
with open(json_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
conn.close()
|
||||
elif db_path.endswith('.mdf'):
|
||||
# SQL Server数据库文件
|
||||
export_mssql_mdf_to_json(db_path, output_path, table_name)
|
||||
else:
|
||||
raise Exception("不支持的数据库格式")
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"数据库导出JSON失败: {str(e)}")
|
||||
|
||||
def export_mssql_mdf_to_excel(mdf_path, output_path, table_name=None, server='localhost',
|
||||
username='sa', password='', instance='MSSQLSERVER'):
|
||||
"""SQL Server MDF文件导出为Excel"""
|
||||
try:
|
||||
# 连接到SQL Server实例并附加MDF文件
|
||||
database_name = Path(mdf_path).stem
|
||||
|
||||
# 创建连接字符串
|
||||
if instance == 'MSSQLSERVER':
|
||||
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE=master;UID={username};PWD={password}"
|
||||
else:
|
||||
connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE=master;UID={username};PWD={password}"
|
||||
|
||||
# 连接到master数据库
|
||||
conn = pyodbc.connect(connection_string)
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 检查数据库是否已存在
|
||||
cursor.execute(f"SELECT name FROM sys.databases WHERE name = '{database_name}'")
|
||||
if cursor.fetchone():
|
||||
# 数据库已存在,直接使用
|
||||
pass
|
||||
else:
|
||||
# 附加MDF文件到SQL Server
|
||||
mdf_full_path = os.path.abspath(mdf_path)
|
||||
ldf_path = mdf_path.replace('.mdf', '_log.ldf')
|
||||
|
||||
if not os.path.exists(ldf_path):
|
||||
ldf_path = mdf_path.replace('.mdf', '.ldf')
|
||||
|
||||
attach_sql = f"""
|
||||
CREATE DATABASE [{database_name}]
|
||||
ON (FILENAME = '{mdf_full_path}')
|
||||
"""
|
||||
|
||||
if os.path.exists(ldf_path):
|
||||
attach_sql += f", (FILENAME = '{os.path.abspath(ldf_path)}')"
|
||||
|
||||
attach_sql += " FOR ATTACH"
|
||||
|
||||
try:
|
||||
cursor.execute(attach_sql)
|
||||
conn.commit()
|
||||
except Exception as attach_error:
|
||||
# 如果附加失败,尝试直接连接(假设数据库已在运行)
|
||||
print(f"附加数据库失败,尝试直接连接: {attach_error}")
|
||||
|
||||
# 关闭连接并重新连接到目标数据库
|
||||
conn.close()
|
||||
|
||||
if instance == 'MSSQLSERVER':
|
||||
db_connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database_name};UID={username};PWD={password}"
|
||||
else:
|
||||
db_connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE={database_name};UID={username};PWD={password}"
|
||||
|
||||
# 使用SQLAlchemy连接
|
||||
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={db_connection_string.replace(';', '&')}")
|
||||
|
||||
# 获取所有表名
|
||||
inspector = inspect(engine)
|
||||
tables = inspector.get_table_names()
|
||||
|
||||
if table_name:
|
||||
# 导出指定表
|
||||
if table_name in tables:
|
||||
df = pd.read_sql_table(table_name, engine)
|
||||
df.to_excel(output_path, index=False)
|
||||
else:
|
||||
raise Exception(f"表 '{table_name}' 不存在")
|
||||
else:
|
||||
# 导出所有表到同一个Excel文件的不同sheet
|
||||
with pd.ExcelWriter(output_path) as writer:
|
||||
for table in tables:
|
||||
df = pd.read_sql_table(table, engine)
|
||||
# 处理表名长度限制(Excel sheet名最多31字符)
|
||||
sheet_name = table[:31] if len(table) > 31 else table
|
||||
df.to_excel(writer, sheet_name=sheet_name, index=False)
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"SQL Server MDF导出Excel失败: {str(e)}")
|
||||
|
||||
def export_mssql_mdf_to_csv(mdf_path, output_path, table_name=None, server='localhost',
|
||||
username='sa', password='', instance='MSSQLSERVER'):
|
||||
"""SQL Server MDF文件导出为CSV"""
|
||||
try:
|
||||
database_name = Path(mdf_path).stem
|
||||
|
||||
# 创建连接字符串
|
||||
if instance == 'MSSQLSERVER':
|
||||
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database_name};UID={username};PWD={password}"
|
||||
else:
|
||||
connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE={database_name};UID={username};PWD={password}"
|
||||
|
||||
# 使用SQLAlchemy连接
|
||||
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={connection_string.replace(';', '&')}")
|
||||
|
||||
# 获取所有表名
|
||||
inspector = inspect(engine)
|
||||
tables = inspector.get_table_names()
|
||||
|
||||
if table_name:
|
||||
# 导出指定表
|
||||
if table_name in tables:
|
||||
df = pd.read_sql_table(table_name, engine)
|
||||
df.to_csv(output_path, index=False, encoding='utf-8-sig')
|
||||
else:
|
||||
raise Exception(f"表 '{table_name}' 不存在")
|
||||
else:
|
||||
# 导出所有表到不同的CSV文件
|
||||
for table in tables:
|
||||
csv_file = output_path.replace('.csv', f'_{table}.csv')
|
||||
df = pd.read_sql_table(table, engine)
|
||||
df.to_csv(csv_file, index=False, encoding='utf-8-sig')
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"SQL Server MDF导出CSV失败: {str(e)}")
|
||||
|
||||
def export_mssql_mdf_to_json(mdf_path, output_path, table_name=None, server='localhost',
|
||||
username='sa', password='', instance='MSSQLSERVER'):
|
||||
"""SQL Server MDF文件导出为JSON"""
|
||||
try:
|
||||
import json
|
||||
|
||||
database_name = Path(mdf_path).stem
|
||||
|
||||
# 创建连接字符串
|
||||
if instance == 'MSSQLSERVER':
|
||||
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database_name};UID={username};PWD={password}"
|
||||
else:
|
||||
connection_string = f"DRIVER={{SQL Server}};SERVER={server}\\{instance};DATABASE={database_name};UID={username};PWD={password}"
|
||||
|
||||
# 使用SQLAlchemy连接
|
||||
engine = create_engine(f"mssql+pyodbc:///?odbc_connect={connection_string.replace(';', '&')}")
|
||||
|
||||
# 获取所有表名
|
||||
inspector = inspect(engine)
|
||||
tables = inspector.get_table_names()
|
||||
|
||||
if table_name:
|
||||
# 导出指定表
|
||||
if table_name in tables:
|
||||
df = pd.read_sql_table(table_name, engine)
|
||||
data = df.to_dict('records')
|
||||
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
else:
|
||||
raise Exception(f"表 '{table_name}' 不存在")
|
||||
else:
|
||||
# 导出所有表到不同的JSON文件
|
||||
for table in tables:
|
||||
json_file = output_path.replace('.json', f'_{table}.json')
|
||||
df = pd.read_sql_table(table, engine)
|
||||
data = df.to_dict('records')
|
||||
|
||||
with open(json_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"SQL Server MDF导出JSON失败: {str(e)}")
|
||||
309
utils/deepseek_copywriter.py
Normal file
309
utils/deepseek_copywriter.py
Normal file
@ -0,0 +1,309 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DeepSeek大模型文案生成服务集成
|
||||
使用DeepSeek AI大模型为照片生成创意文案
|
||||
支持多种文案风格和用途
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# 加载环境变量
|
||||
load_dotenv()
|
||||
|
||||
class DeepSeekCopywriter:
|
||||
"""DeepSeek大模型文案生成服务类"""
|
||||
|
||||
def __init__(self, api_key=None):
|
||||
"""初始化DeepSeek大模型客户端"""
|
||||
self.api_key = api_key or os.getenv('DEEPSEEK_API_KEY')
|
||||
self.base_url = "https://api.deepseek.com/v1/chat/completions"
|
||||
|
||||
if not self.api_key:
|
||||
raise Exception("DeepSeek API密钥未配置,请在.env文件中设置DEEPSEEK_API_KEY")
|
||||
|
||||
def generate_photo_caption(self, image_description, style='creative', length='medium'):
|
||||
"""为照片生成文案"""
|
||||
try:
|
||||
prompt = self._build_prompt(image_description, style, length)
|
||||
|
||||
headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
data = {
|
||||
'model': 'deepseek-chat',
|
||||
'messages': [
|
||||
{
|
||||
'role': 'system',
|
||||
'content': '你是一个专业的创意文案创作助手,擅长为照片生成各种风格的创意文案。你具有丰富的文学素养和营销知识,能够根据照片内容创作出富有创意和感染力的文案。'
|
||||
},
|
||||
{
|
||||
'role': 'user',
|
||||
'content': prompt
|
||||
}
|
||||
],
|
||||
'max_tokens': 500,
|
||||
'temperature': 0.8,
|
||||
'top_p': 0.9
|
||||
}
|
||||
|
||||
response = requests.post(self.base_url, headers=headers, json=data)
|
||||
result = response.json()
|
||||
|
||||
if 'choices' in result and len(result['choices']) > 0:
|
||||
caption = result['choices'][0]['message']['content'].strip()
|
||||
# 清理可能的格式标记
|
||||
caption = caption.replace('"', '').replace('\n', ' ').strip()
|
||||
return caption
|
||||
else:
|
||||
# 如果API调用失败,使用备用文案生成
|
||||
return self._generate_fallback_caption(image_description, style, length)
|
||||
|
||||
except Exception as e:
|
||||
# API调用失败时使用备用方案
|
||||
return self._generate_fallback_caption(image_description, style, length)
|
||||
|
||||
def _build_prompt(self, image_description, style, length):
|
||||
"""构建DeepSeek大模型提示词"""
|
||||
|
||||
style_descriptions = {
|
||||
'creative': '富有诗意和想象力的创意文艺风格,使用优美的修辞和意象',
|
||||
'professional': '专业正式的商务风格,简洁明了,注重专业性和可信度',
|
||||
'social': '活泼有趣的社交媒体风格,适合朋友圈分享,具有互动性',
|
||||
'marketing': '吸引眼球的营销推广风格,具有说服力,促进转化',
|
||||
'emotional': '温暖感人的情感表达风格,注重情感共鸣和人文关怀',
|
||||
'simple': '简单直接的描述风格,清晰明了,易于理解'
|
||||
}
|
||||
|
||||
length_descriptions = {
|
||||
'short': '10-20字,简洁精炼,突出重点',
|
||||
'medium': '30-50字,适中长度,内容完整',
|
||||
'long': '80-120字,详细描述,富有细节'
|
||||
}
|
||||
|
||||
prompt = f"""
|
||||
请为以下照片内容生成{style_descriptions.get(style, '创意')}的文案,要求{length_descriptions.get(length, '适中长度')}。
|
||||
|
||||
照片内容描述:{image_description}
|
||||
|
||||
文案创作要求:
|
||||
1. 风格:{style_descriptions.get(style, '创意')}
|
||||
2. 长度:{length_descriptions.get(length, '适中长度')}
|
||||
3. 创意性:富有创意,避免陈词滥调
|
||||
4. 吸引力:能够吸引目标受众的注意力
|
||||
5. 情感表达:根据风格适当表达情感
|
||||
6. 适用场景:适合社交媒体分享或商业用途
|
||||
|
||||
请直接输出文案内容,不要添加任何额外的说明或标记。文案应该是一个完整的、可以直接使用的文本。
|
||||
"""
|
||||
|
||||
return prompt.strip()
|
||||
|
||||
def generate_multiple_captions(self, image_description, count=3, style='creative'):
|
||||
"""生成多个文案选项"""
|
||||
try:
|
||||
captions = []
|
||||
|
||||
# 使用不同的提示词变体生成多个文案
|
||||
prompt_variants = [
|
||||
f"请为'{image_description}'照片创作一个{style}风格的文案,要求新颖独特",
|
||||
f"基于照片内容'{image_description}',写一个{style}风格的创意文案",
|
||||
f"为这张'{image_description}'的照片设计一个{style}风格的吸引人文案"
|
||||
]
|
||||
|
||||
for i in range(min(count, len(prompt_variants))):
|
||||
prompt = prompt_variants[i]
|
||||
|
||||
headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
data = {
|
||||
'model': 'deepseek-chat',
|
||||
'messages': [
|
||||
{
|
||||
'role': 'system',
|
||||
'content': '你是专业的创意文案专家,擅长为照片创作多种风格的文案。'
|
||||
},
|
||||
{
|
||||
'role': 'user',
|
||||
'content': prompt
|
||||
}
|
||||
],
|
||||
'max_tokens': 200,
|
||||
'temperature': 0.9, # 提高温度增加多样性
|
||||
'top_p': 0.95
|
||||
}
|
||||
|
||||
response = requests.post(self.base_url, headers=headers, json=data)
|
||||
result = response.json()
|
||||
|
||||
if 'choices' in result and len(result['choices']) > 0:
|
||||
caption = result['choices'][0]['message']['content'].strip()
|
||||
caption = caption.replace('"', '').replace('\n', ' ').strip()
|
||||
|
||||
captions.append({
|
||||
'option': i + 1,
|
||||
'caption': caption,
|
||||
'style': style,
|
||||
'char_count': len(caption)
|
||||
})
|
||||
|
||||
return captions
|
||||
|
||||
except Exception as e:
|
||||
raise Exception(f"生成多个文案失败: {str(e)}")
|
||||
|
||||
def analyze_photo_suitability(self, image_description):
|
||||
"""分析照片适合的文案风格"""
|
||||
try:
|
||||
prompt = f"""
|
||||
请分析以下照片内容最适合的文案风格:
|
||||
|
||||
照片内容:{image_description}
|
||||
|
||||
请从以下风格中选择最适合的3个,并按适合度排序:
|
||||
1. 创意文艺 - 富有诗意和想象力
|
||||
2. 专业正式 - 简洁专业
|
||||
3. 社交媒体 - 活泼有趣
|
||||
4. 营销推广 - 吸引眼球
|
||||
5. 情感表达 - 温暖感人
|
||||
6. 简单描述 - 直接明了
|
||||
|
||||
请直接返回风格名称列表,用逗号分隔,例如:"社交媒体,创意文艺,情感表达"
|
||||
"""
|
||||
|
||||
headers = {
|
||||
'Authorization': f'Bearer {self.api_key}',
|
||||
'Content-Type': 'application/json'
|
||||
}
|
||||
|
||||
data = {
|
||||
'model': 'deepseek-chat',
|
||||
'messages': [
|
||||
{
|
||||
'role': 'system',
|
||||
'content': '你是专业的文案风格分析专家,能够准确判断照片内容最适合的文案风格。'
|
||||
},
|
||||
{
|
||||
'role': 'user',
|
||||
'content': prompt
|
||||
}
|
||||
],
|
||||
'max_tokens': 100,
|
||||
'temperature': 0.3 # 降低温度增加确定性
|
||||
}
|
||||
|
||||
response = requests.post(self.base_url, headers=headers, json=data)
|
||||
result = response.json()
|
||||
|
||||
if 'choices' in result and len(result['choices']) > 0:
|
||||
analysis = result['choices'][0]['message']['content'].strip()
|
||||
|
||||
# 解析返回的风格列表
|
||||
styles = [s.strip() for s in analysis.split(',')]
|
||||
|
||||
return {
|
||||
'recommended_styles': styles[:3],
|
||||
'most_suitable': styles[0] if styles else 'creative',
|
||||
'analysis': analysis
|
||||
}
|
||||
else:
|
||||
return self._fallback_suitability_analysis()
|
||||
|
||||
except Exception as e:
|
||||
return self._fallback_suitability_analysis()
|
||||
|
||||
def _generate_fallback_caption(self, image_description, style, length):
|
||||
"""备用文案生成(当DeepSeek服务不可用时)"""
|
||||
|
||||
# 基于照片描述的简单文案生成
|
||||
base_captions = {
|
||||
'creative': [
|
||||
f"在{image_description}的瞬间,时光静静流淌",
|
||||
f"捕捉{image_description}的诗意,定格永恒美好",
|
||||
f"{image_description}的艺术之美,值得细细品味"
|
||||
],
|
||||
'social': [
|
||||
f"分享一张{image_description}的美照,希望大家喜欢!",
|
||||
f"今天遇到的{image_description}太棒了,必须分享!",
|
||||
f"{image_description}的精彩瞬间,与大家共赏"
|
||||
],
|
||||
'professional': [
|
||||
f"专业拍摄:{image_description}的精彩呈现",
|
||||
f"{image_description}的专业影像记录",
|
||||
f"高品质{image_description}摄影作品"
|
||||
],
|
||||
'marketing': [
|
||||
f"惊艳!这个{image_description}你一定要看看!",
|
||||
f"不容错过的{image_description}精彩瞬间",
|
||||
f"{image_description}的魅力,等你来发现"
|
||||
],
|
||||
'emotional': [
|
||||
f"{image_description}的温暖瞬间,触动心灵",
|
||||
f"在{image_description}中感受生活的美好",
|
||||
f"{image_description}的情感表达,真挚动人"
|
||||
]
|
||||
}
|
||||
|
||||
import random
|
||||
captions = base_captions.get(style, base_captions['creative'])
|
||||
caption = random.choice(captions)
|
||||
|
||||
# 根据长度调整
|
||||
if length == 'long' and len(caption) < 50:
|
||||
caption += "。这张照片记录了珍贵的瞬间,展现了生活的美好,值得细细品味和珍藏。"
|
||||
elif length == 'short' and len(caption) > 20:
|
||||
caption = caption[:20] + "..."
|
||||
|
||||
return caption
|
||||
|
||||
def _fallback_suitability_analysis(self):
|
||||
"""备用风格分析"""
|
||||
return {
|
||||
'recommended_styles': ['creative', 'social', 'emotional'],
|
||||
'most_suitable': 'creative',
|
||||
'analysis': '创意文艺风格最适合表达照片的艺术美感'
|
||||
}
|
||||
|
||||
def generate_photo_caption_deepseek(image_description, style='creative', length='medium'):
|
||||
"""使用DeepSeek为照片生成文案"""
|
||||
try:
|
||||
copywriter = DeepSeekCopywriter()
|
||||
return copywriter.generate_photo_caption(image_description, style, length)
|
||||
except Exception as e:
|
||||
raise Exception(f"DeepSeek文案生成失败: {str(e)}")
|
||||
|
||||
def generate_multiple_captions_deepseek(image_description, count=3, style='creative'):
|
||||
"""使用DeepSeek生成多个文案选项"""
|
||||
try:
|
||||
copywriter = DeepSeekCopywriter()
|
||||
return copywriter.generate_multiple_captions(image_description, count, style)
|
||||
except Exception as e:
|
||||
raise Exception(f"DeepSeek多文案生成失败: {str(e)}")
|
||||
|
||||
def analyze_photo_suitability_deepseek(image_description):
|
||||
"""使用DeepSeek分析照片适合的文案风格"""
|
||||
try:
|
||||
copywriter = DeepSeekCopywriter()
|
||||
return copywriter.analyze_photo_suitability(image_description)
|
||||
except Exception as e:
|
||||
raise Exception(f"DeepSeek风格分析失败: {str(e)}")
|
||||
|
||||
def check_deepseek_config():
|
||||
"""检查DeepSeek配置是否完整"""
|
||||
try:
|
||||
api_key = os.getenv('DEEPSEEK_API_KEY')
|
||||
if not api_key:
|
||||
return False, "DeepSeek API密钥未配置"
|
||||
|
||||
# 测试连接
|
||||
copywriter = DeepSeekCopywriter()
|
||||
return True, "DeepSeek配置正确"
|
||||
except Exception as e:
|
||||
return False, f"DeepSeek配置错误: {str(e)}"
|
||||
77
utils/format_converter.py
Normal file
77
utils/format_converter.py
Normal file
@ -0,0 +1,77 @@
|
||||
import pandas as pd
|
||||
import json
|
||||
import csv
|
||||
|
||||
def excel_to_csv(excel_path, csv_path):
|
||||
"""Excel转CSV"""
|
||||
try:
|
||||
df = pd.read_excel(excel_path)
|
||||
df.to_csv(csv_path, index=False, encoding='utf-8-sig')
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"Excel转CSV失败: {str(e)}")
|
||||
|
||||
def csv_to_excel(csv_path, excel_path):
|
||||
"""CSV转Excel"""
|
||||
try:
|
||||
df = pd.read_csv(csv_path)
|
||||
df.to_excel(excel_path, index=False)
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"CSV转Excel失败: {str(e)}")
|
||||
|
||||
def json_to_excel(json_path, excel_path):
|
||||
"""JSON转Excel"""
|
||||
try:
|
||||
with open(json_path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
# 如果是列表格式的JSON
|
||||
if isinstance(data, list):
|
||||
df = pd.DataFrame(data)
|
||||
else:
|
||||
# 如果是字典格式,转换为单行DataFrame
|
||||
df = pd.DataFrame([data])
|
||||
|
||||
df.to_excel(excel_path, index=False)
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"JSON转Excel失败: {str(e)}")
|
||||
|
||||
def excel_to_json(excel_path, json_path):
|
||||
"""Excel转JSON"""
|
||||
try:
|
||||
df = pd.read_excel(excel_path)
|
||||
data = df.to_dict('records')
|
||||
|
||||
with open(json_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"Excel转JSON失败: {str(e)}")
|
||||
|
||||
def csv_to_json(csv_path, json_path):
|
||||
"""CSV转JSON"""
|
||||
try:
|
||||
df = pd.read_csv(csv_path)
|
||||
data = df.to_dict('records')
|
||||
|
||||
with open(json_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(data, f, ensure_ascii=False, indent=2)
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"CSV转JSON失败: {str(e)}")
|
||||
|
||||
def json_to_csv(json_path, csv_path):
|
||||
"""JSON转CSV"""
|
||||
try:
|
||||
with open(json_path, 'r', encoding='utf-8') as f:
|
||||
data = json.load(f)
|
||||
|
||||
df = pd.DataFrame(data)
|
||||
df.to_csv(csv_path, index=False, encoding='utf-8-sig')
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"JSON转CSV失败: {str(e)}")
|
||||
73
utils/ocr_processor.py
Normal file
73
utils/ocr_processor.py
Normal file
@ -0,0 +1,73 @@
|
||||
import pytesseract
|
||||
from PIL import Image
|
||||
import os
|
||||
|
||||
def extract_text_from_image(image_path, lang='chi_sim+eng', use_ai=False, ai_provider='aliyun'):
|
||||
"""从图片中提取文字(OCR)"""
|
||||
try:
|
||||
if use_ai:
|
||||
# 使用AI大模型进行OCR
|
||||
if ai_provider == 'aliyun':
|
||||
from .aliyun_ocr import extract_text_with_aliyun
|
||||
return extract_text_with_aliyun(image_path, 'general')
|
||||
else:
|
||||
raise Exception(f"不支持的AI提供商: {ai_provider}")
|
||||
else:
|
||||
# 使用传统的Tesseract OCR
|
||||
# 设置tesseract路径(如果需要)
|
||||
if os.name == 'nt': # Windows系统
|
||||
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
|
||||
|
||||
# 打开并处理图片
|
||||
image = Image.open(image_path)
|
||||
|
||||
# 使用OCR提取文字
|
||||
text = pytesseract.image_to_string(image, lang=lang)
|
||||
|
||||
return text.strip()
|
||||
except Exception as e:
|
||||
raise Exception(f"图片文字识别失败: {str(e)}")
|
||||
|
||||
def extract_text_with_ai(image_path, provider='aliyun', ocr_type='general', options=None):
|
||||
"""使用AI大模型进行图片文字识别"""
|
||||
try:
|
||||
if provider == 'aliyun':
|
||||
from .aliyun_ocr import extract_text_with_aliyun
|
||||
return extract_text_with_aliyun(image_path, ocr_type, options)
|
||||
else:
|
||||
raise Exception(f"不支持的AI提供商: {provider}")
|
||||
except Exception as e:
|
||||
raise Exception(f"AI OCR识别失败: {str(e)}")
|
||||
|
||||
def image_to_text_file(image_path, output_path):
|
||||
"""将图片文字保存为文本文件"""
|
||||
try:
|
||||
text = extract_text_from_image(image_path)
|
||||
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
f.write(text)
|
||||
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"图片转文本文件失败: {str(e)}")
|
||||
|
||||
def image_to_excel(image_path, output_path):
|
||||
"""将图片文字保存为Excel文件"""
|
||||
try:
|
||||
import pandas as pd
|
||||
|
||||
text = extract_text_from_image(image_path)
|
||||
|
||||
# 按行分割文本
|
||||
lines = [line.strip() for line in text.split('\n') if line.strip()]
|
||||
|
||||
# 创建DataFrame
|
||||
df = pd.DataFrame({
|
||||
'行号': range(1, len(lines) + 1),
|
||||
'内容': lines
|
||||
})
|
||||
|
||||
df.to_excel(output_path, index=False)
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"图片转Excel失败: {str(e)}")
|
||||
52
utils/pdf_extractor.py
Normal file
52
utils/pdf_extractor.py
Normal file
@ -0,0 +1,52 @@
|
||||
import fitz # PyMuPDF
|
||||
import pandas as pd
|
||||
|
||||
def extract_text_from_pdf(pdf_path):
|
||||
"""从PDF中提取文本内容"""
|
||||
try:
|
||||
doc = fitz.open(pdf_path)
|
||||
text = ""
|
||||
for page_num in range(len(doc)):
|
||||
page = doc.load_page(page_num)
|
||||
text += page.get_text()
|
||||
doc.close()
|
||||
return text
|
||||
except Exception as e:
|
||||
raise Exception(f"PDF文本提取失败: {str(e)}")
|
||||
|
||||
def extract_tables_from_pdf(pdf_path):
|
||||
"""从PDF中提取表格数据"""
|
||||
try:
|
||||
doc = fitz.open(pdf_path)
|
||||
tables = []
|
||||
|
||||
for page_num in range(len(doc)):
|
||||
page = doc.load_page(page_num)
|
||||
|
||||
# 尝试提取表格(简单实现,实际可能需要更复杂的表格检测)
|
||||
text = page.get_text("text")
|
||||
# 这里可以添加表格检测和提取逻辑
|
||||
|
||||
doc.close()
|
||||
return tables
|
||||
except Exception as e:
|
||||
raise Exception(f"PDF表格提取失败: {str(e)}")
|
||||
|
||||
def pdf_to_excel(pdf_path, output_path):
|
||||
"""将PDF文本内容导出为Excel"""
|
||||
try:
|
||||
text = extract_text_from_pdf(pdf_path)
|
||||
|
||||
# 将文本按段落分割
|
||||
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
|
||||
|
||||
# 创建DataFrame
|
||||
df = pd.DataFrame({
|
||||
'段落编号': range(1, len(paragraphs) + 1),
|
||||
'内容': paragraphs
|
||||
})
|
||||
|
||||
df.to_excel(output_path, index=False)
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"PDF转Excel失败: {str(e)}")
|
||||
366
utils/photo_advice_generator.py
Normal file
366
utils/photo_advice_generator.py
Normal file
@ -0,0 +1,366 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
照片评分建议生成器
|
||||
为照片评分结果提供具体的改进建议
|
||||
"""
|
||||
|
||||
class PhotoAdviceGenerator:
|
||||
"""照片建议生成器类"""
|
||||
|
||||
def __init__(self):
|
||||
self.quality_advice_db = self._init_quality_advice()
|
||||
self.aesthetic_advice_db = self._init_aesthetic_advice()
|
||||
self.technical_advice_db = self._init_technical_advice()
|
||||
|
||||
def _init_quality_advice(self):
|
||||
"""初始化质量改进建议数据库"""
|
||||
return {
|
||||
'clarity': {
|
||||
'low': [
|
||||
"使用三脚架或稳定设备减少抖动",
|
||||
"提高快门速度避免运动模糊",
|
||||
"使用自动对焦确保主体清晰",
|
||||
"清洁镜头避免污渍影响",
|
||||
"在光线充足的环境下拍摄"
|
||||
],
|
||||
'medium': [
|
||||
"微调对焦点确保主体清晰",
|
||||
"使用更高的分辨率设置",
|
||||
"避免过度压缩图像",
|
||||
"后期适当锐化处理"
|
||||
],
|
||||
'high': [
|
||||
"清晰度优秀,继续保持",
|
||||
"可尝试更高难度的拍摄场景"
|
||||
]
|
||||
},
|
||||
'brightness': {
|
||||
'low': [
|
||||
"增加曝光补偿",
|
||||
"使用闪光灯或补光设备",
|
||||
"选择光线更好的拍摄时间",
|
||||
"提高ISO感光度(注意噪点)",
|
||||
"使用反光板补光"
|
||||
],
|
||||
'medium': [
|
||||
"微调曝光参数",
|
||||
"使用HDR模式拍摄",
|
||||
"注意高光和阴影的平衡",
|
||||
"后期调整亮度曲线"
|
||||
],
|
||||
'high': [
|
||||
"亮度适中,曝光准确",
|
||||
"可尝试创意光影效果"
|
||||
]
|
||||
},
|
||||
'contrast': {
|
||||
'low': [
|
||||
"增加画面明暗对比",
|
||||
"选择色彩对比强烈的场景",
|
||||
"使用侧光或逆光增强立体感",
|
||||
"后期调整对比度参数"
|
||||
],
|
||||
'medium': [
|
||||
"适当增强局部对比",
|
||||
"注意高光不过曝,阴影不死黑",
|
||||
"使用曲线工具精细调整"
|
||||
],
|
||||
'high': [
|
||||
"对比度良好,层次分明",
|
||||
"可尝试高对比风格创作"
|
||||
]
|
||||
},
|
||||
'color_balance': {
|
||||
'low': [
|
||||
"校正白平衡设置",
|
||||
"使用灰卡进行色彩校准",
|
||||
"避免混合光源造成的色偏",
|
||||
"后期校正色彩平衡"
|
||||
],
|
||||
'medium': [
|
||||
"微调色温和色调",
|
||||
"注意肤色还原自然",
|
||||
"统一画面色彩风格"
|
||||
],
|
||||
'high': [
|
||||
"色彩平衡优秀,还原准确",
|
||||
"可尝试创意色彩风格"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
def _init_aesthetic_advice(self):
|
||||
"""初始化美学改进建议数据库"""
|
||||
return {
|
||||
'composition': {
|
||||
'basic': [
|
||||
"学习三分法则构图",
|
||||
"注意主体在画面中的位置",
|
||||
"避免主体过于居中",
|
||||
"利用引导线增强画面深度"
|
||||
],
|
||||
'intermediate': [
|
||||
"尝试对称或不对称构图",
|
||||
"利用前景增强层次感",
|
||||
"注意画面元素的平衡",
|
||||
"创造视觉焦点"
|
||||
],
|
||||
'advanced': [
|
||||
"构图优秀,可尝试更复杂构图",
|
||||
"探索极简或复杂构图风格",
|
||||
"注重画面节奏和韵律"
|
||||
]
|
||||
},
|
||||
'lighting': {
|
||||
'basic': [
|
||||
"选择黄金时刻拍摄(日出日落)",
|
||||
"避免正午强光直射",
|
||||
"学习使用自然光",
|
||||
"注意光影方向和质量"
|
||||
],
|
||||
'intermediate': [
|
||||
"尝试侧光或逆光效果",
|
||||
"利用阴影创造氛围",
|
||||
"控制光比避免过曝或欠曝",
|
||||
"学习使用人造光源"
|
||||
],
|
||||
'advanced': [
|
||||
"光线运用娴熟,可尝试创意用光",
|
||||
"探索特殊光线条件拍摄",
|
||||
"注重光影的情感表达"
|
||||
]
|
||||
},
|
||||
'subject': {
|
||||
'basic': [
|
||||
"明确拍摄主体",
|
||||
"简化背景突出主体",
|
||||
"注意主体与环境的互动",
|
||||
"选择有故事性的主体"
|
||||
],
|
||||
'intermediate': [
|
||||
"注重主体的表情和姿态",
|
||||
"创造主体与环境的关系",
|
||||
"捕捉决定性瞬间",
|
||||
"注重主体的个性表达"
|
||||
],
|
||||
'advanced': [
|
||||
"主体表现力强,可尝试更深层次表达",
|
||||
"探索抽象或概念性主体",
|
||||
"注重主体的象征意义"
|
||||
]
|
||||
}
|
||||
}
|
||||
|
||||
def _init_technical_advice(self):
|
||||
"""初始化技术改进建议数据库"""
|
||||
return {
|
||||
'camera_settings': [
|
||||
"学习曝光三角关系(光圈、快门、ISO)",
|
||||
"根据场景选择合适的拍摄模式",
|
||||
"掌握对焦技巧确保主体清晰",
|
||||
"合理使用白平衡设置"
|
||||
],
|
||||
'post_processing': [
|
||||
"学习基本的后期调整技巧",
|
||||
"掌握色彩校正和调整",
|
||||
"学习锐化和降噪处理",
|
||||
"尝试创意滤镜效果"
|
||||
],
|
||||
'equipment': [
|
||||
"根据需求选择合适的镜头",
|
||||
"考虑使用三脚架提高稳定性",
|
||||
"投资质量好的存储设备",
|
||||
"定期清洁和维护设备"
|
||||
],
|
||||
'shooting_techniques': [
|
||||
"练习稳定的持机姿势",
|
||||
"学习不同的拍摄角度",
|
||||
"掌握连拍和定时拍摄",
|
||||
"尝试慢门或高速摄影"
|
||||
]
|
||||
}
|
||||
|
||||
def generate_quality_advice(self, quality_scores):
|
||||
"""生成质量改进建议"""
|
||||
advice = {
|
||||
'overall': [],
|
||||
'specific': {},
|
||||
'priority': []
|
||||
}
|
||||
|
||||
# 总体建议
|
||||
overall_score = sum(quality_scores.values()) / len(quality_scores)
|
||||
|
||||
if overall_score >= 90:
|
||||
advice['overall'].append("照片质量优秀,继续保持高水平拍摄")
|
||||
elif overall_score >= 80:
|
||||
advice['overall'].append("照片质量良好,有进一步提升空间")
|
||||
elif overall_score >= 60:
|
||||
advice['overall'].append("照片质量一般,需要重点改进")
|
||||
else:
|
||||
advice['overall'].append("照片质量较差,建议系统学习摄影基础")
|
||||
|
||||
# 具体维度建议
|
||||
for dimension, score in quality_scores.items():
|
||||
if dimension in self.quality_advice_db:
|
||||
level = self._get_score_level(score)
|
||||
dimension_advice = self.quality_advice_db[dimension].get(level, [])
|
||||
advice['specific'][dimension] = dimension_advice
|
||||
|
||||
# 添加优先级建议
|
||||
if score < 70:
|
||||
advice['priority'].append(f"优先改进{dimension}(当前{score}分)")
|
||||
|
||||
return advice
|
||||
|
||||
def generate_aesthetic_advice(self, aesthetic_score, composition_analysis):
|
||||
"""生成美学改进建议"""
|
||||
advice = {
|
||||
'general': [],
|
||||
'composition': [],
|
||||
'lighting': [],
|
||||
'subject': [],
|
||||
'creative': []
|
||||
}
|
||||
|
||||
# 总体美学建议
|
||||
if aesthetic_score >= 90:
|
||||
advice['general'].append("美学表现优秀,具备专业水准")
|
||||
advice['creative'].append("可尝试更具挑战性的创意拍摄")
|
||||
elif aesthetic_score >= 80:
|
||||
advice['general'].append("美学表现良好,细节有待提升")
|
||||
advice['creative'].append("尝试不同的构图和用光方式")
|
||||
elif aesthetic_score >= 60:
|
||||
advice['general'].append("美学表现一般,需要系统学习")
|
||||
advice['creative'].append("从基础构图和用光开始练习")
|
||||
else:
|
||||
advice['general'].append("美学表现较差,建议学习摄影美学基础")
|
||||
|
||||
# 构图建议
|
||||
comp_level = self._get_aesthetic_level(aesthetic_score)
|
||||
advice['composition'] = self.aesthetic_advice_db['composition'].get(comp_level, [])
|
||||
|
||||
# 用光建议
|
||||
light_level = self._get_aesthetic_level(aesthetic_score)
|
||||
advice['lighting'] = self.aesthetic_advice_db['lighting'].get(light_level, [])
|
||||
|
||||
# 主体建议
|
||||
subject_level = self._get_aesthetic_level(aesthetic_score)
|
||||
advice['subject'] = self.aesthetic_advice_db['subject'].get(subject_level, [])
|
||||
|
||||
return advice
|
||||
|
||||
def generate_technical_advice(self, photo_type='general'):
|
||||
"""生成技术改进建议"""
|
||||
advice = {
|
||||
'camera_settings': self.technical_advice_db['camera_settings'],
|
||||
'post_processing': self.technical_advice_db['post_processing'],
|
||||
'equipment': self.technical_advice_db['equipment'],
|
||||
'shooting_techniques': self.technical_advice_db['shooting_techniques']
|
||||
}
|
||||
|
||||
# 根据照片类型调整建议
|
||||
if photo_type == 'portrait':
|
||||
advice['camera_settings'].extend([
|
||||
"使用大光圈虚化背景",
|
||||
"注意对焦在眼睛上",
|
||||
"使用柔光设备美化肤色"
|
||||
])
|
||||
elif photo_type == 'landscape':
|
||||
advice['camera_settings'].extend([
|
||||
"使用小光圈获得大景深",
|
||||
"使用三脚架确保稳定性",
|
||||
"利用滤镜控制光线"
|
||||
])
|
||||
elif photo_type == 'macro':
|
||||
advice['camera_settings'].extend([
|
||||
"使用微距镜头或近摄环",
|
||||
"注意景深控制",
|
||||
"使用环形闪光灯补光"
|
||||
])
|
||||
|
||||
return advice
|
||||
|
||||
def generate_personalized_advice(self, quality_scores, aesthetic_score, photo_content):
|
||||
"""生成个性化综合建议"""
|
||||
personalized = {
|
||||
'quick_wins': [],
|
||||
'long_term_improvements': [],
|
||||
'learning_resources': [],
|
||||
'practice_exercises': []
|
||||
}
|
||||
|
||||
# 快速改进建议
|
||||
low_score_dimensions = [dim for dim, score in quality_scores.items() if score < 70]
|
||||
if low_score_dimensions:
|
||||
personalized['quick_wins'].append(f"重点改进:{', '.join(low_score_dimensions)}")
|
||||
|
||||
# 长期改进建议
|
||||
if aesthetic_score < 80:
|
||||
personalized['long_term_improvements'].append("系统学习摄影构图和用光")
|
||||
|
||||
# 学习资源推荐
|
||||
personalized['learning_resources'].extend([
|
||||
"推荐书籍:《摄影构图学》、《美国纽约摄影学院教材》",
|
||||
"在线课程:B站摄影教程、摄影之友",
|
||||
"实践平台:参加摄影比赛、加入摄影社群"
|
||||
])
|
||||
|
||||
# 练习建议
|
||||
personalized['practice_exercises'].extend([
|
||||
"每日拍摄练习:同一主题不同角度",
|
||||
"技术练习:曝光、对焦、白平衡",
|
||||
"创意练习:尝试不同风格和主题"
|
||||
])
|
||||
|
||||
return personalized
|
||||
|
||||
def _get_score_level(self, score):
|
||||
"""根据分数获取等级"""
|
||||
if score >= 85:
|
||||
return 'high'
|
||||
elif score >= 70:
|
||||
return 'medium'
|
||||
else:
|
||||
return 'low'
|
||||
|
||||
def _get_aesthetic_level(self, score):
|
||||
"""根据美学分数获取等级"""
|
||||
if score >= 85:
|
||||
return 'advanced'
|
||||
elif score >= 70:
|
||||
return 'intermediate'
|
||||
else:
|
||||
return 'basic'
|
||||
|
||||
def get_quality_improvement_advice(quality_scores):
|
||||
"""获取质量改进建议"""
|
||||
try:
|
||||
advisor = PhotoAdviceGenerator()
|
||||
return advisor.generate_quality_advice(quality_scores)
|
||||
except Exception as e:
|
||||
return {'error': f"生成建议失败: {str(e)}"}
|
||||
|
||||
def get_aesthetic_improvement_advice(aesthetic_score, composition_analysis=None):
|
||||
"""获取美学改进建议"""
|
||||
try:
|
||||
advisor = PhotoAdviceGenerator()
|
||||
return advisor.generate_aesthetic_advice(aesthetic_score, composition_analysis)
|
||||
except Exception as e:
|
||||
return {'error': f"生成建议失败: {str(e)}"}
|
||||
|
||||
def get_technical_advice(photo_type='general'):
|
||||
"""获取技术改进建议"""
|
||||
try:
|
||||
advisor = PhotoAdviceGenerator()
|
||||
return advisor.generate_technical_advice(photo_type)
|
||||
except Exception as e:
|
||||
return {'error': f"生成建议失败: {str(e)}"}
|
||||
|
||||
def get_personalized_advice(quality_scores, aesthetic_score, photo_content):
|
||||
"""获取个性化综合建议"""
|
||||
try:
|
||||
advisor = PhotoAdviceGenerator()
|
||||
return advisor.generate_personalized_advice(quality_scores, aesthetic_score, photo_content)
|
||||
except Exception as e:
|
||||
return {'error': f"生成建议失败: {str(e)}"}
|
||||
99
utils/web_scraper.py
Normal file
99
utils/web_scraper.py
Normal file
@ -0,0 +1,99 @@
|
||||
import requests
|
||||
from bs4 import BeautifulSoup
|
||||
import pandas as pd
|
||||
import re
|
||||
|
||||
def scrape_webpage(url, selector=None):
|
||||
"""抓取网页内容"""
|
||||
try:
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
response = requests.get(url, headers=headers, timeout=10)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
|
||||
if selector:
|
||||
# 根据CSS选择器提取特定内容
|
||||
elements = soup.select(selector)
|
||||
content = [elem.get_text(strip=True) for elem in elements]
|
||||
else:
|
||||
# 提取所有文本内容
|
||||
content = soup.get_text(strip=True)
|
||||
|
||||
return content
|
||||
except Exception as e:
|
||||
raise Exception(f"网页抓取失败: {str(e)}")
|
||||
|
||||
def scrape_table_from_webpage(url, table_index=0):
|
||||
"""从网页中提取表格数据"""
|
||||
try:
|
||||
headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
|
||||
}
|
||||
|
||||
response = requests.get(url, headers=headers, timeout=10)
|
||||
response.raise_for_status()
|
||||
|
||||
soup = BeautifulSoup(response.content, 'html.parser')
|
||||
tables = soup.find_all('table')
|
||||
|
||||
if not tables:
|
||||
return None
|
||||
|
||||
table = tables[table_index]
|
||||
|
||||
# 提取表头
|
||||
headers = []
|
||||
header_row = table.find('tr')
|
||||
if header_row:
|
||||
headers = [th.get_text(strip=True) for th in header_row.find_all(['th', 'td'])]
|
||||
|
||||
# 提取数据行
|
||||
data = []
|
||||
rows = table.find_all('tr')[1:] # 跳过表头
|
||||
|
||||
for row in rows:
|
||||
cells = row.find_all(['td', 'th'])
|
||||
row_data = [cell.get_text(strip=True) for cell in cells]
|
||||
if row_data:
|
||||
data.append(row_data)
|
||||
|
||||
return headers, data
|
||||
except Exception as e:
|
||||
raise Exception(f"网页表格提取失败: {str(e)}")
|
||||
|
||||
def web_to_excel(url, output_path, selector=None):
|
||||
"""将网页内容导出为Excel"""
|
||||
try:
|
||||
if selector:
|
||||
content = scrape_webpage(url, selector)
|
||||
if isinstance(content, list):
|
||||
df = pd.DataFrame({
|
||||
'序号': range(1, len(content) + 1),
|
||||
'内容': content
|
||||
})
|
||||
else:
|
||||
df = pd.DataFrame({'内容': [content]})
|
||||
else:
|
||||
# 尝试提取表格
|
||||
table_data = scrape_table_from_webpage(url)
|
||||
if table_data:
|
||||
headers, data = table_data
|
||||
df = pd.DataFrame(data, columns=headers)
|
||||
else:
|
||||
# 提取普通文本
|
||||
content = scrape_webpage(url)
|
||||
# 按段落分割
|
||||
paragraphs = [p.strip() for p in re.split(r'\n+', content) if p.strip()]
|
||||
df = pd.DataFrame({
|
||||
'段落编号': range(1, len(paragraphs) + 1),
|
||||
'内容': paragraphs
|
||||
})
|
||||
|
||||
df.to_excel(output_path, index=False)
|
||||
return True
|
||||
except Exception as e:
|
||||
raise Exception(f"网页转Excel失败: {str(e)}")
|
||||
Loading…
Reference in New Issue
Block a user