Overview
Dume.ai offers two types of web scraping capabilities: Basic Scraper for quick HTML extraction in chat, and Advanced Scraper for comprehensive data extraction in workflows.The Basic Scraper provides simple HTML content, while the Advanced Scraper offers structured data with change tracking, metadata, and multiple output formats.
Scraper Types
- Basic Scraper
- Advanced Scraper
Available in: Chat interface onlyActivation:
@web_scraperOutput: Raw HTML content from the webpageBest for: Quick content checks, simple data extraction, chat-based researchHow It Works
Basic Scraper (Chat Interface)
1
Activate the Tool
Type
@web_scraper in the chat interface to activate the basic web scraping functionality.2
Specify Your Target
Provide the website URL and describe what you want to extract.
Basic scraper returns raw HTML content that you can analyze directly in the chat.
3
Review Results
The tool returns the HTML source code for manual analysis and extraction.
Advanced Scraper (Workflows)
The Advanced Scraper is available in workflows and provides comprehensive data extraction with structured output:Advanced Scraper Output Schema
Use Cases
- Basic Scraper Uses
- Advanced Scraper Uses
- Market Research
- Lead Generation
Quick Analysis
- Check website structure and content
- Verify HTML elements and tags
- Debug web page issues
- Extract simple text content
- Read article content quickly
- Check meta tags and SEO elements
- Analyze page structure
- Get contact information
Example Commands
Basic Scraper (Chat Interface)
Basic HTML Extraction
Basic HTML Extraction
Advanced Scraper (Workflows Only)
Structured Data Extraction
Structured Data Extraction
The Advanced Scraper in workflows provides comprehensive output with change tracking:
Example Advanced Output
Data Extraction Comparison
| Feature | Basic Scraper | Advanced Scraper |
|---|---|---|
| Availability | Chat interface only | Workflows only |
| Output Format | Raw HTML | HTML + Markdown + JSON |
| Change Tracking | ❌ None | ✅ Full tracking with diffs |
| Metadata | ❌ Manual extraction | ✅ Automatic extraction |
| Structured Data | ❌ Manual parsing needed | ✅ Auto-parsed JSON |
| Historical Data | ❌ No history | ✅ Previous scrape tracking |
| Integration | Manual analysis | Workflow automation |
Both scrapers respect robots.txt files and implement rate limiting for responsible scraping.
Scraping Strategies
Single Page Scraping
Single Page Scraping
Extract data from a specific webpage:Best for: Specific data collection, one-time extractions, targeted research
Multi-Page Scraping
Multi-Page Scraping
Collect data across multiple pages or sections:Best for: Comprehensive data collection, catalog building, large-scale research
Real-Time Monitoring
Real-Time Monitoring
Track changes and updates on websites:Best for: Competitive monitoring, market tracking, opportunity alerts
Data Mining
Data Mining
Extract insights and patterns from web data:Best for: Market research, sentiment analysis, trend identification
Output Formats
- Basic Scraper Output
- Advanced Scraper Output
Raw HTML
Best Practices
Scraper Selection Guide
- Basic Scraper: Quick checks, research, manual analysis
- Advanced Scraper: Automation, monitoring, data pipelines
- URLs: Always provide complete, valid URLs
- Specificity: Be clear about what data you need
- Rate Limiting: Respect website performance
Responsible Scraping
1
Check Terms of Service
Review the website’s terms of service and robots.txt file before scraping.
2
Respect Rate Limits
Allow reasonable delays between requests to avoid overloading servers.
3
Use Appropriate Data
Only extract publicly available information for legitimate business purposes.
4
Store Data Securely
Ensure extracted data is stored and handled according to privacy regulations.
Common Applications
Legal & Ethical Considerations
Always ensure compliance with website terms of service, data protection regulations (GDPR, CCPA), and copyright laws when scraping web content.
Recommended Guidelines:
- ✅ Scrape publicly available information only
- ✅ Respect robots.txt and rate limiting
- ✅ Use for legitimate business research and analysis
- ✅ Comply with data protection regulations
- ✅ Attribute sources when required
Avoid These Practices:
- ❌ Scraping private or protected content
- ❌ Overloading servers with excessive requests
- ❌ Violating copyright or intellectual property
- ❌ Collecting personal data without consent
- ❌ Bypassing authentication or access controls
Use web scraping as a powerful tool for legitimate business intelligence while respecting digital property rights and privacy.