Skip to main content

Overview

Dume.ai offers two types of web scraping capabilities: Basic Scraper for quick HTML extraction in chat, and Advanced Scraper for comprehensive data extraction in workflows.
The Basic Scraper provides simple HTML content, while the Advanced Scraper offers structured data with change tracking, metadata, and multiple output formats.

Scraper Types

  • Basic Scraper
  • Advanced Scraper
Available in: Chat interface onlyActivation: @web_scraperOutput: Raw HTML content from the webpageBest for: Quick content checks, simple data extraction, chat-based research

How It Works

Basic Scraper (Chat Interface)

1

Activate the Tool

Type @web_scraper in the chat interface to activate the basic web scraping functionality.
2

Specify Your Target

Provide the website URL and describe what you want to extract.
@web_scraper Get the HTML content from https://example-store.com/products
Basic scraper returns raw HTML content that you can analyze directly in the chat.
3

Review Results

The tool returns the HTML source code for manual analysis and extraction.

Advanced Scraper (Workflows)

The Advanced Scraper is available in workflows and provides comprehensive data extraction with structured output:
Advanced Scraper Output Schema
{
  markdown: string,                    // Clean markdown version
  html: string | null,                // Full HTML content  
  metadata: any,                      // Page metadata (title, description, etc.)
  changeTracking: {
    previousScrapeAt: string | null,   // ISO datetime of last scrape
    changeStatus: 'new' | 'same' | 'changed' | 'removed',
    visibility: 'visible' | 'hidden',
    diff: string | null,              // Text diff if changed
    json: any,                        // Structured JSON data
  }
}

Use Cases

  • Basic Scraper Uses
  • Advanced Scraper Uses
  • Market Research
  • Lead Generation
Quick Analysis
  • Check website structure and content
  • Verify HTML elements and tags
  • Debug web page issues
  • Extract simple text content
Research Tasks
  • Read article content quickly
  • Check meta tags and SEO elements
  • Analyze page structure
  • Get contact information

Example Commands

Basic Scraper (Chat Interface)

@web_scraper Get HTML from https://company.com/about
Output: Raw HTML content that you can analyze in the chat interface.

Advanced Scraper (Workflows Only)

The Advanced Scraper in workflows provides comprehensive output with change tracking:
Example Advanced Output
{
  "markdown": "# Product Page\n\n**Price:** $99.99\n**Availability:** In Stock\n\n## Description\nHigh-quality wireless headphones...",
  "html": "<html><head><title>Product</title></head>...",
  "metadata": {
    "title": "Wireless Headphones - TechStore",
    "description": "Premium wireless headphones with noise cancellation",
    "keywords": ["headphones", "wireless", "audio"],
    "author": "TechStore",
    "publishedDate": "2025-09-26T10:00:00Z"
  },
  "changeTracking": {
    "previousScrapeAt": "2025-09-25T10:00:00Z",
    "changeStatus": "changed", 
    "visibility": "visible",
    "diff": "Price changed from $109.99 to $99.99",
    "json": {
      "price": "$99.99",
      "availability": "In Stock",
      "rating": 4.5,
      "reviews": 1247
    }
  }
}

Data Extraction Comparison

FeatureBasic ScraperAdvanced Scraper
AvailabilityChat interface onlyWorkflows only
Output FormatRaw HTMLHTML + Markdown + JSON
Change Tracking❌ None✅ Full tracking with diffs
Metadata❌ Manual extraction✅ Automatic extraction
Structured Data❌ Manual parsing needed✅ Auto-parsed JSON
Historical Data❌ No history✅ Previous scrape tracking
IntegrationManual analysisWorkflow automation
Both scrapers respect robots.txt files and implement rate limiting for responsible scraping.

Scraping Strategies

Extract data from a specific webpage:
@web_scraper Get all customer testimonials from https://company.com/testimonials
@web_scraper Extract FAQ questions and answers from https://support-site.com/faq
Best for: Specific data collection, one-time extractions, targeted research
Collect data across multiple pages or sections:
@web_scraper Scrape all blog post titles and dates from https://blog.com (all pages)
@web_scraper Extract product catalog from https://ecommerce.com/categories/electronics
Best for: Comprehensive data collection, catalog building, large-scale research
Track changes and updates on websites:
@web_scraper Monitor price changes on https://competitor.com/pricing-page
@web_scraper Track new job postings on https://company.com/careers
Best for: Competitive monitoring, market tracking, opportunity alerts
Extract insights and patterns from web data:
@web_scraper Analyze customer review sentiment from https://product-reviews.com
@web_scraper Extract trending topics from https://industry-forum.com
Best for: Market research, sentiment analysis, trend identification

Output Formats

  • Basic Scraper Output
  • Advanced Scraper Output
Raw HTML
<!DOCTYPE html>
<html>
<head>
  <title>Product Page</title>
  <meta name="description" content="Product description">
</head>
<body>
  <h1>Wireless Headphones</h1>
  <p class="price">$99.99</p>
  <p class="stock">In Stock</p>
</body>
</html>
Format: Raw HTML content only Analysis: Manual parsing required

Best Practices

Scraper Selection Guide
  • Basic Scraper: Quick checks, research, manual analysis
  • Advanced Scraper: Automation, monitoring, data pipelines
  • URLs: Always provide complete, valid URLs
  • Specificity: Be clear about what data you need
  • Rate Limiting: Respect website performance

Responsible Scraping

1

Check Terms of Service

Review the website’s terms of service and robots.txt file before scraping.
2

Respect Rate Limits

Allow reasonable delays between requests to avoid overloading servers.
3

Use Appropriate Data

Only extract publicly available information for legitimate business purposes.
4

Store Data Securely

Ensure extracted data is stored and handled according to privacy regulations.

Common Applications

@web_scraper Monitor competitor prices and product availability on https://competitor-store.com/category/electronics
{
  "extraction_results": {
    "url": "https://competitor-store.com/category/electronics",
    "timestamp": "2025-09-26T10:30:00Z",
    "data": [
      {
        "product_name": "Wireless Headphones",
        "price": "$129.99",
        "brand": "TechBrand",
        "rating": "4.5/5",
        "reviews_count": 1247,
        "availability": "In Stock"
      }
    ],
    "total_items": 156
  }
}
Always ensure compliance with website terms of service, data protection regulations (GDPR, CCPA), and copyright laws when scraping web content.
  • ✅ Scrape publicly available information only
  • ✅ Respect robots.txt and rate limiting
  • ✅ Use for legitimate business research and analysis
  • ✅ Comply with data protection regulations
  • ✅ Attribute sources when required

Avoid These Practices:

  • ❌ Scraping private or protected content
  • ❌ Overloading servers with excessive requests
  • ❌ Violating copyright or intellectual property
  • ❌ Collecting personal data without consent
  • ❌ Bypassing authentication or access controls
Use web scraping as a powerful tool for legitimate business intelligence while respecting digital property rights and privacy.