Web Scraper Tool - Intelligent Data Extraction

Overview

Dume.ai offers two types of web scraping capabilities: Basic Scraper for quick HTML extraction in chat, and Advanced Scraper for comprehensive data extraction in workflows.

The Basic Scraper provides simple HTML content, while the Advanced Scraper offers structured data with change tracking, metadata, and multiple output formats.

Scraper Types

Basic Scraper
Advanced Scraper

Available in: Chat interface onlyActivation: @web_scraperOutput: Raw HTML content from the webpageBest for: Quick content checks, simple data extraction, chat-based research

How It Works

Basic Scraper (Chat Interface)

Activate the Tool

Type @web_scraper in the chat interface to activate the basic web scraping functionality.

Specify Your Target

Provide the website URL and describe what you want to extract.

@web_scraper Get the HTML content from https://example-store.com/products

Basic scraper returns raw HTML content that you can analyze directly in the chat.

Review Results

The tool returns the HTML source code for manual analysis and extraction.

Advanced Scraper (Workflows)

The Advanced Scraper is available in workflows and provides comprehensive data extraction with structured output:

Advanced Scraper Output Schema

{
  markdown: string,                    // Clean markdown version
  html: string | null,                // Full HTML content  
  metadata: any,                      // Page metadata (title, description, etc.)
  changeTracking: {
    previousScrapeAt: string | null,   // ISO datetime of last scrape
    changeStatus: 'new' | 'same' | 'changed' | 'removed',
    visibility: 'visible' | 'hidden',
    diff: string | null,              // Text diff if changed
    json: any,                        // Structured JSON data
  }
}

Use Cases

Basic Scraper Uses
Advanced Scraper Uses
Market Research
Lead Generation

Quick Analysis

Check website structure and content
Verify HTML elements and tags
Debug web page issues
Extract simple text content

Research Tasks

Read article content quickly
Check meta tags and SEO elements
Analyze page structure
Get contact information

Example Commands

Basic Scraper (Chat Interface)

Basic HTML Extraction

@web_scraper Get HTML from https://company.com/about

Output: Raw HTML content that you can analyze in the chat interface.

Advanced Scraper (Workflows Only)

Structured Data Extraction

The Advanced Scraper in workflows provides comprehensive output with change tracking:

Example Advanced Output

{
  "markdown": "# Product Page\n\n**Price:** $99.99\n**Availability:** In Stock\n\n## Description\nHigh-quality wireless headphones...",
  "html": "<html><head><title>Product</title></head>...",
  "metadata": {
    "title": "Wireless Headphones - TechStore",
    "description": "Premium wireless headphones with noise cancellation",
    "keywords": ["headphones", "wireless", "audio"],
    "author": "TechStore",
    "publishedDate": "2025-09-26T10:00:00Z"
  },
  "changeTracking": {
    "previousScrapeAt": "2025-09-25T10:00:00Z",
    "changeStatus": "changed", 
    "visibility": "visible",
    "diff": "Price changed from $109.99 to $99.99",
    "json": {
      "price": "$99.99",
      "availability": "In Stock",
      "rating": 4.5,
      "reviews": 1247
    }
  }
}

Data Extraction Comparison

Feature	Basic Scraper	Advanced Scraper
Availability	Chat interface only	Workflows only
Output Format	Raw HTML	HTML + Markdown + JSON
Change Tracking	❌ None	✅ Full tracking with diffs
Metadata	❌ Manual extraction	✅ Automatic extraction
Structured Data	❌ Manual parsing needed	✅ Auto-parsed JSON
Historical Data	❌ No history	✅ Previous scrape tracking
Integration	Manual analysis	Workflow automation

Both scrapers respect robots.txt files and implement rate limiting for responsible scraping.

Scraping Strategies

Single Page Scraping

Extract data from a specific webpage:

@web_scraper Get all customer testimonials from https://company.com/testimonials
@web_scraper Extract FAQ questions and answers from https://support-site.com/faq

Best for: Specific data collection, one-time extractions, targeted research

Multi-Page Scraping

Collect data across multiple pages or sections:

@web_scraper Scrape all blog post titles and dates from https://blog.com (all pages)
@web_scraper Extract product catalog from https://ecommerce.com/categories/electronics

Best for: Comprehensive data collection, catalog building, large-scale research

Real-Time Monitoring

Track changes and updates on websites:

@web_scraper Monitor price changes on https://competitor.com/pricing-page
@web_scraper Track new job postings on https://company.com/careers

Best for: Competitive monitoring, market tracking, opportunity alerts

Data Mining

Extract insights and patterns from web data:

@web_scraper Analyze customer review sentiment from https://product-reviews.com
@web_scraper Extract trending topics from https://industry-forum.com

Best for: Market research, sentiment analysis, trend identification

Output Formats

Basic Scraper Output
Advanced Scraper Output

Raw HTML

<!DOCTYPE html>
<html>
<head>
  <title>Product Page</title>
  <meta name="description" content="Product description">
</head>
<body>
  <h1>Wireless Headphones</h1>
  <p class="price">$99.99</p>
  <p class="stock">In Stock</p>
</body>
</html>

Format: Raw HTML content only Analysis: Manual parsing required

Best Practices

Scraper Selection Guide

Basic Scraper: Quick checks, research, manual analysis
Advanced Scraper: Automation, monitoring, data pipelines
URLs: Always provide complete, valid URLs
Specificity: Be clear about what data you need
Rate Limiting: Respect website performance

Responsible Scraping

Check Terms of Service

Review the website’s terms of service and robots.txt file before scraping.

Respect Rate Limits

Allow reasonable delays between requests to avoid overloading servers.

Use Appropriate Data

Only extract publicly available information for legitimate business purposes.

Store Data Securely

Ensure extracted data is stored and handled according to privacy regulations.

Common Applications

@web_scraper Monitor competitor prices and product availability on https://competitor-store.com/category/electronics

{
  "extraction_results": {
    "url": "https://competitor-store.com/category/electronics",
    "timestamp": "2025-09-26T10:30:00Z",
    "data": [
      {
        "product_name": "Wireless Headphones",
        "price": "$129.99",
        "brand": "TechBrand",
        "rating": "4.5/5",
        "reviews_count": 1247,
        "availability": "In Stock"
      }
    ],
    "total_items": 156
  }
}

Legal & Ethical Considerations

Always ensure compliance with website terms of service, data protection regulations (GDPR, CCPA), and copyright laws when scraping web content.

Recommended Guidelines:

✅ Scrape publicly available information only
✅ Respect robots.txt and rate limiting
✅ Use for legitimate business research and analysis
✅ Comply with data protection regulations
✅ Attribute sources when required

Avoid These Practices:

❌ Scraping private or protected content
❌ Overloading servers with excessive requests
❌ Violating copyright or intellectual property
❌ Collecting personal data without consent
❌ Bypassing authentication or access controls

Use web scraping as a powerful tool for legitimate business intelligence while respecting digital property rights and privacy.

Overview

Integrations

AI Tools

Workflows

Web Scraper - Intelligent Data Extraction

Overview

Scraper Types

How It Works

Basic Scraper (Chat Interface)

Advanced Scraper (Workflows)

Use Cases

Example Commands

Basic Scraper (Chat Interface)

Advanced Scraper (Workflows Only)

Data Extraction Comparison

Scraping Strategies

Output Formats

Best Practices

Responsible Scraping

Common Applications

Legal & Ethical Considerations

Recommended Guidelines:

Avoid These Practices:

Overview

Integrations

AI Tools

Workflows

​Overview

​Scraper Types

​How It Works

​Basic Scraper (Chat Interface)

​Advanced Scraper (Workflows)

​Use Cases

​Example Commands

​Basic Scraper (Chat Interface)

​Advanced Scraper (Workflows Only)

​Data Extraction Comparison

​Scraping Strategies

​Output Formats

​Best Practices

​Responsible Scraping

​Common Applications

​Legal & Ethical Considerations

​Recommended Guidelines:

​Avoid These Practices:

Overview

Scraper Types

How It Works

Basic Scraper (Chat Interface)

Advanced Scraper (Workflows)

Use Cases

Example Commands

Basic Scraper (Chat Interface)

Advanced Scraper (Workflows Only)

Data Extraction Comparison

Scraping Strategies

Output Formats

Best Practices

Responsible Scraping

Common Applications

Legal & Ethical Considerations

Recommended Guidelines:

Avoid These Practices: