Favicon of Sosse

Sosse

Build a searchable web archive. This open-source crawler indexes dynamic pages, downloads files, and integrates with external services via powerful webhooks.

Sosse is a powerful, open-source search engine and crawler you can host yourself. It's designed to index, archive, and search websites, with a special focus on modern, dynamic pages that rely heavily on JavaScript. By using browser-based crawling, it can capture content that simpler tools often miss, creating a comprehensive and searchable archive of web content for your research or data projects.

Its capabilities make it a versatile tool for a wide range of tasks:

  • Advanced Web Crawling: Schedule recurring crawls at fixed intervals or let the tool adapt the frequency based on how often a page's content changes. It can even authenticate to access private pages.
  • Comprehensive Archiving: Save full HTML content of pages, automatically adjusting links for local use and downloading necessary assets.
  • Powerful Search: Perform full-text searches across all your archived content with advanced queries.
  • Flexible Integration: Use webhooks to connect with external services, such as AI platforms for data extraction, summarization, or auto-tagging.
  • Content Organization: Apply tags to crawled pages to easily filter and manage your archive.
  • Feed Generation: Create Atom feeds for any website, allowing you to get updates when new pages matching your keywords are found.

Directory Structure

sosse
data
.env
docker-compose.yml

docker-compose.yml

services:
  sosse:
    image: biolds/sosse:1.14
    container_name: sosse
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - ./data:/app/data
    environment:
      - ADMIN_PASSWORD=${ADMIN_PASSWORD}

.env

ADMIN_PASSWORD=your_super_secret_password
Categories:

Share:

Ad
Favicon

 

  
 

Similar to Sosse

Favicon

 

  
  
Favicon

 

  
  
Favicon