Scraperr
Self-hosted webscraper.
**Scraperr** is a Self-hosted webscraper. The project is written primarily in TypeScript, distributed under the MIT License license, first published in 2024. It has gained significant community traction with 4,898 stars and 241 forks on GitHub. Key topics include: docker, helm, kubernetes, opensource, playwright.
A powerful self-hosted web scraping solution
<div> <img src="https://img.shields.io/badge/MongoDB-%234ea94b.svg?style=for-the-badge&logo=mongodb&logoColor=white" alt="MongoDB" /> <img src="https://img.shields.io/badge/FastAPI-005571?style=for-the-badge&logo=fastapi" alt="FastAPI" /> <img src="https://img.shields.io/badge/Next-black?style=for-the-badge&logo=next.js&logoColor=white" alt="Next JS" /> <img src="https://img.shields.io/badge/tailwindcss-%2338B2AC.svg?style=for-the-badge&logo=tailwind-css&logoColor=white" alt="TailwindCSS" /> </div> </div>📋 Overview
Scrape websites without writing a single line of code.
<div align="center"> <img src="https://github.com/jaypyles/www-scrape/blob/master/docs/main_page.png" alt="Scraperr Main Interface" width="800px"> </div>📚 Check out the docs for a comprehensive quickstart guide and detailed information.
✨ Key Features
- XPath-Based Extraction: Precisely target page elements
- Queue Management: Submit and manage multiple scraping jobs
- Domain Spidering: Option to scrape all pages within the same domain
- Custom Headers: Add JSON headers to your scraping requests
- Media Downloads: Automatically download images, videos, and other media
- Results Visualization: View scraped data in a structured table format
- Data Export: Export your results in markdown and csv formats
- Notifcation Channels: Send completion notifcations, through various channels
🚀 Getting Started
Docker
bashmake up
Helm
Refer to the docs for helm deployment: https://scraperr-docs.pages.dev/guides/helm-deployment
⚖️ Legal and Ethical Guidelines
When using Scraperr, please remember to:
- Respect
robots.txt: Always check a website'srobots.txtfile to verify which pages permit scraping - Terms of Service: Adhere to each website's Terms of Service regarding data extraction
- Rate Limiting: Implement reasonable delays between requests to avoid overloading servers
Disclaimer: Scraperr is intended for use only on websites that explicitly permit scraping. The creator accepts no responsibility for misuse of this tool.
💬 Join the Community
Get support, report bugs, and chat with other users and contributors.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
👏 Contributions
Development made easier with the webapp template.
To get started, simply run make build up-dev.
Contributors
Showing top 4 contributors by commit count.
