[爬虫资源]各大爬虫资源大汇总,做我们自己的awesome系列

阅读量：6401 次

发布时间：2019-06-23

本文共 2783 字，大约阅读时间需要 9 分钟。

大数据的流行一定程序导致的爬虫的流行，有些企业和公司本身不生产数据，那就只能从网上爬取数据，笔者关注相关的内容有一定的时间，也写过很多关于爬虫的系列，现在收集好的框架希望能为对爬虫有兴趣的人，或者想更进一步的研究的人提供索引，也随时欢迎大家star,fork ,或者提issue，让我们一起来完善这个awesome系列

Awesome-crawler

A collection of awesome web crawler,spider and resources in different language

Python

- A fast high-level screen scraping and web crawling framework.

- A powerful spider system.

- A distributed crawling framework.

- PyQuery-based scraping micro-framework.

- Universal feed parser.

- Site scraping framework.

- A Python library for automating interaction with websites.

- Visual scraping for Scrapy.

- Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

- A simple, Pythonic library for browsing the web without a standalone web browser.

- A simple ,easy spider using gevent and js render.

Java

- Highly extensible, highly scalable web crawler for production environment.

- Simple and lightweight web crawler.

- Scrapes, parses, manipulates and cleans HTML.

- Website-Specific Processors for HTML INformation eXtraction.

- A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.

- A easy to use lightweight web crawler

- Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.

- A scalable crawler framework.

- Extensible, web-scale, archival-quality web crawler project.

- An agile, distributed crawler framework.

C

- Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.