博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
[爬虫资源]各大爬虫资源大汇总,做我们自己的awesome系列
阅读量:6401 次
发布时间:2019-06-23

本文共 2783 字,大约阅读时间需要 9 分钟。

  大数据的流行一定程序导致的爬虫的流行,有些企业和公司本身不生产数据,那就只能从网上爬取数据,笔者关注相关的内容有一定的时间,也写过很多关于爬虫的系列,现在收集好的框架希望能为对爬虫有兴趣的人,或者想更进一步的研究的人提供索引,也随时欢迎大家star,fork ,或者提issue,让我们一起来完善这个awesome系列

Awesome-crawler Awesome

A collection of awesome web crawler,spider and resources in different language

Python

  • - A fast high-level screen scraping and web crawling framework.
  • - A powerful spider system.
  • - A distributed crawling framework.
  • - PyQuery-based scraping micro-framework.
  • - Universal feed parser.
  • - Site scraping framework.
  • - A Python library for automating interaction with websites.
  • - Visual scraping for Scrapy.
  • - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
  • - A simple, Pythonic library for browsing the web without a standalone web browser.
  • - A simple ,easy spider using gevent and js render.

Java

  • - Highly extensible, highly scalable web crawler for production environment.
  • - Simple and lightweight web crawler.
  • - Scrapes, parses, manipulates and cleans HTML.
  • - Website-Specific Processors for HTML INformation eXtraction.
  • - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
  • - A easy to use lightweight web crawler
  • - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
  • - A scalable crawler framework.
  • - Extensible, web-scale, archival-quality web crawler project.
  • - An agile, distributed crawler framework.

C

  • - Built in C# 3.5 version. it contains a simple extention of web content categorizer, which can saparate between the web page depending on their content.
  • - Simple spider base on mutithreading, regluar expression.
  • - C# web crawler built for speed and flexibility.
  • - Advanced Crawler and ETL tool written in C#/WPF.

JavaScript

  • - Event driven web crawler.
  • - Node-crawler has clean,simple api.
  • - Web crawler for Node.JS, both HTTP and HTTPS are supported.

PHP

  • - A screen scraping and web crawling library for PHP.
    • - Laravel 5 Facade for Goutte.
  • - The DomCrawler component eases DOM navigation for HTML and XML documents.
  • - Parallel web crawler written in PHP.
  • - A configurable and extensible PHP web spider.

C++

  • - A distributed open source search engine and spider/crawler written in C/C++.

Ruby

  • - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
  • - RubyRetriever is a Web Crawler, Scraper & File Harvester.

Go

  • - Polite, slim and concurrent web crawler.
  • - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.

Scala

  • - Scala DSL for web crawling.
  • - Scala crawler(spider) framework, inspired by scrapy.
  • - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.

还在持续更新之中:最新的资源请查看git:

转载于:https://www.cnblogs.com/codefish/p/5947165.html

你可能感兴趣的文章
非常好的BASH脚本编写教程
查看>>
MFC类库之CArray作为函数参数和返回值
查看>>
VMware vSphere 5.1 群集深入解析(十八)-DPM推荐向导&汇总
查看>>
plesk panel 虚拟主机管理平台 0day
查看>>
Java正则表达式进阶(一):写出常用的正则模式
查看>>
Android:Typeface、fonts、字体
查看>>
PgSQL · 源码分析 · AutoVacuum机制之autovacuum launcher
查看>>
MySQL初步使用
查看>>
【计算机网络】 DNS学习笔记 (>﹏<)
查看>>
ORA-01111: name for data file 119 is unknown - rename to correct file
查看>>
源代码构建Apache反向代理(包括SSL配置)
查看>>
找出apache日志中访问量最大的IP
查看>>
Exchange2010 console控制台初始化失败
查看>>
angular controller as syntax vs scope
查看>>
【ZooKeeper Notes 10】ZooKeepr监控
查看>>
Windows Server 2008 将计算机加入到指定组织单元
查看>>
在VM2008R2中使用模板快速创建虚拟机之二实践篇
查看>>
DELPHI中对SQL SERVER中image、text字段的读写综述
查看>>
SSD硬盘配置最佳实践
查看>>
关于docker容器网络的一些理解
查看>>