Golang高并发抓取HTML图片

待兔 等级 632 0 0

版权所有,转载请注明:http://www.lenggirl.com/language/go-picture.html

使用准备

1.安装Golang

2.下载爬虫包

go get -v github.com/hunterhug/marmot/expert
go get -v github.com/hunterhug/marmot/miner
go get -v github.com/hunterhug/parrot/util 

程序

该程序只能抓取HTML中src="http"中的图片, 必须带有协议头http(s), 其他如data-src和混淆在JS中的无法抓取

See: https://github.com/hunterhug/marmot/blob/master/example/lesson/lesson6.go

package main

import (
    "errors"
    "fmt"
    "net/url"
    "strings"

    "github.com/hunterhug/marmot/expert"
    "github.com/hunterhug/marmot/miner"
    "github.com/hunterhug/parrot/util"
)

// Num of miner, We can run it at the same time to crawl data fast
var MinerNum = 5

// You can update this decide whether to proxy
var ProxyAddress interface{}

func main() {
    // You can Proxy!
    // ProxyAddress = "socks5://127.0.0.1:1080"

    fmt.Println(`Welcome: Input "url" and picture keep "dir"`)
    for {
        fmt.Println("---------------------------------------------")
        url := util.Input(`URL(Like: "http://publicdomainarchive.com")`, "http://publicdomainarchive.com")
        dir := util.Input(`DIR(Default: "./picture")`, "./picture")
        fmt.Printf("You will keep %s picture in dir %s\n", url, dir)
        fmt.Println("---------------------------------------------")

        // Start Catch
        err := CatchPicture(url, dir)
        if err != nil {
            fmt.Println("Error:" + err.Error())
        }
    }
}

// Come on!
func CatchPicture(picture_url string, dir string) error {
    // Check valid
    _, err := url.Parse(picture_url)
    if err != nil {
        return err
    }

    // Make dir!
    err = util.MakeDir(dir)
    if err != nil {
        return err
    }

    // New a worker to get url
    worker, _ := miner.New(ProxyAddress)

    result, err := worker.SetUrl(picture_url).SetUa(miner.RandomUa()).Get()
    if err != nil {
        return err
    }

    // Find all picture
    pictures := expert.FindPicture(string(result))

    // Empty, What a pity!
    if len(pictures) == 0 {
        return errors.New("empty")
    }

    // Devide pictures into several worker
    xxx, _ := util.DevideStringList(pictures, MinerNum)

    // Chanel to info exchange
    chs := make(chan int, len(pictures))

    // Go at the same time
    for num, imgs := range xxx {

        // Get pool miner
        worker_picture, ok := miner.Pool.Get(util.IS(num))
        if !ok {
            // No? set one!
            worker_temp, _ := miner.New(ProxyAddress)
            worker_picture = worker_temp
            worker_temp.SetUa(miner.RandomUa())
            miner.Pool.Set(util.IS(num), worker_temp)
        }

        // Go save picture!
        go func(imgs []string, worker *miner.Worker, num int) {
            for _, img := range imgs {

                // Check, May be Pass
                _, err := url.Parse(img)
                if err != nil {
                    continue
                }

                // Change Name of our picture
                filename := strings.Replace(util.ValidFileName(img), "#", "_", -1)

                // Exist?
                if util.FileExist(dir + "/" + filename) {
                    fmt.Println("File Exist:" + dir + "/" + filename)
                    chs <- 0
                } else {

                    // Not Exsit?
                    imgsrc, e := worker.SetUrl(img).Get()
                    if e != nil {
                        fmt.Println("Download " + img + " error:" + e.Error())
                        chs <- 0
                        return
                    }

                    // Save it!
                    e = util.SaveToFile(dir+"/"+filename, imgsrc)
                    if e == nil {
                        fmt.Printf("SP%d: Keep in %s/%s\n", num, dir, filename)
                    }
                    chs <- 1
                }
            }
        }(imgs, worker_picture, num)
    }

    // Every picture should return
    for i := 0; i < len(pictures); i++ {
        <-chs
    }

    return nil
} 

解释均写, 运行后:

superpika@superpika-chen-110:~/code/src/github.com/hunterhug/marmot/example/lesson$ go run lesson6.go 

        Welcome: Input "url" and picture keep "dir"



---------------------------------------------
URL(Like: "http://publicdomainarchive.com")

DIR(Default: "./picture")

You will keep http://publicdomainarchive.com picture in dir ./picture
---------------------------------------------
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_modern.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_google_dark.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-003-667x1000-192684_667x675.jpg

superpika@superpika-chen-110:~/code/src/github.com/hunterhug/marmot/example/lesson$ go run lesson6.go 

        Welcome: Input "url" and picture keep "dir"



---------------------------------------------
URL(Like: "http://publicdomainarchive.com")

DIR(Default: "./picture")

You will keep http://publicdomainarchive.com picture in dir ./picture
---------------------------------------------
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_modern.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_google_dark.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-003-667x1000-192684_667x675.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_powered-by-wp-engine.png
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_divi.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-002-1000x667.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_public-domain-mark.png
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos008-1000x625.jpg
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_09_03_Weekly.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-054-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_10_03_instagram_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_vintage.jpg
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-001-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-070-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_twitter02_dark.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos001-1000x750-167066_1000x675.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-035-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_facebook_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-060-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-013-1000x667.jpg
---------------------------------------------
URL(Like: "http://publicdomainarchive.com")



SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_powered-by-wp-engine.png
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_divi.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-002-1000x667.jpg
SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_public-domain-mark.png
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos008-1000x625.jpg
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_09_03_Weekly.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-054-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_10_03_instagram_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_vintage.jpg
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-001-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-070-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_twitter02_dark.png
SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos001-1000x750-167066_1000x675.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-035-1000x667.jpg
SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_facebook_dark.png
SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-060-1000x667.jpg
SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-013-1000x667.jpg
---------------------------------------------
URL(Like: "http://publicdomainarchive.com") 

[图片上传失败...(image-685ed8-1557639583356)]

收藏
评论区

相关推荐

Golang高阶:Golang1.5到Golang1.12包管理
版权所有,转载请注明:http://www.lenggirl.com/go/gomod.html(https://links.jianshu.com/go?tohttp%3A%2F%2Fwww.lenggirl.com%2Fgo%2Fgomod.html) 1. 前言 Golang 是一门到如今有十年的静态高级语言了,2009年的时
Golang高并发抓取HTML图片
版权所有,转载请注明:http://www.lenggirl.com/language/gopicture.html(https://links.jianshu.com/go?tohttp%3A%2F%2Fwww.lenggirl.com%2Flanguage%2Fgopicture.html) 使用准备 1.安装Golang 2.
关于Golang的那些事(一) -- Node.js和Golang对比
之前一直用Node.js作为开发语言,用了差不多4年的Node.js,涉及前端和后端,最近看到Golang这个新兴之秀挺火的,于是想探究探究一下这门语言,对比了一下他们的Github repo,截止现在Node.js的repo有72.5K星, issue数量是859个,Golang的repo有75.7K星,issue数量是5K个。从趋势来看,Golang来势
【Golang】Golang + jwt 实现简易用户认证
<p本文已同步发布到我的个人博客:<a href"https://links.jianshu.com/go?tohttps%3A%2F%2Fglorin.xyz%2F2019%2F11%2F23%2FGolangjwtsimpleauth%2F" target"_blank"https://glorin.xyz/2019/11/23/Golang
golang 中神奇的 slice
声明:本文仅限于简书发布,其他第三方网站均为盗版,原文地址: golang 中神奇的 slice(https://links.jianshu.com/go?tohttps%3A%2F%2Fliqiang.io%2Fpost%2Fimagesliceingolang) 在 golang 中,似乎人们都不太喜欢使用 Linked List,甚至于原
Golang并发模型:轻松入门流水线FAN模式
前一篇文章《Golang并发模型:轻松入门流水线模型》(https://segmentfault.com/a/1190000017142506),介绍了流水线模型的概念,这篇文章是流水线模型进阶,介绍FANIN和FANOUT,FAN模式可以让我们的流水线模型更好的利用Golang并发,提高软件性能。但FAN模式不一定是万能,不见得能提高程序的性能,甚
godoc 命令和 golang 代码文档管理
介绍 godoc 是 golang 自带的文档查看器,更多的提供部署服务 go doc 和 godoc 在 golang 1.13 被移除了,可以自行安装 golang.org go1.13 godoc(https://links.jianshu.com/go?tohttps%3A%2F%2Fgolang.org%2Fdoc%2Fg
Mac安装Golang和vscode
Mac第一次安装golang和vscode一起使用,遇到了不少的坑,下面介绍一下正确的安装方式。 1、使用brew安装Golang 如果不知道brew是什么,或怎么安装请看这里 brew官网(https://brew.sh/index_zhcn) brew install golang 安装完成后可以使用
【Golang】GoWeb框架之Gin-简明教程
Gin 简介 Gin is a HTTP web framework written in Go (Golang). It features a
golang 分析调试高阶技巧
layout: post title: “golang 调试高阶技巧” date: 2020603 1:44:09 0800 categories: golang GC 垃圾回收 golang 高阶调试 Golang tools nm compile
深入理解 Go Slice
(https://imghelloworld.osscnbeijing.aliyuncs.com/0ce8a8773a658d4b843e5796a0dbf001.png) image 原文地址:深入理解 Go Slice(https://github.com/EDDYCJY/blog/blob/master/golang/pkg/20
golang包循环引用的几种解决方案
golang包循环引用的几种解决方案 发表于2020年11月2日2020年11月3日(https://libuba.com/2020/11/02/golang%e5%8c%85%e5%be%aa%e7%8e%af%e5%bc%95%e7%94%a8%e7%9a%84%e5%87%a0%e7%a7%8d%e8%a7%
GO开发[一]:golang语言初探
一.Golang的安装 1.https://dl.gocn.io/ (国内下载地址) (https://imghelloworld.osscnbeijing.aliyuncs.com/658c5d13c377
Android如何解析json字符串
前言上一篇文章介绍了服务器用Golang如何解析json字符串,今天我们来看看Android客户端是如何解析json字符串的。 正文Golang如何解析post请求中的json字符串(https://www.helloworld.net/p/O917HGeiALU2D)使用java语句如何正确解析json字符串呢?举一个例子,假如我们想从rtc_i
go get下载包失败问题
关于我由于某些不可抗力的原因,国内使用go get命令安装包时会经常会出现timeout的问题。本文介绍几个常用的解决办法。 从github克隆golang在github上建立了一个镜像库,如https://github.com/golang/net就对应是 https://golang.org/x/net的镜像库。 要下载golang.org/x/net包