×

使用WebMagic快速配置你的“小虫子”

96
冬天只爱早晨
2017.10.16 22:05* 字数 440

前言

距上一次发博已经有一个月的时间了,期间一直在优化某功能模块,每次都搞到很晚才回家。所以就没有新的内容发布。
国庆之后我负责了几个爬虫,主要就是自己编写爬虫抓取BAT三家公司的职位信息,还有就是三大人才网的职位信息,这三个之前用了webmagic写的,我就负责维护,总体来说还算是轻松的,就是最后一个猎聘网,需要用代理ip来抓取,花了点时间在网上找了个比较不错的代理ip站点,下面就把webmagic的使用过程抽了出来,方便下一次的快速使用。

webmagic简介

http://webmagic.io/docs/zh/ 这是官方的中文说明,很详细我就不再做过多的解释了,只想提醒几点:

  • 适合大部分的列表-内容网站,如CSDN博文列表与对应内容等类似格式额网站
  • 要会一些简单的正则表达式与xpath(如果不会的话,使用chrom的插件xpath也是一个不错的选择)
  • 不要过分的依赖此框架

简单的配置抓取智联招聘的职位信息

编写一个model并使用@ExtractBy注解编写xpath语法来为每个属性赋值

@TargetUrl({"http://jobs.zhaopin.com/*.htm?*"})
@HelpUrl({"http://sou.zhaopin.com/jobs/searchresult.ashx?*"})
public class ZhilianJobInfo
        implements AfterExtractor {

    @ExtractBy("//h1/text()")
    private String title = "";

    @ExtractBy("//html/body/div[6]/div[1]/ul/li[1]/strong/text()")
    private String salary = "";

    @ExtractBy("//html/body/div[5]/div[1]/div[1]/h2/a/text()")
    private String company = "";

    @ExtractBy("//html/body/div[6]/div[1]/div[1]/div/div[1]/allText()")
    private String description = "";

    private String source = "zhilian.com";

    @ExtractByUrl
    private String url = "";

    private String urlMd5 = "";

    @ExtractBy("//html/body/div[6]/div[1]/ul/li[2]/strong/a/text()")
    private String dizhi = "";

    @ExtractBy("//html/body/div[6]/div[1]/ul/li[5]/strong/text()")
    private String qualifications = "";

    @ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[3]/strong/a/text()")
    private String companycategory = "";

    @ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[1]/strong/text()")
    private String companyscale = "";

    @ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[2]/strong/text()")
    private String companytype = "";

    @ExtractBy("//html/body/div[6]/div[2]/div[1]/ul/li[4]/strong/text()")
    private String companyaddress;

    public String getTitle() {
        return this.title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getCompany() {
        return this.company;
    }

    public void setCompany(String company) {
        this.company = company;
    }

    public String getDescription() {
        return this.description;
    }

    public void setDescription(String description) {
        if (description != null)
            this.description = description;
    }

    public String getSource() {
        return this.source;
    }

    public void setSource(String source) {
        this.source = source;
    }

    public String getUrl() {
        return this.url;
    }

    public void setUrl(String url) {
        this.url = url;
        this.urlMd5 = DigestUtils.md5Hex(url);
    }

    public String getSalary() {
        return this.salary;
    }

    public void setSalary(String salary) {
        this.salary = salary;
    }

    public String getUrlMd5() {
        return this.urlMd5;
    }

    public void setUrlMd5(String urlMd5) {
        this.urlMd5 = urlMd5;
    }

    public String getDizhi() {
        return this.dizhi;
    }

    public void setDizhi(String dizhi) {
        this.dizhi = dizhi;
    }

    public String getQualifications() {
        return this.qualifications;
    }

    public void setQualifications(String qualifications) {
        this.qualifications = qualifications;
    }

    public String getCompanycategory() {
        return this.companycategory;
    }

    public void setCompanycategory(String companycategory) {
        this.companycategory = companycategory;
    }

    public String getCompanyscale() {
        return this.companyscale;
    }

    public void setCompanyscale(String companyscale) {
        this.companyscale = companyscale;
    }

    public String getCompanytype() {
        return this.companytype;
    }

    public void setCompanytype(String companytype) {
        this.companytype = companytype;
    }

    public String getCompanyaddress() {
        return this.companyaddress;
    }

    public void setCompanyaddress(String companyaddress) {
        this.companyaddress = companyaddress;
    }

    public String toString() {
        return "JobInfo{title='" + this.title + '\'' + ", salary='" + this.salary + '\'' + ", company='" + this.company + '\'' + ", description='" + this.description + '\'' + ", source='" + this.source + '\'' + ", url='" + this.url + '\'' + '}';
    }

    public void afterProcess(Page page) {
    }
}

再来就是实现Crawler的Pipeline来决定你抓取的数据的存储方式

public class ZhilianModelPipeline implements PageModelPipeline<ZhilianJobInfo> {


    public void process(ZhilianJobInfo zhilianJobInfo, Task task) {
        // save info to db
        System.out.println(zhilianJobInfo);
    }
}

最后就是爬虫的一些配置和启动入口(今天加了IP代理池)

public class Crawler {
    public static void main(String[] args) {
        // IP代理池
        HttpClientDownloader httpClientDownloader = new HttpClientDownloader();
        try {
            List<Proxy> proxies = buildProxyIP();
            System.out.println("请求代理IP: " + proxies);
            httpClientDownloader.setProxyProvider(new SimpleProxyProvider(proxies));
        } catch (IOException e) {
            e.printStackTrace();
        }

        OOSpider.create(Site.me()
                .setSleepTime(5)
                .setRetrySleepTime(10)
                .setCycleRetryTimes(3),
                new ZhilianModelPipeline(),ZhilianJobInfo.class)
                .addUrl("http://sou.zhaopin.com/jobs/searchresult.ashx?jl=765&bj=7002000&sj=463")
                .thread(60)
                .setDownloader(httpClientDownloader)
                .run();
    }


    /**
     * 不错的免费代理IP站点
     * www.89ip.cn
     *
     * @return
     */
    private static List<Proxy> buildProxyIP() throws IOException {
        Document parse = Jsoup.parse(new URL("http://www.89ip.cn/tiqv.php?sxb=&tqsl=50&ports=&ktip=&xl=on&submit=%CC%E1++%C8%A1"), 5000);
        String pattern = "(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+):(\\d+)";
        Pattern r = Pattern.compile(pattern);
        Matcher m = r.matcher(parse.toString());
        List<Proxy> proxies = new ArrayList<Proxy>();
        while (m.find()) {
            String[] group = m.group().split(":");
            int prot = Integer.parseInt(group[1]);
            proxies.add(new Proxy(group[0], prot));
        }
        return proxies;
    }
}

很快就可以抓取了。代码在https://github.com/vector4wang/webmagic-quick

完了,又水了一贴。。。

技术随笔
Web note ad 1