正文抽取-利用curl获取网页内容

近期在写一个正文抽取的程序，基于linux平台C++，大体流程从网页获取-->网页解析-->构建变种dom树-->正文抽取算法-->结构化输出。

目前已经完成了第一个功能，调试第二、三个功能，由于互联网上的页面很多由“无证”程序员完成，所以很不规范，需要进行一些容错处理，所以比较耗时间，而且，由于之前对编码格式不了解，在解析时，对我来说编码格式的转换也是一个难题，不过应该会在不断的学习过程中慢慢解决，也算是弥补一下技术缺陷。

网页获取可以用curl库完成，很简单，主要有四个函数：

CURL *curl_easy_init( )

This function must be the first function to call, and it returns a CURL easy handle that you must use as input to other easy-functions.

这个函数必须第一个被调用，返回的CURL指针用于其它几个easy-函数的easy-句柄

CURL *curl
curl = curl_easy_init();

CURLcode curl_easy_setopt(CURL *handle, CURLoption option, parameter)

curl_easy_setopt() is used to tell libcurl how to behave. By using the appropriate options to curl_easy_setopt,you can change libcurl’s behavior.

这个函数设置libcurl如何进行处理。通过该函数设置适当的选项，可以进行不同的处理

CURLcode code;code = curl_easy_setopt(curl, CURLOPT_ERRORBUFFER, error);　　/* 设置error为错误输出的buffer */curl_easy_setopt(curl, CURLOPT_VERBOSE, 1L);　　　　　　　　　　/* 如果你想CURL报告每一件意外的事情，设置这个选项为一个非零值 */code = curl_easy_setopt(curl, CURLOPT_URL, url);　　　　　　　 /* 设置将要进行访问的url */code = curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1);　　 /* 设置这个选项为一个非零值(象 "Location: ")的头，服务器会把它当做HTTP头的一部分发送 */code = curl_easy_setopt(curl, CURLOPT_HEADERFUNCTION, writer);　　/* 设置接收到响应头所调用的处理函数writer */code = curl_easy_setopt(curl, CURLOPT_WRITEHEADER, &header);　　　/* 响应头函数的最后一个参数 */code = curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writer);　　/* 设置接收到响应体所调用的处理函数writer */code = curl_easy_setopt(curl, CURLOPT_WRITEDATA, &content);　　/* 响应体函数的最后一个参数 */

3.CURLcode curl_easy_perform(CURL *handle);

This function is called after the init and all the curl_easy_setopt(3) calls are made, and will perform the transfer as described in the options.

这个函数将在curl_easy_setopt函数调用后被调用，将根据设置的处理选项进行处理（包括http头的字段添加，响应头和响应体的响应处理等）。

code = curl_easy_perform(curl);

4.void curl_easy_cleanup(CURL *handle);

This function must be the last function to call for an easy session. It is the opposite of the curl_easy_init(3) function and must be called with the same handle as input that the curl_easy_init call returned.

这个函数必须是每个easy回话的最后一个被调用。它与curl_easy_init函数相反，且必须使用curl_easy_init返回的同一个句柄。

curl_easy_cleanup(curl);

CURLcode curl_easy_getinfo(CURL *curl, CURLINFO info, ... );

Request internal information from the curl session with this function.

该函数可以从curl回话中获取中间信息。

code = curl_easy_getinfo(curl, CURLINFO_RESPONSE_CODE , &retcode); 
if ( (code == CURLE_OK) && retcode == 200 )
{
    ....
}

curl的用法大致如此，目前正在进行正文抽取的工作，大体的工作已经有了些效果，但要做到抽取率100%，抽取错误90%还有些工作要做。

继续努力。。。。

原文链接: https://www.cnblogs.com/geekma/archive/2012/08/15/2640270.html

欢迎关注

微信关注下方公众号，第一时间获取干货硬货；公众号内回复【pdf】免费获取数百本计算机经典书籍

原创文章受到原创版权保护。转载请注明出处：https://www.ccppcoding.com/archives/59536

非原创文章文中已经注明原地址，如有侵权，联系删除

关注公众号【高性能架构探索】，第一时间获取最新文章

转载文章受原作者版权保护。转载请注明原作者出处！

正文抽取-利用curl获取网页内容

相关推荐