web数据采集核心技术分享系列（三）如何破解验证码？图像分析？特征匹配？人工智能？第三方集成？…哪个最强大？

先加个目录，方便大家查看

应各位热心看客的要求建了个QQ群：254764602，欢迎大家加群一起讨论，互相学习进步。

加群请输入暗号“数据采集”，否则不加

速度进入主题，这次的话题有点大，也有点难度，所以可能一篇说不完，先写一篇，回头根据大家的反馈我再写第二篇。

道高一尺魔高一丈，在验证码这个领域，道高一尺不难，魔高一丈非常难，所以我们就通常的验证码来做讨论，比较特殊的或者变态的验证码就不做深入探讨了。

一个普通的验证码通常是一个图片，有几个字符，然后有一些背景色，前景色，杂点（俗称噪点），干扰线，字符可能会有倾斜，扭曲，粘连，变形，甚至手写体，破解的过程总结起来就是一句话，去除干扰，简化特征，匹配特征，得到验证码，我不是写书的，不能面面俱到，我们从简单点的开始，看图说话,从下图可以看出，最后一步猜验证码的方法有三个，分别是简单的图像分析+特征匹配，基于神经网络的人工智能特征匹配，以及采用第三方google组件继承的方式，更强大的方式依赖于集成多个第三方类库（包含C以及C++代码）的实现，更为复杂，为了方便大家理解，先从第一种看起

web数据采集核心技术分享系列（三）如何破解验证码？图像分析？特征匹配？人工智能？第三方集成？...哪个最强大？

第一步，获取验证码图片

在web数据采集过程中，如果采用获取网页源文件的方式，获取到图片地址在获取图片，应该不会有问题，如果借助浏览器获取到网页，再得到验证码地址，再去获取图片，则会导致问题，因为一般验证码的地址都是随机生成的，再次访问会得到另外一个验证码，所以借助浏览器的童鞋们，请直接从浏览器中获取图片。

第二部，变形

对于倾斜，字体变形的验证码，不做变形还原是很难继续处理的，所以必须变形，变形的原理针对不同的变形会有不同，没有哪一个方法可以包治百病且药到病除，所以我们也针对性讨论，比如对于倾斜，要获取到字符区域四个角，然后计算倾斜四个边的倾斜角度，然后再向想法方向拉伸（我不是学计算机的，也不是学图形算法的，这些都是我的个人经验，说的不对还请不吝赐教）

贴一段代码给大家

1 Bitmap output = input;

2 int x = input.Width;

3             int y = input.Height;
4             int startPointsCount = 10;
5             int[] yBlackCount = new int[y];
6
7             Point leftTop = new Point(0, 0);
8             Point leftBottom = new Point(0, 0);
9             Point rightTop = new Point(0, 0);
10             Point rightBottom = new Point(0, 0);
11
12             for (int j = 0; j < y; j++)
13             {
14                 for (int i = 0; i < x; i++)
15                 {
16                     if (input.GetPixel(i, j).R == 0)
17                     {
18                         yBlackCount[j]++;
19                     }
20                 }
21             }
22
23             for (int j = 1; j < y - 1; j++)
24             {
25                 Point letterStart = new Point(0, 0);
26                 Point letterEnd = new Point(0, 0);
27                 for (int i = 1; i < x - 1; i++)
28                 {
29                     if (input.GetPixel(i, j).R == 0)
30                     {
31                         letterStart = new Point(i, j);
32                         break;
33                     }
34                 }
35                 for (int i = x - 2; i > 0; i--)
36                 {
37                     if (input.GetPixel(i, j).R == 0)
38                     {
39                         letterEnd = new Point(i, j);
40                         break;
41                     }
42                 }
43                 if (yBlackCount[j] > startPointsCount && yBlackCount[j + 1] > yBlackCount[j] && leftTop.Y == 0)
44                 {
45                     //top of letters
46                     leftTop = letterStart;
47                     rightTop = letterEnd;
48                 }
49                 if (leftTop.Y > 0 && yBlackCount[j + 1] < startPointsCount && yBlackCount[j] > yBlackCount[j + 1] && leftBottom.Y == 0)
50                 {
51                     //botton of letters
52                     leftBottom = letterStart;
53                     rightBottom = letterEnd;
54                 }
55             }
56             if (leftTop.Y != 0 && leftBottom.Y != 0)
57             {
58                 int lDistince = ((leftBottom.X - leftTop.X) * y) / (leftBottom.Y - leftTop.Y);
59                 int rDistince = ((rightBottom.X - rightTop.X) * y) / (rightBottom.Y - rightTop.Y);
60                 if (lDistince > 20)
61                 {
62                     lDistince = 20;
63                 }
64                 if (lDistince < -20)
65                 {
66                     lDistince = -20;
67                 }
68                 if (rDistince > 20)
69                 {
70                     rDistince = 20;
71                 }
72                 if (rDistince < -20)
73                 {
74                     rDistince = -20;
75                 }
76
77
78                 Graphics g = Graphics.FromImage(output);
79                 Brush b = new TextureBrush(source);
80
81                 //g.FillRectangle(b, this.ClientRectangle);
82                 g.FillRectangle(b, rectangle);
83
84                 Point[] destinationPoints = {
85                     new Point(lDistince, 0),        // destination for upper-left point of original
86                     new Point(x+rDistince, 0),      // destination for upper-right point of original
87                     new Point(0, y)};               // destination for lower-left point of original
88                 g.DrawImage(source, destinationPoints);
89             }
90
91             return output;

其他的变形暂且不在这里深入，要针对具体变形才能深入展开。

3，继续我们简单验证码处理的流程，说实话web数据采集中任何一点都可以拿出来单独写一个系列，要想做一个强大的采集系统，不是一个人花一两个月可以完成的，这里面的艰难只有你真正去做了，真正拿给客户运行才能体会到，如果各位大牛都能无私的把牛逼的解决方案和源代码开源，那么程序员的生活就会容易很多，大家都是一条船上的同路人，互相扶持多好。不好意思废话几句，继续说灰度化，这个网上代码很多，为了方便大家，我还是贴出来，如果大家觉得简单代码没必要贴，下次我就不贴了。

protected static Color Gray(Color c)

{
int rgb = Convert.ToInt32((double)(((0.3 * c.R) + (0.59 * c.G)) + (0.11 * c.B)));
return Color.FromArgb(rgb, rgb, rgb);
}

4.转化为黑白图片，俗称二值话，其实3和4都是为了简化特征，为后续处理打好基础，二值化的关键步骤是取得门限值，或者叫阀值，就是说什么样的点应该看做黑点，什么样的点应该看做白点，上代码(为啥总是第一行或者最后一行就没格式呢，谁告诉我？)

Bitmap output = new Bitmap(input.Width, input.Height);
            int tv = ComputeThresholdValue(input);
            int x = input.Width;
            int y = input.Height;
            int blackCount = 0;
            int whiteCount = 0;
            int nearDots;
            for (int i = 0; i < x; i++)
            {
                for (int j = 0; j < y; j++)
                {
                    //suppose the background is white,set the border to white
                    if (i == 0 || i == input.Width - 1 || j == 0 || j == input.Height - 1)
                    {
                        output.SetPixel(i, j, Color.White);
                        whiteCount++;
                        continue;
                    }
                    //white point, background
                    if (input.GetPixel(i, j).R >= tv)
                    {
                        output.SetPixel(i, j, Color.White);
                        whiteCount++;
                    }
                    //black point, char
                    else
                    {
                        output.SetPixel(i, j, Color.Black);
                        blackCount++;
                    }
                }

}

5.切分，切分的目的是把一个字符串中的单个字符找出来，单个字符的特征处理起来就要简单很多，切分的原理就是主要是定位到字符边界，然后切分图片，经过上面几个步骤之后，图片上是一个个的黑白字符，假设白色为底色，黑色为字符，那么对黑色点在XY坐标系里的分布进行统计，即可得到字符边界。

/// <summary>

        /// Split picture, and get the codes into a list
        /// </summary>
        /// <param name="map"></param>
        /// <param name="count"></param>
        /// <returns></returns>
        public static List<Bitmap> Split(Bitmap map)
        {
            List<Bitmap> resultList = new List<Bitmap>();

            int x = map.Width;
            int y = map.Height;
            int maxNoisyWidth = 4;//code with width nor more thal 4 is treated as noisy code
            int maxNoisyCount = 4; //points no more than 4 is treated as noisy points

            //black is char
            //black points count per column
            int[] xBlackCount = new int[x];
            for (int i = 0; i < x; i++)
            {
                for (int j = 0; j < y; j++)
                {
                    if (map.GetPixel(i, j).R == 0)
                    {
                        xBlackCount[i]++;
                    }
                }
            }
            //white points count per column
            int[] yBlackCount = new int[y];
            for (int j = 0; j < y; j++)
            {
                for (int i = 0; i < x; i++)
                {
                    if (map.GetPixel(i, j).R == 0)
                    {
                        yBlackCount[j]++;
                    }
                }
            }

            //split picture
            bool charFlag = false;
            int xStart = 0;
            int xEnd = 0;
            int yStart = 0;
            int yEnd = 0;
            for (int j = 0; j < yBlackCount.Length; j++)
            {
                if (yBlackCount[j] >= maxNoisyCount && charFlag == false)
                {
                    //start to scan the top of all char
                    yStart = j;
                    charFlag = true;
                }
                if (yBlackCount[j] < maxNoisyCount && charFlag == true)
                {
                    //end of scan the bottom of all char
                    yEnd = j;
                    charFlag = false;
                }
                if (yStart != 0 && yEnd != 0)
                {
                    //got the top and bottom of all char
                    break;
                }
            }
            for (int i = 0; i < xBlackCount.Length; i++)
            {
                if (xBlackCount[i] >= maxNoisyCount && charFlag == false)
                {
                    //start to scan a char
                    xStart = i;
                    charFlag = true;
                }
                if (xBlackCount[i] < maxNoisyCount && charFlag == true)
                {
                    //end of scan a char
                    xEnd = i;
                    charFlag = false;
                }
                if (xStart != 0 && xEnd != 0)
                {
                    //got the start and end of a char,check whether it's noise
                    if (xEnd - xStart < maxNoisyWidth)
                    {
                        //reset start and end
                        xStart = 0;
                        xEnd = 0;
                        continue;
                    }
                    //create new map for a char
                    Bitmap newMap = new Bitmap(xEnd - xStart + 1, yEnd - yStart + 1);
                    for (int ni = xStart; ni <= xEnd; ni++)
                    {
                        for (int nj = yStart; nj <= yEnd; nj++)
                        {
                            newMap.SetPixel(ni - xStart, nj - yStart, map.GetPixel(ni, nj));
                        }
                    }
                    newMap = new Bitmap(newMap, 16, 16);
                    resultList.Add(newMap);
                    //reset start and end
                    xStart = 0;
                    xEnd = 0;
                }
            }
            return resultList;
        }

6.切分完成之后，我们得到一组图片，每一个代表一个字符，然后进行特征计算，这里的思路首先把图片转化为一个矩阵，矩阵是啥不知道？查一下吧，还是有必要的，然后使用幂法求一个方阵的最大特征值和它所对应的特征向量, 向量也不知道？？我敢判定你肯定跟我一样，大学数据没及格过。哈哈。然后要把该向量与我们知识库（一堆向量,每个向量都对应一个字符，这个知识库需要通过人工对程序进行训练得到，也就是你告诉程序，这个向量是2，那个是3,后面会讲）里面的向量进行比较，求出向量之间的举例，与其距离最小的向量就表明其特征最相近，也就是说，这两个字符很像，我们就认为他们是同一个字符，从而得出判断结果。

input = new Bitmap(input, 16, 16);

            Double[,] doublemap = new Double[input.Width, input.Height];
            for (int i = 0; i < input.Width; i++)
            {
                for (int j = 0; j < input.Height; j++)
                {
                    if (input.GetPixel(i, j).R == 255)
                    {
                        doublemap[i, j] = Convert.ToDouble(1);
                    }
                    else
                    {
                        doublemap[i, j] = Convert.ToDouble(0);
                    }
                }
            }

            Double[] W = new double[input.Width]; ;
            Double max = 0;
            MatrixLab mat = new MatrixLab(input.Width, 0.001, doublemap);
            mat.returnResult(ref W, ref max);

            SampleVector vector = new SampleVector(W, "");
            Double minDistance = Double.MaxValue;
            SampleVector similarVector = null;
            foreach (SampleVector target in this._studyList)
            {
                double distance = _metric.Compute(vector, target);
                if (distance < minDistance)
                {
                    similarVector = target;
                    output = target.Code;
                    minDistance = distance;
                }

}

最简单的原理先讲到这里，下一篇我们深入点讲解，开头说了，现写一篇，回头根据大家的反馈我再写第二篇，欢迎大家交流

本系列 web数据采集核心技术分享注重分享思路，所有的代码都是为了配合思路的讲解，想要关注如何搭建一个完整的采集系统的童鞋稍安勿躁，后续会关注这个话题，不想关注思路，只想复制代码，F5运行,点鼠标进行数据抓取的童鞋请理解。

PS: 因本人能力有限，虽在web数据采集领域奋战多年，却也不可能在web数据采集的各个方面都提供最牛逼的解决方案和思路，还请各位看官本着互相交流学习，一起进步成长的态度来批评指正，欢迎留言。

原文链接: https://www.cnblogs.com/keven1006/archive/2012/08/06/2625343.html

欢迎关注

微信关注下方公众号，第一时间获取干货硬货；公众号内回复【pdf】免费获取数百本计算机经典书籍

web数据采集核心技术分享系列（三）如何破解验证码？图像分析？特征匹配？人工智能？第三方集成？...哪个最强大？

原创文章受到原创版权保护。转载请注明出处：https://www.ccppcoding.com/archives/58326

非原创文章文中已经注明原地址，如有侵权，联系删除

关注公众号【高性能架构探索】，第一时间获取最新文章

转载文章受原作者版权保护。转载请注明原作者出处！

web数据采集核心技术分享系列（三）如何破解验证码？图像分析？特征匹配？人工智能？第三方集成？…哪个最强大？

相关推荐