Tesseract-OCR图像文字识别技术安装

## 介绍 ##

Tesseract是一个开源的OCR引擎，能识别100多种语言（中，英，韩，日，德，法…等等），但是Tesseract对手写的识别能力较差。

源码地址：

[https://github.com/tesseract-ocr/tesseract/releases][1]

## mac系统安装 ##

`brew install  tesseract`

![图][2]

安装成功

cd /usr/local/Cellar/tesseract/4.0.0/share/tessdata/

不同的版本、文件夹的版本号不一样、如果下面有这些语言文本就不用单独下载了

![图][3]
 
## 下载语言库 ##

如果没有则需要到github上下载：[https://github.com/tesseract-ocr/tessdata][4]

根据自己的需求选择所要的语言库，在这里我们选择的是简体中文所以选择的库是：`chi_sim.traineddata`
将文件拷贝到到：`/usr/local/Cellar/tesseract/4.0.0/share/tessdata`目录下。

### 库名-语言表如下 ###

## Tesseract使用 ##

终端输入命令:`tesseract --help`

一般使用:

```php
//默认使用eng文字库， imgName是图片的地址，result识别结果
tesseract imgName result
```
指定语言:

```bash
//指定使用简体中文
tesseract ./G.jpg result -l chi_sim
//查看本地存在的语言库
tesseract --list-langs
```

指定多语言:

```bash
//指定多语言，用+号相连
tesseract ./G.jpg result -l chi_sim+eng
```

有个地方需要特别注意，参数psm

```php
//输入命令，查看psm的参数
tesseract --help-psm
 
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 
 翻译（可能不是很准,最好看原文）：
 0 定向脚本监测（OSD）
 1 使用OSD自动分页
 2 自动分页，但是不使用OSD或OCR（Optical Character Recognition，光学字符识别）
 3 全自动分页，但是没有使用OSD（默认）
 4 假设可变大小的一个文本列。
 5 假设垂直对齐文本的单个统一块。
 6 假设一个统一的文本块。
 7 将图像视为单个文本行。
 8 将图像视为单个词。
 9 将图像视为圆中的单个词。
 10 将图像视为单个字符。

```

根据情况选择不同的psm值，这很重要，如果选择到不恰当的值会导致识别失败。
例如:

例子图片：1234.png

![图][5]

`tesseract ./1234.png result -l chi_sim+eng`

![图][6]

```bash
//不设置psm值的命令
➜  Downloads tesseract ./1234.png result -l chi_sim+eng
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Empty page!!
Empty page!!

//设置psm值的命令
➜  Downloads tesseract ./1234.png result -l chi_sim+eng --psm 6
Tesseract Open Source OCR Engine v4.0.0 with Leptonica

cat result.txt
成功识别:
一二三四
一二三四
```

![图][7]

### MAC安装参考 ###

[https://blog.csdn.net/u010670689/article/details/78374623][8]

### 官方参考安装文档 ###

[https://github.com/tesseract-ocr/tesseract/wiki][9]

## Ubuntu安装 ##

tesseract4.0是一个开源的OCR，4.0版本引用了机器学习

tesseract4.0是依赖于leptonica-1.74.2以上版本的库和其他一些库

```bash
sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install autoconf-archive
sudo apt-get install pkg-config
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev
```

### 安装libtiff5-dev 遇到了依赖不兼容的问题 ###

aptitude可以比apt-get更加智能地解决依赖问题，然后sudo aptitude install 对应安装包

选择合适的解决方案，便可解决问题！

### 训练需要的库 ###

- sudo apt-get install libicu-dev
 - sudo apt-get install libpango1.0-dev
 - sudo apt-get install libcairo2-dev

leptonica.1.74.2库以上的版本需要自己编译源码，源码下载地址：

[https://github.com/DanBloomberg/leptonica/releases][10]

[http://www.leptonica.com/download.html][11]

下载需要的版本，解压：

```bash
./configure

make

sudo make install即可
```

下面是编译tesseract4.0，源码下载地址：

[https://github.com/tesseract-ocr/tesseract/releases][12]

然后是解压，

`./configure`

发现出错了

`error while loading shared libraries: liblept.so.5: cannot open shared object file: No such file or directory`

`ll /usr/local/lib/liblept.so.5 ` 可以发现这个文件是存在的，`在/etc/ld.so.conf`文件中

添加”/usr/local/lib“这行

`# ldconfig `

[scode type="share"]
ldconfig是一个动态链接库管理命令，为了让动态链接库为系统所共享,还需运行动态链接库的管理命令–ldconfig ,用途主要是在默认搜寻目录(/lib和/usr/lib)以及动态库配置文件/etc/ld.so.conf内所列的目录下,搜索出可共享的动态链接库(格式如前介绍,lib.so),进而创建出动态装入程序(ld.so)所需的连接和缓存文件.缓存文件默认为 /etc/ld.so.cache,此文件保存已排好序的动态链接库名字列表.ldconfig通常在系统启动时运行,而当用户安装了一个新的动态链接库时,就需要手工运行这个命令.

[/scode]

再次执行就没有错误了

然后就是make

make install

下载官网训练好的 data:地址：[https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400][13]

放到/usr/local/share/tesseract/

`tesseract eng.png result  -l eng  测试命令 `

[https://blog.csdn.net/u012384044/article/details/77979803][14]

## Centos7安装 ##

安装可能需要的依赖包。

```bash
yum -y update

yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel

yum group install -y "Development Tools"

```
pkg-config这个包不一定yum源里面有。如果出现Leptonica 1.74 or higher is required. 这个错误的话。那么一定要安装。

###  源码安装pkg-config ###

去下载最新的

[https://pkg-config.freedesktop.org/releases/][15]

```bash
wget https://pkg-config.freedesktop.org/releases/pkg-config-0.29.2.tar.gz

tar -xf pkg-config-0.29.2.tar.gz

cd pkg-config-0.29.2/
./configure  --with-internal-glib  （可以不带参数）

make
install make
```

如果出现Leptonica 1.74 or higher is required. 需要安装

```bash
wget https://github.com/DanBloomberg/leptonica/releases/download/1.78.0/leptonica-1.78.0.tar.gz
tar -zxvf leptonica-1.78.0.tar.gz
cd leptonica-1.78.0/
./configure
make -j
make install
cd ../

```

### 安装tesseract ###

```bash
wget https://github.com/tesseract-ocr/tesseract/archive/4.0.0.tar.gz
tar -zxvf 4.0.0.tar.gz
cd tesseract-4.0.0/
./autogen.sh
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
make install
ldconfig
cd ..

wget https://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/master/equ.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata

mv *.traineddata /usr/local/share/tessdata
```
把命令加入环境变量 /usr/local/bin

`vim /etc/profile`

![图][16]

`source /etc/profile`

## linux安装参考： ##
https://www.cnblogs.com/zhangbei/p/9950789.html

## PHP类库地址 ##
https://packagist.org/packages/thiagoalessio/tesseract_ocr

## window使用参考 ##
https://www.cnblogs.com/wzben/p/5930538.html

## php代码参考 ##
http://developer.51cto.com/art/201601/504121.htm

[1]: https://github.com/tesseract-ocr/tesseract/releases
  [2]: https://echo.solo90.com/usr/uploads/2020/01/2842879501.png
  [3]: https://echo.solo90.com/usr/uploads/2020/01/1834259868.png
  [4]: https://github.com/tesseract-ocr/tessdata
  [5]: https://echo.solo90.com/usr/uploads/2020/01/185027377.png
  [6]: https://echo.solo90.com/usr/uploads/2020/01/2666568328.png
  [7]: https://echo.solo90.com/usr/uploads/2020/01/3567574623.png
  [8]: https://blog.csdn.net/u010670689/article/details/78374623
  [9]: https://github.com/tesseract-ocr/tesseract/wiki
  [10]: https://github.com/DanBloomberg/leptonica/releases
  [11]: http://www.leptonica.com/download.html
  [12]: https://github.com/tesseract-ocr/tesseract/releases
  [13]: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400
  [14]: https://blog.csdn.net/u012384044/article/details/77979803
  [15]: https://pkg-config.freedesktop.org/releases/
  [16]: https://echo.solo90.com/usr/uploads/2020/01/52081406.jpg

介绍

Tesseract是一个开源的OCR引擎，能识别100多种语言（中，英，韩，日，德，法…等等），但是Tesseract对手写的识别能力较差。

源码地址：

https://github.com/tesseract-ocr/tesseract/releases

mac系统安装

brew install tesseract

安装成功

cd /usr/local/Cellar/tesseract/4.0.0/share/tessdata/

不同的版本、文件夹的版本号不一样、如果下面有这些语言文本就不用单独下载了

下载语言库

如果没有则需要到github上下载：https://github.com/tesseract-ocr/tessdata

根据自己的需求选择所要的语言库，在这里我们选择的是简体中文所以选择的库是：chi_sim.traineddata
将文件拷贝到到：/usr/local/Cellar/tesseract/4.0.0/share/tessdata目录下。

库名-语言表如下

Tesseract使用

终端输入命令:tesseract --help

一般使用:

//默认使用eng文字库， imgName是图片的地址，result识别结果
tesseract imgName result

指定语言:

//指定使用简体中文
tesseract ./G.jpg result -l chi_sim
//查看本地存在的语言库
tesseract --list-langs

指定多语言:

//指定多语言，用+号相连
tesseract ./G.jpg result -l chi_sim+eng

有个地方需要特别注意，参数psm

//输入命令，查看psm的参数
tesseract --help-psm
 
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 
 翻译（可能不是很准,最好看原文）：
 0 定向脚本监测（OSD）
 1 使用OSD自动分页
 2 自动分页，但是不使用OSD或OCR（Optical Character Recognition，光学字符识别）
 3 全自动分页，但是没有使用OSD（默认）
 4 假设可变大小的一个文本列。
 5 假设垂直对齐文本的单个统一块。
 6 假设一个统一的文本块。
 7 将图像视为单个文本行。
 8 将图像视为单个词。
 9 将图像视为圆中的单个词。
 10 将图像视为单个字符。

根据情况选择不同的psm值，这很重要，如果选择到不恰当的值会导致识别失败。
例如:

例子图片：1234.png

tesseract ./1234.png result -l chi_sim+eng

//不设置psm值的命令
➜  Downloads tesseract ./1234.png result -l chi_sim+eng
Tesseract Open Source OCR Engine v4.0.0 with Leptonica
Empty page!!
Empty page!!

//设置psm值的命令
➜  Downloads tesseract ./1234.png result -l chi_sim+eng --psm 6
Tesseract Open Source OCR Engine v4.0.0 with Leptonica

cat result.txt
成功识别:
一二三四
一二三四

MAC安装参考

https://blog.csdn.net/u010670689/article/details/78374623

官方参考安装文档

https://github.com/tesseract-ocr/tesseract/wiki

Ubuntu安装

tesseract4.0是一个开源的OCR，4.0版本引用了机器学习

tesseract4.0是依赖于leptonica-1.74.2以上版本的库和其他一些库

sudo apt-get install g++ # or clang++ (presumably)
sudo apt-get install autoconf automake libtool
sudo apt-get install autoconf-archive
sudo apt-get install pkg-config
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev

安装libtiff5-dev 遇到了依赖不兼容的问题

aptitude可以比apt-get更加智能地解决依赖问题，然后sudo aptitude install 对应安装包

选择合适的解决方案，便可解决问题！

训练需要的库

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev

leptonica.1.74.2库以上的版本需要自己编译源码，源码下载地址：

https://github.com/DanBloomberg/leptonica/releases

http://www.leptonica.com/download.html

下载需要的版本，解压：

./configure

make

sudo make install即可

下面是编译tesseract4.0，源码下载地址：

https://github.com/tesseract-ocr/tesseract/releases

然后是解压，

./configure

发现出错了

error while loading shared libraries: liblept.so.5: cannot open shared object file: No such file or directory

ll /usr/local/lib/liblept.so.5 可以发现这个文件是存在的，在/etc/ld.so.conf文件中

添加”/usr/local/lib“这行

# ldconfig

再次执行就没有错误了

然后就是make

make install

下载官网训练好的 data:地址：https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400

放到/usr/local/share/tesseract/

tesseract eng.png result -l eng 测试命令

https://blog.csdn.net/u012384044/article/details/77979803

Centos7安装

安装可能需要的依赖包。

yum -y update 

yum -y install libstdc++ autoconf automake libtool autoconf-archive pkg-config gcc gcc-c++ make libjpeg-devel libpng-devel libtiff-devel zlib-devel

yum group install -y "Development Tools"

pkg-config这个包不一定yum源里面有。如果出现Leptonica 1.74 or higher is required. 这个错误的话。那么一定要安装。

源码安装pkg-config

wget https://pkg-config.freedesktop.org/releases/pkg-config-0.29.2.tar.gz

tar -xf pkg-config-0.29.2.tar.gz

cd pkg-config-0.29.2/
./configure  --with-internal-glib  （可以不带参数）

make
install make

如果出现Leptonica 1.74 or higher is required. 需要安装

wget https://github.com/DanBloomberg/leptonica/releases/download/1.78.0/leptonica-1.78.0.tar.gz
tar -zxvf leptonica-1.78.0.tar.gz
cd leptonica-1.78.0/
./configure
make -j
make install
cd ../

安装tesseract

wget https://github.com/tesseract-ocr/tesseract/archive/4.0.0.tar.gz
tar -zxvf 4.0.0.tar.gz
cd tesseract-4.0.0/
./autogen.sh
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig LIBLEPT_HEADERSDIR=/usr/local/include ./configure --with-extra-includes=/usr/local/include --with-extra-libraries=/usr/local/lib
LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make -j
make install
ldconfig
cd ..


wget https://github.com/tesseract-ocr/tessdata/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/master/equ.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
wget https://github.com/tesseract-ocr/tessdata/raw/master/chi_sim.traineddata

mv *.traineddata /usr/local/share/tessdata

把命令加入环境变量 /usr/local/bin

vim /etc/profile