Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Overview / Survival Guide
In this article I will address a simple way to get images from SharePoint and process OCR using the Tessnet2 OCR. NET 2.0 assembly OCR.
OCR is an English acronym for Optical Character Recognition, a technology to recognize characters from an image file, or bitmap. Through the OCR is able to scan a sheet of printed text and get an editable text file.
Media Type/Task
The Tessnet2 need a folder to Core Processing Libraries in this case I have English and Portuguese. We also have to add the 64-bit DLL to project, since I'm using SharePoint 2010.
The Tessnet2 need a folder to Core Processing Libraries in this case I have English and Portuguese. We also have to add the 64-bit DLL to project, since I'm using SharePoint 2010
http://206.72.115.36/img/tess1.jpg | http://206.72.115.36/img/tess2.jpg |
In the first part of this article will render a SharePoint Document List and I will put them on the hard drive in"c:\temp images"
The SharePoint Process
I call your attention because I’m processing the information immediately after the foreach but if we want to control whether the document is online or not we have to use the switch included in the procedure.
using System;
using System.Collections.Generic;
using System.Drawing;
using System.Linq;
using System.Text;
using Microsoft.SharePoint;
using System.IO;
try
{
string ImagePath = @"c:\temp\images\";
SPSite mysite = new SPSite(“SPSite”);
SPWeb myweb = mysite.OpenWeb();
SPFolder mylibrary = myweb.Folders[“SPList”];
SPFileCollection files = mylibrary.Files;
foreach (SPFile item in files)
{
byte[] binfile2 = item.OpenBinary();
FileStream fstream = new FileStream(ImagePath + item.Name,
FileMode.Create,
FileAccess.ReadWrite);
fstream.Write(binfile2, 0, binfile2.Length);
fstream.Close();
switch (item.CheckOutType)
{
case SPFile.SPCheckOutType.None:
break;
case SPFile.SPCheckOutType.Offline:
break;
case SPFile.SPCheckOutType.Online:
break;
default:
break;
}
}
}
catch (Exception ex)
{
//Whatever;
}
I'm using a method that returns a StringBuilder because it is much faster than an Array [] String and pass the path to the image. The method takes word by word to a StringBuilder that I add a "space" after each word and method removes some garbage RemoveDiacriticals (diacritics) OCR:
private StringBuilder ProcessOcr(string imagePath)
{
StringBuilder sb = new StringBuilder();
using (Bitmap image = new Bitmap(imagePath))
{
using (tessnet2.Tesseract tessocr = new tessnet2.Tesseract())
{
tessocr.Init(@"c:\temp\tessdata", "por", false);
List<tessnet2.Word> result = tessocr.DoOCR(image, Rectangle.Empty);
foreach (tessnet2.Word word in result)
{
sb.Append(RemoveDiacriticals(word.Text) + " ");
}
return sb;
}
}
}
private string RemoveDiacriticals(string txt)
{
string nfd = txt.Normalize(NormalizationForm.FormD);
StringBuilder retval = new StringBuilder(nfd.Length);
foreach (char ch in nfd)
{
if (ch >= '\u0300' && ch <= '\u036f') continue;
if (ch >= '\u1dc0' && ch <= '\u1de6') continue;
if (ch >= '\ufe20' && ch <= '\ufe26') continue;
if (ch >= '\u20d0' && ch <= '\u20f0') continue;
retval.Append(ch);
}
return retval.ToString();
}
Now go to the directory where I put the pictures taken from SharePoint, in this example I'm just processing. Jpg and remove the OCR tex
Use GC.Collect() in order to release memory
private string VamosNessa()
{
DirectoryInfo di = new DirectoryInfo(ImagePath);
FileInfo[] rgFiles = di.GetFiles("*.jpg");
foreach (FileInfo fi in rgFiles)
{
GC.Collect();
return ProcessOcr(fi.FullName).ToString();
}
}
If you want to upload the OCR to a field in a list we need to know the document link in SharePoint, we can keep him in one of the previous methods, then I will checkout (), Update and CheckIn (), be sure to check your SPCheckOutType, because we do not want to touch anything that is not approved or not is up to you.
We will use two fields, a Bool that tells me if the OCR is processed and a MultiText to put the OCR.
item.File.CheckOut();
item["OCR"] = VamosNessa();
item["BOOL"] = "1";
item.Update();
item.File.CheckIn("Ok");