Unleash the power of image-to-text conversion with Python! This comprehensive tutorial explores OCR (Optical Character Recognition) and machine learning techniques. Learn how to extract text from images, enhance accuracy with ML models, and unlock the potential of image processing. Elevate your data extraction capabilities and explore exciting project possibilities with this hands-on tutorial. #ImageToText #OCR #MachineLearning #Python #DataExtraction #ImageProcessing
In this project tutorial, we will use pytesseract module for Optical Character Recognition (OCR) to extract the text from the images and re module to extract specific fields from the data.
You can watch the step by step explanation video tutorial down below
Project Information
The project uses pytesseract module to convert image into text and regular expression to extract specific fields from the extracted text.
Download the pytesseract OCR source files here
Import Modules
import matplotlib.pyplot as plt
import PIL
import pytesseract
import re
%matplotlib inline
matplotlib - used for data visualization and graphical plotting
PIL - Python Imaging Library for image manipulation in different image formats
re – used as a regular expression to find particular patterns and process it
pytesseract - Image extraction module for character recognition, character segmentation and preprocessing images.
These are the following prerequisites for installing the pystesseract module
# prerequisites
!pip install pytesseract
# install desktop version of pytesseract
Download link: https://sourceforge.net/projects/tesseract-ocr-alt/files/
Alternate Download link: https://digi.bib.uni-mannheim.de/tesseract/
Load the image
Now we will open and display the test image to view the text data
img = PIL.Image.open('test.JPG')
plt.imshow(img)
Here is a test image for example
We can see various types of text data like alphanumerical and special characters.
Convert Image to Text
Now we will convert the image into text
# config
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract'
TESSDATA_PREFIX = 'C:/Program Files/Tesseract-OCR'
This is a necessary configuration for the pytesseact module
You need a prefix to run the module
Now we process the image and get the output data
text_data = pytesseract.image_to_string(img.convert('RGB'), lang='eng')
img.convert('RGB') - Specify to convert the image into a color image
lang='eng' - Set the language to extract the text, by default is in English
If you want to extract the text in a specific language you must download the corresponding files and set the language
You can resize the image into a larger format for better extraction
print(text_data)
Name: Sample Unique Policy Number: 12345 Amount: 100000 Start Date: 1/10/2019 End Date: 1/11/2019 Geo-Coordinates: 13.89,83.49
Text extracted from the image
Compared to the extracted data and the displayed image the extraction was successful and no other data was left out
If the text data is too small you can resize the image to higher resolution for better extraction
Extract Specific Fields
Now we will extract specific fields from the text data
m = re.search("Name: (\w+)", text_data)
name = m[1]
name
'Sample'
m = re.search("Start Date: (\S+)", text_data)
start_date = m[1]
start_date
'1/10/2019'
m = re.search("Geo-Coordinates: (\S+)", text_data)
coordinates = m[1]
coordinates
'13.89,83.49'
(\w+) - Function to extract at least one word after the specified field
m[n] - Retrieve all the text starting from the nth position
(\S+) - Function to extract text including special characters
You may use the split function to receive any other specific data from the text
Final Thoughts
You can filter and further process the text to apply another process like sentiment analysis to obtain more results through plot graphs or frequent words.
You can preprocess the images that have various formats similar to the test image and use the tesseract module to extract text without loss of information.
In this project tutorial, we have explored the Image to Text Conversion and Extraction project using OCR. This method is very practical and can be implemented in other projects to analyze and process the text data from an image.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments