top of page
Writer's pictureHackers Realm

Image to Text Conversion & Extraction using Python | OCR | Machine Learning Project Tutorial

Updated: May 31, 2023

Unleash the power of image-to-text conversion with Python! This comprehensive tutorial explores OCR (Optical Character Recognition) and machine learning techniques. Learn how to extract text from images, enhance accuracy with ML models, and unlock the potential of image processing. Elevate your data extraction capabilities and explore exciting project possibilities with this hands-on tutorial. #ImageToText #OCR #MachineLearning #Python #DataExtraction #ImageProcessing

Image to Text Conversion & Extraction using OCR
Image to Text Conversion & Extraction using OCR

In this project tutorial, we will use pytesseract module for Optical Character Recognition (OCR) to extract the text from the images and re module to extract specific fields from the data.


You can watch the step by step explanation video tutorial down below


Project Information

The project uses pytesseract module to convert image into text and regular expression to extract specific fields from the extracted text.


Download the pytesseract OCR source files here



Import Modules


import matplotlib.pyplot as plt
import PIL
import pytesseract
import re
%matplotlib inline
  • matplotlib - used for data visualization and graphical plotting

  • PIL - Python Imaging Library for image manipulation in different image formats

  • re – used as a regular expression to find particular patterns and process it

  • pytesseract - Image extraction module for character recognition, character segmentation and preprocessing images.


These are the following prerequisites for installing the pystesseract module

# prerequisites
!pip install pytesseract
# install desktop version of pytesseract


Load the image


Now we will open and display the test image to view the text data

img = PIL.Image.open('test.JPG')
plt.imshow(img)
Image contains Text Data
Image contains Text Data
  • Here is a test image for example

  • We can see various types of text data like alphanumerical and special characters.



Convert Image to Text


Now we will convert the image into text

# config
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract'
TESSDATA_PREFIX = 'C:/Program Files/Tesseract-OCR'
  • This is a necessary configuration for the pytesseact module

  • You need a prefix to run the module


Now we process the image and get the output data

text_data = pytesseract.image_to_string(img.convert('RGB'), lang='eng')
  • img.convert('RGB') - Specify to convert the image into a color image

  • lang='eng' - Set the language to extract the text, by default is in English

  • If you want to extract the text in a specific language you must download the corresponding files and set the language

  • You can resize the image into a larger format for better extraction



print(text_data)

Name: Sample Unique Policy Number: 12345 Amount: 100000 Start Date: 1/10/2019 End Date: 1/11/2019 Geo-Coordinates: 13.89,83.49

  • Text extracted from the image

  • Compared to the extracted data and the displayed image the extraction was successful and no other data was left out

  • If the text data is too small you can resize the image to higher resolution for better extraction


Extract Specific Fields


Now we will extract specific fields from the text data

m = re.search("Name: (\w+)", text_data)
name = m[1]
name

'Sample'

m = re.search("Start Date: (\S+)", text_data)
start_date = m[1]
start_date

'1/10/2019'


m = re.search("Geo-Coordinates: (\S+)", text_data)
coordinates = m[1]
coordinates

'13.89,83.49'

  • (\w+) - Function to extract at least one word after the specified field

  • m[n] - Retrieve all the text starting from the nth position

  • (\S+) - Function to extract text including special characters

  • You may use the split function to receive any other specific data from the text


Final Thoughts

  • You can filter and further process the text to apply another process like sentiment analysis to obtain more results through plot graphs or frequent words.

  • You can preprocess the images that have various formats similar to the test image and use the tesseract module to extract text without loss of information.


In this project tutorial, we have explored the Image to Text Conversion and Extraction project using OCR. This method is very practical and can be implemented in other projects to analyze and process the text data from an image.


Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

1,924 views

Comments


bottom of page