I would like to use page segmentation from Tesseract
without running the OCR, as I have my own custom OCR model, and it takes to long to run page segmentation AND OCR. I tried using the --psm 2
mode in command line mode of Tesseract
, and in pytesseract
, and it didn't work as promised.
I'm working in Linux, and am coding in Python 3.10.
I currently use the tesseract-ocr-api
from layoutparser
Documentation. The code looks like the following:
import layoutparser as lp
ocr_agent = lp.TesseractAgent()
res = ocr_agent.detect(img_path, return_response=True)
layout_info = res['data']
The layout_info
then is a pd.DataFrame and contains Layout information on the level of blocks, paragraph, lines and words and also the OCR output. The problem is that this is very slow; on my machine it takes 7s per image and I actually don't need the OCR. Hence, I want page segmentation (also sometimes called layout detection) only.
According to the Tesseract (Documentation), there is the --psm 2
mode "Automatic page segmentation, but no OSD, or OCR". When I try this in the command line, this does not produce an output file (even if the output type is defined):
tesseract img.png outfile --psm 2
tesseract img.png outfile --psm 2 tsv
I also tried working with the python wrapper pytesseract
, but it is quite slow and it again returns the pd.DataFrame with the layout AND OCR data, despite --psm 2 being specified:
import cv2
import pytesseract
img = cv2.imread(img_path)
layout_info = pytesseract.image_to_data(img, config='tsv --psm 2', output_type='data.frame')
I'm using pytesseract==0.3.10 and tesseract 5.3.3-30-gea0b.
Do you have any ideas on how I can achieve page segmentation without OCR with Tesseract (or at least speed up the processing time of page segmenation + OCR?
source https://stackoverflow.com/questions/77704558/how-to-do-only-page-segmentation-layout-detection-with-tesseract-mode-psm-2
Comments
Post a Comment