Search in PDF

You can ask general questions, share opinions or advices about doPDF.
Post Reply
aras
Posts: 1
Joined: Mon Oct 24, 2011 6:00 am

Post by aras »

Hi all,
My Client requirement is to do a PDF search (non-english) in the Search module of his e-learning website. When i try to extract the contents of PDF for indexing, some of the characters are neglected during extraction (empty spaces in that area,when i view the indexed contents in Luke). I am getting these problem for languages like Tamil/Hindi.
The Client is very adamant that he wants PDF search.
What is the solution for this...Please give me a ray of light or guidelines.
Thanks and Regards,

aras


Claudiu (Softland)
Posts: 1565
Joined: Thu May 23, 2013 7:19 am

Post by Claudiu (Softland) »

Hello,
Unfortunately the PDF format supports by default only latin characters. The other characters are added in the PDF as embedded CID font subsets, with Unicode CMaps. You have to use a search text module capable to read this type of text from the PDF files to be able to extract all the characters correctly.
Thank you for understanding.

Follow us to stay updated:

Post Reply