Search in PDF

You can ask general questions, share opinions or advices about doPDF.
Posts: 1
Joined: Mon Oct 24, 2011 6:00 am

Postby aras » Mon Oct 24, 2011 1:12 pm

Hi all,

My Client requirement is to do a PDF search (non-english) in the Search module of his e-learning website. When i try to extract the contents of PDF for indexing, some of the characters are neglected during extraction (empty spaces in that area,when i view the indexed contents in Luke). I am getting these problem for languages like Tamil/Hindi.

The Client is very adamant that he wants PDF search.

What is the solution for this...Please give me a ray of light or guidelines.

Thanks and Regards,


Claudiu (Softland)
Posts: 1495
Joined: Thu May 23, 2013 7:19 am

Postby Claudiu (Softland) » Mon Oct 24, 2011 5:55 pm


Unfortunately the PDF format supports by default only latin characters. The other characters are added in the PDF as embedded CID font subsets, with Unicode CMaps. You have to use a search text module capable to read this type of text from the PDF files to be able to extract all the characters correctly.

Thank you for understanding.

Return to “General”