Digitization of books in Wikisource using DjVu

The content of this blogpost is moved to the new domain. Please visit http://shijualex.in/digitization-of-books-in-wikisource-using-djvu/ to read this blogpost.

10 Responses to Digitization of books in Wikisource using DjVu

  1. Abhiram C says:

    A really good post and a learning curve for most Indic language Wikilibrarians.

  2. su says:

    I don’t understand, doesn’t ABBYY work with Malayam?
    What’s the output of texts like http://archive.org/search.php?query=language%3A%28mal%29 ?

  3. bawolff says:

    Awesome to hear about what the non-english projects are doing :)

    BTW: it appears there’s something wrong with how your blog is set up for en.planet.wikimedia.org . The links on the en planet for your post don’t go to your blog

  4. Shiju Alex says:

    su said
    //I don’t understand, doesn’t ABBYY work with Malayam?//

    No. Malayalam support is not there in ABBYY. Infact it doesn’t have support for any indic language. See the supported languages of ABBYY here: http://finereader.abbyy.com/recognition_languages/?width=850&height=600px&blocks=Content

    bawolff said

    //BTW: it appears there’s something wrong with how your blog is set up for en.planet.wikimedia.org . The links on the en planet for your post don’t go to your blog//

    I do not have any idea about the settings there. Hope the blog planet admins will fix the issues.

  5. P A Alberti says:

    Are you aware that Tesseract (an open source OCR engine) does support Malayalam? I believe there’s both a builtin support in the newest beta version (3.02, , which hasn’t been officially released but you can compile from the source code from https://code.google.com/p/tesseract-ocr/ if you’re feeling technical) and a community driven project (https://code.google.com/p/parichit/ for version 3.01). There’s also support for some other Indian languages. I can’t vouch for the quality though, as I can’t read any of them.

  6. raama0803 says:

    Can I know, if any OCR is present for Sanskrit Language also? If present, can I know where can it be procured?

  7. P A Alberti says:

    Sorry, but I don’t know of any support for Sanskrit. It might be worth the effort to try to get in contact with the folks behind the Parichit project to hear what they are missing in order to support Sanskrit and if there’s a way you can help, though. Their stated goal is to support each of the Indian languages, so maybe — if we’re lucky — they have something on the way. (While training tesseract for new languages is rather tricky, it does not require programming skills.)

  8. raama0803 says:

    Dear Alberti, It would be better if I can get the contacts of Parichit project people, and would help them with the language part.. But the PDFs of many Sanskrit works might be in a font which maynot be open sourced. So, how do we overcome this issue??

  9. P A Alberti says:

    The only contact information I have for the Parichit project people is in the link I gave above. There’s a google groups mailing list, just not a very active one, it seems.
    Training tesseract requires images of text, it does not use fonts directly (there’s a technical guide at https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 ). So it is possible to use scans of old texts, although that is probably tricky for connected scripts. And since an image of text generated using a proprietary font is usually not a derivative work of that font, unless maybe if it is particularly decorative or something, it is also possible to use fonts that aren’t open source.

