Jump to content

Recommended Posts

Posted

I'm working on a little side project here on Japanese kantei books/articles and using optical character recognition (OCR) to detect text and digitize it. I have been testing out some images found online of Juyo/TokoJu NBTHK publications. Unfortunately, there doesn't seem to be many high resolution images of the text side of these books outside of the 2 or 3 found in about 10 minutes of Google searching. I'm looking to see if any members have any of these books in their libraries or saved photos that they would mind sharing so I can keep testing my code with different images. 

 

The idea behind all of this is to convert images/scans into digitized text that can then be indexed/searched for non-Japanese speakers for common nihonto-related words (school, mei, dimensions, and other sword attributes) against a compiled index of useful kanji terms and their English translations. So far the digitization and kanji detection of the few high res pictures I could find online seem to be >95% accurate in detecting the proper Japanese kanji and hiragana characters. The extremely high resolution image I have is 100% accurate. I've got a basic index of 50 or so basic Nihonto kanji terms and was able to take the scanned images and tag each occurence of the items in the index. 

 

Maybe this is all just crazy and there isn't much of an audience for it, but I'd like to grab a few more reference images if any of the members here could provide some and see where it takes me. Or you guys can talk me down from the ledge if this truly is madness. I might listen!

 

I've attached an example of an image that I've been using as reference from a Juyo Token article. 

 

Thanks!

Yamato_Shizu.jpg

  • Like 1
Posted

Great projects, has been done a few times with different levels of completion and different outcomes. Once in full with consequent secret sale to subscribers, once half-full on purpose (setsumei were not translated) and once with full translation but of a few volumes.

Good luck!

Posted
7 hours ago, Rivkin said:

Great projects, has been done a few times with different levels of completion and different outcomes. Once in full with consequent secret sale to subscribers, once half-full on purpose (setsumei were not translated) and once with full translation but of a few volumes.

Good luck!

I'm not planning on charging anything. Just trying to find out if anything like this would be useful to anybody besides myself. IDK about the capability to detect characters on the tang from pictures as the contrast between white and black is one of the key components of the OCR engine (Tesseract) that I'm using. If the Setsumei is printed on the paperwork (like for most origami I've seen) there should be no issue pulling out some key terms.

 

The biggest crux here is the quality of the index. I've just compiled a simple index of terms thus far and so far but as the index becomes more comprehensive with more terms to search against, the resulting number of tags will hopefully become better.  My idea for a finished page could include a printout of the Japanese text, the terms that matched the index, and their location in the original script. Whilst not a translation (as that is exceedingly difficult and Google has spent billions and their Japanese to English is still abysmal for a topic as niche as nihonto), it could reveal a little bit more about what is written about the sword to those who don't read Japanese. 

 

Part of the experimentation, and need for photos is to be able to test the paragraph, text block detection.  As seen below, are the different stages for detecting and grouping blocks of text into paragraphs so that the OCR engine doesn't read across multiple paragraphs turning the output into garbage. The colors are inverted, and then a boundary is drawn around them to see proximity, and then this proximity is used to draw green rectangles which then splits the image up into various sub-images which are then all fed into the engine and the output has been accurate so far provided you have good reference images.

 

image.thumb.jpeg.5414fce94fb105136a29824ae180408b.jpegimage.thumb.jpeg.7d3739462a70d1dca44e6ecd1df737de.jpegimage.thumb.jpeg.f34ea4150f3a78837f3402b08d180b21.jpeg

 

The engine output for the large green rectangle in this test becomes:

 

三月号の誌上鑑定刀の答えは、源正行(清麻
同人)の刀でした。

本作は刀工銘を指裏(太刀銘)に魚っており、
長寸で反りも高いことから、古刀の太刀と見た
札もありましたが、身幅に比して錦幅が狭く、
平肉が目立たず、外が延びてふくらも枯れ、乱
れの足が長く、刃先に抜けるなどの特色から、
時代を新々刀と捉えることが出来ます。

重ねの厚い作品が多い新々刀期において、正

 

And then passing it through my very basic index (I'll need help on this later to know what terms to search for) I get the following:

0   cvtest-3.jpg    太刀                      Tachi     42
1   cvtest-3.jpg    太刀                      Tachi     70
2   cvtest-3.jpg     刀                     Katana      8
3   cvtest-3.jpg     刀                     Katana     25
4   cvtest-3.jpg     刀                     Katana     35
5   cvtest-3.jpg     刀                     Katana     43
6   cvtest-3.jpg     刀                     Katana     68
7   cvtest-3.jpg     刀                     Katana     71
8   cvtest-3.jpg     刀                     Katana    147
9   cvtest-3.jpg     刀                     Katana    174
10  cvtest-3.jpg     幅                       Haba     86
11  cvtest-3.jpg     幅                       Haba     92
12  cvtest-3.jpg    身幅                     Mihaba     85
13  cvtest-3.jpg    反り  Sori - depth of curvature     57
14  cvtest-3.jpg     厚                   Atsu(ku)    165

 

I can already see where there are overlaps between 1 and 2+ kanji term words and that will have to be sorted out later. Maybe some members of the community could pitch in and help edit the index (It's just a Google sheets doc). The number, is the position where the term appears in the text and later it could be inserted into the text at that location or just have a tag type system etc. I haven't thought that far ahead yet. 

This thread is quite old. Please consider starting a new thread rather than reviving this one, unless your post is really relevant and adds to the topic..

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...