[tlug] searching for kanji strings, ignore punctuation and end of lines: Perl Solution and comments

Date: Wed, 18 Jan 2006 17:28:06 +0900
From: David Riggs <dariggs@example.com>
Subject: [tlug] searching for kanji strings, ignore punctuation and end of lines: Perl Solution and comments
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.7) Gecko/20050420 Debian/1.7.7-2

Thanks for help from Edward, Steven, Josh et al.

I have a solution to my suprisingly pesky problem, (see the solution 
after the >>). To recap the problem:

<<review---------

I have a quote that is just a
string of kanji, and I am looking for where it came from. I do have an
etext version of the canon (several hundred megabytes and thousands of
files), in utf8, which most likely contains this phrase.

The problem is that the etexts inserts a special "space" or a maru
(i.e. a unicode period, little circle) at random places, trying to
make it easier to read, and making it impossible to find with grep, and 
breaks lines at unlikely places.

I can assume that two lines is enough to look at, and there is
actually no ascii white spaces, just those two unicode characters that
get in the way.

Example, using ABCDEF for a six kanji phrase I am looking for, and 
"ghijklmnopq..." for other kanji that happen to be on the line.  And "." 
for the maru:

p0001a05(00)-ghi.jklmn.op.rs.AB.
p0001a06(00)-CD.EFtuvw.xyz.

If you are set to unicode, here is a real snippet from the CBETA canon:

p0001b16(00)|　念彌勒佛緣　念佛三昧緣
p0001b17(00)|　　普敬述意緣第一
p0001b18(00)|夫大聖有平等之相。弟子有稱揚之德。
p0001b19(02)|故十方諸佛。同出於淤泥之濁。三身正覺。

And I am searching for e.g. 揚之德故十 , which goes over line breaks and 
maru.

 >>end of review----------------------

Solution:

Following Steven's (and others) general approach, I simply make a search 
argument with optional puntuation, newline, line number characters 
between each and every kanji (the $w = [--] below). The hard part was 
that newline processing is not taken care of by perl in quite such an 
easy way. In fact, the text is from DOS and hence has DOS \015\012 line 
breaks. Looking at this in emacs it shows as simple \012, but perl sees 
and insists on having both \015 and \012 specified. As has so often 
happened to me, I get all twisted up with new lines, especially when 
crossing platforms. Once I figured that out, it was just a matter of 
learning enough perl to figure out the syntax.

I slurp in the whole file with -0777 (thanks Edward), and set my special 
ignore-this string to $w in the BEGIN, then looped over all the files 
globbed (thanks to the -n switch). The /xo switches in the perl match is 
so I can put in white space for readability, and to not recompile the 
search arguent each time for the $w variable. I do have to remember to 
print out the name of the current file $ARGV!

Here is my little perl-lette, already set for a particular search (not 
the one above, sorry).  Put in my ~/bin and invoked with: -> cbsearch 
fileglob

#!/usr/bin/perl -0777 -n
BEGIN {$w = '[0-9pabc()|。　\n\015]*'}
if (/\n$w.*
相$w弟$w子$w有$w稱$w揚$w之$w德$w
故十
.*/xo){print $&, "\n";}

It prints out (for this example), the file name and the text in the file:

t54n2123.txt
p0001b18(00)|夫大聖有平等之相。弟子有稱揚之德。
p0001b19(02)|故十方諸佛。同出於淤泥之濁。三身正覺。

This kind of thing, to put it mildly, is fabulously useful to me.


The ugly part is that I have to go edit the perl script file each time, 
and do a little emacs deal to insert the $w between each kanji. Still, 
it works!

But hmm, slow. A good 60 seconds for the above example, on my three year 
old Toshiba laptop.

Any suggestions about speeding up would be appreciated. I have looked at 
Namazu a bit, but its not clear to me that it is set up for this kind of 
thing. Its not really words we are talking about here, and the point is 
to ignore punctuation, not use it to make syntatic units like Namazu 
does. (These texts are not punctuated in the original, and old writers 
quote them either without punctuation or making up some of their own.)

Steven, are you serious, can you do something like this with egrep and 
elisp? That would be great. I would love to hear more.

Thanks everyone, especially all the perl from Edward.

David Riggs, Kyoto

Follow-Ups:
- Re: [tlug] searching for kanji strings, ignore punctuation and endof lines: Perl Solution and comments
  - From: Stephen J. Turnbull

Prev by Date: [tlug] Mozilla as default browser in KDE
Next by Date: [tlug] Base64 and headers (was: Editing Soud Files (WAV & MP3))
Previous by thread: [tlug] Mozilla as default browser in KDE
Next by thread: Re: [tlug] searching for kanji strings, ignore punctuation and endof lines: Perl Solution and comments
Index(es):
- Date
- Thread

Home | Main Index | Thread Index