Detecting syllables programatically

I know I repeatedly posted about @accidentalhaiku, but I have to say that random 3-line poetry from strangers appeals to my sense of humor. For anyone that hasn’t seen it, the service scans the twitter public timeline for tweets in the 5/7/5 syllable format of a haiku and retweets them. A selection from the last 3 hrs:

RT @mz_jonez87 mercy i so hate / this crap and i want you to / be the first to know!

RT @Carlyyz trying to save a / little mouse that was attacked / by my neighbor’s cat 😦 #haiku

RT @Honeybee1971 sorry my spelling, / but i really hate not / having my cell phone. #haiku§

RT @mjcostajr done done done done done / done! submitting my dental / school applications… #haiku

So I was wondering how to programatically detect syllable boundaries and it seems it’s pretty hard. After spending two minutes trying to figure out the rule it initially looks like a problem of vowels and consonants, but with a few exceptions for word endings and double characters. It seems though, that the exception is more often the rule and a brief trawl of the web turns up the definitive paper by Frank M. Liangs, implemented in TeX and just about every open text rendering system for the last 25 years. It uses a compressed codebook to store the rules – for English it’s about 25KB. There are several open source implementations of the original algorithm – I found a client-side JavaScript version which is good for reference.

Anyone have any other ideas for auto-detected accidental poetry?

Advertisements

3 responses to this post.

  1. Posted by Mariya Genzel on June 21, 2009 at 8:33 pm

    Hey, @accidentalhaiku here. I’m glad you enjoy it. I must say, the script was definitely harder to write than I thought it would be, because there are just lots of exceptions and I didn’t want any non-17-syllable-things to slip through. So I went with a 99.999999% approach, which is dictionary approach. It is not 100% either, because certain words have different syllable counts depending on context (e.g., i like “it” 1-syll vs. i work in “IT” 2-syll). Other issues I had to deal with: making sure to exclude any non-English tweets, since any syllable count is language-dependent; making sure that each line ends at the end of the word and not in the middle of it, gah. Obviously, my bar for haiku is quite low. I’m considering adding scanning for a “seasonal” word (that functionality is already there, I’m just not sure whether decreasing the number of “haikus” shown is in fact a better thing for people other than haiku purists). The more important issue is semantic line break, and that is just way too complex a problem for me to deal with in a plaything script. I mean that each line should be a separate phrase; there are some basic heuristics but they produce tons of false positives & false negatives. I’m still pondering.

    Reply

    • Hi Mariya – yes, you’ve set the bar pretty high and as there is so much data in the public timeline you can afford as many false negatives as you want to maintain the quality. I probably laugh out loud (no, really) at these for the following reasons:

      * author is using twitter as a means to convey teen angst, or the new form of this, political angst, yet ends up writing a short poem for the amusement of others.
      * author is tweeting about the lameness of twitter, or about not wanting to tweet – itself laughable, but not entertaining unless formed in 17 syllables.
      * author uses unexpected, perhaps repeated, words as in the “done done done” example above. I liked your example of “it” vs. “IT”.

      I think you should just add some kind of “w00t” for accidental seasonal haiku – the gold star in unexpected twitter poetry. As for the phrase breaking – perhaps we could get someone from Microsoft to donate their grammar engine?

      Reply

  2. Posted by Mariya Genzel on June 22, 2009 at 3:00 pm

    “woot” method is what i was actually considering. will get right on it 🙂

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: