.NET Discussion

.NET Issues, Problems, Code Samples, and Fixes

My Foray into Regular Expressions with ASP.NET

I know it’s been a while since I posted here. I’ve been working pretty diligently on my newest creation, Quotidian Word, which will essentially be a “word-a-day” site that encourages users to actually learn the word and use it actively in everyday speech or writing. Right now it’s only an email harvester so that I can send an email to those interested when it’s ready, but I’m nearing the launch point every day 🙂

Anyways, the point of this post is to discuss my dip into the world of Regular Expressions. I needed to come up with a way to find the word I wanted in a sentence. Easy enough, right? Just mySentence.Contains("myword") and there you go. It would be nice if that’s all I had to do, but English is a funny language. Every word can assume many forms. For instance, if I want to find the word “entry” in a sentence, I also would like to find “entries”. Or more simply, if i’m looking for “dog” I also want to find “dogs”. Using the mySentence.Contains() method, I would not find “entries” and I would find only the “dog” in “dogs”.

Enter regular expressions.

I had previously shied away from them because when one looks at a regular expression (aka “regex”), it can be rather intimidating. Take this stock regular expression that comes with Visual Studio as a default for finding an email address:

\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*

Your reaction may be the same as mine was: WTF. But when you try to enter an email in a textbox validated by this regex, it knows if you are or not. So my problem was still, how do I find all forms of any word I choose? I took a couple factors into mind, such as, I will mostly be using more obscure words with less common roots, so that will help a bit, and I won’t be using too many really short words that can be blended into other words in a sentence.

My first thought was ok, how about I first try to find the whole word, then the whole word plus a suffix, then the whole word minus a letter plus a suffix, then a the whole word minus two letters plus a suffix:

\bentry\b|((\bentry|\bentr|\bent)(s|es|ies)\b)

I used the word “entry” as an example here. “\b” means either the beginning or the end of a word and “|” means “or”. The rest is just separated by parens. This worked out ok for a while until I realized that the English language has something like a thousand suffixes. I knew regexes were more powerful than that. There had to be an easier way.

And there was! I was sort of on the right track with the losing of the last two letters. In English, most words either simply append a suffix (dog -> dogs), drop one letter and append a suffix (happy -> happiness) or drop two letters and append a suffix (cactus -> cacti). Aside from the word “person” (person -> people) I could not think of an instance where a word dropped three or more and would profide me with enough remaining information to actually distinguish it from other words in the sentence. If you can, let me know.

So after a bit more research, and the testing from a handy regex tool called Expresso, my regex eventually evolved into: (?:ent(?:r|ry)?).+?\b

Explanation: “ent” is what I called the “root” of the word, essentially the letters remaining after the last two have been stripped off. “(?:” is a grouping that means “find whatever’s here, but don’t actually match it alone”. This way it doesn’t find only “ent” or “r” or “ry” and match it, but rather matches the whole thing all together. The clause “(?:r|ry)?” means find “r” OR “ry” (in addition to what comes before it, so “entr” or “entry”). The “?” at the end means the whole clause before it is optional, meaning if it’s not there, it’s ok. The “.+” means find any character after the previous clause for as many repetitions as you can, so for instance, if the word in the sentence is “entries” it will first find the “ent” then the “r” then any characters that follow “ies” up until the end of the word, “\b”. The “?” at the end of the “.+” just means take as few characters as possible up to the next clause, which is the “\b”.

Whew! That’s a lot. I found that this found about 99% of the words that I would be using in all their various forms. But then I got to thinking, what about prefixes? What if someone used something like “anti” or “pre” or something in front of a word to change it just slightly? Hence, my (nearly) final product:

(?:(?:\b\w+)?{1}(?:{2}|{3})?).+?\b

where {1} is the word minus the last two letters, {2} is the penultimate letter, and {3} is the last two letters. The optional clause at the beginning takes care of any prefix if it happens to exist.

Great! All done. It can pull it out of a string, no problem. Now if someone enters the word in a sentence in my textbox it will find…it… crap. It doesn’t work as is with the RegexValidator in ASP.NET on textboxes. Why? Because the validator is looking at the whole string in the textbox to see if it fits the regular expression. For instance, the email regular expression assumes that whatever you enter into that textbox is going to be an email, nothing else. If you enter a sentence into a textbox, it assumes that whatever you enter into that textbox is going to fit the regex.

In order to counter this problem, I put in a simple fix: I prepended and appended my regex for textboxes with “.*”, which means that it will find any characters before or after the word we’re looking for. Done! … Right?

So I thought, until I realized that when people enter sentences, usually they do so in multiline textboxes, as they’re easier to see everything you’re doing. This regex works until the user hits the “Enter” key and inserts a new line. After much research and hair pulling, I eventually found the solution:

^(.|\r|\n)*(?:(?:\b\w+)?{1}(?:{2}|{3})?).+?\b(.|\r|\n)*$

with {1},{2}, and {3} meaning the same thing. The interesting thing about the “.” character in regexes is it means “match any character…..except new lines and carriage returns”, which means if you’re using multiline textboxes, you have to account for that. So I had to prepend “^(.|\r|\n)*” and append “(.|\r|\n)*$” to my already unwieldy regex. “\r” is a carriage return and “\n” is a new line. The “^” symbol means start at the beginning of the string, and the “$” means continue to the end of the string, with the “*” symbols meaning repeat as often as necessary.

To finalize the product, I simply plugged all of this information into a shared function that returns the proper regex (based on a parameter that determines if I’m using it on a regular string, a textbox, or a multiline textbox) and set my validator’s ValidationExpression at runtime based on the word I was looking for. It actually works pretty well… so far…. 🙂

All in all, I think I like regexes. They are undoubtedly powerful and can save many programming hours if you know how to use them. I know the initial function I was using to find words in a sentence took me maybe two hours to write and it didn’t even work all the time. It was also about 200ish lines of code. My new function using regexes is 17 lines of actual code, and 7 of that is a Select statement for string/textbox/multiline textbox.

While the initial learning curve is quite steep for regexes, if you’re serious about programming, I highly suggest that you take a day or two to learn them, as that day or two investment could save you weeks or months down the line of your programming career.

Some regex resources:

What’s your craziest regex?

Advertisements

May 15, 2009 Posted by | ASP.NET, Javascript, Regex, VB.NET, Visual Studio.NET | 1 Comment