Tuesday 12 April 2011

Dot.Net: Break String to Words using Char.IsLetter(c) method

Question:
How do I split a string into words with csharp?
How can I decide whether a Unicode char is a letter or an alpha-numeric or not?


Answer:
In order to break a string into words you must first decide which chars are word-breakers and which ones are not?
One way (there are many others) to decide this is to use the Char.IsLetter() method. This will return true if "the specified Unicode character is categorized as a Unicode letter".

In the following example I use Char.IsLetterOrDigit(c). That includes Digits into our word definition, which might come in handy.

Here is an example
private string[] BreakWords(string p)
{
    StringBuilder sb = new StringBuilder();
    foreach (char c in p)
    {
        // we do not want that '-' and '_' break words neither so we test them too
        if (!Char.IsLetterOrDigit(c) && c != '-' && c != '_')
        {
            sb.Append(" ");
        }
        else
        {
            sb.Append(c);
        }
    }
    string replStr = sb.ToString();
    string[] parts = replStr.Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries);
    return parts;
}


happy programming ...

No comments: