regex - Regular expression that finds same words within 'n' words of each other

08
2014-07
  • post meridiem

    I'm trying to put together a regular expression search that finds any two (or more) words that are within n (e.g., more than 1, less than 5) words of each other. The goal is to search over a prose text, and find unneeded repetitions of words close to each other.

    Example: in the following text, the search should identify "package:"

    The postman delivered a package, and the package was heavy.

    The challenge is that the two words can be any two words, but must be the same two words. I've been trying to figure out a way to work with * or + (I'm fairly new to regular expressions), but of course, wildcards would match every word, so they don't work. Is there any search structure like $1 within n of $1 that would translate to regex?

  • Answers
  • slhck

    I don't think a regex is what you need here – you cannot express that, unless you know the words before.

    So, I guess you could go ahead and parse every word from the text (e.g. sorting, then removing duplicates). Then, you run the following regular expression, for every word found (here, the word is foo):

    \bfoo\W+(?:\w+\W+){1,5}?foo\b
    

    Here, \b is a word boundary. Then you match the actual word. After that, \W is any non-word character, multiple times. Now you start a group (surrounded by ()), which can occur 1 to 5 times ({1,5}). The group will not be captured (?:).

    See an example in action here.


  • Related Question

    regex - how to do an OR on two words in a regular expression
  • Here Be Wolves

    I want to write a regex that matches either:

    insert #TempTable
    

    or

    update #TempTable
    

    How do I do it? I guess an imperfect way to go about it would be:

    [insertupda]{6} #TempTable
    

    While this does work in my situation, i want to know the right way to do this.

    Thanks :)


  • Related Answers
  • livibetter
    (insert|update) #TempTable