regex - Regular Expressions: How is group matching useful?

07
2014-07
  • sammyg

    I've decided to learn some regular expression basics. I am using the Regex One lessons online and I was stuck at lession 11 for a while, but I think I got it now.

    This was the task.

    "Write a regular expression that matches only the filenames (not including extension) of the PDF files below."

    task            text                     capture
    capture text    file_a_record_file.pdf   file_a_record_file
    capture text    file_yesterday.pdf       file_yesterday
    skip text       testfile_fake.pdf.tmp
    

    There is an input field where you type in the pattern to complete the task. After some trials and errors this is what I came up with.

    ^(file_a_record_file)\.pdf$
    

    This will match the file name file_a_record_file.pdf but only "capture" the file_a_record_file. What's the difference?... between matching and "capturing"? And how is this useful? How is this "group matching"?

    Now this does work for the first file, but not for the second file. The task says I need to make a pattern that will match and capture the file name of both files, excluding the extension. So this is what I came up next.

    ^(file_.*)\.pdf$
    

    Since both file names start with file_ I thought it would be a good idea to match against that, and then tell it to match any character that follows, and then exit the group with parenthesis (the "group" is what's inside the parenthesis, right?) and escape the dot with a back slash and end with the file name extension.

    Can this be described in a more tighter way? The correct solutions are not given on the website. So I have nothing to check my answers against. It's a pity because I think this is a good introduction to regular expressions. The examples given for each lesson are sometimes hard to understand.

    And again, how is this useful? He mentions something about command line, I think he means that it can be used to re-use commands or something... well I don't really get it what he's saying.

    Imagine that we have a command line tool that copies each file in a directory up to a server only if it doesn't exist there already, and prints each filename as a result. Now if I want to do another task on each of those filenames, then I will not only need a regular expression that will match the filename, but also some way to extract that information.

    Extracting information? What is he talking about? Can someone please tell me how this is useful and give me real world example?

  • Answers
  • terdon

    In the lesson you linked to, you are asked to write a regex that captures the file name of these two

    file_a_record_file.pdf
    file_yesterday.pdf
    

    and skips

    testfile_fake.pdf.tmp
    

    The simplest regex to do that is

    (.*)\.pdf$
    

    This means match everything that ends in .pdf but capture only the file name.

    So, why is capturing useful? That depends on the program you are using these regexes with. Capturing patterns allows you to save what you have captured as a variable. For example, using Perl, the first captured pattern is $1, the second $2 etc:

    echo "Hello world" | perl -ne '/(.+) (.+)/; print "$2 $1\n"'
    

    This will print "world Hello" because the first parenthesis captured Hello and the second captured world but we are then printing $2 $1 so the two matches are inverted.

    Other regex implementations allow you to refer to the captured patterns using \1, \2 etc. For example, GNU sed:

    echo "Hello world" | sed 's/\(.*\) \(.*\)/\2 \1/'
    

    So, in general, capturing patterns is useful when you need to refer to these patterns later on. This is known as referencing and is briefly explained a little later in the tutorials you are doing.


  • Related Question

    Windows: File copy/move with filename regular expressions?
  • Ian Boyd

    i basically want to run:

    C:\>xcopy [0-9]{13}\.(gif|jpg|png) s:\TargetFolder /s
    

    i know xcopy doesn't support regular-expression filename searches.

    i can't find out how to find out if PowerShell has a Cmdlet to copy files; and if it does, how to find out if it supports regular expression filename matching.

    Can anyone think of a way to perform a recursive file copy/move with regex filename matching?


  • Related Answers
  • Doltknuckle

    I like using all Powershell commands when I can. After a bit of testing, this is the best I can do.

    $source = "C:\test" 
    $destination = "C:\test2" 
    $filter = [regex] "^[0-9]{6}\.(jpg|gif)"
    
    $bin = Get-ChildItem -Path $source | Where-Object {$_.Name -match $filter} 
    foreach ($item in $bin) {Copy-Item -Path $item.FullName -Destination $destination}
    

    The first three lines are just to make this easier to read, you can define the variables inside the actual commands if you want. The key to this code sample is the the "Where-Object" command which is a filter that accepts regular expression matching. It should be noted that regular expression support is a little weird. I found a PDF reference card here that has the supported characters on the left side.

    [EDIT]

    As "@Johannes Rössel" mentioned, you can also reduce the last two lines down to a single line.

    ((Get-ChildItem -Path $source) -match $filter) | Copy-Item -Destination $destination
    

    The main difference is that Johannes's way does object filtering and my way does text filtering. When working with Powershell, it's almost always better to use objects.

    [EDIT2]

    As @smoknheap mentioned, the above scripts will flatten out the folder structure and put all your files in one folder. I'm not sure if there is a switch that retains folder structure. I tried the -Recurse switch and it doesn't help. The only way I got this to work is to go back to string manipulation and add in folders to my filter.

    $bin = Get-ChildItem -Path $source -Recurse | Where-Object {($_.Name -match $filter) -or ($_.PSIsContainer)}
    foreach ($item in $bin) {
        Copy-Item -Path $item.FullName -Destination $item.FullName.ToString().Replace($source,$destination).Replace($item.Name,"")
        }
    

    I'm sure that there is a more elegant way to do this, but from my tests it works. It gather s everything and then filters for both name matches and folder objects. I had to use the ToString() method to gain access to the string manipulation.

    [EDIT3]

    Now if you want to report the pathing to make sure you have everything correct. You can use the "Write-Host" Command. Here's the code that will give you some hints as to what's going on.

    cls
    $source = "C:\test" 
    $destination = "C:\test2" 
    $filter = [regex] "^[0-9]{6}\.(jpg|gif)"
    
    $bin = Get-ChildItem -Path $source -Recurse | Where-Object {($_.Name -match $filter) -or ($_.PSIsContainer)}
    foreach ($item in $bin) {
        Write-Host "
    ----
    Obj: $item
    Path: "$item.fullname"
    Destination: "$item.FullName.ToString().Replace($source,$destination).Replace($item.Name,"")
        Copy-Item -Path $item.FullName -Destination $item.FullName.ToString().Replace($source,$destination).Replace($item.Name,"")
        }
    

    This should return the relevant strings. If you get nothing somewhere, you'll know what item is having problems with.

    Hope this helps

  • armannvg

    PowerShell is an excellent tool for that task. You can use Copy-Item cmdlet for the copying process. You can pipeline it with other cmdlets for complex copy commands, here is someone that did exactly that :)

    Regular expressions use the .NET RegEx class from the System.Text.RegularExpressions namespace, there are quick how-to on these class

    PowerShell also have the -match and -replace operators that can be used when pipelining with copy-item

    There are also tools to help you create the RegEx itself, e.g. RegEx buddy

  • user33788

    as an idea but needs some work

    dir -r | ?{$_ -match '[0-9]{13}\.(gif|jpg|png)'} | %{xcopy $_.fullname c:\temp}