shell script - Retrival of read names from fastq file for a particular seq

06
2014-04
  • user3138373

    I have a fastq file.The reads are of length 75bp. This is how a fastq file looks like

    @ERR030899.2391 HWI-BRUNOP16X_0001:4:1:16426:3738#0/1
    NNCGTGTCAGTGGCCAGCAGCCCACACTGCGCATGGCTGATACTGTGGACCCCCTGGACTGGCTTTTTGGGGAGT
    +
    ###########################################################################
    @ERR030899.2392 HWI-BRUNOP16X_0001:4:1:16582:3744#0/1
    NNAAAGTCCTGCGCTGCGGAGGACAGGAAGCACCCCCTCAATAGCCAGCACCCACAGTGAGCTGAGCACTTACAG
    +
    ##'(''((((5/333**+)(10-11+1,,,,1/1/F<<<<FF:FFFFFF=FFFFFFFFFFFFFFFFFFFFFFFF#
    @ERR030899.2393 HWI-BRUNOP16X_0001:4:1:16911:3745#0/1
    NNGGATTTAGCGGGGTGATGCCTGTTGGGGGCCCGTGCCCTCCTACTTGGGGGGCAGGGGCTAGGCTGCAGAGGT
    +
    ###########################################################################
    @ERR030899.2394 HWI-BRUNOP16X_0001:4:1:18194:3739#0/1
    NNACAAGCAATTTAGTGATAATGTCCAGAGGGCCAAGGATGCGGACCACCTTTTGCAGAACTCATATCTCGAGCA
    +
    ##*+*)'+++FFFFFFFFFF58588==?:?FFFFFFFFFFFFFF<FFFFFFFFFF=FFFFFFFFFFFFFF6=??;
    

    I have a small nucleotide sequence of around 30bp.

    Lets say this is my nucleotide seq

    CTGTTGGGGGCCCGTGC

    What I want to do is search for this sequence in the fastq file and extract the corresponding read name in which the sequence exists. So the read name in this case would be

    @ERR030899.2393 HWI-BRUNOP16X_0001:4:1:16911:3745#0/1

    Also I want to allow a mismatch rate of 1 in the sequence.

    Any script or command line ?

    Thanks

    Regards

  • Answers
  • glenn jackman
    awk -v seq="CTGTTGGGGGCCCGTGC" '
      NR%4 == 1 {name = $0}
      NR%4 == 2 && index($0, seq) {print name}
    ' filename
    

    If, by "mismatch rate of 1", you want to be able to match any 29 out of those 30 (such as CTG.TGGGGGCCCGTGC, that's quite a bit more complex.

    ...

    Eh, not that much more complex:

    awk -v seq="CTGTTGGGGGCCCGTGC" '
      NR%4 == 1 {name=$0}
      NR%4 == 2 {
        if (index($0, seq))
          print "found seq \"" seq "\" in " name
        else
          for (i=1; i<=length(seq); i++) {
            patt = substr(seq, 1, i-1) "." substr(seq, i+1)
            if (match($0, patt)) {
              print "found pattern \"" patt "\" in " name
              break
            }
          }
      }
    ' filename
    

  • Related Question

    verify ASCII file with file command by shell scrript
  • jennifer

    With the file command I need to verify many files if they ASCII or other format

    Sometimes I get from file command:

      file1: ASCII English text
    

    And sometimes I get different answer from file command

      file2: Non-ISO extended-ASCII English text, with very long lines
    

    I am really not sure if there are other answers with different syntax

    My question is:

    I write the follwing ksh syntax to verify if file is a ASCII but I not sure if the

    following syntax is the optimal syntax in order to verify ASCII format?

       [[ ` file  $some_file | grep –c ASCII ` = 1 ]] && print "you have ascii file for sure"
    

    If someone have other suggestion to verify ASCII format for sure!, I will very glad to see that


  • Related Answers
  • ephemient
    if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
        echo "file contains non-ascii characters"
    else
        echo "file contains ascii characters only"
    fi
    
  • grawity

    How about...

    if file -ib "$file" | grep -Eqs '^text/plain(;|$)'; then
        echo "It's text/plain."
    fi
    

    I don't know how common is --mime-type; if it's standard, use

    if file -b --mime-type "$file" | grep -qs '^text/plain$'; then
    

    Alternatively grep -qs '^text/' for any text type.

  • Ian C.

    Since you're parsing the output with code I'd suggest using the -i option on file so it outputs MIME types instead human-friendly strings. The MIME type output is more regular and that makes it a little easier to deal with in code.

    As for the output types a look at man file says that:

    /usr/share/file/magic
        Default list of magic numbers
    
    /usr/share/file/magic.mime
        Default list of magic numbers, used to output  mime types
        when the -i option is specified.
    

    Take a look at those files for all the MIME types it can report to determine which types you'll care about when parsing the output from file. I suspect all you'll care is that the MIME type starts with text/.