Answers: shell script - Retrival of read names from fastq file for a particular seq

user3138373

I have a fastq file.The reads are of length 75bp. This is how a fastq file looks like

@ERR030899.2391 HWI-BRUNOP16X_0001:4:1:16426:3738#0/1
NNCGTGTCAGTGGCCAGCAGCCCACACTGCGCATGGCTGATACTGTGGACCCCCTGGACTGGCTTTTTGGGGAGT
+
###########################################################################
@ERR030899.2392 HWI-BRUNOP16X_0001:4:1:16582:3744#0/1
NNAAAGTCCTGCGCTGCGGAGGACAGGAAGCACCCCCTCAATAGCCAGCACCCACAGTGAGCTGAGCACTTACAG
+
##'(''((((5/333**+)(10-11+1,,,,1/1/F<<<<FF:FFFFFF=FFFFFFFFFFFFFFFFFFFFFFFF#
@ERR030899.2393 HWI-BRUNOP16X_0001:4:1:16911:3745#0/1
NNGGATTTAGCGGGGTGATGCCTGTTGGGGGCCCGTGCCCTCCTACTTGGGGGGCAGGGGCTAGGCTGCAGAGGT
+
###########################################################################
@ERR030899.2394 HWI-BRUNOP16X_0001:4:1:18194:3739#0/1
NNACAAGCAATTTAGTGATAATGTCCAGAGGGCCAAGGATGCGGACCACCTTTTGCAGAACTCATATCTCGAGCA
+
##*+*)'+++FFFFFFFFFF58588==?:?FFFFFFFFFFFFFF<FFFFFFFFFF=FFFFFFFFFFFFFF6=??;

I have a small nucleotide sequence of around 30bp.

Lets say this is my nucleotide seq

CTGTTGGGGGCCCGTGC

What I want to do is search for this sequence in the fastq file and extract the corresponding read name in which the sequence exists. So the read name in this case would be

@ERR030899.2393 HWI-BRUNOP16X_0001:4:1:16911:3745#0/1

Also I want to allow a mismatch rate of 1 in the sequence.

Any script or command line ?

Thanks

Regards

glenn jackman

awk -v seq="CTGTTGGGGGCCCGTGC" '
  NR%4 == 1 {name = $0}
  NR%4 == 2 && index($0, seq) {print name}
' filename

If, by "mismatch rate of 1", you want to be able to match any 29 out of those 30 (such as CTG.TGGGGGCCCGTGC, that's quite a bit more complex.

...

Eh, not that much more complex:

awk -v seq="CTGTTGGGGGCCCGTGC" '
  NR%4 == 1 {name=$0}
  NR%4 == 2 {
    if (index($0, seq))
      print "found seq \"" seq "\" in " name
    else
      for (i=1; i<=length(seq); i++) {
        patt = substr(seq, 1, i-1) "." substr(seq, i+1)
        if (match($0, patt)) {
          print "found pattern \"" patt "\" in " name
          break
        }
      }
  }
' filename

verify ASCII file with file command by shell scrript

shell-script

jennifer

With the file command I need to verify many files if they ASCII or other format

Sometimes I get from file command:

  file1: ASCII English text

And sometimes I get different answer from file command

  file2: Non-ISO extended-ASCII English text, with very long lines

I am really not sure if there are other answers with different syntax

My question is:

I write the follwing ksh syntax to verify if file is a ASCII but I not sure if the

following syntax is the optimal syntax in order to verify ASCII format?

   [[ ` file  $some_file | grep –c ASCII ` = 1 ]] && print "you have ascii file for sure"

If someone have other suggestion to verify ASCII format for sure!, I will very glad to see that

Related Answers

ephemient

if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
    echo "file contains non-ascii characters"
else
    echo "file contains ascii characters only"
fi

grawity

How about...

if file -ib "$file" | grep -Eqs '^text/plain(;|$)'; then
    echo "It's text/plain."
fi

I don't know how common is --mime-type; if it's standard, use

if file -b --mime-type "$file" | grep -qs '^text/plain$'; then

Alternatively grep -qs '^text/' for any text type.

Ian C.

Since you're parsing the output with code I'd suggest using the -i option on file so it outputs MIME types instead human-friendly strings. The MIME type output is more regular and that makes it a little easier to deal with in code.

As for the output types a look at man file says that:

/usr/share/file/magic
    Default list of magic numbers

/usr/share/file/magic.mime
    Default list of magic numbers, used to output  mime types
    when the -i option is specified.

Take a look at those files for all the MIME types it can report to determine which types you'll care about when parsing the output from file. I suspect all you'll care is that the MIME type starts with text/.

Home

shell script - Retrival of read names from fastq file for a particular seq