shell script - Retrival of read names from fastq file for a particular seq
2014-04
I have a fastq file.The reads are of length 75bp. This is how a fastq file looks like
@ERR030899.2391 HWI-BRUNOP16X_0001:4:1:16426:3738#0/1
NNCGTGTCAGTGGCCAGCAGCCCACACTGCGCATGGCTGATACTGTGGACCCCCTGGACTGGCTTTTTGGGGAGT
+
###########################################################################
@ERR030899.2392 HWI-BRUNOP16X_0001:4:1:16582:3744#0/1
NNAAAGTCCTGCGCTGCGGAGGACAGGAAGCACCCCCTCAATAGCCAGCACCCACAGTGAGCTGAGCACTTACAG
+
##'(''((((5/333**+)(10-11+1,,,,1/1/F<<<<FF:FFFFFF=FFFFFFFFFFFFFFFFFFFFFFFF#
@ERR030899.2393 HWI-BRUNOP16X_0001:4:1:16911:3745#0/1
NNGGATTTAGCGGGGTGATGCCTGTTGGGGGCCCGTGCCCTCCTACTTGGGGGGCAGGGGCTAGGCTGCAGAGGT
+
###########################################################################
@ERR030899.2394 HWI-BRUNOP16X_0001:4:1:18194:3739#0/1
NNACAAGCAATTTAGTGATAATGTCCAGAGGGCCAAGGATGCGGACCACCTTTTGCAGAACTCATATCTCGAGCA
+
##*+*)'+++FFFFFFFFFF58588==?:?FFFFFFFFFFFFFF<FFFFFFFFFF=FFFFFFFFFFFFFF6=??;
I have a small nucleotide sequence of around 30bp.
Lets say this is my nucleotide seq
CTGTTGGGGGCCCGTGC
What I want to do is search for this sequence in the fastq file and extract the corresponding read name in which the sequence exists. So the read name in this case would be
@ERR030899.2393 HWI-BRUNOP16X_0001:4:1:16911:3745#0/1
Also I want to allow a mismatch rate of 1 in the sequence.
Any script or command line ?
Thanks
Regards
awk -v seq="CTGTTGGGGGCCCGTGC" '
NR%4 == 1 {name = $0}
NR%4 == 2 && index($0, seq) {print name}
' filename
If, by "mismatch rate of 1", you want to be able to match any 29 out of those 30 (such as CTG.TGGGGGCCCGTGC
, that's quite a bit more complex.
...
Eh, not that much more complex:
awk -v seq="CTGTTGGGGGCCCGTGC" '
NR%4 == 1 {name=$0}
NR%4 == 2 {
if (index($0, seq))
print "found seq \"" seq "\" in " name
else
for (i=1; i<=length(seq); i++) {
patt = substr(seq, 1, i-1) "." substr(seq, i+1)
if (match($0, patt)) {
print "found pattern \"" patt "\" in " name
break
}
}
}
' filename
With the file command I need to verify many files if they ASCII or other format
Sometimes I get from file command:
file1: ASCII English text
And sometimes I get different answer from file command
file2: Non-ISO extended-ASCII English text, with very long lines
I am really not sure if there are other answers with different syntax
My question is:
I write the follwing ksh syntax to verify if file is a ASCII but I not sure if the
following syntax is the optimal syntax in order to verify ASCII format?
[[ ` file $some_file | grep –c ASCII ` = 1 ]] && print "you have ascii file for sure"
If someone have other suggestion to verify ASCII format for sure!, I will very glad to see that
if LC_ALL=C grep -q '[^[:print:][:space:]]' file; then
echo "file contains non-ascii characters"
else
echo "file contains ascii characters only"
fi
How about...
if file -ib "$file" | grep -Eqs '^text/plain(;|$)'; then
echo "It's text/plain."
fi
I don't know how common is --mime-type
; if it's standard, use
if file -b --mime-type "$file" | grep -qs '^text/plain$'; then
Alternatively grep -qs '^text/'
for any text type.
Since you're parsing the output with code I'd suggest using the -i
option on file
so it outputs MIME types instead human-friendly strings. The MIME type output is more regular and that makes it a little easier to deal with in code.
As for the output types a look at man file says that:
/usr/share/file/magic
Default list of magic numbers
/usr/share/file/magic.mime
Default list of magic numbers, used to output mime types
when the -i option is specified.
Take a look at those files for all the MIME types it can report to determine which types you'll care about when parsing the output from file
. I suspect all you'll care is that the MIME type starts with text/
.