Nafiz Hamid    Blog

Solving the cd-hit -d flag problem

When you do cd-hit to filter for identical or near identical sequences in any bioinformatics pipeline, you get a .clstr file. Sometimes you need to parse that file for getting the sequences in the individual clusters. By default, in the cd-hit command, the -d flag is set to 0, which means you can only see the fasta sequence identifier until you hit a whitespace in your .clstr file. By putting in your preferred number with the -d flag you can extend the identifier to that many characters.

As of, May 5, 2016, the master branch of cd-hit in the github repo does not conform to the -d flag. As I needed the full fasta record description for each sequence in a cluster, this was indeed a headache. So, I searched for this problem. There was not much but an issue was indeed opened regarding the same problem in the github repo of cd-hit.

I found a solution there which involves changing one line in the cdhit-common.c++ file. You can find it in your cdhit-master directory which is basically the master branch of cd-hit in github. There, in the void SequenceDB::Read function you have the following lines -

int i = 0;
if( des.data[i] == '>' || des.data[i] == '@' || des.data[i] == '+' ) i += 1;
if( des.data[i] == ' ' or des.data[i] == '\t' ) i += 1;
if( options.des_len and options.des_len < des.size ) des.size = options.des_len;
while( i < des.size and ! isspace( des.data[i] ) ) i += 1;
des.data[i] = 0;
one.identifier = dummy.identifier = des.data;
} else {
    one += buffer;
}

Here, you need to change the 5th line which is a while loop. Delete that and insert the following line there -

while( i < des.size and ( des.data[i] != '\n') ) i += 1;  

Now, with the -d flag you can put the number as you wish and in the .clstr file you should get the fasta sequence identifiers up to that number. Here is a portion my .clstr file with the -d 300 flag -

>Cluster 0
0       702aa, >32.2 | product: bacteriocin immunity protein | note: no note | protein_id: CJQ31854.1 | CLSP01000002:19049-21157 | immunity... *
1       702aa, >34.2 | product: bacteriocin immunity protein | note: no note | protein_id: CJV34097.1 | CMAH01000025:21097-23205 | immunity... at 99.57%
2       702aa, >27.2 | product: bacteriocin immunity protein | note: no note | protein_id: COH18396.1 | CRGD01000006:80660-82768 | immunity... at 99.43%
3       702aa, >27.2 | product: bacteriocin immunity protein | note: no note | protein_id: COI25170.1 | CRIA01000001:244000-246108 | immunity... at 99.86%
4       702aa, >28.2 | product: bacteriocin immunity protein | note: no note | protein_id: COH18396.1 | CRGD01000006:80660-82768 | immunity... at 99.43%
5       702aa, >28.2 | product: bacteriocin immunity protein | note: no note | protein_id: CJN26788.1 | CLNU01000003:29923-32031 | immunity... at 99.72%
6       702aa, >28.2 | product: bacteriocin immunity protein | note: no note | protein_id: COI25170.1 | CRIA01000001:244000-246108 | immunity... at 99.86%
7       702aa, >28.2 | product: bacteriocin immunity protein | note: no note | protein_id: COJ04657.1 | CRJL01000001:149015-151123 | immunity... at 99.72%
8       702aa, >28.2 | product: bacteriocin immunity protein | note: no note | protein_id: COM73718.1 | CROP01000007:61006-63114 | immunity... at 99.72%
9       702aa, >39.2 | product: bacteriocin immunity protein | note: no note | protein_id: COA67976.1 | CHPO01000002:36267-38375 | immunity... at 97.58%
10      702aa, >36.2 | product: bacteriocin immunity protein | note: no note | protein_id: CJQ31854.1 | CLSP01000002:19049-21157 | immunity... at 100.00%
11      702aa, >35.2 | product: bacteriocin immunity protein | note: no note | protein_id: CIS40931.1 | CHNO01000004:74173-76281 | immunity... at 99.72%
12      702aa, >30.2 | product: bacteriocin immunity protein | note: no note | protein_id: CIS40931.1 | CHNO01000004:74173-76281 | immunity... at 99.72%
13      702aa, >30.2 | product: bacteriocin immunity protein | note: no note | protein_id: CJV34097.1 | CMAH01000025:21097-23205 | immunity... at 99.57%

Before, I was only getting up to the first whitespace in these seq descriptions like 0 702aa, >32.2.... Hope that helps you save some precious time while hammering your head against the wall.

comments powered by Disqus