Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

excel - shortcut (Sep/07/2009 )


I have an excel file contaning the names of 2000 genes. I have another file containing 100 genes. Is there a short cut in excel that can quickly do a 'find' or 'search' to let me know if those 2000 genes contain the 100 genes?



I don't know how to do it in Excel -- at least not quickly -- but if you save both lists as text files containing one gene name per line, you can compare the two lists using Perl:

#!/usr/bin/perl -w
use strict;

# Union: all elements in both lists excluding duplicates
# Intersection: items that are in both lists
# Difference (symmetric): members of FIRST or SECOND but not of FIRST and SECOND
# Difference (simple): members of FIRST but not SECOND

open (FIRST, "gene_list_100.txt") or die;
open (SECOND, "gene_list_2000.txt") or die;

my @first = (<FIRST>);
chomp (@first);

my @second = (<SECOND>);
chomp (@second);

my @union = my @isect = my @sym_diff = ();
my %union = my %isect = my %count = ();

foreach my $e (@first, @second) {

foreach my $e (keys %count) {
push(@union, $e);
if ($count{$e} == 2) {
push @isect, $e;
} else {
push @sym_diff, $e;

my %seen;
my @first_only;
@seen{@second} = ();

foreach my $item (@first) {
push (@first_only, $item) unless exists $seen{$item};

@union = sort (@union);
@isect = sort (@isect);
@sym_diff = sort (@sym_diff);
@first_only = sort (@first_only);

print "Union: " . scalar(@union) . " " . join (" ", @union) . "\n";
print "Intersection: " . scalar(@isect) . " " . join (" ", @isect) . "\n";
print "Symmetrical Diff: " . scalar(@sym_diff) . " " . join (" ", @sym_diff) . "\n";
print "Simple Diff: " . scalar(@first_only) . " " . join (" ", @first_only) . "\n";

I guess in Excel you could have both gene lists in a single column, and then sort the column by name. Next to the column, you could enter a formula like:

=IF(A1=A2,"present"," ")

Drag the formula so that it covers the whole column, and count the number of times "present" appears. You could keep track of the two gene lists by changing the font color of the shorter one before sorting them on name.

Perl seems a better solution, though, unless someone better at Excel than me (read: almost anyone) has a better way...


How I do it...

Excel 2007 - A little late but place each list of genes in separate columns. Highlight each column and then go to conditional formatting and scroll down to highlight, with color the duplicates. Now go to your list of 2000 genes and sort via cell color. Now you will have the genes that match, highlighted and at the top.

Linux - cat file1 file2 | sort | uniq -d
This concatenates both files, sorts the information, and then prints out the duplicates.


If your 2000 genes are all in a same column, you can use the formula "MATCH". It will give you the raw where you can find your gene of interest...