Protocol Online logo
Top : New Forum Archives (2009-): : Bioinformatics and Biostatistics

Please help with volcano plots! Desperate - (Aug/17/2018 )

Can someone please advise me how to generate a volcano plot easily on a software I am able to use eg Excel or Prism.

 

I am really stuck!

 

Some step by step instructions would be great.

 

I have already calculated p values and log2 fold changes for my proteomics data on excel. I just don't know where to generate my volcano plot. Some simple step by step advice would be really really appreciated on an easy to use software. I cannot use R - its impossible. 

 

Thanks in advance.

-Natalia KM-

I'm pretty sure that neither Excel nor Graphpad Prism can do volcano plots easily (though I found a tutorial for doing them in Excel using an add-in called XLSTAT (costs money) just by googling). Why do you need a volcano plot? If you are doing microarray analysis, then CLC genomics workbench has them natively built in. 

 

Note that if you are wanting to get into the bioinformatics field in any significant manner, then R really is the way to go, it takes a bit of fiddling to learn, but there are many many tutorials and examples on the web, and lots of community support on many sites too.

-bob1-

I downloaded the trial version of XLSTAT but the instructions don't make much sense to me when Ive opened their demo file. 

 

I don't need to do an awful lot of bioinformatics but I do need to generate a few volcano plots for my proteomics data to show significance and fold change between different treatments. 

 

I was hoping I could find a simple software where I could just input my calculated p values and LOG2 fold changes from Excel. Ive looked at the online R tutorials but even the most basic functions require so much fiddling around. Im not even sure how I can input the vast amount of data I have into R from excel?

 

Any further help would be appreciated even if its a step by step guide to R.

-Natalia KM-

R isn't that hard - it just looks it at first. It helps to know that there are tons of people doing this, so you can just google the problem you have and see if it works for you. If you want to go this route, get R from your local mirror, and get the desktop version of Rstudio, it will make your life easier.

 

Once you have installed and opened, you should see 4 windows - two on the left and two on the right. The two on the right are for data and outputs, while the two on the left are for input commands. The top left is if you want to program (which you do, so that you use the same commands to do all your analysis, and you don't have to remember exactly what you typed each time), the bottom left is the "execution" area, where the commands are run.

 

If you have an excel file and want to get it into Rstudio... On the top right window there should be a button saying "import dataset". Click on it and it'll give you some options, use the one appropriate to your file format, browse to your file, it should preview your file, then click import (bottom right).

 

This should open a version of the file in the top left and run some commands in the bottom left. Blank cells in the file should be filled with "NA". The command should look something like:  your_file_name <- read_excel("//path/to/your/file.xlsx"). If it says something like "

 

The arrow (<-) is the assignment operator, you'll use it a lot in R, it basically tells the program to create a file/add to a file something. In this case it is creating a text file within R called "your_file_name", that contains the data from your excel sheet. You will work with the "your_file_name" file for the next step(s). Note that words are tricky in R - Gene and gene would be two different files, so pay attention to capitalization.

 

Now, here's where it gets a little more tricky - you will need some packages (these are sub-programs within R that do small things for particular purposes). Under the "tools" menu, choose install packages, and type "tidyverse" into the "Packages" section. Tidyverse is a bunch of packages that help arrange data in R and includes a nice way of graphing things called "ggplot"

 

For the next bit, I'm going to assume you have done the stats and have columns to match that are labeled with "protein", "log2foldchange", "pvalue", and "padjusted". It is also helpful to have a column called "threshold" that indicates those that are statistically significant (TRUE (p < 0.05), or FALSE (p > 0.05) in the column - remember capitals are important!).

 

In the bottom left type (copy and paste if you prefer), then hit enter after each line if nothing happens. This loads the packages to R so it can use them. You will need to do these lines each time you try to use this program, but once it is loaded you don't need to do it again. First is the tidyverse. and then loads readxl (which is used to load your file). Text following "#" are comments - this can contain anything, and is used to help remember what you have done and why! Also note that R doesn't like spaces in file names or titles of columns. You can use something like "my_file" but not "my file". If you have spaces in your excel column names, either remove them, or R will replace the space with a "." when loading. You can also type/copy & paste this into the top left and then save (it is then a mini-program!)

library(tidyverse)
library(readxl)

Below that put:

file_user <- file.choose() # choose file for reading
path_user <- dirname(file_user
                     ) # convert windows path to linux format for R useability, comment out for Mac users - you don't need it.
setwd(path_user
      )

This should pop open a window to choose your working file and then set the working directory to where that file is, so that any outputs (e.g. graphs) will be returned to the same location as your initial file - it's bad programming form, but useful for you.

 

Now, load the file into R, change the "your_file" and "file.xlsx" to something memorable for you. You will need to change these file names each time you analyze a new file, so that you don't overwrite the data:

your_file <- read_excel("file.xlsx"
                    ) # assign location and
#file to be read to a variable

Now look at the (top 6 rows of) data:

head(your_file
     )

Now start the graphing...

your_volcano_plot <- ggplot(your_file) +
                     geom_point(aes(x = log2FoldChange, y = -log10(padjusted), colour = threshold)
                                ) +
                     ggtitle("TITLE OF YOUR PLOT") +
                     xlab("log2 fold change") + 
                     ylab("-log10 adjusted p-value") +
                     theme(legend.position = "none",
                     plot.title = element_text(size = rel(1.5), hjust = 0.5),
                     axis.title = element_text(size = rel(1.25)
                                                          )
                                               ) 

This makes a graph file called "your_volcano_plot" (change as necessary) and assigns it a bunch of features based on your column names. Change the "TITLE OF YOUR PLOT", it is what comes out as the title on your plot. Now you can't see this plot just yet...

your_volcano_plot

This should open the plot viewer in bottom right, so that you can have a look at your volcano plot. There is one more step so that you have the plot output as you want it - a file that you can work with to view in another program.

ggsave(
  "your_volcano_plot.pdf",
  width = 6,
  height = 4
  units = "in"
  )

This outputs the file to your working directory (remember setting it near the start) - it'll be in pdf format. You can change as you want to jpg, png, tiff, bmp etc. I have specified a size, you can adjust those as needed. These are in inches currently, change as you like.

 

If you saved this as a program in the top left, then  return the cursor to the top line of the program (before the line starting "library"). Click the "run" button at the top right of the program window. You will need to click run for most lines of the program, until it reaches the bottom of the program. You should now have a volcano plot. If it doesn't work - google the errors or let me know and I'll try to help.

 

Note: I used the plotting commands from here: https://hbctraining.github.io/Training-modules/Visualization_in_R/lessons/03_advanced_visualizations.html, I've never made a volcano plot before...

I also tested it on the data from here, which is gene data, and it was fine with an excel sheet and an added column for threshold, but protein should work too.

-bob1-