|
|
Presentations Details
|
Monday 10/12/2007
|
| 2:30-3:30 PM |
Biology in an Hour |
Matthew Wakefield |
|
A one hour refresher/introduction to genetics and molecular biology covering
the principles of genetic inheritance, DNA, RNA, protein, the central dogma,
replication, transcription, translation, molecular cloning and the polymerase
chain reaction.
|
|
| 4:00-5:00 PM |
Genomic Age Biology |
Matthew Wakefield |
|
A one hour lecture on biology as seen with genome age tools. Topics will
include discoveries from the FANTOM and EncODE projects and other genome scale
studies including miRNAs, the complex transcriptional landscape, RNA processing
and trafficing in the cell, transcriptional factories and subnuclear
localization, and a look forward at personalized medicine and the $1000 genome.
|
|
|
Slides |
|
|
Tuesday 11/12/2007
|
| 9:00-10:00 AM |
Variations on Ye Olde Sequence Alignment Problem |
Lloyd Allison |
|
The sequence alignment problem seems to be straightforward but
there are many variations on it and on its intended data:
choice of alphabet (and hence the alphabet's properties),
minimize a cost or maximize a score,
the form of the gap cost (score) function,
global or local alignment,
optimal alignment or "relatedness",
the number of sequences.
A surprising number of algorithms have been devised to deal with
variations such as those listed above. We will look at how the
algorithms are influenced by the data and the question asked.
One of the variations is modelling-alignment (m-alignment) which
incorporates a population model into the alignment algorithm.
It gives a natural significance-test for the relatedness problem over
non-random (biased, compressible) sequences, and is an alternative to the
standard method (e.g., PRSS from the FASTA package) based on shuffling and
re-alignment of sequences. There are versions of M-alignment for
local alignment, global alignment, optimal alignment, and relatedness.
Modelling-alignment shows higher accuracy than PRSS or Blast, giving
fewer false positives and false negatives in ROC curves and other results
from tests on real and artificial data.
|
|
|
Slides |
|
| 10:30-11:30 AM |
Comparative analysis of long DNA sequences by per element information content using different contexts |
Lloyd Allison |
|
Features of a DNA sequence can be found by compressing the
sequence under a suitable model; good compression implies low
information content. Good DNA compression models consider
repetition, differences between repeats, and base distributions.
From a linear DNA sequence, a compression model can produce
a linear information sequence. Linear space complexity is
important when exploring long DNA sequences of the order of
millions of bases. Compressing a sequence in isolation will
include information on self-repetition. Whereas compressing
a sequence Y in the context of another X can find what new
information X gives about Y. This leads to a methodology for
performing comparative analysis to find features exposed
by such models.
|
|
|
Slides |
|
| 11:30-12:30 AM |
Computational methods for the prediction of protein function |
Rafael Najmanovich |
|
Vast amounts of sequence and structural data have been amassed over the years.
Reliable computational methods for the prediction of protein function are
increasingly necessary to make sense of these data. The vast majority of
current methods, and certainly those more successful, rather than predict the
function of a protein, try to detect homologies to proteins of known function
and transfer such annotation. This lecture will provide an overview the various
types of data and methods used to predict protein function.
|
|
| 4:00-5:00 PM |
Analysis of ligand-protein binding within and across protein families via the detection of binding site 3D atomic similarities |
Rafael Najmanovich |
|
Determining what small-molecules may bind a protein provides valuable clues in
determining protein function particularly in those cases where other sequence
and structure based methods are not helpful. In the present work, we want to
exploit binding site atomic similarities as a means to obtain clues about
ligand similarities. To this purpose we use a graph matching based method for
the detection binding site 3D atomic similarities. We present the results of
the utilisation of this method in the analysis of the interaction profiles of
small-molecules binding/inhibition profiles in two human protein families: 1.
Cytosolic Sulfotransferases and 2. Protein Kinases. Finally, we also use the
method to determine if binding site atomic similarities provide positive clues
about ligand similarities in the absence of homologies.
|
|
|
Wednesday 12/12/2007
|
| 9:00-10:00 AM |
Statistical Phylogenetics 1: Statistial phylogenetic estimation |
Allen Rodrigo |
|
In this lecture, I describe explicitly statistical
methods of phylogenetic reconstruction. These include
both likelihood-based and Bayesian methods. I will
discuss the concept of hypothesis testing in
phylogenetics, and I also introduce the Likelihood
Ratio Test, one of the most widely used tests in
statistical inference.
|
|
| 10:30-11:30 AM |
Statistical Phylogenetics 2: Phylogenetic hypothesis tests |
Allen Rodrigo |
|
I describe how we can test (1) the quality of our data, (2) different models of
evolution, (3) for a molecular clock, and (4) alternative topological
hypotheses.
|
|
|
Slides |
|
| 11:30-12:30 AM |
Inferring phylogenetic trees from un-aligned molecular sequences |
Mark Ragan |
|
It is commonly believed that to infer a phylogenetic tree that adequately
represents the evolutionary history of a set of homologous molecular sequences,
one must first align these sequences, i.e. arrange them relative to each other
in a way that presents the best available hypothesis of homology at each and
every sequence position (alignment column). Indeed, under a wide range of
biologically relevant situations, sub-optimal alignment diminishes the accuracy
of the resulting tree. Unfortunately, multiple sequence alignment is NP-hard.
This raises the question of whether sufficient homology information can be
inexpensively extracted from un-aligned sequences to support the inference of
arbitrarily accurate phylogenetic trees. In this presentation I describe a
range of approaches to alignment-free phylogenetic inference, describing the
effects of inference method (distance-based or Bayesian), statistical variables
(alphabet, word length and degeneracy), and one feature of evolving
biomolecular sequences (across-sites rate variation). The best alignment-free
methods perform surprisingly well for biologically relevant problems, and can
be superior to alignment-based approaches for certain problems. As a bonus, we
find an apparent regularity in statistical parameterisation.
|
|
|
Slides |
|
| 4:00-5:00 PM |
Whole-genome duplication in vertebrates: source of evolutionary novelty and success? |
Karin Kassahn |
|
The duplication of individual genes is thought to be an
important source of genetic and evolutionary novelty. However, vertebrates have
experienced not only recurrent duplication of individual genes giving rise to
large gene families, but also several rounds of whole-genome duplication when
their complete suite of genes was duplicated. Therefore, these whole-genome
duplication events may have had a large impact on vertebrate evolution. In
fact, the diversification of teleost fishes, which today constitute the most
speciose vertebrate lineage with some 20,000 extant species, has been
attributed to an additional whole-genome duplication that occurred in the
ancestor of all teleost fishes and after their divergence from tetrapods,
approximately 450Mya. What role have these whole-genome duplication events
really played in the evolution and diversification of vertebrates? Evolutionary
theory predicts that following whole-genome duplication most duplicate gene
copies rapidly become non-functional due to the accumulation of random
mutations. However, a small set of gene duplicates may potentially be retained
as functional copies and these could either maintain the original function,
acquire new functions or partition previous functions. In this lecture, I
describe different approaches to identify teleost gene duplicates retained from
the last whole-genome duplication event, in particular approaches based on
chromosomal location and phylogenetic inference. These analyses are based on
five completely sequenced teleost genomes. I further comment on other types of
genomic variation in vertebrates that may be associated with the evolution of
new genes and function.
|
|
|
Slides |
|
| 6:00-7:00 PM |
The human genome: more complex than we imagined |
John Mattick |
|
The human genome comprises 3 billion base pairs of DNA sequence information. It
programs the development of a precisely sculptured individual of about 100
trillion cells with hundreds of different muscles, bones and organs, as well as
the brain. It contains about 20,000 protein-coding genes, surprisingly about
the same number and in large part with similar functions as those in sea
urchins or even tiny worms that have only 1,000 cells. This raises the
question: where is the genetic information that programs our complexity? The
answer it seems lies in the so-called "junk" DNA which resides within and
between our genes, and contains most of the genetic differences between
individuals and species. These sequences are not translated into proteins but
are copied into RNA in a developmentally regulated manner. Increasing evidence
suggests that noncoding RNAs form a massive hidden network of regulatory
information that directs the exquisitely precise patterns of gene expression
during our growth and development. It also appears that RNA is central to brain
development, learning and memory, and that we have developed sophisticated RNA
editing systems to overwrite hardwired genetic information in response to
environmental signals. Thus, it seems that evolution discovered the power of
digital communication and control systems a billion years before we did.
Moreover what was dismissed as junk because it was not understood may well hold
the secrets of human complexity.
|
|
|
Thursday 13/12/2007
|
| 9:00-10:00 AM |
The human genome as an RNA machine |
John Mattick |
|
It appears likely that the genetic programming of humans and other higher
organisms has been fundamentally misunderstood for the past 50 years, because
of the presumption-largely true in prokaryotes, but not in complex eukaryotes
- that most genetic information is expressed as and transacted by proteins.
While only a tiny minority encodes protein, it is now evident that the majority
of the mammalian genome is transcribed, much on both strands, apparently in a
developmentally regulated manner, and that most complex genetic phenomena in
the higher organisms are RNA-directed. Evidence will be presented (i) that
there are thousands of non-protein-coding transcripts in mouse that are
dynamically expressed during differentiation and development, including in
embryonal stem cell, neuronal cell and muscle differentiation, male and female
gonadal ridge differentiation, and T-cell and macrophage activation, among
others, many of which show precise expression patterns and subcellular
localization in the brain; (ii) that there are large numbers of
small RNAs expressed from the human and mouse genomes; and (iii) that much, if
not most, of the mammalian genome may not be evolving neutrally, but rather
comprises different types of sequences (including transposon-derived sequences)
that are evolving at different rates under different selection pressures and
different structure-function constraints. Taken together with others, these
observations suggest that the majority of the human genome and those of other
complex organisms is devoted to a hidden and highly sophisticated RNA
regulatory system that directs the trajectories of differentiation and
development by controlling chromatin architecture and epigenetic memory,
transcription, splicing, RNA modification and editing, mRNA translation and RNA
stability.
|
|
| 10:30-11:30 AM |
Borrowing strength in microarray data analysis |
Gordon Smyth |
|
At a molecular biology research institute, most microarray experiments are
differential expression studies. Such studies can range from simple in design,
perhaps comparing just two groups, to more complex designs involving multiple
levels of several treatments. Whether simple or complex, these experiments
invariably involve only a small number of biological replicates. This means
that creative ways to borrow strength between genes and between samples are
essential to the statistical analysis. This talk will describe an approach
which has proved popular and effective, using linear models to borrow strength
between samples, and empirical Bayes methods to borrow strength between genes.
This approach leads to an elegant generalization of t-tests and F-tests. The
talk will go on to consider information borrowing ideas for experiments with
multiple error strata, and gene set tests which conduct hypothesis tests for
ensembles of genes.
|
|
|
Slides |
|
| 11:30-12:30 AM |
Gene expression microarray: how they work |
Conrad Burden |
|
Gene expression microarrays are designed to enable the simultaneous testing for
the presence of expressed genes in prepared mRNA samples. Commercially produced
oligonucleotide microarrays test tens of thousands of genes in a single
experiment. Various algorithms with names like MAS5, RMA and PLIER are
available to convert the raw fluorescence intensity data from these experiments
to data in the form of "expression measures" for use by biologists. But are
these expression measures truly an indication of the specific mRNA
concentration present in the target sample? By analysing data from spike-in
experiments, we can see that current algorithms do not correct well for the
effects of cross hybridisation or saturation, and are not a measure of absolute
target concentrations. So how might we do better? To answer this question, and
with the goal of developing a practical algorithm to estimate absolute target
concentrations accurately, we consider the physics, chemistry and statistics of
oligonucleotide microarrays. In this talk I will describe the mechanics of
what is happening at the microarray surface during hybridisation experiments.
|
|
|
Slides |
|
| 4:00-5:00 PM |
Natural metrics for the calibration of microarray intensities |
Hans Binder |
|
Based on the preceding lecture about the basic mechanism of microarray
hybridization I will discuss the problem how one gets from an adequate physical
model to a feasible method of data calibration which transforms up to a million
raw intensity values per chip into expression measures in units of the
RNA-concentration of ten-thousands of transcripts. Our !HHook-curve!I technique
corrects raw microarray intensity data for the effect of (i) sequence-specific
affinities; (ii) mismatches; (iii) cross-hybridization and (iv) saturation in a
single-chip fashion, i.e. without comparison with the other chips of a series.
It is based on a physically-motivated metric system for GeneChip data which
uses intrinsic relations between matched and mismatches probes. The
hybridization model describes specific and non-specific hybridization in terms
of the competitive two-species Langmuir isotherm and the formation of
probe/target duplexes in terms of positional dependent base-pair interactions.
I will present examples for different hybridizations (DNA-RNA; DNA-DNA,
mismatches, different chip generations etc.) and also applications/adaptations
of the method to different chip types such as expression-, tiling- and
SNP-arrays.
|
|
|
Slides |
|
|
Friday 14/12/2007
|
| 9:00-10:00 AM |
ChIP-chip and high density tiling arrays: issues in design and analysis |
Terry Speed |
|
The combination of chromatin immunoprecipitation (ChIP) and
high-density tiling microarrays permits high resolution, genome wide
localization of protein-DNA interaction in vivo.
ChIP has been combined with spotted arrays for 7 or 8 years, but the
move to high density array platforms -- which permit fine-grained
interrogation of essentially all non-repetitive sequence, even in
higher eukaryotes -- is a recent one. To address the additional
challenges proposed by this platform, a variety of analysis methods and
experimental designs have been proposed. In this talk, I will consider
(i) the use of different types of control data, (ii) statistical issues
common to all proposed analysis methods, and (iii) how lessons learned
from the analysis of gene expression data on high density arrays can be
applied in the ChIP-chip tiling array context.
|
|
| 10:30-11:30 AM |
What ChIP-chip data can tell us about transcriptional regulation in eukaryotic cells |
Terry Speed |
|
Suppose that we have processed ChIP-chip data for one or more
proteins along the lines of the first talk in this pair. That is, suppose
we have estimates of the genomic locations at which our protein(s) bound,
calling these (putative) binding sites. What next? Of obvious interest is
predicting which genes are regulated by our protein. A start on this is
made by relating the location of these binding sites to known or
predicted genes. Many will be in promoter regions of genes, but many
others will not. This task is best carried out in conjunction with other
assays (rtpcr, luciferase), as it cannot be solved by computation alone.
Then we'll be interested in the sequence content of the regions around
the binding sites: are they enriched for known or novel sequence motifs?
There are usually enough regions for one primary and several secondary
binding motifs to emerge, with interest focussing on ones not previously
known to be relevant to the protein(s) under study, and on the
combinations found. ChIP-chip data are almost always obtained on systems
(cellular contexts, proteins) for which there is associated microarray
gene expression assays. This permits a broad class of biological questions
to be explored by relating the binding data to expression and genomic
sequence data. In this talk I will explain and illustrate most of the
issues mentioned above with data from experiments on fruit fly and human
cell lines.
|
|
| 11:30-12:30 AM |
Elucidating the Genetic Architecture of Gene Expression
and Its Impact on Networks Associated with Obesity in Human Blood and Adipose
Tissue Samples
|
Eric Schadt |
|
The identification of genetic variants that affect gene expression may help
unravel the complexity of common human diseases. Here I present the analysis
of the expression of 23,720 transcripts in a population-based blood and adipose
tissue sampling from large familial cohorts assessed for biometric traits
related to obesity. In contrast to blood, we demonstrate a striking
correlation between gene expression in adipose tissue and obesity related
traits. Genome-wide linkage and association mapping reveal a highly
significant genetic component to gene expression traits including a strong
genetic effect of proximal (cis) signals and weaker for distal (trans) signals.
An extensive co-expression network constructed from the human adipose data
exhibits significant overlap with similar network modules in mouse adipose data
found to be causally associated with obesity related traits. Combined these
data highlight that common human diseases like obesity and diabetes are
emergent properties of networks, requiring the integration of multiple
different types of data (like genetic, expression and clinical) to elucidate
the networks underlying them, beyond what could be achieved by examining the
different types of data on their own.
|
|
| 4:00-5:00 PM |
Integrating Large-Scale Functional Genomic Data to Dissect the Complexity of Regulatory Networks
|
Eric Schadt |
|
A primary aim in systems biology research is the construction of networks
capable of predicting complex system behavior. DNA variation and transcription
factor binding site (TFBS) data have been exploited as systematic perturbation
sources to facilitate inferring causal relationships among genes and between
genes and higher-order phenotypes, while protein-protein interaction (PPI) and
gene expression data have been leveraged to construct large-scale interaction
(or association) networks. Here I describe a method to combine multiple types
of large-scale molecular data, including genotypic, gene expression, TFBS and
PPI data to reconstruct causal probabilistic gene networks. I establish the
importance of incorporating systematic sources of perturbations to infer causal
relationships among genes by reconstructing whole gene networks based on
different types of data. These different networks are compared using a number
of metrics devised to assess the predictive power of any given network. A
network reconstructed by integrating genotypic, PPI, gene expression, and TFBS
data from previously published large-scale yeast experiments is shown to
provide for superior predictive power compared to networks constructed from the
expression data alone. I demonstrate that this network enables direct
identification of genes responsible for hot spots of gene expression activity
under the control of common genetic loci in a segregating yeast population. I
further demonstrate that for many of these predictions the network elucidates a
putative mechanistic understanding of how the causal regulators give rise to
larger-scale changes in gene expression activity. Importantly, a number of
predictions based on this network are prospectively tested and validated
experimentally, providing direct experimental evidence that predictive networks
can be constructed via the integration of multiple, appropriate data types.
|
|
|