ANU Home | Search ANU
The Australian National University
Mathematical Sciences Institute (MSI)
Australian Mathematical Sciences Institute Summer Symposium

International Centre of Excellence for Education in Mathematics

Printer Friendly Version of this Document

Presentations Details

Monday 10/12/2007
2:30-3:30 PM Biology in an Hour Matthew Wakefield
A one hour refresher/introduction to genetics and molecular biology covering the principles of genetic inheritance, DNA, RNA, protein, the central dogma, replication, transcription, translation, molecular cloning and the polymerase chain reaction.
4:00-5:00 PM Genomic Age Biology Matthew Wakefield
A one hour lecture on biology as seen with genome age tools. Topics will include discoveries from the FANTOM and EncODE projects and other genome scale studies including miRNAs, the complex transcriptional landscape, RNA processing and trafficing in the cell, transcriptional factories and subnuclear localization, and a look forward at personalized medicine and the $1000 genome.
Slides
Tuesday 11/12/2007
9:00-10:00 AM Variations on Ye Olde Sequence Alignment Problem Lloyd Allison
The sequence alignment problem seems to be straightforward but there are many variations on it and on its intended data: choice of alphabet (and hence the alphabet's properties), minimize a cost or maximize a score, the form of the gap cost (score) function, global or local alignment, optimal alignment or "relatedness", the number of sequences. A surprising number of algorithms have been devised to deal with variations such as those listed above. We will look at how the algorithms are influenced by the data and the question asked. One of the variations is modelling-alignment (m-alignment) which incorporates a population model into the alignment algorithm. It gives a natural significance-test for the relatedness problem over non-random (biased, compressible) sequences, and is an alternative to the standard method (e.g., PRSS from the FASTA package) based on shuffling and re-alignment of sequences. There are versions of M-alignment for local alignment, global alignment, optimal alignment, and relatedness. Modelling-alignment shows higher accuracy than PRSS or Blast, giving fewer false positives and false negatives in ROC curves and other results from tests on real and artificial data.
Slides
10:30-11:30 AM Comparative analysis of long DNA sequences by per element information content using different contexts Lloyd Allison
Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences between repeats, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence. Linear space complexity is important when exploring long DNA sequences of the order of millions of bases. Compressing a sequence in isolation will include information on self-repetition. Whereas compressing a sequence Y in the context of another X can find what new information X gives about Y. This leads to a methodology for performing comparative analysis to find features exposed by such models.
Slides
11:30-12:30 AM Computational methods for the prediction of protein function Rafael Najmanovich
Vast amounts of sequence and structural data have been amassed over the years. Reliable computational methods for the prediction of protein function are increasingly necessary to make sense of these data. The vast majority of current methods, and certainly those more successful, rather than predict the function of a protein, try to detect homologies to proteins of known function and transfer such annotation. This lecture will provide an overview the various types of data and methods used to predict protein function.
4:00-5:00 PM Analysis of ligand-protein binding within and across protein families via the detection of binding site 3D atomic similarities Rafael Najmanovich
Determining what small-molecules may bind a protein provides valuable clues in determining protein function particularly in those cases where other sequence and structure based methods are not helpful. In the present work, we want to exploit binding site atomic similarities as a means to obtain clues about ligand similarities. To this purpose we use a graph matching based method for the detection binding site 3D atomic similarities. We present the results of the utilisation of this method in the analysis of the interaction profiles of small-molecules binding/inhibition profiles in two human protein families: 1. Cytosolic Sulfotransferases and 2. Protein Kinases. Finally, we also use the method to determine if binding site atomic similarities provide positive clues about ligand similarities in the absence of homologies.
Wednesday 12/12/2007
9:00-10:00 AM Statistical Phylogenetics 1: Statistial phylogenetic estimation Allen Rodrigo
In this lecture, I describe explicitly statistical methods of phylogenetic reconstruction. These include both likelihood-based and Bayesian methods. I will discuss the concept of hypothesis testing in phylogenetics, and I also introduce the Likelihood Ratio Test, one of the most widely used tests in statistical inference.
10:30-11:30 AM Statistical Phylogenetics 2: Phylogenetic hypothesis tests Allen Rodrigo
I describe how we can test (1) the quality of our data, (2) different models of evolution, (3) for a molecular clock, and (4) alternative topological hypotheses.
Slides
11:30-12:30 AM Inferring phylogenetic trees from un-aligned molecular sequences Mark Ragan
It is commonly believed that to infer a phylogenetic tree that adequately represents the evolutionary history of a set of homologous molecular sequences, one must first align these sequences, i.e. arrange them relative to each other in a way that presents the best available hypothesis of homology at each and every sequence position (alignment column). Indeed, under a wide range of biologically relevant situations, sub-optimal alignment diminishes the accuracy of the resulting tree. Unfortunately, multiple sequence alignment is NP-hard. This raises the question of whether sufficient homology information can be inexpensively extracted from un-aligned sequences to support the inference of arbitrarily accurate phylogenetic trees. In this presentation I describe a range of approaches to alignment-free phylogenetic inference, describing the effects of inference method (distance-based or Bayesian), statistical variables (alphabet, word length and degeneracy), and one feature of evolving biomolecular sequences (across-sites rate variation). The best alignment-free methods perform surprisingly well for biologically relevant problems, and can be superior to alignment-based approaches for certain problems. As a bonus, we find an apparent regularity in statistical parameterisation.
Slides
4:00-5:00 PM Whole-genome duplication in vertebrates: source of evolutionary novelty and success? Karin Kassahn
The duplication of individual genes is thought to be an important source of genetic and evolutionary novelty. However, vertebrates have experienced not only recurrent duplication of individual genes giving rise to large gene families, but also several rounds of whole-genome duplication when their complete suite of genes was duplicated. Therefore, these whole-genome duplication events may have had a large impact on vertebrate evolution. In fact, the diversification of teleost fishes, which today constitute the most speciose vertebrate lineage with some 20,000 extant species, has been attributed to an additional whole-genome duplication that occurred in the ancestor of all teleost fishes and after their divergence from tetrapods, approximately 450Mya. What role have these whole-genome duplication events really played in the evolution and diversification of vertebrates? Evolutionary theory predicts that following whole-genome duplication most duplicate gene copies rapidly become non-functional due to the accumulation of random mutations. However, a small set of gene duplicates may potentially be retained as functional copies and these could either maintain the original function, acquire new functions or partition previous functions. In this lecture, I describe different approaches to identify teleost gene duplicates retained from the last whole-genome duplication event, in particular approaches based on chromosomal location and phylogenetic inference. These analyses are based on five completely sequenced teleost genomes. I further comment on other types of genomic variation in vertebrates that may be associated with the evolution of new genes and function.
Slides
6:00-7:00 PM The human genome: more complex than we imagined John Mattick
The human genome comprises 3 billion base pairs of DNA sequence information. It programs the development of a precisely sculptured individual of about 100 trillion cells with hundreds of different muscles, bones and organs, as well as the brain. It contains about 20,000 protein-coding genes, surprisingly about the same number and in large part with similar functions as those in sea urchins or even tiny worms that have only 1,000 cells. This raises the question: where is the genetic information that programs our complexity? The answer it seems lies in the so-called "junk" DNA which resides within and between our genes, and contains most of the genetic differences between individuals and species. These sequences are not translated into proteins but are copied into RNA in a developmentally regulated manner. Increasing evidence suggests that noncoding RNAs form a massive hidden network of regulatory information that directs the exquisitely precise patterns of gene expression during our growth and development. It also appears that RNA is central to brain development, learning and memory, and that we have developed sophisticated RNA editing systems to overwrite hardwired genetic information in response to environmental signals. Thus, it seems that evolution discovered the power of digital communication and control systems a billion years before we did. Moreover what was dismissed as junk because it was not understood may well hold the secrets of human complexity.
Thursday 13/12/2007
9:00-10:00 AM The human genome as an RNA machine John Mattick
It appears likely that the genetic programming of humans and other higher organisms has been fundamentally misunderstood for the past 50 years, because of the presumption-largely true in prokaryotes, but not in complex eukaryotes - that most genetic information is expressed as and transacted by proteins. While only a tiny minority encodes protein, it is now evident that the majority of the mammalian genome is transcribed, much on both strands, apparently in a developmentally regulated manner, and that most complex genetic phenomena in the higher organisms are RNA-directed. Evidence will be presented (i) that there are thousands of non-protein-coding transcripts in mouse that are dynamically expressed during differentiation and development, including in embryonal stem cell, neuronal cell and muscle differentiation, male and female gonadal ridge differentiation, and T-cell and macrophage activation, among others, many of which show precise expression patterns and subcellular localization in the brain; (ii) that there are large numbers of small RNAs expressed from the human and mouse genomes; and (iii) that much, if not most, of the mammalian genome may not be evolving neutrally, but rather comprises different types of sequences (including transposon-derived sequences) that are evolving at different rates under different selection pressures and different structure-function constraints. Taken together with others, these observations suggest that the majority of the human genome and those of other complex organisms is devoted to a hidden and highly sophisticated RNA regulatory system that directs the trajectories of differentiation and development by controlling chromatin architecture and epigenetic memory, transcription, splicing, RNA modification and editing, mRNA translation and RNA stability.
10:30-11:30 AM Borrowing strength in microarray data analysis Gordon Smyth
At a molecular biology research institute, most microarray experiments are differential expression studies. Such studies can range from simple in design, perhaps comparing just two groups, to more complex designs involving multiple levels of several treatments. Whether simple or complex, these experiments invariably involve only a small number of biological replicates. This means that creative ways to borrow strength between genes and between samples are essential to the statistical analysis. This talk will describe an approach which has proved popular and effective, using linear models to borrow strength between samples, and empirical Bayes methods to borrow strength between genes. This approach leads to an elegant generalization of t-tests and F-tests. The talk will go on to consider information borrowing ideas for experiments with multiple error strata, and gene set tests which conduct hypothesis tests for ensembles of genes.
Slides
11:30-12:30 AM Gene expression microarray: how they work Conrad Burden
Gene expression microarrays are designed to enable the simultaneous testing for the presence of expressed genes in prepared mRNA samples. Commercially produced oligonucleotide microarrays test tens of thousands of genes in a single experiment. Various algorithms with names like MAS5, RMA and PLIER are available to convert the raw fluorescence intensity data from these experiments to data in the form of "expression measures" for use by biologists. But are these expression measures truly an indication of the specific mRNA concentration present in the target sample? By analysing data from spike-in experiments, we can see that current algorithms do not correct well for the effects of cross hybridisation or saturation, and are not a measure of absolute target concentrations. So how might we do better? To answer this question, and with the goal of developing a practical algorithm to estimate absolute target concentrations accurately, we consider the physics, chemistry and statistics of oligonucleotide microarrays. In this talk I will describe the mechanics of what is happening at the microarray surface during hybridisation experiments.
Slides
4:00-5:00 PM Natural metrics for the calibration of microarray intensities Hans Binder
Based on the preceding lecture about the basic mechanism of microarray hybridization I will discuss the problem how one gets from an adequate physical model to a feasible method of data calibration which transforms up to a million raw intensity values per chip into expression measures in units of the RNA-concentration of ten-thousands of transcripts. Our !HHook-curve!I technique corrects raw microarray intensity data for the effect of (i) sequence-specific affinities; (ii) mismatches; (iii) cross-hybridization and (iv) saturation in a single-chip fashion, i.e. without comparison with the other chips of a series. It is based on a physically-motivated metric system for GeneChip data which uses intrinsic relations between matched and mismatches probes. The hybridization model describes specific and non-specific hybridization in terms of the competitive two-species Langmuir isotherm and the formation of probe/target duplexes in terms of positional dependent base-pair interactions. I will present examples for different hybridizations (DNA-RNA; DNA-DNA, mismatches, different chip generations etc.) and also applications/adaptations of the method to different chip types such as expression-, tiling- and SNP-arrays.
Slides
Friday 14/12/2007
9:00-10:00 AM ChIP-chip and high density tiling arrays: issues in design and analysis Terry Speed
The combination of chromatin immunoprecipitation (ChIP) and high-density tiling microarrays permits high resolution, genome wide localization of protein-DNA interaction in vivo. ChIP has been combined with spotted arrays for 7 or 8 years, but the move to high density array platforms -- which permit fine-grained interrogation of essentially all non-repetitive sequence, even in higher eukaryotes -- is a recent one. To address the additional challenges proposed by this platform, a variety of analysis methods and experimental designs have been proposed. In this talk, I will consider (i) the use of different types of control data, (ii) statistical issues common to all proposed analysis methods, and (iii) how lessons learned from the analysis of gene expression data on high density arrays can be applied in the ChIP-chip tiling array context.
10:30-11:30 AM What ChIP-chip data can tell us about transcriptional regulation in eukaryotic cells Terry Speed
Suppose that we have processed ChIP-chip data for one or more proteins along the lines of the first talk in this pair. That is, suppose we have estimates of the genomic locations at which our protein(s) bound, calling these (putative) binding sites. What next? Of obvious interest is predicting which genes are regulated by our protein. A start on this is made by relating the location of these binding sites to known or predicted genes. Many will be in promoter regions of genes, but many others will not. This task is best carried out in conjunction with other assays (rtpcr, luciferase), as it cannot be solved by computation alone. Then we'll be interested in the sequence content of the regions around the binding sites: are they enriched for known or novel sequence motifs? There are usually enough regions for one primary and several secondary binding motifs to emerge, with interest focussing on ones not previously known to be relevant to the protein(s) under study, and on the combinations found. ChIP-chip data are almost always obtained on systems (cellular contexts, proteins) for which there is associated microarray gene expression assays. This permits a broad class of biological questions to be explored by relating the binding data to expression and genomic sequence data. In this talk I will explain and illustrate most of the issues mentioned above with data from experiments on fruit fly and human cell lines.
11:30-12:30 AM Elucidating the Genetic Architecture of Gene Expression and Its Impact on Networks Associated with Obesity in Human Blood and Adipose Tissue Samples Eric Schadt
The identification of genetic variants that affect gene expression may help unravel the complexity of common human diseases. Here I present the analysis of the expression of 23,720 transcripts in a population-based blood and adipose tissue sampling from large familial cohorts assessed for biometric traits related to obesity. In contrast to blood, we demonstrate a striking correlation between gene expression in adipose tissue and obesity related traits. Genome-wide linkage and association mapping reveal a highly significant genetic component to gene expression traits including a strong genetic effect of proximal (cis) signals and weaker for distal (trans) signals. An extensive co-expression network constructed from the human adipose data exhibits significant overlap with similar network modules in mouse adipose data found to be causally associated with obesity related traits. Combined these data highlight that common human diseases like obesity and diabetes are emergent properties of networks, requiring the integration of multiple different types of data (like genetic, expression and clinical) to elucidate the networks underlying them, beyond what could be achieved by examining the different types of data on their own.
4:00-5:00 PM Integrating Large-Scale Functional Genomic Data to Dissect the Complexity of Regulatory Networks Eric Schadt
A primary aim in systems biology research is the construction of networks capable of predicting complex system behavior. DNA variation and transcription factor binding site (TFBS) data have been exploited as systematic perturbation sources to facilitate inferring causal relationships among genes and between genes and higher-order phenotypes, while protein-protein interaction (PPI) and gene expression data have been leveraged to construct large-scale interaction (or association) networks. Here I describe a method to combine multiple types of large-scale molecular data, including genotypic, gene expression, TFBS and PPI data to reconstruct causal probabilistic gene networks. I establish the importance of incorporating systematic sources of perturbations to infer causal relationships among genes by reconstructing whole gene networks based on different types of data. These different networks are compared using a number of metrics devised to assess the predictive power of any given network. A network reconstructed by integrating genotypic, PPI, gene expression, and TFBS data from previously published large-scale yeast experiments is shown to provide for superior predictive power compared to networks constructed from the expression data alone. I demonstrate that this network enables direct identification of genes responsible for hot spots of gene expression activity under the control of common genetic loci in a segregating yeast population. I further demonstrate that for many of these predictions the network elucidates a putative mechanistic understanding of how the causal regulators give rise to larger-scale changes in gene expression activity. Importantly, a number of predictions based on this network are prospectively tested and validated experimentally, providing direct experimental evidence that predictive networks can be constructed via the integration of multiple, appropriate data types.