Large-scale Genome Sequence Processing

✍ Scribed by Masahiro Kasahara, Shinichi Morishita

Publisher: Imperial College Press; Distributed by World Scientific
Year: 2006
Tongue: English
Leaves: 250
Edition: 1
Category: Library

No coin nor oath required. For personal study only.

✦ Synopsis

Efficient computer programs have made it possible to elucidate and analyze large-scale genomic sequences. Fundamental tasks, such as the assembly of numerous whole-genome shotgun fragments, the alignment of complementary DNA sequences with a long genome, and the design of gene-specific primers or oligomers, require efficient algorithms and state-of-the-art implementation techniques. This textbook emphasizes basic software implementation techniques for processing large-scale genome sequences and provides executable sample programs.

✦ Table of Contents

Contents......Page 10
Preface......Page 8
1.1 Storing a String in an Array......Page 14
1.2 Brute-Force String Search......Page 15
1.3 Encoding Strings into Integers......Page 17
1.4 Sorting k-mer Integers and a Binary Search......Page 20
1.5 Binary Search for the Boundaries of Blocks......Page 21
2.1 Insertion Sort......Page 24
2.2 Merge Sort......Page 25
2.3 Worst-Case Time Complexity of Algorithms......Page 29
2.4 Heap Sort......Page 30
2.5 Randomized Quick Sort......Page 35
2.6 Improving the Performance of Quick Sort......Page 40
2.7 Ternary Split Quick Sort......Page 45
2.8 Radix Sort......Page 46
3.1 Direct-Address Tables......Page 52
3.2 Hash Tables......Page 54
3.3 Table Size......Page 57
3.4 Using the Frequencies of k-mers......Page 58
3.5 Techniques for Reducing Table Size......Page 59
4 Suffix Arrays......Page 64
4.1 Suffix Trees......Page 65
4.2 Suffix Arrays......Page 67
4.3 Binary Search of Suffix Arrays for Queries......Page 69
4.4 Using the Longest Common Prefix Information to Accelerate the Search......Page 71
4.5 Computing the Longest Common Prefixes......Page 75
4.5.1 Application to Occurrence Frequencies of k-mers......Page 78
4.5.2 Application to the Longest Common Factors......Page 80
4.6 Suffix Array Construction - Doubling......Page 81
4.7 Larsson-Sadakane Algorithm......Page 82
4.8 Linear-Time Suffix Array Construction......Page 87
4.9 A Note on Practical Performance......Page 94
5.1 Rabin-Karp Algorithm......Page 96
5.2 Accelerating the Brute-Force String Search......Page 99
5.3 Knuth-Morris-Pratt Algorithm......Page 101
5.4 Bad Character Heuristics......Page 107
6 Approximate String Search......Page 112
6.1 Edit Operations and Alignments......Page 114
6.2 Edit Graphs and Dynamic Programming......Page 116
6.3 Needleman-Wunsch Algorithm......Page 118
6.4 Smith-Waterman Algorithm for Computing Local Alignments......Page 121
6.5 Overlap Alignments......Page 124
6.6 Alignment of cDNA Sequences with Genomes and Affine Gap Penalties......Page 127
6.7 Gotoh's Algorithm for Affine Gap Penalties......Page 129
6.8 Hirschberg's Space Reduction Technique......Page 133
7 Seeded Alignments......Page 138
7.1 Sensitivity and Specificity......Page 139
7.2 Computing Sensitivity and Specificity......Page 142
7.3 Multiple Hits of Seeds......Page 144
7.4 Gapped Seeds......Page 147
7.5 Chaining and Comparative Genomics......Page 148
7.6 Design of Highly Specific Oligomers......Page 154
7.7.1 Naive Algorithm......Page 158
7.7.2 BYP Method......Page 159
7.8 Partially Matching Seeds......Page 160
7.9 Overlapping Partially Matching Seeds......Page 164
8 Whole Genome Shotgun Sequencing......Page 168
8.1 Sanger Method......Page 169
8.1.1 Improvements to the Sequencing Method......Page 172
8.2 Cloning Genomic DNA Fragments......Page 173
8.3 Basecalling......Page 176
8.4 Overview of Shotgun Sequencing......Page 178
8.5 Lander-Waterman Statistics......Page 183
8.6 Double-Stranded Assembly......Page 185
8.7.1 Overlap......Page 188
8.7.2 Layout......Page 190
8.7.3 Consensus......Page 193
8.8 Practical Whole Genome Shotgun Assembly......Page 195
8.8.1 Vector Masking......Page 196
8.8.2 Quality Trimming......Page 199
8.8.3 Contamination Removal......Page 200
8.8.4.1 Seed and Extend......Page 201
8.8.4.2 Seeding......Page 203
8.8.4.3 Greedy Merging Approach......Page 204
8.8.4.4 Longer Repeat Sequence......Page 205
8.8.4.5 Iterative Improvements......Page 206
8.8.4.6 Accelerating Overlap Detection......Page 207
8.8.4.7 Repeat Sequence Detection......Page 209
8.8.4.9 Repeat Separation......Page 212
8.8.4.10 Maximal Read Heuristics......Page 214
8.8.4.11 Paired Pair Heuristics......Page 215
8.8.4.12 Parallelization......Page 216
8.8.4.13 Eliminating Chimeric Reads......Page 217
8.8.5 Scaffolding......Page 218
8.8.5.1 How Many Mate Pairs Are Needed for Scaffolding?......Page 220
8.8.5.2 Iterative Improvements......Page 221
8.8.6 Consensus......Page 223
8.9 Quality Assessment......Page 224
8.10 Past and Future......Page 226
Software Availability......Page 234
Bibliography......Page 236
Index......Page 246

📜 SIMILAR VOLUMES

Genome-Scale Algorithm Design: Biologica

📁 Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing

✍ Veli Mäkinen et al. 📂 Library 📅 2015 🏛 Cambridge University Press 🌐 English

High-throughput sequencing has revolutionised the field of biological sequence analysis. Its application has enabled researchers to address important biological questions, often for the first time. This book provides an integrated presentation of the fundamental algorithms and data structures that p

Genome-Scale Algorithm Design. Biologica

📁 Genome-Scale Algorithm Design. Biological Sequence Analysis in the Era of High-Throughput Sequencing

✍ Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, Alexandru I. Tomescu 📂 Library 📅 2015 🏛 Cambridge 🌐 English

Integrating Large-Scale Genomic Informat

📁 Integrating Large-Scale Genomic Information into Clinical Practice: Workshop Summary

✍ Institute of Medicine; Board on Health Sciences Policy; Roundtable on Translatin 📂 Library 📅 2012 🏛 National Academies Press 🌐 English

The initial sequencing of the human genome, carried out by an international group of experts, took 13 years and $2.7 billion to complete. In the decade since that achievement, sequencing technology has evolved at such a rapid pace that today a consumer can have his or her entire genome sequenced by

Large-Scale Graph Processing Using Apach

📁 Large-Scale Graph Processing Using Apache Giraph

✍ Sherif Sakr, Faisal Moeen Orakzai, Ibrahim Abdelaziz, Zuhair Khayyat (auth.) 📂 Library 📅 2016 🏛 Springer International Publishing 🌐 English

This book takes its reader on a journey through Apache Giraph, a popular distributed graph processing platform designed to bring the power of big data processing to graph data. Designed as a step-by-step self-study guide for everyone interested in large-scale graph processing, it

Large-Scale Transport Processes in Ocean

📁 Large-Scale Transport Processes in Oceans and Atmosphere

✍ Maurice L. Blackmon (auth.), J. Willebrand, D. L. T. Anderson (eds.) 📂 Library 📅 1986 🏛 Springer Netherlands 🌐 English

One of the major experiments in earth science at the present time is about to begin: the World Climate Research Program (WCRP). The objectives of WCRP are to determine the extent to which climate change can be predicted, and the extent to which human activities (such as increasing the level of CO

Large-Scale Data Streaming, Processing,

📁 Large-Scale Data Streaming, Processing, and Blockchain Security

✍ Geetanjali Rathee (editor), Dinesh Kumar Saini (editor) 📂 Library 📅 2020 🏛 IGI Global 🌐 English

Data has cemented itself as a building block of daily life. However, surrounding oneself with great quantities of information heightens risks to ones personal privacy. Additionally, the presence of massive amounts of information prompts researchers into how best to handle and disseminate it