Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Work Log 07/19

Liang Bo Wang
July 19, 2013
44

Work Log 07/19

Liang Bo Wang

July 19, 2013
Tweet

Transcript

  1. Work Log 07/19 3’ UTR Extraction Using R/Bioconductor 2013.07 Bioinformatics

    and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  © 2008 MIKI Yoshihito, CC BY 2.0
  2. 3’ UTR Extraction Using R / Bioconductor Gene/Transcript naming system

    Extraction from UCSC 3’UTR region duplication referring to Refseq Gene ID 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  3. Popular Gene Naming Systems •  Gene (Use TP53 as example)

    –  Gene Symbol (TP53) –  Entrez Gene ID (7157) http://www.ncbi.nlm.nih.gov/gene/7157 –  UniGene ID (409807) http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?UGID=409807 –  Ensembl Gene ID (ENSG00000141510) http://asia.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000141510 –  GeneCards gives a summary report http://www.genecards.org/cgi-bin/carddisp.pl?gene=TP53 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  4. Popular Transcript Naming Systems •  Transcripts (Use a transcript of

    TP53 as example) –  UCSC Known Gene ID (uc010cni.1) http://genome.ucsc.edu/cgi-bin/hgGene?db=hg19&hgg_gene=uc010cni.1 •  guaranteed to be unique –  Ensembl Transcript ID (ENST00000420246) http://asia.ensembl.org/Homo_sapiens/Transcript/Summary?t=ENST00000420246 •  guaranteed to be unique –  Refseq Transcript ID (NM_001276696) http://www.ncbi.nlm.nih.gov/nuccore/NM_001276696.1 •  verified by researches •  UCSC also adopt this naming system 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  5. Notes About Naming Systems •  A gene can have multiple

    transcripts –  one gene ID corresponds to multiple transcript IDs •  Conversion between ID systems of both gene and transcript are not trivial –  one gene can multiple symbol names TP53 = BCC7 = LFS1 = TRP53 = P53 –  two transcripts may mean the same thing from other system uc001asy.1 = uc001asz.3 from Ensembl –  not all transcripts have names in every naming system ENST00000503591 = ?? in Refseq and KnownGene –  although the gene ID exists, sometimes it is not retrievable which shows up in data as NA –  databases are not up-to-date NM_015209 are previously known as XM_048825 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  6. 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine,

    National Taiwan University Slides by Liang Bo Wang  move start < 2.0 > Click on a feature for details. Click or drag in the base position track to zoom in. Click side bars for track options. Drag side bars or labels up or down to reorder tracks. Drag tracks left or right to new position. move end < 2.0 > track search default tracks default order hide all add custom tracks track hubs configure reverse resize refresh
  7. Back to Our Goal •  We need to extract all

    3’ UTR sequences from known human transcripts •  How to define known human transcripts? –  in the end we choose Refseq as our identifier database •  Because most their transcripts have been well-validated –  all info will be first uniformed according to Refseq ID –  if we want to know more about this transcript (e.g., gene symbol, gene id), we query the information by Refseq ID 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  8. How to retrieve all known transcripts? •  There are two

    main databases: –  UCSC Table Browser –  Ensembl / BioMart •  UCSC* works well with Refseq ID •  Ensembl* works well with its own Ensembl ID •  One can obtain the data by: –  direct SQL query –  download the whole table in fasta, BED, GFF format –  use 3rd party interfaces (ex. BioMart, R/Bioconductor) •  We use the highlighted two ways 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  9. Data Processing •  Implementation details are skipped here •  To

    download the whole table from UCSC Table Browser, –  use their web interface, download as a fasta file –  per record it contains transcript ID and location on genome –  parse them using Python (dirty work here) •  To obtain the table through R/Bioconductor, –  many packages involved, but easy to pick up –  full functionality in all table fetching, sequence retrieval, and id conversion –  efficiency is main concern (need to know how to write faster R code) –  in progress, some info may be dropped (still don’t know why) –  all in all, suit for first try and most situations 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  10. UCSC Table Browser Web Interface 2013.07 Bioinformatics and Biostatistics Core,

    NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  Table Browser Use this program to retrieve the data associated with a track in text format, to calculate intersections between tracks, and to retrieve DNA sequence covered by a track. For help in using this application see Using the Table Browser for a description of the controls in this form, the User's Guide for general information and sample queries, and the OpenHelix Table Browser tutorial for a narrated presentation of the software features and usage. For more complex queries, you may want to use Galaxy or our public MySQL server. To examine the biological function of your set through annotation enrichments, send the data to GREAT. Refer to the Credits page for the list of contributors and usage restrictions associated with these data. All tables can be downloaded in their entirety from the Sequence and Annotation Downloads page. clade: Mammal genome: Human assembly: Feb. 2009 (GRCh37/hg19) group: Genes and Gene Prediction Tracks track: UCSC Genes add custom tracks track hubs table: knownGene describe table schema region: genome ENCODE Pilot regions position chr17:7569703-7588839 lookup define regions identifiers (names/accessions): paste list upload list filter: create intersection: create correlation: create output format: all fields from selected table Send output to Galaxy GREAT output file: (leave blank to keep output in browser) file type returned: plain text gzip compressed get output summary/statistics To reset all user cart settings (including custom tracks), click here. Using the Table Browser This section provides brief line-by-line descriptions of the Table Browser controls. For more information on using this program, see the Table Browser User's Guide. clade: Specifies which clade the organism is in. Genomes Genome Browser Tools Mirrors Downloads My Data About Us Help
  11. Result of Two Methods Mismatch •  As mentioned in previous

    week progress, –  from the same UCSC Table Browser resource –  I retrieved ~65K accessions •  I made mistake here, using Known Gene (UCSCKG) as identifier •  based on Refseq id, # of accessions drops to ~38K –  建樂學長 retrieved ~40K accessions •  The numbers of total accessions come close but not the same, how come? •  Multiple records for a transcript ID make no sense –  case 1: multiple location regions in a accession –  case 2: multiple accessions of same transcript ID 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  12. Case 1: Multiple Genomic Locations in an Accession  • 

    Example: NM_001130716 (PLAC8) •  Thru R/BioC, 2 entries returned •  Thru UCSC Table Browser, a fasta record returned 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  13. Case 1: Multiple Genomic Locations (cont’d) •  Through R/Bioconductor, – 

    Extract the sequence from these two regions on chr4 •  [84015831, 84015839] = 9 (bp) •  [84011211, 84012124] = 914, total length = 923 (bp) •  Through UCSC Table Browser, –  From record’s sequence, we count 923 bps •  which matches the sequence from R/BioC –  From record’s description, it should range •  [84011211, 84015839] = 4,628 (bp) •  = joint ranges of all reported genomic locations •  So, why a 3’UTR spans over one more gapped regions? 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  14. Take a Look on UCSC Genome Browser 2013.07 Bioinformatics and

    Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  3’ UTR coordinate by UCSC Table Browser remained exon 3’ UTR coordinates by R/Bioconductor Here’s another example: 3’ UTR 3’ 5’ on reversed strand ( - ) 5’ UTR
  15. Case 1: Summary •  In fact, a 3’UTR may span

    over multiple exons •  UCSC Table Browser returns the coordinate of joint region of these exons –  but returns the sequence with introns removed –  explain mismatch between length of genomic region and sequence •  Reference: UCSC’s mailing list https://lists.soe.ucsc.edu/pipermail/genome/2010-April/021840.html 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  source: Untranslated Region/Wikipedia
  16. Case 2: Multiple Accessions of Same Transcript ID •  Example:

    NM_032454 (STK19) •  In all 3’UTR records, this transcript has 2 records on chr 6 –  sequences of 2 records are matched –  take a look on genome browser •  It should be repeat regions on genome 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  17. Genome Repeat Regions 2013.07 Bioinformatics and Biostatistics Core, NTU Center

    of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly move <<< << < > >> >>> zoom in 1.5x 3x 10x base zoom out 1.5x 3x 10x chr6:31,945,000-31,985,000 40,001 bp. enter position, gene symbol or search terms go New European server available! Click here for more information. move start < 2.0 > Click on a feature for details. Click or drag in the base position track to zoom in. Click side bars for track options. Drag side bars or labels up or down to reorder tracks. Drag tracks left or right to new position. move end < 2.0 > track search default tracks default order hide all add custom tracks track hubs configure reverse resize refresh collapse all Use drop-down controls below and press refresh to alter tracks displayed. Tracks with lots of items will automatically be displayed in more compact modes. expand all Mapping and Sequencing Tracks refresh Phenotype and Disease Associations refresh Genes and Gene Prediction Tracks refresh Literature refresh
  18. Case 2: Multiple Accessions (cont’d) •  Example: NM_001080141(CT47A6) –  known

    as Cancer/Testis CT47 Family 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang 
  19. Case 2: CT47A6 on Genome Browser 2013.07 Bioinformatics and Biostatistics

    Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  UCSC Genome Browser on Human Feb. 2009 (GRCh37/hg19) Assembly move <<< << < > >> >>> zoom in 1.5x 3x 10x base zoom out 1.5x 3x 10x chrX:120,067,695-120,117,629 49,935 bp. enter position, gene symbol or search terms go New European server available! Click here for more information. move start < 2.0 > Click on a feature for details. Click or drag in the base position track to zoom in. Click side bars for track options. Drag side bars or labels up or down to reorder tracks. Drag tracks left or right to new position. move end < 2.0 > track search default tracks default order hide all add custom tracks track hubs configure reverse resize refresh Use drop-down controls below and press refresh to alter tracks displayed.
  20. Summary of 3’ UTR Retrieval •  From UCSC TB • 

    Original total 40,047 –  # in chr1, …, 22, X, Y, (M) = 38,009 •  # NM (Transcripts) = 34,299 –  # unique = 33,729 •  # NR (lncRNA) = 3,710 –  # unique = 3,549 –  # in other chromosomes = 2,038 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang  •  From R/Bioconductor •  Original total 36,155 –  # in chr1, …, 22, X, Y, (M) = 34,360 •  # NM = 34,360 –  # unique = 33,787 •  # NR = 0 –  # in other chromosomes = 1,795
  21. Summary (cont’d) •  UCSC TB reports 33,729 unique transcript 3’

    UTRs •  R/Bioconductor reports 33,787 –  slightly more than UCSC TB by 61 records –  require further inspection by human for these records •  if some records are valid, re-add them back •  After this analysis, hope to know more about genome reference 2013.07 Bioinformatics and Biostatistics Core, NTU Center of Genomic Medicine, National Taiwan University Slides by Liang Bo Wang