fetch study metadata #1

cmungall · 2020-08-12T00:49:48Z

This often has richer text to be used for NLP

wdduncan · 2020-09-09T22:00:30Z

@cmungall should we merge this ticket with #7 ? I am already generating a lot of data for that ticket.

wdduncan · 2020-09-16T17:19:15Z

merging the #7 and closing.

cmungall · 2020-10-05T23:05:54Z

@wdduncan can you assign @hrshdhgd

cmungall · 2020-10-09T23:28:18Z

https://ftp.ncbi.nlm.nih.gov/bioproject/bioproject.xml

The project db has info on all studies. it also links to samples e.g.

    <LocusTagPrefix biosample_id="SAMN11044051">E0Y81</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044052">E0Y82</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044053">E0Y83</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044054">E0Y84</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044055">E0Y85</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044056">E0Y86</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044057">E0Y87</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044058">E0Y88</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044059">E0Y89</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044060">E0Y90</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044061">E0Y91</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044062">E0Y92</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044063">E0Y93</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044064">E0Y94</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044065">E0Y95</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044066">E0Y96</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044067">E0Y97</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044068">E0Y98</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044069">E0Y99</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044070">E0Z00</LocusTagPrefix>

cmungall · 2020-10-09T23:31:34Z

Example text to be mined

<Package>
  <Project>
    <Project>
      <ProjectID>
        <ArchiveID accession="PRJNA13694" archive="NCBI" id="13694"/>
      </ProjectID>
      <ProjectDescr>
        <Name>marine metagenome</Name>
        <Title>Metagenomic analysis of marine microbes isolated during the Global Ocean Sampling Expedition</Title>
        <Description>A broad objective of the Global Ocean Sampling (GOS) Expedition is to assess the genetic diversity in marine microbial communities and understand their role in fundamental processes in nature. Marine microbes influence the cycling of carbon (and other elements) in the world's oceans, acting as a biological conduit that transports carbon dioxide from the surface to the deep oceanic realms. By sequestering carbon from the atmosphere, marine microorganisms (eukaryotes, prokaryotes and viruses) may significantly affect global climate. However, we know little about the physiological processes and complex interactions of communities that impact global carbon cycles and ocean productivity, and our attempts to study their activities are limited by our inability to culture the vast majority of them. These uncultured marine microorganisms are also a rich repository of novel genes and molecular structures that have potential in the development of biocatalysts for industrial and medical applications.
&lt;p&gt;
One avenue of exploration is to sequence the genomes of marine microbes using a metagenomics approach. In 2003, the J. Craig Venter Institute conducted a whole environment shotgun sequencing project to study marine microorganisms in the nutrient-poor Sargasso Sea near Bermuda. This study revealed a remarkable breadth and depth of microbial diversity - about 1,800 different prokaryotic species encoding over 1.2 million genes were discovered. Notably, this study expanded our knowledge of ocean photobiology, microbial diversity and evolution. Results from the pilot study were reported in Science in 2004.
&lt;p&gt;
This pilot study served as the springboard for launching a more comprehensive survey of the bacterial, archaeal and viral diversity of the world's oceans. A global circumnavigation aboard the Sorcerer II sailing yacht began in August 2003, starting in Halifax, Canada and samples were collected at sites along the U.S. east coast, Gulf of Mexico, Galapagos Islands, central and south Pacific Oceans, Australia, Indian Ocean, South Africa, across the Atlantic back to the U.S., and was completed in January 2006. An initial analysis of the microbial sequence data from the first leg of the trip - Halifax to the Galapagos Islands was reported in a special issue of PLoS Biology on Ocean Meganomics in March 2007 (see &lt;a href="http://collections.plos.org/plosbiology/gos-2007"&gt;http://collections.plos.org/plosbiology/gos-2007&lt;/a&gt;). Additional data from the Indian Ocean was released in March 2008.  Shotgun sequencing and deep sequencing of 16S and 18S rRNA is currently underway on additional samples.
&lt;p&gt;
Collectively these studies have produced the largest catalogue of genes to date from thousands of new species, with no apparent slowing of the rate of discovery of novel gene families. These data have potentially far-reaching implications for biological energy production, bioremediation, and creating solutions for reduction/management of greenhouse gas levels in our biosphere. The complete set of data and bioinformatic analysis tools from the &lt;a href="http://web.camera.calit2.net/cameraweb/gwt/org.jcvi.camera.web.gwt.download.BrowseProjectsPage/BrowseProjectsPage.oa?projectSymbol=CAM_PROJ_GOS"&gt;GOS project&lt;/a&gt; is available through the &lt;a href="http://camera.calit2.net/"&gt;CAMERA&lt;/a&gt; metagenomics repository.  These studies have been supported by The Department of Energy, The Gordon and Betty Moore Foundation, and the J. Craig Venter Institute.

&lt;p&gt;
The WGS project and sequences deposited into the Trace Archive can be found using the Project data link.</Description>

Pushing notebook and output file. addresses #1

hrshdhgd · 2020-10-16T16:41:14Z

I believe this is my first stab at the study description xml parsing to output a tsv file. The file has 5 columns namely:

['StudyId', 'Name', 'Title', 'Description', 'BiosampleId'].

The 'Description' column will source the NLP pipeline to get us potential supplemental information.

Pushing notebook and output file. addresses #1

cmungall assigned wdduncan Aug 12, 2020

wdduncan closed this as completed Sep 16, 2020

cmungall reopened this Oct 5, 2020

wdduncan assigned hrshdhgd and unassigned wdduncan Oct 5, 2020

cmungall mentioned this issue Oct 5, 2020

run NER/CR over all textual metadata fields #31

Open

hrshdhgd added a commit that referenced this issue Oct 16, 2020

Pushing notebook and output file. addresses #1

4177733

hrshdhgd added a commit that referenced this issue Oct 16, 2020

Merge pull request #33 from INCATools/xmlParsing_hhegde

36ac94e

Pushing notebook and output file. addresses #1

turbomam pushed a commit that referenced this issue Jul 2, 2021

Pushing notebook and output file. addresses #1

3daf5c2

turbomam pushed a commit that referenced this issue Jul 2, 2021

Merge pull request #33 from INCATools/xmlParsing_hhegde

b8ca764

Pushing notebook and output file. addresses #1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch study metadata #1

fetch study metadata #1

cmungall commented Aug 12, 2020

wdduncan commented Sep 9, 2020

wdduncan commented Sep 16, 2020

cmungall commented Oct 5, 2020

cmungall commented Oct 9, 2020 •

edited

Loading

cmungall commented Oct 9, 2020

hrshdhgd commented Oct 16, 2020 •

edited

Loading

fetch study metadata #1

fetch study metadata #1

Comments

cmungall commented Aug 12, 2020

wdduncan commented Sep 9, 2020

wdduncan commented Sep 16, 2020

cmungall commented Oct 5, 2020

cmungall commented Oct 9, 2020 • edited Loading

cmungall commented Oct 9, 2020

hrshdhgd commented Oct 16, 2020 • edited Loading

cmungall commented Oct 9, 2020 •

edited

Loading

hrshdhgd commented Oct 16, 2020 •

edited

Loading