Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch study metadata #1

Open
cmungall opened this issue Aug 12, 2020 · 6 comments
Open

fetch study metadata #1

cmungall opened this issue Aug 12, 2020 · 6 comments
Assignees

Comments

@cmungall
Copy link
Collaborator

This often has richer text to be used for NLP

@wdduncan
Copy link
Collaborator

wdduncan commented Sep 9, 2020

@cmungall should we merge this ticket with #7 ? I am already generating a lot of data for that ticket.

@wdduncan
Copy link
Collaborator

merging the #7 and closing.

@cmungall cmungall reopened this Oct 5, 2020
@cmungall
Copy link
Collaborator Author

cmungall commented Oct 5, 2020

@wdduncan can you assign @hrshdhgd

@cmungall
Copy link
Collaborator Author

cmungall commented Oct 9, 2020

https://ftp.ncbi.nlm.nih.gov/bioproject/bioproject.xml

The project db has info on all studies. it also links to samples e.g.

    <LocusTagPrefix biosample_id="SAMN11044051">E0Y81</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044052">E0Y82</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044053">E0Y83</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044054">E0Y84</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044055">E0Y85</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044056">E0Y86</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044057">E0Y87</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044058">E0Y88</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044059">E0Y89</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044060">E0Y90</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044061">E0Y91</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044062">E0Y92</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044063">E0Y93</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044064">E0Y94</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044065">E0Y95</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044066">E0Y96</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044067">E0Y97</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044068">E0Y98</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044069">E0Y99</LocusTagPrefix>
    <LocusTagPrefix biosample_id="SAMN11044070">E0Z00</LocusTagPrefix>

@cmungall
Copy link
Collaborator Author

cmungall commented Oct 9, 2020

Example text to be mined

<Package>
  <Project>
    <Project>
      <ProjectID>
        <ArchiveID accession="PRJNA13694" archive="NCBI" id="13694"/>
      </ProjectID>
      <ProjectDescr>
        <Name>marine metagenome</Name>
        <Title>Metagenomic analysis of marine microbes isolated during the Global Ocean Sampling Expedition</Title>
        <Description>A broad objective of the Global Ocean Sampling (GOS) Expedition is to assess the genetic diversity in marine microbial communities and understand their role in fundamental processes in nature. Marine microbes influence the cycling of carbon (and other elements) in the world's oceans, acting as a biological conduit that transports carbon dioxide from the surface to the deep oceanic realms. By sequestering carbon from the atmosphere, marine microorganisms (eukaryotes, prokaryotes and viruses) may significantly affect global climate. However, we know little about the physiological processes and complex interactions of communities that impact global carbon cycles and ocean productivity, and our attempts to study their activities are limited by our inability to culture the vast majority of them. These uncultured marine microorganisms are also a rich repository of novel genes and molecular structures that have potential in the development of biocatalysts for industrial and medical applications.
&lt;p&gt;
One avenue of exploration is to sequence the genomes of marine microbes using a metagenomics approach. In 2003, the J. Craig Venter Institute conducted a whole environment shotgun sequencing project to study marine microorganisms in the nutrient-poor Sargasso Sea near Bermuda. This study revealed a remarkable breadth and depth of microbial diversity - about 1,800 different prokaryotic species encoding over 1.2 million genes were discovered. Notably, this study expanded our knowledge of ocean photobiology, microbial diversity and evolution. Results from the pilot study were reported in Science in 2004.
&lt;p&gt;
This pilot study served as the springboard for launching a more comprehensive survey of the bacterial, archaeal and viral diversity of the world's oceans. A global circumnavigation aboard the Sorcerer II sailing yacht began in August 2003, starting in Halifax, Canada and samples were collected at sites along the U.S. east coast, Gulf of Mexico, Galapagos Islands, central and south Pacific Oceans, Australia, Indian Ocean, South Africa, across the Atlantic back to the U.S., and was completed in January 2006. An initial analysis of the microbial sequence data from the first leg of the trip - Halifax to the Galapagos Islands was reported in a special issue of PLoS Biology on Ocean Meganomics in March 2007 (see &lt;a href="http://collections.plos.org/plosbiology/gos-2007"&gt;http://collections.plos.org/plosbiology/gos-2007&lt;/a&gt;). Additional data from the Indian Ocean was released in March 2008.  Shotgun sequencing and deep sequencing of 16S and 18S rRNA is currently underway on additional samples.
&lt;p&gt;
Collectively these studies have produced the largest catalogue of genes to date from thousands of new species, with no apparent slowing of the rate of discovery of novel gene families. These data have potentially far-reaching implications for biological energy production, bioremediation, and creating solutions for reduction/management of greenhouse gas levels in our biosphere. The complete set of data and bioinformatic analysis tools from the &lt;a href="http://web.camera.calit2.net/cameraweb/gwt/org.jcvi.camera.web.gwt.download.BrowseProjectsPage/BrowseProjectsPage.oa?projectSymbol=CAM_PROJ_GOS"&gt;GOS project&lt;/a&gt; is available through the &lt;a href="http://camera.calit2.net/"&gt;CAMERA&lt;/a&gt; metagenomics repository.  These studies have been supported by The Department of Energy, The Gordon and Betty Moore Foundation, and the J. Craig Venter Institute.

&lt;p&gt;
The WGS project and sequences deposited into the Trace Archive can be found using the Project data link.</Description>

hrshdhgd added a commit that referenced this issue Oct 16, 2020
Pushing notebook and output file. addresses #1
@hrshdhgd
Copy link
Collaborator

hrshdhgd commented Oct 16, 2020

I believe this is my first stab at the study description xml parsing to output a tsv file. The file has 5 columns namely:

['StudyId', 'Name', 'Title', 'Description', 'BiosampleId'].

The 'Description' column will source the NLP pipeline to get us potential supplemental information.

turbomam pushed a commit that referenced this issue Jul 2, 2021
Pushing notebook and output file. addresses #1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants