-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide long-term hosting solutions for RDF/Linked Data #81
Comments
Two comments inline.
On 5 Mar 2021, at 10:34, Christian Chiarcos ***@***.***> wrote:
This is especially for academic data and demonstrators, and this is less about technology or formalisms, but about politics.
Long-term availability is a problem for Linked Data technology in general. Infrastructure providers (e.g., libraries) can be hesitant to provide SPARQL end points, and long-term maintenance for services can probably not be expected at all. Case in point: The infamous British Museum SPARQL end point. Once hailed in Digital Humanities, but now largely defunct. There are challenges here that can be addressed on a technical level, but this is not what this issue is about.
The best way to get people to maintain Linked Data offerings is if they get used, and for things that the provider thinks are useful, and enhance things.
And also for it to form part of the organisation's workflows.
How to get their endpoints used?
Oh, let's make RDF Easy, so people consume the data, and organisations build it into their workflow, and they can access a wealth of RDF developer talent.
Ah, now there did I start?
A minimal requirement to ensure long-term (re-)usability of Linked Data and the feasility of federation (as unique selling points for RDF technology) would be to guarantee the acccessibility of the data itself, ideally in a way that allows resolvable URIs.
So we don't forget:- if the URIs are not resolvable, it isn't Linked Data.
Best
Hugh
… Indeed, major infrastructures for hosting research data have emerged in different fields, e.g., Zenodo, DSPACE, or CLARIN, and these could provide exactly that, at least for the academic sector. However, they currently do not seem to support RDF formats as mime types for deposited data (for metadata, they do). See zenodo/zenodo#1515 for the discussion on Zenodo. So, the mime type RDF data is published under is normally text/plain. This means that applications need to guess the format if they attempt to resolve URIs against a resource. This can work, but it is unreliable. In particular, it will fail if URIs do not include the file ending (as recommended, because we have content negotiation, except not here), or if the data URI carries any flags after the file ending (e.g., "...?download=1").
Example:
FAILURE (using Zenodo-provided data link)
http://www.sparql.org/sparql?query=SELECT+*%0D%0AFROM+%3Chttps%3A%2F%2Fzenodo.org%2Frecord%2F4444132%2Ffiles%2Fcrmtex.owl%3Fdownload%3D1%3E%0D%0AWHERE+%7B+%3Fa+%3Fb+%3Fc+%7D+LIMIT+10&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl
SUCCESS (skipping download flags, providing full file ending)
http://www.sparql.org/sparql?query=SELECT+*%0D%0AFROM+%3Chttps%3A%2F%2Fzenodo.org%2Frecord%2F4444132%2Ffiles%2Fcrmtex.owl%3E%0D%0AWHERE+%7B+%3Fa+%3Fb+%3Fc+%7D+LIMIT+10&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl
Actually, this is very easy to fix, we just need to petition repeatedly and massively to maintainers and developers of such infrastructures that data is declarable as text/turtle (etc.) than just text/plain.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
--
Hugh Glaser
CEO
Seme4 Limited
International House
Southampton International Business Park
Southampton
Hampshire
SO18 2RZ
Mobile: +44 7595 334155
[email protected]
www.seme4.com
|
The point here is that we have to distinguish the provider, the host and the consumer. For most Linked Open Data, the gain is with the consumer, sometimes with the provider, but usually not with the host. In an academic context (where much LOD is being produced), the provider (the scientist/project) has an interest in dissemination and re-use. This can help building a carreer, especially if coupled with attribution, and it's a good signal to send to funders. But the provider is not the host. The provider could be the host: he can put it on some institute's website. Where it will remain for, say, 5 years, and then silently disappear. He has move on since, probably, and doesn't care either. Or, more likely, he just doesn't have the resources for a reasonable hosting solution. In any case, whatever project created that data will have run out of funding. Still, there may be consumers interested in the data. This can be commercial, or it could be research. And indeed, they may continue to operate on local copies. Except that these copies aren't Linked Data anymore because they don't resolve. The British Museum scenario is similar, but not quite the same. Their LOD portal has been a publicity or political thing, I guess, possibly encouraged by the push of Europeana to LOD (who still run their LOD infrastructure, because it's compatible with their vision). But it was never used internally, because museums, libraries, etc., tend to be conservative (for good reasons, but sometimes this means to work with pre-1990 technology, think of MARC or PICA3). However, it was used by the community. Quite a lot, as far as I can tell. which is why there are copies around (we have one, too). But they're no Linked Data anymore, as the URIs won't resolve, and they will never be again unless the British Museum decides to do something about it. But there is no gain in there, for them, so they won't. (And there is a cost associated with that, i.e., making sure that their internal catalogue and the LOD version stay synchronized. Which is also why just republishing their old data with new, resolving URIs is not a good solution either, because the data cannot be trusted anymore, and if I recall correctly, it lacks any versioning information, so we cannot even say for which point in time this snapshot was valid.) This is one scenario, only. There are others. In my case, I struggle to find a sustainable hosting solution for some thousand bilingual dictionaries in RDF. As far as I can tell, this is the largest collection of machine-readable open source bilingual dictionaries in existence. I could get them to Zenodo or just keep them on GitHub (where there are the smaller ones of them, right now), but not with resolving URIs and/or under the right media type. Which is why they aren't Linked Data, yet. This massively reduces their potential to be re-used. But now we actually do have hosting solutions for scientific data, except that they don't support resolvable URIs. If that doesn't change, Linked Data (in academia) is basically dead. Which is a pity, because it's the ideal technology to achieve FAIRness (https://www.go-fair.org/fair-principles/), and this has become a major criterion in scientific grant approval. So, there is a problem (how to improve data re-use in science), there is a demand (achieving FAIRness), there is a technology (Linked Data), there even are hosting services specifically designed to support that (FAIRness), except that they don't (provide resolvable URIs, or, at least. not with the right mime type).
Exactly my point. But the media type issue also extends to non-LOD RDF data, so this is not an exclusive LOD problem. |
Data point: In this case it's |
@chiarcos have you checked https://w3id.org/ ? I think it could work for your vocabularies. |
Nitipicking: w3id.org is not a host, it is a redirection service. It does not directly solve the problem @chiarcos is raising. However, it can indeed be a part of the solution: decoupling the hosting from the naming helps make the dataset robust. If the Biritsh Museum had used w3id.org IRIs, then those IRIs could now redirect to one of the mirrors, and this would still be bona fide Linked Data. My 2¢. |
True, and that's how it should work. But then, we still need either some institution (or a mechanism; not a single person) that updates the redirect if the preferred mirror fails or we need sustainable hosting. I guess Zenodo is (for example), but URIs will only then resolve in robust fashion if the right media type is declared (or the file extension/persistent URI indicates the format -- but this is a client-side solution, not necessarily robust, and often considered a design flaw from the side of data modelling). What I want is Zenodo (etc.) to support text/turtle (etc.). In combination with redirection (be it via w3id.org or another service), this is the gold solution. As for automatically updating the redirect, DOI comes with the possibility to specify multiple redirection targets and with a selection mechanism among them, but as far as I know, real-world DOI resolution systems require manual selection it multiple URL targets are provided, so this is not a LOD-compliant solution either. (The resolution spec has more options, but "use second URL if the first fails" is not among them.) |
This is especially for academic data and demonstrators, and this is less about technology or formalisms, but about politics.
Long-term availability is a problem for Linked Data technology in general. Infrastructure providers (e.g., libraries) can be hesitant to provide SPARQL end points, and long-term maintenance for services can probably not be expected at all. Case in point: The infamous British Museum SPARQL end point. Once hailed in Digital Humanities, but now largely defunct. There are challenges here that can be addressed on a technical level, but this is not what this issue is about.
A minimal requirement to ensure long-term (re-)usability of Linked Data and the feasility of federation (as unique selling points for RDF technology) would be to guarantee the acccessibility of the data itself, ideally in a way that allows resolvable URIs.
Indeed, major infrastructures for hosting research data have emerged in different fields, e.g., Zenodo, DSPACE, or CLARIN, and these could provide exactly that, at least for the academic sector. However, they currently do not seem to support RDF formats as mime types for deposited data (for metadata, they do). See zenodo/zenodo#1515 for the discussion on Zenodo. So, the mime type RDF data is published under is normally text/plain. This means that applications need to guess the format if they attempt to resolve URIs against a resource. This can work, but it is unreliable. In particular, it will fail if URIs do not include the file ending (as recommended, because we have content negotiation, except not here), or if the data URI carries any flags after the file ending (e.g., "...?download=1").
Example:
FAILURE (using Zenodo-provided data link)
http://www.sparql.org/sparql?query=SELECT+*%0D%0AFROM+%3Chttps%3A%2F%2Fzenodo.org%2Frecord%2F4444132%2Ffiles%2Fcrmtex.owl%3Fdownload%3D1%3E%0D%0AWHERE+%7B+%3Fa+%3Fb+%3Fc+%7D+LIMIT+10&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl
SUCCESS (skipping download flags, providing full file ending)
http://www.sparql.org/sparql?query=SELECT+*%0D%0AFROM+%3Chttps%3A%2F%2Fzenodo.org%2Frecord%2F4444132%2Ffiles%2Fcrmtex.owl%3E%0D%0AWHERE+%7B+%3Fa+%3Fb+%3Fc+%7D+LIMIT+10&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl
Actually, this is very easy to fix, we just need to petition repeatedly and massively to maintainers and developers of such infrastructures that data is declarable as text/turtle (etc.) than just text/plain.
The text was updated successfully, but these errors were encountered: