<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Data Curation | Antal Dániel honlapja</title><link>https://danielantal.eu/hu/tag/data-curation/</link><atom:link href="https://danielantal.eu/hu/tag/data-curation/index.xml" rel="self" type="application/rss+xml"/><description>Data Curation</description><generator>Wowchemy (https://wowchemy.com)</generator><language>hu</language><lastBuildDate>Tue, 20 Feb 2024 16:15:00 +0000</lastBuildDate><image><url>https://danielantal.eu/media/icon_hub9491570ac57158c0eeecc95c95b13e5_20247_512x512_fill_lanczos_center_3.png</url><title>Data Curation</title><link>https://danielantal.eu/hu/tag/data-curation/</link></image><item><title>IDCC24 Lightning Talk Session</title><link>https://danielantal.eu/hu/event/2024-02-20_idcc24/</link><pubDate>Tue, 20 Feb 2024 16:15:00 +0000</pubDate><guid>https://danielantal.eu/hu/event/2024-02-20_idcc24/</guid><description>&lt;p>&lt;a href="https://openmuse.eu/" target="_blank" rel="noopener">Open Music Europe&lt;/a> is a Horizon Europe project that aims to build a working prototype of the planned European Music Observatory.&lt;/p>
&lt;div class="alert alert-note">
&lt;div>
Interested in our &lt;a href="https://reprex.nl/documents/observatory-replication.pdf" target="_blank" rel="noopener">automated data observatories&lt;/a>? Let us meet in Edinburgh on the &lt;a href="https://dcc.ac.uk/events/idcc24/programme" target="_blank" rel="noopener">18th International Digital Curation Conference&lt;/a> and discuss who you could use our open-source, collaborative data infrastructure and know-how.
&lt;/div>
&lt;/div>
&lt;p>The EU, UN, or other international bodies have recognised or initiated at least 60 data observatories that carry out long-term data collection on various domains; we have not found any good policies or practices on how to place these observatories on data infrastructures that are interoperable towards open science and open government. We are creating a data management and governance model and a working MVP that coordinates data collection and statistical data production among scientific, private and official statistical actors.&lt;/p>
&lt;p>Our most crucial pilot project wants to showcase a best practice for using privately-held data, i.e., data of music organisations and surveys carried out by scientific and business actors, to improve the quality of government statistics. We show how the guidelines on using private data as an &amp;lsquo;administrative data source&amp;rsquo; and an ex-ante harmonisation of governmental surveys with open scientific surveys can result in high-quality datasets that fully complement the pre-existing official statistical products and commercial products.&lt;/p>
&lt;p>As a coordination tool, we started developing a Data Management Plan to increase transparency from the outset. Apart from applying Horizon Europe&amp;rsquo;s OpenAIRE recommendations and FAIR requirements, we use the Open Policy Analysis Guidelines to bring open science transparency into the less standardised policy analysis area. We implement this following various UN/EU Guidelines on statistical production, creating a three-way reconciliation and interoperability, i.e., scientific research, public policy design and official statistics.&lt;/p>
&lt;td style="text-align: center;">
&lt;figure id="figure-click-through-to-our-working-paperhttpsmusicdataobservatoryeudocumentsopen_music_europeslovakiaslovak-cult-stat-pilothtml-available-in-pdf-epub-and-html">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="img/blogposts_2023/slovak-cult-stat-pilot_screenshot2.webp" alt="Click through to our [working paper](https://music.dataobservatory.eu/documents/open_music_europe/slovakia/slovak-cult-stat-pilot.html) (available in PDF, epub, and html)." loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
Click through to our &lt;a href="https://music.dataobservatory.eu/documents/open_music_europe/slovakia/slovak-cult-stat-pilot.html" target="_blank" rel="noopener">working paper&lt;/a> (available in PDF, epub, and html).
&lt;/figcaption>&lt;/figure>&lt;/td>
&lt;p>Our work contributes to sharing outputs earlier using Open Research platforms because we are building a framework supported by research automation that integrates open science, business, and official governmental data. We develop a software ecosystem complementing the R statistical environment and language, the lingua franca of official and scientific statistics, to make the data curation, pre-processing, processing, and eventual quality-controlled statistical data release open, transparent, and much timelier.&lt;/p>
&lt;p>Our project follows an open collaboration framework that we design so that private music NGOs and enterprises, statistical offices and open science research groups can work together on the curation and design, production and release and use of data assets in the cultural domain. By opening the statistical infrastructure with our open-source production code and implementing the statistical data and metadata exchange standards simultaneously with other metadata standards and standardisation techniques like ex-ante and retrospective survey harmonisation, we hope to combine them in novel ways like never before while making them available sooner.&lt;/p>
&lt;div class="alert alert-note">
&lt;div>
&lt;a href="https://music.dataobservatory.eu/documents/open_music_europe/slovakia/slovak-cult-stat-pilot.html" target="_blank" rel="noopener">https://music.dataobservatory.eu/documents/open_music_europe/slovakia/slovak-cult-stat-pilot.html&lt;/a>
&lt;/div>
&lt;/div>
&lt;p>Our showcase product will be a twin, linked open data resource: the &lt;code>Slovak Comprehensive Music Database&lt;/code>. It will connect in unprecedented detail information about musical works and their sound recordings and notations in music libraries, heritage organisations and individual and collective rights management organisations. We will derive the Slovak Music Industry Registry from this linked open resource that we will convert into a structural business register satellite as an interface between the privately-held data of music management and music heritage institutions and the national/satellite account system of the Slovak Republic, particularly the Slovak Cultural and Creative Satellite Accounts.&lt;/p>
&lt;p>Let&amp;rsquo;s &lt;a href="https://reprex.nl/contact/" target="_blank" rel="noopener">get in touch&lt;/a> if you are interested.&lt;/p></description></item><item><title>How We Add Value to Public Data With Better Curation And Documentation?</title><link>https://danielantal.eu/hu/post/2021-11-08-indicator_findable/</link><pubDate>Mon, 08 Nov 2021 09:00:00 +0000</pubDate><guid>https://danielantal.eu/hu/post/2021-11-08-indicator_findable/</guid><description>&lt;p>In this example, we show a simple indicator: the &lt;em>Turnover in Radio Broadcasting Enterprises&lt;/em> in many European countries. This is an important demand driver in the &lt;em>Music economy&lt;/em> pillar of our &lt;a href="https://music.dataobservatory.eu/" target="_blank" rel="noopener">Digital Music Observatory&lt;/a>, and important indicator in our more general &lt;a href="https://ccsi.dataobservatory.eu/" target="_blank" rel="noopener">Cultural &amp;amp; Creative Sectors and Industries Observatory&lt;/a>. Of course, if you work with competition policy or antitrust, than any industry may be interesting to you&amp;ndash;but not all of them are well-serverd with data.&lt;/p>
&lt;p>This dataset comes from a public datasource, the data warehouse of the
European statistical agency, Eurostat. Yet it is not trivial to use:
unless you are familiar with national accounts, you will not find &lt;a href="https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=sbs_na_1a_se_r2&amp;amp;lang=en" target="_blank" rel="noopener">this dataset&lt;/a> on the Eurostat website.&lt;/p>
&lt;td style="text-align: center;">
&lt;figure id="figure-the-data-can-be-retrieved-from-the-annual-detailed-enterprise-statistics-for-services-nace-rev2-h-n-and-s95-eurostat-folder">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://danielantal.eu/img/blogposts_2021/eurostat_radio_broadcasting_turnover.png" alt="The data can be retrieved from the Annual detailed enterprise statistics for services NACE Rev.2 H-N and S95 Eurostat folder." loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
The data can be retrieved from the Annual detailed enterprise statistics for services NACE Rev.2 H-N and S95 Eurostat folder.
&lt;/figcaption>&lt;/figure>&lt;/td>
&lt;p>Our version of this statistical indicator is documented following the &lt;a href="https://www.go-fair.org/fair-principles/" target="_blank" rel="noopener">FAIR principles&lt;/a>: our data assets
are findable, accessible, interoperable, and reusable. While the
Eurostat data warehouse partly fulfills these important data quality
expectations, we can improve them significantly. And we can also
improve the dataset, too, as we will show in the &lt;a href="https://danielantal.eu/post/2021-11-06-indicator_value_added/">next blogpost&lt;/a>.&lt;/p>
&lt;details class="toc-inpage d-print-none " open>
&lt;summary class="font-weight-bold">Tartalomjegyzék&lt;/summary>
&lt;nav id="TableOfContents">
&lt;ul>
&lt;li>&lt;a href="#findable-data">Findable Data&lt;/a>&lt;/li>
&lt;li>&lt;a href="#accessible-data">Accessible Data&lt;/a>&lt;/li>
&lt;li>&lt;a href="#interoperability">Interoperability&lt;/a>&lt;/li>
&lt;li>&lt;a href="#reuse">Reuse&lt;/a>&lt;/li>
&lt;/ul>
&lt;/nav>
&lt;/details>
&lt;h2 id="findable-data">Findable Data&lt;/h2>
&lt;p>Our data observatories add value by curating the data&amp;ndash;we bring this
indicator to light with a more descriptive name, and we place it in a domain-specific context with our &lt;a href="https://music.dataobservatory.eu/" target="_blank" rel="noopener">Digital Music Observatory&lt;/a> and &lt;a href="https://ccsi.dataobservatory.eu/" target="_blank" rel="noopener">Cultural &amp;amp; Creative Sectors and Industries Observatory&lt;/a> and a policy-specific context with our &lt;em>Competition Data Observatory&lt;/em> and &lt;em>Green Deal Data Observatory&lt;/em>. While many people may need this dataset in the creative sectors, or among cultural policy designers, most of them have no training in working with
national accounts, which imply decyphering national account data codes in records that measure economic activity at a national level. Our curated data observatories bring together many available data around important domains. Our &lt;code>Digital Music Observatory&lt;/code>, for example, aims to form an ecosystem of music data users and producers.&lt;/p>
&lt;td style="text-align: center;">
&lt;figure id="figure-we-added-descriptive-metadatahttpszenodoorgrecord5652113yykvbwdmkuk-that-help-you-find-our-data-and-match-it-with-other-relevant-data-sources">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://danielantal.eu/img/blogposts_2021/zenodo_metadata_eurostat_radio_broadcasting_turnover.png" alt="We [added descriptive metadata](https://zenodo.org/record/5652113#.YYkVBWDMKUk) that help you find our data and match it with other relevant data sources." loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
We &lt;a href="https://zenodo.org/record/5652113#.YYkVBWDMKUk" target="_blank" rel="noopener">added descriptive metadata&lt;/a> that help you find our data and match it with other relevant data sources.
&lt;/figcaption>&lt;/figure>&lt;/td>
&lt;p>We added descriptive metadata that help you find our data and match it
with other relevant data sources. For example, we add keywords and
standardized metadata identifiers from the Library of Congress Linked
Data Services, probably the world’s largest standardized knowledge
library description. This ensures that you can find relevant data
around the same key term (&amp;quot;&lt;a href="https://id.loc.gov/authorities/subjects/sh85110448.html" target="_blank" rel="noopener">Radio broadcasting&lt;/a>&amp;quot;)
in addition to our turnover data. This allows connecting our dataset unambiguously
with other information sources that use the same concept, but may be listed under
different keywords, such as &lt;em>Radio–Broadcasting&lt;/em>, or &lt;em>Radio industry and
trade&lt;/em>, or maybe &lt;em>Hörfunkveranstalter&lt;/em> in German, or &lt;em>Emitiranje
radijskog programa&lt;/em> in Croatian or &lt;em>Actividades de radiodifusão&lt;/em> in
Portugese.&lt;/p>
&lt;h2 id="accessible-data">Accessible Data&lt;/h2>
&lt;p>Our data is accessible in two forms: in &lt;code>csv&lt;/code> tabular format (which can be
read with Excel, OpenOffice, Numbers, SPSS and many similar spreadsheet
or statistical applications) and in &lt;code>JSON&lt;/code> for automated importing into
your databases. We can also provide our users with SQLite databases,
which are fully functional, single user relational databases.&lt;/p>
&lt;p>Tidy datasets are easy to manipulate, model and visualize, and have a
specific structure: each variable is a column, each observation is a
row, and each type of observational unit is a table. This makes the data
easier to clean, and far more easier to use in a much wider range of
applications than the original data we used. In theory, this is a simple objective,
yet we find that even governmental statistical agencies&amp;ndash;and even scientific
publications&amp;ndash;often publish untidy data. This poses a significant problem that implies
productivity loses: tidying data will require long hours of investment, and if
a reproducible workflow is not used, data integrity can also be compromised:
chances are that the process of tidying will overwrite, delete, or omit a data or a label.&lt;/p>
&lt;td style="text-align: center;">
&lt;figure id="figure-tidy-datasetshttpsr4dshadconztidy-datahtml-are-easy-to-manipulate-model-and-visualize-and-have-a-specific-structure-each-variable-is-a-column-each-observation-is-a-row-and-each-type-of-observational-unit-is-a-table">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://danielantal.eu/img/blogposts_2021/tidy-8.png" alt="[Tidy datasets](https://r4ds.had.co.nz/tidy-data.html) are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table." loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
&lt;a href="https://r4ds.had.co.nz/tidy-data.html" target="_blank" rel="noopener">Tidy datasets&lt;/a> are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
&lt;/figcaption>&lt;/figure>&lt;/td>
&lt;p>While the original data source, the Eurostat data warehouse is
accessible, too, we added value with bringing the data into a &lt;a href="https://www.jstatsoft.org/article/view/v059i10" target="_blank" rel="noopener">tidy
format&lt;/a>. Tidy data can
immediately be imported into a statistical application like SPSS or
STATA, or into your own database. It is immediately available for
plotting in Excel, OpenOffice or Numbers.&lt;/p>
&lt;h2 id="interoperability">Interoperability&lt;/h2>
&lt;p>Our data can be easily imported with, or joined with data from other internal or external sources.&lt;/p>
&lt;td style="text-align: center;">
&lt;figure id="figure-all-our-indicators-come-with-standardized-descriptive-metadata-and-statistical-processing-metadata-see-our-apihttpsapimusicdataobservatoryeudatabasemetadata">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://danielantal.eu/img/observatory_screenshots/DMO_API_metadata_table.png" alt="All our indicators come with standardized descriptive metadata, and statistical (processing) metadata. See our [API](https://api.music.dataobservatory.eu/database/metadata/) " loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
All our indicators come with standardized descriptive metadata, and statistical (processing) metadata. See our &lt;a href="https://api.music.dataobservatory.eu/database/metadata/" target="_blank" rel="noopener">API&lt;/a>
&lt;/figcaption>&lt;/figure>&lt;/td>
&lt;p>All our indicators come with standardized descriptive metadata,
following two important standards, the &lt;a href="https://dublincore.org/" target="_blank" rel="noopener">Dublin Core&lt;/a> and
&lt;a href="https://datacite.org/" target="_blank" rel="noopener">DataCite&lt;/a>–implementing not only the mandatory,
but the recommended descriptions, too. This will make it far easier to
connect the data with other data sources, e.g. turnover with the number of radio broadcasting enterprises or radio stations within specific territories.&lt;/p>
&lt;p>Our passion for documentation standards and best practices goes much further: our data uses &lt;a href="https://sdmx.org/?page_id=3215/" target="_blank" rel="noopener">Statistical Data and Metadata eXchange&lt;/a> standardized codebooks, unit descriptions and other statistical and administrative metadata.&lt;/p>
&lt;td style="text-align: center;">
&lt;figure id="figure-we-participate-in-scientific-workhttpsreprexnlpublicationeuropean_visibilitiy_2021-related-to-data-interoperability">
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img src="https://danielantal.eu/img/reports/european_visbility_publication.png" alt="We participate in [scientific work](https://reprex.nl/publication/european_visibilitiy_2021/) related to data interoperability." loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;figcaption>
We participate in &lt;a href="https://reprex.nl/publication/european_visibilitiy_2021/" target="_blank" rel="noopener">scientific work&lt;/a> related to data interoperability.
&lt;/figcaption>&lt;/figure>&lt;/td>
&lt;h2 id="reuse">Reuse&lt;/h2>
&lt;p>All our datasets come with standardized information about reusabililty.
We add citation, attribution data, and licensing terms. Most of our
datasets can be used without commercial restriction after acknowledging
the source, but we sometimes work with less permissible data licenses.&lt;/p>
&lt;p>In the case presented here, we added further value to encourage re-use. In addition to tidying, we significantly increased the usability of public data by handling
missing cases. This is the subject of our &lt;a href="https://danielantal.eu/post/2021-11-06-indicator_value_added/">next blogpost&lt;/a>.&lt;/p>
&lt;details class="spoiler " id="spoiler-6">
&lt;summary>Are you a data user? How could we serve you better?&lt;/summary>
&lt;p>&lt;em>Shall we do some further automatic data enhancements with our datasets? Document with different metadata? Link more information for business, policy, or academic use? Please get in touch with &lt;a href="https://reprex.nl/#contact" target="_blank" rel="noopener">us&lt;/a>!&lt;/em>&lt;/p>
&lt;/details></description></item></channel></rss>