Skip to Main Content

Data Repositories: Data Sharing Best Practices

Planning for Data Sharing

Data Management Planning

Effective data sharing begins long before publication. Start with a comprehensive data management plan that addresses:

  • What data will be collected or generated
  • How the data will be managed during the research process
  • Who will have access to the data during the project
  • What will happen to the data after the project concludes
  • How the data will be preserved and shared

Data Management Plan Tools

  • DMPTool: Web-based tool that walks researchers through creating comprehensive data management plans tailored to specific funder requirements
  • DMP Online: UK-based tool similar to DMPTool

Timing Considerations

  • Ideally, plan for data sharing before data collection begins
  • Consider embargo periods that balance publication priorities with data sharing requirements
  • Set up regular data management check-ins throughout your research project

Data Organization and Documentation

File Organization Best Practices

Hierarchical Folder Structure
  • Create a logical hierarchy that separates raw data, processed data, and analysis
  • Use consistent, meaningful folder names
  • Document the folder structure in a README file
File Naming Conventions
  • Use consistent, descriptive file names
  • Include version information when appropriate
  • Avoid spaces and special characters
  • Consider including date information in YYYY-MM-DD format
  • Example: patientdata_diabetes_wave2_2023-10-15_v3.csv
Version Control
  • Establish a clear versioning system
  • Document changes between versions
  • Consider using tools like Git for code and documentation (though not typically for large datasets)

Comprehensive Documentation

Study-Level Documentation
  • Study protocols
  • Data collection instruments
  • Recruitment materials
  • IRB/ethics approvals (with sensitive information redacted)
  • Data processing workflows
  • Analytical methods
Dataset-Level Documentation
  • Data dictionaries defining all variables
  • Codebooks explaining coding schemes and classifications
  • README files explaining dataset contents and organization
  • Relationships between multiple datasets
Recommended Documentation Elements
  • Variable names and definitions
  • Units of measurement
  • Methodology for derived variables
  • Missing data codes and explanations
  • Quality assurance procedures
  • Known limitations or biases
  • Temporal and spatial coverage


Data Preparation and Formatting

File Formats for Sharing
Preferred Formats
  • Non-proprietary, open formats whenever possible
  • Machine-readable formats
  • Formats with long-term support and broad compatibility
Format Recommendations by Data Type
  • Tabular data: CSV, TSV, or ODS rather than Excel
  • Text: Plain text, PDF/A, or XML
  • Images: TIFF, PNG, JPEG2000, or PDF/A
  • Audio: FLAC or WAVE instead of MP3
  • Statistical analysis: Export syntax files from SPSS, SAS, or Stata
  • Genomic data: FASTQ, BAM, or VCF formats

Data Cleaning Best Practices

  • Document all cleaning processes performed
  • Maintain original raw data files separately from cleaned versions
  • Provide cleaning scripts or protocols when possible
  • Check for inconsistencies, outliers, and missing values
  • Standardize variable names and formats

Sensitive Data Considerations

Data De-identification Methods
  • Direct identifiers: Names, addresses, phone numbers, email addresses, medical record numbers, etc.
  • Indirect identifiers: Combination of attributes that could identify an individual (e.g., rare diagnosis + zip code + age)
De-identification Techniques
  • Removal of direct identifiers
  • Aggregation of variables (e.g., age ranges instead of exact ages)
  • Generalization (e.g., first 3 digits of zip code)
  • Perturbation (adding noise to values)
  • k-anonymity (ensuring data cannot uniquely identify individuals)
Safe Harbor Method (HIPAA)
  • Removal of 18 specific identifier types
  • Certification that remaining information cannot identify subjects
  • Note that HIPAA compliance may not be sufficient for true anonymization

Metadata Standards and Practices

The Importance of Metadata

Metadata—data about data—is crucial for discovery, understanding, and reuse. Without good metadata, your dataset may be technically available but practically unusable.

Disciplinary Metadata Standards

Medical and Health Sciences Standards
  • CDISC: Clinical Data Interchange Standards Consortium models
  • MIAME: Minimum Information About a Microarray Experiment
  • BIDS: Brain Imaging Data Structure
  • HL7 FHIR: Fast Healthcare Interoperability Resources
  • DICOM: Digital Imaging and Communications in Medicine

Generic Metadata Standards

Creating Effective Metadata

  • Include information about the what, who, when, where, and how of your data
  • Balance comprehensiveness with usability
  • Consider both human and machine readability
  • Use controlled vocabularies when available
  • Provide context to make data meaningful

Data Repository Selection and Submission

Evaluating Repositories

Technical Considerations
  • File format support
  • Size limitations
  • Versioning capabilities
  • API availability
  • Integration with other tools and services
Trust and Sustainability
Visibility and Accessibility
  • Indexing in major search engines
  • Domain-specific vs. general visibility
  • Access controls if needed
  • Usage metrics and statistics

Submission Best Practices

  • Review repository-specific guidelines before submission
  • Prepare all materials according to repository standards
  • Complete all metadata fields thoroughly
  • Consider including analysis code alongside data
  • Understand the review process and timeline
  • Test dataset usability from a potential user's perspective

Data Citation and Credit

Creating Citable Datasets
  • Ensure your dataset has a persistent identifier (preferably a DOI)
  • Include recommended citation format in your documentation
  • Consider publishing a data paper that describes your dataset in detail
Tracking Data Impact
  • Monitor citations of your dataset
  • Track usage statistics provided by repositories
  • Consider including your datasets in your ORCID profile
  • Mention datasets in your CV and biosketches


Ethical and Legal Considerations

Informed Consent for Data Sharing
  • Include data sharing information in consent forms
  • Be transparent about potential future uses
  • Consider broad consent models when appropriate
  • Document sharing limitations based on consent
Sample Consent Language for Data Sharing

"The data collected in this study will be stored in a secure database. After removing all information that could identify you, we plan to share the data with other researchers who may use it for different research questions. These other researchers may be at [institution] or at other research centers. By agreeing to participate in this study, you agree to the future use of your de-identified data for other research purposes."


Data Licensing

Common License Types

Choosing a License

  • Consider institutional policies
  • Balance openness with necessary restrictions
  • Ensure compatibility with data repository policies
  • Document license choice clearly in metadata

Addressing Cultural Sensitivity

  • Consider the perspectives of the communities represented in your data
  • Acknowledge community contributions appropriately
  • Be aware of potential harms from data misuse or misinterpretation
  • Follow relevant guidelines for research with indigenous communities

Making Data Truly FAIR

Practical Steps Toward FAIR Data

  • Findable: Use persistent identifiers, rich metadata, and register in searchable resources
  • Accessible: Store in a trusted repository with clear access conditions
  • Interoperable: Use standard formats and vocabularies
  • Reusable: Include detailed documentation and clear licenses
     

Assessing FAIR Compliance

  • Self-assessment tools for FAIR principles
  • Repository features that support FAIR data
  • Continuous improvement approaches to data FAIRness
     

Beyond FAIR: Building Data Communities

  • Participate in community standards development
  • Engage with data users to improve sharing practices
  • Advocate for recognition of data sharing contributions
  • Mentor others in effective data sharing practices

Contact the Library

Need help finding or using Downstate Library resources? We're here to help!

Email us: reference@downstate.edu or use our online form.