CSVS File Format

This document specifies the CSVS file format.

"CSVS" stands for "Comma-Separated Value Store"

CSVS file format is a subset of RFC 4180. In cases where this document contradicts the RFC, RFC takes precedence and this document should be corrected.

A CSVS file MUST have UTF-8 encoding.

A CSVS file MUST have .csv file extension.

grammar

newline: either Carriage Return 0x0D \r, Line Feed 0x0A \n, or both CR LF 0x0D 0x0A \r\n
string: sequence of any utf8 characters, newlines MUST be escaped
key: string
value: string
file: [[key][,[value]]newline]

each line in csvs file MUST represent a relation between values of two collections

each line MUST contain zero, one, or two values separated by a comma

value that contains a comma, a newline or a doble quote MUST be escaped with double quotes ""

omitted value MUST represent an empty string

all characters between the first unescaped comma and an unescaped newline MUST be read as part of the second value

multiple identical lines MUST represent multiple unique relations between identical values

an exact duplicate of a line MUST represent two unique relations

a line that consists only of a newline character MUST represent a relation between one empty string value "" and another empty string value ""

the file in csvs format CAN be called a "tablet"

the first column CAN be called a "key"

the second column CAN be called a "value"

empty lines \n MUST be ignored.

a trailling newline \n MUST be ignored.

these are equivalent

,\n
"",\n
"",""\n

a line CAN have no comma.

these are equivalent

2024-01-01\n
2024-01-01,""\n

these are equivalent

"\n"\n
"\n",\n

examples

1,bob\n: key is 1, value is bob
1,bob\\n\n: key is 1, value is bob\n
,bob\n: key is "", value is bob
1,\n: key is 1, value is ""
1\n: key is 1, value is ""
\n: key is "", value is ""
2,bob,alice\n: key is 2, value is bob,alice
3,apple\n3,pear\n: key is 3, values are apple and pear
3,apple\n3,apple\n: key is 3, values are apple and apple

CSVS Dataset Format

This document specifies the CSVS dataset format.

"CSVS" stands for "Comma-Separated Value Store"

a csvs dataset represents relationships between collection values

terminology

each collection CAN be called a "branch", plural "branches"

an collection without attributes CAN be called a "twig", plural "twigs"

an collection with attributes CAN be called a "trunk", plural "trunks"

an collection that is an attribute of another collection CAN be called a "leaf", plural "leaves"

an collection that is not an attribute of any other collection CAN be called a "root", plural "roots"

a dataset CAN have multiple roots

a branch CAN have multiple trunks

a branch CAN have multiple leaves

.csvs.csv

a dataset MUST contain a tablet named.csvs.csv which describes the dataset

tablet is for metadata

.csvs.csv tablet MUST have a line csvs,0.0.2

this line is to support future breaking changes to the format.

`_-_.csv`

a dataset SHOULD contain a tablet named _-_.csv which describes relationships between collections

reserved technical implementation details

underscroll-dash-underscroll

examples:

_-_.csv: event,date - dataset has an "event" collection with an attribute "date"

if there is no _-_.csv tablet, dataset MUST be considered empty

an collection name MUST NOT be "_".

an collection name MUST NOT include the following characters: [/\<>':"```|?*-.,[];{}$&].

an collection name CAN include any of the following: [azAZ09_%+@], whitespace and other unicode characters

NOTE: when there's no _-_.csv file, list directory and deduce relations from tablet names.

collection-collection.csv

underscore is like SQL table? underscore is not like SQL table? underscore is like MongoDB collection? underscore is not like MongoDB collection?

a dataset CAN have a tablet named {collection1}-{collection2}.csv which describes relationships between values of two collections

contains values of two collections

"went to groceries" is an identificator here examples:

description-date.csv: went to groceries,2024-01-01
description-date.csv: went to groceries,2003-01-01 { _: description, description: "went to groceries", date: [2024-01-01, 2003-01-01]} { _: date, date: "2024-01-01"} { _: date, date: "2003-01-01"}
event-description: 0acab,went to groceries\n0abac,went to groceries
event-date: 0acab,2024-01-01\n0abac,2003-01-01 { _: event, event: "0acab", description: "went to groceries", date: "2024-01-01"} { _: event, event: "0abac", description: "went to groceries", date: "2003-01-01"}

how to create two different values with the same text

a relation between collections MUST be listed in _-_.csv

a relation between collections CAN be recursive.

examples:

collection "person" CAN have an attribute "person".
collection "product" CAN have an attribute "competitor" which has an attribute "product".

notes for the dataset maintainer

to remove the value of collection from the dataset, prune the collection value from the tablet for each leaf of collection, {collection}-{leaf}.csv

csvs dataset SHOULD be version controlled. credentials and other sensitive data SHOULD be stored separately from the dataset, e.g. in .git/config or in another csvs dataset under access control.

binary blobs SHOULD be stored in a folder inside the dataset directory, and associate each blob with a value of collection filename. In datasets version controlled by git, the asset directory SHOULD be filtered with git-lfs or git-index.

collection "text" with large multiline string values CAN be refactored into two collections - "text_hash" where each value is a hash of a text value, to create a content-addressable index of text records.