Getting Started

This documentation describes a database called "comma-separated value store" or csvs.

The goal of csvs is to be accessible and approachable. An engineer should be able to write a csvs library in an evening, and a child should be able to glean the contents of the database by inspecting them with a text reader. A dataset should contain valuable data even after corruption and be easy to repair. Such transparency and naïvetee are of higher priority than processing and memory efficiency.

A csvs dataset is a directory that contains plain text files in the "comma-separated value" format, or CSV. Any directory that contains a .csvs.csv file is a valid CSVS dataset. Each CSV file represents a table with two columns and is called a "tablet". The first column is a key, and the second column is a value. You can store records by appending lines to the tablets. To represent complex objects and connect the tablets to each other, specify the relationships between values in the schema file _-_.csv.

Here's an example of the simplest CSVS dataset that contains a record about visiting Japan in 2001.

.csvs.csv

csvs,0.0.2

_-_.csv

event,date

event-date.csv

visited-japan,2001-01-01

To learn more about csvs, see the Tutorial and the User guides.

Tutorial

Here's an example of the simplest CSVS dataset that contains a record about visiting Japan in 2001.

.csvs.csv

csvs,0.0.2

_-_.csv

event,date

event-date.csv

visited Japan,2001-01-01

Technically, this dataset represents three records:

event record that says visited Japan in 2001-01-01
date record that says 2001-01-01
a _ record, pronounced schema record, that says "event" has a "date"

Let's add another event about climbing the Everest in 2003

.csvs.csv

csvs,0.0.2

_-_.csv

event,date

event-date.csv

visited Japan,2001-01-01
climbed Everest,2003-03-03

Now, the dataset represents five records:

event record that says visited Japan in 2001-01-01
event record that says climbed Everest in 2003-03-03
date record that says 2001-01-01
date record that says 2003-03-03
a _ record, pronounced schema record, that says "event" has a "date"

Let's add another value to the database to show that events happened to different people.

.csvs.csv

csvs,0.0.2

_-_.csv

event,date
event,name

event-date.csv

visited Japan,2001-01-01
climbed Everest,2003-03-03

event-name.csv

visited Japan,Donell
climbed Everest,Eva

Finally, the dataset represents seven records:

event record that says Donell visited Japan in 2001-01-01
event record that says Eva climbed Everest in 2003-03-03
date record that says 2001-01-01
date record that says 2003-03-03
name record that says Eva
name record that says Donell
a _ record, pronounced schema record, that says "event" has a "date" and a "name"

To remove records from the dataset, delete corresponding lines from the tablets.

Learn more about csvs in the User guides.

User Guides

To learn more about csvs, see Design and Requirements.

Nested Records

branches can depend on each other. branch is a name for piece of structure inside a dataset. Imagine that a dataset is a grove of trees, and each tree is made up of branches connected to each other. Branch is called a trunk if it has leaves - branches that describe it. A branch without leaves is called a twig and sits at the very top of the tree. A branch that does not describe any other branch and thus does not have a trunk, is called a root and sits at the very bottom of a tree.

Let's add an age to a name of a person that experienced an event.

.csvs.csv

csvs,0.0.2

_-_.csv

event,date
event,name
name,age

event-date.csv

visited Japan,2001-01-01
climbed Everest,2003-03-03

event-name.csv

visited Japan,Donell
climbed Everest,Eva

name-age.csv

Donell,35
Eva,70

Now, let's add a favorite quote of each person, and the author of each quote

.csvs.csv

csvs,0.0.2

_-_.csv

event,date
event,name
name,age
name,quote
quote,author

event-date.csv

visited Japan,2001-01-01
climbed Everest,2003-03-03

event-name.csv

visited Japan,Donell
climbed Everest,Eva

name-age.csv

Donell,35
Eva,70

name-quote.csv

Donell,The only way to do great work is to love what you do
Eva,"Sometimes you need to scorch everything to the ground, and start over"

quote-author.csv

The only way to do great work is to love what you do,Donovan
"Sometimes you need to scorch everything to the ground, and start over",Celeste Ng

You can even define a recursive relation to specify the parent of each person

.csvs.csv

csvs,0.0.2

_-_.csv

event,date
event,name
name,age
name,parent

event-date.csv

visited Japan,2001-01-01
climbed Everest,2003-03-03

event-name.csv

visited Japan,Donell
climbed Everest,Eva

name-parent.csv

Donell,Jack
Donell,Jacqueline
Jack,Rona
Jack,Bernard
Jacqueline,Leif
Jacqueline,Fatuma
Eva,Ismail
Eva,Hauwa
Ismail,Nelson
Ismail,Dennis
Hauwa,Rabi
Hauwa,Louis

To learn more about csvs, see Design and Requirements.

Lists of Values

repeat a line with the same key to represent a list of values. For example, let's say Donell visited Japan every year for three years.

.csvs.csv

csvs,0.0.2

_-_.csv

event,date
event,name
branch,description

event-date.csv

visited Japan,2001-01-01
visited Japan,2002-02-02
visited Japan,2003-03-03
climbed Everest,2003-03-03

event-name.csv

visited Japan,Donell
climbed Everest,Eva

Notice that the year 2003 repeats two times - once in the list of events about Japan, and once in the list of events about Everest. Be careful when you add new branches that desribe the date "2003-03-03" - they might apply to both mentions!

To avoid conflicts, make sure to use unique identifiers when you want singleton values. For example, use a unique identifier as a key.

empty value is a value, one comma is empty key to an empty value

empty line is not a value

To learn more about csvs, see Design and Requirements.

Dataset Settings

if you want to store arbitrary settings, try to write them as branches.

you can also write new key-value pairs to the .csvs.csv file.

to store credentials and sensitive data, don't write them to the dataset near other data

you can make a separate dataset for sensitive data with stricter security practices

if you use git for version control, you can write sensitive key-value pairs to .git/config

To learn more about csvs, see Design and Requirements.

Asset Storage

you can store media files in a folder in the dataset and address them in one of the tablets

.csvs.csv

csvs,0.0.2

_-_.csv

event, file

event-file.csv

visited Japan,IMG_0890.jpeg

img/IMG_0890.jpeg

               )\         O_._._._A_._._._O         /(
                \`--.___,'=================`.___,--'/
                 \`--._.__                 __._,--'/
                   \  ,. l`~~~~~~~~~~~~~~~'l ,.  /
       __            \||(_)!_!_!_.-._!_!_!(_)||/            __
       \\`-.__        ||_|____!!_|;|_!!____|_||        __,-'//
        \\    `==---='-----------'='-----------`=---=='    //
        | `--.                                         ,--' |
         \  ,.`~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~',.  /
           \||  ____,-------._,-------._,-------.____  ||/
            ||\|___!`======="!`======="!`======="!___|/||
            || |---||--------||-| | |-!!--------||---| ||
  __O_____O_ll_lO_____O_____O|| |'|'| ||O_____O_____Ol_ll_O_____O__
  o H o o H o o H o o H o o |-----------| o o H o o H o o H o o H o
 ___H_____H_____H_____H____O =========== O____H_____H_____H_____H___
                          /|=============|\
()______()______()______() '==== +-+ ====' ()______()______()______()
||{_}{_}||{_}{_}||{_}{_}/| ===== |_| ===== |\{_}{_}||{_}{_}||{_}{_}||
||      ||      ||     / |==== s(   )s ====| \     ||      ||      ||
======================()  =================  ()======================
----------------------/| ------------------- |\----------------------
                     / |---------------------| \
-'--'--'           ()  '---------------------'  ()
                   /| ------------------------- |\    --'--'--'
       --'--'     / |---------------------------| \    '--'
                ()  |___________________________|  ()           '--'-
  --'-          /| _______________________________  |\
 --' gpyy      / |__________________________________| \
Art by Glory Moon

if you version control the dataset with git, you can add the asset folder to Large File Storage

git lfs install

git lfs track "img/**"

git add .gitattributes

git add img/IMG_0890.jpeg

git commit -m "add picture of Japan"

To learn more about csvs, see Design and Requirements.

Branch Metadata

If you want to describe each branch in detail, you can create a set of tablets for the "branch" branch.

.csvs.csv

csvs,0.0.2

_-_.csv

event,date
event,name
branch,description

branch-description.csv

event,something that happened
date,something happened at this time
name,something happened to this person

event-date.csv

visited Japan,2001-01-01
climbed Everest,2003-03-03

event-name.csv

visited Japan,Donell
climbed Everest,Eva

By default, a csvs dataset represents lists of objects of strings. You can define custom value types based on details about each branch. For example, define a tablet called branch-datatype.csv and specify an age branch with type number. Just make sure to check that age values are really numbers.

.csvs.csv

csvs,0.0.2

_-_.csv

event,date
event,name
name,age
branch,datatype

branch-datatype.csv

age,number

event-date.csv

visited Japan,2001-01-01
climbed Everest,2003-03-03

event-name.csv

visited Japan,Donell
climbed Everest,Eva

name-age.csv

Donell,35
Eva,70

To learn more about csvs, see Design and Requirements.

Writing a Client

Let's write a simple client for csvs. We will use pseudocode so you can follow along in your favorite language.

A client library defines three functions that mirror SQL's SELECT, UPDATE and DELETE commands.

# finds a branch key that is connected to `value`
select branch value =
  # find a leaf of the branch
  schema = parse "_-_.csv"
  relation = find line in schema where line contains branch
  tokens = split relation ","
  leaf = tokens[1]
  # find a value in the tablet where leaf matches value
  tablet = parse "$branch-$leaf.csv"
  relation = find line in tablet where line contains value
  tokens = split relation ","
  value = tokens[0]
  return value
  
# adds a trunk key connected to a leaf value
update trunk leaf key value =
  line = "$key,$value"
  append "$trunk-$leaf" line
  
# removes the branch key and the connected leaf value
delete branch key = 
  # find a leaf of the branch
  schema = parse "_-_.csv"
  relation = find line in schema where line contains branch
  tokens = split relation ","
  leaf = tokens[1]
  # find a value in the tablet where branch matches key
  tablet = parse "$branch-$leaf.csv"
  relation = find line in tablet where line contains key
  # delete the line from the tablet
  filter tablet relation

This can be implemented in an evening. A more carefully written client could be much more robust and efficient - try to write your own!

To learn more about csvs, see Design and Requirements.

Design

plain-text relational database

competes: recutils, sql

interacts: filesystem, clients, text editors

constitutes: a set of csv files

includes: config, schema, data

resembles: recutils

patterns:

stakeholders: fetsorn

Specification

0.0.2
0.0.1 (deprecated)

Also see the Requirements.

CSVS File Format

This document specifies the CSVS file format.

"CSVS" stands for "Comma-Separated Value Store"

CSVS file format is a subset of RFC 4180. In cases where this document contradicts the RFC, RFC takes precedence and this document should be corrected.

A CSVS file MUST have UTF-8 encoding.

A CSVS file MUST have .csv file extension.

grammar

newline: either Carriage Return 0x0D \r, Line Feed 0x0A \n, or both CR LF 0x0D 0x0A \r\n
string: sequence of any utf8 characters, newlines MUST be escaped
key: string
value: string
file: [[key][,[value]]newline]

each line in csvs file MUST represent a relation between values of two collections

each line MUST contain zero, one, or two values separated by a comma

value that contains a comma, a newline or a doble quote MUST be escaped with double quotes ""

omitted value MUST represent an empty string

all characters between the first unescaped comma and an unescaped newline MUST be read as part of the second value

multiple identical lines MUST represent multiple unique relations between identical values

an exact duplicate of a line MUST represent two unique relations

a line that consists only of a newline character MUST represent a relation between one empty string value "" and another empty string value ""

the file in csvs format CAN be called a "tablet"

the first column CAN be called a "key"

the second column CAN be called a "value"

empty lines \n MUST be ignored.

a trailling newline \n MUST be ignored.

these are equivalent

,\n
"",\n
"",""\n

a line CAN have no comma.

these are equivalent

2024-01-01\n
2024-01-01,""\n

these are equivalent

"\n"\n
"\n",\n

examples

1,bob\n: key is 1, value is bob
1,bob\\n\n: key is 1, value is bob\n
,bob\n: key is "", value is bob
1,\n: key is 1, value is ""
1\n: key is 1, value is ""
\n: key is "", value is ""
2,bob,alice\n: key is 2, value is bob,alice
3,apple\n3,pear\n: key is 3, values are apple and pear
3,apple\n3,apple\n: key is 3, values are apple and apple

CSVS Dataset Format

This document specifies the CSVS dataset format.

"CSVS" stands for "Comma-Separated Value Store"

a csvs dataset represents relationships between collection values

terminology

each collection CAN be called a "branch", plural "branches"

an collection without attributes CAN be called a "twig", plural "twigs"

an collection with attributes CAN be called a "trunk", plural "trunks"

an collection that is an attribute of another collection CAN be called a "leaf", plural "leaves"

an collection that is not an attribute of any other collection CAN be called a "root", plural "roots"

a dataset CAN have multiple roots

a branch CAN have multiple trunks

a branch CAN have multiple leaves

.csvs.csv

a dataset MUST contain a tablet named.csvs.csv which describes the dataset

tablet is for metadata

.csvs.csv tablet MUST have a line csvs,0.0.2

this line is to support future breaking changes to the format.

`_-_.csv`

a dataset SHOULD contain a tablet named _-_.csv which describes relationships between collections

reserved technical implementation details

underscroll-dash-underscroll

examples:

_-_.csv: event,date - dataset has an "event" collection with an attribute "date"

if there is no _-_.csv tablet, dataset MUST be considered empty

an collection name MUST NOT be "_".

an collection name MUST NOT include the following characters: [/\<>':"```|?*-.,[];{}$&].

an collection name CAN include any of the following: [azAZ09_%+@], whitespace and other unicode characters

NOTE: when there's no _-_.csv file, list directory and deduce relations from tablet names.

collection-collection.csv

underscore is like SQL table? underscore is not like SQL table? underscore is like MongoDB collection? underscore is not like MongoDB collection?

a dataset CAN have a tablet named {collection1}-{collection2}.csv which describes relationships between values of two collections

contains values of two collections

"went to groceries" is an identificator here examples:

description-date.csv: went to groceries,2024-01-01
description-date.csv: went to groceries,2003-01-01 { _: description, description: "went to groceries", date: [2024-01-01, 2003-01-01]} { _: date, date: "2024-01-01"} { _: date, date: "2003-01-01"}
event-description: 0acab,went to groceries\n0abac,went to groceries
event-date: 0acab,2024-01-01\n0abac,2003-01-01 { _: event, event: "0acab", description: "went to groceries", date: "2024-01-01"} { _: event, event: "0abac", description: "went to groceries", date: "2003-01-01"}

how to create two different values with the same text

a relation between collections MUST be listed in _-_.csv

a relation between collections CAN be recursive.

examples:

collection "person" CAN have an attribute "person".
collection "product" CAN have an attribute "competitor" which has an attribute "product".

notes for the dataset maintainer

to remove the value of collection from the dataset, prune the collection value from the tablet for each leaf of collection, {collection}-{leaf}.csv

csvs dataset SHOULD be version controlled. credentials and other sensitive data SHOULD be stored separately from the dataset, e.g. in .git/config or in another csvs dataset under access control.

binary blobs SHOULD be stored in a folder inside the dataset directory, and associate each blob with a value of collection filename. In datasets version controlled by git, the asset directory SHOULD be filtered with git-lfs or git-index.

collection "text" with large multiline string values CAN be refactored into two collections - "text_hash" where each value is a hash of a text value, to create a content-addressable index of text records.

CSVS File Format

This document specifies the CSVS file format.

"CSVS" stands for "Comma-Separated Value Store"

grammar: [(no comma) [comma (any utf8)] newline]

A CSVS file MUST have UTF-8 encoding.

A CSVS file MUST have .csv file extension.

A CSVS file MUST not start with a comma.

A CSVS file MUST consist of lines separated by a newline, either Carriage Return 0x0D \r, Line Feed 0x0A \n, or both CR LF 0x0D 0x0A \r\n.

Each line in a CSVS file CAN have no commas.

Each line in a CSVS file CAN have one comma, first comma in each line separates a KEY from VALUE.

only two columns, second optional

all following commas are part of value

unless specified in the extension, keys SHOULD be unique, if non-unique keys are found, only the first value MUST be treated, the rest matching uuids are to be discarded