Hugo Hacker News

Time to retire the CSV?

Sunspark 2021-08-18 15:14:00 +0000 UTC [ - ]

I don't agree with giving up csvs until the following conditions are met:

1) A truly open format is available and accessible. Csvs are textfiles. There is no system around that cannot open a textfile. If the format is binary or requires patents or whatever, then it's a non-starter.

2) Applications have a speed increase from using csvs. To wit, I loved csvs because often they finish preparing much faster than a "formatted" output in excel, etc. and sometimes I just want to see the numbers, not have everything colored or weird features like merged cells in excel, etc.

3) The new format should not be grossly larger than the one it is replacing. Extracts in Excel format are orders of a magnitude larger than csv in filesize. This affects run-time to prepare the extract, open it (memory constraints, etc.)

Is there truly a problem? The author is not forced to use csvs.

derefr 2021-08-18 15:25:45 +0000 UTC [ - ]

You're comparing CSVs to other spreadsheet document formats. But a CSV is not a spreadsheet. A CSV is raw data. (It's data that is restricted to a shape that enables it to be easily imported into a spreadsheet—but data nevertheless.) As such, it should be compared to other data formats—e.g. YAML, JSON Lines, etc.

These other data formats all win on your #2 against CSV, as CSV is actually horrible at parse-time vs. other data formats — the fact that both of its separators (newlines and commas) can appear as-is inside column values, with a different meaning, if those column-values are quoted, means that there's no way to parallelize CSV processing, because there's no way to read-ahead and "chunk" a CSV purely lexically. You actually need to fully parse it (serially), and only then will you know where the row boundaries are. If you've ever dealt with trying to write ETL logic for datasets that exist as multi-GB CSV files, vs. as multi-GB any-other-data-format files, you'll have experienced the pain.

> The new format should not be grossly larger than the one it is replacing.

Self-describing formats like JSON Lines are big... but when you compress them, they go back to being small. General-purpose compressors like deflate/LZMA/etc. are very good at shearing away the duplication of self-describing rows.

As such, IMHO, the ideal format to replace ".csv" is ".jsonl.gz" (or, more conveniently, just ".jsonl" but with the expectation that backends will offer Transport-Encoding and your computer will use filesystem compression to store it — with this being almost the perfect use-case for both features.)

-----

There's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.

The data-warehouse ecosystem already standardized on Avro as its data interchange format. And spreadsheets are just tiny data warehouses. So why not? ;)

jerf 2021-08-18 16:40:36 +0000 UTC [ - ]

"Self-describing formats like JSON Lines are big... but when you compress them, they go back to being small."

For CSV file replacements, I'd expect something like "one JSON array per line, all values must be JSON scalars". In that case, it's not much larger than a CSV, especially one using quotes already for string values.

But this demonstrates the problem with JSON for CSV, I suppose. Is each line an object? Is it wrapped in a top-level array or not? If it is objects, do the objects have to be one line? If it is an object, where do we put field order? The whole problem we're trying to solve with CSV is that it's not a format, it's a family of formats, but without some authority coming in and declaring a specialized JSON format we end up with a family of JSON formats to replace CSV as well. I'd still say it's a step up; at least the family of JSON formats is unambiguously parseable and the correct string values will pop out. But it's less of a full solution than I'd like.

(It wouldn't even have to be that much of an authority necessarily, but certainly more than "The HN user named jerf declares it to be thus." Though I suppose if I registered "csvjson.org" or some obvious variant and put up a suitably professional-looking page that might just do the trick. I know of a few other "standards" that don't seem to be much more than that. Technically, even JSON itself wasn't much more than that for a lot of its run, though it is an IETF standard now.)

turtles_ 2021-08-18 17:47:29 +0000 UTC [ - ]

Indeed, this has already been done: http://ndjson.org/

To be fair it's not an objectionable format. Using line breaks to separate objects makes it streamable, and you don't need to enclose the whole thing in an array to make it a valid JSON document.

jerf 2021-08-18 19:49:15 +0000 UTC [ - ]

That is not quite a CSV replacement. I use it for things with objects and stuff all the time. To be a CSV replacement you really need to add that each line must be a JSON array, and that it can only have scalars in it (no sub-arrays or objects). That would be a decent enough replacement for CSV itself. Not perfect, but the CSV "standard" is already a nightmare at the edge anyhow and honestly a lot of it can't be fixed anyway, so, this is probably as good as it could get.

derefr 2021-08-18 22:52:35 +0000 UTC [ - ]

> that it can only have scalars in it (no sub-arrays or objects)

I see CSV files that contain JSON arrays/objects in their fields all the time. Mainly from exporting Postgres tables that contain json/jsonb-typed columns. Are you saying that these aren't valid CSVs?

zerocrates 2021-08-19 04:01:06 +0000 UTC [ - ]

They're saying that a CSV equivalent should be strictly 2-dimensional, with "flat" values.

Such a format could contain arbitrary JSON in a "cell", but simply as text, in the same way as putting the same data in a CSV.

shaftoe 2021-08-19 11:20:36 +0000 UTC [ - ]

These are strings containing JSON

cdcarter 2021-08-18 21:08:17 +0000 UTC [ - ]

JSON Lines is a specified format that answers those questions. https://jsonlines.org/ Seems like it qualifies to the level of authority you're requiring.

da_chicken 2021-08-18 17:04:36 +0000 UTC [ - ]

> But this demonstrates the problem with JSON for CSV, I suppose. Is each line an object?

How is that not a problem with every data serialization format? It does me no real good if I have an XML schema and a corresponding file. If I don't know what those elements and attributes represent I'm not really any better off.

It's not like JSON or XML can meaningfully be marshaled back into objects for use generically without knowledge of what is represented. There are generic JSON and XML readers that allow you to parse the data elements sure, but so, too, do generic CSV readers like C#'s CsvHelper or Python's csv. In all cases you have to know what the object turns into in the application before the serialized data is useful.

And, yes, CSV has slightly differing formats, but so does JSON. Date formats are conventionally ISO 8601, but that's not in the spec. That's why Microsoft got away with proprietary date formats in System.Text.Json. XML isn't really any better.

cpx86 2021-08-18 20:26:36 +0000 UTC [ - ]

> That's why Microsoft got away with proprietary date formats in System.Text.Json.

What's proprietary in it? It follows ISO 8601-1:2019 and RFC 3339 according to the docs.

da_chicken 2021-08-18 21:20:42 +0000 UTC [ - ]

Sorry, that should be System.Runtime.Serialization.Json. System.Text.Json is the newer class that replaced it.

In .Net Framework 4.6 and earlier, the only built-in JSON serializer in the .Net Framework was System.Runtime.Serialization.Json.DataContractJsonSerializer.

You can still see it. If you're on Windows 10, run Windows Powershell v5.1 and run:

  Get-Item C:\Windows\System32\notepad.exe | Select-Object -Property Name, LastWriteTime | ConvertTo-Json
You'll see this output:

  {
    "Name":  "notepad.exe",
    "LastWriteTime":  "\/Date(1626957326200)\/"
  }
Microsoft didn't fix their weird JSON serialization until quite late. They may have back ported it to the .Net Framework, but they've deleted that documentation. Powershell v6 and v7 include the newer classes that are properly behaved. This is why Json.NET used to be so popular and ubiquitous for C# and ASP applications. It generated JSON like most web applications do, not the way Microsoft's wonky class did. Indeed, I believe it may be what System.Text.Json is based on.

cpx86 2021-08-19 16:02:31 +0000 UTC [ - ]

Oh that one - yeah I've always steered clear of DataContractJsonSerializer. Never understood why they did it so weird.

To be fair, RFC 3339 wasn't even published back when this class was implemented (in .NET 3.5) so I guess they just went with whatever worked for their needs. ¯\_(ツ)_/¯

a9h74j 2021-08-19 03:45:34 +0000 UTC [ - ]

I have wondered about a file format where a parser could be specified for [at the start of] each line. You could even have different json parsers with different well-characterized limits and relative speeds. Formats could change over time on a line-by-line basis, without being locked into a full-file IDL or similar.

hermitcrab 2021-08-18 19:53:56 +0000 UTC [ - ]

>the fact that both of its separators (newlines and commas) can appear as-is inside column values, with a different meaning, if those column-values are quoted, means that there's no way to parallelize CSV processing, because there's no way to read-ahead and "chunk" a CSV purely lexically

Yes, this is a major pain. It can be avoided by using Tab separated value (TSV) files, which don't use escaping. But then you can't store Tabs or carriage returns in your data. Also there is no way to store metadata in TSV.

JSON is far from ideal for storing 2D data tables as it is a tree. This means it is much more verbose than it needs to be. The same is also true for XML.

divbzero 2021-08-18 21:01:40 +0000 UTC [ - ]

ASCII has special delimiters 0x1E Record Separator and 0x1F Unit Separator to avoid conflicting with values, but they have never gained widespread adoption.

hermitcrab 2021-08-18 21:22:02 +0000 UTC [ - ]

While these have clear advantages over comma and CR for parsing (no more escaping!), they aren't at all convenient for manual editing.

mavhc 2021-08-18 21:41:14 +0000 UTC [ - ]

If they were popular they'd be supported in your editor though.

hermitcrab 2021-08-18 22:03:33 +0000 UTC [ - ]

Maybe. It is a bit of a chicken-and-egg problem though. They aren't likely to become popular until editors support them.

Also, the fact that they have been around for years and never taken off implies that they probably never will.

sli 2021-08-19 00:25:20 +0000 UTC [ - ]

More generally than editor support, how is the average person expected to type them? I could easily add them to my keyboard, but if "editor support" turned out to mean "Edit > Insert > Record Separator Character" and I was now bound only to editors that support this, absolutely nobody is going to bother. They specifically need to be trivially typable in exactly the way commas and newlines are.

Realistically I don't think anyone (at least on HN) would presume that editor support would be so weirdly obtuse, but it's not exactly well-defined here, either.

I'm mostly just thinking out loud, here, not really critiquing anyone.

mavhc 2021-08-19 06:58:02 +0000 UTC [ - ]

Commas and Newlines are only trivial if your data contains no commas or newlines.

Who's writing CSV files by hand though?

I'd expect some kind of CSV mode in my text editor where Tab and Newlines are turned into keys to insert those characters

kevin_thibedeau 2021-08-19 00:54:52 +0000 UTC [ - ]

> how is the average person expected to type them?

[modifier]-[C,D,E,F]

epmos 2021-08-19 10:43:42 +0000 UTC [ - ]

Actually, there is already a standard way for someone to type these with the control key. To be specific, they are:

  FS: ^\ GS: ^] RS: ^^ US: ^_
Since these delimiters are not much used, on many terminals ^\ is mapped to send SIGQUIT. Use 'stty quit undef' and you can then type this at the keyboard. This works on Linux and OS X, and likely most other systems. Note that with most keyboards you are also using shift for RS and US. For emacs, prefix with C-q so that it inserts the literal.

Other posters have noted that they have ASCII separators recently, I have as recently as 2018 internally at a FAANG. So they are not completely unknown in the wild.

xkgt 2021-08-19 05:28:29 +0000 UTC [ - ]

which modifier would that be, Alt/Ctrl or Cmd? Also there is a chance that this combo will not work on some platforms, some may already be in use as a hotkey.

a1369209993 2021-08-19 01:31:51 +0000 UTC [ - ]

> but they have never gained widespread adoption.

Yes, because that just kicks the can down the road: how do you represent the one-byte string "\x1E"?

BenjiWiebe 2021-08-19 02:12:12 +0000 UTC [ - ]

At least very few humans type that into their spreadsheet cell.

nicoburns 2021-08-18 21:14:46 +0000 UTC [ - ]

I worked on a project that used these in a proprietary file format. It was actually quite nice. There was no escaping required because they simply stripped those characters out of the actual content.

Fwirt 2021-08-19 07:14:42 +0000 UTC [ - ]

The other advantage not being discussed here is that once you have a csv parser, it can be trivially generalized to use any character as a delimiter, and you can usually find a character that isn't found in the data you're working with unless you're working with binary data. I work on a legacy system where passing delimited files around is usually the integration method of the day, and if you can't delimit with commas (common) then you just delimit with tabs. If you can't delimit with tabs or commas, then delimit with pipes, which are rarely if ever used outside of code. Or use tildes, or grave accents, or...

Doing this avoids quoting, which avoids an entire class of problems. And yes, if you're passing multi-megabyte files around on a regular basis, or transferring giant blobs of text or binary data, or looking for a storage format with metadata then csvs are an absolutely awful choice, but as with any problem you pick the right tool for the job, and csvs are a great tool when you want to easily move some structured alphanumeric data around in a format that's trivial to parse in any language and widely supported.

hermitcrab 2021-08-19 07:40:13 +0000 UTC [ - ]

>it can be trivially generalized to use any character as a delimiter

But then you have the additional problem of trying to work out what the delimiter is. You can take an educated guess, but this isn't going to be 100% reliable.

2021-08-18 23:38:41 +0000 UTC [ - ]

osigurdson 2021-08-18 15:49:23 +0000 UTC [ - ]

Take an Excel file and change the extension to .zip, then extract the contents. You will see that it is a collection of XML files. Therefore it should be reasonable to conclude that this approach can work for Excel sized datasets.

However it is not particularly readable/diff-able if this is part of your use case.

ectopod 2021-08-18 20:30:29 +0000 UTC [ - ]

Excel actually defines a simple XML-based alternative to CSV:

https://en.wikipedia.org/wiki/Microsoft_Excel#XML_Spreadshee...

2021-08-18 21:04:53 +0000 UTC [ - ]

unnouinceput 2021-08-18 17:52:03 +0000 UTC [ - ]

Correction: the new xlsx is a zip file, the old xls format is true binary.

dtparr 2021-08-18 20:57:28 +0000 UTC [ - ]

To be fair, xlsx came out with Office 2007, so it's not exactly 'new' anymore. Perhaps at this point it's reasonable for 'excel file' to mean the one that's been the default for 14 years?

shakna 2021-08-18 23:21:52 +0000 UTC [ - ]

The xlsx document is based on the OpenDocument format, which actually came out in '05, so it's a wee bit older than that.

mavhc 2021-08-18 21:42:42 +0000 UTC [ - ]

So not the one the uk used to track covid infections, sigh.

meepmorp 2021-08-19 00:34:05 +0000 UTC [ - ]

Do you mean the OLE-based format, or the really old one?

2021-08-18 17:55:36 +0000 UTC [ - ]

anigbrowl 2021-08-18 21:52:50 +0000 UTC [ - ]

That's useful to know.

stickfigure 2021-08-18 17:29:01 +0000 UTC [ - ]

JSON/JSONL is not self-describing for many of the most common tabular data use cases. There's no date type or currency type.

If you are going to displace a standard, it has to be significantly better than the old.

derefr 2021-08-18 17:54:58 +0000 UTC [ - ]

CSV does not have date or currency types. Those are an Excel conventional microformat overlaid on top of CSV.

The semantics of CSV — how it decodes, if you don't have an application making additional configured assumptions or using heuristics to detect microformats — is just N columns, one of header names, N of raw text cells. Nothing more, nothing less. No schema defining what a column's type is.

stickfigure 2021-08-18 18:38:29 +0000 UTC [ - ]

You misunderstood my point. CSV, for all its faults (and I am intimately aware of them), is ubiquitous. You're not going to replace a widespread standard with something that only offers a tiny incremental improvement. Ubiquitous is a feature, and generally trumps all other features.

If you want something better than CSV to take off, at the very least it needs to solve the common pain points of CSV. JSON doesn't - all it brings to the table is the ability to distinguish number vs text. That's a yawn.

breck 2021-08-18 20:30:49 +0000 UTC [ - ]

> If you want something better than CSV to take off...JSON doesn't...

Agreed. I don't ever see something better than CSV taking off (having worked on this problem for better part of a decade) unless you could get to the point where you've got a largely similar notation, but a richer semantic grammar powered by something like Schema.org.

In other words, a strongly typed CSV with a community backed schema. I think GPT-3 and the like are a big wild card here. I can't describe how, but I expect in 20 years we will be using something better for storing data and will look back and saw "CSV used to be the best but then GPT-N _____________".

lmm 2021-08-19 04:02:00 +0000 UTC [ - ]

> If you want something better than CSV to take off, at the very least it needs to solve the common pain points of CSV. JSON doesn't - all it brings to the table is the ability to distinguish number vs text. That's a yawn.

It brings the ability to parse in parallel - that's a big deal. And while number vs text might not be a huge difference in theory, in practice it eliminates what, 95% of real-world parse problems?

stickfigure 2021-08-19 07:10:12 +0000 UTC [ - ]

At least in my experience, I don't see a lot of trouble with number vs string in CSV data. You convert the string to a number or you don't. The pain points are usually dates or currencies - the same problem I usually have with JSON, because there's no standard format.

I think you could fix most of the pain of CSV simply by adding a second header row which defines the type of the column, using a common vocabulary of types. TEXT, NUMBER, ISO8861-DATE, etc.

vertere 2021-08-19 09:08:20 +0000 UTC [ - ]

I envy your life of pristine CSVs and compatible implementations, where JSON doesn't solve any problems. But CSVs have a broad range of uses, and there are, in fact, plenty of implementations that do not follow RFC 4180, and even if everything did there would still be issues like the header row being optional (without an in-band way of telling). JSON is, in practice, much more well-defined.

On text vs numbers, at least some widely-used software (e.g. R, Excel) will try to guess for you. It should be obvious how this might cause problems. Maybe one should turn auto-conversion off (or not use things that don't let you turn it off) and specify which columns are numbers. Some datasets have a lot of columns, so this can be a PITA, even if you do know which ones should be numbers. But the bigger problem is if you have to deal with other people, or the files that they've touched. There are always going to be people that edit data in excel, don't use the right options when they import, etc.

derefr 2021-08-18 19:11:22 +0000 UTC [ - ]

JSON solves tons and tons of pain points around CSV... for implementing anything that reads/imports/normalizes CSV data. Just the fact that CSV has multiple standards around the quote/escape/separator/NULL character-sequences, with the choice not get encoded into the document, means that any format that doesn't have that problem is instantly 100x better in terms of not experiencing a disaster of data-garbling-on-import.

Sunspark 2021-08-18 15:54:57 +0000 UTC [ - ]

The formats you describe sound interesting, but in the big corporate/government world the barriers to adoption are high. For example, the software package Cognos which is owned by IBM and connects to database servers. To extract data from it, you have a choice of csv, Excel or pdf. The pdf is only used if you have done a perfect job creating a report in Cognos and don't need to edit/adjust it further in Excel. So really, only 2 output formats.

I realize that this is a specific use case here, but I was on Cognos for years, and then when they started shifting over to Tableau, it wasn't any better.. csv, MS formats, proprietary formats, etc.

cbsmith 2021-08-18 16:23:26 +0000 UTC [ - ]

Cognos is a BI tool, and the formats you are talking about are for rendering reports, not data. If you're picking data formats based on being able to load & edit them in to Excel, you're already doomed.

tw04 2021-08-18 16:42:04 +0000 UTC [ - ]

The entire article is about replacing CSVs for exchanging data exported from Excel... so why wouldn't he be picking data formats based on being able to load and edit them in Excel? If you're trying to solve this problem in a way that EXCLUDES Excel, you're already doomed. The business world will laugh at you and continue on their merry CSV way.

>The biggest and most thorny problem to solve is the people problem: how do you convince people to stop creating new CSVs when they’ve never done things any other way? Fortunately - for this problem, anyway - most of the world’s business data is born in one of a handful of programs that are owned by an increasingly small number of companies. If Microsoft and Salesforce were somehow convinced to move away from CSV support in Excel and Tableau, a large portion of business users would move to a successor format as a matter of course. Of course, it’s debatable whether that kind of a change is in the best interest of those companies, but I’m cautiously optimistic.

cbsmith 2021-08-18 16:50:08 +0000 UTC [ - ]

> The entire article is about replacing CSVs for exchanging data exported from Excel...

No, it's not. It's about replacing CSVs for exchanging data. It mentions that CSVs often are the product of someone exporting data from a spreadsheet or doing a table dump, and how just doing that tends to create a ton of problems, but Excel is an example, not the subject matter of the article.

> The business world will laugh at you and continue on their merry CSV way.

The business world pays me a lot of money to teach them not to use CSVs.

masklinn 2021-08-18 17:20:28 +0000 UTC [ - ]

> The business world pays me a lot of money to teach them not to use CSVs.

Could you teach them better and faster? I don't think they're getting it. You have my blessing to use violence.

barbarbar 2021-08-19 03:26:47 +0000 UTC [ - ]

This is the right suggestion.

cbsmith 2021-08-18 17:42:59 +0000 UTC [ - ]

> Could you teach them better and faster? I don't think they're getting it. You have my blessing to use violence.

I'm trying man. I'm trying.

wombatpm 2021-08-18 18:06:25 +0000 UTC [ - ]

Of course there is a old solution in the ANSI character set. File, Record, Group and Unit separator characters

hermitcrab 2021-08-18 20:09:46 +0000 UTC [ - ]

Yes. You could get a long way with a text format in which:

-the first line is always a header

-fields are separated by Unit separator characters

-records are separated by Record separator characters

-encoding is UTF8

If you wanted to get fancy you could also have:

-comment lines

-column metadata (e.g. column 0 is an ISO date, column 2 is text, column 3 is an integer)

Both the above could start with a Unicode character unlikely to be used for anything else.

I think that would avoid 99% of the pain of CSV files. The downside is that the use of things like the Unit separator mean that it wouldn't be easy to create/edit manually.

I don't suppose it will ever happen though.

cbsmith 2021-08-18 21:52:33 +0000 UTC [ - ]

> it wouldn't be easy to create/edit manually

I mean, you'd have to be using a pretty terrible tool for it not being able to handle that, and I suspect if such an approach were to become prevalent, that tool would either fix the glitch or become largely unused.

hermitcrab 2021-08-18 22:00:58 +0000 UTC [ - ]

Are there any editors that let you insert a Unit separator character as easily as a comma?

cbsmith 2021-08-18 22:05:54 +0000 UTC [ - ]

All of the programmable ones? ;-)

hermitcrab 2021-08-18 22:15:54 +0000 UTC [ - ]

So that's a no then. ;0P

cbsmith 2021-08-19 02:12:51 +0000 UTC [ - ]

? Quite the contrary. It's more, "all the ones that any craftsman should be using".

lanstin 2021-08-19 04:34:14 +0000 UTC [ - ]

A lot of excels and csvs are made by people other than programming crafters.

cbsmith 2021-08-19 11:08:53 +0000 UTC [ - ]

Yes indeed. It often means they do extra work because they're using the wrong tools.

But you gotta admit that excel is a pretty terrible tool to hand edit a CSV with... ;-)

scbrg 2021-08-19 08:33:16 +0000 UTC [ - ]

As with pretty much everything else computing, the world suffers because Microsoft has been dumping terrible tools on it for decades, and people just take their garbage as the way things have to be.

cbsmith 2021-08-19 11:09:43 +0000 UTC [ - ]

Excel has its advantages, but it is funny the tools people choose to work with. The number of times Access would make more sense...

atatatat 2021-08-19 14:28:59 +0000 UTC [ - ]

Didn't they sunset Access in lieu of Power BI Apps or whatever the hell they're calling their shot at no code these days?

barbarbar 2021-08-19 03:19:52 +0000 UTC [ - ]

But it sounds like a very very good approach.

cbsmith 2021-08-18 19:14:41 +0000 UTC [ - ]

Yup. I mean, if you're going to go with a text encoding, you might want to, you know, use the features of the text encoding that were put there explicitly for said purpose...

...or you could invent abominations like CSV, TSV, etc. ;-)

wombatpm 2021-08-18 18:03:54 +0000 UTC [ - ]

TSV solve a lot of the pain

hermitcrab 2021-08-18 20:10:50 +0000 UTC [ - ]

As long as you don't need to store tabs or carriage returns in your data. ;0)

cbsmith 2021-08-18 19:13:10 +0000 UTC [ - ]

...or, you know, you could use the ASCII characters specifically defined for separating records and units. ;-)

derefr 2021-08-18 20:07:55 +0000 UTC [ - ]

If only I could type them on my keyboard. (I think this is a big part of why CSV is the way it is — people want to be able to hand-edit it, or at least hand-produce small test datasets to test the systems on the other end.)

cbsmith 2021-08-18 20:16:20 +0000 UTC [ - ]

The funny thing is you can type any character on a keyboard. It's the same weird disconnect about "text file formats are human-readable". Either way you need a tool that can render & read the format in question. It does't much matter what format you actually store the data in, because you don't read & write the bites directly.

cbsmith 2021-08-18 16:11:18 +0000 UTC [ - ]

> There's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.

Avro is not a lossless alternate encoding of what's canonically a JSON document. Yes Avro supports a JSON encoding, but it's not canonical.

In general though, you're point about Avro being able to be represented as text is valid, and applies to practically any binary formThere's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.at, which is why the whole "but it needs to be a text format" argument is garbage.

> The data-warehouse ecosystem already standardized on Avro as its data interchange format. And spreadsheets are just tiny data warehouses. So why not? ;)

I wish that the data-warehouse ecosystem standardized on anything. ;-)

That said, there are plenty of good reasons why a data-warehouse standard would not be advisable for spreadsheets.

derefr 2021-08-18 17:59:52 +0000 UTC [ - ]

> Yes Avro supports a JSON encoding, but it's not canonical.

To be clear, I'm not talking about using an Avro library to encode data to JSON. I'm saying that when you decode an Avro document, the result that comes out — presuming you don't tell the Avro decoder anything special about custom types your runtime supports and how it should map them — is a JSON document.

Where, by "JSON document" here, I don't mean "a JSON-encoded text string", but rather an in-memory ADT that has the exact set of types that exist in JSON, no more and no less. The sum-type of (JSONArray | JSONObject | String | Integer | Float | true | false | null). The "top" expression type recognized by a JSON parser. Some might call such a document a "JSON DOM." But usually it's a "JSON document", same as how the ADT you get by parsing XML is usually referred to as an "XML document."

Or, to put that another way, Avro is a way to encode JSON-typed data, just as "JSON text", or https://bsonspec.org/, is a way to encode JSON-typed data. They're all alternative encodings that have equivalent lossless encoding power over the same supported inputs.

cbsmith 2021-08-18 18:29:04 +0000 UTC [ - ]

> I'm saying that when you decode an Avro document, the result that comes out (presuming you don't tell the Avro decoder anything special about custom types your runtime supports and how it should map them) is a JSON document.

Semantic point: it's not a "document".

There are tools which will decode Avro and output the data in JSON (typically using the JSON encoding of Avro: https://avro.apache.org/docs/current/spec.html#json_encoding), but the ADT that is created is by no means a JSON document. The ADT that is created has more complex semantics than JSON; JSON is not the canonical representation.

> By which I don't mean JSON-encoded text, but rather an in-memory ADT that has the exact set of types that exist in JSON, no more and no less.

Except Avro has data types that are not the exact set of types that exist in JSON. The first clue on this might be that the Avro spec includes mappings that list how primitive Avro types are mapped to JSON types.

> Or, to put that another way, Avro is a way to encode JSON-typed data, just as "JSON text", or https://bsonspec.org/, is a way to encode JSON-typed data

BSON, by design, was meant to be a more efficient way to encode JSON data, so yes, it is a way to encode JSON-typed data. Avro, however, was not defined as a way to encode JSON data. It was defined as a way to encode data (with a degree of specialization for the case of Hadoop sequence files, where you are generally storing a large number of small records in one file).

A simple counter example: Avro has a "float" type, which is a 32-bit IEEE 754 floating point number. Neither JSON nor BSON have that type.

Technically, JSON doesn't really have types, it has values, but even if you pretend that JavaScript's types are JSON's types, there's nothing "canonical" about JavaScript's types for Avro.

Yes, you can represent JSON data in Avro, and Avro in JSON, much as you can represent data in two different serialization formats. Avro's data model is very much defined independently of JSON's data model (as you'd expect).

derefr 2021-08-18 19:42:55 +0000 UTC [ - ]

> The first clue on this might be that the Avro spec includes mappings that list how primitive Avro types are mapped to JSON types.

My understanding was always:

1. that the "primitive Avro types" are Avro's wire types, which are separate from its representable domain types. (Sort of like how RLE-ified data has wire types of "literal string" and "repeat literal N times".)

2. that any data that would not be valid as input to a JSON encoder, is not valid as input to an Avro encoder, because its wire types are defined in terms of their a mapping from a set of domain types that are exactly the set of domain types accepted by JSON encoders (whether they're explicitly noted as being those or not.)

Or, to put that another way: an Avro schema is — besides a validation step that constrains your data into a slightly-more-normalized/cleaned format — mostly a big fat hint for how to most-efficiently pack an (IMHO strictly JSONly-typed) value into a binary encoding. Differences between "long" and "int" on the wire aren't meant to decode to different domain types (at least, by default); they're just meant to restrict the data's allowed values (like a SQL DOMAIN constraint) in ways that allow it to be more predictable, and so to be wire-encoded more optimally.

Let me lay out some evidence for that assertion:

• Avro supports specifying e.g. "bytes" vs. {"array": "byte"} — there's literally no domain-type difference in those! But one is a wire-encoding optimization over the other.

• Avro has a "default" property, and this property—as part of the JSON-typed schema—can only take on JSON-typed values. Do you think this is an implementation constraint, or a design choice?

• Avro's enum type's "symbols" array? Once again, defined by (and therefore limited to) JSON string values.

• Avro doesn't implement an arbitrary-precision integer type, even though its wire-encoding for integers would support one just fine. Why? Seemingly only because JSON doesn't have an arbitrary-precision integer type (because JavaScript doesn't have a native BigNum type); nor does JavaScript/JSON have any obvious type to O(1)-efficiently deserialize a BigNum out into. (Deserializing BigNums to strings wouldn't be O(1).) Every other language offers a clean 1:1 mapping for bignums, but JavaScript doesn't, so JSON didn't, so Avro doesn't.

• And why do you think Avro schemas are stored as embedded explicitly-defined-to-be-JSON documents within the root-level record / .avsc file, anyway? This means that you are required to have a JSON decoder around (either at decode time, or at decoder codegen time) to decode Avro documents. Why would this be, if not because the Avro implementation is (ot at least originally was) expected to decode the Avro document's wire types into the JSON library's already-defined ADTs, relying on e.g. having those "default"-parameter values already loaded in in JSON-value format from the schema's decode-output, ready to be dropped seamlessly into the resulting Avro decode-output?

And the biggest knock-down argument I'm aware of:

• Avro "string" doesn't support "\u0000". Why not? Because as you've said, Avro has a "JSON encoding", which specifies one-to-one mapping for strings; and JSON doesn't support "\u0000" in strings. (Just ask Postgres's jsonb type about that.) Since an Avro string containing "\u0000" wouldn't round-trip losslessly between the JSON and binary wire-encodings, it's not allowed in strings in the binary encoding.

cbsmith 2021-08-18 20:47:28 +0000 UTC [ - ]

Since it is a serialization format, Avro's types are its wire types. However, the primitive types are just a subset of the types that Avro supports.

Based on these comments, my best guess is you got the idea that Avro was for encoding JSON because the schema declaration is encoded in JSON, but that's not nearly the same as the data model. There are some terrible implementations of Avro libraries out there that use JSON as some kind of middleware, but that's not how Avro actually works.

If there's a type model it is derived from at all, it's the Java type model.

"byte" is not a valid type in Avro. There is only "bytes", and the domain model reflects this. You can't work with individual "byte" of a "bytes" object.

Default values are encoded in the schema, and so that does limit what kind of default values you can have, but again this is a limitation derived from the schema being defined in JSON, and how the schema language was defined in general. So your defaults have to be represented as JSON literals, but they don't even necessarily share the type of the JSON literal (e.g. a field defined as: '{"name": "foo", "type": "long", "default":1}' does not have the same default value as '{"name": "bar", "type": "int", "default":1}", because "foo" has a default value that is a long while "bar" has one that is an "int"). Note that "default values" are a property of the type, and only apply to elements inside complex data types. JSON has no such equivalent concept.

Avro's type model does have an arbitrary precision type that doesn't correlate to anything in JSON: the "decimal" logical type.

You aren't required to use a JSON decoder to decode Avro documents, nor are you required to use a .avsc file. The Avro schema file is just the standard way to represent a schema. If you have the schema, you don't need the file. JSON schema files are one of the poorer choices in the Avro design, but you'll notice that the schema is defined the way it is specifically so that it can cover a type model well outside of JSON. You'll also notice the names of types in Avro don't directly correlate to names of types in JSON.

* The \u0000 thing is a bug in avro tools, but there is nothing about the spec that prohibits having \u0000 in your strings.

I feel like in general this is like a retcon exercise, where you've reached a conclusion and are drawing evidence to prove it, while ignoring the information that contradicts. I spoke with Cutting a fair bit when he came up with Avro, and I can assure you, while the schema language does very intentionally use JSON, Avro is not a representation for JSON types.

amyjess 2021-08-18 22:37:26 +0000 UTC [ - ]

As someone who's used Avro [0], it's a pain because Avro records must be ordered, but JSON by definition is unordered. Avro's "JSON" format is more of a JSON-like format. At one point, when I was writing a script that would ingest an Avro record and then output a new one, I had to fiddle with things to make Python use an OrderedDict [1] so the new record would be output in the right order.

[0] though the last time I touched it was in 2014, so things might've changed since

[1] as per [0], this was before Python dicts were ordered by default

cbsmith 2021-08-19 02:20:38 +0000 UTC [ - ]

I'm not sure what you mean by Avro records "must" be ordered. If you mean that the serialization format specifies the ordering of the fields, then yes, that is true, but that's an advantage in terms of compactness and processing efficiency (https://avro.apache.org/docs/current/spec.html#order). If you don't like it though, there are other formats like protobuf and thrift that have no such requirement, at the cost of ~3 bits per field, which can be a comparatively efficient trade off.

nonameiguess 2021-08-18 16:06:08 +0000 UTC [ - ]

It's not possible to read ahead and chunk with 100% assurance it will always work, but libraries like pandas and R's data.table do a reasonable job of reading in the first X rows, doing some usually correct type inference on the columns, and then read ahead and chunking the rest of the rows.

For what it's worth, I totally agree something like compressed json lines is a better data exchange format, but part of why csv remains as universal and supported as it is is that so much existing data storage applications export to either csv or excel and that's about it. So any ETL system that can't strictly control the source of its input data has no choice but to support csv.

breck 2021-08-18 20:24:46 +0000 UTC [ - ]

> means that there's no way to parallelize CSV processing, because there's no way to read-ahead and "chunk" a CSV purely lexically.

There's a subset of CSV that forbids escapes that is super fast to parse. All fast CSV parsers I'm aware of take advantage of this subset. I try to never ever publish a CSV that has quotes, and always aim for a more restrictive grammar that is cleaner, better thought out data.

grumple 2021-08-18 17:28:58 +0000 UTC [ - ]

Commas, quotes, colons, newlines, and hyphens (JSON and YAML separators) can occur in content as well. How is that any different from separators occurring in csvs?

In practice, I used TSVs a lot, as tabs do not usually occur in most data. Alternatively, you could use pipes (|) or control characters as field or row separators.

kaiju0 2021-08-18 16:05:20 +0000 UTC [ - ]

Most data warehouses I have worked with use character delimited formats.

*shudder

gsich 2021-08-18 16:24:54 +0000 UTC [ - ]

If you interpret "CSV" as purely comma seperated values then maybe. But in my bubble "CSV" means textfiles that are separated by some separator. Be it tabs, spaces, commas, or any other ASCII character. Some are more usable then others, if you have commas in your data then use tabs. If you have tabs use Form Feed or Record Separator or vertical tabs ... and so on.

Of course this is not always applicable, since you sometimes don't control the format you get your data.

jolmg 2021-08-18 16:44:35 +0000 UTC [ - ]

> But in my bubble "CSV" means textfiles that are separated by some separator. Be it tabs, spaces, commas, or any other ASCII character. Some are more usable then others, if you have commas in your data then use tabs. If you have tabs use Form Feed or Record Separator or vertical tabs ... and so on. Of course this is not always applicable, since you sometimes don't control the format you get your data.

When I see CSV parsers like https://www.papaparse.com/ that even try to support comments and empty lines in the format, I wonder if it'd really be that bad to just raise an error on anything that doesn't fit RFC 4180[1], with explanation on where and how the file is corrupted. Push the issue to the writers of such files.

Then I remember the Robustness Principle[2] and I chill a little.

[1] https://datatracker.ietf.org/doc/html/rfc4180

[2] https://en.wikipedia.org/wiki/Robustness_principle

hermitcrab 2021-08-18 20:14:44 +0000 UTC [ - ]

It seems quite common in some European countries to use semi-colons as the delimiter instead of commas (because they use commas as the decimal separator?), adding a new level of fun to parsing. In Easy Data Transform we count the number of commas, semi-colons, tabs in the file to make an educated guess as the delimiter (which the user can override).

unnah 2021-08-19 06:43:44 +0000 UTC [ - ]

Semicolon-separated "CSV" is a quirk of Microsoft Excel, which they have never fixed. I figure Microsoft would prefer that people use XLS instead of CSV.

worik 2021-08-19 01:28:13 +0000 UTC [ - ]

Pipe '|' here

asdff 2021-08-18 17:17:04 +0000 UTC [ - ]

Those other things have different names like TSV

jhbadger 2021-08-18 17:58:25 +0000 UTC [ - ]

In theory, yes, to be pedantic, but for example, LibreOffice saves its exported CSVs by default as tab delimited. You actually have to manually specify you want commas to get those.

jolmg 2021-08-18 18:59:48 +0000 UTC [ - ]

> LibreOffice saves its exported CSVs by default as tab delimited

Maybe it's actually presenting what you've selected last?

It's giving me comma as the default separator, and it's the first option in the dropdown. Tab is the 3rd option.

>> Those other things have different names like TSV

That depends on the writer. I've gotten what should be named PTVs (pipe-terminated values) as CSVs. I can understand how it happened. If the underlying software outputs PTVs, you don't want to bother converting that because you're working in a legacy language that's a pain to work with (the type where identifiers can't be longer than 4 chars), and you want the user to be able to double-click on it and have it open in a spreadsheet without prior configuration, you just push the issue to the reader of the file, since by tradition readers are already quite tolerant of format differences...

Of course, there'll always be the case where the reader is simply not tolerant enough, like when the escaping syntax differs. There doesn't seem to be a way to get LO Calc to interpret "foo|bar\|baz|" as cells "foo" and "bar|baz", for example.

derefr 2021-08-18 20:09:29 +0000 UTC [ - ]

And then Excel imports them all using heuristics when you select "Type: CSV". So you'll never train anyone on the demand end of these documents that they're called anything besides CSVs.

2021-08-18 21:13:55 +0000 UTC [ - ]

gsich 2021-08-18 22:04:38 +0000 UTC [ - ]

Technically true, but like I said "CSV" is more a term for human readable data, with some delimiter in them. Maybe it's a comma, maybe not. In every case you need someonne to look at it. If you want a machine-machine data protocol you can use XML or JSON if it needs to be somewhat human readable.

2021-08-18 22:07:59 +0000 UTC [ - ]

veltas 2021-08-18 20:57:14 +0000 UTC [ - ]

>CSV is actually horrible at parse-time vs. other data formats

I find this really hard to believe given it's a simple enough syntax. And parsing is usually not the limiting factor, usually fast enough to not be noticed alongside interpreting or loading the source data. Every (much more sophisticated) compiler I can think of uses a linear parser based on this assumption.

mortehu 2021-08-18 21:07:41 +0000 UTC [ - ]

Just extracting string views can be fast, but converting numbers to and from text is very slow.

veltas 2021-08-19 12:10:14 +0000 UTC [ - ]

But significant compared to time to load the source in the first place? Or any processing that will come after?

worik 2021-08-19 01:29:47 +0000 UTC [ - ]

It is a text format. So you will always have to do that conversion.

Binary formats have their own pain points - more of them actually

glogla 2021-08-18 17:14:17 +0000 UTC [ - ]

Avro and Parquet and Arrow are almost loseless - they can't handle timestamp with timezone which really sucks for processing world-wide data.

I'm still looking for somwthing that can do it.

(Of course, Spark doesn't support timestampz which is probably why the formats don't.)

7thaccount 2021-08-18 20:54:02 +0000 UTC [ - ]

CSV is far from perfect, but it's nice that I can easily work with them without needing any libraries. All I need is file I/O and the ability to split strings. It doesn't get much simpler.

I'll admit though that "import JSON" and then being able to essentially convert the entire file into a dictionary is nice if the data has more structure to it.

jrumbut 2021-08-19 01:05:34 +0000 UTC [ - ]

The real advantage of CSV, in my mind, is that if the CSV is valid and normal then it's going to be a rectangular dataset (ignoring semantics within the dataset).

If I import JSON data I have no idea what shape the result will be in, and it requires a separate standard to let me know about columns and rows and validation can get complicated.

justinator 2021-08-19 03:36:08 +0000 UTC [ - ]

CSV is that way too. There's nothing that says each row has to have the same number of what maps out to columns, or anything that tells me what those columns really represent (there's no schema). You could use the first line of the CSV doc to say what each tuple is named, but that's not a standard or anything. And without a schema, it certainly could be easy to lose the metadata of the info the CSV file is trying to represent. Is this column just a bunch of numbers or a date format? (for example). CSV is OK for importing and exporting data across systems that know what the format is without the help of a schema, but anything else and you run into a pile of edge cases. Even using a CSV file to import into a spreadsheet works usually but context is often lost.

Frankly, I love the format.

emmelaich 2021-08-19 00:15:50 +0000 UTC [ - ]

It's funny to complain about CSV when JSON is also a minefield: http://seriot.ch/parsing_json.php

CSV is fine .. usually

sgolestane 2021-08-19 01:05:25 +0000 UTC [ - ]

Nice as long as JSON is valid and not too big.

dheera 2021-08-18 21:18:00 +0000 UTC [ - ]

CSV is still easier to parse because the C++ dudes still refuse to implement some kind of nice operator-overloaded interface like

    #include <json>

    std::json myjson("{\"someArray\": [1,2,3,4,{\"a\": \"b\"}]}");
    std::cout << (std::string)myjson["someArray"][4]["a"];
and the result is we have 50 different rogue JSON libraries instead of an STL solution. Until the STL folks wake up, boost::split can deal with the CSV.

mivade 2021-08-18 21:19:43 +0000 UTC [ - ]

meepmorp 2021-08-18 23:59:51 +0000 UTC [ - ]

Thank you!

dheera 2021-08-18 21:25:24 +0000 UTC [ - ]

ooh this is nice. STL should adopt it

RodgerTheGreat 2021-08-19 00:43:24 +0000 UTC [ - ]

A string split function is a poor choice if there's any possibility the CSV file contains quoted fields. Robust handling of both CSV and JSON requires a parser. In my experience, CSV can actually be trickier than JSON to parse because there are so many edge cases, alternatives, and ambiguities.

2021-08-19 05:55:15 +0000 UTC [ - ]

08-15 2021-08-18 23:30:41 +0000 UTC [ - ]

> All I need is file I/O and the ability to split strings.

...until there is a newline inside a field.

The moronic quoting mechanism of CSV is one half of the problem; people like you, who try to parse it by "just splitting strings" is the other half. The third half is that it's locale dependent and after 30+ years, people still don't use Unicode.

bsg75 2021-08-19 00:18:11 +0000 UTC [ - ]

Never write your own parser, especially as just string.split(), and when possible don't use C(omma)SV formats, but C(haracter)SV, aka DSV: https://en.wikipedia.org/wiki/Delimiter-separated_values

There are non-printable, non-typable characters specifically defined as separators (ASCII 28-31) with UTF equivalents.

7thaccount 2021-08-19 03:17:58 +0000 UTC [ - ]

As printed above, I'm not writing software where that would be a problem. In general if I was writing a commercial application or something in production where I don't control all the inputs I would agree 100%, but I'm lucky to not have those problems. I could use a bunch of libraries and additional code to try to catch non-existent errors I don't have, or fix my actual problems and move on.

I appreciate the perspective though for sure. I'm guilty of the same thing on HN, where I assume people have the same uses as me when they're writing code that has to be extremely robust or blazingly fast.

7thaccount 2021-08-19 03:10:58 +0000 UTC [ - ]

You've assumed an awful lot about my use cases. The data I deal with in .CSV form is always pre-processed and doesn't have any of the minefield occurrences you've mentioned. There can't be a newline or anything like that in an input. In my decade of using .CSV files daily, I've only had one tertiary system where that is a problem.

Also, when doing interactive work, it's a bit different than writing production IT software.

karteum 2021-08-18 17:39:52 +0000 UTC [ - ]

I personally believe that at least SQLite matches all those criteria :

"1) A truly open format is available" : sqlite is open-source, MIT-licensed, and well specified (even though I am usually not so happy with its weak typing approach, yet in this case this precisely enables a 100% correspondance between CSV and sqlite since CSV has also no typing at all...)

"2) Applications have a speed increase from using csvs" : I think it should be obvious to everyone that this is the case...

"3) The new format should not be grossly larger than the one it is replacing" : this is also the case

anigbrowl 2021-08-18 22:05:38 +0000 UTC [ - ]

It's worth mentioning that SQLite is committed by its publishers to be backward-compatible out to 2050 and is the recommended by the Library of Congress as a safe long-term storage format (as are XML, JSON, and CSV). https://www.sqlite.org/locrsf.html

For single tables a database is probably overkill, but it's nice to have around when you need something reasonably powerful without being overly complex or hard to get started with.

dolni 2021-08-19 03:54:50 +0000 UTC [ - ]

I suppose if I was to recommend some "safe long-term storage formats" I, too, would choose things that are readable in plain ASCII and/or have open source roots.

Shoutout to the Library of Congress for doing the real heavy lifting here?

unnouinceput 2021-08-18 17:50:30 +0000 UTC [ - ]

sqlite files are binary files, so a non-starter. If I open them in Notepad I can't read the values like I can with CSV.

euroderf 2021-08-19 06:37:42 +0000 UTC [ - ]

How about an app that opens an SQLite DB so that strings and numbers can easily be edited, and then on file save, SQLite's internal pointers and markers are updated ? Maybe an Emacs mode ?

postalrat 2021-08-18 18:09:22 +0000 UTC [ - ]

"If I open them in Notepad"

Well there's your problem.

munk-a 2021-08-18 20:55:20 +0000 UTC [ - ]

There is a definite demand for some import format that you can trivially edit - CSV excels at this right now and JSON isn't that bad. Binary dump formats definitely do have a time and a place but there is also a separate need for trivially human readable formats.

postalrat 2021-08-18 22:00:43 +0000 UTC [ - ]

I guess my point is that CSVs or any text files aren't really "human readable". You still need some application top view them.

icelancer 2021-08-19 01:09:11 +0000 UTC [ - ]

I can - and do - read a CSV by opening it in Notepad++ to view basic data. This isn't possible with a SQLite file.

creata 2021-08-19 05:04:39 +0000 UTC [ - ]

In some incredibly pedantic sense, sure. But the difference is that every computer comes with what a human needs to read a CSV file. That's not true of an SQLite database.

atatatat 2021-08-19 14:46:06 +0000 UTC [ - ]

Could it be?

HanaShiratori 2021-08-19 01:02:45 +0000 UTC [ - ]

So what's the alternative? Good old pen and paper?

vertere 2021-08-19 07:16:26 +0000 UTC [ - ]

Doesn't count, you still need light to make it human-readable.

/s

dragonwriter 2021-08-18 16:05:04 +0000 UTC [ - ]

> The author is not forced to use csvs.

The author appears to be a consultant selling data prep/transformation services. As long as the market is using CSVs, he’s forced to use CSVs, at least as end-of-pipeline inputs and outputs.

Of course, “people optimize their workflows for something other than making my job easy” is a common, but also rarely persuasive in motivating action from others with different jobs, complaint.

dylan604 2021-08-18 16:23:10 +0000 UTC [ - ]

The consultant should be less concerned about what makes their life/job easier, but what makes their client's job easier by working with the consultant. If the client has to jump through hoops to serve consultant1 data in a format they are not used to, but consultant2 will accept the data they have already available or can be exported in a format familiar to them, the client will prefer consultant2.

dragonwriter 2021-08-18 16:26:46 +0000 UTC [ - ]

Sure, but if everyone is blocked from using the format consultant1 opposes because the industry wakes up one day and abandons it utterly, then consultant2 no longer has an advantage, which is why someone who would be consultant1 except that they don't want to lose business to consultant2 lobbies for it.

dylan604 2021-08-18 18:35:11 +0000 UTC [ - ]

My point is that the consultant needs to understand the capabilities of the client more so than dictating terms just because it is convenient to the consultant. The scenario you describe is not out of the realm of possibility, but it's also not unheard of to have a client with such antiquated systems that there is no using of modern anything. Hence, why CSV will never die.

lallysingh 2021-08-18 22:38:41 +0000 UTC [ - ]

Really, just use the right delimeters to start: https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

Ascii #31 instead of commas, Ascii #30 instead of newlines. Now those characters can go into your values.

If that's no good, zstd-compressed proto.

cls59 2021-08-18 15:55:14 +0000 UTC [ - ]

To point 1, I'd argue that a SQLite database is a great next step beyond CSV despite being a binary format.

da_chicken 2021-08-18 16:13:36 +0000 UTC [ - ]

No, SQLite's dynamic data types would silently coerce data just like opening a CSV directly with Excel does.

The advantage of CSV is that it's as accurate as your plain text representation of your data can be. Since binary data can be represented by character data, that's 100% accurate. As soon as you introduce a storage format that has made assumptions about the type of data being stored, you've lost flexibility.

SQLite is not intended for data serialization. It's intended for data storage to be read back by essentially the same application in the same environment.

Tagbert 2021-08-18 18:24:33 +0000 UTC [ - ]

Excel's default insistence on trying to coerce the data into what it expects the data to be is the source of many of the problems of people using CSV. Excel will even ignore quotes around values if it thinks that the value inside might be a number. I often work with CSV files that contain 17 digit IDs that use numeric characters "01233998742021925" which Excel will convert lossily into scientific notation rather than leaving it as a string. There are ways to override that but they are tedious and people don't do it by default.

I'd be satisfied if Excel was just a little less aggressive about it's data coercion on import.

hermitcrab 2021-08-18 20:18:41 +0000 UTC [ - ]

Excel trying to be helpful and silently converting data is a source of endless pain. In fact it is so bad, that geneticists have renamed some genes: https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

masklinn 2021-08-18 16:28:37 +0000 UTC [ - ]

> No, SQLite's dynamic data types would silently coerce data just like opening a CSV directly with Excel does.

SQLite's "dynamic data types" coerce data on input, not output. Once the data is in sqlite the way you wanted it, excel has no interpretation to perform, except insofar as really really wanting dates.

> The advantage of CSV is that it's as accurate as your plain text representation of your data can be.

Yeah nah.

> SQLite is not intended for data serialization. It's intended for data storage to be read back by essentially the same application in the same environment.

That's completely, absolutely, utterly, objectively, nonsensical.

Data analysis, exchange, container, and archiving, are literally all use cases listed as "appropriate uses for sqlite" in "Situations Where SQLite Works Well" on the official sqlite website: https://www.sqlite.org/whentouse.html

edmundsauto 2021-08-18 16:43:15 +0000 UTC [ - ]

How is a csv not the most accurate representation of the data? If you trust the other agent encoded it properly in the db, then sure. Your flippant dismissal was inappropriate in tone and detracted from the rest of your opinion.

Cockiness tells me that you’re insecure about your knowledge, not that you know more than GP.

X6S1x6Okd1st 2021-08-18 17:15:25 +0000 UTC [ - ]

There's plenty of CSVs that have been produced or will be parsed by

for line in input: ','.join(line)

It's not exactly a problem with "CSV" specifically, but the environment in which it exists.

edmundsauto 2021-08-18 21:06:49 +0000 UTC [ - ]

True, but if you don't trust the code generating the data, then any and every data format is suspect.

masklinn 2021-08-18 17:15:02 +0000 UTC [ - ]

> How is a csv not the most accurate representation of the data? If you trust the other agent encoded it properly in the db, then sure.

The idea that a CSV would be more likely to be correctly encoded than a DB is hilarious, thanks for the laugh. But that you were confident enough to seriously put it in writing shows how little experience you have with CSV.

edmundsauto 2021-08-18 21:05:23 +0000 UTC [ - ]

Yep, you got me. I'm actually a trapeze artist moonlighting on these forums.

A CSV file represents the exact digits that are to be stored. You have unlimited precision. You could even store irrational numbers, equations, or mix data types in a column. OTOH, you have to make sure the delimiting character is not present in the data - that can be pretty easy, if you use the field-separator character in ASCII, or even just a \t. I've even seen people terminate fields with ^8675309| because they felt confident no data would contain Jenny's number.

A database, like Excel, likes to conform data. This is usually awesome! But sometimes, it's not.

da_chicken 2021-08-18 16:53:53 +0000 UTC [ - ]

> SQLite's "dynamic data types" coerce data on input, not output.

That's not relevant. If a data format coerces data when it's stored, it's still not a data format suitable for serialization.

dasyatidprime 2021-08-18 17:58:34 +0000 UTC [ - ]

It's strictly dependent on the table definitions you use. A column with no type (the simplest way to use them) will take data in any of the natively supported types—whatever you actually insert in your SQL statement—and will not attempt any translation.¹

If you do add column types to your tables as in traditional databases, then of course you should expect some varying behavior based on that.

I can imagine some potential for issues from people expecting to insert exact decimal numbers and choosing floating-point by accident, perhaps, or using integer types for actually-text numeric ID fields. If you go all-strings on input with no column affinity, then you'll get the same strings on output, though, so that's isomorphic to good-CSV.

¹ https://www.sqlite.org/datatype3.html – §3.1 “If the declared type for a column contains the string "BLOB" or if no type is specified then the column has affinity BLOB.” + §3 “A column with affinity BLOB does not prefer one storage class over another and no attempt is made to coerce data from one storage class into another.”

lmilcin 2021-08-18 15:57:25 +0000 UTC [ - ]

SQLite also requires external dependency and special execution environment and are not as portable as CSV.

srcreigh 2021-08-18 16:00:50 +0000 UTC [ - ]

SQLite is included in most operating systems by default. It's also on macOS and iOS.

It also runs on all kinds of embedded devices in addition to personal computers and servers.

https://en.wikipedia.org/wiki/SQLite#Operating_systems

lmilcin 2021-08-18 17:39:30 +0000 UTC [ - ]

Just because most popular operating systems have libraries, doesn't yet mean it is portable.

Can I click on it and open it in Excel?

If not then it is not portable for me and for a lot of other people.

The main reason I use CSV is to produce reports that I can either open myself or send to other people so that they can click on it and open themselves and immediately start hacking away.

Excel is still corporate lingua franca and until this changes CSV is not going to retire.

CSV means for me that I can send it to anybody and everybody will know how to use it.

Try sending SQLite binary to people you work with, to your accountant, to your wife. Then share your experience.

wombatpm 2021-08-18 18:13:40 +0000 UTC [ - ]

Try using tabs. Still open in excel with less problems

lmilcin 2021-08-18 18:30:01 +0000 UTC [ - ]

I find files where the exact type of whitespace used is meaningful to be a sick joke.

Not all editors make it easy for you to use a tab character.

If only there was some other solution that would not require you to use a character that a lot of editors refuse to insert or even try to convert to spaces...

2021-08-18 17:35:33 +0000 UTC [ - ]

ab111111111 2021-08-18 16:31:16 +0000 UTC [ - ]

I'd add:

4) Changes to the replacement format should be human-readable in a diff

jonnycomputer 2021-08-18 16:39:49 +0000 UTC [ - ]

Agree. But a lot of this is tooling though, whether its adequate and whether its ubiquitously available. e.g. you can diff two directories in Linux, which is super useful, but directories are not text files. The issue is, can the differences be reasonably represented in text formatted output.

masklinn 2021-08-18 17:21:47 +0000 UTC [ - ]

I'll take a good format and a diff filter thank you very much.

It's easy to write a diff filter for e.g. xlsx, it's not possible to make CSV any good.

red_admiral 2021-08-19 07:57:32 +0000 UTC [ - ]

For most of my use cases, I also need

4) can easily interact with Excel.

A lot of things 'around here' run off spreadsheets, because relational databases weren't invented here and no-one ever needs more than a few million rows apparently.

I seem to spend half my job on the current project exporting stuff to CSV, running it through my code, then opening the resulting CSV in excel and formatting it a bit and saving as .xlsx again.

Still, at least I don't have to use Visual Basic that way.

2021-08-18 21:32:45 +0000 UTC [ - ]

osacial 2021-08-18 18:56:59 +0000 UTC [ - ]

Besides, CSV can be written by a human and initially it was the way how data was entered and read, because it is a text file - database. If there is a need for something more convenient, then there are other file types. Nonissue, IMO.

danellis 2021-08-19 04:37:07 +0000 UTC [ - ]

> The author is not forced to use csvs.

But some people are. There are entire industries built around the exchange of CSV files, and the producers and consumers don't necessarily talk to each other.

joelthelion 2021-08-18 17:34:28 +0000 UTC [ - ]

Parquet (and a few other competitors) have open-source implementations in many languages, are much faster to load than csv, are natively compressed, are strongly typed and don't require parsing...

There are few reasons to continue using csv in this day and age.

cbsmith 2021-08-18 16:04:26 +0000 UTC [ - ]

1. open(2) doesn't care about the file format

2. Yes, mixing formatting with data slows down data processing, don't do it.

3. Excel is not the replacement for CSV, and CSV is not a compact format. I mean, maybe if you are used to XML it is, but otherwise, just no.

Yes, there is truly a problem.

da_chicken 2021-08-18 16:30:02 +0000 UTC [ - ]

> CSV is not a compact format

That's only true if you're trying to send all your data in a single, monolithic CSV.

If you're sending multiple CSVs, you're capable of representing data as well as a relational data store. Which is to say, you're representing your data using a system of data normalization specifically designed to minimalize data duplication. A single CSV represents a single table, and in most cases with intelligent delimiter selection you can represent an entire data set with no more than one character spent between fields or records.

Yes, you do have situations where you're storing losing data density due to using plain text strings, but that's not a limitation particularly unique to CSV for data serialization formats. Additionally, it is a problem that can largely be mitigated by simple text compression. Furthermore, once you switch to a non-text representation, you're limiting yourself to whatever that data representation is. It's easy to represent an arbitrary precision decimal number in plain text. It's hard to find a binary representation that universally represents the same data regardless of the system on the other end. Again, that's not a problem unique to CSVs.

If you're working with an API, object by object, then JSON is certainly going to be better, yes, because you can use the application's object representation. If you're working with bulk data of many disparate, unrelated, complex objects, however, or where you're transferring and entire system, you're not going to do much better than CSV.

cbsmith 2021-08-18 16:44:33 +0000 UTC [ - ]

> That's only true if you're trying to send all your data in a single, monolithic CSV.

No, that's true in general. There are exceptional cases where it might not be true, but it's true in general.

> Yes, you do have situations where you're storing losing data density due to using plain text strings, but that's not a limitation particularly unique to CSV for data serialization formats.

That there are other inefficient data serialization formats does not mean that it is not an inefficient data serialization format.

> Additionally, it is a problem that can largely be mitigated by simple text compression.

"Simple text compression"? Not sure what you are referring to there. If you are speaking of generalized compression, there's a bunch of information theory pointing out the problem with that thinking.

> Furthermore, once you switch to a non-text representation, you're limiting yourself to whatever that data representation is. It's easy to represent an arbitrary precision decimal number in plain text. It's hard to find a binary representation that universally represents the same data regardless of the system on the other end. Again, that's not a problem unique to CSVs.

It's pretty easy to represent arbitrary precision decimal numbers in non-text formats and they mean the same thing regardless of the system on the other end, but let's say you were right. Great, now represent arbitrary precision binary numbers (you know, the kind that computers work with), in text formats.

> If you're working with an API, object by object, then JSON is certainly going to be better,

Let me stop you right there. It is not "certainly going to be better". JSON is generally worse.

> If you're working with bulk data of many disparate, unrelated, complex objects, however, or where you're transferring and entire system, you're not going to do much better than CSV.

There's a whole world out there of formats that are better than CSV at this job, and in many cases they are A LOT better than CSV. Even Avro, for all its pitfalls, is MUCH better.

sicariusnoctis 2021-08-18 22:06:01 +0000 UTC [ - ]

> > Additionally, it is a problem that can largely be mitigated by simple text compression. > > "Simple text compression"? Not sure what you are referring to there. If you are speaking of generalized compression, there's a bunch of information theory pointing out the problem with that thinking.

For most natural text-only data, .csv.gz is probably close to the limits of what we can do. For numerical or categorical text containing patterns or structure, I agree that better formats may exist (e.g. H.264 for CSV files containing video frames). Not sure how a more general format will solve anything you couldn't already do by converting to a more specific format and tarring everything afterwards.

da_chicken 2021-08-18 16:52:55 +0000 UTC [ - ]

> No, that's true in general. There are exceptional cases where it might not be true, but it's true in general.

Okay, put your money where your mouth is. Prove it.

cbsmith 2021-08-18 17:46:59 +0000 UTC [ - ]

CSV uses decimal representations of numeric data, which means you are getting 3.5 bits of data for every 8 bits of storage space (and that's assuming you are using a reasonably compact text encoding... if you are using UTF-16, it's 16 bits). Using a binary representation you can store 8 bits of data for every 8 bits of storage space.

CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.

CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).

Then there's floating point data...

Honestly, this is a silly argument. CSV is, by design, a format that trades space for other factors. It should not be debatable whether it is a compact format or not.

da_chicken 2021-08-18 22:17:38 +0000 UTC [ - ]

> CSV uses decimal representations of numeric data, which means you are getting 3.5 bits of data for every 8 bits of storage space (and that's assuming you are using a reasonably compact text encoding... if you are using UTF-16, it's 16 bits). Using a binary representation you can store 8 bits of data for every 8 bits of storage space.

XML, JSON, and YAML all have this issue, too.

> CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.

This is also identical to XML, YAML and JSON.

And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.

> CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).

This is also identical to XML (escaping XML entities, sometimes having to resort to CDATA), YAML (escaping dashes) and JSON (escaping double quotes).

All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.

This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?

Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right? Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF. If you don't like the fact that two systems can format text differently, you're going to have a whole lot of fun when you see how they can screw up binary formatting. Nevermind things like, "Hey, here's a binary file from 25 years ago... and nothing can read it and nobody alive knows the format," that you just don't get with plain text.

Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.

cbsmith 2021-08-19 11:37:06 +0000 UTC [ - ]

> XML, JSON, and YAML all have this issue, too.

Yes. Though to their credit, some of those work with her numbers, which at least gets you 4 bits out of every 8 bits.

> And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.

I'm not sure what argument you thought I was making, or why that comment is relevant.

> All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.

I'm not sure what your mean by "eliminate", or why you think it matters that there are other formats with the same design trade offs.

> This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?

CSV is used for a variety of purposes. The context of the article is using it for data transfer.

The claim was that it was a compact format for data transfer, which is demonstrably not true.

> Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right?

I'm not sure what "straight binary" means to you. JSON is, for the most part, a binary encoding standard (just not a particularly good one).

You've got the heritage a bit wrong, as XML was not originally designed for data transfer at all. It was an attempt to simplify the SGML document markup language, and the data transfer aspects were subsequently grafted on. JSON & YAML have a slightly more complicated heritage, but neither was intended as a data transfer format. They've all been pressed in to service for that purpose, for a variety of reasons, that can charitably described as tactically advantageous but strategically flawed.

> Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF.

I don't know how to break this to you, but text formats can have endianess (in fact, insanely UTF-8 does!), and systems being confused and whether they are at EOF as well.

> Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.

Wouldn't ZIP be a binary format, with all the problems and concerns you have with binary formats?

So to summarize what you are saying... "CSV is a compact format because you can compress it if you are concerned and all the space it wastes".

Would it be fair to say then that any binary format is a text format because you can convert the binary into a text representation of the data? ;-)

Twisell 2021-08-18 15:18:48 +0000 UTC [ - ]

OP and you gave me an idea :

"The only true successor of CSV should be forward/backward compatible with any existing CSV variant"

If we manage to write a spec that meet this criteria we'll have a powerful standard with easy adoption.

Macha 2021-08-18 15:55:03 +0000 UTC [ - ]

The issue is CSV is not compatible with CSV. It's not possible to write a spec that covers all CSV files in the world. CSV means things that are mutually incompatible in the less common cases, and the only way to really parse them correctly is to know which variant generated it. But you can't even tag that variant in the file by your criteria, as existing CSV parsers won't understand it.

godshatter 2021-08-18 16:36:53 +0000 UTC [ - ]

Couldn't tools that read CSV files scan them first and see which variant best matches the file? The questions arise about which end of line character(s) are used and how double-quotes and commas are handled. There can't be that many ways to escape them, and there are three sets of end of line characters mentioned - just see which one is used (i.e. don't assume only \n if you run into \r\n or \r alone).

The assumption that most software uses is that the import file will be in the same variant of the format as what that tool exports. That seems to be more of a problem than anything else.

masklinn 2021-08-18 16:38:22 +0000 UTC [ - ]

> Couldn't tools that read CSV files scan them first and see which variant best matches the file?

Sure, and they sometimes do that if they have to ingest CSVs whose origin they don't control (although not every system implementor cares enough to do it).

But that's still just a bunch of shitty faillible heuristics which would not be necessary if the format was not so horrible.

Macha 2021-08-18 16:40:01 +0000 UTC [ - ]

It also doesn't prevent a human or other system doing:

cat input1.csv input2.csv > output.csv

resulting in a single file containing multiple formats.

Also, what variant is this:

    1,5,Here is a string "" that does stuff,2021-1-1
What is the value of the third column?

Is this a CSV file without quoting? Then it's

    Here is a string "" that does stuff
Or is it a CSV file with double quote escaping? Then it's

    Here is a string " that does stuff
This is fundamentally undecidable without knowledge of what the format it is.

You can decide to just assume RFC compliant CSVs in the event of ambiguity, but then you absolutely will get bugs from users with non-RFC compliant CSV files.

godshatter 2021-08-18 20:46:47 +0000 UTC [ - ]

That's true. You could scan the file and see if there are any other types of double quote escaping happening, but if there isn't any that wouldn't help either. It's also negated by the multiple formats in the same file point.

So, yeah. Can't really be done without making too many assumptions that will break later.

hermitcrab 2021-08-18 20:24:20 +0000 UTC [ - ]

>Couldn't tools that read CSV files scan them first and see which variant best matches the file?

Yes. And my software does that. But it is always going to be a guess which the user needs to be able to override.

Twisell 2021-08-18 16:33:21 +0000 UTC [ - ]

By this argument UTF8 can't exist. And yet here it is.

PS: I never said 100% forward/backward compatible with all variant at the same time and without any noticeable artifact. I meant compatible in a non blocking way.

masklinn 2021-08-18 16:37:38 +0000 UTC [ - ]

What are you talking about? UTF8 is a single well-defined specification, and detecting that data is definitely not UTF8 is trivial.

2021-08-18 16:40:04 +0000 UTC [ - ]

Twisell 2021-08-18 16:43:22 +0000 UTC [ - ]

And yet it is forward/backward compatible with ASCII and non blocking against all it's ill defined variants.

Macha 2021-08-18 16:49:26 +0000 UTC [ - ]

ASCII was well defined, CSV was not. Therefore they could take the highest bit, which they could know that was unused per the ASCII spec, and use that to encode their extra UTF-8 information.

Also UTF-8/ascii compatibility is unidirectional. A tool that understands ASCII is going to print nonsense when it encounters emoji or whatever in UTF-8. Even the idea that tools that only understand ASCII won't mangle UTF-8 is limited - sure dumb passthroughs are fine, but if it manipulates the text at all, then you're out of luck - what does it mean to uppercase the first byte of a flag emoji?

tsimionescu 2021-08-18 18:08:53 +0000 UTC [ - ]

To be fair, there is basically no way to manipulate arbitrary text at all without mangling it, UTF-8-aware or not. What does it mean to take the first 7 characters of a UTF-8 string which might contain combinator characters and left-to-right special chars? What if the text uses special shaping chars, such as arranging hieroglyphs in cartouches? You basically need a text-rendering aware library to manipulate arbitrary strings.

masklinn 2021-08-18 15:23:20 +0000 UTC [ - ]

Given the infinity of CSV variants existing is the very reason why CSV blows goats, creating a spec compatible with that is

1. a fool's errand, CSV "variants" are not compatible with one another and regularly contradict one another (one needs not look any further than Excel's localised CSVs)

2. resulting in getting CSV anyway, which is a lot of efforts to do nothing

dragonwriter 2021-08-18 16:11:52 +0000 UTC [ - ]

> If we manage to write a spec that meet this criteria we'll have a powerful standard with easy adoption.

So, a binary format consisting of: (1) a text data segment (2) and end of file character (3) a second text data segment with structured metadata describing the layout of the first text data segment, which can be as simple (in terms of meaning; the structure should be more constrained for machine readability) as “It’s some kind of CSV, yo!” to a description of specific CSV variations (headers? column data types? escaping mechanisms? etc.) or even specify that the main body is JSON, YAML, XML, etc. (which would probably often be detectable by inspection, but this removes any ambiguity).

Twisell 2021-08-18 16:22:14 +0000 UTC [ - ]

You got my vibe

Almost any current CSV parser, even the bad ones, tolerate a header line.

So it should be possible to define a compact and standardized syntax that is appended before the real header of the first cell (separator,encoding,decimal separator (often disregarded by most parsers but crucial outside USA),quote character,escape character,etc...). Following headers would just use special notation to inform on (data-type,length,comment).

Newest parsers would use theses clues, older ones would just append some manageable junk to headers.

Macha 2021-08-18 16:54:38 +0000 UTC [ - ]

So someone opens this CSV in Excel and there's garbage in A1?

Does this really count as compatible? You will get user bugs for this.

dragonwriter 2021-08-18 17:01:59 +0000 UTC [ - ]

> So someone opens this CSV in Excel and there's garbage in A1?

Yeah, that's why I chose the “thing that looks like a text file—including optionally CSV—but has additional metadata after the EOF mark” approach instead of stuffing additional metadata in the CSV; there's no way to guarantee that existing implementations will safely ignore any added metadata the main CSV body. (My mechanism has some risk in that there are probably CSV readers that treat the file as a binary byte stream and use the file size rather than a text stream that ends at EOF, but I expect its far fewer than will do the wrong thing with additional metadata before the first header.

Macha 2021-08-18 17:16:52 +0000 UTC [ - ]

If by EOF char you mean Ctrl-Z, Python's `csv` module is at least one case where it will read past the EOF char and you'll get rows of garbage data for any content in the file after that.

sigzero 2021-08-19 00:34:50 +0000 UTC [ - ]

No, not binary.

ivanbakel 2021-08-18 15:26:10 +0000 UTC [ - ]

That's ignoring two big points in the OP: that CSV variants are ill-defined and conflicting; and that existing CSV data is too poorly-specified.

There is no reason to try to be "backwards-compatible" with existing CSV files - we don't have a single definition of correctness to use to check that the compatibility is correct. Every attempt to be parse existing data would result in unexpected results or even data loss for some CSVs in the wild, because there is no way to reconcile all the different expectations and specifications that people have for their own CSV data.

wernercd 2021-08-18 16:02:30 +0000 UTC [ - ]

obligatory: https://xkcd.com/927/

sicariusnoctis 2021-08-18 21:56:37 +0000 UTC [ - ]

This is true. The best way would be if everyone generated CSV files that were compatible with on one or two already existing CSV formats. (e.g. no-quote-plain-old-csv or quotes-forced-csv.) Hopefully, it would be easy to distinguish between the two in some well-defined manner. Maybe call it .csvs (CSV standardized). Or just introduce new extensions like .csvq ("CSV quoted").

Note the difference is that I am suggesting reducing the number of standards in-use by using only one already existing CSV format. :)

AtlasBarfed 2021-08-18 16:37:38 +0000 UTC [ - ]

There are mature CSV libraries for most major languages that handle 99% of the problems of CSV.

CSV should be better standardized, but ... whatever, what should be done to "fix" CSV is to advertise the proper use of the libraries and the nontrivial aspects of a superficially trivial format.

A format that is trivially useful in 99% of cases is far better than many other "worse is better" things in computing.

masklinn 2021-08-18 17:26:12 +0000 UTC [ - ]

> There are mature CSV libraries for most major languages that handle 99% of the problems of CSV.

They really don't. In fact I'd go further and confidently state that they really can't, because tons of mis-parsed CSVs are heuristic judgement values, and those tools don't really have the ability to make those calls.

I've never seen a "mature CSV library for most major language" which'd guess encoding, separators, quoting/escaping, jaggedness, … to say nothing of being able to fix issues like mojibake.

burntsushi 2021-08-18 22:20:51 +0000 UTC [ - ]

No, they really do. But 99% isn't 100%. Compare Python's csv parser with Go's. The former will prefer a parse for any input, and this is done in a fairly consistent way that at least gives one the opportunity to deal with "malformed" csv. The latter is a strict parser and coughs up an error upon almost any deviation from RFC 4180. I couldn't use the latter in practice because of it. It would just choke on too many csv files I had.

masklinn 2021-08-19 06:00:02 +0000 UTC [ - ]

> No, they really do. But 99% isn't 100%.

Neither is 0. Python’s csv library does not handle any problem, it applies the separators / quoting / escapes it was given and that’s that. It gives the developer the opportunity to handle things, but if the developer isn’t aware nothing is handled.

burntsushi 2021-08-19 16:56:24 +0000 UTC [ - ]

If you're just going to quote me out of context and not bother responding to the actual substance of my comment (which was a high level but concrete comparison of two different CSV parsers, demonstrating an important qualitative difference with respect to csv parsers solving problems), then please, don't bother responding.

adgjlsfhk1 2021-08-18 17:09:43 +0000 UTC [ - ]

The problem is that even the fastest CSV readers are 10-100x slower than readers of better defined binary formats (like Arrow)

joelbondurant 2021-08-19 02:53:49 +0000 UTC [ - ]

I would add 4) editable in vim.

dragontamer 2021-08-18 15:51:18 +0000 UTC [ - ]

> A truly open format is available and accessible.

Sqlite?

> Applications have a speed increase from using csvs.

Sqlite?

> The new format should not be grossly larger than the one it is replacing

Sqlite it is.

--------

Oh, you mean something that Excel can open? Oh yeah, I guess CSV then. But lets not pretend #1 (openness), #2 (speed), and #3 (size) are the issues.

ARandomerDude 2021-08-18 15:57:40 +0000 UTC [ - ]

I'd add the constraint that it needs to be understandable by most people. A typical non-tech office worker can use CSV without any trouble. Sqlite has an extremely steep learning curve for someone unfamiliar with databases.

dragontamer 2021-08-18 16:06:03 +0000 UTC [ - ]

Well yeah, that's kind of my point though.

We programmers get stuck into thinking that openness, speed, or other performance characteristics matter. But CSVs prove otherwise.

The reason why CSVs win in many situations is

1. Because they're convenient to edit.

2. Because they work in Microsoft Excel

CSVs have a myriad of issues: too many to list. But that doesn't matter, because convenience + Excel compatibility are king in practice.

To muse about other benefits is ignoring the core reason why CSV is used.

salamandersauce 2021-08-18 16:03:45 +0000 UTC [ - ]

Exactly. Worst case a CSV can be edited and viewed or even created in MS Notepad or any other text editor which is pretty much guaranteed to be on a system. Something to deal with Sqlite? No such luck and the barrier is much higher.

masklinn 2021-08-18 16:35:19 +0000 UTC [ - ]

> Worst case a CSV can be edited and viewed or even created in MS Notepad or any other text editor which is pretty much guaranteed to be on a system.

Good luck getting that garbage to be ingestible by anything. Odds are the system you're trying you shove it in will start by choking on the UTF8 BOM notepad insists on, then it will choke on the record separators, then on the incorrect escaping (or quoting).

> Something to deal with Sqlite? No such luck and the barrier is much higher.

SQLite is already present in most if not all environments (usually multiple times too, you have at least one instance of sqlite per browser on your machine, and probably a system one as well, and a few others), so it's really at most a matter of providing SQLite clients by default.

sigzero 2021-08-19 00:37:11 +0000 UTC [ - ]

SQLite is not the answer. At all.

js8 2021-08-18 16:45:59 +0000 UTC [ - ]

If you have this problem, maybe you just need to go to https://sqlitebrowser.org/

radmuzom 2021-08-18 15:14:29 +0000 UTC [ - ]

Every few years an article like this pops up. I find it tiring - because they are primarily from a software engineer's viewpoint who is probably trying to write a parser and needs to handle the edge cases. As a data scientist, I receive and process around 75GB of CSV every day - of course I don't process it manually. Our processes have been running a few years now and millions of dollars of revenue rides on it. I don't see any need to retire anything, and am pretty sure our processes will be working and running 10 years from now. The only thing which will change are some statistical models which make use of this data.

shadowwolf007 2021-08-18 17:19:21 +0000 UTC [ - ]

Yeah - I used to lead a department that would process somewhere around 10TB of CSV formatted data per day.

The edge cases are a hassle but they don't become less of a hassle from a business perspective by switching to json or really any other format. We tried an experiment of using more json and eventually gave it up because it wasn't saving any time at a holistic level because the "data schema" conversations massively dominated the entirety of the development and testing time.

Obviously being able to jam out some json helped quite a bit initially, but then on the QA side we started to run in to problems with tooling not really being designed to handle massive json files. Basically, when something was invalid (such as the first time we encountered an invalid quote) it was not enjoyable to figure out where that was in a 15GB file.

That said, I fully concur with the general premise that CSV doesn't let you encode the solutions to these problems, which really really sucks. But, to solve that, we would output to a more columnar storage format like Parquet or something. This would let us fully encode and manage the data how we wanted while letting our clients continue working their processes.

What I would really like to see is a file format where the validity of the file could be established by only using the header. E.g. I could validate that all the values in a specific column were integers without having to read them all.

anigbrowl 2021-08-18 22:27:47 +0000 UTC [ - ]

Really appreciate the insight from you and the GP here. I have been struggling with data format decisions around a personal project that will only be used by a few people, being unsure about the extent i should try to make it bulletproof (but harder to maintain and modify) or just keeping it simple (but primitive). It's helpful to see an experienced professional perspective showing that you can fall into a tooling rabbit hole at any scale.

breck 2021-08-18 20:38:14 +0000 UTC [ - ]

> "data schema" conversations massively dominated the entirety of the development and testing time.

Agreed. JSON let's me know something is a number. That's great, but I still have to check for min/max,zero etc. A string? That's great, but I got to check it against a set of enums, and so forth. Basically, the "types" JSON gives you is about 20% of the work, and you're going to have to parse things into your own types anyway.

> What I would really like to see is a file format where the validity of the file could be established by only using the header.

Are you saying something like a checksum so not only is a schema provided but some method to verify that the data obeys the schema?

If you're talking about just some stronger shared ontology, I think that's a direction things will go. I call this concept "Type the world" or "World Wide Types". I'm starting to think something like GPT-N will be the primary author, rather than a committee of humans like Schema.org.

shadowwolf007 2021-08-19 01:12:12 +0000 UTC [ - ]

Honestly with the schema thing I'd probably be fine with either/or!

A checksum would be crude and user-hostile, only being able to say "you did it wrong" but not really good at tell you what it means to do it right.

If I understand the concepts correctly then it seems like a shared ontology could potentially solve the problem in a non-hostile way.

Plus, it makes me happy because I feel like types are a real-world problem, so it is always nice if the type system could enforce that real-world-ness and all the messiness that comes along for the ride.

radus 2021-08-18 21:10:34 +0000 UTC [ - ]

Would DuckDB (https://duckdb.org/) work as your file format with enforced column types?

shadowwolf007 2021-08-19 01:20:07 +0000 UTC [ - ]

We looked at it and there were a few problems we had with where it would force us to put VMs that we just weren't super comfortable with due to the in-process-ness.

More a byproduct of decisions made 5 - 7 years ago when the company was in raw startup mode versus a more mature roadmap.

smhenderson 2021-08-18 16:35:34 +0000 UTC [ - ]

That's pretty much how I feel as well. People that complain about all the variations in CSV "formats" out there aren't wrong but in practice, for me anyway, most of the files I work with in any given situation come from the same source or a small set of sources and use the same variant.

If/when I get a data source introduced into my workflow that differs from this variant I come up with a routine to normalize it, integrate that into my workflow and move on.

vertere 2021-08-19 09:53:19 +0000 UTC [ - ]

Pretty much the opposite here. Dealing with lots of different formats from different sources, often once off. There's all kinds of crazy things people will call "data" (CSV 'variants' are the tip of the iceberg).

okareaman 2021-08-18 16:43:12 +0000 UTC [ - ]

Programmers are always trying to solve general problems and forget that often a specific solution is the right solution.

turtlebits 2021-08-18 15:58:45 +0000 UTC [ - ]

It'll work until someone upstream upgrades their CSV writing library and your process breaks. But it's still writing CSV.

nightski 2021-08-18 16:06:47 +0000 UTC [ - ]

So you can constrain what type of CSV you will allow and if this happens it will bail. It's that simple. There is nothing wrong with having additional constraints on top of just saying it must be "CSV" especially in these scenarios.

I'm in a similar situation, we've been using CSV for over a decade to move billions of dollars worth of product each year. It just works.

turtlebits 2021-08-18 18:59:43 +0000 UTC [ - ]

I'm pretty sure most devs are going to use whatever CSV library that comes with their language. When that breaks, it's generally not a simple fix.

kcartlidge 2021-08-18 20:08:45 +0000 UTC [ - ]

> I'm pretty sure most devs are going to use whatever CSV library that comes with their language. When that breaks, it's generally not a simple fix.

Call me a yak-shaver, but in every language I've worked with I've written my own csv parsing library when I needed one.

It's such a trivial thing for the majority cases (varying of delimiters, line-endings, ascii/uft8, quoting, escaping, and embedding of delimiters/line-endings) that it takes barely no time at all after you've done it once in another language. Of course there are edge cases and special cases depending upon specific workloads, but if your team has it's own parser (which is a small amount of obvious code) then it does indeed usually become a simple fix.

Sounds good using someone else's library, but below a certain complexity it's rarely worth it in the medium to long term except for when doing proof of concept or demo code, or if the domain space is complex.

magicalhippo 2021-08-18 21:45:29 +0000 UTC [ - ]

We've got at least a few dozen customer integrations that parse CSV-ish files, and they all have a custom parser. Many of these have been chugging for over a decade, sending "mission critical" data back and forth.

It's dead simple to whip up, and we can easily tweak it to whatever the customers software spits out, like one field suddenly being UTF-8 encoded in an otherwise Windows-1252 file.

proverbialbunny 2021-08-18 21:22:42 +0000 UTC [ - ]

The same argument could be made for all other data formats.

CSV is like the C of data formats. It's incredibly stable yet simple enough you can make your own variant if you need to.

nojito 2021-08-18 16:27:19 +0000 UTC [ - ]

That’s a good thing.

We validate on ingestion and if there are changes upstream we can immediately triage without polluting our data warehouse.

2021-08-18 15:17:15 +0000 UTC [ - ]

gpvos 2021-08-18 18:28:31 +0000 UTC [ - ]

Indeed, why try to write a parser? There are dozens already for every programming language you can think of. Yours is likely to be worse.

systemvoltage 2021-08-18 17:17:47 +0000 UTC [ - ]

100%. Don’t fix if it ain’t broken. This is the pragmatic approach that often comes across to new devs as “unsexy”. I got some news for “unsexy” software - it works, and it brings the revenue.

izietto 2021-08-18 15:05:46 +0000 UTC [ - ]

"In favour of what?", that is the matter.

CSV is a format more for humans and less for machines, but that is the use case: a format that is good enough to be compiled by humans and read by machines. At the moment there aren't many alternatives.

masklinn 2021-08-18 15:11:46 +0000 UTC [ - ]

And CSV can be consumed and produced by spreadsheet software, which is useful to critical in many contexts where CSV is part of the pipeline. There the alternative would be to use xlsx files which… isn't necessarily any better.

Until someone gets excel to ingest and produce something in a better format, we're pretty much stuck.

Sanzig 2021-08-18 15:16:58 +0000 UTC [ - ]

As I mentioned down-thread, I can generate a CSV with a couple of fprintf statements and a loop. I definitely can't do that with .xlsx. There is almost zero friction to bolting CSV export capability to an existing system, which is part of why it's so popular.

turtlebits 2021-08-18 16:02:38 +0000 UTC [ - ]

You can write what "looks" like CSV to you, but there are no guarantees it will import correctly.

The problem is 10x worse when you get CSV from one source and rely on another process to load it. I fought this problem for several days going from NetSuite to Snowflake via CSV.

asdff 2021-08-18 18:03:50 +0000 UTC [ - ]

Can you give an example? The rules for CSV files are so simple I'm struggling to imagine a case where something looks correct but in fact isn't correct.

turtlebits 2021-08-18 19:04:16 +0000 UTC [ - ]

Non standard delimiters. Escaping delimiters in fields - sometimes with a \, sometimes doubled (""), sometimes not at all. Double new lines.

Poor handling from standard CSV libraries. Either unable to read or unable to create for some downstream process.

rendall 2021-08-18 20:49:31 +0000 UTC [ - ]

That sounds like the problem of badly formatted CSV, not a problem with CSV per se.

If you stick to one delimiter, and that delimiter is a comma, and escape the delimiter in the data with double-quotes around the entry, and escape double quotes with two double-quotes, well, you have written CSV that is correct and looks correct and will be parsed correctly by literally every CSV parser.

adzm 2021-08-19 04:14:13 +0000 UTC [ - ]

Pretty much the rfc. https://datatracker.ietf.org/doc/html/rfc4180

Parsers are trickier if you want to be lenient, but exporters are dead simple.

masklinn 2021-08-19 06:04:13 +0000 UTC [ - ]

> That sounds like the problem of badly formatted CSV

That’s what CSV is. That’s what happens when you ingest CSVs whose production you don’t control.

> If you [ignore everything people literally clamour for in these comments and praise csv for]

Yes i also like ponies.

rendall 2021-08-19 09:49:25 +0000 UTC [ - ]

> "That’s what CSV is."

That's really not a serious argument against CSV. Since you paraphrase in a silly way, I can do it too! Your "argument" is "Badly formatted files exist, therefore CSV bad".

Everyone "against CSV" seems to be arguing against badly formatted CSV, and leaping to the conclusion that "CSV is just bad" without much more to say about it. I'm sorry that badly formatted CSV gave you a bad time, but the format is fine and gets its job done.

"It doesn't have x, y or z feature therefore no one should be using it ever" is kind of a dumb argument, honestly.

anigbrowl 2021-08-18 22:31:40 +0000 UTC [ - ]

Me, a naive idiot: CSV is simple I will write my own exporter because I am clever

Me, 20 minutes later: Heh that was easy I am a genius

Me, 21 minutes later: Unicode is ruining my life T_T

Don't get me wrong, I really like CSV because it's so primitive and works so well if you are disciplined about it. But it's easy to get something working on a small dataset and forget all the other possibilities only to faceplant as soon as you step outside your front door. In the case above my experience with dealing with CSV data from other people made me arrogant, when I should have just taken a few minutes to learn my way around a mature library.

crazygringo 2021-08-19 05:01:17 +0000 UTC [ - ]

How does Unicode present a problem?

In UTF-8, the byte for a comma and a quote only exist as their characters. They don't exist as parts of multibyte sequences, by design.

If you have Unicode problems, then you have Unicode problems, but they wouldn't seem to be CSV problems...? Unless you're being incredibly sloppy in your programming and outputting double-byte UTF-16 strings surrounded by single-byte commas and quotes or something...?

selfhoster11 2021-08-19 01:01:21 +0000 UTC [ - ]

Lots of edge cases that aren't always handled to spec on both sides of the import/export. That's my experience, at least.

jacobsenscott 2021-08-19 00:01:48 +0000 UTC [ - ]

  name,position
  "Smith, John"‚Manager

woodrowbarlow 2021-08-18 18:26:20 +0000 UTC [ - ]

how're,you,handling,quotes?'

astine 2021-08-18 20:53:14 +0000 UTC [ - ]

If you're manually generating your own CSV files, you probably know what kind of data you are generating and consequently whether your data is going to contain commas. If commas and newlines don't exist in your data, then you can safely ignore quoting rules when generating CSV files. I know that I've generated CSVs in the past and rather than figuring out the correct way to quote the strings, I just removed any inconvenient characters without any loss to the data at all. Obviously this is not "correct" but you don't have to implement cases if you know they won't show up.

politician 2021-08-19 00:28:45 +0000 UTC [ - ]

You use the right ANSI characters - the record and field separators (30,31) - and avoid hacks like comma, pipe, and newline.

masklinn 2021-08-18 16:24:43 +0000 UTC [ - ]

> I fought this problem for several days going from NetSuite to Snowflake via CSV.

Yeah if you see a CSV import feature without a billion knobs you know you're in for a world of hurt.

If you see a CSV import feature with a billion knobs, you're probably still in a world of hurt.

romwell 2021-08-18 23:14:30 +0000 UTC [ - ]

> but there are no guarantees it will import correctly.

What do you mean, there are "no guarantees"? You are in charge! You know what data you're dumping, you can see if it imports well. You can tailor your use case.

That's not the same as getting a CSV from some dump, where you have limited (if any) control over the behavior.

SpicyLemonZest 2021-08-18 17:27:40 +0000 UTC [ - ]

This is true, but a lot of data processing takes place in a context where frictionless export functionality is more important than a 100% guarantee of import compatibility. I'd rather ingest city = ",CHANGSHA,HUNAN" (real example!) than ingest nothing at all because my vendor doesn't have time to integrate a JSON serializer.

masklinn 2021-08-18 15:19:34 +0000 UTC [ - ]

> As I mentioned down-thread, I can generate a CSV with a couple of fprintf statements and a loop.

And usually generate garbage for anything but the most trivial case, which really nobody gives a shit about. That's the main reason why CSV absolutely sucks too, you have to waste month diagnosing the broken shit you're given to implement the workarounds necessary to deal with it.

> I definitely can't do that with .xlsx.

You probably can though. An xlsx file is just a bunch of XML files in a zip.

Sanzig 2021-08-18 15:27:21 +0000 UTC [ - ]

Define "garbage." If I know what my data looks like, I can anticipate the edge cases ahead of time. Plenty of CSV exports work this way, they don't need to be general if the schema is already imposed by the system.

Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.

I know this is HN and we like to pretend we're all data scientists working on clusters with eleventy billion gigs of RAM, but us embedded systems folks exist too.

masklinn 2021-08-18 15:42:57 +0000 UTC [ - ]

> Define "garbage."

Incorrect encoding, incorrect separators (record and field both), incorrect escaping / quoting, etc…

> If I know what my data looks like

If you control the entirety of the pipeline, the format you're using is basically irrelevant. You can pick whatever you want and call it however you want.

> Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.

You can pretty literally do that with XML and zip files: write the uncompressed data, keep track of the amount of data (for the bits which are not fixed-size), write the file header, done. You just need to keep track of your file sizes and offsets in order to write the central directory. And the reality's if you're replacing a CSV file the only dynamic part will be the one worksheet, everything else will be constant.

theamk 2021-08-19 00:59:23 +0000 UTC [ - ]

> If you control the entirety of the pipeline, the format you're using is basically irrelevant.

I think you are missing the point -- you only need to know about generator to know about format.

Since the parent poster was talking embedded, here is one example: a data logger with tiny embedded records tuples: (elapsed-time, voltage, current). You need this to be readable in the widest variety of programs possible. What format do you use?

I think the answer is pretty clear: CSV. It is compatible with every programming language and spreadsheet out there, and in a pinch, you can even open it in text editor and manually examine the data.

Using something like XLSX here would be total craziness: it will make code significantly bigger, and it will severely decrease compatibility.

pdonis 2021-08-18 15:22:53 +0000 UTC [ - ]

> An xlsx file is just a bunch of XML files in a zip.

A bunch of XML files with opaque formats that MS constantly makes changes to to make its competitors have to keep chasing the format.

masklinn 2021-08-18 15:25:10 +0000 UTC [ - ]

Even if it were true it wouldn't matter a whit to the production side of the format, which is what "produce CSVs using fprintf" is: excel can consume them all.

megous 2021-08-18 16:22:26 +0000 UTC [ - ]

As someone who tried, Excel's handling of CSV files was the reason to abandon the idea and generate XLSX.

Libreoffice handles normal UTF-8 encoded, quoted value CSV files fine. Excel not so much.

masklinn 2021-08-18 16:23:49 +0000 UTC [ - ]

Did you misread / misunderstand my comment somehow? Hint: my comment is not about generating CSV for excel to consume.

marcosdumay 2021-08-18 16:00:32 +0000 UTC [ - ]

As soon as you open it in Excel, it's garbage anyway, since it will replace date-like items with nonsense, drop number digits, convert anything it can, reencode monetary unities, and so on.

If you don't open it in Excel, you can have as strict a parser as you want, just like any other format.

But neither is going anywhere anyway.

masklinn 2021-08-18 16:43:24 +0000 UTC [ - ]

> If you don't open it in Excel, you can have as strict a parser as you want, just like any other format.

No, you can not. Because the CSV format is so fuzzy you can very easily parse incorrectly and end up with a valid parse full of garbage.

Trivially: incorrect separator, file happens to not contain that separator at all, you end up with a single column. That's a completely valid file, and might even make sense for the system. Also trivially: incorrect encoding, anything ascii-compatible will parse fine as iso-8859-*. Also trivially: incorrect quoting / escaping, might not break the parse, will likely corrupt the data (because you will not be stripping the quotes or applying the escapes and will store them instead).

It's like you people have never had to write ingestion pipelines for CSVs coming from randos.

romwell 2021-08-18 23:17:53 +0000 UTC [ - ]

>It's like you people have never had to write ingestion pipelines for CSVs coming from randos.

That's because this is not what this thread is about.

The comment you're responding to is not about CSVs coming from "randos". It's for the case where that rando is you, so you can make sure the problems you mention don't happen on the generation side of CSVs.

ribosometronome 2021-08-18 20:48:27 +0000 UTC [ - ]

Why is the solution to create an entirely new format rather than try to more rigidly enforce the a single CSV standard?

masklinn 2021-08-19 06:09:41 +0000 UTC [ - ]

Because there is no way to perform that enforcement, becauae you say “CSV” and people understand “my old garbage” and you can’t fix that; and because millions of incorrect documents will yield a valid but nonsensical parse.

theamk 2021-08-19 00:43:10 +0000 UTC [ - ]

I have a script which generates a CSV file using a bunch of print statements. The columns are hostnames and some numbers, so it is never going to contain commas or newlines. This will be perfectly valid CSV every time.

That’s why CSV is absolutely beautiful - there is a huge number of applications that people really care about, and their data is constrained enough that there is not need to care about CSV escaping and need for any third party libraries.

Creating XSLX file by hand is possible, but this will be a large amount of code and I wouldn’t include this in my script, it would need to be a separate library - which means build system support, learning the API etc...

yrds96 2021-08-18 15:33:32 +0000 UTC [ - ]

People doesn't matter which format is, since it simple work in any spreadsheet software.

Yeah i can generate any file with a bunch of printf, but csv i dont have to read a specification, i its possible to read with a bunch of read without have to use a xml or xlsl library.

masklinn 2021-08-18 15:46:14 +0000 UTC [ - ]

> Yeah i can generate any file with a bunch of printf, but csv i dont have to read a specification, i its possible to read with a bunch of read

The only thing funnier than producing broken CSV is consuming broken CSV.

slunk 2021-08-18 16:11:02 +0000 UTC [ - ]

> That's the main reason why CSV absolutely sucks too [...]

Is it? I think you're absolutely right that naive points of view like the one you're responding to will lead to avoidable bugs, but I'm not so sure the problem is CSV so much as people who assume CSV is simple enough to parse or generate without using a library.

masklinn 2021-08-18 16:23:20 +0000 UTC [ - ]

> I'm not so sure the problem is CSV so much as people who assume CSV is simple enough to parse or generate without using a library.

The simplicity of CSV is what tells people that they can parse and generate it without a library, and even more so that that's a feature of CSV. You just had to read the comments disagreeing with me to see exactly that.

acwan93 2021-08-18 15:17:39 +0000 UTC [ - ]

Also, "will the other side adopt it?"

After working with so many retailers and online sales channels, things that are considered "legacy" or "outdated" by the HN crowd doesn't seem like it will go away unless both sides make a change. There are numerous articles posted on HN about how "FTP is dead" or no one uses it anymore, when it's far from the case.

Even Amazon's marketplace and vendor files are still using SFTP and EDI files. They've recently made changes, but it's been slow and hasn't had widespread adoption.

There's also the universality and "simplicity" CSV provides to the non-computer literate, and convincing them to make a change to a new standard provides itself some non-technical challenges. CSV is a bad standard, but it's the best one given what it does and its flexibility.

danellis 2021-08-19 04:30:28 +0000 UTC [ - ]

What are all these hypothetical humans doing with CSV files? I've worked with them a lot, and the only time I read one manually is to find out why it isn't parsing properly. I'm not writing them by hand. I'm not extracting data from them by hand. What are other people doing?

crazygringo 2021-08-19 04:58:24 +0000 UTC [ - ]

Mainly Ctrl+F and diffs, in my case.

Sometimes you need to look up a single code in a 50 MB file. And sometimes you need a quick check to see if one line or a million lines changes.

It's "exception not the rule" type stuff... but it sure comes in handy to be able to check this stuff quick with basic text tools than have to run it through some binary parser. Same as JSON. But unlike protobufs for example.

danellis 2021-08-19 07:27:59 +0000 UTC [ - ]

Interesting. I would reach for `grep` and `diff` in those cases, so it's not much of a leap to imagine similarly using a tool for a binary format. After all, I only need the textual representation of the thing I'm searching for, not the entire dataset.

crazygringo 2021-08-19 16:36:44 +0000 UTC [ - ]

Sure, but those specific tools need to be written for each specific binary format.

And honestly they usually simply don't exist. In the vast majority of cases, in my experience.

grenoire 2021-08-18 15:09:46 +0000 UTC [ - ]

I think this issue can be bypassed if a better format such as Arrow or Parquet can be used in Excel.

xtracto 2021-08-18 15:25:43 +0000 UTC [ - ]

Interesting that the two main "alternatives" mentioned here are columnar data storages. The problem I see with both Arrow and Parquet are that they are binary-based files. Sure, binary files can be VERY efficient. But the "magic" behind files like CSV, JSON or YAML is that they are both "sufficiently" both human and machine readable.

Maybe a CSV killer would be a human readable columnar based file format.

Nevertheless, the article basically discusses the issues encountered with the manual "editability" of CSV files, not so much with its performance. It also mentions parquet or arrow and concedes that they require specialized format to read/write. If we are looking to that, then there are a lot of options such as sqlite format, BerkleDB (used by some cryptocurrency projects) among plenty of others.

cryptonym 2021-08-18 15:15:54 +0000 UTC [ - ]

That doesn't qualify as "good enough to be compiled by humans". We need something dumb and easy you can author from scratch with any text editor, and read/understand with just `cat`.

masklinn 2021-08-18 15:13:17 +0000 UTC [ - ]

Ideally excel would just speak sqlite. Not import/export or connect via odbc, straight be an sqlite client.

pdonis 2021-08-18 15:21:12 +0000 UTC [ - ]

CSV isn't supported in Excel because Microsoft prefers it. MS prefers its own Excel format. Excel supports CSV because a large enough majority of Excel users want it, which means MS includes support for it. For MS to remove CSV support in favor of Arrow or Parquet support would not benefit MS at all, and would make many of its customers push back, so I don't see it happening.

mr_toad 2021-08-19 01:06:21 +0000 UTC [ - ]

Even if they did support Parquet they’d use the same embrace and extend mechanisms they do with CSV files.

habitue 2021-08-18 15:09:45 +0000 UTC [ - ]

Objectively, CSV is terrible for humans despite being a plaintext format. No one reads CSVs: they're incomprehensible since the columns are not aligned with the headings. (You might be drawing an analogy with JSON, which is often human readable because it puts the keys right there next to the values).

The best that can be said for its simplicity is that it's easy to write code that can dump data out in CSV format (and to a lesser extent, it's easy to parse it, though watch out for those variants). This is not a really strong argument, most everyone is going to use a library for serialization, there's no reason to write your own unless it's for learning.

fighterpilot 2021-08-18 15:15:00 +0000 UTC [ - ]

I'm a human and I love CSVs. What other format can I open in a spreadsheet, access through cat/grep/awk, and easily load into any programming language? Any other format has to sacrifice one of these three things, and that's bad. It's this trifecta that makes it so versatile and human friendly.

xtracto 2021-08-18 15:26:49 +0000 UTC [ - ]

> access through cat/grep/awk, and easily load into any programming language

Until the CSV fields contain commas themselves. Even if fields are surrounded by "".

megous 2021-08-18 16:24:31 +0000 UTC [ - ]

Then you just switch to a 3-line php script (fgetcsv() in a loop) and continue on... :)

the_only_law 2021-08-18 20:49:20 +0000 UTC [ - ]

I recall one of my old Phones could dump SMS messages as CSV. I wanted to be build and interface that would let me search through my messages.

Turns out when you have data that contains both commas and quotes things get screwey real quick. You could have quoted data like:

123,ABC,”,””,456

Where ,” is column data.

I think that standard method is to double quote the field, but the dump sure wasn’t doing that for me, plus what happens when the data is something like abc””,?

rendall 2021-08-18 21:15:33 +0000 UTC [ - ]

> 123,ABC,”,””,456

That is badly formatted CSV, assuming ” is a double-quote. Double-quote (") is how you escape commas, so the 3rd entry will make a parser barf, since there is no close quote.

It should be 123,ABC,"","",456

> Turns out when you have data that contains both commas and quotes things get screwey real quick.

Not really. It's pretty logical. If the entry has even one comma, put quotes around it. If it also has quotes, double each of them. Otherwise don't worry about it.

> what happens when the data is something like abc””,?

abc"", becomes "abc"""","

2021-08-18 21:01:29 +0000 UTC [ - ]

breck 2021-08-18 20:41:48 +0000 UTC [ - ]

Some people like me never generate CSVs like that. If someone sends me a CSV with quotes, I reject it and say "clean up your data". It's okay to say, "we're not going to allow escape characters in our CSV grammar. We're going to have a higher standard"

danellis 2021-08-19 04:33:47 +0000 UTC [ - ]

Not everyone has the option of rejecting it, though. Sometimes you need that data and have no control over how it's being generated.

madaxe_again 2021-08-18 15:18:36 +0000 UTC [ - ]

Developer friendly. Not human friendly. I used to receive CSVs of product data from clients, which they often handcrafted or manipulated by hand, and inevitably, and I mean inevitably, broke.

Of course, excel will make just as much of a mess - when the client hands back their sheet with all of their UPCs expressed as exponents and æ€ ligatures jammed next to every apostrophe, there’s no guessing as to what happened.

In theory, it’s a great and simple format.

In practice, it causes much pain.

bobthepanda 2021-08-18 20:28:30 +0000 UTC [ - ]

Anything written by hand will be screwed up. Chances are if they're writing it by hand and it's badly formatted, they're not using something that has input or output validation anyways, and they're probably not going to budge from whatever workflow they're using.

asdff 2021-08-18 18:06:16 +0000 UTC [ - ]

Honestly though, if people are doing that with CSVs there is no system where they won't also go off the reservation and do something insane with an excel file or sqlite or whatever. At least with CSVs there are common lines of code all around the internet to wrangle all sorts of issues.

rutthenut 2021-08-18 15:21:38 +0000 UTC [ - ]

Surprised you would think that JSON is more human-readable, as the layout of that makes it visually quite appalling - not so bad when through a pretty-printer utility, admittedly, but may as well go for XML as another use case

anigbrowl 2021-08-18 22:55:34 +0000 UTC [ - ]

For ages I would just dump json files in a browser and let it auto-pretty print it for a first look.

A while back I discovered https://dadroit.com/ which is small (in feature set) but perfectly formed (nicely designed and extremely fast, even on huge files).

godshatter 2021-08-18 16:55:14 +0000 UTC [ - ]

I try to use tab-delimited files when writing out columnar text to be loaded into something else (assuming it can parse them). It's even easier to read since you don't have extraneous commas and double quotes everywhere and most viewers will line things up to some degree via the tabs. Tabs are less common in data fields (at least when compared to double quotes and commas), at least in my experience.

izietto 2021-08-18 15:18:11 +0000 UTC [ - ]

I like CSV instead, it's pretty intuitive to me to provide a list of records separating each column by comma and each record by newline.

asdff 2021-08-18 18:08:42 +0000 UTC [ - ]

The columns are aligned with the headings if you use the tool for the job, like R or python or awk or insert favorite cli parsing method here.

wutwutwutwut 2021-08-18 15:16:09 +0000 UTC [ - ]

Which format would you propose if I am writing a .NET tool to produce a file to be consumed by Excel?

habitue 2021-08-18 15:20:01 +0000 UTC [ - ]

I would expect .NET to have excellent .xlsx capabilities? Most libraries have an easy way to write those from tabular data.

wutwutwutwut 2021-08-19 15:51:58 +0000 UTC [ - ]

Why would you expect that?

psanford 2021-08-18 15:52:22 +0000 UTC [ - ]

CSV files are terrible, but I love them. I love sites that offer an "Export to CSV" option, because I know I can take that export and start working with it immediately. I can give that CSV file to my Dad, who can open it in Excel, or I can run a single command[0] to import it into a sqlite database.

It is a lowest common denominator format. That type of thing is incredibly hard to kill unless you can replace it with something that is simpler. Good luck with that.

[0]: https://github.com/psanford/csv2sqlite

lmilcin 2021-08-18 18:11:25 +0000 UTC [ - ]

I call this "criticizing something without understanding why it succeeded in the first place". Also "writing blog posts while blind to other users' use cases".

Do you really think there were no other "portable" formats to exchange data and SQLite is the first one?

The reason CSV is so popular has nothing to do with technical superiority which it obviously lacks.

The reason CSV is so popular is because it is dead simple and extremely easy to integrate within practically any conceivable workflow.

And because it is text format which you can trust, even if you have no other tools, you can inspect and edit in a text editor which is exactly the reason why other formats like INI, XML, JSON or YAML are so popular.

How would you:

a) Run shell command on a locked (can't install anything) PROD server to find the files and write a report of the file location, date of change and file size so that it can be easily processed by something else? Return the data in stream on the same SSH connection?

b) Quickly add a line to an existing startup script to get a report of when the script is starting and stopping so you can import data to excel. You just need it for couple of days, then you delete it.

c) Produce a report you can send to your coworker knowing they don't know how to program? And you don't want spending time explaining to them how to make use of your superior file format?

d) You talk to another team. You need to propose a format and specify a report with as little effort as possible. You know they are not advanced technically so you keep things simple?

e) You are solving an emergency and have to cross-check some data with another team. You need to produce a file that they will be able to read and process. You are stressed for time so you don't want to start a committee right now to find out shared technology. What would be safe choice without asking the other team for options?

f) A manager asked you for some data. You want to earn quick kudos. How will you send the data to get your kudos rather than irritating him?

These are all real life situations where CSV is being used.

As much as I understand technical arguments for SQLite, not every person is a developer. And all that technical superiority is worth nothing when they struggle to make use of the file you sent them.

I would say, the news of CSV's death are greatly exaggerated.

qwerty456127 2021-08-18 15:59:50 +0000 UTC [ - ]

> It's Time to Retire the CSV

> This column obviously contains dates, but which dates? Most of the world

It's time to retire local formats and always write YYYY-MM-DD (which is both the international and the Swedish standard, and the most convenient for parsing and sorting).

> A third major piece of metadata missing from CSVs is information about the file’s character encoding.

It's bloody the time to retire all the character encodings and always use UTF-8 (and update all the standards like ISO, RFC etc to require UTF-8). The last time I checked common e-mail clients like Thunderbird and Outlook created new e-mails in ANSI/ISO codepages by default (although they are perfectly capable of using UTF-8) - this infuriated me.

> If not CSV, then what? ... HDF5

Indeed! Since the moment I discovered HDF5 I wonder why is it not the default format for spreadsheet apps. It could just store the data, the metadata, the formulae, the formatting details and the file-level properties in different dimensions of its structure to make a perfect spreadsheet file. Nevertheless spreadsheet apps like LibreOffice Calc and MS Excel don't even let you import from HDF5.

> An enormous amount of structured information is stored in SQLite databases

Yet still very underused. It ought to be more popular. In fact every time I get CSV data I import it to SQLite to store and process but most of the people (non-developers) have never heard of it. IMHO it also begs to be supported (for easy import and export at least) by the spreadsheet apps. A caveat here is it still uses strings to store dates so the dates still can be in any imaginable format. Fortunately most of the developers use a variation of ISO 8601 conventionally.

And by the way, almost every application-specific file format could be replaced by SQLite or HDF5 for good. IMHO the only cases where custom format make good sense are streaming and extremely resource-limited embedded solutions.

titzer 2021-08-18 17:35:39 +0000 UTC [ - ]

I, too, am a fan of ISO 8601. In an insane world of date formats, it's the only sane choice.

https://en.wikipedia.org/wiki/ISO_8601

qwerty456127 2021-08-18 20:37:32 +0000 UTC [ - ]

Almost nobody is a fan of the actual ISO 8601. It requires a big 'T' to separate the date from the time - people consider this ugly and rarely implement this in the wild. People mostly use RFC 3339 which allows using a space instead of the 'T'.

Both also require colons to separate hours and minutes and this makes it impossible to use in file names if you want to support accessing them from Windows.

I personally use the actual ISO 8601 (with the "T") wherever I can, simple YYYY-MM-DD-HH-mm-SS-ffffff where I need to support saving to the file system (but this is slightly harder for a human to read) and mostly RFC 3339 (with a space instead of the "T") wherever I need to display or to interop with tools written by other people. As for SQLite - I usually store every field (years, months,... seconds etc) in a separate integer column and create a view which adds an automatically generated RFC 3339 date/time column for simpler querying.

Evidlo 2021-08-18 22:48:04 +0000 UTC [ - ]

Are ISO standards ever updated? Is there any chance we might see an ISO 8601 date variant which is meant for filenames?

qwerty456127 2021-08-18 23:13:12 +0000 UTC [ - ]

> Are ISO standards ever updated?

Yes, ISO standards are updated as regularly as it makes sense for standards to change (obviously they wouldn't really be standards if they were updated more often).

Wikipedia says ISO 8601 was first published in 1988, with updates in 1991, 2000, 2004, and 2019.

The C language is another example. The latest stable revision (C17) is ISO/IEC 9899:2018.

> Is there any chance we might see an ISO 8601 date variant which is meant for filenames?

Hardly.

anodyne33 2021-08-19 00:24:38 +0000 UTC [ - ]

I had a great one today. I got a load file in (e-discovery world) that had my sent/rev'd/sort dates in "mm.dd.yyyy hh:mm" while my created and modified dates were "dd.mm.yyyy hh:mm". I can't fathom what piece of software sent that to me, and it wasn't rocket science to fix in Excel. I'm all for standardizing date format but with the # of applications that will only spit out data in one particular format I don't see it ever happening.

hatmatrix 2021-08-18 21:06:12 +0000 UTC [ - ]

CSV is accessible to programmers and non-programmers alike. HDF5 and SQLite require some programming ability and special software to interact with the data as they are not just text files like CSV.

qwerty456127 2021-08-18 21:23:27 +0000 UTC [ - ]

This is so just because no common software supports them (I don't know why, the libraries are totally free - BSD and Public Domain).

And by the way, many (if not an overwhelming majority) of the non-programmers don't even understand what does "just text files" actually mean, how do text files differ in nature from DOC files and how are CSV files different from XLS files. They can only use CSV because Excel and LibreOffice support it OOTB and consider CSV just a weird XLS cousin needed for import/export purposes.

jedimastert 2021-08-18 23:19:36 +0000 UTC [ - ]

I don't need a library to work with CSV in any language. That alone is a deal breaker is plenty of situations, no matter how widespread the format becomes.

qwerty456127 2021-08-19 00:00:05 +0000 UTC [ - ]

Isn't importing and using the SQLite library utterly trivial in modern versions of the most of the programming languages? Even in Fortran and COBOL it's just `use :: sqlite` and `set proc-ptr to entry "sqlite3.dll"` respectively.

CobaltFire 2021-08-18 16:29:01 +0000 UTC [ - ]

Honestly I constantly see dates argued about and people state various formats that are still confusing. 4-2-2 of any variety can be confused. Why not 2-3-4 or 4-3-2 (DD-MMM-YYYY or YYYY-MMM-DD)? I’ve never understood why that isn’t more widely used.

qayxc 2021-08-18 16:35:18 +0000 UTC [ - ]

What would MMM look like? 02-005-2022?

Usually MMM refers to the 3-letter shorthand of the month, e.g. "APR" or "OCT", but I guess that's not what you meant because it couldn't be used internationally.

CobaltFire 2021-08-18 17:21:01 +0000 UTC [ - ]

I do mean with the letters, but international issues makes sense.

I should know better but just never thought about it.

qwerty456127 2021-08-18 16:40:36 +0000 UTC [ - ]

> Why not ... DD-MMM-YYYY

Because sorting. You can just sort a collection of dates stored as YYYY-MM-DD strings alphabetically and the result will always be in accordance with the actual time line.

kstrauser 2021-08-18 17:43:59 +0000 UTC [ - ]

This has so many happy side effects. Yesterday I was upgrading some code to use a newer version of an API. The old version stored timestamps as int seconds-since-epoch, but the new one stored them as ISO8601 strings. I replaced a line of code like:

  if row['timestamp'] < cutoff_time_as_int: ...
with:

  if row['isotimestamp'] < cutoff_time_as_string: ...
and everything kept working exactly as before. There was no expensive and error prone date parsing involved, some some simple string comparison. It wasn't quite as cheap as comparing ints, but it's not performance critical code so eh, good enough.

CobaltFire 2021-08-18 17:21:30 +0000 UTC [ - ]

This is the answer I needed.

Thanks!

pwinnski 2021-08-18 16:53:37 +0000 UTC [ - ]

YYYY-MM-DD with numeric months sorts in order. Nothing else does.

mr_toad 2021-08-19 01:18:06 +0000 UTC [ - ]

If someone gives you a date like 10–04-2021 you need to know whether they’re American or not. Because it could be October 4th, or April 10th.

(In theory people could write yyyy-dd-mm, but I’ve never seen anyone actually do that).

2021-08-18 16:39:39 +0000 UTC [ - ]

_moof 2021-08-18 16:05:54 +0000 UTC [ - ]

This is an example of a genre I like to describe as: programmer objects to a solution that meets everyone's requirements because it doesn't make them feel like a beautiful code-poet.

I like elegance as much as anyone. And I think it's a good proxy for other important qualities. But don't prioritize it above building something that actually does the job. Be an engineer.

qayxc 2021-08-18 16:40:28 +0000 UTC [ - ]

Ah yes, and then there's https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

Not to mention the mess that is exchanging documents between different locales. It's all sunshine and roses until you get your CSVs from an office in a different country (which happens a lot in Europe).

CSV gets the job done until it doesn't.

brushfoot 2021-08-18 16:50:14 +0000 UTC [ - ]

That's not a problem inherent to CSV the format, though, any more than using Word to edit JSON means smart quotes are a problem with JSON. It's about knowing or not knowing the tools you're using.

Unexpected behavior is a potential problem with any tool or format, certainly no less so with the kinds of solutions the article is proposing.

qayxc 2021-08-18 17:10:40 +0000 UTC [ - ]

But the behaviour isn't unexpected at all, e.g. the user wanted to store a number in column 3 and that's exactly what's the tool did. The behaviour is both expected and reproducible, the issue is that it's not communicated by the format.

The problem arises from the fact that CSV doesn't contain any metadata so while the same tool reads and writes the same data, the results will be different because the format doesn't specify what a number look like.

The formats proposed in the article don't have this problem because they specify both the type and the formatting of the file content so there's no confusion.

That's the difference. JSON in particular simply cannot have this problem (though it's not a suggested solution): a number has a well-defined format and that's that. So do strings. This is not the case with CSV:

  Number,Also a number,A String
  1.0,1e15,"Hello there!"
  "1,0",1e15,"Bienvenue á la maison!"
The first row is what you'd expect from a customer in Ireland and the second row might be sent from someone in Switzerland.

Both parties agreed to use comma-separated files for data exchange and quoted strings, yet importing the first row using a French locale would fail to parse the first number and same goes for the second row and an English locale.

Neither row contains unexpected or malformed data from the respective user's POV. And that's the problem with CSV.

p_j_w 2021-08-18 17:41:34 +0000 UTC [ - ]

>CSV gets the job done until it doesn't.

Then you take it out of use where it doesn't get the job done.

2021-08-18 23:37:20 +0000 UTC [ - ]

dragonwriter 2021-08-18 16:01:41 +0000 UTC [ - ]

Yes CSV sucks, but beyond that this rant is...quite sloppy. In dismissing YAML and JSON it seems to confuse probability of corruption when using an editor that isn't structure-aware with “lack of precision”, and then treats binary formats as better in this regard. Not only is the terminology bizarre, but the application of the criteria is nonsensical. Binary formats are not less likely to be corrupted when using an editor that isn't specifically aware of the format, and JSON and YAML (especially when accompanied by explicit schemata) are no less amenable to structure (and, when used, schema) aware editors than binary formats are.

> To usurp CSV as the path of least resistance, its successor must have equivalent or superior support for both editing and viewing

Since the author has already ruled (on spurious grounds) human readability and thus the usability of ubiquitous text editors as incompatible with requirements for a successor format, this is simply impossible.

> If Microsoft and Salesforce were somehow convinced to move away from CSV support in Excel and Tableau, a large portion of business users would move to a successor format as a matter of course.

Yeah, sorry, you’ve put the cart before the horse. Neither of those firms are going to do that until the vast majority of users have already migrated off of CSV-based workflows, it would be an insane, user hostile move that would be a bigger threat to their established dominance in their respective markets than anything any potential competitor is likely to do in the foreseeable future.

worik 2021-08-19 01:26:27 +0000 UTC [ - ]

What a crock.

The author completely misses the point of what CSV files are useful for.

They are useful when both ends of the communication understand the context. They know what the data types are, they know what the character encoding is etcetera.

This is a very common situation. CSV files are easy to process, easy to generate, and can be read by a human without too much bother (they are not "for people" as the author so irritatingly asserts). I have written so many CSV (and other delimiters, not just ',') generators and processors I lost count decades ago. It is so easy.

And what does "retire CSV" really mean? Just stop generating them then! The consumers of your data will probably insist you go back to them as writing all that code just to satisfy your fetish with metadata is not useful

Now I am whinging.....

Antrikshy 2021-08-19 02:00:32 +0000 UTC [ - ]

I stopped reading, so I may have missed it, but do they even address situations where humans interact with systems by supplying them with tabular data?

einpoklum 2021-08-19 12:43:44 +0000 UTC [ - ]

I don't agree with the author's thesis, but you're wrong.

1. It is a very common use-case to load CSV data generated by an opaque system which did not specify the format exactly to you.

2. CSV files are _difficult_ to process I: Fields are of variable width, which is not specified elsewhere. So you can't start processing different records in parallel, since you have to process previous lines to figure out where they end. You can't even go by some clear line termination pattern, since the quoting and escaping rules, and the inconsistencies the author mentions, make it so that you may not be able to determine with certainty that what you're reading is an actual end-of-record or end-of-field. Maybe it's all just a dream^H^H^H^H^H quoted string?

3. CSV files are _difficult_ to process II: For the same reasons as the above, it is non-trivial and time consuming to recover from CSV generation mistakes, corruptions during transmission, or missing parts of the data.

However - if you get some guarantees and abot field width and about the non-use of field and record separators in quoted strings, then your life becomes much easier.

pwinnski 2021-08-18 16:51:43 +0000 UTC [ - ]

This reads like a joke.

If you think you can do better than CSV, let's see your proposal. Hint: you probably can't, and if you could, you probably couldn't get Excel to export it, so you still probably can't.

"The status quo is bad, more recent popular formats aren't good enough either, but I don't actually have a specific proposal that's better than all of the above" is a lot faster to read than that article, and says the same thing.

AussieWog93 2021-08-18 21:55:17 +0000 UTC [ - ]

>If you think you can do better than CSV, let's see your proposal.

I've got one! It's basically the same as regular CSV, but everything is UTF-8, the columns and lines are delineated by dedicated UTF-8 "delineator" codepoints (if they aren't defined in the spec, find reasonable surrogates and use them), and therefore nothing ever needs to be escaped.

More human readable than regular CSV, less prone to error and just as easy to write to in a for loop (easier, in fact, as there are no escapes).

Depending on how Excel handles delineators, it should be able to import it too.

jedimastert 2021-08-18 23:32:07 +0000 UTC [ - ]

fun fact: ascii (and utf-8) has separators! There's a "file separator", "group separator", "record separator", and "unit separator".|

politician 2021-08-19 00:33:29 +0000 UTC [ - ]

This whole “problem” with CSV boils down to ignorance of the purpose of ANSI characters below 32. This has been a solved problem for decades.

proverbialbunny 2021-08-18 21:34:32 +0000 UTC [ - ]

I wouldn't mind an updated standard to CSV that has the type information in the first line with the column labels. I feel like this is all that CSV is really missing. As a format where you can easily see and interact with the data in plain text if you need to nothing really beats CSV but I believe it could be made better, including strengthening standardization. Eg, a CSV standards version header could be put at the top of the file to minimize import export difficulty.

hatmatrix 2021-08-18 21:04:54 +0000 UTC [ - ]

Just another confirmation for Betteridge's Law of Headlines

flowerlad 2021-08-18 21:45:44 +0000 UTC [ - ]

CSV isn't going away any time soon. It is the only format that is supported by all data-oriented tools. Getting all those tools to agree on an alternative isn't going to happen, especially given that that there are all these old tools that are no longer being updated, that produce and consume CSV.

Instead of finding a replacement for CSV it might be easier to standardize it and enhance it. Excel's version of CSV is the de-facto standard. If you want a written-down spec that is available too [1]. To this we need to add enhancements such as a way to specify metadata (i.e., data type of each field). No need to find an alternative to CSV!

[1] https://datatracker.ietf.org/doc/html/rfc4180

einpoklum 2021-08-19 12:46:12 +0000 UTC [ - ]

> Excel's version of CSV is the de-facto standard.

If you're working someplace that uses Excel, then maybe. Otherwise, it isn't.

Anyway, if more tools, including Excel, properly implemented RFC4180, we'd at least have something.

Another option is some sort of meta-data file alongside a large CSV describing its format somehow - which, if missing, you have to fall back on RFC4180 or your parser's "try to figure out the quirks" mode.

rjh29 2021-08-18 17:03:09 +0000 UTC [ - ]

There isn't enough love for TSV imo. Unlike commas, tabs rarely appear in input data, so you don't need complicated quoting or escaping rules. You just have tab and newline as special characters. Processing this data is extremely fast. Grepping for particular values is also very fast, as you can use the tabs as anchors when searching.

TSV allows you to do stuff on a single machine and GNU parallel that people would normally create a Hadoop cluster or 128GB database for.

shock-value 2021-08-18 18:27:27 +0000 UTC [ - ]

Strong agree. TSV is better than CSV always (unless you need to interface with an external system that doesn't accept it).

For those not aware, TSV and CSV differ by more than just the delimiting character. TSV has a dead-simple specification: https://www.iana.org/assignments/media-types/text/tab-separa.... CSV does not have a standard spec and implementations differ quite a bit, but often in subtle ways.

segmondy 2021-08-18 23:01:50 +0000 UTC [ - ]

There's no difference between TSV and CSV but the separator, most good libraries allow you to use any character you wish as the separator.

Here's the RFC for CSV - https://datatracker.ietf.org/doc/html/rfc4180

Hasnep 2021-08-19 00:28:23 +0000 UTC [ - ]

The other key difference is instead of having complex quoting rules, TSV just disallows the tab character in the data.

shock-value 2021-08-19 01:49:03 +0000 UTC [ - ]

> There's no difference between TSV and CSV but the separator

This is not true, though almost everyone (understandably) assumes this based on the names.

alexpw 2021-08-18 17:10:46 +0000 UTC [ - ]

1) CSV can often be human readable at a glance (more so than most other formats, depending on that data), and that makes it appear deceptively simple and compact. Possibly due to that, I'd bet most of us have been bit by a writer/reader that doesn't respect the RFC rules.

2) I ask for TSV, whenever convenient. It's been more reliable, and I don't have a comprehensive why, but I think it's slightly more resilient to writer/reader inconsistencies, for me. It may be that there's just less need for escaping and quoting, so you might dodge a smart quotes debacle when asking for a one-off from an Excel user, for example.

3) Despite the issues raised, the notion we'd retire it makes me hug it tight, because for the majority of my requirements, it hits a sweet spot. I still reserve the right to raise my fist in frustration when someone does: ",".join(mylist).

shock-value 2021-08-18 18:22:32 +0000 UTC [ - ]

TSV is a dead simple and fully-specified format. Here it is: https://www.iana.org/assignments/media-types/text/tab-separa...

CSV is a mish-mash of different, complicated, and under-specified standards and/or implementations.

yuy910616 2021-08-18 15:11:13 +0000 UTC [ - ]

CSV is bad - but convenience is hard to beat. It's the same thing with excel. People abuse it, but you just can't beat the fact that your programmer, CEO, analyst, and secretary can all contribute to the same file.

fuhrysteve 2021-08-18 15:25:27 +0000 UTC [ - ]

Yeah, this is pretty much it. The author complains about CSVs being "notoriously inconsistent" as though switching to some other format would magically change that. They're only inconsistent because sometimes lazy programmers do ",".join(mylist) instead of using an RFC4180 compliant CSV writer. Lazy programmers will just use non-compliant methods of creating whatever magic format OP is dreaming about. Case in point: trailing commas in JSON objects, and other ridiculous things that people have come up with such as encoding a date in JSON like this: "\/Date(628318530718)\/" https://docs.microsoft.com/en-us/previous-versions/dotnet/ar...

CSVs also are great because you can parse them one row at a time. This makes for a very scale-able and memory-efficient way of processing very large files containing millions of rows.

Let there be no mistake: Everyone reading this today will retire long before CSVs retire. And that's just fine by me.

ivanbakel 2021-08-18 15:34:23 +0000 UTC [ - ]

>CSVs also are great because you can parse them one row at a time. This makes for a very scale-able and memory-efficient way of processing very large files containing millions of rows.

Even RFC4180-compliant CSVs can be incredibly memory-inefficient to parse. If you encounter a quoted field, you must continue to the next unescaped quote to discover how large the field is, since all newlines you encounter are part of the field contents. Field sizes (and therefore row sizes) are unbounded, and much harder to determine than simply looking for newlines - if you were to naively treat CSV as a "memory-efficient" format to parse, you would create a parser that would be easy to blow up with a trivial large file.

newbie2020 2021-08-19 05:15:40 +0000 UTC [ - ]

Hah, I use the join function when I write them. Can you elaborate upon why that’s bad?

bernardv 2021-08-18 20:24:28 +0000 UTC [ - ]

At the end of the day the purpose of software is to serve users, not satisfy developers’ sense of perfection.

rendall 2021-08-18 21:35:00 +0000 UTC [ - ]

The article's strongest criticism of CSV is that it's easy for someone to mangle it when manually editing. This is true. It's also true for every format.

It was weakest when it implied there is no real standard. There is, and it's robust for representing data, even data that includes any combination of commas and double-quotes.

The algorithm for creating well-formed CSV from data is straightforward and almost trivial: if the datum has no comma in it, leave it alone. It's good to go. If it has even one comma, wrap the datum in double quotes; and if it also contains double quotes, then double them.

Not complicated and covers every edge case. CSV is going nowhere. 1000 years from now, computers will still be using CSV.

The answer to his objections could be to extend the format to include metadata. Perhaps a second row that holds type data.

  fruit,price,expiration
  string,$0.00,MM/DD/YYYY
  apple,$0.45,01/24/2022
  durian,$1.34,08/20/2021
etc

hnick 2021-08-19 00:00:27 +0000 UTC [ - ]

You didn't mention handling new lines as data. Excel for example will include these as-is and double quote the cell.

It is very easy to overlook edge cases in CSV.

rendall 2021-08-19 03:06:50 +0000 UTC [ - ]

> It is very easy to overlook edge cases in CSV.

That was not a criticism from the original article, and isn't even true.

If you have newlines in your original data, and you "overlooked" this "edge case", then neither JSON nor YAML nor any other format will save you. The same fix applies to them all.

This is really a very poor criticism.

hnick 2021-08-19 03:41:25 +0000 UTC [ - ]

The only reason I replied was that you said:

> The algorithm for creating well-formed CSV from data is straightforward and almost trivial: if the datum has no comma in it, leave it alone. It's good to go.

> Not complicated and covers every edge case

Maybe more context was implied, but I didn't want anyone to think it's that simple. I have received and had to process CSV data with unexpected new lines. It gets nearly everyone the first time when they try to process a CSV file line by line.

rendall 2021-08-19 09:45:21 +0000 UTC [ - ]

That is a fair point :)

jedimastert 2021-08-18 23:23:44 +0000 UTC [ - ]

I wonder why escape dquotes with another dquote instead of a backslash or something?

contravariant 2021-08-18 23:35:46 +0000 UTC [ - ]

Because then you'd have to escape backslashes.

RobLach 2021-08-18 15:25:35 +0000 UTC [ - ]

I can generate a CSV that can be read in excel, etc.. extremely easily in a terminal.

Just yesterday I was trying to add some metadata to a list of images that I handed off to a designer and I just piped a single "find . -fprintf" command into a file.

To "retire" something so simple and utilitarian makes no sense. There's plenty of formats that deal with all the issues described in this. It's like saying we need to retire plain text files because it's confusing to know how they should be displayed.

javajosh 2021-08-18 15:10:22 +0000 UTC [ - ]

More correctly titled, "I Don't Like CSV".

fridif 2021-08-18 15:14:03 +0000 UTC [ - ]

Hahahah. Allow me to submit mine as well:

"I Cannot Fathom A Use Case For CSV So I Would Like To Ban Everyone From Using It For Any Reason"

cogman10 2021-08-18 15:22:31 +0000 UTC [ - ]

There are probably valid uses for CSV, but more often than not, it's the wrong choice.

As soon as data has any form of structure to it (and most data does). CSV complicates everything. Even for unstructured data, the problem of escape characters often shows it's ugly head. The moment your data contains a comma, tab, or space, you run into a nasty mess that, in the best case makes your system fail, and in the worst case silently adds corrupt data into the system.

Neither JSON nor XML suffer from that problem and both can easily be used in any scenario you'd use CSV. The only argument against either format is they are a bit more bulky than CSV.

fxleach 2021-08-18 15:36:30 +0000 UTC [ - ]

Just an FYI, JSON and XML both have characters that need to be escaped properly.

cogman10 2021-08-18 16:03:50 +0000 UTC [ - ]

Correct, and those methods of escaping are clearly defined in both specs. In CSV, there is no universal escaping system. It's all over the board.

Any variable length text data format is going to run into issues when special characters are used in the data. CSV has no definition of what should happen when that occurs.

fxleach 2021-08-19 16:25:14 +0000 UTC [ - ]

The characters to escape are clearly defined in both specs, but that doesn't mean there's a universal escaping system. Just google "how to escape JSON" and you will see the wealth of users struggling to understand what and how to accomplish this. There will always be people struggling with escape characters, I don't see it being any harder or easier in any format.

Demiurge 2021-08-18 16:27:43 +0000 UTC [ - ]

Retiring CSV is as futile as retiring Unix pipes. It's a text format for text manipulations, and there is nothing simpler and more ubiquitous than a new-line and comma delimited table. Simplicity and flexibility has its drawbacks, but the advantages clearly outweigh those drawbacks. So, it will never retire, as long as we use text and blob filesystems that don't have interoperable structured metadata, across the board.

jonnycomputer 2021-08-18 16:48:13 +0000 UTC [ - ]

CSV in general is terrible. CSV in particular is great. Most people only deal with CSVs in particular: this set of CSVs output from that process.

Writing code that can accept and parse arbitrary CSV files is a whole different thing. If I had to do that, I'd be yelling too.

Similar story can be said for JSON, markdown, etc, where standard is inadequate, non-existent, or in mutual competition.

kaiken1987 2021-08-18 15:14:46 +0000 UTC [ - ]

The whole point of CSV is that it's simple. You don't need and special libraries. If you can write hello world to a file you can make a CSV and more often than not it'll just work. Yes somethings, like DNA, shouldn't be CSV files. But don't blame the tool for bad craftsmanship. There are more things that it works for than doesn't. As a side note dates should be ISO 8601

2021-08-18 15:32:32 +0000 UTC [ - ]

kryptiskt 2021-08-18 16:16:15 +0000 UTC [ - ]

I have always wondered why there is so little use of the separators in ASCII (the file, group, record and unit separators with codes 28-31). They seem perfect for the job and it would be easy to forbid inclusion of those characters in fields.

Ekaros 2021-08-19 08:54:06 +0000 UTC [ - ]

It's sad that so much ascii space is wasted on characters that are used... Really should replace that standard with something better...

derbOac 2021-08-18 19:51:35 +0000 UTC [ - ]

I didn't know they existed, but now I do, and am wondering the same thing!

_jal 2021-08-18 15:18:44 +0000 UTC [ - ]

The more interesting question is to try to tease apart what led to CSV's survival, despite all the warts. Clearly a combination of things, including accident. But I'll point out some advantages:

- Repairability by mere mortals. If you are competent enough to use Excel and are given bad data, you are competent enough to fix it. (Whether the dataset is too big for mere mortals to find the problem is a different issue.)

- Trivial serialization. It is super-easy to dump things to CSV. (This is probably also why there are so many annoying variants.)

- Usable by other tools. You don't need special libraries to read it, so pipelines involving the usual suspects is possible. Importantly, there are a ton of different ways to do this sort of thing, so non-experts can frequently find something that works for them, even if it looks wonky to programmers.

All that said, I hate dealing with them, too.

bhawks 2021-08-18 20:52:56 +0000 UTC [ - ]

CSV is the embodiment of worse is better axiom.

Anyone that has dealt with more than 1 CSV is aware of many of the aspects of their horrible nature.

The more interesting reason is why they're so damn successful.

1: network effect - not supporting CSV in a product is practically silly. everyone can do it why can't you?

2: ease of producing / consuming (not saying that you do it correctly in all cases :))

3: data is transparent (or feels transparent)

4: accessible - if you can write language X you can parse or generate reasonable CSV in a mater of minutes. No knowledge of any libraries or tech required

That said it's horrible - but it will always be with us.

tannhaeuser 2021-08-18 16:09:38 +0000 UTC [ - ]

CSV (well TSV) as a format is a simple as it gets: one special code as field separator, and another one as line/record terminator. Only that CSV/TSV use chars available in text editors rather than the proper (archaic) ASCII C1 codes. Whatever the author has against CSV, I feel like starting a war against CSV is crazy, since any alternative is going to be worse when the problem isn't the format as such, but folks misusing it or using it wrongly (eg. broken XML in RSS, broken JSON, whatever).

shock-value 2021-08-18 18:30:43 +0000 UTC [ - ]

Despite the similar names, TSV and CSV are very different, in more ways than just the delimiter. In particular, CSV tries to handle escape sequences but its various implementations do so in different ways. TSV just disallows tab characters within fields, so its specification extremely simple: https://www.iana.org/assignments/media-types/text/tab-separa....

ziml77 2021-08-18 16:19:32 +0000 UTC [ - ]

And then a field needs to contain one of those characters and everything breaks. And they're designed to be human editable, so the solution is to complicate parsing by using quoted fields instead of escape characters. Although someone might choose to use escape characters because there's no spec so they can do whatever they want.

mongol 2021-08-18 16:21:23 +0000 UTC [ - ]

I think if only editors and shells supported control codes better and gave them a sane visual representation, we could see a revival for them. Feels like a waste to have them and not to be used.

thibran 2021-08-18 16:44:46 +0000 UTC [ - ]

For the start I would be happy if Excel would, in the German version, by default start to separate CSV values by comma and not semicolon. This "bug" did cost us alone hundreds of support hours. We fixed it by switching to XLS, but still CSV is out there and causing headaches (Mailjet for example can't create correct CSV files - luckily SQLite can read them).

Microsoft -> The name of the format is 'comma separated values' not 'semicolon separated values'!

mongol 2021-08-18 17:22:23 +0000 UTC [ - ]

Problem is, many countries use comma as decimal separator. Which makes comma inconvenient as CSV separator.

unnah 2021-08-18 18:46:15 +0000 UTC [ - ]

Just use a decimal point instead. Switching the delimiters depending on locale makes it extremely impractical to use CSV for data interchange with continental European Excel users.

Microsoft Excel is the only program I know that insists on locale-specific interpretations of CSV. To read a standard CSV into Excel in a continental European locale, you need to either switch your locale at the Windows level, or to manually set every delimiter in the Excel text import wizard (requiring multiple mouse clicks for each column).

thibran 2021-08-18 17:26:12 +0000 UTC [ - ]

This can easily be solved in CSV by escaping or quotation.

MrPowers 2021-08-18 18:21:54 +0000 UTC [ - ]

Here are the main advantages of Parquet over CSV (post author made these points more abstractly):

* Parquet stores schema in the metadata, so schema inference isn't required (schema inference is expensive for big datasets)

* Parquet files are columnar so individual columns can be grabbed for analyses (Spark does this automatically). This is a huge performance improvement.

* Row groups contain min/max info for each column, which allows for predicate pushdown filtering data skipping. Highly recommend playing with PyArrow + Parquet to see the metadata that's available.

* Columnar file formats are easier to compress. Binary files are way smaller than text files like CSV even without compression.

I wrote a blog post that shows how a Parquet query that leverages column pruning and predicate pushdown filtering can be 85x faster than an equivalent CSV query: https://coiled.io/parquet-column-pruning-predicate-pushdown/

CSVs are great for small datasets when human readability is an important feature. They're also great when you need to mutate the file. Parquet files are immutable.

It's easy to convert CSVs => Parquet/Delta/Avro with Pandas/Dask/Spark.

The world is already shifting to different file formats. We just need to show folks how Parquet is easy to use and will greatly increase their analysis speeds & they'll be happy to start using it.

Small nit: the article implies Apache Arrow is a file format. It's a memory format.

podiki 2021-08-18 15:11:29 +0000 UTC [ - ]

As long as it is text-based. At least you can actually look at a CSV and see what is going (wrong), as well as use all the text tools we have. Not that there aren't problems as the article points out. But some binary file based on the whims of a proprietary program...no thanks.

nolok 2021-08-18 15:14:01 +0000 UTC [ - ]

Xlsx (office format for almost a decade now) are zip and XML all the way. Not fun to look at, but totally readable by a human.

daxuak 2021-08-18 15:20:24 +0000 UTC [ - ]

Human can't efficiently write or parse XML or json, though. In some scenarios CSV hits the right spot to be accessible to human and computer, and the table can be laid out so that one can sort/grep/awk to quickly gain some insight.

cogman10 2021-08-18 15:26:24 +0000 UTC [ - ]

As someone that works with maven and npm.... what?

This is valid JSON

    [["bob", "jones", 1, 22],
    ["frank", "was", 32, 45]]
That's unreadable and unparsable by a human?

Not only is JSON often more parsable, because it's structured it also becomes a lot easier to query.

I grep and awk xml and json stuff all the time. I also have the added bonus of being able to use `jq` for json content.

shapefrog 2021-08-18 16:00:07 +0000 UTC [ - ]

Thats just a CSV with extra steps.

cogman10 2021-08-18 16:10:46 +0000 UTC [ - ]

Nope, that's CSV without the drawbacks of CSV. That's CSV that can have special characters and doesn't suffer from delimiter problems.

When someone says "Maybe we can fix CSV" this is what you should do instead of trying to "fix" CSV.

2021-08-18 21:06:58 +0000 UTC [ - ]

mongol 2021-08-18 17:18:44 +0000 UTC [ - ]

Interesting. This is JSON++ somehow. What should it be called? Line-oriented JSON? Row-JSON?

cogman10 2021-08-18 17:27:03 +0000 UTC [ - ]

It's just JSON. It's not an extension to the spec, it's a subset of the spec. If anything, you could say it's JSON--.

mongol 2021-08-18 17:41:40 +0000 UTC [ - ]

Well, to work as a CSV-like JSON, i.e greppable similar to a CSV it needs formatting with linebreaks as above. It is still completely valid JSON. ++ is the formatting convention.

cogman10 2021-08-18 17:54:25 +0000 UTC [ - ]

Gotcha.

Yeah, it's never really come up as it's generally trivial to convert into this format from either a text editor or sed (Look for the `],[` and put a new line after the comma).

dullgiulio 2021-08-18 15:19:06 +0000 UTC [ - ]

Readable but not comprehensible. CSV is hard to beat in that sense, being somewhat "natural" like a table written on paper.

throwawayswede 2021-08-18 16:30:06 +0000 UTC [ - ]

Also it's way more difficult to parse than CSV. I'm not saying that CSV is better (or worse), just that to each it's job! Wtf is this about retiring a format because it doesn't fit someone's expectation!

cogman10 2021-08-18 15:18:33 +0000 UTC [ - ]

JSON and XML both make it easy to see what is going wrong and don't have near the same amount of drawbacks that CSV has.

Semaphor 2021-08-18 15:48:28 +0000 UTC [ - ]

Both are also not a good fit for columnar data at all.

cogman10 2021-08-18 16:12:31 +0000 UTC [ - ]

I disagree. Json, in particular, can be nearly as compact as CSV by storing the data as an array of arrays.

    [[1,2,3],
     [4,5,6]]
It's easy to make a structured data interchange format mimic and unstructured format. It's impossible to go the other way around without severe problems.

goatlover 2021-08-18 17:07:40 +0000 UTC [ - ]

It can be, but what's ensuring that format when you read in a JSON file? JSON is one of the formats Pandas can read, but it has to be structured in a format the python library can read in as tabular data. Excel would have the same issue as would any program that is consuming tabular data. At least with CSVs, you know the data is tabular.

cogman10 2021-08-18 17:23:23 +0000 UTC [ - ]

> It can be, but what's ensuring that format when you read in a JSON file?

What's ensuring the format of data in a CSV file?

Format comes from the same place it comes from for CSV. It's part of whatever data contract you are making with whatever system is providing the data. If someone shoves YAML into a Json file you've got problems, just like you have problems if someone starts talking about cars when you expected a file about airplanes.

At some point, you've got to turn file data into application data regardless of format. CSV offers no help in shaping the data and has a bunch of footguns to boot.

> JSON is one of the formats Pandas can read, but it has to be structured in a format the python library can read in as tabular data.

Pandas is highly flexible in it's ability to read in Json data, What I showed would be trivially loadable by Pandas, so would other formats (including more traditional JSON data).

Turning structured data into tabular data is trivial. What isn't trivial is turning tabular data into structured data.

bluedino 2021-08-18 16:32:17 +0000 UTC [ - ]

Except that won't open in Excel

WorldMaker 2021-08-18 18:20:19 +0000 UTC [ - ]

Excel actually has really good JSON import… it's just "hidden" in the Data tab on the ribbon and users don't want to learn how to use it.

I feel so much of what keeps CSV in use is that it's a format that you can relatively easily generate without pulling in a lot of library dependencies and relatively quickly generate something that users can just "double click it opens in Excel".

What the Excel team could really gift to developer humanity at this point is some dumb file extension Excel registers like XLJSON that you could just rename a JSON file to and Excel opens it like a spreadsheet on double click.

cogman10 2021-08-18 16:53:11 +0000 UTC [ - ]

Correct. That'd be a moving goalpost. The original claim was that JSON doesn't work well with columnar data.

If the primary usecase is to take data give it to a non-programmer to evaluate it, then CSV isn't terrible. Add a transformer that's ran when someone manually requests to see the data.

However, for machine to machine communication, CSV should never be used.

dbreunig 2021-08-18 18:06:31 +0000 UTC [ - ]

They answer their own question:

>Most of us don’t use punch-cards anymore, but that ease of authorship remains one of CSV’s most attractive qualities. CSVs can be read and written by just about anything, even if that thing doesn’t know about the CSV format itself.

Yes, we keep CSVs. If you care about metadata and incredibly strict spec compliance, then yes: avro, parquet, json, whatever. But most CSV usage is small data, where the ease of usage, creation, and evaluation wins.

One of the problems with CSVs he cites is a great reason why I like CSVs:

>CSVs often begin life as exported spreadsheets or table dumps from legacy databases, and often end life as a pile of undifferentiated files in a data lake, awaiting the restoration of their precious metadata so they can be organized and mined for insights.

A benefit of a CSV is a skilled or unskilled operator can evaluate these piles of aging data. Parquet? Even SQLite? Not so much.

For small-to-medium sized datasets, CSV is great and accessible to a wider user base. If you're relying on CSV to preserve structure and meta, then meh.

SavantIdiot 2021-08-18 18:42:35 +0000 UTC [ - ]

Every year this question comes up, and every year the arguments are the same.

It's an exercise in hubris because while CSV has one glaring fault, it is far fewer than other competing methods of the simplest possible exchange of textual data.

No one can decide to retire CSV because it ubiquitous. Of the three issues OP compains about: line delimiter, field delimiter, and header, only one is really a problem for anyone that has used them for any period of time.

1. Every file format suffers from Windows/NonWindows CRLF issues.

2. Metadata has been an issue since forever, and there are plenty of painful formats that support it (looking at you XML), complete with ginormous parsers and even more issues.

3. Escaping, as OP points out, was pretty clearly defined in RFC 4180.

So yes, it has a wart: escaping.

Learn what your system expects, and modify accordingly. Because using a CSV will be much faster and simpler than any other format you can try, which is why it has been so pervasive for longer than most of HN has been alive.

ozim 2021-08-18 21:28:28 +0000 UTC [ - ]

Problem is not with CSV, problem is with people.

People want to read and interpret arbitrary CSV ... well you cannot read and interpret any format that is arbitrary.

*As a consultant, I’ve written more than one internal system that attempts to reconstruct the metadata of a CSV of unknown provenance using a combination of heuristics and brute force. In short, CSV is a scourge that has followed me throughout my career.*

Ideally this should not happen because you should talk with party that you agree on common format. Someone that would explain what each field means and what should it contain, or at least some documentation for the file, not that it just is a CSV. But of course it always is more complicated than that.

Garbage in - Garbage out, even in other formats you still can have the same problem.

Traster 2021-08-18 17:41:06 +0000 UTC [ - ]

CSV is ubiqitous. That's a massive advantage. If you argue against using CSV you might make it less ubiquitous. You're never going to make anything as ubiquitous as CSV though and that's amply demonsrated by the fact that when you look at alternatives you immediately name 7 different things. And some of the advantages aren't even necessarily advantages - a file format storing typed values is a non-trivial consideration when you're thinking about cross-tool support. Pandas is cursed with this even in CSV - there wasn't an optional int so any column with missing values must be float, and floats don't necessarily handle large integers - have fun with your nanoseconds going funny in your nanosecond timestamps stored as optional ints.

CSV is ubiqituous because it's not trying to solve difficult problems. The second you try to solve the difficult problems you necessarily fragment your audience.

zabzonk 2021-08-18 15:21:01 +0000 UTC [ - ]

As the author of a CSV munging tool (CSVfix) I think most of the problems with CSV could be fixed if people producing CSV output, and people reading CSV input obeyed the rules of the RFC. Sadly, most people don't, and any textual output or input is routinely described as CSV, when it is nothing of the sort - even to the extent of not being comma-separated!

pessimizer 2021-08-18 16:51:01 +0000 UTC [ - ]

If you're referring to http://csvfix.byethost5.com/csvfix15/csvfix.html, thank you from the bottom of my heart. You were my pick for best tool for dealing with CSVs when a bunch of others (some of which sadly ended up with better adoption) were competing for the prize.

zabzonk 2021-08-18 19:52:42 +0000 UTC [ - ]

Yep, that's me. I'm afraid, because of mental and physical health problems, and family troubles I have stopped supporting it. Also, when bitbucket stopped supporting Mercurial I simply could not be arsed porting it to git. Not that I have anything against git.

hermitcrab 2021-08-18 22:09:26 +0000 UTC [ - ]

I have a commercial question for you about CSVfix (yes, I know it is no longer maintained). Email me if interested (see bio for email).

javajosh 2021-08-18 16:05:34 +0000 UTC [ - ]

Is there something like CSVlint? Would be useful particular in the output stage, a kind of functional runtime test to make sure your program's output is valid and consumable.

zabzonk 2021-08-18 16:14:20 +0000 UTC [ - ]

Validating CSV syntax is kind of hard - in the worst case you would have to read the entire CSV file.

hermitcrab 2021-08-18 21:31:02 +0000 UTC [ - ]

Even then it seems that there aren't that many checks you can do, other than:

-warning if different rows have different numbers of columns

-warning if the file ends without closing escaping

Without meta data you can't check data types.

A missing escape character might result in a valid CSV (just not the one intended).

masklinn 2021-08-18 15:48:17 +0000 UTC [ - ]

> Sadly, most people don't

And the important think to remember is that you can not and will not make them.

zabzonk 2021-08-18 16:04:13 +0000 UTC [ - ]

No, but you could possibly suggest to them that what they are producing is not CSV. Otherwise, any sort of description becomes meaningless.

AdmiralAsshat 2021-08-18 15:16:01 +0000 UTC [ - ]

CSV was a thing long before I was born, so I'm not privy to how it came about. But at least in day-to-day work, the single biggest drawback of CSV in my experience is the fact that the comma and most of the other common delimiters occur regularly in real data, forcing all of the cumbersome escape sequences. To say nothing of someone misplacing a quote somewhere and throwing off the cell count.

So, question to the greybeards: when this format was coming about, why didn't we use one of the dedicated Data Structure separator control codes (e.g. File/Group/Record/Unit separator) that were part of the ASCII standard?

https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Basic_...

It seems like it would've saved us all several decades of headache.

wanderingstan 2021-08-18 15:39:46 +0000 UTC [ - ]

Some people did.

But commas can be seen, edited, and typed with ease in any text editor.

As per comment above: your CEO, SWE, or secretary can all use or contribute to a csv file. And using an easily recognizable and typeable separator has proven to be worth the downsides.

rutthenut 2021-08-18 15:20:07 +0000 UTC [ - ]

Agree that the commma delimiter is a pain, as it is a frequently used character in the content. Then adding escape-type chars or quotes around fields just makes things more and more messy.

Consider also that IBM EBDCIC and other non-ASCII character sets were (and are) in common use, so C0/C1 may not have made sense 'back in the day'

_moof 2021-08-18 16:10:56 +0000 UTC [ - ]

> why didn't we use one of the dedicated Data Structure separator control codes (e.g. File/Group/Record/Unit separator) that were part of the ASCII standard?

Because you can't see them. CSV is, at its core, a text format. Using FS/GS/RS/US would effectively make it a binary format.

Also, what happens if one of those bytes appears in data?

The simple fact is you will never be able to come up with a text-based format that can handle all possible values without escape sequences, quoting, or length-encoding. And really, that's not that big a deal. It just means you have to write a simple state machine instead of using regex or your language's equivalent of split().

NyxWulf 2021-08-18 15:23:18 +0000 UTC [ - ]

Csv is only one form of delimited files. Tsv or Tab Separated Values is yet another. If you have either field delimiters or row delimiters (new line) in your data, change your delimiters and process appropriately.

pmelendez 2021-08-18 17:26:51 +0000 UTC [ - ]

Several things in this article resonated with me negatively. But I think these two are the biggest ones:

>values stored in the files are typed. >most importantly, these formats trade human readability and writability for precision

Those two properties are advantageous only when CSV is used in cases meant to be just parsed and as a data transfer format. But in reality, CSV files are being used in many different contexts. For instance, data scientists love to leverage and chain Unix tools to create a subset of data to test models. Also, it is often used as an export format to validate outputs quickly.

I think the problem is that CSV is often the subject of abuse. In the same way that a spreadsheet is a subpar database, I don't believe we are nearly close to the time when we will retire Excel.

slaymaker1907 2021-08-19 03:52:36 +0000 UTC [ - ]

I think a key element for any potential replacement of CSV is parser simplicity. Writing a basic CSV parser by hand is not very difficult (i.e. ignoring stuff like escaped characters and quotes) compared to writing a SQLite parser. Regardless if it is binary or not, it should be possible to print out a usable grammar on a single sheet of grammar and said grammar should be LL(1) so at most you would need to write a recursive descent parser.

Suppose someone wants to do some data processing client side in the browser with an input file. How would you do this for SQLite, even a SQLite file with a known schema such as would be the case most of the time for CSV file uploads? There is sql.js which compiles SQLite to JS or WASM, but even the WASM version weighs in around 400kb.

mdellavo 2021-08-18 15:11:27 +0000 UTC [ - ]

please don't - CSV is a perfectly fine textual file format for many things that can be used by the technical and non with a wide range of support in applications and programming languages.

karmakaze 2021-08-18 18:16:47 +0000 UTC [ - ]

The premise is that CSVs are in widespread use and not suitable for their used purposes.

The conclusion is that there are many binary formats which are more suitable, except that none are used widely enough, and the post even goes on to not make a recommendation. Finally it says that we have to accept this sad state:

"Ultimately, even if it could muster the collective will, the industry doesn’t have to unify around a single successor format. We only need to move to some collection of formats that are built with machine-readability and clarity as first-order design principles. But how do we get there?"

ho_schi 2021-08-18 16:52:23 +0000 UTC [ - ]

XML? Which is often not human readable, overly complex and therefore requires enormous amounts of processing power to be parsed.

JSON - and also CSV - are ubiquitous because they are actually human readable and usually quick to process. JSON is strictly defined it just doesn't over comments or structures, like XML. CSV is loosely defined but better than another competing standard ;)

A good data interchange format must be human readable, JSON and CSV are. Proprietary stuff not. And if you feel the need for speed? Seriously? Okay, then think about a binary format.

I prefer also the .conf format (so called INI) over other complex stuff for application settings.

bazzargh 2021-08-18 16:37:24 +0000 UTC [ - ]

I'm kinda surprised this references RFC-4180 as the most formal spec when there is a _much better_, and more realistic, spec developed by the W3C csv on the web working group (https://www.w3.org/2013/csvw/wiki/Main_Page)

https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/#...

... at the time RFC-4180 came out, it didn't even accurately describe how to read csv produced by Excel - which was already inconsistent in line endings and character set between office versions and platforms. The w3c spec at least tried to offer a model which would parse junk csv if you could guess the metadata (by eg scanning for the BOM, \" vs ""[^,], and so on)

When I worked on this stuff early 2010s, if you wanted to produce a non-ascii csv _download_ that could be opened by all office/openoffice variants you were out of luck. UTF-16LE-with-BOM, as I recall, would work in _most_ office variants but not consistently even across minor version changes in Office for OSX - so it was just a roll of the dice. We offered multiple download formats which _could_ handle this but csv was required by some customers.

Anyone saying csv is easy never worked with it in an international context.

ineedasername 2021-08-19 02:19:59 +0000 UTC [ - ]

Under many scenarios it's not currently possible. As one example, I use SQL Developer extensively, and neither its XML or JSON exports include metadata, So that's no more useful than CSV. Excel is a no-go for anything with > 1M rows. Exporting large datasets leaves me with CSV since it's easier for recipients to parse than XML or JSON without additional work. And querying Oracle from within Python or R suffers a big performance hit over doing it directly from SQL developer, so for one-off projects I don't bother building a pipeline. (Repeat projects I'll build and schedule for downtime, or on a different PC.)

kazinator 2021-08-18 18:04:10 +0000 UTC [ - ]

Let me define a sane CSV standard:

1. Encoding is UTF-8.

2. Header line with column names is mandatory. If there are no column names, the first line must be blank.

3. Each record is a line. A line is defined according to the operating platform's text file format.

4. In the light of (3) CSV does not dictate line endings and does not address the conversion issue of text files from one platform being transferred to another platform for processing. This consideration is a general text issue, off-topic to CSV.

5. CSV consists of items separated by quotes. An item may be:

5. a) a JSON number, surrounded by optional whitespace. Such an object may be specially recognized as a number by the CSV-processing implementation.

5. b) the symbol true, false or nil, optionally surrounded by whitespace. These symbols may have a distinct meaning from "true", "false" or "nil" strings in the CSV-processing implementation.

5. c) a JSON string literal

5. d) any sequence of printable and whitespace characters, other than comma or quote, including empty sequence.

6. In the case of (5) (d), the sequence is interpreted as a character string, after the removal of leading and trailing whitespace. (To preserve leading and trailing whitespace in a datum, a JSON literal must be used.)

7. In (5), whitespace refers to the ASCII space (32) and TAB (9) character. If there are any other control characters, the processing behavior is implementation-defined. Arbitrary character codes may be encoded using JSON literals.

ArsenArsen 2021-08-19 00:54:54 +0000 UTC [ - ]

AFAICT, XML fulfils all of the requirements of this.

It is self describing, rigorously defined for the machine, highly flexible, highly extensible, human readable, text-based and an open format, and it is only mentioned as a storage format in the post, not even nearly doing it justice!

XML is complex, but there's already very established libraries for handling it, so that shouldn't be an issue, and that's the only drawback XML has.

Hell, even alternatives like JSON/YAML + JSONSchema exist, still providing incredibly rigorous validation, but staying human readable and ubiquitous.

P.S. I think a bigger failure here is trying to generalize the concept of spreadsheets, rather than using a common encoding (XML) for domain-specific data formatting.

mr_toad 2021-08-19 01:02:51 +0000 UTC [ - ]

Too easy to produce bloated files. People do dumb things like put field attributes and metadata on every row of the output, so you end up with files that can be several times the size of the raw data.

Parsing times are often horrible.

There’s no standard for tabular data. You invariably need some overly complicated XML map, because people can’t resist the temptation to over-engineer.

ArsenArsen 2021-08-19 01:11:52 +0000 UTC [ - ]

I had anticipated the overhead of XML being brought up, hence the other suggestions there, which is anything + jsonschema. jsonschema was used for illustrative purposes, it's a lot more powerful than most usecases call for, and it pays forward for that by making the syntax very verbose and longwinded. It'd certainly be an alternative with less overhead, though.

I've not benchmarked XML parsing times in a very long time, I'd be interested in seeing the numbers now.

Tabular data is barely data, honestly, hence my PS, I don't think spreadsheets in general are a very good way to store anything.

And yes, absolutely, overengineering is bound to happen, which is unfortunate, but I'm not sure if it really can be avoided while still keeping many of those upsides (especially the rigorous definitions)

selfhoster11 2021-08-19 00:58:53 +0000 UTC [ - ]

Most XML is not human-readable. There's just too much line noise, and most of what's emitted has a weird schema that's difficult to parse using Human Brain 1.0.

ArsenArsen 2021-08-19 01:08:40 +0000 UTC [ - ]

I don't think it's much worse than CSV, with it's inflexible structure, not allowing for any formatting to be inserted. As for weird schemas, that really is an implementation specific issue. This example was sketched up in minutes and already reads better than most spreadsheets I've seen over the years!

  <bankstatement>
      <transaction date="2021-08-19" description="Hello, world!">
          <debit commodity="USD">100</debit>
          <balance commodity="USD">237.87</debit>
      </transaction>
  </bankstatement>

selfhoster11 2021-08-19 03:10:54 +0000 UTC [ - ]

While weird schemas are an implementation-specific issue, the fact remains that most XML that I've seen in commercial settings is exactly as unreadable as I described. It's harder to go against the grain of the status quo when very few people accompany you on the journey.

simonw 2021-08-18 18:00:31 +0000 UTC [ - ]

One of my goals with https://datasette.io is to offer a better alternative for publishing data than sharing a link to a CSV file.

The trick is that if you compile data into a SQLite file and then deploy the Datasette web application with a bundled copy of that database file, users who need CSV can still have it: every Datasette table and query offers a CSV export.

But... you can also get the data out as JSON. Or you can reshape it (rename columns etc) with a SQL query and export the new shape.

Or you can install plugins like https://datasette.io/plugins/datasette-yaml or https://datasette.io/plugins/datasette-ics or https://datasette.io/plugins/datasette-atom to enable other formats.

kthejoker2 2021-08-18 18:59:58 +0000 UTC [ - ]

As an avid Datasette (and Simon Willison!) fan,

> then deploy the Datasette web application

Is a huge hurdle for non-technical folks holding on to their CSV workflows.

I think there is no barrier low enough that CSV cannot limbo beneath it.

simonw 2021-08-18 19:06:49 +0000 UTC [ - ]

Yeah I've put a ton of effort into that, but it's still nowhere near as easy as I would like.

hestefisk 2021-08-18 15:20:10 +0000 UTC [ - ]

20 years ago, clever consultants said the same when XML was the “next big thing”. I remember reading this book called “The XML Revolution”, which boldly stated that the age of CSV and other “obscure” formats was finally over. Here we are 20 years later with the same rhetoric. CSV isn’t pretty, no, but it’s universally acceptable and works quite nicely on a Unix command line with cat, grep, and cut, so I am sure it will stick around for longer.

crdrost 2021-08-18 16:32:54 +0000 UTC [ - ]

CSV parsing is not as bad as it sounds and can be a decent fizzbuzzy interview question.

The real problem with CSV is the lack of vision of data formats in general. Inside your programming language you need to say that you want to read a CSV? You need to import a different library and change both the parse code and your consumption code to instead do JSON? You need to change everything to use futures in order to stream the results? Are you out of your damn mind?

So now that disk space is so cheap it would make sense for any file format to just begin with a single few-kilobytes line that defines the parser for the coming file. Could be sandboxed, WASM-y or something, could be made printable of course... Sure, it might not be possible to get the full SQLite library in there but you could at least make it free to switch between JSON and BSON and CSV without having to recompile the software or force someone to design modular file input systems. Somehow the only flexible containers that hold different data structures are video container formats that do not care about what codec you used. They "get it"— can the rest of us?

hcarvalhoalves 2021-08-18 15:39:23 +0000 UTC [ - ]

1. I wouldn't bet Avro, Parquet, Arrow, or other formats that require a library to parse are going to _entirely_ replace CSV. Those are broader in scope, but can't say are "replacements" if they don't share the simplicity of plain text data following a few conventions – much like how PDF is not a "replacement" for Markdown, but enables higher-fidelity content.

2. I would feel safer storing longer-term data in CSV than in a binary format w/ a complicated spec. Having to make sense of compressed, columnar padded data sounds worse than parsing CSV.

3. Although it's easy to point at corner cases, I don't remember the last time I couldn't figure out how to parse a file because of inconsistent quoting or exotic char encoding – and I've spent a good amount of the past 13 years exchanging CSVs and TSVs w/ 3rd parties full of horrible legacy. Asking those 3rd parties to send me an Avro/Parquet/whatever file would've made the project fail or take 10x longer.

There's a reason why CSV stuck around so long, and the alternatives make different trade-offs but miss the pros.

lwn 2021-08-18 17:32:49 +0000 UTC [ - ]

  U+FEFF"aaa","b CRLF
  bb","cc"c" CRLF
  zzz,yyy,xxx
Just an example of the wonderful world of CSV \(.)/ which I've used quite a lot, because it enables quick and easy (dirty) data dumps/ exchanges. But using it as a data exchange format between multiple parties often leads to problems, such as above.

rexreed 2021-08-18 15:15:11 +0000 UTC [ - ]

CSV will never go away. But surely we can build better systems to handle all the complexities and variability in CSV? If we can handle image recognition with all its complexities, we can handle the more constrained variety in CSVs. There are only so many variations to CSV files, and perhaps a better preprocessor is the answer here versus asking for the near-impossible task of "retiring" CSV.

2021-08-18 15:29:55 +0000 UTC [ - ]

_trampeltier 2021-08-18 20:59:06 +0000 UTC [ - ]

CSV is perfectly fine for bring data from one App to another App. Is is easy. Everybody can fix things with any editor. You can paste files together, you can seperate files after x lines. Of course not everything is perfect .. and Microsoft made it even worser in the german office. The seperator is not even "," it is ";". The author writes about problems with dates .. i think we everybody saw excel translate dates to funny floats and so on. His problem with the date format is not a csv problem. The official standart is just not accepted in dayli life. Americans have anyway the most terrible ways to write dates. There is everything lost .. CSV is almost perfect for the usecase to bring data from one app to another without trouble.

hermitcrab 2021-08-18 19:50:27 +0000 UTC [ - ]

I feel the author's pain. I have written a CSV parser from scratch and despaired at the lack of a decent standard for line endings, escaping, encoding etc (Excel and Numbers don't even agree on how to import the same CSV file). But CSVs aren't going away any time soon and I think the author is naive to suggest vendors should stop supporting CSV.

For a start, think about it from a game theory perspective. If I keep CSV import/export in my Easy Data Transform product, while all my competitors with data transformation software (Alteryx, Knime etc) remove it from theirs, it gives me a big sales advantage. What advantage do I get for removing already working CSV import/export code? Nothing.

kcartlidge 2021-08-18 19:39:53 +0000 UTC [ - ]

CSV files can be made much nicer by the simple switch to tab-delimited. It doesn't solve all the issues by any means, but makes just enough of a difference to be 'okay'.

Whatever is used, I like that tab-delimited (or even csv) is a human-readable and always-machine-readable long-term data format in the same way as ASCII (or Markdown etc) is for text. A hundred years from now, assuming the storage medium is still usable, the content should be easily recoverable. That may not be the case with spreadsheet files or other structured data formats.

lamontcg 2021-08-18 20:14:01 +0000 UTC [ - ]

> CSVs are built for humans, not machines

And that's why they're not going away.

Trying to add typing to them is probably missing the point that >95% of humans that use them don't need or want to understand that.

MarkLowenstein 2021-08-18 23:36:42 +0000 UTC [ - ]

All I know is that when people give me a CSV I'm really happy and I get to work on their request right away.

When they give me literally anything else I get pissed off and push their task lower in my queue.

ittan 2021-08-19 00:12:52 +0000 UTC [ - ]

I might have a different perspective than most, but I think csv as a raw access method is far better than any structured format.

The contract on what is transfered or how it's mapped in your csv is developer to developer and not in code.

This is a problem and also a benefit.

As long as both systems are designed to read and write the csv correctly(there in also the problem of verification), csv works flawlessly and retains its simplicity.

ubermonkey 2021-08-19 00:04:28 +0000 UTC [ - ]

Yeah, that's a dumb idea.

Sometimes -- frequently, even -- you have to go to some least-common-denominator format to move data around. CSV is often a candidate.

When I was younger I thought we'd eventually abandon "primitive" formats like this, but in my middle years I realize their extreme utility. CSV absolutely has a place, and absolutely does a job that no other format does as quickly or as easily.

bastardoperator 2021-08-18 16:41:46 +0000 UTC [ - ]

I'm going to get downvoted, but I'll ask, who cares? You may have perfectly valid reasons to not like CSV, but it's choice at the end of the day to use it or not. Maybe you didn't make the choice, and inherited a system that does, but I much rather read an article about how you changed minds and/or switched to something more advantageous for you and your system versus what amounts to a rant. It's not time to retire CSV, it's time to retire the mentality that just because you don't like something doesn't mean it's wrong or needs to go away.

donretag 2021-08-18 16:45:37 +0000 UTC [ - ]

It is pure clickbait

jqcoffey 2021-08-18 18:35:48 +0000 UTC [ - ]

I’m going to cite “worse is better” in disagreeing completely with the author. CSV has won because it’s easy to generate, easy to read as a human, easy to parse with standard tools (when generated correctly), and really the last point is the salient one.

The format isn’t inherently flawed, though much like other less than perfect standards (I’m looking at you SMTP), the implementations frequently are.

And also, much like the author, I’ve been “professionally” dealing with data in all (most?) its forms for the length of my career (~25 years).

SPBS 2021-08-18 18:27:46 +0000 UTC [ - ]

CSV is often used as a database-agnostic data dump file. Since the author suggested sqlite as a possible alternative to CSV, I think that using sqlite as a database-agnostic data dump file actually sounds pretty good. For starters, all data is kept in one file instead of being spread out over multiple CSV files. Also, the data schema is already well defined inside an sqlite file. All you need is a tool that can read the sqlite file and write the data out accordingly to other databases depending on dialect.

edoceo 2021-08-18 15:59:11 +0000 UTC [ - ]

Why not simply use the ASCII characters dedicated for this? STX,ETX, RS, GS, etc

robbmorganf 2021-08-19 01:05:12 +0000 UTC [ - ]

I love CSV. Only change I want is some way to indicate a comment. Then I can write a human readable description of what the data means (more than just column names, but how it is calculated). It would be cool if the comment mechanism also could have a machine readable metadata system.

anodyne33 2021-08-19 00:29:08 +0000 UTC [ - ]

In e-Discovery our metadata is exchanged in csv format almost exclusively with one caveat, the defacto delimiters are Pilcrow and Thorn. We've solved the quote/comma in data problem, but as soon as we start mixing character sets, someone writes a load file in ANSI all hell breaks loose.

progre 2021-08-18 15:27:52 +0000 UTC [ - ]

The least fun csv-like format i have seen was mandated by a Big Spanish Bank. It was original a fixed fieldwith flatfile. Numerals were 0-leftpadded and fixed decimal (decimal point not written). Except that someone decided they needed to open it in Excel so semicolons were added at the end of each field. Then they discovered that some data didn't actually fit in the fixed with fields. So certain fields where variable width. End result was a file that looked like shit in Excel, and couldn't be parsed by a fixed width parser.

osrec 2021-08-18 16:26:14 +0000 UTC [ - ]

I can't think of another more efficient format for tabular data that's also human readable with a simple text editor... I'm personally not retiring the CSV any time soon!

PaulHoule 2021-08-18 15:16:03 +0000 UTC [ - ]

XSLX covers a lot of ground in this area, works with the market leading spreadsheet program, but there are also great libraries to encode and decode them in Python, Java and other languages.

andylynch 2021-08-18 15:30:19 +0000 UTC [ - ]

I agree, xlsx is pretty easy to deal outside Excel (at least until you hit 1M rows!).

politician 2021-08-19 00:22:41 +0000 UTC [ - ]

The only reason CSV is hard to parse is that text editors made it hard to type ANSI 29, 30, and 31: the group, record, and field separators. So folks had to resort to comma or pipe and newline.

If you use the right characters built specifically for this purpose, then the problems go away.

pharmakom 2021-08-18 16:40:39 +0000 UTC [ - ]

What are good alternatives to CSV that provide all of these features:

- easy to parse

- easy to edit with a generic text editor

- easy to edit with a widely available GUI, like LibreOffice

- allow adding more data with only append operations

mongol 2021-08-18 17:27:09 +0000 UTC [ - ]

I recall ESR's book about Unix conventions and culture mentioned something I think he called stanza-based. A file format where each line is a value and records separated by empty lines. Don't think Office tools can work with that though, except as regular text files.

dredmorbius 2021-08-18 19:55:51 +0000 UTC [ - ]

Awk will eat that for breakfast no problem.

The usual problem is when you discover that what you thought was a value is in fact a record of its own, with internal structure.

The problem isn't the CSV. The problem is that data parsing is actually a Hard Problem.

The author of TFA is ... naive and misguided.

hobs 2021-08-19 03:30:04 +0000 UTC [ - ]

Gotta post https://github.com/secretGeek/csvz whenever csvs come up, csvs but with metadata!

jayd16 2021-08-18 16:13:05 +0000 UTC [ - ]

I would love to drop csv for something else but I won't trade human readable for a binary format.

In fact, I will go even further and say I want something mergeable by source control.

btilly 2021-08-18 17:05:20 +0000 UTC [ - ]

When I see someone complaining about the loss of metadata, I have to wonder if they've ever had to deal with a swamp of data from different organizations, all of which disagreed on when to use what piece of metadata in what way.

Bad CSV is a PITA, but usually you can make reasonable sense of it. Merging bad/inconsistent/conflicting metadata tends to be an open-ended nightmare with no good resolution at the end.

jojo2000 2021-08-18 20:13:12 +0000 UTC [ - ]

It's time to reinvent the wheel, but more complex, more failure-prone. No thanks ! Joke aside good alternatives exist, but are far less ubiquitous. Binary format means problems to read, compatibility issues... Look at how hdf5 is a complex format with a lengthy spec. Checksum and compress your csv files, it will solve most issues.

zmmmmm 2021-08-18 23:45:21 +0000 UTC [ - ]

I guess it's nice we've reached the generation who can rant about this and not even mention that XML solved nearly all the problems mentioned and everybody hated it.

It turns out people hate bloat and complexity more than they hate all the ills of CSV put together!

Thiez 2021-08-19 03:05:12 +0000 UTC [ - ]

I'll happily take xml over csv any day, but my coworkers and our customers won't!

jszymborski 2021-08-18 16:09:04 +0000 UTC [ - ]

You will pry CSVs form my cold, dead hands.

When I have lots of data in multiple related tables where I would benefit from defined data types, I reach for HDF5 or SQLite, but there are so many nice things to say about a simple CSV for lots of simply structured data.

Super simple to stream compress/decompress, and the ability to use great CLI tools like xsv, and being able to peak at the data by simply calling less or head... it's just hard to beat.

MonadIsPronad 2021-08-18 15:09:53 +0000 UTC [ - ]

Surely the successor should be SQLite. Tabular data like CSV, easy to view on different mediums, free and open source, single file databases. It has to be this, right?

puppet-master 2021-08-18 15:20:28 +0000 UTC [ - ]

Replacing an almost trivial open format with one that requires a specific 85 kLOC codebase to access does not seem like an improvement.

SQLite is wonderful, but it's definitely not a replacement for CSV

masklinn 2021-08-18 15:32:02 +0000 UTC [ - ]

> Replacing an almost trivial open format

The main feature of CSV is that it's aways broken, not that it's almost trivial.

> with one that requires a specific 85 kLOC codebase to access

sqlite is available essentially everywhere. Every browser uses sqlite extensively internally, so do tons of other software on your standard desktop machine (to say nothing of mobile OS where it's ubiquitous). Using sqlite is not any sort of constraint.

> SQLite is wonderful, but it's definitely not a replacement for CSV

Only in the sense that it's nowhere near as bad and annoying as CSV.

JohnHaugeland 2021-08-18 15:13:34 +0000 UTC [ - ]

yeah, why not replace a dense human readable easily generated format with a third party format that requires installation of compiled binaries and an over the wire protocol

tzs 2021-08-18 15:46:15 +0000 UTC [ - ]

An SQLite dump is human readable and the data in the dump actually is in a CSV format, just surrounded with information on types and relationships. Here's the first few lines for example of a dump of an SQLite DB that I have of some temperature sensor data:

  PRAGMA foreign_keys=OFF;
  BEGIN TRANSACTION;
  CREATE TABLE temperature
  (
    ts integer,
    Tc real,
    Tf real,
    src_id integer
  );
  INSERT INTO temperature VALUES(1615165342,5.5999999999999996447,42.099999999999999644,1);
  INSERT INTO temperature VALUES(1615165350,0.6,32.999999999999998223,3);
  INSERT INTO temperature VALUES(1615165404,5.5,41.899999999999995026,1);
  INSERT INTO temperature VALUES(1615165410,-17.199999999999999289,1.0,2);
  INSERT INTO temperature VALUES(1615165435,5.5,41.899999999999995026,1);
SQLite dumps would actually be a pretty good data exchange format. As mentioned there is a CSV inside there. It's not hard to extract that if you need an unadorned CSV--some grep and sed, or grep and cut, or a few lines of scripting.

Or if you have SQLite installed, it is an easy to have SQLite itself read the dump and export it in CSV, letting it deal with things like quoting that can be a pain with grep and sed/cut. This way also makes it easy to rearrange or omit columns and to do some filtering so that data you don't care about doesn't get into the CSV.

coding123 2021-08-18 15:09:48 +0000 UTC [ - ]

As long as it's replacement opens in Numbers and Excel.

MeteorMarc 2021-08-18 15:12:15 +0000 UTC [ - ]

CSV does not open in Excel, unless you play the usual tricks first...

NBJack 2021-08-18 15:18:47 +0000 UTC [ - ]

Not sure what your basing that claim on. It is a natively supported format, even amidst its inconsistencies, and has been for decades.

Are you referring to a particular format variant?

rutthenut 2021-08-18 15:25:05 +0000 UTC [ - ]

Excel definitely opens CSV files, and Windows file associations tend to set that up by default. However, Excel then goes ahead with changing the content to often wrongly-assumed formats. My favourite hate is how it will convert mobile numbers from a string of digits to an unusable exponential-format number. Hey, it doesn't even set column widths to make the content visible!

nojito 2021-08-18 16:29:34 +0000 UTC [ - ]

You should use power query to load csvs into excel that way you can control the parsing of columns.

nolok 2021-08-18 15:15:44 +0000 UTC [ - ]

If Jane from accounting and Bobby from marketing know those trick (and they do) then Csv does open in excel, and its replacement has to do at least as good.

oytis 2021-08-18 15:16:31 +0000 UTC [ - ]

Just curious - what tricks are required? localc opens csv without an issue.

2021-08-18 15:16:02 +0000 UTC [ - ]

poorman 2021-08-18 15:18:23 +0000 UTC [ - ]

I would agree, csv is probably not the way to go. But try to go implement a Parquet reader and then see how nice processing a streaming gzipped csv file can be.

johnrob 2021-08-18 15:15:24 +0000 UTC [ - ]

I find JSON “array of arrays” to be a better solution while also remaining human-readable. Especially when formatted as one array per line, analogous to CSV.

xtracto 2021-08-18 15:39:09 +0000 UTC [ - ]

Right on. And with the inclusion of a JSON-schema header, it makes data parsing a breeze:

    { "$schema": "http://json-schema.org/draft-07/schema#", "type": "array", "items": {"type":"array","items": [ { "type": "number" }, { "type": "string" }, { "enum": ["Street", "Avenue", "Boulevard"] },{ "enum": ["NW", "NE", "SW", "SE"] } ]} }
    [[3,"some", "Street", "NE"],
    [4,"other", "Avenue", "SE"],
    [5,"some", "Boulevard", "SW"]]

Existenceblinks 2021-08-18 17:50:21 +0000 UTC [ - ]

With draft 2020-12, you can even have module-like based on [Compound Document spec](https://json-schema.org/draft/2020-12/json-schema-core.html#...)

Now you have tables and relation. Only few people understand the usefulness of Compound Document, currently.

lufte 2021-08-18 18:14:37 +0000 UTC [ - ]

Parsing a CSV can be done in a line-per-line basis, while a an array of arrays in JSON is not valid until you reach the end.

How would any existing JSON parser handle 75GB of data in a single array of arrays?

WorldMaker 2021-08-18 18:36:38 +0000 UTC [ - ]

Many JSON Parsers already have an incremental forward read mode (think SAX-style parser if you are familiar with XML parser styles) where you can ask for each array inside that array of arrays one at a time as it reads them. If something unexpected happens at the end of the large outer array such as syntax error you can decide at that point if you rollback what you've already read/operated on or not.

JSON.Parse() in the browser isn't that sort of parser obviously, but there options in many languages such as both major .NET JSON libraries (Newtonsoft and System.Text.Json) have ways to parse that way. Similar examples exist in other languages too.

lufte 2021-08-18 20:07:21 +0000 UTC [ - ]

Interesting. I don't know of any similar option in Python's json module, which is the one that I mostly use, but good to know this exists.

temp8964 2021-08-18 19:37:23 +0000 UTC [ - ]

I use CSVY (CSV with YAML frontmatter) for my work. It is supported in R data.table fread()/fwrite().

It is just CSV with a little bit information on the top. Great for define column types, so that you reader does not need to guess the column types.

See: https://csvy.org/

EthicalHacker03 2021-08-18 23:02:11 +0000 UTC [ - ]

Use fixed width files instead... along with a good old COBOL copybook. Nothing beats parsing fixed width records performance-wise. You can auto-generate staging and persistence tables, ETL jobs and quality control with just the copybook as input.

dasil003 2021-08-18 15:27:37 +0000 UTC [ - ]

Interesting thought experiment that feels good after struggling with some thorny data corruption issues (or whatever inspired this), but this what they call a boil-the-ocean problem. There's no point in thinking about CSV beyond the impact on things that you directly touch as it will never rise above the background noise of the problems faced by all the people whom you hope to inspire.

jti107 2021-08-19 00:11:00 +0000 UTC [ - ]

never gonna happen. company i work for has csv files from tests on products they conducted almost two decades ago. they went to a proprietary solution for a few years and that company went bankrupt and we can no longer open certain files because the license was restricted to Sun servers. we've learned the hard way that open systems even if sub optimal are usually the right choice especially when you're trying to store/archive data.

we've resorted to csv files + readme + json files in git for version control.

punnerud 2021-08-18 15:26:10 +0000 UTC [ - ]

There is a CSV standard (RFC 4180), the problem is that a lot of programs don’t follow it. DataGrip don’t quote all text fields and Python remove some of the \r\n or \n\r (don’t remember which).

As long as you are following RFC4180 it works.

I ended up exporting from Oracle DB using JSON and converting to CSV for one of our contractors to be able to import the data to MongoDB.

philistine 2021-08-18 16:37:20 +0000 UTC [ - ]

Getting rid of human-readability is doomed to fail at getting any traction. CSV being human-readable is the sole reason for its pervasiveness.

What we need is something akin to Strict Markdown. Something that qualifies every edge case to produce a strict CSV that can encompass human-readable metadata within the strict delimiters.

robochat 2021-08-18 20:28:18 +0000 UTC [ - ]

I know that there's an xkcd comic about this but I think that the best solution would be to create a new text based format with a new extension ie. "bsv" - better separated values. This new format would either be much stricter than csv or have a compulsory header that defines the file's format (like the text encoding, separator character etc).

If it is stricter, it would have one type of field separator that is not commas since some locales use them as decimal places (I'm looking at you, France) but something like '|'. It would insist that dates were iso8601. It could define how fields can be escaped and quoted - although I would prefer if quoting was kept to a minimum. The format should also allow for comments i.e # so that people can comment their datasets inside the same file.

Alternatively or in addition, it could have some header lines:

1) A header that defines the encoding, separator, decimal separator, quote character, escape character, line ending character, date format ...

2) A header that defines each column's name

3) A header that defines each column's data type and formatting

4) A header that defines each column's unit like m/s or kg - ok, this is a bit of a stretch but it would be great to have.

or some variation of the above.

Fundamentally, this bsv format would still be csv and most programs would still be able to read it with the parsers that already exist or be quickly adapted to read it. It could still be easily edited by hand but the metadata would be present.

I suspect that this is just a pipe dream because people would find hundreds of ways to break it but toml took off and that didn't exist so long ago.

breck 2021-08-18 20:46:46 +0000 UTC [ - ]

I've played around with this (https://jtree.treenotation.org/designer/#standard%20iris)

I don't think you can make a breakthrough through syntax alone. I think you've got to integrate some type of live semantic schema, something like Schema.org. If I used "bsv" and didn't just get a slightly better parsing experience but also got data augmentation for free, or suggested data transformations/visualizations, et cetera, then I could see a community building.

I think perhaps a GPT-N will be able to write it's own Schema.org thing, using all the world's content, and then a BSV format could come out of that.

41209 2021-08-18 19:19:48 +0000 UTC [ - ]

Why not use Yaml.

Yaml is my go to format when I need something human readable and somewhat editable ( very easy to ruin Yaml spacing)..

What's the real motivation behind an article like this. I don't imagine anyone has serious trouble with csvs, they tend to just work

Sn0wCoder 2021-08-18 16:07:47 +0000 UTC [ - ]

While it’s great to dream I hope the CSV file stays around until I retire. Still the best way to get ‘simple’ data out of excel and uploaded on the front end (papa parse). More complicated data most likely should be using another format. I guess there is always TSV if the CSV goes away :)

jrm4 2021-08-18 17:10:47 +0000 UTC [ - ]

Ha. Not gonna happen; for the same reason Python took off over more "flexible" languages, for the same reason people move from JSON to YAML for configs, etc.

Languages and formats are not fundamentally about computers or efficiency, they're about people. Carry on.

andylynch 2021-08-18 15:24:22 +0000 UTC [ - ]

The author seems to be advocating for something like ISO 20022. Which is great if you actually need it. But also complete overkill if you aren’t doing the kind of things it’s made for (and even then, still a bit painful) CSV is at the other extreme but this is also why it’s so useful.

zz865 2021-08-18 22:35:16 +0000 UTC [ - ]

As an aside, XML really was great though, its a pity it fell out of fashion. Schemas, XPath, so many great solutions that we're still messing with in JSON, YAML etc etc.

anigbrowl 2021-08-18 23:08:44 +0000 UTC [ - ]

I really wanted to like XML but every book I picked up about it was so awful, and every XML browser program I tried so unbelievably slow, that I just gave up trying to figure it out when it was still current. I didn't need it for work; I just thought the concept was cool and it'd be a fun thing to know. Perhaps people were so eager to prove it was powerful that they ended up making it seem like something for masochists.

bernardv 2021-08-18 20:21:15 +0000 UTC [ - ]

If you want to design yet another ‘better’ file format, go to it. But CSV files are fine the way they are, thanks. If I need to preserve metadata, there are many tried and true options available to me.

mybrid 2021-08-18 22:30:18 +0000 UTC [ - ]

It will be easier to get all the OS creators to normalize new line to one thing. Then pick a text delimiter that is reserved solely for delimiting, like say '|'.

joelthelion 2021-08-18 17:36:31 +0000 UTC [ - ]

No love for parquet and/or feather?

They have open-source implementations in many languages, are much faster to load than csv, are natively compressed, are strongly typed and don't require parsing...

There are few reasons to continue using csv in this day and age.

2021-08-18 16:07:56 +0000 UTC [ - ]

2021-08-18 15:16:38 +0000 UTC [ - ]

mrvenkman 2021-08-18 16:14:52 +0000 UTC [ - ]

> While many programs can’t read or write Excel spreadsheets, almost anything can read and write CSVs, and a human can open a CSV file in any text editor and understand roughly what it contains.

Libra Office works in Windows.

Google Sheets does a great job too, no?

osigurdson 2021-08-18 15:37:58 +0000 UTC [ - ]

HDF5 is efficient for large datasets and has its use cases but being a binary (and fairly easy to corrupt) it isn't a reasonable general replacement for CSV. It also has limitations such as it cannot be read while it is being written.

mongol 2021-08-18 16:15:54 +0000 UTC [ - ]

I never thought of Sqlite as a data transfer format before. I know it is frequenly used as application storage format. But to use as data transfer format is a really interesting use case. How much is it used like that in practise?

laichzeit0 2021-08-18 17:19:04 +0000 UTC [ - ]

Lack of types is my biggest gripe with CSV. I think just being able to specify types in column headers would be a win. E.g. “column1:int,column2:string,column3:datetime”. Type inference has bit me too many times.

7952 2021-08-18 17:29:39 +0000 UTC [ - ]

Maybe have a format based on postgres types? They are fairly comprehensive, well defined and already have a text representation.

Evidlo 2021-08-18 22:39:40 +0000 UTC [ - ]

Many of the author's complaints are addressed by CSVY:

https://csvy.org/

tbenst 2021-08-18 22:29:40 +0000 UTC [ - ]

Another beauty of CSV is its compatability with git / versioning.

SQLite databases are in contrast a nightmare, and git merges often corrupt files.

spiritplumber 2021-08-19 00:30:39 +0000 UTC [ - ]

No, not time to retire the CSV. I can open a CSV with an Arduino with a SD card socket soldered onto the pins.

danellis 2021-08-19 04:35:10 +0000 UTC [ - ]

The worst I've seen is CSV in which some of the fields contain CSV.

jiggawatts 2021-08-18 23:44:26 +0000 UTC [ - ]

My problem with CSV is that it has a logical, obvious structure that is practically self-evident. The very act of having to write an encoder and then a parser makes the various corner-cases inescapable, which means that it is almost impossible to imagine getting it wrong.

Yet, everbody gets it wrong. Everybody. Thoroughly, fantastically, almost unimaginably wrong.

For example, in the Microsoft world: I come across CSVs in about 5 scenarios, all of them very common, most of them designed to interact: Excel, PowerShell's Export-CSV, SQL Server, Power BI, and the Azure Portal. None of these are obscure. None of these use CSV infrequently. Yet, they're all mutually incompatible!!!

That just blows my mind.

For example, SQL Server will output string ",NULL," to represent a null field instead of just a pair of commas (",,"), so every other tool will convert this to the string "NULL", which is not a null.

PowerShell helpfully outputs the "type" of the value it is outputting, like so:

    PS C:\> dir | ConvertTo-Csv
    #TYPE System.IO.DirectoryInfo
    "PSPath","PSParentPath","PSChildName","PSDrive","PSProvider",...
Excel can't open such files! It is entirely unfathomable that the PowerShell team wrote this code and never once double-clicked the resulting CSV to see if it opens successfully in Excel or not. Absolutely mindblowing!

It just goes on and on.

Different quoting rules. Random ability/inability to handle new lines. Different ways of handling quoted versus unquoted string values. Encoding is UTF-8 by default or not. Handling of quote characters within strings. Etc, etc...

Basically, within one vendor's ecosystem, flagship applications have at best a 50:50 chance of opening arbitrary CSV files.

Don't even get me started on the inherent ambiguities of the format, like interpreting dates, times, or high precision decimals, etc...

Oh, and before I forget: the SQL Server Integration Services team wrote lengthy articles on how their engine can process CSV files faster than the competition. Why is this a feature? Because CSV is a woefully inefficient format and processing it fast is an achievement.

Lastly: By default, SQL Server cannot export table data to a flat data file and round-trip it with full fidelity, in any format. Exchanging just a couple of tables between servers is... not fun.

So yes, a replacement format with an efficient, high-fidelity binary format is long overdue.

csours 2021-08-18 16:00:28 +0000 UTC [ - ]

I think there's two conversations here: known, working CSVs as part of a workflow; and CSVs from random, unknown sources.

There's no such thing as schema-less - there's undefined schema

mrlonglong 2021-08-18 16:31:38 +0000 UTC [ - ]

CSV files are great but only if things like dates are parsed correctly. Always validate, validate and validate again, that's the critical thing dealing with CSV files.

tester756 2021-08-19 00:01:26 +0000 UTC [ - ]

What if CSV had optional type name hint next to column name, with default being text?

Like

Id(int), Name, LastName, BornAt(date)

1, asd, dsa, 1992-07-12

shezi 2021-08-18 17:50:59 +0000 UTC [ - ]

This is a pet peeve of mine: csv was a bad format already when it was created, because it intermixes data with meta data, ie it uses characters that can appear as either data or control characters, leading to escaping issues. This is in addition to being under specified as to which characters are used as control characters.

It was outdated from the start because ASCII already has specific characters for file, group, record, and item separation. Using this would give broad compatibility and a wider feature set (eg. more than one table in a file) while retaining all the benefits of csv.

kissgyorgy 2021-08-18 15:55:21 +0000 UTC [ - ]

Once my CSV parser blew up because there was a monetary value bigger than $1000 (I mean bigger than $1,000) in a field and the field values were not quoted.

bluedino 2021-08-18 16:33:18 +0000 UTC [ - ]

Garbage in, garbage out.

s_dev 2021-08-18 17:50:44 +0000 UTC [ - ]

CSVs will never go away. They're simply very versatile and very simple. That magic combination is extremely rare in any technology.

m1234_ 2021-08-18 20:57:05 +0000 UTC [ - ]

As someone who works for a company in the finance sector that provides services that several large banks, funds and financial institutions use - I can say with almost certainty that CSVs won't go away anytime soon. Most of our data sharing from and to these companies is via CSV files.

worik 2021-08-19 01:08:45 +0000 UTC [ - ]

Whinging a bit

There are plenty of formats available. CSV is useful. Not perfect.

TrackerFF 2021-08-18 15:07:28 +0000 UTC [ - ]

I agree with many of the problems - but luckily many providers (of data) use the first lines to provide a header of metadata.

habitue 2021-08-18 15:07:12 +0000 UTC [ - ]

It's probably not news to anyone who works with data that CSVs are inferior to other file formats like parquet.

I think the valuable insight here is that there needs to be a meme / movement that CSV is bad or deprecated. That's what's actually going to put the nails in its coffin, not private griping from developers when they get CSVs.

I'm all for it. Down with CSV :)

Sanzig 2021-08-18 15:14:37 +0000 UTC [ - ]

The thing is, I need a hefty library to write parquet. I can write a CSV with a couple fprintf statements and a loop. A parser isn't much harder if we can make assumptions about CSV dialect ahead of time.

If I'm trying to make an exportable format for a data logger with an SD card running on an ARM microcontroller, it doesn't get much easier than CSV. Sure, I could save space by rolling my own binary format, but then I have to provide a PC application to read it (and realistically, the user is probably just going to want that application to dump to a CSV anyway!).

I agree that for many use cases there are much better alternatives, but one of the reasons CSV is so popular is because it's so simple. It shouldn't be used for multi-gigabyte datasets, but for many simple use cases it works great.

dingosity 2021-08-18 16:43:32 +0000 UTC [ - ]

no. you can't write CSV with a couple of fprintf statements in a loop. If your data has a comma or a double-quote in it, the field needs to be surrounded by double quotes (and the double quote needs to be doubled). Ditto if you have leading or trailing spaces.

that is... if you're following the RFC 4180 guidance for CSV. microsoft has another internal standard that occasionally gets referenced.

drewmate 2021-08-18 15:10:22 +0000 UTC [ - ]

I'm with you! But what does a hot new startup do when a rich old legacy corp sends them a data dump that isn't a custom sqlite database? Slowly but surely, the startup begrudgingly uses csv "just this once" until they grow into the legacy corp that sends out files in csv because "we've done it this way forever."

And the cycle continues.

newbamboo 2021-08-18 15:05:29 +0000 UTC [ - ]

“ I’m not going to advocate for any one successor file format here, but there are certainly a lot of contenders. The big data ecosystem gave us formats like Avro, Parquet, and Arrow that are widely used as an intermediate representation when transferring data between systems. HDF51 is widely used in the scientific computing community.”

Xkcd: https://m.xkcd.com/927/

croo 2021-08-18 15:14:43 +0000 UTC [ - ]

Yeah okay funny, but widely used? I've never heard any of these formats. Can someone give a short and biased opinion which one is the best?

xtracto 2021-08-18 15:09:17 +0000 UTC [ - ]

USB-C (alt text)

dataminded 2021-08-18 15:14:29 +0000 UTC [ - ]

I've been moving my organization to Sqlite and Parquet depending on the use case.

recursivedoubts 2021-08-18 16:28:11 +0000 UTC [ - ]

no

CSV is the A-10 warthog of data formats

marcelotournier 2021-08-18 15:10:37 +0000 UTC [ - ]

anshargal 2021-08-18 17:33:32 +0000 UTC [ - ]

CSVJ open standard addresses many common problems of CSV. It is basically a CSV where rows are encoded as a JSON.

Unfortunately it is not widely adopted yet and the language support is yet to be improved.

mark-r 2021-08-18 15:36:54 +0000 UTC [ - ]

They complain about the loss of metadata, but never get into specifics. I'm surprised they didn't mention the biggest example of this. Excel is forced to guess the type of input data it's reading, and sometimes it guesses wrong and messes up your data. A number of genes had to be renamed because too many genetic markers were being interpreted as dates: https://www.theverge.com/2020/8/6/21355674/human-genes-renam...

unixhero 2021-08-18 16:41:15 +0000 UTC [ - ]

No

Kivutar 2021-08-18 16:15:57 +0000 UTC [ - ]

No.

JohnHaugeland 2021-08-18 15:14:09 +0000 UTC [ - ]

LOL, no

The only appropriate replacement for CSV would be a better defined CSV that gave hard requirements for things like field quoting

zepolen 2021-08-18 15:12:47 +0000 UTC [ - ]

CSVs are plain text, and that's huge, plus they have a few other advantages:

1. You can read them without any software

2. Streamable row by row

3. Compress well

To be honest most of the points in this article could be addressed by standardizing a method of defining a schema for the CSVs. It could even be backward compatible by appending the definition as metadata on the header column or as a separate file.

One thing that would be good to have is a standardized method of indexing CSVs so that random access is possible too, though that would be more involved.

jslaby 2021-08-18 16:29:06 +0000 UTC [ - ]

I think an ideal solution would be csv with json metadata elements as the header. If a program supports the header, then you get the metadata. If not, no big deal, you just get an ugly header.

zepolen 2021-08-18 16:58:48 +0000 UTC [ - ]

Perhaps, though maybe something simpler and readable would be easier to get people to use

    name::string,number of legs::int,height in meters::float,date of birth::date(MM/DD/YYYY),email adress::email,website::url
    joe,2,1.76,12/12/1999,joe@joe.com,https://www.joe.com
    bob,1,1.84,12/12/1944,bob@vietnam.com,null

j45 2021-08-19 04:09:55 +0000 UTC [ - ]

Csv isn’t going anywhere for better or worse.

mastrsushi 2021-08-19 03:05:20 +0000 UTC [ - ]

Are there really new projects stretching the limits of CSV out in production?

Everyone knows it’s not capable of storing anything more than table data. It was never meant to be much more.

It doesn’t really store objects well. To write the values of a composite object would require following some format order. It would go past human readability and just be a bloated way of writing bytes with readable characters.

turtletontine 2021-08-18 20:49:52 +0000 UTC [ - ]

> Time to retire the CSV?

no ♥

anothernewdude 2021-08-19 01:11:56 +0000 UTC [ - ]

If CSV is giving you such problems over the years, then I have no confidence in your opinion.

soheil 2021-08-19 01:07:24 +0000 UTC [ - ]

Literally there are two characters required to create a csv. Anything that replaces it will necessarily be more complex. So let’s keep csv and create something better at the same time.

2021-08-18 16:49:24 +0000 UTC [ - ]

thrower123 2021-08-18 16:36:11 +0000 UTC [ - ]

CSVs would be fine if Excel didn't import them in the most brain-dead way imaginable by default.

throwawayswede 2021-08-18 16:29:02 +0000 UTC [ - ]

Meh. Author probably messed up some important deadline or emailed (or got emailed) some crappy CSV file and then threw a temper tantrum and decided to write this.

Every single one of their points are extremely subjective and very wrong. If you have an excel sheet full of equations and colorful cells and then to decide to export it as CSV and open it in Notepad, you really can't complain that that CSV is bad. I mean it's obvious that to each format a set of strengths and a set of weaknesses, and also obviously what could be a strength to someone is a weakness to someone else. The fact that CSV is so simple to parse (almost every modern language can very easily read/write a CSV) makes it a fantastic data transfer format for every single usecase I had (this doesn't mean that other formats are less important of course). Sure you'll lose your Google Sheet or Excel metadata, but this is NOT what CSV is for.

What I find fascinating though is that the author decided it's a good idea to make a blanket statement like "Time to retire?" just because they themselves have an issue with SOME use case. I mean the idea that having your personal needs not met to justify arguing that we should ALL stop using CSV is so very bizarre to me, like it's way beyond selfish.

parksy 2021-08-18 15:56:42 +0000 UTC [ - ]

There are better formats but the change needs to come from the top down, I can't see it working from the grass roots. Put it this way, I'm not going to turn around and refuse a non-technical client who volunteers at a non-profit because they can only give me data in Excel rather than whatever I decide the superior format is, or tell them it's impossible when the next developer will click a few buttons and have me taken for a fool, nor will I push hard to sell them up on a data mapping service or cloud pipeline if it's well outside their needs or budget. Unlike topics like web accessibility that's easy to argue on its morals and ethics, or mobile first that's easy to argue based on market share and audience reach, refusing to work with an organisation's data because you don't like the very simple, very established format is a very tough sell. If Excel could natively export to JSON, that's what I'd ask for, or their custom-built website from the 00's had a clean REST API, then it wouldn't even be a discussion.

There are workarounds for sure; VB macros, unzip XSLX and parse the XML, write scripts to automate Excel-to-whatever using DCOM, or import into an intermediary service and re-export into the format de-jure, but that all takes time and costs money too, and often causes confusion when even spoken about. Asking for a CSV takes a couple of seconds and is easily understood by most people. Anyone experienced with importing / manipulating CSV data can deal with the variance in delimiters and escape sequences without major issues. It's a headache at times but easier than the alternative of alienating or confusing clients who are looking for a simple solution to whatever issue they have today.

On the other hand once the data is in the target system, if they're still asking for CSV exports I do probe to ask why, and try and figure out if there's a better way. Reporting is the usual reason, and plenty of the CRMs I work with have built-in reporting that can replicate and improve upon whatever spreadsheet they are using, and have APIs that can interface with cloud-based reporting services. But there's a lot of inertia against change in most small-to-medium organisations, not everyone is a data expert and you can't sell someone something they can't use or understand. Ultimately people win, and the solution ends up being a balance of hopefully incremental technological improvements that they can still integrate into their day-to-day. Not every organisation is able to undergo a full digital transformation with time, budgets, and skills at hand.

I agree with the sentiment but I'd hate to see a future where everyone's locked into proprietary ecosystems - not that that is what's advocated for in the post, but we have CSV because that's what the big platforms seem to allow, not because they don't know there are better options. There's no technical reason Excel couldn't export to WordPress or SuiteCRM. Take CSV away and it gets harder, not easier, to move between platforms.

2021-08-18 15:15:49 +0000 UTC [ - ]

gigatexal 2021-08-19 04:36:03 +0000 UTC [ - ]

Let’s replace them with sqlite dbs.

kofejnik 2021-08-18 19:34:41 +0000 UTC [ - ]

no

2021-08-18 15:12:56 +0000 UTC [ - ]

dingosity 2021-08-18 16:38:02 +0000 UTC [ - ]

When I worked at Amazon, I had a t-shirt and bumper sticker made up that said "Show me again on the doll where CSV hurt you."

I can't remember a time that such a simple file format had such wild inconsistencies. Not to mention some CSV export functions just ignore them. Put a comma inside a CSV field? why not? Put a single double quote in a CSV field? sure! Insist that leading (or trailing) spaces in a CSV field are semantically important? OF COURSE!

If CSV was used consistently, it wouldn't be that bad. But it's apparently simplicity lulls developers into a false sense of security, which is part of what the original author seems to be saying.

moderncsv-dev 2021-08-19 00:39:02 +0000 UTC [ - ]

I'd rather not.

six0h 2021-08-19 02:20:01 +0000 UTC [ - ]

I've never seen an article say so little in so many words.

six0h 2021-08-19 02:18:48 +0000 UTC [ - ]

I'll never get the time back I wasted hoping that this article had something useful to say

bbertelsen 2021-08-18 15:58:52 +0000 UTC [ - ]

EMBRACE PARQUET.