Hugo Hacker News

A Straightforward Way to Extend CSV with Metadata

dec0dedab0de 2021-08-19 17:02:38 +0000 UTC [ - ]

If we're going to change the format, why not just use the record separator, and field separator characters. They've been around since ascii.

dreyfan 2021-08-19 17:07:53 +0000 UTC [ - ]

It's not changing the format, that's the entire point. It's including a separate, optional, metadata file. Systems that implement the metadata can take advantage; everything else is exactly the same as it always has been.

bitwize 2021-08-19 17:07:30 +0000 UTC [ - ]

Single file with an open standard binary format based on Protobuf or CapnProto. This idea of using text files by default is ancient Unix cruft. Use binary formats unless you have a damn good reason not to, always, and release open-source tools for reading and processing files in your format. Then you can do some semblance of enforcing structure and types on your data.

iask 2021-08-19 16:52:36 +0000 UTC [ - ]

This just make it more complex…and then what? We end up with multiple specifications like an x12 document? No, it’s not time to retire it. It’s just another option. Just like any other delimited file format. Remember.INI files? They’re still being used when there is need for simplicity.

If you think CSV is complicated for your app requirement choose a different delimiter like pipe, else look at other alternatives. Simple as that.

I’ve spent years building parsers for different document in the retail juggernaut businesses.

karmakaze 2021-08-19 16:24:00 +0000 UTC [ - ]

Anything that requires multiple files to be passed around together is a non-starter. Even with just one file it's so hard for people to keep track of the 'current' one.

hinkley 2021-08-19 16:51:03 +0000 UTC [ - ]

I think that's why the archive is part of the format. But then you have a process change where you expect people to work with the archive and not the contents. That can be done but it requires some institutional will.

For instance in a code signing situation it was socialized that only the archive was 'safe' and you used one of several tools to build them or crack them open. However that company was already used to thinking about physical shipping manifests and so learning by analogy worked, after a fashion.

swader999 2021-08-19 16:47:08 +0000 UTC [ - ]

Agree but there are whole industries that's work this way. Looking at you banking and finance.

2021-08-19 16:26:48 +0000 UTC [ - ]

41209 2021-08-19 16:14:51 +0000 UTC [ - ]

Am I insane here, or does Excel generally do a very good job of handling almost all CSV files.

I don't really see a need for a metadata file, nor would I ever see Excel or other tools accepting it. The main problem is adoption, CSV isn't perfect but it's what we have. Now if you wrote this as a member of the Excel team at Microsoft, and then Excel had the option of exporting CSV files with a metadata file, then I'd be a bit more excited.

VenTatsu 2021-08-19 17:06:05 +0000 UTC [ - ]

Excel does a fairly bad job when moving data between two computers that aren't configured the same way, which kind of defeats the purpose of using a data interchange format.

Where I work we have offices in the US, and in Europe where installing a localized version of windows will swap ',' and '.' when used as the group and decimal separator. Excel when loading a value 100,002 in the US will see one hundred thousand and two, in some parts of Europe it will see one hundred and 2 thousandths.

Character set handing can be just as bad, there is no good way to get Excel to auto open a CSV file as UTF-8 that won't break every other CSV parser in existence. The only cross platform option is ASCII. Excel will happily load your local OS encoding, likely some variant of ISO-8859, but any other encoding requires jumping through hoops.

IanCal 2021-08-19 16:29:17 +0000 UTC [ - ]

Excel does an atrocious job of handling CSV files. It regularly alters data, messes up encodings and either can't or couldn't (I haven't checked in a few years) open CSV files that start with a capital letter I.

Source: dealing with CSV files people exported from Excel and the horrors that flowed from there.

bitwize 2021-08-19 17:09:18 +0000 UTC [ - ]

> either can't or couldn't (I haven't checked in a few years) open CSV files that start with a capital letter I.

It can't do this because it confuses such files with files in SYLK format, which was YET ANOTHER attempt to standardize spreadsheet data interchange, dating from the 80s.

swader999 2021-08-19 16:45:59 +0000 UTC [ - ]

I've donated hours of my life to resolving excel corrupted csv files. They haven't been satisfying hours.

Tagbert 2021-08-19 16:27:49 +0000 UTC [ - ]

Excel has some major problems with CSV ingestion though I don’t think that this proposal will address those problems.

Here are a couple of cases that I run into frequently:

* Excel is very aggressive about forcing type conversion based on its own assumptions. It will convert strings to dates or numbers, even if data is lost in the process. It will ignore quotes to convert long numeric IDs into scientific notation which truncates the ID unrecoverable.

* Excel cannot deal with quoted strings containing line breaks. It treats them as separate records and you get truncated records and partial records on separate rows.

isoprophlex 2021-08-19 16:39:34 +0000 UTC [ - ]

No, excel does about the worst possible job of handling CSVs.

You can't even hope to keep a file intact upon opening...

Hackbraten 2021-08-19 16:28:52 +0000 UTC [ - ]

Can we please have a third file in the zip?

* format.txt

* mydata.csv

* .DS_Store

It will be there if the zip was created on a Mac so might as well include it in the standard.

kaeruct 2021-08-19 16:40:33 +0000 UTC [ - ]

I cannot tell if this is a joke or not.

If you are serious, then what about just ignoring any files in the zip that are not specified in the standard?

swader999 2021-08-19 16:49:20 +0000 UTC [ - ]

Some notion of control data should be added to this spec. Total lines to parse and so on could be included. Perhaps a hash...

delusional 2021-08-19 16:43:40 +0000 UTC [ - ]

Amazing, a standard to define the characters we couldn't standardize on.

If you can decide this file format, couldn't you just normalize the CSV file instead?

barbazoo 2021-08-19 16:48:54 +0000 UTC [ - ]

I think they're simply acknowledging the fact that there are many permutations of formats out there and what they propose is a way to communicate what your particular format is without abandoning CSV completely.

I'm assuming it's in response to the post yesterday that outlined all the things that are wrong with CSV and how other formats like Parquet are better.

hinkley 2021-08-19 16:57:06 +0000 UTC [ - ]

The chaos of CSV files is older than XML. And while many of the same issues have played out in HTML, I have a strong suspicion that the particular levels of hell familiar to anyone using CSV files professionally informed some of the opinionated nature of XML. Especially since XML delves into data transport instead of just typesetting of text meant for human consumption.

I know I had specific conversations about CSV versus XML and those referred to a substantial body of literature on the topic.

2021-08-19 16:41:01 +0000 UTC [ - ]

prpl 2021-08-19 16:21:48 +0000 UTC [ - ]

There is CSVW