Time to retire the CSV?
radmuzom 2021-08-18 15:14:29 +0000 UTC [ - ]
shadowwolf007 2021-08-18 17:19:21 +0000 UTC [ - ]
The edge cases are a hassle but they don't become less of a hassle from a business perspective by switching to json or really any other format. We tried an experiment of using more json and eventually gave it up because it wasn't saving any time at a holistic level because the "data schema" conversations massively dominated the entirety of the development and testing time.
Obviously being able to jam out some json helped quite a bit initially, but then on the QA side we started to run in to problems with tooling not really being designed to handle massive json files. Basically, when something was invalid (such as the first time we encountered an invalid quote) it was not enjoyable to figure out where that was in a 15GB file.
That said, I fully concur with the general premise that CSV doesn't let you encode the solutions to these problems, which really really sucks. But, to solve that, we would output to a more columnar storage format like Parquet or something. This would let us fully encode and manage the data how we wanted while letting our clients continue working their processes.
What I would really like to see is a file format where the validity of the file could be established by only using the header. E.g. I could validate that all the values in a specific column were integers without having to read them all.
anigbrowl 2021-08-18 22:27:47 +0000 UTC [ - ]
breck 2021-08-18 20:38:14 +0000 UTC [ - ]
Agreed. JSON let's me know something is a number. That's great, but I still have to check for min/max,zero etc. A string? That's great, but I got to check it against a set of enums, and so forth. Basically, the "types" JSON gives you is about 20% of the work, and you're going to have to parse things into your own types anyway.
> What I would really like to see is a file format where the validity of the file could be established by only using the header.
Are you saying something like a checksum so not only is a schema provided but some method to verify that the data obeys the schema?
If you're talking about just some stronger shared ontology, I think that's a direction things will go. I call this concept "Type the world" or "World Wide Types". I'm starting to think something like GPT-N will be the primary author, rather than a committee of humans like Schema.org.
shadowwolf007 2021-08-19 01:12:12 +0000 UTC [ - ]
A checksum would be crude and user-hostile, only being able to say "you did it wrong" but not really good at tell you what it means to do it right.
If I understand the concepts correctly then it seems like a shared ontology could potentially solve the problem in a non-hostile way.
Plus, it makes me happy because I feel like types are a real-world problem, so it is always nice if the type system could enforce that real-world-ness and all the messiness that comes along for the ride.
radus 2021-08-18 21:10:34 +0000 UTC [ - ]
shadowwolf007 2021-08-19 01:20:07 +0000 UTC [ - ]
More a byproduct of decisions made 5 - 7 years ago when the company was in raw startup mode versus a more mature roadmap.
smhenderson 2021-08-18 16:35:34 +0000 UTC [ - ]
If/when I get a data source introduced into my workflow that differs from this variant I come up with a routine to normalize it, integrate that into my workflow and move on.
vertere 2021-08-19 09:53:19 +0000 UTC [ - ]
okareaman 2021-08-18 16:43:12 +0000 UTC [ - ]
turtlebits 2021-08-18 15:58:45 +0000 UTC [ - ]
nightski 2021-08-18 16:06:47 +0000 UTC [ - ]
I'm in a similar situation, we've been using CSV for over a decade to move billions of dollars worth of product each year. It just works.
turtlebits 2021-08-18 18:59:43 +0000 UTC [ - ]
kcartlidge 2021-08-18 20:08:45 +0000 UTC [ - ]
Call me a yak-shaver, but in every language I've worked with I've written my own csv parsing library when I needed one.
It's such a trivial thing for the majority cases (varying of delimiters, line-endings, ascii/uft8, quoting, escaping, and embedding of delimiters/line-endings) that it takes barely no time at all after you've done it once in another language. Of course there are edge cases and special cases depending upon specific workloads, but if your team has it's own parser (which is a small amount of obvious code) then it does indeed usually become a simple fix.
Sounds good using someone else's library, but below a certain complexity it's rarely worth it in the medium to long term except for when doing proof of concept or demo code, or if the domain space is complex.
magicalhippo 2021-08-18 21:45:29 +0000 UTC [ - ]
It's dead simple to whip up, and we can easily tweak it to whatever the customers software spits out, like one field suddenly being UTF-8 encoded in an otherwise Windows-1252 file.
proverbialbunny 2021-08-18 21:22:42 +0000 UTC [ - ]
CSV is like the C of data formats. It's incredibly stable yet simple enough you can make your own variant if you need to.
nojito 2021-08-18 16:27:19 +0000 UTC [ - ]
We validate on ingestion and if there are changes upstream we can immediately triage without polluting our data warehouse.
gpvos 2021-08-18 18:28:31 +0000 UTC [ - ]
systemvoltage 2021-08-18 17:17:47 +0000 UTC [ - ]
izietto 2021-08-18 15:05:46 +0000 UTC [ - ]
CSV is a format more for humans and less for machines, but that is the use case: a format that is good enough to be compiled by humans and read by machines. At the moment there aren't many alternatives.
masklinn 2021-08-18 15:11:46 +0000 UTC [ - ]
Until someone gets excel to ingest and produce something in a better format, we're pretty much stuck.
Sanzig 2021-08-18 15:16:58 +0000 UTC [ - ]
turtlebits 2021-08-18 16:02:38 +0000 UTC [ - ]
The problem is 10x worse when you get CSV from one source and rely on another process to load it. I fought this problem for several days going from NetSuite to Snowflake via CSV.
asdff 2021-08-18 18:03:50 +0000 UTC [ - ]
turtlebits 2021-08-18 19:04:16 +0000 UTC [ - ]
Poor handling from standard CSV libraries. Either unable to read or unable to create for some downstream process.
rendall 2021-08-18 20:49:31 +0000 UTC [ - ]
If you stick to one delimiter, and that delimiter is a comma, and escape the delimiter in the data with double-quotes around the entry, and escape double quotes with two double-quotes, well, you have written CSV that is correct and looks correct and will be parsed correctly by literally every CSV parser.
adzm 2021-08-19 04:14:13 +0000 UTC [ - ]
Parsers are trickier if you want to be lenient, but exporters are dead simple.
masklinn 2021-08-19 06:04:13 +0000 UTC [ - ]
That’s what CSV is. That’s what happens when you ingest CSVs whose production you don’t control.
> If you [ignore everything people literally clamour for in these comments and praise csv for]
Yes i also like ponies.
rendall 2021-08-19 09:49:25 +0000 UTC [ - ]
That's really not a serious argument against CSV. Since you paraphrase in a silly way, I can do it too! Your "argument" is "Badly formatted files exist, therefore CSV bad".
Everyone "against CSV" seems to be arguing against badly formatted CSV, and leaping to the conclusion that "CSV is just bad" without much more to say about it. I'm sorry that badly formatted CSV gave you a bad time, but the format is fine and gets its job done.
"It doesn't have x, y or z feature therefore no one should be using it ever" is kind of a dumb argument, honestly.
anigbrowl 2021-08-18 22:31:40 +0000 UTC [ - ]
Me, 20 minutes later: Heh that was easy I am a genius
Me, 21 minutes later: Unicode is ruining my life T_T
Don't get me wrong, I really like CSV because it's so primitive and works so well if you are disciplined about it. But it's easy to get something working on a small dataset and forget all the other possibilities only to faceplant as soon as you step outside your front door. In the case above my experience with dealing with CSV data from other people made me arrogant, when I should have just taken a few minutes to learn my way around a mature library.
crazygringo 2021-08-19 05:01:17 +0000 UTC [ - ]
In UTF-8, the byte for a comma and a quote only exist as their characters. They don't exist as parts of multibyte sequences, by design.
If you have Unicode problems, then you have Unicode problems, but they wouldn't seem to be CSV problems...? Unless you're being incredibly sloppy in your programming and outputting double-byte UTF-16 strings surrounded by single-byte commas and quotes or something...?
selfhoster11 2021-08-19 01:01:21 +0000 UTC [ - ]
jacobsenscott 2021-08-19 00:01:48 +0000 UTC [ - ]
name,position
"Smith, John"‚Manager
asdff 2021-08-19 03:31:38 +0000 UTC [ - ]
woodrowbarlow 2021-08-18 18:26:20 +0000 UTC [ - ]
astine 2021-08-18 20:53:14 +0000 UTC [ - ]
politician 2021-08-19 00:28:45 +0000 UTC [ - ]
masklinn 2021-08-18 16:24:43 +0000 UTC [ - ]
Yeah if you see a CSV import feature without a billion knobs you know you're in for a world of hurt.
If you see a CSV import feature with a billion knobs, you're probably still in a world of hurt.
romwell 2021-08-18 23:14:30 +0000 UTC [ - ]
What do you mean, there are "no guarantees"? You are in charge! You know what data you're dumping, you can see if it imports well. You can tailor your use case.
That's not the same as getting a CSV from some dump, where you have limited (if any) control over the behavior.
SpicyLemonZest 2021-08-18 17:27:40 +0000 UTC [ - ]
masklinn 2021-08-18 15:19:34 +0000 UTC [ - ]
And usually generate garbage for anything but the most trivial case, which really nobody gives a shit about. That's the main reason why CSV absolutely sucks too, you have to waste month diagnosing the broken shit you're given to implement the workarounds necessary to deal with it.
> I definitely can't do that with .xlsx.
You probably can though. An xlsx file is just a bunch of XML files in a zip.
Sanzig 2021-08-18 15:27:21 +0000 UTC [ - ]
Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.
I know this is HN and we like to pretend we're all data scientists working on clusters with eleventy billion gigs of RAM, but us embedded systems folks exist too.
masklinn 2021-08-18 15:42:57 +0000 UTC [ - ]
Incorrect encoding, incorrect separators (record and field both), incorrect escaping / quoting, etc…
> If I know what my data looks like
If you control the entirety of the pipeline, the format you're using is basically irrelevant. You can pick whatever you want and call it however you want.
> Have you ever worked in embedded systems? Writing XML files and then zipping them on a platform with 32 kilobytes of RAM would be hell. CSV is easy, I can write the file a line at a time through a lightweight microcontroller-friendly filesystem library like FatFS.
You can pretty literally do that with XML and zip files: write the uncompressed data, keep track of the amount of data (for the bits which are not fixed-size), write the file header, done. You just need to keep track of your file sizes and offsets in order to write the central directory. And the reality's if you're replacing a CSV file the only dynamic part will be the one worksheet, everything else will be constant.
theamk 2021-08-19 00:59:23 +0000 UTC [ - ]
I think you are missing the point -- you only need to know about generator to know about format.
Since the parent poster was talking embedded, here is one example: a data logger with tiny embedded records tuples: (elapsed-time, voltage, current). You need this to be readable in the widest variety of programs possible. What format do you use?
I think the answer is pretty clear: CSV. It is compatible with every programming language and spreadsheet out there, and in a pinch, you can even open it in text editor and manually examine the data.
Using something like XLSX here would be total craziness: it will make code significantly bigger, and it will severely decrease compatibility.
pdonis 2021-08-18 15:22:53 +0000 UTC [ - ]
A bunch of XML files with opaque formats that MS constantly makes changes to to make its competitors have to keep chasing the format.
masklinn 2021-08-18 15:25:10 +0000 UTC [ - ]
megous 2021-08-18 16:22:26 +0000 UTC [ - ]
Libreoffice handles normal UTF-8 encoded, quoted value CSV files fine. Excel not so much.
masklinn 2021-08-18 16:23:49 +0000 UTC [ - ]
marcosdumay 2021-08-18 16:00:32 +0000 UTC [ - ]
If you don't open it in Excel, you can have as strict a parser as you want, just like any other format.
But neither is going anywhere anyway.
masklinn 2021-08-18 16:43:24 +0000 UTC [ - ]
No, you can not. Because the CSV format is so fuzzy you can very easily parse incorrectly and end up with a valid parse full of garbage.
Trivially: incorrect separator, file happens to not contain that separator at all, you end up with a single column. That's a completely valid file, and might even make sense for the system. Also trivially: incorrect encoding, anything ascii-compatible will parse fine as iso-8859-*. Also trivially: incorrect quoting / escaping, might not break the parse, will likely corrupt the data (because you will not be stripping the quotes or applying the escapes and will store them instead).
It's like you people have never had to write ingestion pipelines for CSVs coming from randos.
romwell 2021-08-18 23:17:53 +0000 UTC [ - ]
That's because this is not what this thread is about.
The comment you're responding to is not about CSVs coming from "randos". It's for the case where that rando is you, so you can make sure the problems you mention don't happen on the generation side of CSVs.
ribosometronome 2021-08-18 20:48:27 +0000 UTC [ - ]
masklinn 2021-08-19 06:09:41 +0000 UTC [ - ]
theamk 2021-08-19 00:43:10 +0000 UTC [ - ]
That’s why CSV is absolutely beautiful - there is a huge number of applications that people really care about, and their data is constrained enough that there is not need to care about CSV escaping and need for any third party libraries.
Creating XSLX file by hand is possible, but this will be a large amount of code and I wouldn’t include this in my script, it would need to be a separate library - which means build system support, learning the API etc...
yrds96 2021-08-18 15:33:32 +0000 UTC [ - ]
Yeah i can generate any file with a bunch of printf, but csv i dont have to read a specification, i its possible to read with a bunch of read without have to use a xml or xlsl library.
masklinn 2021-08-18 15:46:14 +0000 UTC [ - ]
The only thing funnier than producing broken CSV is consuming broken CSV.
slunk 2021-08-18 16:11:02 +0000 UTC [ - ]
Is it? I think you're absolutely right that naive points of view like the one you're responding to will lead to avoidable bugs, but I'm not so sure the problem is CSV so much as people who assume CSV is simple enough to parse or generate without using a library.
masklinn 2021-08-18 16:23:20 +0000 UTC [ - ]
The simplicity of CSV is what tells people that they can parse and generate it without a library, and even more so that that's a feature of CSV. You just had to read the comments disagreeing with me to see exactly that.
acwan93 2021-08-18 15:17:39 +0000 UTC [ - ]
After working with so many retailers and online sales channels, things that are considered "legacy" or "outdated" by the HN crowd doesn't seem like it will go away unless both sides make a change. There are numerous articles posted on HN about how "FTP is dead" or no one uses it anymore, when it's far from the case.
Even Amazon's marketplace and vendor files are still using SFTP and EDI files. They've recently made changes, but it's been slow and hasn't had widespread adoption.
There's also the universality and "simplicity" CSV provides to the non-computer literate, and convincing them to make a change to a new standard provides itself some non-technical challenges. CSV is a bad standard, but it's the best one given what it does and its flexibility.
danellis 2021-08-19 04:30:28 +0000 UTC [ - ]
crazygringo 2021-08-19 04:58:24 +0000 UTC [ - ]
Sometimes you need to look up a single code in a 50 MB file. And sometimes you need a quick check to see if one line or a million lines changes.
It's "exception not the rule" type stuff... but it sure comes in handy to be able to check this stuff quick with basic text tools than have to run it through some binary parser. Same as JSON. But unlike protobufs for example.
danellis 2021-08-19 07:27:59 +0000 UTC [ - ]
crazygringo 2021-08-19 16:36:44 +0000 UTC [ - ]
And honestly they usually simply don't exist. In the vast majority of cases, in my experience.
grenoire 2021-08-18 15:09:46 +0000 UTC [ - ]
xtracto 2021-08-18 15:25:43 +0000 UTC [ - ]
Maybe a CSV killer would be a human readable columnar based file format.
Nevertheless, the article basically discusses the issues encountered with the manual "editability" of CSV files, not so much with its performance. It also mentions parquet or arrow and concedes that they require specialized format to read/write. If we are looking to that, then there are a lot of options such as sqlite format, BerkleDB (used by some cryptocurrency projects) among plenty of others.
cryptonym 2021-08-18 15:15:54 +0000 UTC [ - ]
masklinn 2021-08-18 15:13:17 +0000 UTC [ - ]
pdonis 2021-08-18 15:21:12 +0000 UTC [ - ]
mr_toad 2021-08-19 01:06:21 +0000 UTC [ - ]
habitue 2021-08-18 15:09:45 +0000 UTC [ - ]
The best that can be said for its simplicity is that it's easy to write code that can dump data out in CSV format (and to a lesser extent, it's easy to parse it, though watch out for those variants). This is not a really strong argument, most everyone is going to use a library for serialization, there's no reason to write your own unless it's for learning.
fighterpilot 2021-08-18 15:15:00 +0000 UTC [ - ]
xtracto 2021-08-18 15:26:49 +0000 UTC [ - ]
Until the CSV fields contain commas themselves. Even if fields are surrounded by "".
megous 2021-08-18 16:24:31 +0000 UTC [ - ]
the_only_law 2021-08-18 20:49:20 +0000 UTC [ - ]
Turns out when you have data that contains both commas and quotes things get screwey real quick. You could have quoted data like:
123,ABC,”,””,456
Where ,” is column data.
I think that standard method is to double quote the field, but the dump sure wasn’t doing that for me, plus what happens when the data is something like abc””,?
rendall 2021-08-18 21:15:33 +0000 UTC [ - ]
That is badly formatted CSV, assuming ” is a double-quote. Double-quote (") is how you escape commas, so the 3rd entry will make a parser barf, since there is no close quote.
It should be 123,ABC,"","",456
> Turns out when you have data that contains both commas and quotes things get screwey real quick.
Not really. It's pretty logical. If the entry has even one comma, put quotes around it. If it also has quotes, double each of them. Otherwise don't worry about it.
> what happens when the data is something like abc””,?
abc"", becomes "abc"""","
breck 2021-08-18 20:41:48 +0000 UTC [ - ]
danellis 2021-08-19 04:33:47 +0000 UTC [ - ]
madaxe_again 2021-08-18 15:18:36 +0000 UTC [ - ]
Of course, excel will make just as much of a mess - when the client hands back their sheet with all of their UPCs expressed as exponents and æ€ ligatures jammed next to every apostrophe, there’s no guessing as to what happened.
In theory, it’s a great and simple format.
In practice, it causes much pain.
bobthepanda 2021-08-18 20:28:30 +0000 UTC [ - ]
asdff 2021-08-18 18:06:16 +0000 UTC [ - ]
rutthenut 2021-08-18 15:21:38 +0000 UTC [ - ]
anigbrowl 2021-08-18 22:55:34 +0000 UTC [ - ]
A while back I discovered https://dadroit.com/ which is small (in feature set) but perfectly formed (nicely designed and extremely fast, even on huge files).
godshatter 2021-08-18 16:55:14 +0000 UTC [ - ]
izietto 2021-08-18 15:18:11 +0000 UTC [ - ]
asdff 2021-08-18 18:08:42 +0000 UTC [ - ]
wutwutwutwut 2021-08-18 15:16:09 +0000 UTC [ - ]
habitue 2021-08-18 15:20:01 +0000 UTC [ - ]
psanford 2021-08-18 15:52:22 +0000 UTC [ - ]
It is a lowest common denominator format. That type of thing is incredibly hard to kill unless you can replace it with something that is simpler. Good luck with that.
lmilcin 2021-08-18 18:11:25 +0000 UTC [ - ]
Do you really think there were no other "portable" formats to exchange data and SQLite is the first one?
The reason CSV is so popular has nothing to do with technical superiority which it obviously lacks.
The reason CSV is so popular is because it is dead simple and extremely easy to integrate within practically any conceivable workflow.
And because it is text format which you can trust, even if you have no other tools, you can inspect and edit in a text editor which is exactly the reason why other formats like INI, XML, JSON or YAML are so popular.
How would you:
a) Run shell command on a locked (can't install anything) PROD server to find the files and write a report of the file location, date of change and file size so that it can be easily processed by something else? Return the data in stream on the same SSH connection?
b) Quickly add a line to an existing startup script to get a report of when the script is starting and stopping so you can import data to excel. You just need it for couple of days, then you delete it.
c) Produce a report you can send to your coworker knowing they don't know how to program? And you don't want spending time explaining to them how to make use of your superior file format?
d) You talk to another team. You need to propose a format and specify a report with as little effort as possible. You know they are not advanced technically so you keep things simple?
e) You are solving an emergency and have to cross-check some data with another team. You need to produce a file that they will be able to read and process. You are stressed for time so you don't want to start a committee right now to find out shared technology. What would be safe choice without asking the other team for options?
f) A manager asked you for some data. You want to earn quick kudos. How will you send the data to get your kudos rather than irritating him?
These are all real life situations where CSV is being used.
As much as I understand technical arguments for SQLite, not every person is a developer. And all that technical superiority is worth nothing when they struggle to make use of the file you sent them.
I would say, the news of CSV's death are greatly exaggerated.
qwerty456127 2021-08-18 15:59:50 +0000 UTC [ - ]
> This column obviously contains dates, but which dates? Most of the world
It's time to retire local formats and always write YYYY-MM-DD (which is both the international and the Swedish standard, and the most convenient for parsing and sorting).
> A third major piece of metadata missing from CSVs is information about the file’s character encoding.
It's bloody the time to retire all the character encodings and always use UTF-8 (and update all the standards like ISO, RFC etc to require UTF-8). The last time I checked common e-mail clients like Thunderbird and Outlook created new e-mails in ANSI/ISO codepages by default (although they are perfectly capable of using UTF-8) - this infuriated me.
> If not CSV, then what? ... HDF5
Indeed! Since the moment I discovered HDF5 I wonder why is it not the default format for spreadsheet apps. It could just store the data, the metadata, the formulae, the formatting details and the file-level properties in different dimensions of its structure to make a perfect spreadsheet file. Nevertheless spreadsheet apps like LibreOffice Calc and MS Excel don't even let you import from HDF5.
> An enormous amount of structured information is stored in SQLite databases
Yet still very underused. It ought to be more popular. In fact every time I get CSV data I import it to SQLite to store and process but most of the people (non-developers) have never heard of it. IMHO it also begs to be supported (for easy import and export at least) by the spreadsheet apps. A caveat here is it still uses strings to store dates so the dates still can be in any imaginable format. Fortunately most of the developers use a variation of ISO 8601 conventionally.
And by the way, almost every application-specific file format could be replaced by SQLite or HDF5 for good. IMHO the only cases where custom format make good sense are streaming and extremely resource-limited embedded solutions.
titzer 2021-08-18 17:35:39 +0000 UTC [ - ]
qwerty456127 2021-08-18 20:37:32 +0000 UTC [ - ]
Both also require colons to separate hours and minutes and this makes it impossible to use in file names if you want to support accessing them from Windows.
I personally use the actual ISO 8601 (with the "T") wherever I can, simple YYYY-MM-DD-HH-mm-SS-ffffff where I need to support saving to the file system (but this is slightly harder for a human to read) and mostly RFC 3339 (with a space instead of the "T") wherever I need to display or to interop with tools written by other people. As for SQLite - I usually store every field (years, months,... seconds etc) in a separate integer column and create a view which adds an automatically generated RFC 3339 date/time column for simpler querying.
Evidlo 2021-08-18 22:48:04 +0000 UTC [ - ]
qwerty456127 2021-08-18 23:13:12 +0000 UTC [ - ]
Yes, ISO standards are updated as regularly as it makes sense for standards to change (obviously they wouldn't really be standards if they were updated more often).
Wikipedia says ISO 8601 was first published in 1988, with updates in 1991, 2000, 2004, and 2019.
The C language is another example. The latest stable revision (C17) is ISO/IEC 9899:2018.
> Is there any chance we might see an ISO 8601 date variant which is meant for filenames?
Hardly.
anodyne33 2021-08-19 00:24:38 +0000 UTC [ - ]
hatmatrix 2021-08-18 21:06:12 +0000 UTC [ - ]
qwerty456127 2021-08-18 21:23:27 +0000 UTC [ - ]
And by the way, many (if not an overwhelming majority) of the non-programmers don't even understand what does "just text files" actually mean, how do text files differ in nature from DOC files and how are CSV files different from XLS files. They can only use CSV because Excel and LibreOffice support it OOTB and consider CSV just a weird XLS cousin needed for import/export purposes.
jedimastert 2021-08-18 23:19:36 +0000 UTC [ - ]
qwerty456127 2021-08-19 00:00:05 +0000 UTC [ - ]
CobaltFire 2021-08-18 16:29:01 +0000 UTC [ - ]
qayxc 2021-08-18 16:35:18 +0000 UTC [ - ]
Usually MMM refers to the 3-letter shorthand of the month, e.g. "APR" or "OCT", but I guess that's not what you meant because it couldn't be used internationally.
CobaltFire 2021-08-18 17:21:01 +0000 UTC [ - ]
I should know better but just never thought about it.
qwerty456127 2021-08-18 16:40:36 +0000 UTC [ - ]
Because sorting. You can just sort a collection of dates stored as YYYY-MM-DD strings alphabetically and the result will always be in accordance with the actual time line.
kstrauser 2021-08-18 17:43:59 +0000 UTC [ - ]
if row['timestamp'] < cutoff_time_as_int: ...
with: if row['isotimestamp'] < cutoff_time_as_string: ...
and everything kept working exactly as before. There was no expensive and error prone date parsing involved, some some simple string comparison. It wasn't quite as cheap as comparing ints, but it's not performance critical code so eh, good enough.
pwinnski 2021-08-18 16:53:37 +0000 UTC [ - ]
mr_toad 2021-08-19 01:18:06 +0000 UTC [ - ]
(In theory people could write yyyy-dd-mm, but I’ve never seen anyone actually do that).
_moof 2021-08-18 16:05:54 +0000 UTC [ - ]
I like elegance as much as anyone. And I think it's a good proxy for other important qualities. But don't prioritize it above building something that actually does the job. Be an engineer.
qayxc 2021-08-18 16:40:28 +0000 UTC [ - ]
Not to mention the mess that is exchanging documents between different locales. It's all sunshine and roses until you get your CSVs from an office in a different country (which happens a lot in Europe).
CSV gets the job done until it doesn't.
brushfoot 2021-08-18 16:50:14 +0000 UTC [ - ]
Unexpected behavior is a potential problem with any tool or format, certainly no less so with the kinds of solutions the article is proposing.
qayxc 2021-08-18 17:10:40 +0000 UTC [ - ]
The problem arises from the fact that CSV doesn't contain any metadata so while the same tool reads and writes the same data, the results will be different because the format doesn't specify what a number look like.
The formats proposed in the article don't have this problem because they specify both the type and the formatting of the file content so there's no confusion.
That's the difference. JSON in particular simply cannot have this problem (though it's not a suggested solution): a number has a well-defined format and that's that. So do strings. This is not the case with CSV:
Number,Also a number,A String
1.0,1e15,"Hello there!"
"1,0",1e15,"Bienvenue á la maison!"
The first row is what you'd expect from a customer in Ireland and the second row might be sent from someone in Switzerland.Both parties agreed to use comma-separated files for data exchange and quoted strings, yet importing the first row using a French locale would fail to parse the first number and same goes for the second row and an English locale.
Neither row contains unexpected or malformed data from the respective user's POV. And that's the problem with CSV.
p_j_w 2021-08-18 17:41:34 +0000 UTC [ - ]
Then you take it out of use where it doesn't get the job done.
dragonwriter 2021-08-18 16:01:41 +0000 UTC [ - ]
> To usurp CSV as the path of least resistance, its successor must have equivalent or superior support for both editing and viewing
Since the author has already ruled (on spurious grounds) human readability and thus the usability of ubiquitous text editors as incompatible with requirements for a successor format, this is simply impossible.
> If Microsoft and Salesforce were somehow convinced to move away from CSV support in Excel and Tableau, a large portion of business users would move to a successor format as a matter of course.
Yeah, sorry, you’ve put the cart before the horse. Neither of those firms are going to do that until the vast majority of users have already migrated off of CSV-based workflows, it would be an insane, user hostile move that would be a bigger threat to their established dominance in their respective markets than anything any potential competitor is likely to do in the foreseeable future.
worik 2021-08-19 01:26:27 +0000 UTC [ - ]
The author completely misses the point of what CSV files are useful for.
They are useful when both ends of the communication understand the context. They know what the data types are, they know what the character encoding is etcetera.
This is a very common situation. CSV files are easy to process, easy to generate, and can be read by a human without too much bother (they are not "for people" as the author so irritatingly asserts). I have written so many CSV (and other delimiters, not just ',') generators and processors I lost count decades ago. It is so easy.
And what does "retire CSV" really mean? Just stop generating them then! The consumers of your data will probably insist you go back to them as writing all that code just to satisfy your fetish with metadata is not useful
Now I am whinging.....
Antrikshy 2021-08-19 02:00:32 +0000 UTC [ - ]
einpoklum 2021-08-19 12:43:44 +0000 UTC [ - ]
1. It is a very common use-case to load CSV data generated by an opaque system which did not specify the format exactly to you.
2. CSV files are _difficult_ to process I: Fields are of variable width, which is not specified elsewhere. So you can't start processing different records in parallel, since you have to process previous lines to figure out where they end. You can't even go by some clear line termination pattern, since the quoting and escaping rules, and the inconsistencies the author mentions, make it so that you may not be able to determine with certainty that what you're reading is an actual end-of-record or end-of-field. Maybe it's all just a dream^H^H^H^H^H quoted string?
3. CSV files are _difficult_ to process II: For the same reasons as the above, it is non-trivial and time consuming to recover from CSV generation mistakes, corruptions during transmission, or missing parts of the data.
However - if you get some guarantees and abot field width and about the non-use of field and record separators in quoted strings, then your life becomes much easier.
pwinnski 2021-08-18 16:51:43 +0000 UTC [ - ]
If you think you can do better than CSV, let's see your proposal. Hint: you probably can't, and if you could, you probably couldn't get Excel to export it, so you still probably can't.
"The status quo is bad, more recent popular formats aren't good enough either, but I don't actually have a specific proposal that's better than all of the above" is a lot faster to read than that article, and says the same thing.
AussieWog93 2021-08-18 21:55:17 +0000 UTC [ - ]
I've got one! It's basically the same as regular CSV, but everything is UTF-8, the columns and lines are delineated by dedicated UTF-8 "delineator" codepoints (if they aren't defined in the spec, find reasonable surrogates and use them), and therefore nothing ever needs to be escaped.
More human readable than regular CSV, less prone to error and just as easy to write to in a for loop (easier, in fact, as there are no escapes).
Depending on how Excel handles delineators, it should be able to import it too.
jedimastert 2021-08-18 23:32:07 +0000 UTC [ - ]
politician 2021-08-19 00:33:29 +0000 UTC [ - ]
proverbialbunny 2021-08-18 21:34:32 +0000 UTC [ - ]
hatmatrix 2021-08-18 21:04:54 +0000 UTC [ - ]
flowerlad 2021-08-18 21:45:44 +0000 UTC [ - ]
Instead of finding a replacement for CSV it might be easier to standardize it and enhance it. Excel's version of CSV is the de-facto standard. If you want a written-down spec that is available too [1]. To this we need to add enhancements such as a way to specify metadata (i.e., data type of each field). No need to find an alternative to CSV!
einpoklum 2021-08-19 12:46:12 +0000 UTC [ - ]
If you're working someplace that uses Excel, then maybe. Otherwise, it isn't.
Anyway, if more tools, including Excel, properly implemented RFC4180, we'd at least have something.
Another option is some sort of meta-data file alongside a large CSV describing its format somehow - which, if missing, you have to fall back on RFC4180 or your parser's "try to figure out the quirks" mode.
rjh29 2021-08-18 17:03:09 +0000 UTC [ - ]
TSV allows you to do stuff on a single machine and GNU parallel that people would normally create a Hadoop cluster or 128GB database for.
shock-value 2021-08-18 18:27:27 +0000 UTC [ - ]
For those not aware, TSV and CSV differ by more than just the delimiting character. TSV has a dead-simple specification: https://www.iana.org/assignments/media-types/text/tab-separa.... CSV does not have a standard spec and implementations differ quite a bit, but often in subtle ways.
segmondy 2021-08-18 23:01:50 +0000 UTC [ - ]
Here's the RFC for CSV - https://datatracker.ietf.org/doc/html/rfc4180
Hasnep 2021-08-19 00:28:23 +0000 UTC [ - ]
shock-value 2021-08-19 01:49:03 +0000 UTC [ - ]
This is not true, though almost everyone (understandably) assumes this based on the names.
alexpw 2021-08-18 17:10:46 +0000 UTC [ - ]
2) I ask for TSV, whenever convenient. It's been more reliable, and I don't have a comprehensive why, but I think it's slightly more resilient to writer/reader inconsistencies, for me. It may be that there's just less need for escaping and quoting, so you might dodge a smart quotes debacle when asking for a one-off from an Excel user, for example.
3) Despite the issues raised, the notion we'd retire it makes me hug it tight, because for the majority of my requirements, it hits a sweet spot. I still reserve the right to raise my fist in frustration when someone does: ",".join(mylist).
shock-value 2021-08-18 18:22:32 +0000 UTC [ - ]
CSV is a mish-mash of different, complicated, and under-specified standards and/or implementations.
yuy910616 2021-08-18 15:11:13 +0000 UTC [ - ]
fuhrysteve 2021-08-18 15:25:27 +0000 UTC [ - ]
CSVs also are great because you can parse them one row at a time. This makes for a very scale-able and memory-efficient way of processing very large files containing millions of rows.
Let there be no mistake: Everyone reading this today will retire long before CSVs retire. And that's just fine by me.
ivanbakel 2021-08-18 15:34:23 +0000 UTC [ - ]
Even RFC4180-compliant CSVs can be incredibly memory-inefficient to parse. If you encounter a quoted field, you must continue to the next unescaped quote to discover how large the field is, since all newlines you encounter are part of the field contents. Field sizes (and therefore row sizes) are unbounded, and much harder to determine than simply looking for newlines - if you were to naively treat CSV as a "memory-efficient" format to parse, you would create a parser that would be easy to blow up with a trivial large file.
newbie2020 2021-08-19 05:15:40 +0000 UTC [ - ]
bernardv 2021-08-18 20:24:28 +0000 UTC [ - ]
rendall 2021-08-18 21:35:00 +0000 UTC [ - ]
It was weakest when it implied there is no real standard. There is, and it's robust for representing data, even data that includes any combination of commas and double-quotes.
The algorithm for creating well-formed CSV from data is straightforward and almost trivial: if the datum has no comma in it, leave it alone. It's good to go. If it has even one comma, wrap the datum in double quotes; and if it also contains double quotes, then double them.
Not complicated and covers every edge case. CSV is going nowhere. 1000 years from now, computers will still be using CSV.
The answer to his objections could be to extend the format to include metadata. Perhaps a second row that holds type data.
fruit,price,expiration
string,$0.00,MM/DD/YYYY
apple,$0.45,01/24/2022
durian,$1.34,08/20/2021
etc
hnick 2021-08-19 00:00:27 +0000 UTC [ - ]
It is very easy to overlook edge cases in CSV.
rendall 2021-08-19 03:06:50 +0000 UTC [ - ]
That was not a criticism from the original article, and isn't even true.
If you have newlines in your original data, and you "overlooked" this "edge case", then neither JSON nor YAML nor any other format will save you. The same fix applies to them all.
This is really a very poor criticism.
hnick 2021-08-19 03:41:25 +0000 UTC [ - ]
> The algorithm for creating well-formed CSV from data is straightforward and almost trivial: if the datum has no comma in it, leave it alone. It's good to go.
> Not complicated and covers every edge case
Maybe more context was implied, but I didn't want anyone to think it's that simple. I have received and had to process CSV data with unexpected new lines. It gets nearly everyone the first time when they try to process a CSV file line by line.
jedimastert 2021-08-18 23:23:44 +0000 UTC [ - ]
RobLach 2021-08-18 15:25:35 +0000 UTC [ - ]
Just yesterday I was trying to add some metadata to a list of images that I handed off to a designer and I just piped a single "find . -fprintf" command into a file.
To "retire" something so simple and utilitarian makes no sense. There's plenty of formats that deal with all the issues described in this. It's like saying we need to retire plain text files because it's confusing to know how they should be displayed.
javajosh 2021-08-18 15:10:22 +0000 UTC [ - ]
fridif 2021-08-18 15:14:03 +0000 UTC [ - ]
"I Cannot Fathom A Use Case For CSV So I Would Like To Ban Everyone From Using It For Any Reason"
cogman10 2021-08-18 15:22:31 +0000 UTC [ - ]
As soon as data has any form of structure to it (and most data does). CSV complicates everything. Even for unstructured data, the problem of escape characters often shows it's ugly head. The moment your data contains a comma, tab, or space, you run into a nasty mess that, in the best case makes your system fail, and in the worst case silently adds corrupt data into the system.
Neither JSON nor XML suffer from that problem and both can easily be used in any scenario you'd use CSV. The only argument against either format is they are a bit more bulky than CSV.
fxleach 2021-08-18 15:36:30 +0000 UTC [ - ]
cogman10 2021-08-18 16:03:50 +0000 UTC [ - ]
Any variable length text data format is going to run into issues when special characters are used in the data. CSV has no definition of what should happen when that occurs.
fxleach 2021-08-19 16:25:14 +0000 UTC [ - ]
Demiurge 2021-08-18 16:27:43 +0000 UTC [ - ]
jonnycomputer 2021-08-18 16:48:13 +0000 UTC [ - ]
Writing code that can accept and parse arbitrary CSV files is a whole different thing. If I had to do that, I'd be yelling too.
Similar story can be said for JSON, markdown, etc, where standard is inadequate, non-existent, or in mutual competition.
kaiken1987 2021-08-18 15:14:46 +0000 UTC [ - ]
kryptiskt 2021-08-18 16:16:15 +0000 UTC [ - ]
_jal 2021-08-18 15:18:44 +0000 UTC [ - ]
- Repairability by mere mortals. If you are competent enough to use Excel and are given bad data, you are competent enough to fix it. (Whether the dataset is too big for mere mortals to find the problem is a different issue.)
- Trivial serialization. It is super-easy to dump things to CSV. (This is probably also why there are so many annoying variants.)
- Usable by other tools. You don't need special libraries to read it, so pipelines involving the usual suspects is possible. Importantly, there are a ton of different ways to do this sort of thing, so non-experts can frequently find something that works for them, even if it looks wonky to programmers.
All that said, I hate dealing with them, too.
bhawks 2021-08-18 20:52:56 +0000 UTC [ - ]
Anyone that has dealt with more than 1 CSV is aware of many of the aspects of their horrible nature.
The more interesting reason is why they're so damn successful.
1: network effect - not supporting CSV in a product is practically silly. everyone can do it why can't you?
2: ease of producing / consuming (not saying that you do it correctly in all cases :))
3: data is transparent (or feels transparent)
4: accessible - if you can write language X you can parse or generate reasonable CSV in a mater of minutes. No knowledge of any libraries or tech required
That said it's horrible - but it will always be with us.
tannhaeuser 2021-08-18 16:09:38 +0000 UTC [ - ]
shock-value 2021-08-18 18:30:43 +0000 UTC [ - ]
ziml77 2021-08-18 16:19:32 +0000 UTC [ - ]
mongol 2021-08-18 16:21:23 +0000 UTC [ - ]
thibran 2021-08-18 16:44:46 +0000 UTC [ - ]
Microsoft -> The name of the format is 'comma separated values' not 'semicolon separated values'!
mongol 2021-08-18 17:22:23 +0000 UTC [ - ]
unnah 2021-08-18 18:46:15 +0000 UTC [ - ]
Microsoft Excel is the only program I know that insists on locale-specific interpretations of CSV. To read a standard CSV into Excel in a continental European locale, you need to either switch your locale at the Windows level, or to manually set every delimiter in the Excel text import wizard (requiring multiple mouse clicks for each column).
thibran 2021-08-18 17:26:12 +0000 UTC [ - ]
MrPowers 2021-08-18 18:21:54 +0000 UTC [ - ]
* Parquet stores schema in the metadata, so schema inference isn't required (schema inference is expensive for big datasets)
* Parquet files are columnar so individual columns can be grabbed for analyses (Spark does this automatically). This is a huge performance improvement.
* Row groups contain min/max info for each column, which allows for predicate pushdown filtering data skipping. Highly recommend playing with PyArrow + Parquet to see the metadata that's available.
* Columnar file formats are easier to compress. Binary files are way smaller than text files like CSV even without compression.
I wrote a blog post that shows how a Parquet query that leverages column pruning and predicate pushdown filtering can be 85x faster than an equivalent CSV query: https://coiled.io/parquet-column-pruning-predicate-pushdown/
CSVs are great for small datasets when human readability is an important feature. They're also great when you need to mutate the file. Parquet files are immutable.
It's easy to convert CSVs => Parquet/Delta/Avro with Pandas/Dask/Spark.
The world is already shifting to different file formats. We just need to show folks how Parquet is easy to use and will greatly increase their analysis speeds & they'll be happy to start using it.
Small nit: the article implies Apache Arrow is a file format. It's a memory format.
podiki 2021-08-18 15:11:29 +0000 UTC [ - ]
nolok 2021-08-18 15:14:01 +0000 UTC [ - ]
daxuak 2021-08-18 15:20:24 +0000 UTC [ - ]
cogman10 2021-08-18 15:26:24 +0000 UTC [ - ]
This is valid JSON
[["bob", "jones", 1, 22],
["frank", "was", 32, 45]]
That's unreadable and unparsable by a human?Not only is JSON often more parsable, because it's structured it also becomes a lot easier to query.
I grep and awk xml and json stuff all the time. I also have the added bonus of being able to use `jq` for json content.
shapefrog 2021-08-18 16:00:07 +0000 UTC [ - ]
cogman10 2021-08-18 16:10:46 +0000 UTC [ - ]
When someone says "Maybe we can fix CSV" this is what you should do instead of trying to "fix" CSV.
mongol 2021-08-18 17:18:44 +0000 UTC [ - ]
cogman10 2021-08-18 17:27:03 +0000 UTC [ - ]
mongol 2021-08-18 17:41:40 +0000 UTC [ - ]
cogman10 2021-08-18 17:54:25 +0000 UTC [ - ]
Yeah, it's never really come up as it's generally trivial to convert into this format from either a text editor or sed (Look for the `],[` and put a new line after the comma).
dullgiulio 2021-08-18 15:19:06 +0000 UTC [ - ]
throwawayswede 2021-08-18 16:30:06 +0000 UTC [ - ]
cogman10 2021-08-18 15:18:33 +0000 UTC [ - ]
Semaphor 2021-08-18 15:48:28 +0000 UTC [ - ]
cogman10 2021-08-18 16:12:31 +0000 UTC [ - ]
[[1,2,3],
[4,5,6]]
It's easy to make a structured data interchange format mimic and unstructured format. It's impossible to go the other way around without severe problems.
goatlover 2021-08-18 17:07:40 +0000 UTC [ - ]
cogman10 2021-08-18 17:23:23 +0000 UTC [ - ]
What's ensuring the format of data in a CSV file?
Format comes from the same place it comes from for CSV. It's part of whatever data contract you are making with whatever system is providing the data. If someone shoves YAML into a Json file you've got problems, just like you have problems if someone starts talking about cars when you expected a file about airplanes.
At some point, you've got to turn file data into application data regardless of format. CSV offers no help in shaping the data and has a bunch of footguns to boot.
> JSON is one of the formats Pandas can read, but it has to be structured in a format the python library can read in as tabular data.
Pandas is highly flexible in it's ability to read in Json data, What I showed would be trivially loadable by Pandas, so would other formats (including more traditional JSON data).
Turning structured data into tabular data is trivial. What isn't trivial is turning tabular data into structured data.
bluedino 2021-08-18 16:32:17 +0000 UTC [ - ]
WorldMaker 2021-08-18 18:20:19 +0000 UTC [ - ]
I feel so much of what keeps CSV in use is that it's a format that you can relatively easily generate without pulling in a lot of library dependencies and relatively quickly generate something that users can just "double click it opens in Excel".
What the Excel team could really gift to developer humanity at this point is some dumb file extension Excel registers like XLJSON that you could just rename a JSON file to and Excel opens it like a spreadsheet on double click.
cogman10 2021-08-18 16:53:11 +0000 UTC [ - ]
If the primary usecase is to take data give it to a non-programmer to evaluate it, then CSV isn't terrible. Add a transformer that's ran when someone manually requests to see the data.
However, for machine to machine communication, CSV should never be used.
dbreunig 2021-08-18 18:06:31 +0000 UTC [ - ]
>Most of us don’t use punch-cards anymore, but that ease of authorship remains one of CSV’s most attractive qualities. CSVs can be read and written by just about anything, even if that thing doesn’t know about the CSV format itself.
Yes, we keep CSVs. If you care about metadata and incredibly strict spec compliance, then yes: avro, parquet, json, whatever. But most CSV usage is small data, where the ease of usage, creation, and evaluation wins.
One of the problems with CSVs he cites is a great reason why I like CSVs:
>CSVs often begin life as exported spreadsheets or table dumps from legacy databases, and often end life as a pile of undifferentiated files in a data lake, awaiting the restoration of their precious metadata so they can be organized and mined for insights.
A benefit of a CSV is a skilled or unskilled operator can evaluate these piles of aging data. Parquet? Even SQLite? Not so much.
For small-to-medium sized datasets, CSV is great and accessible to a wider user base. If you're relying on CSV to preserve structure and meta, then meh.
SavantIdiot 2021-08-18 18:42:35 +0000 UTC [ - ]
It's an exercise in hubris because while CSV has one glaring fault, it is far fewer than other competing methods of the simplest possible exchange of textual data.
No one can decide to retire CSV because it ubiquitous. Of the three issues OP compains about: line delimiter, field delimiter, and header, only one is really a problem for anyone that has used them for any period of time.
1. Every file format suffers from Windows/NonWindows CRLF issues.
2. Metadata has been an issue since forever, and there are plenty of painful formats that support it (looking at you XML), complete with ginormous parsers and even more issues.
3. Escaping, as OP points out, was pretty clearly defined in RFC 4180.
So yes, it has a wart: escaping.
Learn what your system expects, and modify accordingly. Because using a CSV will be much faster and simpler than any other format you can try, which is why it has been so pervasive for longer than most of HN has been alive.
ozim 2021-08-18 21:28:28 +0000 UTC [ - ]
People want to read and interpret arbitrary CSV ... well you cannot read and interpret any format that is arbitrary.
*As a consultant, I’ve written more than one internal system that attempts to reconstruct the metadata of a CSV of unknown provenance using a combination of heuristics and brute force. In short, CSV is a scourge that has followed me throughout my career.*
Ideally this should not happen because you should talk with party that you agree on common format. Someone that would explain what each field means and what should it contain, or at least some documentation for the file, not that it just is a CSV. But of course it always is more complicated than that.
Garbage in - Garbage out, even in other formats you still can have the same problem.
Traster 2021-08-18 17:41:06 +0000 UTC [ - ]
CSV is ubiqituous because it's not trying to solve difficult problems. The second you try to solve the difficult problems you necessarily fragment your audience.
zabzonk 2021-08-18 15:21:01 +0000 UTC [ - ]
pessimizer 2021-08-18 16:51:01 +0000 UTC [ - ]
zabzonk 2021-08-18 19:52:42 +0000 UTC [ - ]
hermitcrab 2021-08-18 22:09:26 +0000 UTC [ - ]
javajosh 2021-08-18 16:05:34 +0000 UTC [ - ]
zabzonk 2021-08-18 16:14:20 +0000 UTC [ - ]
hermitcrab 2021-08-18 21:31:02 +0000 UTC [ - ]
-warning if different rows have different numbers of columns
-warning if the file ends without closing escaping
Without meta data you can't check data types.
A missing escape character might result in a valid CSV (just not the one intended).
masklinn 2021-08-18 15:48:17 +0000 UTC [ - ]
And the important think to remember is that you can not and will not make them.
zabzonk 2021-08-18 16:04:13 +0000 UTC [ - ]
AdmiralAsshat 2021-08-18 15:16:01 +0000 UTC [ - ]
So, question to the greybeards: when this format was coming about, why didn't we use one of the dedicated Data Structure separator control codes (e.g. File/Group/Record/Unit separator) that were part of the ASCII standard?
https://en.wikipedia.org/wiki/C0_and_C1_control_codes#Basic_...
It seems like it would've saved us all several decades of headache.
wanderingstan 2021-08-18 15:39:46 +0000 UTC [ - ]
But commas can be seen, edited, and typed with ease in any text editor.
As per comment above: your CEO, SWE, or secretary can all use or contribute to a csv file. And using an easily recognizable and typeable separator has proven to be worth the downsides.
rutthenut 2021-08-18 15:20:07 +0000 UTC [ - ]
Consider also that IBM EBDCIC and other non-ASCII character sets were (and are) in common use, so C0/C1 may not have made sense 'back in the day'
_moof 2021-08-18 16:10:56 +0000 UTC [ - ]
Because you can't see them. CSV is, at its core, a text format. Using FS/GS/RS/US would effectively make it a binary format.
Also, what happens if one of those bytes appears in data?
The simple fact is you will never be able to come up with a text-based format that can handle all possible values without escape sequences, quoting, or length-encoding. And really, that's not that big a deal. It just means you have to write a simple state machine instead of using regex or your language's equivalent of split().
NyxWulf 2021-08-18 15:23:18 +0000 UTC [ - ]
pmelendez 2021-08-18 17:26:51 +0000 UTC [ - ]
>values stored in the files are typed. >most importantly, these formats trade human readability and writability for precision
Those two properties are advantageous only when CSV is used in cases meant to be just parsed and as a data transfer format. But in reality, CSV files are being used in many different contexts. For instance, data scientists love to leverage and chain Unix tools to create a subset of data to test models. Also, it is often used as an export format to validate outputs quickly.
I think the problem is that CSV is often the subject of abuse. In the same way that a spreadsheet is a subpar database, I don't believe we are nearly close to the time when we will retire Excel.
slaymaker1907 2021-08-19 03:52:36 +0000 UTC [ - ]
Suppose someone wants to do some data processing client side in the browser with an input file. How would you do this for SQLite, even a SQLite file with a known schema such as would be the case most of the time for CSV file uploads? There is sql.js which compiles SQLite to JS or WASM, but even the WASM version weighs in around 400kb.
mdellavo 2021-08-18 15:11:27 +0000 UTC [ - ]
karmakaze 2021-08-18 18:16:47 +0000 UTC [ - ]
The conclusion is that there are many binary formats which are more suitable, except that none are used widely enough, and the post even goes on to not make a recommendation. Finally it says that we have to accept this sad state:
"Ultimately, even if it could muster the collective will, the industry doesn’t have to unify around a single successor format. We only need to move to some collection of formats that are built with machine-readability and clarity as first-order design principles. But how do we get there?"
ho_schi 2021-08-18 16:52:23 +0000 UTC [ - ]
JSON - and also CSV - are ubiquitous because they are actually human readable and usually quick to process. JSON is strictly defined it just doesn't over comments or structures, like XML. CSV is loosely defined but better than another competing standard ;)
A good data interchange format must be human readable, JSON and CSV are. Proprietary stuff not. And if you feel the need for speed? Seriously? Okay, then think about a binary format.
I prefer also the .conf format (so called INI) over other complex stuff for application settings.
bazzargh 2021-08-18 16:37:24 +0000 UTC [ - ]
https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/#...
... at the time RFC-4180 came out, it didn't even accurately describe how to read csv produced by Excel - which was already inconsistent in line endings and character set between office versions and platforms. The w3c spec at least tried to offer a model which would parse junk csv if you could guess the metadata (by eg scanning for the BOM, \" vs ""[^,], and so on)
When I worked on this stuff early 2010s, if you wanted to produce a non-ascii csv _download_ that could be opened by all office/openoffice variants you were out of luck. UTF-16LE-with-BOM, as I recall, would work in _most_ office variants but not consistently even across minor version changes in Office for OSX - so it was just a roll of the dice. We offered multiple download formats which _could_ handle this but csv was required by some customers.
Anyone saying csv is easy never worked with it in an international context.
ineedasername 2021-08-19 02:19:59 +0000 UTC [ - ]
kazinator 2021-08-18 18:04:10 +0000 UTC [ - ]
1. Encoding is UTF-8.
2. Header line with column names is mandatory. If there are no column names, the first line must be blank.
3. Each record is a line. A line is defined according to the operating platform's text file format.
4. In the light of (3) CSV does not dictate line endings and does not address the conversion issue of text files from one platform being transferred to another platform for processing. This consideration is a general text issue, off-topic to CSV.
5. CSV consists of items separated by quotes. An item may be:
5. a) a JSON number, surrounded by optional whitespace. Such an object may be specially recognized as a number by the CSV-processing implementation.
5. b) the symbol true, false or nil, optionally surrounded by whitespace. These symbols may have a distinct meaning from "true", "false" or "nil" strings in the CSV-processing implementation.
5. c) a JSON string literal
5. d) any sequence of printable and whitespace characters, other than comma or quote, including empty sequence.
6. In the case of (5) (d), the sequence is interpreted as a character string, after the removal of leading and trailing whitespace. (To preserve leading and trailing whitespace in a datum, a JSON literal must be used.)
7. In (5), whitespace refers to the ASCII space (32) and TAB (9) character. If there are any other control characters, the processing behavior is implementation-defined. Arbitrary character codes may be encoded using JSON literals.
ArsenArsen 2021-08-19 00:54:54 +0000 UTC [ - ]
It is self describing, rigorously defined for the machine, highly flexible, highly extensible, human readable, text-based and an open format, and it is only mentioned as a storage format in the post, not even nearly doing it justice!
XML is complex, but there's already very established libraries for handling it, so that shouldn't be an issue, and that's the only drawback XML has.
Hell, even alternatives like JSON/YAML + JSONSchema exist, still providing incredibly rigorous validation, but staying human readable and ubiquitous.
P.S. I think a bigger failure here is trying to generalize the concept of spreadsheets, rather than using a common encoding (XML) for domain-specific data formatting.
mr_toad 2021-08-19 01:02:51 +0000 UTC [ - ]
Parsing times are often horrible.
There’s no standard for tabular data. You invariably need some overly complicated XML map, because people can’t resist the temptation to over-engineer.
ArsenArsen 2021-08-19 01:11:52 +0000 UTC [ - ]
I've not benchmarked XML parsing times in a very long time, I'd be interested in seeing the numbers now.
Tabular data is barely data, honestly, hence my PS, I don't think spreadsheets in general are a very good way to store anything.
And yes, absolutely, overengineering is bound to happen, which is unfortunate, but I'm not sure if it really can be avoided while still keeping many of those upsides (especially the rigorous definitions)
selfhoster11 2021-08-19 00:58:53 +0000 UTC [ - ]
ArsenArsen 2021-08-19 01:08:40 +0000 UTC [ - ]
<bankstatement>
<transaction date="2021-08-19" description="Hello, world!">
<debit commodity="USD">100</debit>
<balance commodity="USD">237.87</debit>
</transaction>
</bankstatement>
selfhoster11 2021-08-19 03:10:54 +0000 UTC [ - ]
simonw 2021-08-18 18:00:31 +0000 UTC [ - ]
The trick is that if you compile data into a SQLite file and then deploy the Datasette web application with a bundled copy of that database file, users who need CSV can still have it: every Datasette table and query offers a CSV export.
But... you can also get the data out as JSON. Or you can reshape it (rename columns etc) with a SQL query and export the new shape.
Or you can install plugins like https://datasette.io/plugins/datasette-yaml or https://datasette.io/plugins/datasette-ics or https://datasette.io/plugins/datasette-atom to enable other formats.
kthejoker2 2021-08-18 18:59:58 +0000 UTC [ - ]
> then deploy the Datasette web application
Is a huge hurdle for non-technical folks holding on to their CSV workflows.
I think there is no barrier low enough that CSV cannot limbo beneath it.
simonw 2021-08-18 19:06:49 +0000 UTC [ - ]
hestefisk 2021-08-18 15:20:10 +0000 UTC [ - ]
crdrost 2021-08-18 16:32:54 +0000 UTC [ - ]
The real problem with CSV is the lack of vision of data formats in general. Inside your programming language you need to say that you want to read a CSV? You need to import a different library and change both the parse code and your consumption code to instead do JSON? You need to change everything to use futures in order to stream the results? Are you out of your damn mind?
So now that disk space is so cheap it would make sense for any file format to just begin with a single few-kilobytes line that defines the parser for the coming file. Could be sandboxed, WASM-y or something, could be made printable of course... Sure, it might not be possible to get the full SQLite library in there but you could at least make it free to switch between JSON and BSON and CSV without having to recompile the software or force someone to design modular file input systems. Somehow the only flexible containers that hold different data structures are video container formats that do not care about what codec you used. They "get it"— can the rest of us?
hcarvalhoalves 2021-08-18 15:39:23 +0000 UTC [ - ]
2. I would feel safer storing longer-term data in CSV than in a binary format w/ a complicated spec. Having to make sense of compressed, columnar padded data sounds worse than parsing CSV.
3. Although it's easy to point at corner cases, I don't remember the last time I couldn't figure out how to parse a file because of inconsistent quoting or exotic char encoding – and I've spent a good amount of the past 13 years exchanging CSVs and TSVs w/ 3rd parties full of horrible legacy. Asking those 3rd parties to send me an Avro/Parquet/whatever file would've made the project fail or take 10x longer.
There's a reason why CSV stuck around so long, and the alternatives make different trade-offs but miss the pros.
lwn 2021-08-18 17:32:49 +0000 UTC [ - ]
U+FEFF"aaa","b CRLF
bb","cc"c" CRLF
zzz,yyy,xxx
Just an example of the wonderful world of CSV \(.)/ which I've used quite a lot, because it enables quick and easy (dirty) data dumps/ exchanges. But using it as a data exchange format between multiple parties often leads to problems, such as above.
rexreed 2021-08-18 15:15:11 +0000 UTC [ - ]
_trampeltier 2021-08-18 20:59:06 +0000 UTC [ - ]
hermitcrab 2021-08-18 19:50:27 +0000 UTC [ - ]
For a start, think about it from a game theory perspective. If I keep CSV import/export in my Easy Data Transform product, while all my competitors with data transformation software (Alteryx, Knime etc) remove it from theirs, it gives me a big sales advantage. What advantage do I get for removing already working CSV import/export code? Nothing.
kcartlidge 2021-08-18 19:39:53 +0000 UTC [ - ]
Whatever is used, I like that tab-delimited (or even csv) is a human-readable and always-machine-readable long-term data format in the same way as ASCII (or Markdown etc) is for text. A hundred years from now, assuming the storage medium is still usable, the content should be easily recoverable. That may not be the case with spreadsheet files or other structured data formats.
lamontcg 2021-08-18 20:14:01 +0000 UTC [ - ]
And that's why they're not going away.
Trying to add typing to them is probably missing the point that >95% of humans that use them don't need or want to understand that.
MarkLowenstein 2021-08-18 23:36:42 +0000 UTC [ - ]
When they give me literally anything else I get pissed off and push their task lower in my queue.
ittan 2021-08-19 00:12:52 +0000 UTC [ - ]
The contract on what is transfered or how it's mapped in your csv is developer to developer and not in code.
This is a problem and also a benefit.
As long as both systems are designed to read and write the csv correctly(there in also the problem of verification), csv works flawlessly and retains its simplicity.
ubermonkey 2021-08-19 00:04:28 +0000 UTC [ - ]
Sometimes -- frequently, even -- you have to go to some least-common-denominator format to move data around. CSV is often a candidate.
When I was younger I thought we'd eventually abandon "primitive" formats like this, but in my middle years I realize their extreme utility. CSV absolutely has a place, and absolutely does a job that no other format does as quickly or as easily.
bastardoperator 2021-08-18 16:41:46 +0000 UTC [ - ]
jqcoffey 2021-08-18 18:35:48 +0000 UTC [ - ]
The format isn’t inherently flawed, though much like other less than perfect standards (I’m looking at you SMTP), the implementations frequently are.
And also, much like the author, I’ve been “professionally” dealing with data in all (most?) its forms for the length of my career (~25 years).
SPBS 2021-08-18 18:27:46 +0000 UTC [ - ]
edoceo 2021-08-18 15:59:11 +0000 UTC [ - ]
robbmorganf 2021-08-19 01:05:12 +0000 UTC [ - ]
anodyne33 2021-08-19 00:29:08 +0000 UTC [ - ]
progre 2021-08-18 15:27:52 +0000 UTC [ - ]
osrec 2021-08-18 16:26:14 +0000 UTC [ - ]
PaulHoule 2021-08-18 15:16:03 +0000 UTC [ - ]
andylynch 2021-08-18 15:30:19 +0000 UTC [ - ]
politician 2021-08-19 00:22:41 +0000 UTC [ - ]
If you use the right characters built specifically for this purpose, then the problems go away.
pharmakom 2021-08-18 16:40:39 +0000 UTC [ - ]
- easy to parse
- easy to edit with a generic text editor
- easy to edit with a widely available GUI, like LibreOffice
- allow adding more data with only append operations
mongol 2021-08-18 17:27:09 +0000 UTC [ - ]
dredmorbius 2021-08-18 19:55:51 +0000 UTC [ - ]
The usual problem is when you discover that what you thought was a value is in fact a record of its own, with internal structure.
The problem isn't the CSV. The problem is that data parsing is actually a Hard Problem.
The author of TFA is ... naive and misguided.
hobs 2021-08-19 03:30:04 +0000 UTC [ - ]
jayd16 2021-08-18 16:13:05 +0000 UTC [ - ]
In fact, I will go even further and say I want something mergeable by source control.
btilly 2021-08-18 17:05:20 +0000 UTC [ - ]
Bad CSV is a PITA, but usually you can make reasonable sense of it. Merging bad/inconsistent/conflicting metadata tends to be an open-ended nightmare with no good resolution at the end.
jojo2000 2021-08-18 20:13:12 +0000 UTC [ - ]
zmmmmm 2021-08-18 23:45:21 +0000 UTC [ - ]
It turns out people hate bloat and complexity more than they hate all the ills of CSV put together!
Thiez 2021-08-19 03:05:12 +0000 UTC [ - ]
jszymborski 2021-08-18 16:09:04 +0000 UTC [ - ]
When I have lots of data in multiple related tables where I would benefit from defined data types, I reach for HDF5 or SQLite, but there are so many nice things to say about a simple CSV for lots of simply structured data.
Super simple to stream compress/decompress, and the ability to use great CLI tools like xsv, and being able to peak at the data by simply calling less or head... it's just hard to beat.
MonadIsPronad 2021-08-18 15:09:53 +0000 UTC [ - ]
puppet-master 2021-08-18 15:20:28 +0000 UTC [ - ]
SQLite is wonderful, but it's definitely not a replacement for CSV
masklinn 2021-08-18 15:32:02 +0000 UTC [ - ]
The main feature of CSV is that it's aways broken, not that it's almost trivial.
> with one that requires a specific 85 kLOC codebase to access
sqlite is available essentially everywhere. Every browser uses sqlite extensively internally, so do tons of other software on your standard desktop machine (to say nothing of mobile OS where it's ubiquitous). Using sqlite is not any sort of constraint.
> SQLite is wonderful, but it's definitely not a replacement for CSV
Only in the sense that it's nowhere near as bad and annoying as CSV.
JohnHaugeland 2021-08-18 15:13:34 +0000 UTC [ - ]
tzs 2021-08-18 15:46:15 +0000 UTC [ - ]
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE temperature
(
ts integer,
Tc real,
Tf real,
src_id integer
);
INSERT INTO temperature VALUES(1615165342,5.5999999999999996447,42.099999999999999644,1);
INSERT INTO temperature VALUES(1615165350,0.6,32.999999999999998223,3);
INSERT INTO temperature VALUES(1615165404,5.5,41.899999999999995026,1);
INSERT INTO temperature VALUES(1615165410,-17.199999999999999289,1.0,2);
INSERT INTO temperature VALUES(1615165435,5.5,41.899999999999995026,1);
SQLite dumps would actually be a pretty good data exchange format. As mentioned there is a CSV inside there. It's not hard to extract that if you need an unadorned CSV--some grep and sed, or grep and cut, or a few lines of scripting.Or if you have SQLite installed, it is an easy to have SQLite itself read the dump and export it in CSV, letting it deal with things like quoting that can be a pain with grep and sed/cut. This way also makes it easy to rearrange or omit columns and to do some filtering so that data you don't care about doesn't get into the CSV.
coding123 2021-08-18 15:09:48 +0000 UTC [ - ]
MeteorMarc 2021-08-18 15:12:15 +0000 UTC [ - ]
NBJack 2021-08-18 15:18:47 +0000 UTC [ - ]
Are you referring to a particular format variant?
rutthenut 2021-08-18 15:25:05 +0000 UTC [ - ]
nojito 2021-08-18 16:29:34 +0000 UTC [ - ]
nolok 2021-08-18 15:15:44 +0000 UTC [ - ]
oytis 2021-08-18 15:16:31 +0000 UTC [ - ]
poorman 2021-08-18 15:18:23 +0000 UTC [ - ]
johnrob 2021-08-18 15:15:24 +0000 UTC [ - ]
xtracto 2021-08-18 15:39:09 +0000 UTC [ - ]
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "array", "items": {"type":"array","items": [ { "type": "number" }, { "type": "string" }, { "enum": ["Street", "Avenue", "Boulevard"] },{ "enum": ["NW", "NE", "SW", "SE"] } ]} }
[[3,"some", "Street", "NE"],
[4,"other", "Avenue", "SE"],
[5,"some", "Boulevard", "SW"]]
Existenceblinks 2021-08-18 17:50:21 +0000 UTC [ - ]
Now you have tables and relation. Only few people understand the usefulness of Compound Document, currently.
lufte 2021-08-18 18:14:37 +0000 UTC [ - ]
How would any existing JSON parser handle 75GB of data in a single array of arrays?
WorldMaker 2021-08-18 18:36:38 +0000 UTC [ - ]
JSON.Parse() in the browser isn't that sort of parser obviously, but there options in many languages such as both major .NET JSON libraries (Newtonsoft and System.Text.Json) have ways to parse that way. Similar examples exist in other languages too.
lufte 2021-08-18 20:07:21 +0000 UTC [ - ]
temp8964 2021-08-18 19:37:23 +0000 UTC [ - ]
It is just CSV with a little bit information on the top. Great for define column types, so that you reader does not need to guess the column types.
See: https://csvy.org/
EthicalHacker03 2021-08-18 23:02:11 +0000 UTC [ - ]
dasil003 2021-08-18 15:27:37 +0000 UTC [ - ]
jti107 2021-08-19 00:11:00 +0000 UTC [ - ]
we've resorted to csv files + readme + json files in git for version control.
punnerud 2021-08-18 15:26:10 +0000 UTC [ - ]
As long as you are following RFC4180 it works.
I ended up exporting from Oracle DB using JSON and converting to CSV for one of our contractors to be able to import the data to MongoDB.
philistine 2021-08-18 16:37:20 +0000 UTC [ - ]
What we need is something akin to Strict Markdown. Something that qualifies every edge case to produce a strict CSV that can encompass human-readable metadata within the strict delimiters.
robochat 2021-08-18 20:28:18 +0000 UTC [ - ]
If it is stricter, it would have one type of field separator that is not commas since some locales use them as decimal places (I'm looking at you, France) but something like '|'. It would insist that dates were iso8601. It could define how fields can be escaped and quoted - although I would prefer if quoting was kept to a minimum. The format should also allow for comments i.e # so that people can comment their datasets inside the same file.
Alternatively or in addition, it could have some header lines:
1) A header that defines the encoding, separator, decimal separator, quote character, escape character, line ending character, date format ...
2) A header that defines each column's name
3) A header that defines each column's data type and formatting
4) A header that defines each column's unit like m/s or kg - ok, this is a bit of a stretch but it would be great to have.
or some variation of the above.
Fundamentally, this bsv format would still be csv and most programs would still be able to read it with the parsers that already exist or be quickly adapted to read it. It could still be easily edited by hand but the metadata would be present.
I suspect that this is just a pipe dream because people would find hundreds of ways to break it but toml took off and that didn't exist so long ago.
breck 2021-08-18 20:46:46 +0000 UTC [ - ]
I don't think you can make a breakthrough through syntax alone. I think you've got to integrate some type of live semantic schema, something like Schema.org. If I used "bsv" and didn't just get a slightly better parsing experience but also got data augmentation for free, or suggested data transformations/visualizations, et cetera, then I could see a community building.
I think perhaps a GPT-N will be able to write it's own Schema.org thing, using all the world's content, and then a BSV format could come out of that.
41209 2021-08-18 19:19:48 +0000 UTC [ - ]
Yaml is my go to format when I need something human readable and somewhat editable ( very easy to ruin Yaml spacing)..
What's the real motivation behind an article like this. I don't imagine anyone has serious trouble with csvs, they tend to just work
Sn0wCoder 2021-08-18 16:07:47 +0000 UTC [ - ]
jrm4 2021-08-18 17:10:47 +0000 UTC [ - ]
Languages and formats are not fundamentally about computers or efficiency, they're about people. Carry on.
andylynch 2021-08-18 15:24:22 +0000 UTC [ - ]
zz865 2021-08-18 22:35:16 +0000 UTC [ - ]
anigbrowl 2021-08-18 23:08:44 +0000 UTC [ - ]
bernardv 2021-08-18 20:21:15 +0000 UTC [ - ]
mybrid 2021-08-18 22:30:18 +0000 UTC [ - ]
joelthelion 2021-08-18 17:36:31 +0000 UTC [ - ]
They have open-source implementations in many languages, are much faster to load than csv, are natively compressed, are strongly typed and don't require parsing...
There are few reasons to continue using csv in this day and age.
mrvenkman 2021-08-18 16:14:52 +0000 UTC [ - ]
Libra Office works in Windows.
Google Sheets does a great job too, no?
osigurdson 2021-08-18 15:37:58 +0000 UTC [ - ]
mongol 2021-08-18 16:15:54 +0000 UTC [ - ]
laichzeit0 2021-08-18 17:19:04 +0000 UTC [ - ]
7952 2021-08-18 17:29:39 +0000 UTC [ - ]
tbenst 2021-08-18 22:29:40 +0000 UTC [ - ]
SQLite databases are in contrast a nightmare, and git merges often corrupt files.
spiritplumber 2021-08-19 00:30:39 +0000 UTC [ - ]
danellis 2021-08-19 04:35:10 +0000 UTC [ - ]
jiggawatts 2021-08-18 23:44:26 +0000 UTC [ - ]
Yet, everbody gets it wrong. Everybody. Thoroughly, fantastically, almost unimaginably wrong.
For example, in the Microsoft world: I come across CSVs in about 5 scenarios, all of them very common, most of them designed to interact: Excel, PowerShell's Export-CSV, SQL Server, Power BI, and the Azure Portal. None of these are obscure. None of these use CSV infrequently. Yet, they're all mutually incompatible!!!
That just blows my mind.
For example, SQL Server will output string ",NULL," to represent a null field instead of just a pair of commas (",,"), so every other tool will convert this to the string "NULL", which is not a null.
PowerShell helpfully outputs the "type" of the value it is outputting, like so:
PS C:\> dir | ConvertTo-Csv
#TYPE System.IO.DirectoryInfo
"PSPath","PSParentPath","PSChildName","PSDrive","PSProvider",...
Excel can't open such files! It is entirely unfathomable that the PowerShell team wrote this code and never once double-clicked the resulting CSV to see if it opens successfully in Excel or not. Absolutely mindblowing!It just goes on and on.
Different quoting rules. Random ability/inability to handle new lines. Different ways of handling quoted versus unquoted string values. Encoding is UTF-8 by default or not. Handling of quote characters within strings. Etc, etc...
Basically, within one vendor's ecosystem, flagship applications have at best a 50:50 chance of opening arbitrary CSV files.
Don't even get me started on the inherent ambiguities of the format, like interpreting dates, times, or high precision decimals, etc...
Oh, and before I forget: the SQL Server Integration Services team wrote lengthy articles on how their engine can process CSV files faster than the competition. Why is this a feature? Because CSV is a woefully inefficient format and processing it fast is an achievement.
Lastly: By default, SQL Server cannot export table data to a flat data file and round-trip it with full fidelity, in any format. Exchanging just a couple of tables between servers is... not fun.
So yes, a replacement format with an efficient, high-fidelity binary format is long overdue.
csours 2021-08-18 16:00:28 +0000 UTC [ - ]
There's no such thing as schema-less - there's undefined schema
mrlonglong 2021-08-18 16:31:38 +0000 UTC [ - ]
tester756 2021-08-19 00:01:26 +0000 UTC [ - ]
Like
Id(int), Name, LastName, BornAt(date)
1, asd, dsa, 1992-07-12
shezi 2021-08-18 17:50:59 +0000 UTC [ - ]
It was outdated from the start because ASCII already has specific characters for file, group, record, and item separation. Using this would give broad compatibility and a wider feature set (eg. more than one table in a file) while retaining all the benefits of csv.
kissgyorgy 2021-08-18 15:55:21 +0000 UTC [ - ]
s_dev 2021-08-18 17:50:44 +0000 UTC [ - ]
m1234_ 2021-08-18 20:57:05 +0000 UTC [ - ]
worik 2021-08-19 01:08:45 +0000 UTC [ - ]
There are plenty of formats available. CSV is useful. Not perfect.
TrackerFF 2021-08-18 15:07:28 +0000 UTC [ - ]
habitue 2021-08-18 15:07:12 +0000 UTC [ - ]
I think the valuable insight here is that there needs to be a meme / movement that CSV is bad or deprecated. That's what's actually going to put the nails in its coffin, not private griping from developers when they get CSVs.
I'm all for it. Down with CSV :)
Sanzig 2021-08-18 15:14:37 +0000 UTC [ - ]
If I'm trying to make an exportable format for a data logger with an SD card running on an ARM microcontroller, it doesn't get much easier than CSV. Sure, I could save space by rolling my own binary format, but then I have to provide a PC application to read it (and realistically, the user is probably just going to want that application to dump to a CSV anyway!).
I agree that for many use cases there are much better alternatives, but one of the reasons CSV is so popular is because it's so simple. It shouldn't be used for multi-gigabyte datasets, but for many simple use cases it works great.
dingosity 2021-08-18 16:43:32 +0000 UTC [ - ]
that is... if you're following the RFC 4180 guidance for CSV. microsoft has another internal standard that occasionally gets referenced.
drewmate 2021-08-18 15:10:22 +0000 UTC [ - ]
And the cycle continues.
newbamboo 2021-08-18 15:05:29 +0000 UTC [ - ]
Xkcd: https://m.xkcd.com/927/
dataminded 2021-08-18 15:14:29 +0000 UTC [ - ]
anshargal 2021-08-18 17:33:32 +0000 UTC [ - ]
Unfortunately it is not widely adopted yet and the language support is yet to be improved.
mark-r 2021-08-18 15:36:54 +0000 UTC [ - ]
JohnHaugeland 2021-08-18 15:14:09 +0000 UTC [ - ]
The only appropriate replacement for CSV would be a better defined CSV that gave hard requirements for things like field quoting
zepolen 2021-08-18 15:12:47 +0000 UTC [ - ]
1. You can read them without any software
2. Streamable row by row
3. Compress well
To be honest most of the points in this article could be addressed by standardizing a method of defining a schema for the CSVs. It could even be backward compatible by appending the definition as metadata on the header column or as a separate file.
One thing that would be good to have is a standardized method of indexing CSVs so that random access is possible too, though that would be more involved.
jslaby 2021-08-18 16:29:06 +0000 UTC [ - ]
zepolen 2021-08-18 16:58:48 +0000 UTC [ - ]
name::string,number of legs::int,height in meters::float,date of birth::date(MM/DD/YYYY),email adress::email,website::url
joe,2,1.76,12/12/1999,joe@joe.com,https://www.joe.com
bob,1,1.84,12/12/1944,bob@vietnam.com,null
mastrsushi 2021-08-19 03:05:20 +0000 UTC [ - ]
Everyone knows it’s not capable of storing anything more than table data. It was never meant to be much more.
It doesn’t really store objects well. To write the values of a composite object would require following some format order. It would go past human readability and just be a bloated way of writing bytes with readable characters.
anothernewdude 2021-08-19 01:11:56 +0000 UTC [ - ]
soheil 2021-08-19 01:07:24 +0000 UTC [ - ]
thrower123 2021-08-18 16:36:11 +0000 UTC [ - ]
throwawayswede 2021-08-18 16:29:02 +0000 UTC [ - ]
Every single one of their points are extremely subjective and very wrong. If you have an excel sheet full of equations and colorful cells and then to decide to export it as CSV and open it in Notepad, you really can't complain that that CSV is bad. I mean it's obvious that to each format a set of strengths and a set of weaknesses, and also obviously what could be a strength to someone is a weakness to someone else. The fact that CSV is so simple to parse (almost every modern language can very easily read/write a CSV) makes it a fantastic data transfer format for every single usecase I had (this doesn't mean that other formats are less important of course). Sure you'll lose your Google Sheet or Excel metadata, but this is NOT what CSV is for.
What I find fascinating though is that the author decided it's a good idea to make a blanket statement like "Time to retire?" just because they themselves have an issue with SOME use case. I mean the idea that having your personal needs not met to justify arguing that we should ALL stop using CSV is so very bizarre to me, like it's way beyond selfish.
parksy 2021-08-18 15:56:42 +0000 UTC [ - ]
There are workarounds for sure; VB macros, unzip XSLX and parse the XML, write scripts to automate Excel-to-whatever using DCOM, or import into an intermediary service and re-export into the format de-jure, but that all takes time and costs money too, and often causes confusion when even spoken about. Asking for a CSV takes a couple of seconds and is easily understood by most people. Anyone experienced with importing / manipulating CSV data can deal with the variance in delimiters and escape sequences without major issues. It's a headache at times but easier than the alternative of alienating or confusing clients who are looking for a simple solution to whatever issue they have today.
On the other hand once the data is in the target system, if they're still asking for CSV exports I do probe to ask why, and try and figure out if there's a better way. Reporting is the usual reason, and plenty of the CRMs I work with have built-in reporting that can replicate and improve upon whatever spreadsheet they are using, and have APIs that can interface with cloud-based reporting services. But there's a lot of inertia against change in most small-to-medium organisations, not everyone is a data expert and you can't sell someone something they can't use or understand. Ultimately people win, and the solution ends up being a balance of hopefully incremental technological improvements that they can still integrate into their day-to-day. Not every organisation is able to undergo a full digital transformation with time, budgets, and skills at hand.
I agree with the sentiment but I'd hate to see a future where everyone's locked into proprietary ecosystems - not that that is what's advocated for in the post, but we have CSV because that's what the big platforms seem to allow, not because they don't know there are better options. There's no technical reason Excel couldn't export to WordPress or SuiteCRM. Take CSV away and it gets harder, not easier, to move between platforms.
dingosity 2021-08-18 16:38:02 +0000 UTC [ - ]
I can't remember a time that such a simple file format had such wild inconsistencies. Not to mention some CSV export functions just ignore them. Put a comma inside a CSV field? why not? Put a single double quote in a CSV field? sure! Insist that leading (or trailing) spaces in a CSV field are semantically important? OF COURSE!
If CSV was used consistently, it wouldn't be that bad. But it's apparently simplicity lulls developers into a false sense of security, which is part of what the original author seems to be saying.
six0h 2021-08-19 02:20:01 +0000 UTC [ - ]
six0h 2021-08-19 02:18:48 +0000 UTC [ - ]
Sunspark 2021-08-18 15:14:00 +0000 UTC [ - ]
1) A truly open format is available and accessible. Csvs are textfiles. There is no system around that cannot open a textfile. If the format is binary or requires patents or whatever, then it's a non-starter.
2) Applications have a speed increase from using csvs. To wit, I loved csvs because often they finish preparing much faster than a "formatted" output in excel, etc. and sometimes I just want to see the numbers, not have everything colored or weird features like merged cells in excel, etc.
3) The new format should not be grossly larger than the one it is replacing. Extracts in Excel format are orders of a magnitude larger than csv in filesize. This affects run-time to prepare the extract, open it (memory constraints, etc.)
Is there truly a problem? The author is not forced to use csvs.
derefr 2021-08-18 15:25:45 +0000 UTC [ - ]
These other data formats all win on your #2 against CSV, as CSV is actually horrible at parse-time vs. other data formats — the fact that both of its separators (newlines and commas) can appear as-is inside column values, with a different meaning, if those column-values are quoted, means that there's no way to parallelize CSV processing, because there's no way to read-ahead and "chunk" a CSV purely lexically. You actually need to fully parse it (serially), and only then will you know where the row boundaries are. If you've ever dealt with trying to write ETL logic for datasets that exist as multi-GB CSV files, vs. as multi-GB any-other-data-format files, you'll have experienced the pain.
> The new format should not be grossly larger than the one it is replacing.
Self-describing formats like JSON Lines are big... but when you compress them, they go back to being small. General-purpose compressors like deflate/LZMA/etc. are very good at shearing away the duplication of self-describing rows.
As such, IMHO, the ideal format to replace ".csv" is ".jsonl.gz" (or, more conveniently, just ".jsonl" but with the expectation that backends will offer Transport-Encoding and your computer will use filesystem compression to store it — with this being almost the perfect use-case for both features.)
-----
There's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.
The data-warehouse ecosystem already standardized on Avro as its data interchange format. And spreadsheets are just tiny data warehouses. So why not? ;)
jerf 2021-08-18 16:40:36 +0000 UTC [ - ]
For CSV file replacements, I'd expect something like "one JSON array per line, all values must be JSON scalars". In that case, it's not much larger than a CSV, especially one using quotes already for string values.
But this demonstrates the problem with JSON for CSV, I suppose. Is each line an object? Is it wrapped in a top-level array or not? If it is objects, do the objects have to be one line? If it is an object, where do we put field order? The whole problem we're trying to solve with CSV is that it's not a format, it's a family of formats, but without some authority coming in and declaring a specialized JSON format we end up with a family of JSON formats to replace CSV as well. I'd still say it's a step up; at least the family of JSON formats is unambiguously parseable and the correct string values will pop out. But it's less of a full solution than I'd like.
(It wouldn't even have to be that much of an authority necessarily, but certainly more than "The HN user named jerf declares it to be thus." Though I suppose if I registered "csvjson.org" or some obvious variant and put up a suitably professional-looking page that might just do the trick. I know of a few other "standards" that don't seem to be much more than that. Technically, even JSON itself wasn't much more than that for a lot of its run, though it is an IETF standard now.)
turtles_ 2021-08-18 17:47:29 +0000 UTC [ - ]
To be fair it's not an objectionable format. Using line breaks to separate objects makes it streamable, and you don't need to enclose the whole thing in an array to make it a valid JSON document.
jerf 2021-08-18 19:49:15 +0000 UTC [ - ]
derefr 2021-08-18 22:52:35 +0000 UTC [ - ]
I see CSV files that contain JSON arrays/objects in their fields all the time. Mainly from exporting Postgres tables that contain json/jsonb-typed columns. Are you saying that these aren't valid CSVs?
zerocrates 2021-08-19 04:01:06 +0000 UTC [ - ]
Such a format could contain arbitrary JSON in a "cell", but simply as text, in the same way as putting the same data in a CSV.
shaftoe 2021-08-19 11:20:36 +0000 UTC [ - ]
cdcarter 2021-08-18 21:08:17 +0000 UTC [ - ]
da_chicken 2021-08-18 17:04:36 +0000 UTC [ - ]
How is that not a problem with every data serialization format? It does me no real good if I have an XML schema and a corresponding file. If I don't know what those elements and attributes represent I'm not really any better off.
It's not like JSON or XML can meaningfully be marshaled back into objects for use generically without knowledge of what is represented. There are generic JSON and XML readers that allow you to parse the data elements sure, but so, too, do generic CSV readers like C#'s CsvHelper or Python's csv. In all cases you have to know what the object turns into in the application before the serialized data is useful.
And, yes, CSV has slightly differing formats, but so does JSON. Date formats are conventionally ISO 8601, but that's not in the spec. That's why Microsoft got away with proprietary date formats in System.Text.Json. XML isn't really any better.
cpx86 2021-08-18 20:26:36 +0000 UTC [ - ]
What's proprietary in it? It follows ISO 8601-1:2019 and RFC 3339 according to the docs.
da_chicken 2021-08-18 21:20:42 +0000 UTC [ - ]
In .Net Framework 4.6 and earlier, the only built-in JSON serializer in the .Net Framework was System.Runtime.Serialization.Json.DataContractJsonSerializer.
You can still see it. If you're on Windows 10, run Windows Powershell v5.1 and run:
You'll see this output: Microsoft didn't fix their weird JSON serialization until quite late. They may have back ported it to the .Net Framework, but they've deleted that documentation. Powershell v6 and v7 include the newer classes that are properly behaved. This is why Json.NET used to be so popular and ubiquitous for C# and ASP applications. It generated JSON like most web applications do, not the way Microsoft's wonky class did. Indeed, I believe it may be what System.Text.Json is based on.cpx86 2021-08-19 16:02:31 +0000 UTC [ - ]
To be fair, RFC 3339 wasn't even published back when this class was implemented (in .NET 3.5) so I guess they just went with whatever worked for their needs. ¯\_(ツ)_/¯
a9h74j 2021-08-19 03:45:34 +0000 UTC [ - ]
hermitcrab 2021-08-18 19:53:56 +0000 UTC [ - ]
Yes, this is a major pain. It can be avoided by using Tab separated value (TSV) files, which don't use escaping. But then you can't store Tabs or carriage returns in your data. Also there is no way to store metadata in TSV.
JSON is far from ideal for storing 2D data tables as it is a tree. This means it is much more verbose than it needs to be. The same is also true for XML.
divbzero 2021-08-18 21:01:40 +0000 UTC [ - ]
hermitcrab 2021-08-18 21:22:02 +0000 UTC [ - ]
mavhc 2021-08-18 21:41:14 +0000 UTC [ - ]
hermitcrab 2021-08-18 22:03:33 +0000 UTC [ - ]
Also, the fact that they have been around for years and never taken off implies that they probably never will.
sli 2021-08-19 00:25:20 +0000 UTC [ - ]
Realistically I don't think anyone (at least on HN) would presume that editor support would be so weirdly obtuse, but it's not exactly well-defined here, either.
I'm mostly just thinking out loud, here, not really critiquing anyone.
mavhc 2021-08-19 06:58:02 +0000 UTC [ - ]
Who's writing CSV files by hand though?
I'd expect some kind of CSV mode in my text editor where Tab and Newlines are turned into keys to insert those characters
kevin_thibedeau 2021-08-19 00:54:52 +0000 UTC [ - ]
[modifier]-[C,D,E,F]
epmos 2021-08-19 10:43:42 +0000 UTC [ - ]
Other posters have noted that they have ASCII separators recently, I have as recently as 2018 internally at a FAANG. So they are not completely unknown in the wild.
xkgt 2021-08-19 05:28:29 +0000 UTC [ - ]
a1369209993 2021-08-19 01:31:51 +0000 UTC [ - ]
Yes, because that just kicks the can down the road: how do you represent the one-byte string "\x1E"?
BenjiWiebe 2021-08-19 02:12:12 +0000 UTC [ - ]
nicoburns 2021-08-18 21:14:46 +0000 UTC [ - ]
Fwirt 2021-08-19 07:14:42 +0000 UTC [ - ]
Doing this avoids quoting, which avoids an entire class of problems. And yes, if you're passing multi-megabyte files around on a regular basis, or transferring giant blobs of text or binary data, or looking for a storage format with metadata then csvs are an absolutely awful choice, but as with any problem you pick the right tool for the job, and csvs are a great tool when you want to easily move some structured alphanumeric data around in a format that's trivial to parse in any language and widely supported.
hermitcrab 2021-08-19 07:40:13 +0000 UTC [ - ]
But then you have the additional problem of trying to work out what the delimiter is. You can take an educated guess, but this isn't going to be 100% reliable.
2021-08-18 23:38:41 +0000 UTC [ - ]
osigurdson 2021-08-18 15:49:23 +0000 UTC [ - ]
However it is not particularly readable/diff-able if this is part of your use case.
ectopod 2021-08-18 20:30:29 +0000 UTC [ - ]
https://en.wikipedia.org/wiki/Microsoft_Excel#XML_Spreadshee...
2021-08-18 21:04:53 +0000 UTC [ - ]
unnouinceput 2021-08-18 17:52:03 +0000 UTC [ - ]
dtparr 2021-08-18 20:57:28 +0000 UTC [ - ]
shakna 2021-08-18 23:21:52 +0000 UTC [ - ]
mavhc 2021-08-18 21:42:42 +0000 UTC [ - ]
meepmorp 2021-08-19 00:34:05 +0000 UTC [ - ]
2021-08-18 17:55:36 +0000 UTC [ - ]
anigbrowl 2021-08-18 21:52:50 +0000 UTC [ - ]
stickfigure 2021-08-18 17:29:01 +0000 UTC [ - ]
If you are going to displace a standard, it has to be significantly better than the old.
derefr 2021-08-18 17:54:58 +0000 UTC [ - ]
The semantics of CSV — how it decodes, if you don't have an application making additional configured assumptions or using heuristics to detect microformats — is just N columns, one of header names, N of raw text cells. Nothing more, nothing less. No schema defining what a column's type is.
stickfigure 2021-08-18 18:38:29 +0000 UTC [ - ]
If you want something better than CSV to take off, at the very least it needs to solve the common pain points of CSV. JSON doesn't - all it brings to the table is the ability to distinguish number vs text. That's a yawn.
breck 2021-08-18 20:30:49 +0000 UTC [ - ]
Agreed. I don't ever see something better than CSV taking off (having worked on this problem for better part of a decade) unless you could get to the point where you've got a largely similar notation, but a richer semantic grammar powered by something like Schema.org.
In other words, a strongly typed CSV with a community backed schema. I think GPT-3 and the like are a big wild card here. I can't describe how, but I expect in 20 years we will be using something better for storing data and will look back and saw "CSV used to be the best but then GPT-N _____________".
lmm 2021-08-19 04:02:00 +0000 UTC [ - ]
It brings the ability to parse in parallel - that's a big deal. And while number vs text might not be a huge difference in theory, in practice it eliminates what, 95% of real-world parse problems?
stickfigure 2021-08-19 07:10:12 +0000 UTC [ - ]
I think you could fix most of the pain of CSV simply by adding a second header row which defines the type of the column, using a common vocabulary of types. TEXT, NUMBER, ISO8861-DATE, etc.
vertere 2021-08-19 09:08:20 +0000 UTC [ - ]
On text vs numbers, at least some widely-used software (e.g. R, Excel) will try to guess for you. It should be obvious how this might cause problems. Maybe one should turn auto-conversion off (or not use things that don't let you turn it off) and specify which columns are numbers. Some datasets have a lot of columns, so this can be a PITA, even if you do know which ones should be numbers. But the bigger problem is if you have to deal with other people, or the files that they've touched. There are always going to be people that edit data in excel, don't use the right options when they import, etc.
derefr 2021-08-18 19:11:22 +0000 UTC [ - ]
Sunspark 2021-08-18 15:54:57 +0000 UTC [ - ]
I realize that this is a specific use case here, but I was on Cognos for years, and then when they started shifting over to Tableau, it wasn't any better.. csv, MS formats, proprietary formats, etc.
cbsmith 2021-08-18 16:23:26 +0000 UTC [ - ]
tw04 2021-08-18 16:42:04 +0000 UTC [ - ]
>The biggest and most thorny problem to solve is the people problem: how do you convince people to stop creating new CSVs when they’ve never done things any other way? Fortunately - for this problem, anyway - most of the world’s business data is born in one of a handful of programs that are owned by an increasingly small number of companies. If Microsoft and Salesforce were somehow convinced to move away from CSV support in Excel and Tableau, a large portion of business users would move to a successor format as a matter of course. Of course, it’s debatable whether that kind of a change is in the best interest of those companies, but I’m cautiously optimistic.
cbsmith 2021-08-18 16:50:08 +0000 UTC [ - ]
No, it's not. It's about replacing CSVs for exchanging data. It mentions that CSVs often are the product of someone exporting data from a spreadsheet or doing a table dump, and how just doing that tends to create a ton of problems, but Excel is an example, not the subject matter of the article.
> The business world will laugh at you and continue on their merry CSV way.
The business world pays me a lot of money to teach them not to use CSVs.
masklinn 2021-08-18 17:20:28 +0000 UTC [ - ]
Could you teach them better and faster? I don't think they're getting it. You have my blessing to use violence.
barbarbar 2021-08-19 03:26:47 +0000 UTC [ - ]
cbsmith 2021-08-18 17:42:59 +0000 UTC [ - ]
I'm trying man. I'm trying.
wombatpm 2021-08-18 18:06:25 +0000 UTC [ - ]
hermitcrab 2021-08-18 20:09:46 +0000 UTC [ - ]
-the first line is always a header
-fields are separated by Unit separator characters
-records are separated by Record separator characters
-encoding is UTF8
If you wanted to get fancy you could also have:
-comment lines
-column metadata (e.g. column 0 is an ISO date, column 2 is text, column 3 is an integer)
Both the above could start with a Unicode character unlikely to be used for anything else.
I think that would avoid 99% of the pain of CSV files. The downside is that the use of things like the Unit separator mean that it wouldn't be easy to create/edit manually.
I don't suppose it will ever happen though.
cbsmith 2021-08-18 21:52:33 +0000 UTC [ - ]
I mean, you'd have to be using a pretty terrible tool for it not being able to handle that, and I suspect if such an approach were to become prevalent, that tool would either fix the glitch or become largely unused.
hermitcrab 2021-08-18 22:00:58 +0000 UTC [ - ]
cbsmith 2021-08-18 22:05:54 +0000 UTC [ - ]
hermitcrab 2021-08-18 22:15:54 +0000 UTC [ - ]
cbsmith 2021-08-19 02:12:51 +0000 UTC [ - ]
lanstin 2021-08-19 04:34:14 +0000 UTC [ - ]
cbsmith 2021-08-19 11:08:53 +0000 UTC [ - ]
But you gotta admit that excel is a pretty terrible tool to hand edit a CSV with... ;-)
scbrg 2021-08-19 08:33:16 +0000 UTC [ - ]
cbsmith 2021-08-19 11:09:43 +0000 UTC [ - ]
atatatat 2021-08-19 14:28:59 +0000 UTC [ - ]
cbsmith 2021-08-19 16:44:39 +0000 UTC [ - ]
barbarbar 2021-08-19 03:19:52 +0000 UTC [ - ]
cbsmith 2021-08-18 19:14:41 +0000 UTC [ - ]
...or you could invent abominations like CSV, TSV, etc. ;-)
wombatpm 2021-08-18 18:03:54 +0000 UTC [ - ]
hermitcrab 2021-08-18 20:10:50 +0000 UTC [ - ]
cbsmith 2021-08-18 19:13:10 +0000 UTC [ - ]
derefr 2021-08-18 20:07:55 +0000 UTC [ - ]
cbsmith 2021-08-18 20:16:20 +0000 UTC [ - ]
cbsmith 2021-08-18 16:11:18 +0000 UTC [ - ]
Avro is not a lossless alternate encoding of what's canonically a JSON document. Yes Avro supports a JSON encoding, but it's not canonical.
In general though, you're point about Avro being able to be represented as text is valid, and applies to practically any binary formThere's also Avro, which fails your point #1 (it's a binary format) but that binary format is a lossless alternate encoding of what's canonically a JSON document, and there are both simple CLI tools / and small, free, high-quality libraries that can map back and forth between the "raw" JSON document and the Avro-encoded file. At any time, you can decode the Avro-encoded file to text, to examine/modify it in a text editor.at, which is why the whole "but it needs to be a text format" argument is garbage.
> The data-warehouse ecosystem already standardized on Avro as its data interchange format. And spreadsheets are just tiny data warehouses. So why not? ;)
I wish that the data-warehouse ecosystem standardized on anything. ;-)
That said, there are plenty of good reasons why a data-warehouse standard would not be advisable for spreadsheets.
derefr 2021-08-18 17:59:52 +0000 UTC [ - ]
To be clear, I'm not talking about using an Avro library to encode data to JSON. I'm saying that when you decode an Avro document, the result that comes out — presuming you don't tell the Avro decoder anything special about custom types your runtime supports and how it should map them — is a JSON document.
Where, by "JSON document" here, I don't mean "a JSON-encoded text string", but rather an in-memory ADT that has the exact set of types that exist in JSON, no more and no less. The sum-type of (JSONArray | JSONObject | String | Integer | Float | true | false | null). The "top" expression type recognized by a JSON parser. Some might call such a document a "JSON DOM." But usually it's a "JSON document", same as how the ADT you get by parsing XML is usually referred to as an "XML document."
Or, to put that another way, Avro is a way to encode JSON-typed data, just as "JSON text", or https://bsonspec.org/, is a way to encode JSON-typed data. They're all alternative encodings that have equivalent lossless encoding power over the same supported inputs.
cbsmith 2021-08-18 18:29:04 +0000 UTC [ - ]
Semantic point: it's not a "document".
There are tools which will decode Avro and output the data in JSON (typically using the JSON encoding of Avro: https://avro.apache.org/docs/current/spec.html#json_encoding), but the ADT that is created is by no means a JSON document. The ADT that is created has more complex semantics than JSON; JSON is not the canonical representation.
> By which I don't mean JSON-encoded text, but rather an in-memory ADT that has the exact set of types that exist in JSON, no more and no less.
Except Avro has data types that are not the exact set of types that exist in JSON. The first clue on this might be that the Avro spec includes mappings that list how primitive Avro types are mapped to JSON types.
> Or, to put that another way, Avro is a way to encode JSON-typed data, just as "JSON text", or https://bsonspec.org/, is a way to encode JSON-typed data
BSON, by design, was meant to be a more efficient way to encode JSON data, so yes, it is a way to encode JSON-typed data. Avro, however, was not defined as a way to encode JSON data. It was defined as a way to encode data (with a degree of specialization for the case of Hadoop sequence files, where you are generally storing a large number of small records in one file).
A simple counter example: Avro has a "float" type, which is a 32-bit IEEE 754 floating point number. Neither JSON nor BSON have that type.
Technically, JSON doesn't really have types, it has values, but even if you pretend that JavaScript's types are JSON's types, there's nothing "canonical" about JavaScript's types for Avro.
Yes, you can represent JSON data in Avro, and Avro in JSON, much as you can represent data in two different serialization formats. Avro's data model is very much defined independently of JSON's data model (as you'd expect).
derefr 2021-08-18 19:42:55 +0000 UTC [ - ]
My understanding was always:
1. that the "primitive Avro types" are Avro's wire types, which are separate from its representable domain types. (Sort of like how RLE-ified data has wire types of "literal string" and "repeat literal N times".)
2. that any data that would not be valid as input to a JSON encoder, is not valid as input to an Avro encoder, because its wire types are defined in terms of their a mapping from a set of domain types that are exactly the set of domain types accepted by JSON encoders (whether they're explicitly noted as being those or not.)
Or, to put that another way: an Avro schema is — besides a validation step that constrains your data into a slightly-more-normalized/cleaned format — mostly a big fat hint for how to most-efficiently pack an (IMHO strictly JSONly-typed) value into a binary encoding. Differences between "long" and "int" on the wire aren't meant to decode to different domain types (at least, by default); they're just meant to restrict the data's allowed values (like a SQL DOMAIN constraint) in ways that allow it to be more predictable, and so to be wire-encoded more optimally.
Let me lay out some evidence for that assertion:
• Avro supports specifying e.g. "bytes" vs. {"array": "byte"} — there's literally no domain-type difference in those! But one is a wire-encoding optimization over the other.
• Avro has a "default" property, and this property—as part of the JSON-typed schema—can only take on JSON-typed values. Do you think this is an implementation constraint, or a design choice?
• Avro's enum type's "symbols" array? Once again, defined by (and therefore limited to) JSON string values.
• Avro doesn't implement an arbitrary-precision integer type, even though its wire-encoding for integers would support one just fine. Why? Seemingly only because JSON doesn't have an arbitrary-precision integer type (because JavaScript doesn't have a native BigNum type); nor does JavaScript/JSON have any obvious type to O(1)-efficiently deserialize a BigNum out into. (Deserializing BigNums to strings wouldn't be O(1).) Every other language offers a clean 1:1 mapping for bignums, but JavaScript doesn't, so JSON didn't, so Avro doesn't.
• And why do you think Avro schemas are stored as embedded explicitly-defined-to-be-JSON documents within the root-level record / .avsc file, anyway? This means that you are required to have a JSON decoder around (either at decode time, or at decoder codegen time) to decode Avro documents. Why would this be, if not because the Avro implementation is (ot at least originally was) expected to decode the Avro document's wire types into the JSON library's already-defined ADTs, relying on e.g. having those "default"-parameter values already loaded in in JSON-value format from the schema's decode-output, ready to be dropped seamlessly into the resulting Avro decode-output?
And the biggest knock-down argument I'm aware of:
• Avro "string" doesn't support "\u0000". Why not? Because as you've said, Avro has a "JSON encoding", which specifies one-to-one mapping for strings; and JSON doesn't support "\u0000" in strings. (Just ask Postgres's jsonb type about that.) Since an Avro string containing "\u0000" wouldn't round-trip losslessly between the JSON and binary wire-encodings, it's not allowed in strings in the binary encoding.
cbsmith 2021-08-18 20:47:28 +0000 UTC [ - ]
Based on these comments, my best guess is you got the idea that Avro was for encoding JSON because the schema declaration is encoded in JSON, but that's not nearly the same as the data model. There are some terrible implementations of Avro libraries out there that use JSON as some kind of middleware, but that's not how Avro actually works.
If there's a type model it is derived from at all, it's the Java type model.
"byte" is not a valid type in Avro. There is only "bytes", and the domain model reflects this. You can't work with individual "byte" of a "bytes" object.
Default values are encoded in the schema, and so that does limit what kind of default values you can have, but again this is a limitation derived from the schema being defined in JSON, and how the schema language was defined in general. So your defaults have to be represented as JSON literals, but they don't even necessarily share the type of the JSON literal (e.g. a field defined as: '{"name": "foo", "type": "long", "default":1}' does not have the same default value as '{"name": "bar", "type": "int", "default":1}", because "foo" has a default value that is a long while "bar" has one that is an "int"). Note that "default values" are a property of the type, and only apply to elements inside complex data types. JSON has no such equivalent concept.
Avro's type model does have an arbitrary precision type that doesn't correlate to anything in JSON: the "decimal" logical type.
You aren't required to use a JSON decoder to decode Avro documents, nor are you required to use a .avsc file. The Avro schema file is just the standard way to represent a schema. If you have the schema, you don't need the file. JSON schema files are one of the poorer choices in the Avro design, but you'll notice that the schema is defined the way it is specifically so that it can cover a type model well outside of JSON. You'll also notice the names of types in Avro don't directly correlate to names of types in JSON.
* The \u0000 thing is a bug in avro tools, but there is nothing about the spec that prohibits having \u0000 in your strings.
I feel like in general this is like a retcon exercise, where you've reached a conclusion and are drawing evidence to prove it, while ignoring the information that contradicts. I spoke with Cutting a fair bit when he came up with Avro, and I can assure you, while the schema language does very intentionally use JSON, Avro is not a representation for JSON types.
amyjess 2021-08-18 22:37:26 +0000 UTC [ - ]
[0] though the last time I touched it was in 2014, so things might've changed since
[1] as per [0], this was before Python dicts were ordered by default
cbsmith 2021-08-19 02:20:38 +0000 UTC [ - ]
nonameiguess 2021-08-18 16:06:08 +0000 UTC [ - ]
For what it's worth, I totally agree something like compressed json lines is a better data exchange format, but part of why csv remains as universal and supported as it is is that so much existing data storage applications export to either csv or excel and that's about it. So any ETL system that can't strictly control the source of its input data has no choice but to support csv.
breck 2021-08-18 20:24:46 +0000 UTC [ - ]
There's a subset of CSV that forbids escapes that is super fast to parse. All fast CSV parsers I'm aware of take advantage of this subset. I try to never ever publish a CSV that has quotes, and always aim for a more restrictive grammar that is cleaner, better thought out data.
grumple 2021-08-18 17:28:58 +0000 UTC [ - ]
In practice, I used TSVs a lot, as tabs do not usually occur in most data. Alternatively, you could use pipes (|) or control characters as field or row separators.
kaiju0 2021-08-18 16:05:20 +0000 UTC [ - ]
*shudder
gsich 2021-08-18 16:24:54 +0000 UTC [ - ]
Of course this is not always applicable, since you sometimes don't control the format you get your data.
jolmg 2021-08-18 16:44:35 +0000 UTC [ - ]
When I see CSV parsers like https://www.papaparse.com/ that even try to support comments and empty lines in the format, I wonder if it'd really be that bad to just raise an error on anything that doesn't fit RFC 4180[1], with explanation on where and how the file is corrupted. Push the issue to the writers of such files.
Then I remember the Robustness Principle[2] and I chill a little.
[1] https://datatracker.ietf.org/doc/html/rfc4180
[2] https://en.wikipedia.org/wiki/Robustness_principle
hermitcrab 2021-08-18 20:14:44 +0000 UTC [ - ]
unnah 2021-08-19 06:43:44 +0000 UTC [ - ]
worik 2021-08-19 01:28:13 +0000 UTC [ - ]
asdff 2021-08-18 17:17:04 +0000 UTC [ - ]
jhbadger 2021-08-18 17:58:25 +0000 UTC [ - ]
jolmg 2021-08-18 18:59:48 +0000 UTC [ - ]
Maybe it's actually presenting what you've selected last?
It's giving me comma as the default separator, and it's the first option in the dropdown. Tab is the 3rd option.
>> Those other things have different names like TSV
That depends on the writer. I've gotten what should be named PTVs (pipe-terminated values) as CSVs. I can understand how it happened. If the underlying software outputs PTVs, you don't want to bother converting that because you're working in a legacy language that's a pain to work with (the type where identifiers can't be longer than 4 chars), and you want the user to be able to double-click on it and have it open in a spreadsheet without prior configuration, you just push the issue to the reader of the file, since by tradition readers are already quite tolerant of format differences...
Of course, there'll always be the case where the reader is simply not tolerant enough, like when the escaping syntax differs. There doesn't seem to be a way to get LO Calc to interpret "foo|bar\|baz|" as cells "foo" and "bar|baz", for example.
derefr 2021-08-18 20:09:29 +0000 UTC [ - ]
2021-08-18 21:13:55 +0000 UTC [ - ]
gsich 2021-08-18 22:04:38 +0000 UTC [ - ]
2021-08-18 22:07:59 +0000 UTC [ - ]
veltas 2021-08-18 20:57:14 +0000 UTC [ - ]
I find this really hard to believe given it's a simple enough syntax. And parsing is usually not the limiting factor, usually fast enough to not be noticed alongside interpreting or loading the source data. Every (much more sophisticated) compiler I can think of uses a linear parser based on this assumption.
mortehu 2021-08-18 21:07:41 +0000 UTC [ - ]
veltas 2021-08-19 12:10:14 +0000 UTC [ - ]
worik 2021-08-19 01:29:47 +0000 UTC [ - ]
Binary formats have their own pain points - more of them actually
glogla 2021-08-18 17:14:17 +0000 UTC [ - ]
I'm still looking for somwthing that can do it.
(Of course, Spark doesn't support timestampz which is probably why the formats don't.)
7thaccount 2021-08-18 20:54:02 +0000 UTC [ - ]
I'll admit though that "import JSON" and then being able to essentially convert the entire file into a dictionary is nice if the data has more structure to it.
jrumbut 2021-08-19 01:05:34 +0000 UTC [ - ]
If I import JSON data I have no idea what shape the result will be in, and it requires a separate standard to let me know about columns and rows and validation can get complicated.
justinator 2021-08-19 03:36:08 +0000 UTC [ - ]
Frankly, I love the format.
emmelaich 2021-08-19 00:15:50 +0000 UTC [ - ]
CSV is fine .. usually
sgolestane 2021-08-19 01:05:25 +0000 UTC [ - ]
dheera 2021-08-18 21:18:00 +0000 UTC [ - ]
mivade 2021-08-18 21:19:43 +0000 UTC [ - ]
meepmorp 2021-08-18 23:59:51 +0000 UTC [ - ]
dheera 2021-08-18 21:25:24 +0000 UTC [ - ]
RodgerTheGreat 2021-08-19 00:43:24 +0000 UTC [ - ]
2021-08-19 05:55:15 +0000 UTC [ - ]
08-15 2021-08-18 23:30:41 +0000 UTC [ - ]
...until there is a newline inside a field.
The moronic quoting mechanism of CSV is one half of the problem; people like you, who try to parse it by "just splitting strings" is the other half. The third half is that it's locale dependent and after 30+ years, people still don't use Unicode.
bsg75 2021-08-19 00:18:11 +0000 UTC [ - ]
There are non-printable, non-typable characters specifically defined as separators (ASCII 28-31) with UTF equivalents.
7thaccount 2021-08-19 03:17:58 +0000 UTC [ - ]
I appreciate the perspective though for sure. I'm guilty of the same thing on HN, where I assume people have the same uses as me when they're writing code that has to be extremely robust or blazingly fast.
7thaccount 2021-08-19 03:10:58 +0000 UTC [ - ]
Also, when doing interactive work, it's a bit different than writing production IT software.
karteum 2021-08-18 17:39:52 +0000 UTC [ - ]
"1) A truly open format is available" : sqlite is open-source, MIT-licensed, and well specified (even though I am usually not so happy with its weak typing approach, yet in this case this precisely enables a 100% correspondance between CSV and sqlite since CSV has also no typing at all...)
"2) Applications have a speed increase from using csvs" : I think it should be obvious to everyone that this is the case...
"3) The new format should not be grossly larger than the one it is replacing" : this is also the case
anigbrowl 2021-08-18 22:05:38 +0000 UTC [ - ]
For single tables a database is probably overkill, but it's nice to have around when you need something reasonably powerful without being overly complex or hard to get started with.
dolni 2021-08-19 03:54:50 +0000 UTC [ - ]
Shoutout to the Library of Congress for doing the real heavy lifting here?
unnouinceput 2021-08-18 17:50:30 +0000 UTC [ - ]
euroderf 2021-08-19 06:37:42 +0000 UTC [ - ]
postalrat 2021-08-18 18:09:22 +0000 UTC [ - ]
Well there's your problem.
munk-a 2021-08-18 20:55:20 +0000 UTC [ - ]
postalrat 2021-08-18 22:00:43 +0000 UTC [ - ]
icelancer 2021-08-19 01:09:11 +0000 UTC [ - ]
creata 2021-08-19 05:04:39 +0000 UTC [ - ]
atatatat 2021-08-19 14:46:06 +0000 UTC [ - ]
HanaShiratori 2021-08-19 01:02:45 +0000 UTC [ - ]
vertere 2021-08-19 07:16:26 +0000 UTC [ - ]
/s
dragonwriter 2021-08-18 16:05:04 +0000 UTC [ - ]
The author appears to be a consultant selling data prep/transformation services. As long as the market is using CSVs, he’s forced to use CSVs, at least as end-of-pipeline inputs and outputs.
Of course, “people optimize their workflows for something other than making my job easy” is a common, but also rarely persuasive in motivating action from others with different jobs, complaint.
dylan604 2021-08-18 16:23:10 +0000 UTC [ - ]
dragonwriter 2021-08-18 16:26:46 +0000 UTC [ - ]
dylan604 2021-08-18 18:35:11 +0000 UTC [ - ]
lallysingh 2021-08-18 22:38:41 +0000 UTC [ - ]
Ascii #31 instead of commas, Ascii #30 instead of newlines. Now those characters can go into your values.
If that's no good, zstd-compressed proto.
cls59 2021-08-18 15:55:14 +0000 UTC [ - ]
da_chicken 2021-08-18 16:13:36 +0000 UTC [ - ]
The advantage of CSV is that it's as accurate as your plain text representation of your data can be. Since binary data can be represented by character data, that's 100% accurate. As soon as you introduce a storage format that has made assumptions about the type of data being stored, you've lost flexibility.
SQLite is not intended for data serialization. It's intended for data storage to be read back by essentially the same application in the same environment.
Tagbert 2021-08-18 18:24:33 +0000 UTC [ - ]
I'd be satisfied if Excel was just a little less aggressive about it's data coercion on import.
hermitcrab 2021-08-18 20:18:41 +0000 UTC [ - ]
masklinn 2021-08-18 16:28:37 +0000 UTC [ - ]
SQLite's "dynamic data types" coerce data on input, not output. Once the data is in sqlite the way you wanted it, excel has no interpretation to perform, except insofar as really really wanting dates.
> The advantage of CSV is that it's as accurate as your plain text representation of your data can be.
Yeah nah.
> SQLite is not intended for data serialization. It's intended for data storage to be read back by essentially the same application in the same environment.
That's completely, absolutely, utterly, objectively, nonsensical.
Data analysis, exchange, container, and archiving, are literally all use cases listed as "appropriate uses for sqlite" in "Situations Where SQLite Works Well" on the official sqlite website: https://www.sqlite.org/whentouse.html
edmundsauto 2021-08-18 16:43:15 +0000 UTC [ - ]
Cockiness tells me that you’re insecure about your knowledge, not that you know more than GP.
X6S1x6Okd1st 2021-08-18 17:15:25 +0000 UTC [ - ]
for line in input: ','.join(line)
It's not exactly a problem with "CSV" specifically, but the environment in which it exists.
edmundsauto 2021-08-18 21:06:49 +0000 UTC [ - ]
masklinn 2021-08-18 17:15:02 +0000 UTC [ - ]
The idea that a CSV would be more likely to be correctly encoded than a DB is hilarious, thanks for the laugh. But that you were confident enough to seriously put it in writing shows how little experience you have with CSV.
edmundsauto 2021-08-18 21:05:23 +0000 UTC [ - ]
A CSV file represents the exact digits that are to be stored. You have unlimited precision. You could even store irrational numbers, equations, or mix data types in a column. OTOH, you have to make sure the delimiting character is not present in the data - that can be pretty easy, if you use the field-separator character in ASCII, or even just a \t. I've even seen people terminate fields with ^8675309| because they felt confident no data would contain Jenny's number.
A database, like Excel, likes to conform data. This is usually awesome! But sometimes, it's not.
da_chicken 2021-08-18 16:53:53 +0000 UTC [ - ]
That's not relevant. If a data format coerces data when it's stored, it's still not a data format suitable for serialization.
dasyatidprime 2021-08-18 17:58:34 +0000 UTC [ - ]
If you do add column types to your tables as in traditional databases, then of course you should expect some varying behavior based on that.
I can imagine some potential for issues from people expecting to insert exact decimal numbers and choosing floating-point by accident, perhaps, or using integer types for actually-text numeric ID fields. If you go all-strings on input with no column affinity, then you'll get the same strings on output, though, so that's isomorphic to good-CSV.
¹ https://www.sqlite.org/datatype3.html – §3.1 “If the declared type for a column contains the string "BLOB" or if no type is specified then the column has affinity BLOB.” + §3 “A column with affinity BLOB does not prefer one storage class over another and no attempt is made to coerce data from one storage class into another.”
lmilcin 2021-08-18 15:57:25 +0000 UTC [ - ]
srcreigh 2021-08-18 16:00:50 +0000 UTC [ - ]
It also runs on all kinds of embedded devices in addition to personal computers and servers.
https://en.wikipedia.org/wiki/SQLite#Operating_systems
lmilcin 2021-08-18 17:39:30 +0000 UTC [ - ]
Can I click on it and open it in Excel?
If not then it is not portable for me and for a lot of other people.
The main reason I use CSV is to produce reports that I can either open myself or send to other people so that they can click on it and open themselves and immediately start hacking away.
Excel is still corporate lingua franca and until this changes CSV is not going to retire.
CSV means for me that I can send it to anybody and everybody will know how to use it.
Try sending SQLite binary to people you work with, to your accountant, to your wife. Then share your experience.
wombatpm 2021-08-18 18:13:40 +0000 UTC [ - ]
lmilcin 2021-08-18 18:30:01 +0000 UTC [ - ]
Not all editors make it easy for you to use a tab character.
If only there was some other solution that would not require you to use a character that a lot of editors refuse to insert or even try to convert to spaces...
2021-08-18 17:35:33 +0000 UTC [ - ]
ab111111111 2021-08-18 16:31:16 +0000 UTC [ - ]
4) Changes to the replacement format should be human-readable in a diff
jonnycomputer 2021-08-18 16:39:49 +0000 UTC [ - ]
masklinn 2021-08-18 17:21:47 +0000 UTC [ - ]
It's easy to write a diff filter for e.g. xlsx, it's not possible to make CSV any good.
red_admiral 2021-08-19 07:57:32 +0000 UTC [ - ]
4) can easily interact with Excel.
A lot of things 'around here' run off spreadsheets, because relational databases weren't invented here and no-one ever needs more than a few million rows apparently.
I seem to spend half my job on the current project exporting stuff to CSV, running it through my code, then opening the resulting CSV in excel and formatting it a bit and saving as .xlsx again.
Still, at least I don't have to use Visual Basic that way.
2021-08-18 21:32:45 +0000 UTC [ - ]
osacial 2021-08-18 18:56:59 +0000 UTC [ - ]
danellis 2021-08-19 04:37:07 +0000 UTC [ - ]
But some people are. There are entire industries built around the exchange of CSV files, and the producers and consumers don't necessarily talk to each other.
joelthelion 2021-08-18 17:34:28 +0000 UTC [ - ]
There are few reasons to continue using csv in this day and age.
cbsmith 2021-08-18 16:04:26 +0000 UTC [ - ]
2. Yes, mixing formatting with data slows down data processing, don't do it.
3. Excel is not the replacement for CSV, and CSV is not a compact format. I mean, maybe if you are used to XML it is, but otherwise, just no.
Yes, there is truly a problem.
da_chicken 2021-08-18 16:30:02 +0000 UTC [ - ]
That's only true if you're trying to send all your data in a single, monolithic CSV.
If you're sending multiple CSVs, you're capable of representing data as well as a relational data store. Which is to say, you're representing your data using a system of data normalization specifically designed to minimalize data duplication. A single CSV represents a single table, and in most cases with intelligent delimiter selection you can represent an entire data set with no more than one character spent between fields or records.
Yes, you do have situations where you're storing losing data density due to using plain text strings, but that's not a limitation particularly unique to CSV for data serialization formats. Additionally, it is a problem that can largely be mitigated by simple text compression. Furthermore, once you switch to a non-text representation, you're limiting yourself to whatever that data representation is. It's easy to represent an arbitrary precision decimal number in plain text. It's hard to find a binary representation that universally represents the same data regardless of the system on the other end. Again, that's not a problem unique to CSVs.
If you're working with an API, object by object, then JSON is certainly going to be better, yes, because you can use the application's object representation. If you're working with bulk data of many disparate, unrelated, complex objects, however, or where you're transferring and entire system, you're not going to do much better than CSV.
cbsmith 2021-08-18 16:44:33 +0000 UTC [ - ]
No, that's true in general. There are exceptional cases where it might not be true, but it's true in general.
> Yes, you do have situations where you're storing losing data density due to using plain text strings, but that's not a limitation particularly unique to CSV for data serialization formats.
That there are other inefficient data serialization formats does not mean that it is not an inefficient data serialization format.
> Additionally, it is a problem that can largely be mitigated by simple text compression.
"Simple text compression"? Not sure what you are referring to there. If you are speaking of generalized compression, there's a bunch of information theory pointing out the problem with that thinking.
> Furthermore, once you switch to a non-text representation, you're limiting yourself to whatever that data representation is. It's easy to represent an arbitrary precision decimal number in plain text. It's hard to find a binary representation that universally represents the same data regardless of the system on the other end. Again, that's not a problem unique to CSVs.
It's pretty easy to represent arbitrary precision decimal numbers in non-text formats and they mean the same thing regardless of the system on the other end, but let's say you were right. Great, now represent arbitrary precision binary numbers (you know, the kind that computers work with), in text formats.
> If you're working with an API, object by object, then JSON is certainly going to be better,
Let me stop you right there. It is not "certainly going to be better". JSON is generally worse.
> If you're working with bulk data of many disparate, unrelated, complex objects, however, or where you're transferring and entire system, you're not going to do much better than CSV.
There's a whole world out there of formats that are better than CSV at this job, and in many cases they are A LOT better than CSV. Even Avro, for all its pitfalls, is MUCH better.
sicariusnoctis 2021-08-18 22:06:01 +0000 UTC [ - ]
For most natural text-only data, .csv.gz is probably close to the limits of what we can do. For numerical or categorical text containing patterns or structure, I agree that better formats may exist (e.g. H.264 for CSV files containing video frames). Not sure how a more general format will solve anything you couldn't already do by converting to a more specific format and tarring everything afterwards.
da_chicken 2021-08-18 16:52:55 +0000 UTC [ - ]
Okay, put your money where your mouth is. Prove it.
cbsmith 2021-08-18 17:46:59 +0000 UTC [ - ]
CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.
CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).
Then there's floating point data...
Honestly, this is a silly argument. CSV is, by design, a format that trades space for other factors. It should not be debatable whether it is a compact format or not.
da_chicken 2021-08-18 22:17:38 +0000 UTC [ - ]
XML, JSON, and YAML all have this issue, too.
> CSV uses a variety of date-time formats, but a prevalent one is YYYY-MM-DDThh:mm:ss.sssZ. I'll leave it as an exercise for the reader to determine whether that is as compact as an 8-byte millis since the epoch value.
This is also identical to XML, YAML and JSON.
And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.
> CSV also requires escaping of separator characters, or quoting of strings (and escaping of quotes), despite ASCII (and therefore UTF-8) having a specific unit separator character already reserved. So you're wasting space for each escape, and effectively wasting symbol space as well (and that's ignoring the other bits of space for record separators, group separators, etc.).
This is also identical to XML (escaping XML entities, sometimes having to resort to CDATA), YAML (escaping dashes) and JSON (escaping double quotes).
All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.
This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?
Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right? Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF. If you don't like the fact that two systems can format text differently, you're going to have a whole lot of fun when you see how they can screw up binary formatting. Nevermind things like, "Hey, here's a binary file from 25 years ago... and nothing can read it and nobody alive knows the format," that you just don't get with plain text.
Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.
cbsmith 2021-08-19 11:37:06 +0000 UTC [ - ]
Yes. Though to their credit, some of those work with her numbers, which at least gets you 4 bits out of every 8 bits.
> And I know what you're about to argue, but JSON's datetime format is not in the spec. The common JSON datetime format is convention, not standard.
I'm not sure what argument you thought I was making, or why that comment is relevant.
> All you've shown is that CSV has the same limitations that XML, YAML, and JSON have, and those three formats specifically designed and intended for data serialization. Yes, the other formats do have other advantages, but they don't eliminate those three limitations, either.
I'm not sure what your mean by "eliminate", or why you think it matters that there are other formats with the same design trade offs.
> This is for data serialization, which means it's going to potentially be used with data systems that are wholly foreign separating great distances or great timespans. What data serialization format are you comparing CSV to? What do you think CSV is actually used for?
CSV is used for a variety of purposes. The context of the article is using it for data transfer.
The claim was that it was a compact format for data transfer, which is demonstrably not true.
> Are you arguing for straight binary? You know that CSV, XML, YAML and JSON all grew out of the reaction to how inscrutable both binary files and fixed width files were in the 80s and 90s, right?
I'm not sure what "straight binary" means to you. JSON is, for the most part, a binary encoding standard (just not a particularly good one).
You've got the heritage a bit wrong, as XML was not originally designed for data transfer at all. It was an attempt to simplify the SGML document markup language, and the data transfer aspects were subsequently grafted on. JSON & YAML have a slightly more complicated heritage, but neither was intended as a data transfer format. They've all been pressed in to service for that purpose, for a variety of reasons, that can charitably described as tactically advantageous but strategically flawed.
> Binary has all sorts of lovely problems you get to work with like endianness and some systems getting confused if they encounter a mid-file EOF.
I don't know how to break this to you, but text formats can have endianess (in fact, insanely UTF-8 does!), and systems being confused and whether they are at EOF as well.
> Yes, you do end up with a wasted space, but the file is in plain text and ZIP compression is a thing if that's actually a concern.
Wouldn't ZIP be a binary format, with all the problems and concerns you have with binary formats?
So to summarize what you are saying... "CSV is a compact format because you can compress it if you are concerned and all the space it wastes".
Would it be fair to say then that any binary format is a text format because you can convert the binary into a text representation of the data? ;-)
Twisell 2021-08-18 15:18:48 +0000 UTC [ - ]
"The only true successor of CSV should be forward/backward compatible with any existing CSV variant"
If we manage to write a spec that meet this criteria we'll have a powerful standard with easy adoption.
Macha 2021-08-18 15:55:03 +0000 UTC [ - ]
godshatter 2021-08-18 16:36:53 +0000 UTC [ - ]
The assumption that most software uses is that the import file will be in the same variant of the format as what that tool exports. That seems to be more of a problem than anything else.
masklinn 2021-08-18 16:38:22 +0000 UTC [ - ]
Sure, and they sometimes do that if they have to ingest CSVs whose origin they don't control (although not every system implementor cares enough to do it).
But that's still just a bunch of shitty faillible heuristics which would not be necessary if the format was not so horrible.
Macha 2021-08-18 16:40:01 +0000 UTC [ - ]
cat input1.csv input2.csv > output.csv
resulting in a single file containing multiple formats.
Also, what variant is this:
What is the value of the third column?Is this a CSV file without quoting? Then it's
Or is it a CSV file with double quote escaping? Then it's This is fundamentally undecidable without knowledge of what the format it is.You can decide to just assume RFC compliant CSVs in the event of ambiguity, but then you absolutely will get bugs from users with non-RFC compliant CSV files.
godshatter 2021-08-18 20:46:47 +0000 UTC [ - ]
So, yeah. Can't really be done without making too many assumptions that will break later.
hermitcrab 2021-08-18 20:24:20 +0000 UTC [ - ]
Yes. And my software does that. But it is always going to be a guess which the user needs to be able to override.
Twisell 2021-08-18 16:33:21 +0000 UTC [ - ]
PS: I never said 100% forward/backward compatible with all variant at the same time and without any noticeable artifact. I meant compatible in a non blocking way.
masklinn 2021-08-18 16:37:38 +0000 UTC [ - ]
2021-08-18 16:40:04 +0000 UTC [ - ]
Twisell 2021-08-18 16:43:22 +0000 UTC [ - ]
Macha 2021-08-18 16:49:26 +0000 UTC [ - ]
Also UTF-8/ascii compatibility is unidirectional. A tool that understands ASCII is going to print nonsense when it encounters emoji or whatever in UTF-8. Even the idea that tools that only understand ASCII won't mangle UTF-8 is limited - sure dumb passthroughs are fine, but if it manipulates the text at all, then you're out of luck - what does it mean to uppercase the first byte of a flag emoji?
tsimionescu 2021-08-18 18:08:53 +0000 UTC [ - ]
masklinn 2021-08-18 15:23:20 +0000 UTC [ - ]
1. a fool's errand, CSV "variants" are not compatible with one another and regularly contradict one another (one needs not look any further than Excel's localised CSVs)
2. resulting in getting CSV anyway, which is a lot of efforts to do nothing
dragonwriter 2021-08-18 16:11:52 +0000 UTC [ - ]
So, a binary format consisting of: (1) a text data segment (2) and end of file character (3) a second text data segment with structured metadata describing the layout of the first text data segment, which can be as simple (in terms of meaning; the structure should be more constrained for machine readability) as “It’s some kind of CSV, yo!” to a description of specific CSV variations (headers? column data types? escaping mechanisms? etc.) or even specify that the main body is JSON, YAML, XML, etc. (which would probably often be detectable by inspection, but this removes any ambiguity).
Twisell 2021-08-18 16:22:14 +0000 UTC [ - ]
Almost any current CSV parser, even the bad ones, tolerate a header line.
So it should be possible to define a compact and standardized syntax that is appended before the real header of the first cell (separator,encoding,decimal separator (often disregarded by most parsers but crucial outside USA),quote character,escape character,etc...). Following headers would just use special notation to inform on (data-type,length,comment).
Newest parsers would use theses clues, older ones would just append some manageable junk to headers.
Macha 2021-08-18 16:54:38 +0000 UTC [ - ]
Does this really count as compatible? You will get user bugs for this.
dragonwriter 2021-08-18 17:01:59 +0000 UTC [ - ]
Yeah, that's why I chose the “thing that looks like a text file—including optionally CSV—but has additional metadata after the EOF mark” approach instead of stuffing additional metadata in the CSV; there's no way to guarantee that existing implementations will safely ignore any added metadata the main CSV body. (My mechanism has some risk in that there are probably CSV readers that treat the file as a binary byte stream and use the file size rather than a text stream that ends at EOF, but I expect its far fewer than will do the wrong thing with additional metadata before the first header.
Macha 2021-08-18 17:16:52 +0000 UTC [ - ]
sigzero 2021-08-19 00:34:50 +0000 UTC [ - ]
ivanbakel 2021-08-18 15:26:10 +0000 UTC [ - ]
There is no reason to try to be "backwards-compatible" with existing CSV files - we don't have a single definition of correctness to use to check that the compatibility is correct. Every attempt to be parse existing data would result in unexpected results or even data loss for some CSVs in the wild, because there is no way to reconcile all the different expectations and specifications that people have for their own CSV data.
wernercd 2021-08-18 16:02:30 +0000 UTC [ - ]
sicariusnoctis 2021-08-18 21:56:37 +0000 UTC [ - ]
Note the difference is that I am suggesting reducing the number of standards in-use by using only one already existing CSV format. :)
AtlasBarfed 2021-08-18 16:37:38 +0000 UTC [ - ]
CSV should be better standardized, but ... whatever, what should be done to "fix" CSV is to advertise the proper use of the libraries and the nontrivial aspects of a superficially trivial format.
A format that is trivially useful in 99% of cases is far better than many other "worse is better" things in computing.
masklinn 2021-08-18 17:26:12 +0000 UTC [ - ]
They really don't. In fact I'd go further and confidently state that they really can't, because tons of mis-parsed CSVs are heuristic judgement values, and those tools don't really have the ability to make those calls.
I've never seen a "mature CSV library for most major language" which'd guess encoding, separators, quoting/escaping, jaggedness, … to say nothing of being able to fix issues like mojibake.
burntsushi 2021-08-18 22:20:51 +0000 UTC [ - ]
masklinn 2021-08-19 06:00:02 +0000 UTC [ - ]
Neither is 0. Python’s csv library does not handle any problem, it applies the separators / quoting / escapes it was given and that’s that. It gives the developer the opportunity to handle things, but if the developer isn’t aware nothing is handled.
burntsushi 2021-08-19 16:56:24 +0000 UTC [ - ]
adgjlsfhk1 2021-08-18 17:09:43 +0000 UTC [ - ]
joelbondurant 2021-08-19 02:53:49 +0000 UTC [ - ]
dragontamer 2021-08-18 15:51:18 +0000 UTC [ - ]
Sqlite?
> Applications have a speed increase from using csvs.
Sqlite?
> The new format should not be grossly larger than the one it is replacing
Sqlite it is.
--------
Oh, you mean something that Excel can open? Oh yeah, I guess CSV then. But lets not pretend #1 (openness), #2 (speed), and #3 (size) are the issues.
ARandomerDude 2021-08-18 15:57:40 +0000 UTC [ - ]
dragontamer 2021-08-18 16:06:03 +0000 UTC [ - ]
We programmers get stuck into thinking that openness, speed, or other performance characteristics matter. But CSVs prove otherwise.
The reason why CSVs win in many situations is
1. Because they're convenient to edit.
2. Because they work in Microsoft Excel
CSVs have a myriad of issues: too many to list. But that doesn't matter, because convenience + Excel compatibility are king in practice.
To muse about other benefits is ignoring the core reason why CSV is used.
salamandersauce 2021-08-18 16:03:45 +0000 UTC [ - ]
masklinn 2021-08-18 16:35:19 +0000 UTC [ - ]
Good luck getting that garbage to be ingestible by anything. Odds are the system you're trying you shove it in will start by choking on the UTF8 BOM notepad insists on, then it will choke on the record separators, then on the incorrect escaping (or quoting).
> Something to deal with Sqlite? No such luck and the barrier is much higher.
SQLite is already present in most if not all environments (usually multiple times too, you have at least one instance of sqlite per browser on your machine, and probably a system one as well, and a few others), so it's really at most a matter of providing SQLite clients by default.
sigzero 2021-08-19 00:37:11 +0000 UTC [ - ]
js8 2021-08-18 16:45:59 +0000 UTC [ - ]