Hugo Hacker News

Autocorrect errors in Excel still creating genomics headache

nnmg 2021-08-17 19:05:41 +0000 UTC [ - ]

Excel is used as a database/storage/interchange format, especially after the initial analysis by someone who uses python or R. Bioinformatician does the analysis, then the PI wants to see it so they can Ctrl-F for genes they are interested in, so out comes an excel document.

And really, even if you know python or R, are you really going to fire up a jupyter notebook, load the data, and run pandas queries every time someone in lab meeting or after a talk asks you about this gene or that gene in your data?

I think the important question is why is date conversion a default? Would it really break backwards compatibility for MS Excel users if date conversions were explicit instead of automatic? Turning that off by default would fix a lot of this.

BugsJustFindMe 2021-08-17 20:36:46 +0000 UTC [ - ]

> Excel is used as a database/storage/interchange format, especially after the initial analysis by someone who uses python or R. Bioinformatician does the analysis

Sometimes, but the situation is in reality worse than that. Excel is also used as the gold standard database/storage/interchange format of record for random shit that clinical researchers have typed in by hand whether directly or transcribed from other notes, often when that data isn't actually fundamentally tabular in nature because people really like working in grids. Even when grids hurt more than help.

A big secret in genetic research is that the MDs, grad students, project managers, and coordinators running the research programs are often not super focused on what well-structured data looks like and don't know what things like "key-value store" or "nested tree-like structure" mean, and even if they did there aren't good GUI tools for entering them anyway, and it leads to countless errors that maybe (here I speculate) they just assume will wash out as noise.

> I think the important question is why is date conversion a default?

Yes, why any kind of conversion is ever the default is a real money question.

quantified 2021-08-17 21:18:35 +0000 UTC [ - ]

For the finance and business office worker, it seems to have traction. Just like auto-creating an emoji when you type a : character. Excel is for offices, not specializations of scientists. Bummer.

netizen-936824 2021-08-18 13:39:48 +0000 UTC [ - ]

So maybe we need better software for scientists? Sounds like a hole in the market

gaze 2021-08-19 04:48:05 +0000 UTC [ - ]

The market for scientific software is a bit iffy. Scientific software also needs to be super super flexible since the users are, somewhat by definition, not doing something that's been done before. Hard market.

SonicScrub 2021-08-17 20:20:24 +0000 UTC [ - ]

I don't work in bioinformatics, but what you are describing is a completely accurate description of what I experienced working in manufacturing quality control. Raw data came in from suppliers in the form of spreadsheets, and management wanted to see results in spreadsheets. Meaning all our quality data was subjected to these issues. The date formatting issue was a particularly annoying "gotcha", particularly when features were defined with a XX-XX numeric code. The number of times I had to deal with someone in a meeting saying "hey, why is this feature called October-13?!" Super frustrating.

If I could choose the tools used by the whole process involving multiple different companies and departments, hey I would! It would be python all the way down. But I was but a cog in a massive organization.

dragonwriter 2021-08-17 20:25:29 +0000 UTC [ - ]

> A lot of "safety culture" is composed of things like checklists and hazard warnings which are more geared towards shifting the blame for accidents onto somebody else than actually preventing those accidents,

If you stay in spreadsheets these problems mostly don’t occur (that is, once data entry is squared away so that the initial spreadsheet has what you want it doesn't tend to get lost), its when you move in and out of spreadsheets via text and take the path of least resistance [0] to do the transition that the problem occurs.

[0] and to be fair, there is a lot of resistance off that path.

SonicScrub 2021-08-17 20:40:21 +0000 UTC [ - ]

The process I had to deal with was filling out spreadsheets with data from a python-driven 3D inspection program that exported out data files in CSV format. Needless to say, these errors were inevitable for exactly the reasons you've stated. Why we didn't bypass the large, poorly formatted cumbersome spreadsheets and just directly export data via pandas? All the inspection was done by Python anyways. You tell me! Also, it did not help that the spreadsheets were not created by me, or any colleagues in my department.

God I hated working in old-school engineering/manufacturing. "That's not how we do things" is the answer to everything. I

dragonwriter 2021-08-17 23:57:43 +0000 UTC [ - ]

Sorry about the misplaced quote. Meant to be a quote from the immediate upthread comment. Looking back, it probably wasn't even needed, the response works fine against the comment as a whole.

dec0dedab0de 2021-08-17 20:09:32 +0000 UTC [ - ]

And really, even if you know python or R, are you really going to fire up a jupyter notebook, load the data, and run pandas queries every time someone in lab meeting or after a talk asks you about this gene or that gene in your data?

I don't do any scientific research, but I have been using jupyter as a replacement for excel since it was called the ipython notebook. I don't really use pandas all that often, I just find it easier to read and edit data in python. Though I first learned ipython added the notebook from a talk Wes McKinney gave about Pandas.

tcskeptic 2021-08-17 16:43:29 +0000 UTC [ - ]

We need a clippy type feature "It looks like you are working on genomics data, should I turn off all the data destroying auotcorrect features?"

UnFleshedOne 2021-08-17 17:52:26 +0000 UTC [ - ]

"It looks like you are working on genomics data, should I direct you to a search engine to find appropriate tooling instead?"

KMnO4 2021-08-18 02:48:14 +0000 UTC [ - ]

Heh, sure, if you have $10k+ per user.

(Maybe genomics isn’t as bad as proteomics)

guitarbill 2021-08-17 16:52:40 +0000 UTC [ - ]

It's easy to hate on Excel or Microsoft, a "dumb" program and a faceless mega corporation. Don't get me wrong, I'm no fan. But the real question is what are journals doing? It seems like the obvious solution is for editors and reviewers to do a quick search for these errors. To me, this is more of an indication of how poor most journals really are.

(You could argue that supplementary data should be in an open format, so you could automate this checking. I'm not against this, either.)

nnmg 2021-08-17 19:33:48 +0000 UTC [ - ]

This is an important point. What have the journals done? Raised their prices and business as usual.

Scientific editors do nothing for data validation. There is no accountability, even after retractions.

Scientific journal editors are glorified gatekeepers for "high impact" work (read: flashy), and then use free reviewer labor to cover themselves so they can call it 'peer reviewed'. In the rare cases when journals do require supporting data, they explicitly ask for excel spreadsheets :(

2021-08-17 17:18:47 +0000 UTC [ - ]

breakfastduck 2021-08-17 17:39:44 +0000 UTC [ - ]

So we blame the software instead of blaming the downright lazy people who can't be fucked learning an appropriate tool even though that's their entire career.

Honestly these kind of things really wind me up.

ironmagma 2021-08-17 17:41:55 +0000 UTC [ - ]

If a problem is re-occurring and between lots of people, it's probably the fault of the system that powers it. The system in this case is the software, so blaming the software is appropriate.

breakfastduck 2021-08-17 17:48:53 +0000 UTC [ - ]

No it isn't. This isn't a gotcha. This is probably one of the most well known software behaviors on the planet.

You KNOW the problem exists, yet you STILL choose to use the software, THEN expect it to change its behaviors because its inconvenient for your very specific use case. Microsoft aren't going to fix this behavior because there are likely millions if not billions of spreadsheets that rely on it.

That is not a software problem.

ironmagma 2021-08-17 17:56:59 +0000 UTC [ - ]

People don't choose to use Microsoft products, the software is chosen for them by their organization. For a problem like this, the existence of the article may very well be the first step to switching to another solution. The first step in that process is raising awareness that there is a problem.

bserge 2021-08-17 20:41:03 +0000 UTC [ - ]

If only the organization was run by people.

ironmagma 2021-08-18 04:50:07 +0000 UTC [ - ]

To a first-order approximation, all people are not management.

cyanydeez 2021-08-18 00:23:09 +0000 UTC [ - ]

we stand on the shoulders of excel users, thats the reality.

breakfastduck 2021-08-17 19:03:27 +0000 UTC [ - ]

A convenient cop out there.

"I didn't install it, I only chose to use it for something it isnt suited for instead of looking for alternatives, not my fault"

ironmagma 2021-08-17 19:26:38 +0000 UTC [ - ]

People working in large organizations are frequently not even allowed to install software that isn’t pre-approved by N layers of management. The bottom line though is that Excel is made for data entry. If it’s not suitable for that use case, that’s a big problem with the product-market fit.

breakfastduck 2021-08-17 20:32:01 +0000 UTC [ - ]

I cannot imagine a single piece of software that is more widespread and used without issue by huge swathes of people than excel.

But one specific use case for genetics means its totally unsuitable & and big problem with the product. Got it.

setr 2021-08-18 06:47:34 +0000 UTC [ - ]

?

Excel documents almost always have issues, independent of genomics[0]. I can probably just skim through my emails and find a bunch of problem spreadsheets all over.

Fundamentally, the problem is threefold.

1) excel tries to be smart, and this works in the easy case and breaks down in the edge cases. The edge cases also tend to appear with larger data entry as various one-offs, so it’s particularly insidious

2. Excel is difficult to validate (formulas are hidden, naming things properly is a bit of a hack, no real typechecking, data constraints can be set but it’s hard to check what’s set to what across all cells, etc)

3. The typical user / target market isn’t trained to use what facilities excel does happen to offer.

Ultimately, the problem is that excel doesn’t “scale”, and by default you get the wrong thing.

And even if you know how to set things up properly, it’s near impossible to be sure it stays proper.

[0] https://www.marketwatch.com/story/88-of-spreadsheets-have-er...

quantified 2021-08-17 21:20:49 +0000 UTC [ - ]

Right. I doubt genomics registered as much of a market segment, and probably still doesn’t register.

AmericanChopper 2021-08-17 20:47:21 +0000 UTC [ - ]

Excel isn’t a data entry tool. It’s a rather sophisticated analysis tool for tabular data. It’s success is derived from the fact that it’s very good at this, and is so easy for users to pick up. Mind boggling amounts of commerce and administration are driven my Excel.

It is however most suited to accounting and logistics tasks. For any more novel use case, you’re going to have to put some thought into how you use it, or use a more suitable alternative. Even for the tasks that it’s very well suited to, there will always be more sophisticated alternatives. But I’d consider the way it lowers the the technical barriers to entry an overwhelming positive, even if it doesn’t completely absolve the users from learning about how their data works. I have the same frustrations with developers who have been trained by ORMs to have no idea how RDBMSes work. But I consider it their fault for not learning their trade properly, rather than the fault of the people who make nice tools to help them.

breakfastduck 2021-08-17 23:52:33 +0000 UTC [ - ]

I really like the comparison to ORMs

Closi 2021-08-17 17:52:00 +0000 UTC [ - ]

> So we blame the software instead of blaming the downright lazy people who can't be fucked learning an appropriate tool even though that's their entire career.

Further, the software would be fine if you use it properly for this sort of use case (i.e. bring the data in via PowerQuery which has enforced types).

BugsJustFindMe 2021-08-17 20:44:33 +0000 UTC [ - ]

In reality the appropriate tool for research data actually doesn't exist (make it and become very rich). Excel is only sort of close if you squint.

IAmEveryone 2021-08-17 20:22:40 +0000 UTC [ - ]

There is one problem people have for their use case, so it’s inappropriate? Have you ever created an issue on some GitHub project? I guess you were using the wrong tool?

People choose their tools for some reason(s), and there is no law of man or nature that says that using a common tool for some esoteric purpose is somehow wrong, just because you consider it too easy.

breakfastduck 2021-08-17 23:53:24 +0000 UTC [ - ]

That does not apply to the windows office suite, literally built on megatonnes of legacy & bug level compatibility. Or windows for that matter.

psychometry 2021-08-17 20:08:11 +0000 UTC [ - ]

compbio is not a field you'd want to go in if you want to use well-written tools exclusively. There's tons of absolutely abominable Perl and R code powering critical pipelines everywhere you look.

What makes matters worse is that scientists, who are already not great coders, have to work with even less technical people (i.e. the physicians). These collaborative processes will inevitably involve software like Excel. Need your M.D. co-collaborator to annotate a gene list or whatever? Your CSV's getting re-saved in Excel whether you like it or not.

tssva 2021-08-17 21:38:21 +0000 UTC [ - ]

This isn't a Microsoft or Excel issue. It is a choosing a spreadsheet issue. LibreOffice and Google Sheets for instance by default also try to determine what type of data is being entered.

The issue is choosing the incorrect tool or if you have no choice but to use a spreadsheet failing to learn how your tools work.

cyanydeez 2021-08-18 00:22:02 +0000 UTC [ - ]

unfortunately, the reality is backwards compatibility with everyone who came before you, sets your budgets and expects to review data with minimal upstart.

people like spreadsheets because it provides very immediate feedback

delosrogers 2021-08-17 16:45:01 +0000 UTC [ - ]

This exact problem burned me when I started doing transcriptomics, one of the things I found that helps mitigate the problem is to always keep both the gene symbol and a ID like an Ensemble or Entrez ID for every data point because those don’t get mangled by Excel

OptionX 2021-08-17 16:24:54 +0000 UTC [ - ]

Can anyone with first hand experience on this tell shed some light on why stick with Excel versus using python, R or what have you to treat the data?

an_d_rew 2021-08-17 16:28:16 +0000 UTC [ - ]

Because an absolutely enormous number of people involved in all ends of genomics and biomedical science have absolutely no inkling whatsoever about programming.

Furthermore, Microsoft office is installed everywhere everywhere everywhere… like it leave it love it or hate it, it doesn’t matter… it’s true (at least for huge swaths of the demographic).

So the first thing most people learn is excel and the last thing most people use is excel. It’s a database, it’s a spreadsheet, you can even use it as a word processor if you want to. And some have.

I’m not saying you could and I’m not saying you should, I’m just answering as to “why“.

OptionX 2021-08-17 16:49:31 +0000 UTC [ - ]

Even newcomers?

I've had my education in a engineering focused campus so that may skew it, but even not IT related area had programming courses to, at least, learn the basics.

Do the newer researcher still show reluctance to move from Excel or is it an old guard kind of deal?

jasode 2021-08-17 17:23:31 +0000 UTC [ - ]

>why stick with Excel versus using python, R or what have you to treat the data? [...] Do the newer researcher still show reluctance to move from Excel

The feature of Excel that's underestimated/overlooked by skilled programmers of Python/R is that spreadsheets immediately expose an editable visual spatial canvas datagrid GUI.

I have decades of programming C/C++/C#/Python/etc and yet I create new spreadsheets and use them every day.

Spreadsheets are much faster than wiring up a datagrid in C# or C++ Qt gui widget editor or any other "true" programming language. Spreadsheets are also faster than using Python package Pandas to import a csv into a dataframe and view it in a Jupyter notebook. And last time I checked, displaying an output grid of cells in Jupyter is read-only and not 2-way editable like Excel.

And yes, the complaint is that "MS Excel calculations are not auditable, repeatable, version controlled, etc". All true, but it still doesn't change the fact that Python doesn't have an instant datagrid GUI. UI affordances also matter in viewing science data sets as well as financial budgets.

OptionX 2021-08-18 00:36:40 +0000 UTC [ - ]

I'm not saying Excel is without its benefits or a bad tool, just perhaps shouldn't be the only one.

And I was wondering if people like in the non-engineering but scientific fields, as the ones discussed in post, are exposed to the alternatives during their formation years.

If one chooses to use Excel because its the best tool great, but if its because its the perceived only option that's probably counterproductive.

int_19h 2021-08-17 20:45:32 +0000 UTC [ - ]

Besides, these are not mutually exclusive - you can use Excel as the front-end to your Python code via something like PyXLL.

fartcannon 2021-08-17 16:40:17 +0000 UTC [ - ]

Yeah. At some point though, people who want to do something that requires python or R, but refuse to learn it, and then use Excel (and make mistakes like the article mentions) should reconsider their choices. This is really out of character for me, but in this instance, relying on excel to do everything is a mistake by the user, not Microsoft's ridiculous data destroying import function.

Use the right tool.

qayxc 2021-08-17 16:51:19 +0000 UTC [ - ]

> Use the right tool.

Step one should be learn your tool, no matter the tool.

There's several ways to avoid the issue in Excel. If someone working with Excel isn't able to learn that (heck, a simple template would suffice), I have no hope the same demographic would have any success with R or Python.

da_chicken 2021-08-17 17:57:33 +0000 UTC [ - ]

> There's several ways to avoid the issue in Excel.

No, not really. There are several ways to help reduce the issue, but none of them eliminate what it does.

Example that comes to mind is the data file for College Board's SAT test. The data formats for student reports for schools come in two formats: PDF (one page per student), CSV, and fixed-width. That is the comprehensive list of your options. College Board doesn't care about you as a customer. They're too big. Any request you submit will be black holed.

Some of the columns in the file indicate a range, usually in the format "X-Y". Excel will try to coerce that into a date, if it's valid.

Other columns indicate a ratio, expressed as "X/Y". Excel will coerce that into a date, if it's valid.

Other columns indicate an ID number, expressed as a large, fixed-digit number, zero-padded. Excel will coerce that into an integer, or, if it's too long, into scientific notation discarding digits.

It doesn't matter how you format the CSV. Excel will do the above.

Here's an example CSV:

  ID,Range,Ratio
  12345678901234567890,8-9,"7/15"
  "00000000000000012345","9-10",21/35
I open that with Excel and immediately save it as a CSV. I look at the file in a text editor and I see:

  ID,Range,Ratio
  1.23457E+19,9-Aug,15-Jul
  12345,10-Sep,21/35
Do you have any idea how fun it is to explain to teachers and school administrators what happened here?

The correct way to work with this data file is: Do not, under any circumstances, open it with Excel if you expect to use it for anything else.

The problem is, there are very few applications that work well with CSV files. I know of CsvEd and Delimit. There are several text editors with a CSV column mode that makes the file look like a table (with varying degrees of success). All of these vary between "godawful" and "a complete nightmare" in terms of performance and usability compared to Excel.

dragonwriter 2021-08-17 18:39:51 +0000 UTC [ - ]

> It doesn't matter how you format the CSV. Excel will do the above.

Only if you open the file, so that Exel assumes defaults for formatting. If you import the CSV (Data -> From Text in the current UI) you can specify the format with the Import Text Wizard as described on the College Board instructions for using the file. Except...

Unfortunately, the College Board has outdated instructions on their website; Excel used to offer the Import Text Wizard on opening a csv or text file, rather than making default assumptions, and the College Board instructions page, while it does tell you to use that Import Text Wizard and provide all the details of what to plug into that wizard, tells you to open the file to get it, which bypasses the wizard.

Things like this is why clerical and other low-level positions involving excel test candidates on proficiency with specific versions of excel. Higher-level workers are expected to be able to figure out these kind of changes themselves, though (or consume lower-level staff time, if they are in management.

da_chicken 2021-08-17 19:56:14 +0000 UTC [ - ]

Yeah, that's also obnoxious. The Data From Text wizard lets you easily import the data... but if you do that then the first row no longer defaults to headers. So if you open it that way and want to filter or sort the data, your headers will disappear. You have to go into full blown Power Query to do both prevent reformatting and use column headers, and you have to reassign the data types on a much more complex interface.

We have now changed a 1 second automated process that anybody could do, into a 2 minute manual process that requires knowledge of data types. We've gone from a non-technical task to a highly technical task. To open a file without completely munging the data.

It also will sometimes do weird things like exclusive lock the file on disk until you close every open window of Excel because it creates data connections to the file.

dragonwriter 2021-08-17 20:03:19 +0000 UTC [ - ]

> The Data From Text wizard lets you easily import the data... but if you do that then the first row no longer defaults to headers. So if you open it that way and want to filter or sort the data, your headers will disappear. You have to go into full blown Power Query to do both prevent reformatting and use column headers

You don't, because while the wizard doesn't default to headers, the first screen of the wizard has a checkbox for it, so the option is there without Power Query.

da_chicken 2021-08-17 23:33:15 +0000 UTC [ - ]

Not on my Excel 2019. You pick "scan first 200" and it sets first rows to headers. You set it to all text, and it disables first row headers.

There is no checkbox in the wizard.

occamrazor 2021-08-17 18:27:47 +0000 UTC [ - ]

If you open the csv file with Excel you get those errors. If instead one uses Get&Transform (aka Power Query) then it is possible to specify the type of each column.

pletnes 2021-08-17 17:19:33 +0000 UTC [ - ]

As someone who has programmed in python for a decade, and no idea how to do anything right in Excel, I would beg to differ.

qayxc 2021-08-17 19:03:55 +0000 UTC [ - ]

Do you use Excel on a daily basis for productive work as well?

If not then your personal experience is irrelevant in this context.

CobaltFire 2021-08-17 18:00:18 +0000 UTC [ - ]

Not specifically this topic but I ran into an issue that would have been trivial in any real programming language this week, but I’m not able to USE anything that’s not on my work computer due to data restrictions.

So I get to cobble together some really ugly spreadsheets in excel when I’d MUCH rather use something more appropriate. Sometimes it’s not the person, it’s the org. In this case, government.

jpeloquin 2021-08-17 16:53:54 +0000 UTC [ - ]

As a biomedical researcher who prefers using python & R over Excel:

(1) Graduate students (who are the bulk of the scientific labor force) usually have some training in python, R, or Matlab, but are seldom practiced in applying it to real-world work. So if a lab standardizes on python, R, etc. the senior people have to do a lot of extra work to support the programming, rather than doing work with intrinsic scientific value. It's easier to teach the Excel quirks than to teach effective programming practices. Excel quirks are concrete, and "good practices" in programming are vague and situational.

(2) Python & R have their own pitfalls. It's easy to apply the wrong filters to a data set and compare subsets that aren't what you think you're comparing. Although when they go wrong, they go very wrong, and this is easier for a supervisor to notice.

(3) Excel has rich text formatting, graph embedding, and data types, so it's very useful as a human-readable data interchange & summary format even if you never do computation in it.

IMO neither programming languages nor Excel are a great fit for data analysis. Something purpose-made, like JMP or even GraphPad, is probably the better choice in most situations. Programming gets you automation but at the cost of high complexity. Since you still need to look at and think about the data there's a limit to how much time automation can save (Amdahl's law).

cm2187 2021-08-17 16:45:31 +0000 UTC [ - ]

When you have done programming for a while and it is mostly muscle memory, it is easy to forget how much it is an impenetrable black box to someone who has never done it.

Try to tell someone who just needs to get a task done that he needs to spend the next 3 months learning programming from scratch vs using excel.

breakfastduck 2021-08-17 17:37:03 +0000 UTC [ - ]

Tough shit, though. Thats what they're being paid for.

"A bad workman always blames his tools" has never been truer than this situation.

gerdesj 2021-08-17 16:30:23 +0000 UTC [ - ]

"That’s something Purdie says she doesn’t have time for. She has adapted to Excel’s quirks, adding apostrophes before commonly affected genes to prevent the conversion, or pre-formatting spreadsheet cells before importing data. “It’s one of those things that I just accept,” she says."

qayxc 2021-08-17 16:52:26 +0000 UTC [ - ]

Wait, if she knows how to avoid the issue, how is it still an issue?

acomjean 2021-08-18 04:43:54 +0000 UTC [ - ]

I work in bioinformatics and have seen the issue.

You get data from a lot of sources, and people (like me) just sometimes forget to check the spreadsheet. Often a csv file so it’s not clear it’s been generated by excel. It’s also only a problem for gene symbols (not entrez gene IDs or flybase gene numbers). Also the datasets are big, so there are 15 genes in error over the whole dataset it can be hard to spot.

Honestly genes are a constantly moving target and are kinda a pain to write software for. They get new names, split and merge over time…

qayxc 2021-08-18 09:20:47 +0000 UTC [ - ]

What really interests me is why your field is still using untyped CSV data for exchanging information, instead of say XML.

And why hasn't there been a collaboration between professionals to fix this on an organisational level (like agreeing on a spreadsheet template or sane method for data exchange)?

Is it really every lab for themselves with MS Excel and CSV of all things being the lowest common denominator?

acomjean 2021-08-19 12:26:19 +0000 UTC [ - ]

File sizes are an issue and xml won’t help. The file formats are wierd (fasta?) but they are at least kinda standard.

There are groups dedicated to organizing/ annotating genes. They’re quite organized and have highly structured databases.

http://gmod.org/wiki/Introduction_to_Chado

A lot of the model orgainism species have gotten together to make thinks more uniform.

https://www.alliancegenome.org/

Also most people in this field are using free software.

gpapilion 2021-08-17 16:40:15 +0000 UTC [ - ]

I took a programming for math majors course in college, and the largest challenge for the class was understanding procedural looping to process data sets. Understanding a for loop or a while loop really was a hard concept for someone to understand. We didn't cover any real data structures, and everything was a 2 dimensional array.

Excel, and most spreadsheets for that, avoid some of the basic control structures beginners find challenging. Many functions are basically map and reduce functions so you typically are getting the same number of cells back, or one. Often maps are done visually in the spreadsheet with a formula being applied to every cell in a column.

delosrogers 2021-08-17 16:40:33 +0000 UTC [ - ]

From my experience Python/R can be great when you have large scale analysis to do but if you have a task that requires more manual fiddling with the data then it’s much nicer to use Excel.

enaaem 2021-08-17 16:31:24 +0000 UTC [ - ]

It might be easier to teach people how to use excel correctly with for example power query.

taeric 2021-08-17 18:26:55 +0000 UTC [ - ]

The same line of questioning can easily land in "why use csv?"

It is a terrible format that has very few redeeming qualities. It happens to also be the most widely used one.

happytoexplain 2021-08-17 16:36:05 +0000 UTC [ - ]

Is there any reason to suspect that the primary factor is something other than the difference between an extremely featured end user GUI and programming?

minikites 2021-08-17 16:28:01 +0000 UTC [ - ]

Given that these are relatively unskilled users, would switching away from Excel increase or decrease errors? I don't think it's a given that errors would decrease by switching away from Excel.

blackbear_ 2021-08-17 16:38:47 +0000 UTC [ - ]

Anecdotically, I was talking to an acquaintance who decided to pick up R as a replacement for excel for data analysis and they were positively blown away by the possibilities opened by the new mindset that came with programming.

So yes, maybe these "relatively unskilled users" will struggle for a few weeks/months, but it will be a net positive change as many of them will see immense benefit after some tinkering.

instagraham 2021-08-17 16:31:14 +0000 UTC [ - ]

I think there is a difference between the errors. Excel autocorrect errors are easy to miss as it is the software making an arbritrary decision over your own.

Errors made from using a new , presumably neutral software, would be the kind you are supposed to catch as a researcher.

There is obviously scope for error in everything but Excel's user-hostile methods are not justified by this.

an_d_rew 2021-08-17 16:32:14 +0000 UTC [ - ]

Switch to what?

Excel is everywhere.

People are familiar with it. The same people are often not familiar with, and have never used, and may even be confused by “notepad.exe”.

ltbarcly3 2021-08-17 16:39:51 +0000 UTC [ - ]

These aren't someone's 85 year old grandparent getting a computer for the first time to look at pictures of their great grandchildren, these are people who have used a computer every day since they were 14 and have advanced STEM degrees. I think they are probably re-trainable.

Iwan-Zotow 2021-08-17 16:45:32 +0000 UTC [ - ]

XXX YYY ZZZ, PhD in genomics, probably re-trainable

codetrotter 2021-08-17 16:58:32 +0000 UTC [ - ]

That’s the price we pay.

A reasonable alternative would be for example statically typing each column or row in the sheet.

So in one sheet column A contains only floats, col B contains only text, col C contains only dates etc. And in another sheet col A may be date type, col B date type also, col C and D text, and col E monetary amounts.

But this would come at the cost of not being able to mix types in column, and a lot of people want to be able to mix.

And that’s why we’re stuck

wtallis 2021-08-17 17:16:34 +0000 UTC [ - ]

> But this would come at the cost of not being able to mix types in column, and a lot of people want to be able to mix.

I feel like a lot of the desire for this comes from Excel's desire to fill the screen with a single infinite spreadsheet. Apple's Numbers and some other programs make it more natural to have multiple separate tables on screen, each of which can have their own header rows and columns and separate type and format rules for those columns/rows. Excel more or less forces people to emulate this capability by just using a separate region of the same table, which gets in the way of applying consistent formatting and typing rules to whole columns or rows.

eitland 2021-08-17 17:14:29 +0000 UTC [ - ]

Nah, much of this is the same as the dumbness of many autocorrecting keyboards:

automatically "correcting" something that was correct based on a dumb hunch

It is really simple to remove the most embarrassing autocomplete errors on phones:

- just disable autocorrect,

- or preferably use another keyboard that just proposes fixes but leaves it to you to select the suggestion (I use SwiftKey and it does this, others probably exist too).

The same is not possible in Excel as far as I am aware:

You cannot turn of autoformatting.

The whole thing could be easily (on a ux level at least ;-) solved by introducing a single checkbox: "stop embarrassing me" with help text that says "turn off autoformatting".

Checking that box would leave values like you typed or pasted them in.

No need to select on a column basis.

Edit: It is also almost at the level of modern search engines that replace my very specific queries with generic queries about something unrelated or vaguely related.

Edit 2: it is already more than a year ago I guess since iPhones were caught changing the names of correctly spelled medications to the name of another unrelated medication. Sooner or later this incessant drive to be smarter than the user is going to cost lives I'm afraid.

When it comes to search it is already costing me many minutes lost a day, and that is just first order effects, before we get into all the things that doesn't get done because we give up on finding it.

jjk166 2021-08-17 17:34:25 +0000 UTC [ - ]

Right click on top left corner

Select format cells

Select "Text"

This will disable excel's autoformatting. Given the rarity that someone types "SEPT4" into excel and isn't typing a date, putting the option one level deep in a menu seems more appropriate than a top level button.

eitland 2021-08-17 19:22:44 +0000 UTC [ - ]

> Given the rarity that someone types "SEPT4" into excel and isn't typing a date, putting the option one level deep in a menu seems more appropriate than a top level button.

Given the rarity that someone has pasted 100 000 rows into and want them to be autoborked disabling autoborking, at least for pasted data seems appropriate enough for me ;-)

jjk166 2021-08-17 20:00:15 +0000 UTC [ - ]

I would expect dates to show up in large datasets pasted into excel orders of magnitude more often than genes. For the people who use excel, autoformatting is overwhelmingly the desired default behavior.

eitland 2021-08-18 06:59:07 +0000 UTC [ - ]

> For the people who use excel, autoformatting is overwhelmingly the desired default behavior.

Seriously? Also for pasted series? Also when the rest of the column doesn't match?

jjk166 2021-08-18 14:59:18 +0000 UTC [ - ]

Let's say you have 100,000 rows of data, and in a column all but 3 of those values look like dates. In a real office, which is more likely: that you are working on some set of data that just happens to look like a date 99.997% of the time but isn't a date, or some guy mis-entered or modified 0.003% of the data?

The I-want-to-put-data-in-my-table-which-closely-resembles-an-incredibly-common-format-but-is-problematic-if-mistaken-for-that-format-oh-and-I-can-not-be-bothered-to-google-how-to-turn-off-autoformat niche is not very large. I don't work for microsoft and I don't have copies of user data, but I would be willing to bet my entire life savings that over 95% of the time someone pastes a series which appears to have mismatched data types in the columns, the series actually does have mismatched data types in the columns.

ironmagma 2021-08-17 17:43:22 +0000 UTC [ - ]

Note how nowhere in those instructions is the phrase "autocorrect."

jjk166 2021-08-17 17:46:10 +0000 UTC [ - ]

Why would it? Excel doesn't have autocorrect, it has autoformat - a feature controlled in its formatting menu.

Closi 2021-08-17 17:53:56 +0000 UTC [ - ]

> A reasonable alternative would be for example statically typing each column or row in the sheet.

You can actually statically type the columns now for imported data via the PowerQuery editor (which is built into Excel), although not a lot of people know how to use it.

zozbot234 2021-08-17 17:11:03 +0000 UTC [ - ]

> A reasonable alternative would be for example statically typing each column or row in the sheet.

Excel actually supports this out of the box. But it's one more option to set, and people get lazy.

dspillett 2021-08-17 17:23:33 +0000 UTC [ - ]

Also those properties are lost if you transfer the data to another format then back, or don't properly survive copy/cut/paste operations (often cutting will remove everything about a cell including options such as these).

As well as being easy to be lazy and not use the options, it is easy to accidentally undo them also.

Tagbert 2021-08-18 19:40:04 +0000 UTC [ - ]

If an excel spreadsheet is setup properly, this conversion doesn't happen. If the cells are designated as "text", then conversion doesn't happen.

This is definitely a problem if you are just opening a raw CSV file and editing it as the default cell type does allow autoconversion.

djbebs 2021-08-19 09:00:49 +0000 UTC [ - ]

Except from personal experience I can tell you that even when you go out of your way to set the correct data types, if you try to copy paste any data into those cells excel will ignore everything you set and try to automatically change the data types.

If you know how to get it to never do that sort of thing, please let me know, because this is causing major issues in production software...

gpapilion 2021-08-17 16:31:05 +0000 UTC [ - ]

Autocorrect(and predictive text) has reached an interesting milestone, good enough that you mainly rely on it, but occasionally it distorts the data you intended to send. Disengaging the technology isn't easy, or even understanding field types it shouldn't correct(everyday my work VPN auth window tries to autocorrect my username). There is so much context that is unavailable to the computer, and there is no good way to provide it.

I think it has interesting implications for self-driving, since as far as I can tell we're at an earlier but similar stage in its development; works for simple situations, likely to drive you into a wall for no reason. We build so much trust in these systems, we stop paying attention to what they are doing, and then pay the price when it behaves in an unexpected way to the user.

soco 2021-08-17 20:18:19 +0000 UTC [ - ]

I get bitten by Excel's auto-correct all the time and I work nothing related to genomics. It's in my opinion just a dumb decision of Microsoft to not let you disable all those smarty feats (like, thinking you have a date if your string kinda looks like one)

PedroBatista 2021-08-17 17:27:36 +0000 UTC [ - ]

I generally am very critical of my own gang ( programmers ) for being so into their own bubble they completely lose perspective of the real World, and with that ignorance comes the usual arrogance.

But this time I feel we should give the business to the geneticists. The Excel "problem" is known and has been known for a long time, finance guys and policy makers were maybe the first big ones to "discover" this for slightly different reasons.

Then WHY on Earth these top professionals, elites, "creme de la creme" people continue on this path? And it's not like they love Excel, quite the contrary. Yes, the main reason is they are not programmers so they need to find a way to hack something to solve their problem. The problem with their "problem" solving is they are being slobs in a work that requires the EXACT opposite.

"Oh now we need to hire a programmer too?" - Yes, you do.

"But we don't have the budget for it" - Yes, you do. You see, money is never enough for anything, it's a question of priorities and when you blow millions of dollars with fancy experiments to ruin your career because of Excel, maybe that Python/R/Julia guy was a bargain after all.

I know some of these people, and under the aura of lab coats and distinguished professors it's just some guy/gal trying to crank that paper ASAP to keep the hamster wheel spinning AND they are being sloppy about the data and process AND most of them know it.

Sorry, but no excuses.

Closi 2021-08-17 17:50:30 +0000 UTC [ - ]

The underlying assumption here is that when transferred into R or Julia, with an equivalent level of development effort and testing, that there will be no/fewer bugs. I'm not entirely sure that's true.

I think we are comparing something that is knocked out in an hour in excel to someone spending 4-10x of the time doing it in Python / Julia, but I don't think that's a like-for-like comparison.

breakfastduck 2021-08-17 17:35:47 +0000 UTC [ - ]

Yeah. They use poor/ unsuitable tooling known for unusual data quirks with absolutely no effort to look at alternatives or pay for an expert.

No sympathy from me. Embarrassing. I certainly wouldn't get away with such negligence at my workplace.

koksik202 2021-08-18 14:11:21 +0000 UTC [ - ]

I remember I was helping my wife with her Uni assignment data split between excel and matlab couldn’t believe the errors we got from misinterpretation looks like lab science needs to catch-up with IT standards for handling data with correct integrity

planet-and-halo 2021-08-17 16:37:18 +0000 UTC [ - ]

Serious question, is this a viable area for a startup product? Obviously nothing will easily displace Excel for general spreadsheet work, but I could see room for a domain-specific spreadsheet that has core features and supports just a basic set of functions relevant to genomics, + a "we won't fuck up your data" feature.

mceoin 2021-08-17 20:13:17 +0000 UTC [ - ]

Hi - I'm building a networked spreadsheet product and had heard of this auto-correct problem in genomics research through word of mouth a couple of times, so have asked this question myself. (I was astonished to learn just how much of genomics work involves sending Excel files back and forth!) Here's the conclusion that I have come to:

- It's definitely possible to build a custom spreadsheet product with a small team, even one targeted at such a "niche" user group. So it's an idea worth testing.

- Product can be "backwards compatible" - you can export to xlxs and import from xlxs - so you don't have to change behaviour of the entire industry on day one to get this to work, only a single genomics researcher or lab.

- Pricing, unit economics, etc are unknown to me (I have no background in genomics or scientific research). But presumably you could leverage standard SaaS models and build a viable model up from there using a few case studies. There's definitely schleppy behaviour going on here that can be solved.

- Even a "lifestyle business" has significant upside beyond the financial: improving genomics research improves genomic research!

- The product advantage over time presumably involve building more custom tooling into the genomics / data ecosystem. "Not creating typos" is just the beachhead.

I've never actually interviewed genomics people about their need here, but if anyone knows people with this problem I would love to talk to them: @mceoin on twitter. (DMs open)

cm2187 2021-08-17 16:42:23 +0000 UTC [ - ]

That’s an expensive workaround to just formatting your cells as text before entering the values

dspillett 2021-08-17 17:20:18 +0000 UTC [ - ]

You'd be surprised how much effort can be saved by not relying on human interaction to be reliable.

Though convincing people to switch to a new app for that reason, even if it were free and Free, would still be an uphill struggle.

> just formatting your cells as text before entering the values

It is a little more than that. I work with finance people, and there a lot of data is manipulated in Excel but passed around as CSV files for compatibility elsewhere. This causes no end of problems because fixes by setting cell properties are obviously lost in transcription, and date errors creep in as things are moved back & forth between people in the US and those in locales that do dates properly.

masklinn 2021-08-17 17:30:22 +0000 UTC [ - ]

Probably not: people use excel because it's available, it's what they know, and it's what their peers use.

It's been going for more than half a decade now, and genomicists apparently would rather rename genes than stop using excel...

prionassembly 2021-08-17 17:16:22 +0000 UTC [ - ]

The easier solution is coming up with a 1:1 human-readable-hashing scheme so that MARCH4 translates to "funny-blue-smell-tuesday". Then you keep the safe key in a column next to the unsafe key.

NullPrefix 2021-08-17 16:40:02 +0000 UTC [ - ]

>a "we won't fuck up your data" feature

That's a hard to implement feature. Hand waving and buzzword lingo will not be enough for people to believe it.

masklinn 2021-08-17 17:31:13 +0000 UTC [ - ]

> That's a hard to implement feature.

It really isn't: just remove the data autodetection anything and you're done.

Closi 2021-08-17 17:28:22 +0000 UTC [ - ]

I think the size and scope of Excel is hard to replicate, however it sounds like it would make a great plugin.

HPsquared 2021-08-17 17:40:38 +0000 UTC [ - ]

They might move to a relational database, but then the gene named '; DROP TABLE genes; will start causing problems.