Why you possibly can’t parse CSV with a daily expression

Common expressions are a really great tool in a programmer’s toolbox. However they will’t do every part. And one of many issues they will’t do is to reliably parse CSV (comma separated worth) information. It is because a daily expression doesn’t retailer state. You want a state machine (or one thing equal) to parse a CSV file.

For instance, contemplate this (very quick) CSV file (3 double quotes + 1 comma + 3 double quotes):

“””,”””

That is appropriately interpreted as:

quote to start out the information worth + escaped quote + comma + escaped quote + quote to finish the information worth

E.g. a single worth of:

“,”

How every character is interpreteted relies on what characters come earlier than and after it. E.g. the primary quote places you into an ‘inside information’ state. The second quote places you right into a ‘is likely to be an escaped for the next character or is likely to be finish of information’ state. The third quote places you again right into a ‘inside information’ state.

Regardless of how difficult a regex you provide you with, it would at all times be doable to create a CSV file that your regex can’t appropriately parse. And as soon as the parsing goes improper, every part after that time might be rubbish.

You may write a regex that may deal with CSV file the place you might be assured there aren’t any commas, quotes or carriage returns within the information values. However commas, quotes or carriage returns within the information values are completely legitimate in CSV information. So it is just ever going to deal with a subset of all of the doable well-formed CSV information.

Word that you simply can parse a TSV (tab separated worth) file with a regex, as TSV information are (usually!) not allowed to include tabs or carriage returns in information and due to this fact don’t want escaping.

See additionally on Stackoverflow:

Using regular expressions to parse HTML: why not?