Best way to parse a text document

I'm trying to parse a plain text document in PHP but have no idea how to do it correctly. I want to separate each word, assign them an ID and save the result in JSON format.

Sample text:

"Hello, how are you (today)"

This is what im doing at the moment:

$document_array  = explode(' ', $document_text);
json_encode($document_array);

The resulting JSON is

[["Hello,"],["how"],["are"],["you"],["(today)"]]

How do I ensure that spaces are kept in-place and that symbols are not included along with the words...

[["Hello"],[", "],["how"],[" "],["are"],[" "],["you"],["  ("],["today"],[")"]]

I’m sure some sort of regex is required... but have no idea what kind of pattern to apply to deal with all cases... Any suggestions guys?

7
задан meagar 13 April 2011 в 13:41
поделиться