Joomla! Issue Tracker | Joomla! CMS #11702 - [RFC] New generic content parser.

? ?

Closed
10 Apr 2018
Medium
Build: staging
# 11702
Diff
chrisdavenport:content-parser

Failure

Success continuous-integration/travis-ci/pr The Travis CI build passed Details
Error continuous-integration/jenkins/pr-merge This commit cannot be built Details

User tests: Successful: Unsuccessful:

chrisdavenport
20 Aug 2016

Work on the new Custom Fields feature for Joomla 3.7 has highlighted the need to define a common "core supported" syntax for embedding codes within content that can be replaced dynamically using the content plugins. At present, the loadmodule plugin is the only content plugin in the core distribution that does this kind of tag replacement. Since the loadmodule plugin has been around for many years, it has been extensively copied by third-party developers and many, although not all, have stayed with, or close to, the informally defined syntax that loadmodule supports.

For example, to embed a module position called "myposition" into an article using the "mychrome" style, you would insert the following string into the article:

{loadposition myposition,mychrome}

The new Custom Fields feature requires a more sophisticated syntax and initially this was achieved by importing the Mustache (sic) library (https://mustache.github.io/). However, this has a notably different syntax from the one established by the loadmodule precedent and the question arose as to whether that was the right direction for the Joomla project to follow.

In my opinion, it would be better to try to stay close to the existing syntax established by loadmodule, adding only backwards-compatible extensions to the syntax to support the new custom fields feature.

Most third-party developers tend to follow "core standards" so the core distribution tends to set a precedent which then becomes a de facto, albeit often undocumented standard. So it is important that the syntax that we come up with meets some basic objectives:

The syntax should be simple enough for non-technical end-users to grasp in the majority of use-cases.
The API should be simple enough for modestly skilled developers to easily create content plugins that implement the standard.
The implementation should cater for multi-byte character sets.
The implementation should be efficient. In practice this means minimal use of regular expressions.

Summary of Changes

This pull request is offered as a potential solution that meets these objectives. It comprises some additional classes in the Joomla string library and a refactoring of the loadmodule plugin to make use of it.

Testing Instructions

If you fancy testing it then please check that the loadmodule plugin behaves exactly as it did previously. In particular, please test with multi-byte characters to make sure I have that handled properly.

Please also try to create your own content plugins using the library and see how you get on.

Documentation Changes Required

The syntax supported by the library is described below and this description could be used as the basis for the documentation should the PR be accepted.

Simple tokens

To use the parser you follow these steps in your content plugin code:

instantiate a JStringParser object.
make one or more calls to the registerToken method so as to associate with each token a callback function that will return the string that will replace the token whenever it is encountered in the content. This is explained in more detail below.
call the translate method which will replace all the tokens encountered in the content by calling the relevant callbacks.

For example, the following code appears in the loadmodule plugin:

// Get a content parser.
$parser = new JStringParser;

// Register the loadposition token.
$parser->registerToken(
    'loadposition',
    function(JStringToken $token)
    {
        $tokenParams = $token->getParams();
        $position = trim($tokenParams[0]);
        $style = isset($tokenParams[1]) ? trim($tokenParams[1]) : $this->params->def('style', 'none');

        return addcslashes($this->_load($position, $style), '\\$');
    }
);

The callback function takes a JStringToken object which allows access to the token definition that was registered by the registerToken method as well as to the specific parameters associated with the token in the input.

Notice that the token parameters are assumed to be a comma-separated list, but the parser makes no further assumptions about the syntax of the parameters. The parameters are made available to the callback function through the $token->getParams() array.

With this setup out of the way, the following code performs the actual content parsing and translation:

// Parse the content.
$article->text = $parser->translate($article->text);

Block tokens

The parser also supports a block syntax that is rich enough to support the custom fields extension. Here's an example:

<li>{field alias=something}{field-label}: {field-value}{/field}</li>

In this case we have a couple of simple tokens, field-label and field-value, surrounded by a begin-end pair of block tokens. The begin block token takes a single argument, which in this case contains an equals sign, although the parser does not attempt to understand it and will simply pass it to the callback function as the string "alias=something" in $token->getParams()[0].

Here is some pseudo-code that will handle the above syntax:

// Define a context variable.
$context = '';

// Get a content parser.
$parser = new JStringParser();

// Register the simple tokens.
$parser->registerToken(
    'field-label',
    function(JStringToken $token) use (&$context)
    {
        // Set $label to the label for the current field defined by $context.
        return $label;
    }
);

$parser->registerToken(
    'field-value',
    function(JStringToken $token) use (&$context)
    {
        // Set $value to the value of the current field defined by $context.
        return $value;
    }
);

// Register the block token
$parser->registerToken(
    'field',
    function(JStringToken $token, $content) use (&$context)
    {
        $tokenParams = $token->getParams();

        // Set the context for any contained tokens.
        $context = $tokenParams[0];

        return $content;
    }
);

// Parse the content.
$article->text = $parser->translate($article->text);

The first point to notice about the above code is that you indicate whether a token is a simple token or a block token by the third argument passed to the registerToken method. By default this is true, meaning that the token is a simple one. If you want to register a block token you must pass false as the third argument.

The second point to notice is that the callback function for the block token ("field") takes an additional argument. The first argument is a JStringToken as before, but the second argument will be passed the already processed string extracted from between the begin and end tokens. For example, suppose we have this content:

This is a field: {field article-id}Article ID{/field}

Then $content will be passed the string "Article ID" when the callback is called. However, if the string between the begin and end tokens contains other tokens, then these will already have been processed before the callback is called. So if we have this content:

This is a field: {field article-id}{field-label}: {field-value}{/field}

And if we assume the callback for the "field-label" token always returns the string "Label" and the callback for the "field-value" token always returns the string "Value", then $content will be passed the string "Label: Value" rather than "{field-label}: {field-value}".

The third point to note is the use of the $context variable to pass context from the block token callback to any callbacks handling tokens within the block. In the above example, the {field-label} and {field-value} will presumably depend on the "article-id" parameter in the {field article-id} token, so the $context variable is used to pass that context between the callbacks. Although a simple string variable is shown in the example code above, the $context variable can be anything. For example, if you need to support nested block tokens it would make sense for $context to be a stack of contexts; perhaps an array which is pushed and popped appropriately. You may, of course, use multiple variables in the use clauses if you need to pass more context information around.

Other notes

Token names are case-insensitive.
If you call registerToken with a token name that is already registered, then your definition will replace the earlier one.
You can define your own start-of-token and end-of-token strings by passing options in an array to the translator. This could actually be used to define a "micro-syntax" that exists only without an outer pair of block tokens. You can also define your own parameter separator too. See the code for the details.
The implementation tries hard to ignore obviously incorrect token usage. For example, a begin block token without a matching end block token should result in the begin block token being ignored, but the content is otherwise "unharmed" and no errors or exceptions are thrown. Similarly, unmatched opening or closing braces, and unregistered tokens are passed unchanged into the output.

Formal syntax definition

The parser supports the syntax defined by the following production rules:

list       ::= string | string token list
token      ::= simple | beginBlock list endBlock
simple     ::= startOfToken name endOfToken | startOfToken name space params endOfToken
beginBlock ::= startOfToken name endOfToken | startOfToken name space params endOfToken
endBlock   ::= startOfToken / name endOfToken
params     ::= param | param paramSeparator params
string     ::= any sequence of zero or more characters not including startOfToken
name       ::= any sequence of at least one non-space character
param      ::= any sequence of zero or more characters except paramSeparator and endOfToken

3af74bd 20 Aug 2016

New generic content parser.

joomla-cms-bot - change - 20 Aug 2016

Revised Documentation

Simple tokens

To use the parser you follow these steps in your content plugin code:

instantiate a JStringParser object.
make one or more calls to the register method to declare a token name with a value, an optional callback and an optional layout, that will replace the token whenever it is encountered in the content. This is explained in more detail below.
call the translate method which will replace all the tokens encountered in the content.

For example, this code

// Get a content parser.
$parser = new JStringParser;

// Register the "mytoken" token.
$parser->register('mytoken', new JStringTokenSimple('contains my token'));

echo $parser->translate('This string {mytoken}.');

will output the string

This string contains my token.

IMPORTANT: Token names are case-insensitive.

You can, of course, register a variable to provide the replacement string:

$myString = 'Walrus';
echo (new JStringParser)
    ->register('character', new JStringTokenSimple($myString))
    ->translate('The time has come, the {character} said.')
    ;

// The time has come, the Walrus said.

NOTE: If you call register with a token name that is already registered, then your definition will replace the earlier one. You can unregister a token using the parser's unregister method.

You can register a callback function that will be called whenever the token is encountered. The callback should return the string that will replace the token. The callback function takes a JStringToken object which allows access to the token definition provided by the register method.

echo (new JStringParser)
    ->register(
        'simple',
        (new JStringTokenSimple)->callback(
            function(JStringToken $token) {
                return '[' . strtoupper($token->getName()) . ']';
                }
            )
        )
    ->translate('This string contains a {simple} callback token.')
    ;

// This string contains a [SIMPLE] callback token.

You can pass parameters in the token and these are also available in the JStringToken object. For example, the following code appears in the loadmodule plugin:

// Get a content parser.
$parser = new JStringParser;

// Register the loadposition token.
// Syntax: {loadposition <module-position>[,<style>]}
$parser->register(
    'loadposition',
    (new JStringTokenSimple)->callback(
        function(JStringToken $token)
        {
            $tokenParams = $token->getParams();
            $position = trim($tokenParams[0]);
            $style = isset($tokenParams[1]) ? trim($tokenParams[1]) : $this->params->def('style', 'none');

            return addcslashes($this->_load($position, $style), '\\$');
        }
    )
);

// Parse the content.
$article->text = $parser->translate($article->text);

You can also assign a JLayout object to process the data before rendering. In this example, a date-of-birth is formatted before being replaced into the output:

echo (new JStringParser)
    ->register(
        'dob',
        (new JStringTokenSimple(array('text' => '18 July 1918')))->layout(
            new JLayoutFile('plugins.user.profile.fields.dob')
            )
        )
    ->translate('Nelson Mandela was born on {dob}.')
    ;

// Nelson Mandela was born on 18 July 1918.

The value assigned to the token in the JStringTokenSimple constructor is passed as the third argument to the callback function, if there is one. The result is then passed to the layout's render method, if there is one, before being substituted into the content string.

Block tokens

The parser also supports a block syntax in which a begin block token is paired with an end block token. A token is defined as being a block token rather than a simple token by declaring it with JStringTokenBlock instead JStringTokenSimple. The string between the begin and end tokens is passed to the callback function as the second argument. Here's an example:

echo (new JStringParser)
    ->register(
        'shout',
        (new JStringTokenBlock)->callback(
            function(JStringToken $token, $content) {
                return strtoupper($content);
                }
            )
        )
    ->translate('Using all capitals is {shout}known as shouting{/shout} and should be avoided.')
    ;

// Using all capitals is KNOWN AS SHOUTING and should be avoided.

The exact same comments about parameters, callbacks and layouts apply to block tokens as well as simple tokens.

Block tokens can be nested and may also include simple tokens. When nesting, the inner content will be translated before being made available to the outer token.

Loops

It is possible to implement simple loops by defining a new parser inside the callback function of another. For example, suppose we have an array of field data like this:

$fieldset = array(
    array(
        'label' => 'field1',
        'value' => 'val1',
        ),
    array(
        'label' => 'field2',
        'value' => 'val2',
        ),
    array(
        'label' => 'field3',
        'value' => 'val3',
        ),
    );

Then the following code can be used to generate a list of field labels and values:

echo (new JStringParser)
    ->register(
        'fieldset',
        (new JStringTokenBlock($fieldset))->callback(
            function(JStringToken $token, $content, $value) {

                $return = '';
                $parser = new JStringParser;

                foreach ($value as $field)
                {
                    $parser
                        ->register('label', new JStringTokenSimple($field['label']))
                        ->register('value', new JStringTokenSimple($field['value']))
                        ;
                    $return .= $parser->translate($content);
                }

                return $return;
                }
            )
        )
    ->translate('<ol>{fieldset}<li><strong>{label}</strong>: {value}</li>{/fieldset}</ol>')
    ;

// <ol><li><strong>field1</strong>: val1</li><li><strong>field2</strong>: val2</li><li><strong>field3</strong>: val3</li></ol>

Other notes

You can define your own start-of-token and end-of-token strings by passing options in an array to the translator. This could actually be used to define a "micro-syntax" that exists only without an outer pair of block tokens. You can also define your own parameter separator too. See the code for the details.
The implementation tries hard to ignore obviously incorrect token usage. For example, a begin block token without a matching end block token should result in the begin block token being ignored, but the content is otherwise "unharmed" and no errors or exceptions are thrown. Similarly, unmatched opening or closing braces, and unregistered tokens are passed unchanged into the output.

Formal syntax definition

The parser supports the syntax defined by the following production rules:

list       ::= string | string token list
token      ::= simple | beginBlock list endBlock
simple     ::= startOfToken name endOfToken | startOfToken name space params endOfToken
beginBlock ::= startOfToken name endOfToken | startOfToken name space params endOfToken
endBlock   ::= startOfToken / name endOfToken
params     ::= param | param paramSeparator params
string     ::= any sequence of zero or more characters not including startOfToken
name       ::= any sequence of at least one non-space character
param      ::= any sequence of zero or more characters except paramSeparator and endOfToken

c634812 28 Aug 2016

Updated comments.

ba6d35e 28 Aug 2016

Code style fixes.

phproberto - comment - 29 Aug 2016

I really appreciate your work here @chrisdavenport but don't you think you are trying to reinvent the wheel?

What I would do:

Integrate Twig through composer in core.
Create an event onTwigLoadExtensions that would allow that anybody loads additional extensions.
Create the tags you want. You can even use the same you are proposing here including loadposion, etc. Ideally fields would be a Twig entity that contains all the methods/properties that are exposed through its public api.
Save to write documentation/maintaining a new parser. Also save lines to develop what you need.
Enjoy extensions/filters that Twig already provides like date, etc.
Prepare a bigger plan for a gradual migration to Twig as main renderer.
Send a message to joomla developers. Joomla is moving to Twig and you can start using it in your own extensions since v3.X.
Get social engagement from PHP community, etc. thanks to using a commonly used library.

But above all: avoid doing things in the Joomla way.

As always I'm not criticizing your job. I respect the time you have invested contributing this. What I try is to avoid big mistakes in the direction Joomla is moving to.

I offer my help for anything you need including integrating everything I exposed here. Obviously I won't do it if nobody is listening up there in the PLT.

chrisdavenport - comment - 29 Aug 2016

I looked at Twig and Mustache and a few others before embarking on writing my own. They are all fine packages in their own way, but I was looking for something that would support the really simple syntax that has become established for Joomla content plugins, remembering that this is syntax that ordinary users with absolutely no coding experience, must be able to handle. I also wanted something that would minimise use of regular expressions since these can be rather slow and many content plugins are typically run over the content of a page before it gets sent to the browser.

My specific reasons for rejecting Twig were:

The syntax that ordinary users would need to learn is quite complex. Whilst it might just be acceptable to have users cope with "{{" instead of "{", the syntax for passing variables is something I know would be quite beyond some of the users I encounter!
It makes use of some scarily complex regular expressions. Take a look at the lexer here: https://github.com/twigphp/Twig/blob/v1.24.1/lib/Twig/Lexer.php#L42 To be fair, I haven't benchmarked it against my code, so maybe the difference isn't worth worrying about.
I didn't investigate further but it appears that Twig is intended to parse the content once and once only. I wasn't clear on how you could use it when multiple plugins are parsing the same content consecutively, which is what will happen if we have more than one content plugin using the same parser. We need each plugin to ignore tokens/variables it doesn't understand. Maybe there is a way to achieve that with Twig; I couldn't see it.

Of course, there is nothing to stop third-party developers including Twig, Mustache or any other template engine of choice in their own plugins.

Thanks for your feedback @roberto. I really appreciate the time you have spent thinking about this and the voice of your experience is always worth listening to. Personally, I'm not (yet) convinced that Twig is the right answer (although it might be for Joomla X), but I'm going to defer to others to make the decision as I'm obviously too closely involved to be objective.

mahagr - comment - 29 Aug 2016

As I've worked with twig in both Grav and Gantry, I can say that it is fast -- people have benchmarked Gantry against other frameworks including Joomla itself and Gantry is generally almost as fast if not faster than Joomla itself. But twig is really more for templating than for content and compiles to PHP code to make it fast. This is also where I think it shouldn't be used for all content as it defeats the purpose of storing articles into database.

What comes to the syntax, I actually like {{ and {% more than using single { for everything -- also parsing something that doesn't occur in the text naturally makes it faster to parse. Twig is pretty easy to learn and people seem to love it once they get it, but it does require some learning and coding skills to master.

Twig can easily be used by multiple plugins, but writing a token parser for your own tags means that everyone creating a new syntax needs to create a class that reads the tokens and generates PHP code based on it. It is pretty involving task and needs some basic knowledge on how compilers work.

In summary: I don't think that twig should be used for this purpose, even though I'd love to see Joomla using twig as its primary templating language (instead of PHP files). I'm also not sure how you could use twig without introducing better and more general models for articles, categories etc which you could use to load arbitrary data from Joomla. Creating twig TokenParsers for everything just doesn't feel to be the right way to go...

If you want to see how Twig could be used in Joomla, please see: https://github.com/gantry/gantry5/blob/develop/engines/joomla/nucleus/particles/contentarray.html.twig which basically replaces most article modules in Joomla. But to make something like this to work, you really need to redo all the models as right now the models in Joomla work only in a single context (usually inside a single component).

Here are my models for Joomla articles allowing me to load and display Joomla articles from anywhere by using a simple API:
https://github.com/gantry/gantry5/tree/develop/src/platforms/joomla/Joomla/Content

Its documented (for Twig) in here:
http://docs.gantry.org/gantry5/advanced/content-in-particles

@chrisdavenport I've meant to contact you on these classes; I think they'd be really useful for your services work.

phproberto - comment - 29 Aug 2016

Thanks for the fast reply @chrisdavenport and for taking the time to reply and discuss things.

The syntax that ordinary users would need to learn is quite complex. Whilst it might just be acceptable to have users cope with "{{" instead of "{", the syntax for passing variables is something I know would be quite beyond some of the users I encounter!

About {{ & { I don't think that's really an issue you can keep B/C for those tags but introduce new ones that will always use {{. In fact is probably better because you know that {{ are always using Twig.

About passing variables I think that's because you haven't used Twig and you really think you need a custom token for everything. The main plan should be to write Entities that would be used internally by Twig. Let's take an example: Article twig entity which should be our future goal.

That class will only contain those methods that are publicly available for templates. So if you have a module that is displaying information 1 article its layout will receive an Article entity from where you can do whatever you want. Imagine that you need to get the author of the article. You could do something like:

<span class="article-author">{{ article.getAuthor().getName() }}</span>

or:

<span class="article-author">{{ article.author.name }}</span>

Because Twig already searches for getters automatically. What does that mean?

article.getAuthor() will retrieve a User Twig entity which alraedy contains its own methods usable by templaters.
Author information is only retrieved when is required so you don't need to get all the information everytime you pass an article to a layout.
The Twig entities define methods available and are a very nice abstraction layer for users that allows us to change the logic without affecting templaters work. Imagine that you change the name db column to title. That can be done transparently because the entity will retrieve the new data + ensure that getName() method still returns the right information.
Entities are easy to document automatically with something like phpDocumentor which takes the data directly from the entity class.

It makes use of some scarily complex regular expressions. Take a look at the lexer here: https://github.com/twigphp/Twig/blob/v1.24.1/lib/Twig/Lexer.php#L42 To be fair, I haven't benchmarked it against my code, so maybe the difference isn't worth worrying about.

Twig is used everywhere and it has been available for years now. I don't think reliability is a real issue.

I didn't investigate further but it appears that Twig is intended to parse the content once and once only. I wasn't clear on how you could use it when multiple plugins are parsing the same content consecutively, which is what will happen if we have more than one content plugin using the same parser. We need each plugin to ignore tokens/variables it doesn't understand. Maybe there is a way to achieve that with Twig; I couldn't see it.

Plugins don't need to parse the same content recursively because plugins will load the tags, functions and filters into the main Twig enviroment and content/template will be processed once.

mahagr - comment - 29 Aug 2016

@phproberto I've already implemented Article classes which can be used for this. See my links above...

phproberto - comment - 29 Aug 2016

Thanks @mahagr. I hope that helps to understand the behavior I'm trying to describe and that we don't need to register 123123 custom tags. Just Twig entities that are passed to layouts and that will allow to retrieve an entity from another, etc. anb will serve as our API for templates.

brianteeman - change - 29 Oct 2016

Labels

Removed: ?

brianteeman - comment - 4 Jan 2018

Is anything happening with this RFC - its been over a year?

phproberto - comment - 5 Jan 2018

Hey @brianteeman,

lately I feel rejected by the system as my vision for Joomla clearly goes in the opposite way than the decisions taken by the leadership teams. I decided to stop losing my time "fighting" the system to contribute things nobody wants. Instead I release my own packages.

You can find what I suggested here in my Joomla-Twig package:
https://phproberto.github.io/joomla-twig/

100% unit tested, 100% based on plugins and with public docs.

laoneo - comment - 8 Jan 2018

Just for the record, a fields plugin was implemented with pr #13814 using a similar syntax as the loadmodule content plugin.

brianteeman - comment - 10 Apr 2018

Closing this as it has clearly been abandoned.

brianteeman - change - 10 Apr 2018

Status	Discussion	⇒	Closed
Closed_Date	0000-00-00 00:00:00	⇒	2018-04-10 12:48:25
Closed_By		⇒	brianteeman

brianteeman - close - 10 Apr 2018

Add a Comment

Older
Newer

Joomla! Issue Tracker - CMS

[#11702] - [RFC] New generic content parser.

Summary of Changes

Testing Instructions

Documentation Changes Required

Simple tokens

Block tokens

Other notes

Formal syntax definition

Revised Documentation

Simple tokens

Block tokens

Loops

Other notes

Formal syntax definition

Add a Comment