PDD 20: Lexical Variables

Abstract

This document defines the requirements and implementation strategy for lexically scoped variables.

Synopsis

    .sub 'foo'
        .lex "$a", $P0
        $P1 = new 'Integer'
        $P1 = 13013
        store_lex "$a", $P1
        print $P0            # prints 13013
    .end

    .sub 'bar' :outer('foo')
        $P0 = find_lex "$a"
    .end

    .sub 'baz'
        $P0 = find_lex "$a"  # guaranteed to fail: no .lex, no :outer()
    .end

    .sub 'corge'
        print "hi"
    .end # no .lex and no :lex, thus: no LexInfo, no LexPad


    # Lexical behavior varies by HLL.  For example,
    # Tcl's lexicals are not declared at compile time.

    .HLL "Tcl"
    .loadlib 'tcl_group'

    .sub grault :lex         # without ":lex", Tcl subs have no lexicals
        $P0 = find_lex "x"   # FAILS

        box $P0, 42          # really TclInteger
        store_lex "x", $P0   # creates lexical "x"

        $P0 = find_lex "x"   # SUCCEEDS
    .end

Description

For Parrot purposes, "lexical variables" are variables stored in a hash (or hash-like) PMC associated with a subroutine invocation, a.k.a. a call frame.

Conceptual Model

LexInfo PMC

LexInfo PMCs contain what is known at compile time about lexical variables of a given subroutine: their names (for most languages), perhaps their types, etc. They are the interface through which the PIR compiler stores and validates compile-time information about lexical variables.

At compile time, each newly created Subroutine (or Subroutine derivative, e.g. Closure) that uses lexical variables will be populated with a PMC of HLL-mapped type LexInfo. (Note that this type may actually be Null in some HLLs, e.g. Tcl.)

LexPad PMC

LexPads hold what becomes known at run time about lexical variables of a given invocation of a given subroutine: their values, of course, and for some languages (e.g. Tcl) their names. They are the interface through which the Parrot runtime stores and fetches lexical variables.

At run time, each call frame for a Sub (or Sub derivative) that uses lexical variables will be populated with a PMC of HLL-mapped type LexPad. Note that call frames for subroutines without lexical variables will omit the LexPad.

From the interface perspective, LexPads are basically Hashes, with strings as keys and PMCs as values. They extend the basic Hash interface with specialized initialization (requiring a reference to an associated LexInfo) and the query METHOD get_lexinfo() (to return it).

LexPad keys are unique. Therefore, in each subroutine, there can be only one lexical variable with a given name.

In the normal use case, LexPads are not exposed to user code (not for any special reason; it just worked out that way). Instead, specialized opcodes implement the common use cases. Specialized opcodes are particularly a Good Idea here because most lexical usage involves searching more than one LexPad, so a single LexPad reference wouldn't be as useful as one might expect. And, of course, opcodes can cheat ... er, can be written in optimized C. :-)

Nested Subroutines Have Outies; the ":outer" attribute

For HLLs that support nested subroutines, Parrot provides a way to denote that a given subroutine is conceptually "inside" another. Lookup for lexical variables starts at the current call frame and proceeds through call frames of the "outer" subroutines. The specific meaning of "outer" is defined below, but it's designed to support the common linguistic structure of nested subroutines where inner subs refer to lexical variables contained in outer blocks.

Note that "outer" and "caller" are very different concepts! For example, given the Perl 6 code:

   sub foo {
      my $a = 1;
      my sub a { eval '$a' }
      return &a;
   }

The &foo subroutine is the outer subroutine of &a, but it is not the caller of &a.

In the above example, the definition of the Parrot subroutine implementing &a must include a notation that it is textually enclosed within &foo. This is normally a static attribute of a Sub, but can be changed dynamically using the set_outer method.

    .sub 'a' :outer('foo')
       # ...
    .end

The value of :outer identifies a subroutine by its :subid flag; subroutine definitions that do not have an explicit :subid flag have the name of the Sub as the :subid.

Note that the outer sub must be compiled first; in other words, "foo" must appear before "a" in the source text. Compilers can easily do this via preorder traversal of lexically-nested subs.

Capturing the lexical environment

The capture_lex opcode is used to attach the current lexical environment to any subroutines that are lexically nested within the current sub. This is normally done either when the outer sub is entered or just prior to invoking the inner sub.

    .sub 'a'
        .lex '$a', $P0
        # ...
        # capture current lexical environment for inner sub 'foo'
        .const 'Sub' $P0 = 'foo'
        capture_lex $P0
        # invoke inner sub 'foo'
        'foo'()
    .end

The newclosure opcode will clone a subroutine and then perform capture_lex on the newly cloned sub.

    .sub 'a'
        .lex '$a', $P0

        # invoke inner sub 'foo'
        .const 'Sub' $P0 = 'foo'
        $P1 = newclosure $P0
        $P1()
    .end

Lexical Lookup Algorithm

When a subroutine is invoked, its newly created call frame points to the outer sub's context that was established for the sub by the capture_lex opcode above. When Parrot is asked to access a lexical variable named '$a' (e.g., via the find_lex opcode), Parrot starts with the current call frame and follows the chain of outer lexical call frames until it finds one containing the requested lexical. If none of the outer lexical environments define such a variable, an exception is thrown.

Autoclose semantics

If an inner subroutine is invoked that hasn't had a capture_lex operation performed on it, then Parrot will attempt to dynamically perform the lexical capture using the call from its outer sub. If the outer sub doesn't have a call frame, as might occur when jumping directly to the inner sub without previously invoking the outer, then Parrot creates a dummy call frame for the outer sub to be used for its inner lexical sub captures (until the outer sub is invoked, at which point it receives a new call frame).

Note that the dummy call frame created for the outer sub will be attached to its outer call frame, which may require creating dummy call frames for additional outer contexts (until an invoked outer sub is located, or the top-level outer lexical context is reached).

LexPad and LexInfo are optional; the ":lex" attribute

Parrot does not assume that every subroutine needs lexical variables. Therefore, Parrot defaults to not creating LexInfo or LexPad PMCs. It only creates a LexInfo when it first encounters a ".lex" directive in the subroutine. If no such directive is found, Parrot does not create a LexInfo for it at compile time, and therefore cannot create a LexPad for it at run time.

However, an absence of ".lex" directives is normal for some languages (e.g. Tcl) which lack compile-time knowledge of lexicals. For these languages, the additional Subroutine attribute ":lex" should be specified. It forces Parrot to create LexInfo and LexPads.

HLL Type Mapping

The implementation of lexical variables in the PIR compiler depends on two new PMCs: LexPad and LexInfo. However, the default Parrot LexPad and LexInfo PMCs will not meet the needs of all languages. They should suit Perl 6, for example, but not Tcl.

Therefore, it is expected that HLLs will map the LexPad and LexInfo types to something more appropriate (e.g. TclLexPad and TclLexInfo). That mapping will automatically occur when the appropriate ".HLL" directive is in force.

Using Tcl as an extreme example: TclLexPad will likely be a thin veneer on PMCHash. Meanwhile, TclLexInfo will likely map to Null: Tcl provides no reliable compile-time information about lexicals; without any compile-time information to store, there's no need for TclLexInfo to do anything interesting.

Required Interfaces: LexPad, LexInfo

LexInfo

Below are the standard LexInfo methods that all HLL LexInfo PMCs may support. Each LexInfo PMC should only define the methods that it can usefully implement, so the compiler can use method lookup failure to generate useful diagnostics (e.g. "register aliasing not supported by Tcl lexicals").

Each language's LexInfo will implement methods that are helpful to that language's LexPad. In the extreme case, LexInfo can be Null -- but if it is, the given HLL should not generate any ".lex*" directives.

void init_pmc(PMC *sub)
Called exactly once.
PMC *get_sub()
Return the associated Subroutine.
void declare_lex_preg(STRING *name, INTVAL preg)
Declare a lexical variable that is an alias for a PMC register. The PIR compiler calls this method in response to a .lex STRING, PREG directive. For example, given this preamble:
    .lex "$a", $P0
    $P1 = new 'Integer'
These two opcodes have an identical effect:
    $P0 = $P1
    store_lex "$a", $P1
And these two opcodes also have an identical effect:
    $P1 = $P0
    $P1 = find_lex "$a"

LexPad

LexPads start by implementing the Hash interface: variable names are string keys, and variable values are PMCs.

In addition, LexPads must implement the following methods:

void init_pmc(PMC *lexinfo)
Called exactly once. Note that Parrot guarantees that this method will be called after the new Context object is made current. It is recommended that any LexPad that aliases registers take a pointer to the current Context at init_pmc() time.
PMC *get_lexinfo()
Return the associated LexInfo.

Default Parrot LexPad and LexInfo

The default LexInfo supports lexicals only as aliases for PMC registers. It therefore implements declare_lex_preg(). (Internally, it could be a Hash of some kind, where keys are String variable names and values are integer register numbers.)

The default LexPad (like all LexPads) implements the Hash interface. When asked to look up a variable, it finds the corresponding register number by querying its associated LexInfo. It then gets or sets the given numbered register in its associated Parrot Context structure.

Introspection without Call Frame PMCs

Due to implementation concerns, it will not be until late in Parrot development -- if ever -- that call frames will be available to user code as PMCs. Until then, the interpreter and continuation PMCs will be the interface to use to get frame info.

For example, to get the immediate caller's LexPad, use:

    $P0 = getinterp
    $P1 = $P0["lexpad"; 1]

To access a sub's :outer subroutine, use the get_outer() method:

    .include "interpinfo.pasm"
    interpinfo $P1, .INTERPINFO_CURRENT_SUB
    $P2 = $P1."get_outer"()

Here, $P1 contains information on the current subroutine. $P2 will contain $P1's outer subroutine.

To get $P2's outer subroutine (if any), the same method can be used on $P2 itself:

    $P3 = $P2."get_outer"()

Using the interpinfo instruction is one way to do it. Another way is this:

    $P0 = getinterp
    $P1 = $P0["outer"; "sub"]
    $P2 = $P0["outer"; "sub"; 2] # get the outer sub of the current's outer
                                 # subroutine

It is also possible to get the :outer sub's LexPad, as above:

    $P0 = getinterp
    $P1 = $P0["outer"; "lexpad"]

See [1] for an example.

It's likely that this interface will continue to be available even once call frames become visible as PMCs.

Implementation

TK.

References

t/op/lexicals.t