You are viewing nickclifton

Previous Entry | Next Entry

October 2011 GNU Toolchain Update

Deep Thought
Hi Guys,

  Quite a lot of things have happened in the last month.  Here are the highlights:
 
    * Support has been added for the Tilera TILEPRO and TILE-Gx architectures to the binutils.

    * Readelf can now decode Sparc hardware attributes.

    * The binutils 2.22 branch has been created, so a new release should be out soon.

    * GCC now supports vector comparison with the standard C comparison operators: ==, !=, <, <=, >, >=.  Comparison operands can be vector expressions of integer-type or real-type.  Comparison between integer-type vectors and real-type vectors is not supported.  The result of the comparison is a vector of the same width and number of elements as the comparison operands with a signed integral element type.

      Vectors are compared element-wise producing 0 when comparison is false and -1 (constant of the appropriate type where all bits are set) otherwise. Consider the following example:

        typedef int v4si __attribute__ ((vector_size (16)));

        v4si a = {1, 2, 3, 4};
        v4si b = {3, 2, 1, 4};
        v4si c;

        c = a >  b;     /* The result would be {0, 0,-1, 0}  */
        c = a == b;     /* The result would be {0,-1, 0,-1}  */



    * GCC now supports vector shuffling using two builtin functions:
   
    __builtin_shuffle (vec, mask)
    __builtin_shuffle (vec0, vec1, mask)


      The functions construct an output vector built from selected elements from either one or two input vectors.  The output vector is of the always of the same type as the input vector(s).

      The mask is an integral vector with the same width and element count as the output vector.  Each element in the mask specifies which element from the input vector(s) should be selected for the corresponding position in the output vectors.  Numbering starts at 0 and is computed modulo the length of the input vector(s).  For example:

        typedef int v4si __attribute__ ((vector_size (16)));

        v4si a     = {1, 2, 3, 4};
        v4si b     = {5, 6, 7, 8};
        v4si mask1 = {0, 1, 1, 3};
        v4si mask2 = {0, 4, 2, 8};
        v4si res;

        res = __builtin_shuffle (a, mask1);       /* res is {1,2,2,4}  */
        res = __builtin_shuffle (a, b, mask2);    /* res is {1,5,3,1}  */



    * GCC has a new, somewhat useless feature for the C language:

        -fallow-parameterless-variadic-functions

      This allows variadic functions without named parameters.  Although it is possible to define such a function, it is not very useful as it is not possible to read the arguments.  This is only supported for C as this construct is allowed by C++.


    * A couple of new warnings have been added as well:

        -Wunused-local-typedefs

      Warns when a typedef locally defined in a function is not used.


        -Wvector-operation-performance

      Warns if vector operation is not implemented via SIMD capabilities of the architecture.  Mainly useful for the performance tuning.


    * Four new optimizations have been added to GCC as well:

        -fno-fat-lto-objects

      Fat LTO objects are object files that contain both the intermediate language and the object code.  This makes them usable for both LTO linking and normal linking, and is the default when -flto is used.  -fno-fat-lto-objects improves compilation time over plain LTO by not storing the object code in the object files, but it requires that the complete toolchain to be aware of LTO.  This means that the linker must have plugin support as a minimum.  Additionally, nm, ar and ranlib need to support linker plugins in order to allow a full-featured build environment (capable of building static libraries etc).

     
        -foptimize-strlen

      Enables string length optimizations.  It attempts to track string lengths and optimize various standard C string functions  like strlen(), strchr(), strcpy(), strcat(), stpcpy() into faster alternatives.  This pass is enabled by default at -O2 and above, unless optimizing for size.  This optimization can for example change:

        char *
       append_slash (const char * a)
        {
          size_t l = strlen (a) + 2;
          char * p = malloc (l);
          if (p == NULL)
            return p;
          strcpy (p, a);
          strcat (p, "/");
          return p;
        }


      into:

        char *
       append_slash (const char * a)
        {
          size_t tmp = strlen (a);
          char * p = malloc (tmp + 2);
          if (p == NULL)
            return p;
          memcpy (p, a, tmp);
          memcpy (p + tmp, "/", 2);
          return p;
        }


      The next optimization will be especially useful for improving the scores in synthetic benchmarks like dhrystone or coremark:
   
        -fshrink-wrap

      This makes GCC emit function prologues only before parts of the function that need it, rather than at the top of the function.  This feature is enabled by default at -O and higher.  For example in a function like this:

        extern int bar (int *, int);

        int
        foo (int arg)
        {
          if (arg)
            return arg * 2;
          else
            {
              int array[4] = {1,2,3,4};
              return bar (array, arg);
            }
        }


      A stack frame is only needed if arg is zero.  Otherwise foo() can act just like a leaf function, and no stack space, function prologues or epilogues are needed.

      Lastly there is:
    
        -ftree-tail-merge
      

      This looks for identical code sequences at the end of functions. When found it replaces one with a jump to the other.  This optimization is enabled by default at -O2 and higher.


    * Some target specific GCC features have been added as well:
  
        -mtune=generic-<arch>    [For x86 targets]

      This specifies that GCC should tune the performance for a blend of processors within architecture <arch>.  The aim is to generate code that run well on the current most popular processors, balancing between optimizations that benefit some CPUs in the range, and avoiding performance pitfalls of other CPUs.

        -mpid             [For the RX target]

      This enables the generation of position independent data (but not code).  When enabled any access to constant data will done via an offset from a base address held in a register.  This allows the location of constant data to be determined at run-time without requiring the executable to be relocated, which is a benefit to embedded applications with tight memory constraints.  Data that can be modified is not affected by this option.

        -munaligned-access    [For ARM targets]
      
      Enable unaligned word and halfword accesses to packed data.  This is enabled by default for all ARMv6, ARMv7-A, ARMv7-R, and ARMv7-M architecture-based processors, and disabled for other ARM architectures.

Cheers
  Nick

Profile

Deep Thought
nickclifton
nickclifton

Latest Month

October 2014
S M T W T F S
   1234
567891011
12131415161718
19202122232425
262728293031 
Powered by LiveJournal.com
Designed by chasethestars