Re: Incredible!


Subject: Re: Incredible!
From: Mac Man (macman@onetel.net.uk)
Date: Wed Sep 12 2001 - 16:03:01 MDT


Hi Nathan,

Well, your explanation actually seems to be what it is. My code is hard
C, implementing algorithms that are designed with a Modelling principle,
to have as few iterations as possible. I'm dealing with a Matrix
representation of a Sporadic group called the Monster Group.

What was actually happening, was that the Kernel was kicking in, and
using the extra PPC Altivec 128-bit registers, plus what Sean said of
the 64 FPU, results in the incredible performance difference. As to the
person who was talking about getting a copy of the program, its designed
specifically for a Dual Processor. Without the Dual processor, on your
Intel PIII 700 it would take a hell of a lot longer, as each matrix
would have to work with the processor again and again. (as opposed to
the two processors working in parallel.. as silly as it will sound to
you, I ran this on a single Athlon 1Gig, and it took 17 minutes, where
with the Dual P4 it takes about 3 minutes, and the Dual PPC took 13
seconds. The code was optimized to a maximum extent, including the
-unroll-loops option (causes the most incredible performance difference
of the lot).

Also when I was doing a Profile of how the multithreading is working,
the PPC result is somewhere in the region of 60% more efficent. I am no
expert, as I'm very new to a MAC, but so far, it has out performed my PC
in everything except the price of software.

Regards,

Alex

P.S. How compatible are the RPM's between Redhat and YDL?

On Wednesday, September 12, 2001, at 10:24 pm, Nathan Buck wrote:

> Mac Man wrote:
>
>> I just got through running an alpha of the modelling program I am
>> working on, under both RedHat and YDL, and I don't understand it.
>> Could someone who is technically minded, explain to me why this G4,
>> Dual Processor 800 MHz, goes about 15 times faster then the Dual
>> Processor Pentium 4, 1.8 GHz. Both machines have 1.5 Gigs of Memory.
>> I'm a mathematician not a computer guru. And my math skills say that
>> 1.8 GHz >> 0.8 GHz.
>
>
> Alex:
>
> The simple yet technical explaination of why you are seeing such a
> huge difference between the two systems:
>
> First off, hardware wise. The Pentium 4 is an incredibly inefficient
> processor. In fact, a 1 Ghz Pentium III can in many situation
> out-perform a P4 clocked at 1.2. The reasons are as follows :
>
> x86 processors are based on Complex Instruction Set Code design, which
> says, "Have lots of instructions, making it take less computational
> cycles to complete each task." This is verses Reduced Instruction Set
> Code design which says, "Have few instructions making each instruction
> run faster." Ultimately everything that gets sent to the processor is
> a matter of addition, so a command on an x86 that might take 3 encoded
> chip functions to process, can take 8 to 200 steps in a RISC processor.
> The archectures then are built around those concepts to try and move
> data as fast as possible.
>
> In x86 chips now a days, in order to ramp up the clock speeds (which
> just measure how fast the internal bus runs) they build the command
> pipelines long and thin and use whats called predictive processing,
> where they guess which branch and instruction is going to take. RISC
> chips use a similar predictive processing method, but because they have
> few instructions, they can built shorter and wider command pipelines.
> When a wrong prediction is made, that entire channel has to clear and
> then processing has to be attempted again. Obviously, a shorter
> pipeline helps when you have a problem like that because it clears
> faster.
>
> The P4 and the G4 have these extended 'special" additions meant to aid
> in certain kinds of processing. The G4 uses Altivec, and the P4 uses
> SSE (Also called MMX by intel). These are supposed to help immensely
> when you are performing matrix intesive math processes (like modeling
> and graphics rendering). You do have to have compiler optimization for
> these.
>
> A point in favor at this of the G4 is that the kernel iteslf also
> supports Altivec processing, so certain calls you make in your program
> are automatically utilizing the Altivec unit, though not efficiently
> since its being passed through the kernel. On the P4 there is no SSE
> optimization in the kernel because of the specialized nature of the
> unit. Its supposed to be utilized automatically but you'll see limited
> increases without compiler optimization in that case.
>
> Additionally the P4 includes another special set of instructions that
> are part of Intel's Next Generation for predictive processing. But
> these are entirely compiler dependant. At compile time, when using the
> compiler designed to use these instruction sets, certain instructions
> are generated that greatly (and I mean GREATLY) increase the efficience
> of the P4's branch predication module. The downside to this
> optimization is that the P4's branch prediction while running
> non-optimized code blows chunks, even compared to the PIII, which is
> why a PIII will beat a P4 running non-optimized code in most cases
> (unless you're running code that doesn't involve Branch Prediction).
>
> Yet another point in Apple's favor in this case the Multi-threaded
> design of their main board and processors that enables a "slightly"
> more efficient SMP operation. SMP on Intel/x86 based hardware has from
> the beginning been considered a "specialty" item most people would
> probably buy a REAL server instead of, so its not designed very well at
> the hardware level.
>
> If you take into account the possibility that you have not written your
> code very well and made a few "oversights" that cause a high rate of
> misses in Branch Prediction on the P4 and you've got an answer right
> there.
>
> Remember, clock speeds are a very small matter when it comes to
> processors, especially now. Other, and usually more key points are,
> what kind of processing the CPU is optimized for, is your compiler
> optimized for the processor, if SMP, how well are you writing your
> Multi-threading code (cause if you aren't then its all the kernel's job
> to multi-thread translate - probably using the altivec units to help,
> another bonus). Remember, a ferrari with a pedal that only depresses
> by 25 degrees can still be a ford puck truck with a 75 degree pedal.
>
>
>
>



This archive was generated by hypermail 2a24 : Wed Sep 12 2001 - 15:13:18 MDT