Sciencemadness Discussion Board
Not logged in [Login ]
Go To Bottom

Printable Version  
 Pages:  1  
Author: Subject: Speed of C programs, relative to Java
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 11-12-2023 at 04:20
Speed of C programs, relative to Java


I am working on a library for polynomial root finding (soon, more will follow on that on sciencemadness). The software is written in Java, but I also make ports of the software to C, which is fairly easy (porting the other way around is more cumbersome).

The reason for making the C-ports is twofold. One reason is that it makes the software available to a larger group of people, and the second is that C-programs are said to perform better than Java programs in terms of computational speed.

However, I notice that my C ports consistently run slower than the Java ports. it hardly matters what software I port, but the C-code always runs between 1.5 and 2 times slower than the Java code with exactly the same input.

I run the software on a Core i7-13700 with DDR4-3200 memory, running Ubuntu 22.04. For comparison of speed, I run all software in a single thread, just to make comparison simpler. None of the software is really memory intensive, it all runs in a few MByte at most.

The Java environment is a standard JDK, 11.0.x, supplied with Ubuntu, or Oracle's HotSpot JDK, also for Java 11. I compile the programs, using Netbeans. No additional frameworks besides standard Java SE are used.

The C-environment is GCC, as supplied with Ubuntu 22.04, with the standard library for I/O and math. The code, however, uses math-library calls sparingly, most of it is just in-language multiplication, division, addition and subtraction. I compile the programs, using optimization level -O2, -O3, or even -O6, but between those, there hardly is any difference. Without optimization, -O0, the C-software is nearly twice as slow as with -O2. The software is standard C, no C++. No use of additional frameworks is made, it runs with just standard stdio.h, stdlib.h, and math.h.

Porting of the code is done in a really straightforward way. The Java code works on arrays of double's, ints, etc. The C-code also does, using the same indexing scheme. No pointer arithmetic is used in the C-code, the Java code is copied into the C-code nearly 1 to 1.

I have the impression that I am missing something. On the internet I read many websites, which tell that C is so much faster than Java in all kinds of applications, but I see things being the other way around. My C code ports are consistently at least 1.5 times slower than the Java code, even when the C code is compiled with heavy optimization.

Could it be that my type of code (numerical mathematics, containing a lot of computation, but hardly any I/O, no database interaction, and hardly any heavy data-shifting throughout memory, is very atypical and is actually faster in Java, due to JIT-compiler optimization?

Any experts out there, who could shed some light on this?




The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
Tsjerk
International Hazard
*****




Posts: 3022
Registered: 20-4-2005
Location: Netherlands
Member Is Offline

Mood: Mood

[*] posted on 11-12-2023 at 05:23


Are you promoting a lot of int to long? That takes a lot of time, that might be optimized by starting out with long instead of int.
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 11-12-2023 at 06:12


No, I do not convert ints to longs. It mostly is floating point (doubles) arithmetic, and a little integer arithmetic (e.g. computing indexes in arrays, that kind of things).



The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
Sulaiman
International Hazard
*****




Posts: 3561
Registered: 8-2-2015
Location: 3rd rock from the sun
Member Is Offline


[*] posted on 11-12-2023 at 06:31


I expect that if you wrote the programme in C then ported to any other language it would run better in the C version.
Applies to all programming languages.




CAUTION : Hobby Chemist, not Professional or even Amateur
View user's profile View All Posts By User
Keras
National Hazard
****




Posts: 775
Registered: 20-8-2018
Location: (48, 2)
Member Is Offline


[*] posted on 11-12-2023 at 12:48


Quote: Originally posted by woelen  
I am working on a library for polynomial root finding (soon, more will follow on that on sciencemadness). The software is written in Java, but I also make ports of the software to C, which is fairly easy (porting the other way around is more cumbersome).

However, I notice that my C ports consistently run slower than the Java ports. it hardly matters what software I port, but the C-code always runs between 1.5 and 2 times slower than the Java code with exactly the same input.


This is absolutely not normal. C code is, barring assembly, the closest to binary you can get with a ‘high-level’ programming language. In comparison, Java is interpreted, which means that the program is not transformed into bytes directly ingestible by the CPU, but run by a specific process (the ‘interpreter’), itself having been written in C (in the majority of cases, like Java or Python).

First of all, why Java? Why not Python. Use Python + SciPy, and you have your root finding function already coded, and probably better than anything you can code yourself.

If you insist on writing C code, maybe as a didactical task, I encourage you to get a copy of the book Numerical Recipes in C, which is a sort of Bible for a lot of mathematical algorithms.

There is, however, not a lot of good algorithms to find polynomial roots, especially when those roots are multiple. The dichotomy method is slow and prone to roundoff accumulation. Newton method can converge quadratically, but it needs a good estimate to start with. Your best bet is to use the QR inversion method to get those estimates, then use the Newton method to increase their precision.

What version of the GCC compiler do you use?

[Edited on 11-12-2023 by Keras]
View user's profile View All Posts By User
clearly_not_atara
International Hazard
*****




Posts: 2696
Registered: 3-11-2013
Member Is Offline

Mood: Big

[*] posted on 11-12-2023 at 14:12


Unfortunately we do not have much hope of answering your questions without seeing the code! If you are observing Java outperforming C, there are a few possible reasons:

- Your code is memory-bound, and your manual memory management is poor. The Java GC is mediocre for small tasks because it is optimized for large ones. It will beat C that is not written to a high standard on these loads. This happens sometimes when using a lot of data.

- Your code is parallel, and the load is distributed more efficiently across threads in the Java version than in the C version. This is a common reason if the code is parallel, and it is relatively easy to test by using a process monitor and seeing if your program utilizes all of your cores.

- Your code uses an efficient internal Java function for an important computation, but you used an inferior version in C, possibly one you wrote yourself.

- You failed to turn on compiler optimizations (unlikely).

- Your C code has mistakes.

The fix will depend on where the problem is.





[Edited on 04-20-1969 by clearly_not_atara]
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 11-12-2023 at 14:46


@Keras: I actually am in the process of improving some of the best polynomial root finders available nowadays. I have specialized code for degree 2, 3, 4, 5 polynomials and I know no code which beats what I achieve, both in terms of speed and accuracy. I tested it with an extensive test suite, using all kinds of nasties, including polynomials with multiple roots and closely clustered roots. A web page and paper will follow soon.
For general polynomials, I used Madsen-Reid's method, Jenkins-Traub and Aberth-Eherlich's methods. Best results are obtained with MR and AE, JT is quite disappointing and many polynomials simply cannot be solved at all by this code. I ported MR code to Java and C (original code is written in Fortran, modules PA16 and PA17) and made many modifications to the original code, in order to solve issues with certain classes of polynomials, which cannot be solved by the original MR-algorithm. Aberth-Ehrlich I also ported to Java, and made some small modifications to make it somewhat more robust. It is exceptionally useful for sparse polynomials and in particular trinomials, nothing can beat this in terms of speed and accuracy. In order to achieve this, I again modified the original algorithm, by selecting suitable/better starting values.

I use GCC version 11.4.0 (from Ubuntu 22). I also tried on other older versions of GCC with similar results.


@clearly_not_atara: I do not need any memory management in my code. It is basically numerical algorithms, with needed memory either simply on the stack, or pre-allocated before running the algorithm and no need to dynamically allocate and release memory. So, this is not an issue. The same is true for Java and C. Polynomial root finding is not memory-intensive at all (e.g. for solving a degree 100 polynomial I only need storage for a few hundreds of numerical values).
The code also is not parallel. Solving one polynomial equation takes microseconds or milliseconds at most. No need to parallellize the algorithms for that. Of course, I can run the algorithm concurrently on different polynomials, but inside one polynomial solving 'session' I use no parallel code.
As I wrote in the opening post, my code does not use any external library, nor is it dominated by using some internal function. It uses mainly standard arithmetic and comparisons, which is part of the language (e.g. +, -, *, /, <, >, and an occasional sqrt()).
When I release the code on publishing the upcoming web page, then you can have a look at it.

I use Java, because of its superior performance in many cases. It indeed produces byte code, but with modern JVM's the byte code in turn is converted to native machine code dynamically, while running the code. This is done by the so-called JIT-compiler and on modern JVMs this does an amazing job and it is said that it can even outperform well-crafted manually written assembly code. Also, many low-level Java-functions nowadays are implemented by means of so-called intrinsics, which are specialized optimized pieces of assemby code, which either is called, or is inlined in the code from the JIT-compiler if it is small. But indeed, I expected optimized C-code to be faster, especially if it is not using external functions from a pre-compiled library, which is fixed.


I am actually very content with the performance of my Java code. E.g. 4th degree equations can be solved reliably in only 400 ns or so in a single thread of a standard recent Core i5 or Core i7 processor (2.5 million of such polynomials per second, at around theoretically attainable accuracy, also for ill-conditioned ones). Polynomials of degree, averaging between 12 and 25 can be solved accurately in approximately 8 seconds per million polynomials on the same CPU in one thread.

I also wrote a simple library, which allows me to do computations at 105 bits precision (almost twice the accuracy of a standard double). This code is MUCH faster than stuff like GMP or MPFR in C or BigDecimal in Java. Of course, its precision also is more limited, but it fills in a gap between full high precision arithmetic and the limited precision of doubles. Many algorithms, I also implemented in Java, using this 105 bits code. I do not have a port to C for this, maybe I will do that in the future.

I expected more from my C port, but was disappointed by its performance. It might be that there is some mistake in it, but given the fact that it is a nearly 1 to 1 copy of the Java code (just omitting all the OO-stuff from Java) I hardly can imagine that there is a real bug, because the numerical outcomes of the C code and Java code are frequently identical up to bit level, sometimes they slightly differ, because of the use of a different random generator for test data or using another implementation of a math-function, which may have one or two ulp's of difference in the output.

[Edited on 11-12-23 by woelen]




The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
Rainwater
National Hazard
****




Posts: 800
Registered: 22-12-2021
Member Is Offline

Mood: indisposition to activity

[*] posted on 11-12-2023 at 16:53


I would suggest running a profiler to see where the bottlenecks are.
This will direct you to where improvement can be made.

If you can add some macros before and after sections of your code and
time them to see how long they take, you will know where to start
Assuming your not using std::cerr for anything else amd your program
is console based, you can pipe the err output to a file for review later.
Writing on my phone, code might contain errors.
Its the simple things that make programs fast.
Thread blocking is also another major porting issue.
Again java has optimizations in place to make this easier,
would be happy to take a look,
even add to my stack of nda's if need be
Code:
myProgram 2> debug_file.txt

Code:
#include <stdint.h> // Windows #ifdef _WIN32 #include <intrin.h> uint64_t rdtsc(){ return __rdtsc(); } // Linux/GCC #else uint64_t rdtsc(){ unsigned int lo,hi; __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo; } #endif #define DEBUG_CPU_CYCLES std::cerr << __FILE__ << ":" << __LINE__ << " " << rdtsc();

code source

The issue will be a simple one. c/c++ is as fast as you get.
The evolution of java has removed so many design hurdles aimed at making it easier to work with, that
porting from java down to c will likely result in these optimizations being omitted.
For example, an array in java, is compared to a std::vector in c++, but the implementation is very different.
To the user they both function the same, in c the vector can be fragmented in places, acting as kinda a double linked list. Making lookups slower. Or they can fragment the heap by not using std::vector.reserve() before pushing items into it.
Java optimized this years ago into a binary tree for large arrays, making the lookups quicker.

[Edited on 12-12-2023 by Rainwater]

[Edited on 12-12-2023 by Rainwater]




"You can't do that" - challenge accepted
View user's profile View All Posts By User
clearly_not_atara
International Hazard
*****




Posts: 2696
Registered: 3-11-2013
Member Is Offline

Mood: Big

[*] posted on 11-12-2023 at 18:20


Quote:
or using another implementation of a math-function, which may have one or two ulp's of difference in the output.

If you are noticing slight differences in the last few bits of floating-point operations between C and Java, and the C code is slower, the first thing that I would try is using -ffast-math. I would be surprised if Java doesn't always stick to IEEE-compliant floating point, but differences like that in the output suggest that it is not. The goal of IEEE floating-point is to ensure that computations are always exactly identical, while the unsafe operations enabled by -ffast-math will avoid things like correcting the least significant bit in favor of using fewer cycles and getting very close. Most of the standard math library is also IEEE-defined and should always be exact.

Also, there is no -O6:

https://gcc.gnu.org/legacy-ml/gcc-help/2011-06/msg00450.html
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 will not generally enable -ffast-math because gcc's authors believe (correctly) that general compiler optimization flags should not alter the output of a program, while -ffast-math can definitely alter the output of the program. See:

https://kristerw.github.io/2021/10/19/fast-math/

EDIT: There is also -Ofast which enables all unsafe optimizations.

Also, modern gcc will typically use all available intrinsics, although this may not always be true on Windows or the newest processors, where MSVC or ICC (free for educational use IIRC) can be faster. But the difference between GCC and ICC usually isn't that big. ICC sucks on AMD processors usually (Intel's weird thing).

[Edited on 12-12-2023 by clearly_not_atara]




[Edited on 04-20-1969 by clearly_not_atara]
View user's profile View All Posts By User
Keras
National Hazard
****




Posts: 775
Registered: 20-8-2018
Location: (48, 2)
Member Is Offline


[*] posted on 11-12-2023 at 22:43


Okay, I apologise, I assumed (wrongly) you were simply dabbling with root finding. I see you’re venturing into advanced territory. Since you’re there, I would advise you to give a close look to rounding errors and their accumulation during the computations.

For your performance issue, if you want, I can benchmark your code on Clang/LLVM on ARM which I daily use as MacOS developer. Also try and analyse if there is any possibility to vectorise your code, despite the gains being minimal, they may amount to something when repeatedly called. It maybe that Java is using some highly optimised routines that are unavailable at C level, because GCC is unable to optimise the code beyond a certain level.

Isn’t 128-bit arithmetic already available at ASM level? Why 105-bit ? This seems an odd figure to me.

[Edited on 12-12-2023 by Keras]
View user's profile View All Posts By User
Rainwater
National Hazard
****




Posts: 800
Registered: 22-12-2021
Member Is Offline

Mood: indisposition to activity

[*] posted on 12-12-2023 at 03:50


User interface output can also be a major time consumer.
Java spawns a seperate thread to handle writing to the os,
this has std::out calls return immediately, c does not.
Using platform specific code, you can redirect the std::cout or in 'C', the stdout handles
to a buffer and reproduce the same optimization.
Thread safty is achieved by pausing the program thread,
swaping out the buffer, then
Resumeing the program thread while the buffer is printed.

[Edited on 12-12-2023 by Rainwater]




"You can't do that" - challenge accepted
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 12-12-2023 at 07:40


After searching a lot on the internet I think I found a possible answer to my question. It seems that the latest incarnations of the JVM are capable of detecting situations, where use of AVX-instructions is beneficial.

A simple case is something like


Code:
for (int i=0; i<n; i++) { a[i] += b[i]*c[i]; }


or

Code:
zb = .... double sum = 0.0; for (int i=0; i<n; i++) { sum += Math.abs(zb - z[i]); }



The JIT-compiler can translate this into a loop, which uses AVX or AVX2 instructions, which take 2 or 4 operands at once and do the same piece of code 2 or 4 times in parallel. This reduces computational time. Not by a factor 4 (due to e.g. limitations on memory bandwidth), but it certainly can be significant (e.g. a factor 1.5 speed up is plausible). The newer JIT-compilers also recognize more complicated structures with loops, e.g. situations where there are multiple lines of code, executed after each other in a loop. For this reason, there is less and less need for specialized vectorizing code in Java (Java has a special SIMD-API, but use of that is quite low-level and makes code much harder to understand).

Most likely this is not the complete answer, but I can imagine that this at least partly explains the effect I observe. On older CPUs without SIMD instructions this may not be the case. I unfortunately have no really old CPUs lying around, where I can test this (oldest I have is a 2012 core i5 3th generation, dusty, but most likely still working).

@Keras: Why do I use 105-bits and not 128 bits? This is because of performance. Beautiful algorithms exist for combining the precision of two doubles with 53-bit mantissas into a single numerical entity with 106-bit mantissa. In practice, due to loss of one or 2 ulps, this leads to appr. 105 bit accuracy: https://www.davidhbailey.com/dhbpapers/qd.pdf
This works for combining two doubles to get 105 bits of precision, or combining 4 doubles to get 211 bits of precision. This works on any IEEE 754 or IEEE 854 architecture, which adheres strictly to the standard. Java does this (not yet Java 11, unless you use strictfp on methods and classes, Java 17 does this out of the box). Using FMA (fused multiply add instructions), this can be made even faster, and that is what I do in my Java-implementation. This is MUCH faster than any multiprecision library I have seen and 105 bits of precision is good enough for polynomials even up to degree 100. With standard 53 bits of precision, polynomial root finding works well for polynomials with just a few terms, but for general dense polynomials, the limits are reached somewhere at degree 50, especially if the polynomials is somewhat ill-conditioned.

I also have a general full-precision implementation, using MPSOLVE, with GMP for multiprecision and JNI for use from Java code. This can be used to solve polynomial equations with tens of thousands of terms, but of course, this is much slower than using basic floating point with limited precision.

I also have experimented with real 128 bit arithmetic on GCC and did play around with this, but this is prohibitively slow. The arithmetic is done by using integer/long arithmetic, using inline assembly (it is available in GCC as a special floating point type). This is MUCH slower than the 105-bit algorithm, described in the paper, mentioned above. This is mainly true because of all the bit-manipulation, needed for exponent management, checking of overflow/underflow/NaN-generation, and other bit-stuffing to represent floating point numbers.




[Edited on 12-12-23 by woelen]




The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
clearly_not_atara
International Hazard
*****




Posts: 2696
Registered: 3-11-2013
Member Is Offline

Mood: Big

[*] posted on 12-12-2023 at 07:44


^right, GCC will not enable this optimization without -fassociative-math, which is part of -ffast-math, even with -O3 enabled, because floating-point arithmetic is not associative and vectorizing it can change the final result (however slightly).

Compilers are very strict about consistency!




[Edited on 04-20-1969 by clearly_not_atara]
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 12-12-2023 at 07:55


I'll try compiling with the -fassociative-math option or the -ffast-math option. Sounds interesting. I'll come back on that.



The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
Keras
National Hazard
****




Posts: 775
Registered: 20-8-2018
Location: (48, 2)
Member Is Offline


[*] posted on 12-12-2023 at 08:06


Did you experiment with fixed point arithmetic?
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 12-12-2023 at 11:23


I tried the -ffast-math option. It makes the C-code somewhat faster.

Solving 100M quartics, plus analysing the error, relative to theory, takes 38 seconds with -ffast-math -O3 and it takes 42 seconds with only -O3. I used gcc version 11.4.0.
In Java, solving the 100 million quartics, plus doing the same analysis, on JDK 17.0.7, takes 34 seconds. This is with strict IEEE mathematics. (since Java 17, strictfp is the standard behavior in Java, loose IEEE conformance is not allowed anymore).

So, using the -ffast-math optimization has some effect, but it is not dramatic. Besides that, it introduces loose IEEE comformance.

@Keras: Fixed point arithmetic is not interesting at all in polynomial solving. I need big ranges for storing intermediate results. Even if you look at polynomials with small roots in the range 0.1 to 10, then the coefficients can become HUGE for polynomials of degree 50 or so. Think of (x+1)^50 and see what coefficients you get. For anything with larger or smaller roots, the effect only is worse.
I even consider extending my 105-bits floating point library with the use of large exponents (e.g. a Java int for storing exponents in the range -2 billion to + 2 billion). This would make certain algorithms more robust, although at additional computational cost. Polynomial root finding really is a science on its own ;)

[Edited on 12-12-23 by woelen]




The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
clearly_not_atara
International Hazard
*****




Posts: 2696
Registered: 3-11-2013
Member Is Offline

Mood: Big

[*] posted on 12-12-2023 at 13:22


A hundred million quartics? Are you sure that your application isn't spending the difference generating the quartics?

Maybe in C you call a function 100 million times and it actually calls a function by creating and destroying a stack frame while Java is just running the test program as one big loop? It does seem like there is room for some "non-arithmetic" optimization here because the actual math (a Newton iteration?) happens very quickly.

Also, doesn't the C rand() only generate integers? Are you doing something like
Code:
double x = rand() / ((double) RAND_MAX);


I just ran some code that generates 500 million random floating-point numbers and returns their sum and it took 3.4 seconds to run in C on my laptop. Since a quartic has five coefficients, that suggests this could be affecting your runtime for 100 million quartics. I was not able to beat this with pointer twiddling, but I didn't try very hard.




[Edited on 04-20-1969 by clearly_not_atara]
View user's profile View All Posts By User
Keras
National Hazard
****




Posts: 775
Registered: 20-8-2018
Location: (48, 2)
Member Is Offline


[*] posted on 13-12-2023 at 00:31


Quote: Originally posted by woelen  
Polynomial root finding really is a science on its own ;)


I don’t dispute that a single second! ;)

Your pointer to the extended precision library is very interesting. I wonder if I’m not going and try implement it in ARM assembly. I wish you could release your code, it would make benchmarking much easier.

Also do all your method work with multiple roots. I know you quickly answered that before, but I’m really curious about that. Especially the A-E method you seem to especially cherish :)
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 13-12-2023 at 01:02


I only timed the actual solving of the generated quartics and doing the analysis of the error. The time, needed for generating the quartics, is not included in the numbers, reported above.

I now timed the generation of the quartics as well. Generation of 100M quartics takes 4 seconds in C and appr. 2 seconds in Java, but this comparison is not fair. In C I use drand48() and in Java I use a highly optimized random generator, which I wrote myself, based on the new Xoshiro256 class of pseudo random generators: https://prng.di.unimi.it/ In both cases, however, I use a function call in a library (I put the Xoshiro256 in a separate library, in which it is implemented as a subclass of Java's standard Random() class).
In both environments, I scale the random generator output, such that the coefficients are in some range [coefMin .. coefMax], for each term its own range:
Code:
double x = coefMin + drand48()*(coefMax - coefMin);


The quartic solver does not use Newton iteration. It is based on a hybrid analytical/numerical method, based on the LDL^T decomposition of a 3x3 matrix. This method is very fast, and can also be made very accurate, also in all kinds of corner cases. The latter I describe in my paper, which I soon will put on my website.






The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 13-12-2023 at 01:27


Quote: Originally posted by Keras  

Your pointer to the extended precision library is very interesting. I wonder if I’m not going and try implement it in ARM assembly. I wish you could release your code, it would make benchmarking much easier.

The code for the extended precision I can share. I'll see if I can put it somewhere here this evening. It is Java code, but it should be easy to port it to C.

Quote: Originally posted by Keras  

Also do all your method work with multiple roots. I know you quickly answered that before, but I’m really curious about that. Especially the A-E method you seem to especially cherish :)

The methods I use also work well for multiple roots. Accuracy for multiple roots goes down though, but that is not a flaw of the software, but it is a basic mathematical property. For example, if you have a limited precision, such that 1 and 1+eps are just not distinguisable anymore, then you cannot distinguish multiple roots with value 1 from a cluster of separate roots which differ from 1 by a margin with absolute value eps^(1/k), where k is the multiplicity. For IEEE floating point, eps is around 10^-16. This means that for double roots, the accuracy is limited to around 8 digits, for a triple root it is a little over 5 digits, etc.

For a quadratic (x+1-delta)*(x+1+delta) you get the polynomial x^2+2x+1-delta*delta. For delta equal to 10^-8, you have an eps equal to 10^-16, which just cannot be detected anymore with standard IEEE accuracy. For a cubic polynomial with three equal roots you get coefficients with delta^3 as deviation. For these, the detectable error bound is appr. 5*10^-6.




The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
Keras
National Hazard
****




Posts: 775
Registered: 20-8-2018
Location: (48, 2)
Member Is Offline


[*] posted on 13-12-2023 at 03:39


Quote: Originally posted by woelen  
The methods I use also work well for multiple roots. Accuracy for multiple roots goes down though, but that is not a flaw of the software, but it is a basic mathematical property. […]


Yeah, I get that. Does your software returns the root and its multiplicity?

And thanks for the multiprecision code! :)

[Edited on 13-12-2023 by Keras]
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 13-12-2023 at 06:17


The software returns the roots. It is possible to detect multiple roots. One can compute guaranteed error bounds for each of the roots. These are small disks around the found roots. If there are overlapping disks, then these can be considered multiple roots. The multiple root then is the average of the cluster of roots with overlapping disks of uncertainty.

An example is a polynomial with double roots 1.2345 besides some other roots. The solver finds roots 1.23450001987... and 1.23449998012...
The average of these is 1.23449999999....
The error bound for the individual roots is 3e-8 or something close to that, while the error bound for the average is 1e-15 or so.
In this way, one can detect multiple roots. I did not implement this for the root finders I have, because it reduces the overall speed with 30% or so and also makes usage of the software more complicated. Just computing error bounds on single roots and leaving out the cluster analysis is simple though. I did this in my test programs to validate the accuracy of my software.
Things can become much more complicated if there are no true multiple roots, but clusters of very close roots. A nasty situation can be a real polynomial with the following roots:

z1 = 1.234
z2 = 4.567
z3 = 6.123
z4 = 6.123 + 0.001*i
z5 = 6.123 - 0.001*i
z6 = 6.124
z7 = -1.256 - 5.612*i
z8 = -1.256 + 5.612*i
This polynomial has 4 well-separated roots, which have relative error bounds in the order of magnitude 1e-15 or so with standard IEEE floating point arithmetic. The remaining 4 roots form a tight cluster, but are not multiple. The software may have difficulty separating these roots and then a seemingly random mix of roots is produced, with values close to 6.123, with random deviations of magnitude 0.001 or so. The average of these 4 computed roots, however, is close to the average of the given 4 clustered roots.

Even polynomials with well-separated roots may pose serious problems. A notorious example is Wilkinson's polynomial (x-1)*(x-2)*(x-3)*...*(x-20). It has 20 well-separated roots, but with standard double precision (53 bits mantissa) one cannot compute all of these roots at much better than 10% accuracy. Just search the internet for info on this (search for "Wilkinson polynomial" in Google).




The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
Keras
National Hazard
****




Posts: 775
Registered: 20-8-2018
Location: (48, 2)
Member Is Offline


[*] posted on 13-12-2023 at 09:27


BTW, you don’t have access to 128-bit IEEE754 floats? Long double type?

About the Wilkinson polynomial. Yeah, I knew that example. But you can still use a QR method to find the roots as eigenvalues of a specially constructed matrix, right?

[Edited on 13-12-2023 by Keras]
View user's profile View All Posts By User
woelen
Super Administrator
*********




Posts: 7977
Registered: 20-8-2005
Location: Netherlands
Member Is Offline

Mood: interested

[*] posted on 13-12-2023 at 12:10


In Java there is no 128-bit IEEE754, at least not through standard code. There may be some software library out there, but it will be slow. In C, things are not different. GCC has some 128 bit floating point type, but it is prohibitively slow. OK for simple use cases, where small amounts of work need to be done, but not useful for serious math.

Using matrix eigenvalue methods for polynomial root finding does not solve the issue of inherent ill-conditioning of the polynomial. The Wilkinson polynomials simply are ill-conditioned and when expanded, they simply have lost some information in their coefficients, which leads to bad approximations of the zeros if factorized, even with a perfect solver, working at standard 64-bit floating point precision. This loss of accuracy is a mathematical property of the polynomial itself and no piece of software can magically restore the information.
Matrix eigenvalue methods can be interesting for polynomial root finding, as a relatively stable means of solving, but they have some serious drawbacks. The polynomials only have N+1 coefficients, while the matrix methods use N^2 elements. For somewhat larger polynomials (e.g. degree 50), this leads to much more computational effort, needed for solving the equation. So, in modern software, the eigenvalue methods are not the preferred methods anymore. Much faster methods, which also are stable, exist nowadays.

Two files are available for download. One is the code, used for my polynomial-solvers with 105-bits precision, and a very small Rnd-package, which contains my own implementation of Java's Random-class (much faster, and much better qualty random numbers).


Attachment: Rnd.tgz (31kB)
This file has been downloaded 57 times

Attachment: polarith.tgz (4.2MB)
This file has been downloaded 63 times





The art of wondering makes life worth living...
Want to wonder? Look at https://woelen.homescience.net
View user's profile Visit user's homepage View All Posts By User
Keras
National Hazard
****




Posts: 775
Registered: 20-8-2018
Location: (48, 2)
Member Is Offline


[*] posted on 14-12-2023 at 01:46


I’m going to have a look! Thanks for that! And yeah, 128-bit IEEE is handled by software, so I guess this doesn’t help the performance skyrocket. Concerning eigenvalues method, I’m surprised you cannot use sparse matrix methods, since the matrices used to solve polynomials are inherently almost all zeros.
View user's profile View All Posts By User
 Pages:  1  

  Go To Top