hcrypto/libtommath/tommath.src

*ebfedea0SLionel Sambuc\documentclass[b5paper]{book}
*ebfedea0SLionel Sambuc\usepackage{hyperref}
*ebfedea0SLionel Sambuc\usepackage{makeidx}
*ebfedea0SLionel Sambuc\usepackage{amssymb}
*ebfedea0SLionel Sambuc\usepackage{color}
*ebfedea0SLionel Sambuc\usepackage{alltt}
*ebfedea0SLionel Sambuc\usepackage{graphicx}
*ebfedea0SLionel Sambuc\usepackage{layout}
*ebfedea0SLionel Sambuc\def\union{\cup}
*ebfedea0SLionel Sambuc\def\intersect{\cap}
*ebfedea0SLionel Sambuc\def\getsrandom{\stackrel{\rm R}{\gets}}
*ebfedea0SLionel Sambuc\def\cross{\times}
*ebfedea0SLionel Sambuc\def\cat{\hspace{0.5em} \| \hspace{0.5em}}
*ebfedea0SLionel Sambuc\def\catn{$\|$}
*ebfedea0SLionel Sambuc\def\divides{\hspace{0.3em} | \hspace{0.3em}}
*ebfedea0SLionel Sambuc\def\nequiv{\not\equiv}
*ebfedea0SLionel Sambuc\def\approx{\raisebox{0.2ex}{\mbox{\small $\sim$}}}
*ebfedea0SLionel Sambuc\def\lcm{{\rm lcm}}
*ebfedea0SLionel Sambuc\def\gcd{{\rm gcd}}
*ebfedea0SLionel Sambuc\def\log{{\rm log}}
*ebfedea0SLionel Sambuc\def\ord{{\rm ord}}
*ebfedea0SLionel Sambuc\def\abs{{\mathit abs}}
*ebfedea0SLionel Sambuc\def\rep{{\mathit rep}}
*ebfedea0SLionel Sambuc\def\mod{{\mathit\ mod\ }}
*ebfedea0SLionel Sambuc\renewcommand{\pmod}[1]{\ ({\rm mod\ }{#1})}
*ebfedea0SLionel Sambuc\newcommand{\floor}[1]{\left\lfloor{#1}\right\rfloor}
*ebfedea0SLionel Sambuc\newcommand{\ceil}[1]{\left\lceil{#1}\right\rceil}
*ebfedea0SLionel Sambuc\def\Or{{\rm\ or\ }}
*ebfedea0SLionel Sambuc\def\And{{\rm\ and\ }}
*ebfedea0SLionel Sambuc\def\iff{\hspace{1em}\Longleftrightarrow\hspace{1em}}
*ebfedea0SLionel Sambuc\def\implies{\Rightarrow}
*ebfedea0SLionel Sambuc\def\undefined{{\rm ``undefined"}}
*ebfedea0SLionel Sambuc\def\Proof{\vspace{1ex}\noindent {\bf Proof:}\hspace{1em}}
*ebfedea0SLionel Sambuc\let\oldphi\phi
*ebfedea0SLionel Sambuc\def\phi{\varphi}
*ebfedea0SLionel Sambuc\def\Pr{{\rm Pr}}
*ebfedea0SLionel Sambuc\newcommand{\str}[1]{{\mathbf{#1}}}
*ebfedea0SLionel Sambuc\def\F{{\mathbb F}}
*ebfedea0SLionel Sambuc\def\N{{\mathbb N}}
*ebfedea0SLionel Sambuc\def\Z{{\mathbb Z}}
*ebfedea0SLionel Sambuc\def\R{{\mathbb R}}
*ebfedea0SLionel Sambuc\def\C{{\mathbb C}}
*ebfedea0SLionel Sambuc\def\Q{{\mathbb Q}}
*ebfedea0SLionel Sambuc\definecolor{DGray}{gray}{0.5}
*ebfedea0SLionel Sambuc\newcommand{\emailaddr}[1]{\mbox{$<${#1}$>$}}
*ebfedea0SLionel Sambuc\def\twiddle{\raisebox{0.3ex}{\mbox{\tiny $\sim$}}}
*ebfedea0SLionel Sambuc\def\gap{\vspace{0.5ex}}
*ebfedea0SLionel Sambuc\makeindex
*ebfedea0SLionel Sambuc\begin{document}
*ebfedea0SLionel Sambuc\frontmatter
*ebfedea0SLionel Sambuc\pagestyle{empty}
*ebfedea0SLionel Sambuc\title{Multi--Precision Math}
*ebfedea0SLionel Sambuc\author{\mbox{
*ebfedea0SLionel Sambuc%\begin{small}
*ebfedea0SLionel Sambuc\begin{tabular}{c}
*ebfedea0SLionel SambucTom St Denis \\
*ebfedea0SLionel SambucAlgonquin College \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucMads Rasmussen \\
*ebfedea0SLionel SambucOpen Communications Security \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucGreg Rose \\
*ebfedea0SLionel SambucQUALCOMM Australia \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc%\end{small}
*ebfedea0SLionel Sambuc}
*ebfedea0SLionel Sambuc}
*ebfedea0SLionel Sambuc\maketitle
*ebfedea0SLionel SambucThis text has been placed in the public domain.  This text corresponds to the v0.39 release of the
*ebfedea0SLionel SambucLibTomMath project.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{alltt}
*ebfedea0SLionel SambucTom St Denis
*ebfedea0SLionel Sambuc111 Banning Rd
*ebfedea0SLionel SambucOttawa, Ontario
*ebfedea0SLionel SambucK2L 1C3
*ebfedea0SLionel SambucCanada
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucPhone: 1-613-836-3160
*ebfedea0SLionel SambucEmail: tomstdenis@gmail.com
*ebfedea0SLionel Sambuc\end{alltt}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis text is formatted to the international B5 paper size of 176mm wide by 250mm tall using the \LaTeX{}
*ebfedea0SLionel Sambuc{\em book} macro package and the Perl {\em booker} package.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\tableofcontents
*ebfedea0SLionel Sambuc\listoffigures
*ebfedea0SLionel Sambuc\chapter*{Prefaces}
*ebfedea0SLionel SambucWhen I tell people about my LibTom projects and that I release them as public domain they are often puzzled.
*ebfedea0SLionel SambucThey ask why I did it and especially why I continue to work on them for free.  The best I can explain it is ``Because I can.''
*ebfedea0SLionel SambucWhich seems odd and perhaps too terse for adult conversation. I often qualify it with ``I am able, I am willing.'' which
*ebfedea0SLionel Sambucperhaps explains it better.  I am the first to admit there is not anything that special with what I have done.  Perhaps
*ebfedea0SLionel Sambucothers can see that too and then we would have a society to be proud of.  My LibTom projects are what I am doing to give
*ebfedea0SLionel Sambucback to society in the form of tools and knowledge that can help others in their endeavours.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucI started writing this book because it was the most logical task to further my goal of open academia.  The LibTomMath source
*ebfedea0SLionel Sambuccode itself was written to be easy to follow and learn from.  There are times, however, where pure C source code does not
*ebfedea0SLionel Sambucexplain the algorithms properly.  Hence this book.  The book literally starts with the foundation of the library and works
*ebfedea0SLionel Sambucitself outwards to the more complicated algorithms.  The use of both pseudo--code and verbatim source code provides a duality
*ebfedea0SLionel Sambucof ``theory'' and ``practice'' that the computer science students of the world shall appreciate.  I never deviate too far
*ebfedea0SLionel Sambucfrom relatively straightforward algebra and I hope that this book can be a valuable learning asset.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis book and indeed much of the LibTom projects would not exist in their current form if it was not for a plethora
*ebfedea0SLionel Sambucof kind people donating their time, resources and kind words to help support my work.  Writing a text of significant
*ebfedea0SLionel Sambuclength (along with the source code) is a tiresome and lengthy process.  Currently the LibTom project is four years old,
*ebfedea0SLionel Sambuccomprises of literally thousands of users and over 100,000 lines of source code, TeX and other material.  People like Mads and Greg
*ebfedea0SLionel Sambucwere there at the beginning to encourage me to work well.  It is amazing how timely validation from others can boost morale to
*ebfedea0SLionel Sambuccontinue the project. Definitely my parents were there for me by providing room and board during the many months of work in 2003.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucTo my many friends whom I have met through the years I thank you for the good times and the words of encouragement.  I hope I
*ebfedea0SLionel Sambuchonour your kind gestures with this project.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOpen Source.  Open Academia.  Open Minds.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{flushright} Tom St Denis \end{flushright}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage
*ebfedea0SLionel SambucI found the opportunity to work with Tom appealing for several reasons, not only could I broaden my own horizons, but also
*ebfedea0SLionel Sambuccontribute to educate others facing the problem of having to handle big number mathematical calculations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis book is Tom's child and he has been caring and fostering the project ever since the beginning with a clear mind of
*ebfedea0SLionel Sambuchow he wanted the project to turn out. I have helped by proofreading the text and we have had several discussions about
*ebfedea0SLionel Sambucthe layout and language used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucI hold a masters degree in cryptography from the University of Southern Denmark and have always been interested in the
*ebfedea0SLionel Sambucpractical aspects of cryptography.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHaving worked in the security consultancy business for several years in S\~{a}o Paulo, Brazil, I have been in touch with a
*ebfedea0SLionel Sambucgreat deal of work in which multiple precision mathematics was needed. Understanding the possibilities for speeding up
*ebfedea0SLionel Sambucmultiple precision calculations is often very important since we deal with outdated machine architecture where modular
*ebfedea0SLionel Sambucreductions, for example, become painfully slow.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis text is for people who stop and wonder when first examining algorithms such as RSA for the first time and asks
*ebfedea0SLionel Sambucthemselves, ``You tell me this is only secure for large numbers, fine; but how do you implement these numbers?''
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{flushright}
*ebfedea0SLionel SambucMads Rasmussen
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucS\~{a}o Paulo - SP
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBrazil
*ebfedea0SLionel Sambuc\end{flushright}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage
*ebfedea0SLionel SambucIt's all because I broke my leg. That just happened to be at about the same time that Tom asked for someone to review the section of the book about
*ebfedea0SLionel SambucKaratsuba multiplication. I was laid up, alone and immobile, and thought ``Why not?'' I vaguely knew what Karatsuba multiplication was, but not
*ebfedea0SLionel Sambucreally, so I thought I could help, learn, and stop myself from watching daytime cable TV, all at once.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt the time of writing this, I've still not met Tom or Mads in meatspace. I've been following Tom's progress since his first splash on the
*ebfedea0SLionel Sambucsci.crypt Usenet news group. I watched him go from a clueless newbie, to the cryptographic equivalent of a reformed smoker, to a real
*ebfedea0SLionel Sambuccontributor to the field, over a period of about two years. I've been impressed with his obvious intelligence, and astounded by his productivity.
*ebfedea0SLionel SambucOf course, he's young enough to be my own child, so he doesn't have my problems with staying awake.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen I reviewed that single section of the book, in its very earliest form, I was very pleasantly surprised. So I decided to collaborate more fully,
*ebfedea0SLionel Sambucand at least review all of it, and perhaps write some bits too. There's still a long way to go with it, and I have watched a number of close
*ebfedea0SLionel Sambucfriends go through the mill of publication, so I think that the way to go is longer than Tom thinks it is. Nevertheless, it's a good effort,
*ebfedea0SLionel Sambucand I'm pleased to be involved with it.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{flushright}
*ebfedea0SLionel SambucGreg Rose, Sydney, Australia, June 2003.
*ebfedea0SLionel Sambuc\end{flushright}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\mainmatter
*ebfedea0SLionel Sambuc\pagestyle{headings}
*ebfedea0SLionel Sambuc\chapter{Introduction}
*ebfedea0SLionel Sambuc\section{Multiple Precision Arithmetic}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{What is Multiple Precision Arithmetic?}
*ebfedea0SLionel SambucWhen we think of long-hand arithmetic such as addition or multiplication we rarely consider the fact that we instinctively
*ebfedea0SLionel Sambucraise or lower the precision of the numbers we are dealing with.  For example, in decimal we almost immediate can
*ebfedea0SLionel Sambucreason that $7$ times $6$ is $42$.  However, $42$ has two digits of precision as opposed to one digit we started with.
*ebfedea0SLionel SambucFurther multiplications of say $3$ result in a larger precision result $126$.  In these few examples we have multiple
*ebfedea0SLionel Sambucprecisions for the numbers we are working with.  Despite the various levels of precision a single subset\footnote{With the occasional optimization.}
*ebfedea0SLionel Sambuc of algorithms can be designed to accomodate them.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy way of comparison a fixed or single precision operation would lose precision on various operations.  For example, in
*ebfedea0SLionel Sambucthe decimal system with fixed precision $6 \cdot 7 = 2$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEssentially at the heart of computer based multiple precision arithmetic are the same long-hand algorithms taught in
*ebfedea0SLionel Sambucschools to manually add, subtract, multiply and divide.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{The Need for Multiple Precision Arithmetic}
*ebfedea0SLionel SambucThe most prevalent need for multiple precision arithmetic, often referred to as ``bignum'' math, is within the implementation
*ebfedea0SLionel Sambucof public-key cryptography algorithms.   Algorithms such as RSA \cite{RSAREF} and Diffie-Hellman \cite{DHREF} require
*ebfedea0SLionel Sambucintegers of significant magnitude to resist known cryptanalytic attacks.  For example, at the time of this writing a
*ebfedea0SLionel Sambuctypical RSA modulus would be at least greater than $10^{309}$.  However, modern programming languages such as ISO C \cite{ISOC} and
*ebfedea0SLionel SambucJava \cite{JAVA} only provide instrinsic support for integers which are relatively small and single precision.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|r|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Data Type} & \textbf{Range} \\
*ebfedea0SLionel Sambuc\hline char  & $-128 \ldots 127$ \\
*ebfedea0SLionel Sambuc\hline short & $-32768 \ldots 32767$ \\
*ebfedea0SLionel Sambuc\hline long  & $-2147483648 \ldots 2147483647$ \\
*ebfedea0SLionel Sambuc\hline long long & $-9223372036854775808 \ldots 9223372036854775807$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Typical Data Types for the C Programming Language}
*ebfedea0SLionel Sambuc\label{fig:ISOC}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe largest data type guaranteed to be provided by the ISO C programming
*ebfedea0SLionel Sambuclanguage\footnote{As per the ISO C standard.  However, each compiler vendor is allowed to augment the precision as they
*ebfedea0SLionel Sambucsee fit.}  can only represent values up to $10^{19}$ as shown in figure \ref{fig:ISOC}. On its own the C language is
*ebfedea0SLionel Sambucinsufficient to accomodate the magnitude required for the problem at hand.  An RSA modulus of magnitude $10^{19}$ could be
*ebfedea0SLionel Sambuctrivially factored\footnote{A Pollard-Rho factoring would take only $2^{16}$ time.} on the average desktop computer,
*ebfedea0SLionel Sambucrendering any protocol based on the algorithm insecure.  Multiple precision algorithms solve this very problem by
*ebfedea0SLionel Sambucextending the range of representable integers while using single precision data types.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucMost advancements in fast multiple precision arithmetic stem from the need for faster and more efficient cryptographic
*ebfedea0SLionel Sambucprimitives.  Faster modular reduction and exponentiation algorithms such as Barrett's algorithm, which have appeared in
*ebfedea0SLionel Sambucvarious cryptographic journals, can render algorithms such as RSA and Diffie-Hellman more efficient.  In fact, several
*ebfedea0SLionel Sambucmajor companies such as RSA Security, Certicom and Entrust have built entire product lines on the implementation and
*ebfedea0SLionel Sambucdeployment of efficient algorithms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHowever, cryptography is not the only field of study that can benefit from fast multiple precision integer routines.
*ebfedea0SLionel SambucAnother auxiliary use of multiple precision integers is high precision floating point data types.
*ebfedea0SLionel SambucThe basic IEEE \cite{IEEE} standard floating point type is made up of an integer mantissa $q$, an exponent $e$ and a sign bit $s$.
*ebfedea0SLionel SambucNumbers are given in the form $n = q \cdot b^e \cdot -1^s$ where $b = 2$ is the most common base for IEEE.  Since IEEE
*ebfedea0SLionel Sambucfloating point is meant to be implemented in hardware the precision of the mantissa is often fairly small
*ebfedea0SLionel Sambuc(\textit{23, 48 and 64 bits}).  The mantissa is merely an integer and a multiple precision integer could be used to create
*ebfedea0SLionel Sambuca mantissa of much larger precision than hardware alone can efficiently support.  This approach could be useful where
*ebfedea0SLionel Sambucscientific applications must minimize the total output error over long calculations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucYet another use for large integers is within arithmetic on polynomials of large characteristic (i.e. $GF(p)[x]$ for large $p$).
*ebfedea0SLionel SambucIn fact the library discussed within this text has already been used to form a polynomial basis library\footnote{See \url{http://poly.libtomcrypt.org} for more details.}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Benefits of Multiple Precision Arithmetic}
*ebfedea0SLionel Sambuc\index{precision}
*ebfedea0SLionel SambucThe benefit of multiple precision representations over single or fixed precision representations is that
*ebfedea0SLionel Sambucno precision is lost while representing the result of an operation which requires excess precision.  For example,
*ebfedea0SLionel Sambucthe product of two $n$-bit integers requires at least $2n$ bits of precision to be represented faithfully.  A multiple
*ebfedea0SLionel Sambucprecision algorithm would augment the precision of the destination to accomodate the result while a single precision system
*ebfedea0SLionel Sambucwould truncate excess bits to maintain a fixed level of precision.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIt is possible to implement algorithms which require large integers with fixed precision algorithms.  For example, elliptic
*ebfedea0SLionel Sambuccurve cryptography (\textit{ECC}) is often implemented on smartcards by fixing the precision of the integers to the maximum
*ebfedea0SLionel Sambucsize the system will ever need.  Such an approach can lead to vastly simpler algorithms which can accomodate the
*ebfedea0SLionel Sambucintegers required even if the host platform cannot natively accomodate them\footnote{For example, the average smartcard
*ebfedea0SLionel Sambucprocessor has an 8 bit accumulator.}.  However, as efficient as such an approach may be, the resulting source code is not
*ebfedea0SLionel Sambucnormally very flexible.  It cannot, at runtime, accomodate inputs of higher magnitude than the designer anticipated.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucMultiple precision algorithms have the most overhead of any style of arithmetic.  For the the most part the
*ebfedea0SLionel Sambucoverhead can be kept to a minimum with careful planning, but overall, it is not well suited for most memory starved
*ebfedea0SLionel Sambucplatforms.  However, multiple precision algorithms do offer the most flexibility in terms of the magnitude of the
*ebfedea0SLionel Sambucinputs.  That is, the same algorithms based on multiple precision integers can accomodate any reasonable size input
*ebfedea0SLionel Sambucwithout the designer's explicit forethought.  This leads to lower cost of ownership for the code as it only has to
*ebfedea0SLionel Sambucbe written and tested once.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Purpose of This Text}
*ebfedea0SLionel SambucThe purpose of this text is to instruct the reader regarding how to implement efficient multiple precision algorithms.
*ebfedea0SLionel SambucThat is to not only explain a limited subset of the core theory behind the algorithms but also the various ``house keeping''
*ebfedea0SLionel Sambucelements that are neglected by authors of other texts on the subject.  Several well reknowned texts \cite{TAOCPV2,HAC}
*ebfedea0SLionel Sambucgive considerably detailed explanations of the theoretical aspects of algorithms and often very little information
*ebfedea0SLionel Sambucregarding the practical implementation aspects.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn most cases how an algorithm is explained and how it is actually implemented are two very different concepts.  For
*ebfedea0SLionel Sambucexample, the Handbook of Applied Cryptography (\textit{HAC}), algorithm 14.7 on page 594, gives a relatively simple
*ebfedea0SLionel Sambucalgorithm for performing multiple precision integer addition.  However, the description lacks any discussion concerning
*ebfedea0SLionel Sambucthe fact that the two integer inputs may be of differing magnitudes.  As a result the implementation is not as simple
*ebfedea0SLionel Sambucas the text would lead people to believe.  Similarly the division routine (\textit{algorithm 14.20, pp. 598}) does not
*ebfedea0SLionel Sambucdiscuss how to handle sign or handle the dividend's decreasing magnitude in the main loop (\textit{step \#3}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBoth texts also do not discuss several key optimal algorithms required such as ``Comba'' and Karatsuba multipliers
*ebfedea0SLionel Sambucand fast modular inversion, which we consider practical oversights.  These optimal algorithms are vital to achieve
*ebfedea0SLionel Sambucany form of useful performance in non-trivial applications.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucTo solve this problem the focus of this text is on the practical aspects of implementing a multiple precision integer
*ebfedea0SLionel Sambucpackage.  As a case study the ``LibTomMath''\footnote{Available at \url{http://math.libtomcrypt.com}} package is used
*ebfedea0SLionel Sambucto demonstrate algorithms with real implementations\footnote{In the ISO C programming language.} that have been field
*ebfedea0SLionel Sambuctested and work very well.  The LibTomMath library is freely available on the Internet for all uses and this text
*ebfedea0SLionel Sambucdiscusses a very large portion of the inner workings of the library.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe algorithms that are presented will always include at least one ``pseudo-code'' description followed
*ebfedea0SLionel Sambucby the actual C source code that implements the algorithm.  The pseudo-code can be used to implement the same
*ebfedea0SLionel Sambucalgorithm in other programming languages as the reader sees fit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis text shall also serve as a walkthrough of the creation of multiple precision algorithms from scratch.  Showing
*ebfedea0SLionel Sambucthe reader how the algorithms fit together as well as where to start on various taskings.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Discussion and Notation}
*ebfedea0SLionel Sambuc\subsection{Notation}
*ebfedea0SLionel SambucA multiple precision integer of $n$-digits shall be denoted as $x = (x_{n-1}, \ldots, x_1, x_0)_{ \beta }$ and represent
*ebfedea0SLionel Sambucthe integer $x \equiv \sum_{i=0}^{n-1} x_i\beta^i$.  The elements of the array $x$ are said to be the radix $\beta$ digits
*ebfedea0SLionel Sambucof the integer.  For example, $x = (1,2,3)_{10}$ would represent the integer
*ebfedea0SLionel Sambuc$1\cdot 10^2 + 2\cdot10^1 + 3\cdot10^0 = 123$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\index{mp\_int}
*ebfedea0SLionel SambucThe term ``mp\_int'' shall refer to a composite structure which contains the digits of the integer it represents, as well
*ebfedea0SLionel Sambucas auxilary data required to manipulate the data.  These additional members are discussed further in section
*ebfedea0SLionel Sambuc\ref{sec:MPINT}.  For the purposes of this text a ``multiple precision integer'' and an ``mp\_int'' are assumed to be
*ebfedea0SLionel Sambucsynonymous.  When an algorithm is specified to accept an mp\_int variable it is assumed the various auxliary data members
*ebfedea0SLionel Sambucare present as well.  An expression of the type \textit{variablename.item} implies that it should evaluate to the
*ebfedea0SLionel Sambucmember named ``item'' of the variable.  For example, a string of characters may have a member ``length'' which would
*ebfedea0SLionel Sambucevaluate to the number of characters in the string.  If the string $a$ equals ``hello'' then it follows that
*ebfedea0SLionel Sambuc$a.length = 5$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor certain discussions more generic algorithms are presented to help the reader understand the final algorithm used
*ebfedea0SLionel Sambucto solve a given problem.  When an algorithm is described as accepting an integer input it is assumed the input is
*ebfedea0SLionel Sambuca plain integer with no additional multiple-precision members.  That is, algorithms that use integers as opposed to
*ebfedea0SLionel Sambucmp\_ints as inputs do not concern themselves with the housekeeping operations required such as memory management.  These
*ebfedea0SLionel Sambucalgorithms will be used to establish the relevant theory which will subsequently be used to describe a multiple
*ebfedea0SLionel Sambucprecision algorithm to solve the same problem.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Precision Notation}
*ebfedea0SLionel SambucThe variable $\beta$ represents the radix of a single digit of a multiple precision integer and
*ebfedea0SLionel Sambucmust be of the form $q^p$ for $q, p \in \Z^+$.  A single precision variable must be able to represent integers in
*ebfedea0SLionel Sambucthe range $0 \le x < q \beta$ while a double precision variable must be able to represent integers in the range
*ebfedea0SLionel Sambuc$0 \le x < q \beta^2$.  The extra radix-$q$ factor allows additions and subtractions to proceed without truncation of the
*ebfedea0SLionel Sambuccarry.  Since all modern computers are binary, it is assumed that $q$ is two.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\index{mp\_digit} \index{mp\_word}
*ebfedea0SLionel SambucWithin the source code that will be presented for each algorithm, the data type \textbf{mp\_digit} will represent
*ebfedea0SLionel Sambuca single precision integer type, while, the data type \textbf{mp\_word} will represent a double precision integer type.  In
*ebfedea0SLionel Sambucseveral algorithms (notably the Comba routines) temporary results will be stored in arrays of double precision mp\_words.
*ebfedea0SLionel SambucFor the purposes of this text $x_j$ will refer to the $j$'th digit of a single precision array and $\hat x_j$ will refer to
*ebfedea0SLionel Sambucthe $j$'th digit of a double precision array.  Whenever an expression is to be assigned to a double precision
*ebfedea0SLionel Sambucvariable it is assumed that all single precision variables are promoted to double precision during the evaluation.
*ebfedea0SLionel SambucExpressions that are assigned to a single precision variable are truncated to fit within the precision of a single
*ebfedea0SLionel Sambucprecision data type.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, if $\beta = 10^2$ a single precision data type may represent a value in the
*ebfedea0SLionel Sambucrange $0 \le x < 10^3$, while a double precision data type may represent a value in the range $0 \le x < 10^5$.  Let
*ebfedea0SLionel Sambuc$a = 23$ and $b = 49$ represent two single precision variables.  The single precision product shall be written
*ebfedea0SLionel Sambucas $c \leftarrow a \cdot b$ while the double precision product shall be written as $\hat c \leftarrow a \cdot b$.
*ebfedea0SLionel SambucIn this particular case, $\hat c = 1127$ and $c = 127$.  The most significant digit of the product would not fit
*ebfedea0SLionel Sambucin a single precision data type and as a result $c \ne \hat c$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Algorithm Inputs and Outputs}
*ebfedea0SLionel SambucWithin the algorithm descriptions all variables are assumed to be scalars of either single or double precision
*ebfedea0SLionel Sambucas indicated.  The only exception to this rule is when variables have been indicated to be of type mp\_int.  This
*ebfedea0SLionel Sambucdistinction is important as scalars are often used as array indicies and various other counters.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Mathematical Expressions}
*ebfedea0SLionel SambucThe $\lfloor \mbox{ } \rfloor$ brackets imply an expression truncated to an integer not greater than the expression
*ebfedea0SLionel Sambucitself.  For example, $\lfloor 5.7 \rfloor = 5$.  Similarly the $\lceil \mbox{ } \rceil$ brackets imply an expression
*ebfedea0SLionel Sambucrounded to an integer not less than the expression itself.  For example, $\lceil 5.1 \rceil = 6$.  Typically when
*ebfedea0SLionel Sambucthe $/$ division symbol is used the intention is to perform an integer division with truncation.  For example,
*ebfedea0SLionel Sambuc$5/2 = 2$ which will often be written as $\lfloor 5/2 \rfloor = 2$ for clarity.  When an expression is written as a
*ebfedea0SLionel Sambucfraction a real value division is implied, for example ${5 \over 2} = 2.5$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe norm of a multiple precision integer, for example $\vert \vert x \vert \vert$, will be used to represent the number of digits in the representation
*ebfedea0SLionel Sambucof the integer.  For example, $\vert \vert 123 \vert \vert = 3$ and $\vert \vert 79452 \vert \vert = 5$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Work Effort}
*ebfedea0SLionel Sambuc\index{big-Oh}
*ebfedea0SLionel SambucTo measure the efficiency of the specified algorithms, a modified big-Oh notation is used.  In this system all
*ebfedea0SLionel Sambucsingle precision operations are considered to have the same cost\footnote{Except where explicitly noted.}.
*ebfedea0SLionel SambucThat is a single precision addition, multiplication and division are assumed to take the same time to
*ebfedea0SLionel Sambuccomplete.  While this is generally not true in practice, it will simplify the discussions considerably.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSome algorithms have slight advantages over others which is why some constants will not be removed in
*ebfedea0SLionel Sambucthe notation.  For example, a normal baseline multiplication (section \ref{sec:basemult}) requires $O(n^2)$ work while a
*ebfedea0SLionel Sambucbaseline squaring (section \ref{sec:basesquare}) requires $O({{n^2 + n}\over 2})$ work.  In standard big-Oh notation these
*ebfedea0SLionel Sambucwould both be said to be equivalent to $O(n^2)$.  However,
*ebfedea0SLionel Sambucin the context of the this text this is not the case as the magnitude of the inputs will typically be rather small.  As a
*ebfedea0SLionel Sambucresult small constant factors in the work effort will make an observable difference in algorithm efficiency.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAll of the algorithms presented in this text have a polynomial time work level.  That is, of the form
*ebfedea0SLionel Sambuc$O(n^k)$ for $n, k \in \Z^{+}$.  This will help make useful comparisons in terms of the speed of the algorithms and how
*ebfedea0SLionel Sambucvarious optimizations will help pay off in the long run.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Exercises}
*ebfedea0SLionel SambucWithin the more advanced chapters a section will be set aside to give the reader some challenging exercises related to
*ebfedea0SLionel Sambucthe discussion at hand.  These exercises are not designed to be prize winning problems, but instead to be thought
*ebfedea0SLionel Sambucprovoking.  Wherever possible the problems are forward minded, stating problems that will be answered in subsequent
*ebfedea0SLionel Sambucchapters.  The reader is encouraged to finish the exercises as they appear to get a better understanding of the
*ebfedea0SLionel Sambucsubject material.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThat being said, the problems are designed to affirm knowledge of a particular subject matter.  Students in particular
*ebfedea0SLionel Sambucare encouraged to verify they can answer the problems correctly before moving on.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSimilar to the exercises of \cite[pp. ix]{TAOCPV2} these exercises are given a scoring system based on the difficulty of
*ebfedea0SLionel Sambucthe problem.  However, unlike \cite{TAOCPV2} the problems do not get nearly as hard.  The scoring of these
*ebfedea0SLionel Sambucexercises ranges from one (the easiest) to five (the hardest).  The following table sumarizes the
*ebfedea0SLionel Sambucscoring system used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|l|}
*ebfedea0SLionel Sambuc\hline $\left [ 1 \right ]$ & An easy problem that should only take the reader a manner of \\
*ebfedea0SLionel Sambuc                            & minutes to solve.  Usually does not involve much computer time \\
*ebfedea0SLionel Sambuc                            & to solve. \\
*ebfedea0SLionel Sambuc\hline $\left [ 2 \right ]$ & An easy problem that involves a marginal amount of computer \\
*ebfedea0SLionel Sambuc                     & time usage.  Usually requires a program to be written to \\
*ebfedea0SLionel Sambuc                     & solve the problem. \\
*ebfedea0SLionel Sambuc\hline $\left [ 3 \right ]$ & A moderately hard problem that requires a non-trivial amount \\
*ebfedea0SLionel Sambuc                     & of work.  Usually involves trivial research and development of \\
*ebfedea0SLionel Sambuc                     & new theory from the perspective of a student. \\
*ebfedea0SLionel Sambuc\hline $\left [ 4 \right ]$ & A moderately hard problem that involves a non-trivial amount \\
*ebfedea0SLionel Sambuc                     & of work and research, the solution to which will demonstrate \\
*ebfedea0SLionel Sambuc                     & a higher mastery of the subject matter. \\
*ebfedea0SLionel Sambuc\hline $\left [ 5 \right ]$ & A hard problem that involves concepts that are difficult for a \\
*ebfedea0SLionel Sambuc                     & novice to solve.  Solutions to these problems will demonstrate a \\
*ebfedea0SLionel Sambuc                     & complete mastery of the given subject. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Exercise Scoring System}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucProblems at the first level are meant to be simple questions that the reader can answer quickly without programming a solution or
*ebfedea0SLionel Sambucdevising new theory.  These problems are quick tests to see if the material is understood.  Problems at the second level
*ebfedea0SLionel Sambucare also designed to be easy but will require a program or algorithm to be implemented to arrive at the answer.  These
*ebfedea0SLionel Sambuctwo levels are essentially entry level questions.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucProblems at the third level are meant to be a bit more difficult than the first two levels.  The answer is often
*ebfedea0SLionel Sambucfairly obvious but arriving at an exacting solution requires some thought and skill.  These problems will almost always
*ebfedea0SLionel Sambucinvolve devising a new algorithm or implementing a variation of another algorithm previously presented.  Readers who can
*ebfedea0SLionel Sambucanswer these questions will feel comfortable with the concepts behind the topic at hand.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucProblems at the fourth level are meant to be similar to those of the level three questions except they will require
*ebfedea0SLionel Sambucadditional research to be completed.  The reader will most likely not know the answer right away, nor will the text provide
*ebfedea0SLionel Sambucthe exact details of the answer until a subsequent chapter.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucProblems at the fifth level are meant to be the hardest
*ebfedea0SLionel Sambucproblems relative to all the other problems in the chapter.  People who can correctly answer fifth level problems have a
*ebfedea0SLionel Sambucmastery of the subject matter at hand.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOften problems will be tied together.  The purpose of this is to start a chain of thought that will be discussed in future chapters.  The reader
*ebfedea0SLionel Sambucis encouraged to answer the follow-up problems and try to draw the relevance of problems.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Introduction to LibTomMath}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{What is LibTomMath?}
*ebfedea0SLionel SambucLibTomMath is a free and open source multiple precision integer library written entirely in portable ISO C.  By portable it
*ebfedea0SLionel Sambucis meant that the library does not contain any code that is computer platform dependent or otherwise problematic to use on
*ebfedea0SLionel Sambucany given platform.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe library has been successfully tested under numerous operating systems including Unix\footnote{All of these
*ebfedea0SLionel Sambuctrademarks belong to their respective rightful owners.}, MacOS, Windows, Linux, PalmOS and on standalone hardware such
*ebfedea0SLionel Sambucas the Gameboy Advance.  The library is designed to contain enough functionality to be able to develop applications such
*ebfedea0SLionel Sambucas public key cryptosystems and still maintain a relatively small footprint.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Goals of LibTomMath}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLibraries which obtain the most efficiency are rarely written in a high level programming language such as C.  However,
*ebfedea0SLionel Sambuceven though this library is written entirely in ISO C, considerable care has been taken to optimize the algorithm implementations within the
*ebfedea0SLionel Sambuclibrary.  Specifically the code has been written to work well with the GNU C Compiler (\textit{GCC}) on both x86 and ARM
*ebfedea0SLionel Sambucprocessors.  Wherever possible, highly efficient algorithms, such as Karatsuba multiplication, sliding window
*ebfedea0SLionel Sambucexponentiation and Montgomery reduction have been provided to make the library more efficient.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEven with the nearly optimal and specialized algorithms that have been included the Application Programing Interface
*ebfedea0SLionel Sambuc(\textit{API}) has been kept as simple as possible.  Often generic place holder routines will make use of specialized
*ebfedea0SLionel Sambucalgorithms automatically without the developer's specific attention.  One such example is the generic multiplication
*ebfedea0SLionel Sambucalgorithm \textbf{mp\_mul()} which will automatically use Toom--Cook, Karatsuba, Comba or baseline multiplication
*ebfedea0SLionel Sambucbased on the magnitude of the inputs and the configuration of the library.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucMaking LibTomMath as efficient as possible is not the only goal of the LibTomMath project.  Ideally the library should
*ebfedea0SLionel Sambucbe source compatible with another popular library which makes it more attractive for developers to use.  In this case the
*ebfedea0SLionel SambucMPI library was used as a API template for all the basic functions.  MPI was chosen because it is another library that fits
*ebfedea0SLionel Sambucin the same niche as LibTomMath.  Even though LibTomMath uses MPI as the template for the function names and argument
*ebfedea0SLionel Sambucpassing conventions, it has been written from scratch by Tom St Denis.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe project is also meant to act as a learning tool for students, the logic being that no easy-to-follow ``bignum''
*ebfedea0SLionel Sambuclibrary exists which can be used to teach computer science students how to perform fast and reliable multiple precision
*ebfedea0SLionel Sambucinteger arithmetic.  To this end the source code has been given quite a few comments and algorithm discussion points.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Choice of LibTomMath}
*ebfedea0SLionel SambucLibTomMath was chosen as the case study of this text not only because the author of both projects is one and the same but
*ebfedea0SLionel Sambucfor more worthy reasons.  Other libraries such as GMP \cite{GMP}, MPI \cite{MPI}, LIP \cite{LIP} and OpenSSL
*ebfedea0SLionel Sambuc\cite{OPENSSL} have multiple precision integer arithmetic routines but would not be ideal for this text for
*ebfedea0SLionel Sambucreasons that will be explained in the following sub-sections.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Code Base}
*ebfedea0SLionel SambucThe LibTomMath code base is all portable ISO C source code.  This means that there are no platform dependent conditional
*ebfedea0SLionel Sambucsegments of code littered throughout the source.  This clean and uncluttered approach to the library means that a
*ebfedea0SLionel Sambucdeveloper can more readily discern the true intent of a given section of source code without trying to keep track of
*ebfedea0SLionel Sambucwhat conditional code will be used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe code base of LibTomMath is well organized.  Each function is in its own separate source code file
*ebfedea0SLionel Sambucwhich allows the reader to find a given function very quickly.  On average there are $76$ lines of code per source
*ebfedea0SLionel Sambucfile which makes the source very easily to follow.  By comparison MPI and LIP are single file projects making code tracing
*ebfedea0SLionel Sambucvery hard.  GMP has many conditional code segments which also hinder tracing.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen compiled with GCC for the x86 processor and optimized for speed the entire library is approximately $100$KiB\footnote{The notation ``KiB'' means $2^{10}$ octets, similarly ``MiB'' means $2^{20}$ octets.}
*ebfedea0SLionel Sambuc which is fairly small compared to GMP (over $250$KiB).  LibTomMath is slightly larger than MPI (which compiles to about
*ebfedea0SLionel Sambuc$50$KiB) but LibTomMath is also much faster and more complete than MPI.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{API Simplicity}
*ebfedea0SLionel SambucLibTomMath is designed after the MPI library and shares the API design.  Quite often programs that use MPI will build
*ebfedea0SLionel Sambucwith LibTomMath without change. The function names correlate directly to the action they perform.  Almost all of the
*ebfedea0SLionel Sambucfunctions share the same parameter passing convention.  The learning curve is fairly shallow with the API provided
*ebfedea0SLionel Sambucwhich is an extremely valuable benefit for the student and developer alike.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe LIP library is an example of a library with an API that is awkward to work with.  LIP uses function names that are often ``compressed'' to
*ebfedea0SLionel Sambucillegible short hand.  LibTomMath does not share this characteristic.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe GMP library also does not return error codes.  Instead it uses a POSIX.1 \cite{POSIX1} signal system where errors
*ebfedea0SLionel Sambucare signaled to the host application.  This happens to be the fastest approach but definitely not the most versatile.  In
*ebfedea0SLionel Sambuceffect a math error (i.e. invalid input, heap error, etc) can cause a program to stop functioning which is definitely
*ebfedea0SLionel Sambucundersireable in many situations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Optimizations}
*ebfedea0SLionel SambucWhile LibTomMath is certainly not the fastest library (GMP often beats LibTomMath by a factor of two) it does
*ebfedea0SLionel Sambucfeature a set of optimal algorithms for tasks such as modular reduction, exponentiation, multiplication and squaring.  GMP
*ebfedea0SLionel Sambucand LIP also feature such optimizations while MPI only uses baseline algorithms with no optimizations.  GMP lacks a few
*ebfedea0SLionel Sambucof the additional modular reduction optimizations that LibTomMath features\footnote{At the time of this writing GMP
*ebfedea0SLionel Sambuconly had Barrett and Montgomery modular reduction algorithms.}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLibTomMath is almost always an order of magnitude faster than the MPI library at computationally expensive tasks such as modular
*ebfedea0SLionel Sambucexponentiation.  In the grand scheme of ``bignum'' libraries LibTomMath is faster than the average library and usually
*ebfedea0SLionel Sambucslower than the best libraries such as GMP and OpenSSL by only a small factor.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Portability and Stability}
*ebfedea0SLionel SambucLibTomMath will build ``out of the box'' on any platform equipped with a modern version of the GNU C Compiler
*ebfedea0SLionel Sambuc(\textit{GCC}).  This means that without changes the library will build without configuration or setting up any
*ebfedea0SLionel Sambucvariables.  LIP and MPI will build ``out of the box'' as well but have numerous known bugs.  Most notably the author of
*ebfedea0SLionel SambucMPI has recently stopped working on his library and LIP has long since been discontinued.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucGMP requires a configuration script to run and will not build out of the box.   GMP and LibTomMath are still in active
*ebfedea0SLionel Sambucdevelopment and are very stable across a variety of platforms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Choice}
*ebfedea0SLionel SambucLibTomMath is a relatively compact, well documented, highly optimized and portable library which seems only natural for
*ebfedea0SLionel Sambucthe case study of this text.  Various source files from the LibTomMath project will be included within the text.  However,
*ebfedea0SLionel Sambucthe reader is encouraged to download their own copy of the library to actually be able to work with the library.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Getting Started}
*ebfedea0SLionel Sambuc\section{Library Basics}
*ebfedea0SLionel SambucThe trick to writing any useful library of source code is to build a solid foundation and work outwards from it.  First,
*ebfedea0SLionel Sambuca problem along with allowable solution parameters should be identified and analyzed.  In this particular case the
*ebfedea0SLionel Sambucinability to accomodate multiple precision integers is the problem.  Futhermore, the solution must be written
*ebfedea0SLionel Sambucas portable source code that is reasonably efficient across several different computer platforms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter a foundation is formed the remainder of the library can be designed and implemented in a hierarchical fashion.
*ebfedea0SLionel SambucThat is, to implement the lowest level dependencies first and work towards the most abstract functions last.  For example,
*ebfedea0SLionel Sambucbefore implementing a modular exponentiation algorithm one would implement a modular reduction algorithm.
*ebfedea0SLionel SambucBy building outwards from a base foundation instead of using a parallel design methodology the resulting project is
*ebfedea0SLionel Sambuchighly modular.  Being highly modular is a desirable property of any project as it often means the resulting product
*ebfedea0SLionel Sambuchas a small footprint and updates are easy to perform.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucUsually when I start a project I will begin with the header files.  I define the data types I think I will need and
*ebfedea0SLionel Sambucprototype the initial functions that are not dependent on other functions (within the library).  After I
*ebfedea0SLionel Sambucimplement these base functions I prototype more dependent functions and implement them.   The process repeats until
*ebfedea0SLionel SambucI implement all of the functions I require.  For example, in the case of LibTomMath I implemented functions such as
*ebfedea0SLionel Sambucmp\_init() well before I implemented mp\_mul() and even further before I implemented mp\_exptmod().  As an example as to
*ebfedea0SLionel Sambucwhy this design works note that the Karatsuba and Toom-Cook multipliers were written \textit{after} the
*ebfedea0SLionel Sambucdependent function mp\_exptmod() was written.  Adding the new multiplication algorithms did not require changes to the
*ebfedea0SLionel Sambucmp\_exptmod() function itself and lowered the total cost of ownership (\textit{so to speak}) and of development
*ebfedea0SLionel Sambucfor new algorithms.  This methodology allows new algorithms to be tested in a complete framework with relative ease.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFIGU,design_process,Design Flow of the First Few Original LibTomMath Functions.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOnly after the majority of the functions were in place did I pursue a less hierarchical approach to auditing and optimizing
*ebfedea0SLionel Sambucthe source code.  For example, one day I may audit the multipliers and the next day the polynomial basis functions.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIt only makes sense to begin the text with the preliminary data types and support algorithms required as well.
*ebfedea0SLionel SambucThis chapter discusses the core algorithms of the library which are the dependents for every other algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{What is a Multiple Precision Integer?}
*ebfedea0SLionel SambucRecall that most programming languages, in particular ISO C \cite{ISOC}, only have fixed precision data types that on their own cannot
*ebfedea0SLionel Sambucbe used to represent values larger than their precision will allow. The purpose of multiple precision algorithms is
*ebfedea0SLionel Sambucto use fixed precision data types to create and manipulate multiple precision integers which may represent values
*ebfedea0SLionel Sambucthat are very large.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs a well known analogy, school children are taught how to form numbers larger than nine by prepending more radix ten digits.  In the decimal system
*ebfedea0SLionel Sambucthe largest single digit value is $9$.  However, by concatenating digits together larger numbers may be represented.  Newly prepended digits
*ebfedea0SLionel Sambuc(\textit{to the left}) are said to be in a different power of ten column.  That is, the number $123$ can be described as having a $1$ in the hundreds
*ebfedea0SLionel Sambuccolumn, $2$ in the tens column and $3$ in the ones column.  Or more formally $123 = 1 \cdot 10^2 + 2 \cdot 10^1 + 3 \cdot 10^0$.  Computer based
*ebfedea0SLionel Sambucmultiple precision arithmetic is essentially the same concept.  Larger integers are represented by adjoining fixed
*ebfedea0SLionel Sambucprecision computer words with the exception that a different radix is used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhat most people probably do not think about explicitly are the various other attributes that describe a multiple precision
*ebfedea0SLionel Sambucinteger.  For example, the integer $154_{10}$ has two immediately obvious properties.  First, the integer is positive,
*ebfedea0SLionel Sambucthat is the sign of this particular integer is positive as opposed to negative.  Second, the integer has three digits in
*ebfedea0SLionel Sambucits representation.  There is an additional property that the integer posesses that does not concern pencil-and-paper
*ebfedea0SLionel Sambucarithmetic.  The third property is how many digits placeholders are available to hold the integer.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe human analogy of this third property is ensuring there is enough space on the paper to write the integer.  For example,
*ebfedea0SLionel Sambucif one starts writing a large number too far to the right on a piece of paper they will have to erase it and move left.
*ebfedea0SLionel SambucSimilarly, computer algorithms must maintain strict control over memory usage to ensure that the digits of an integer
*ebfedea0SLionel Sambucwill not exceed the allowed boundaries.  These three properties make up what is known as a multiple precision
*ebfedea0SLionel Sambucinteger or mp\_int for short.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{The mp\_int Structure}
*ebfedea0SLionel Sambuc\label{sec:MPINT}
*ebfedea0SLionel SambucThe mp\_int structure is the ISO C based manifestation of what represents a multiple precision integer.  The ISO C standard does not provide for
*ebfedea0SLionel Sambucany such data type but it does provide for making composite data types known as structures.  The following is the structure definition
*ebfedea0SLionel Sambucused within LibTomMath.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\index{mp\_int}
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc%\begin{verbatim}
*ebfedea0SLionel Sambuc\begin{tabular}{|l|}
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuctypedef struct \{ \\
*ebfedea0SLionel Sambuc\hspace{3mm}int used, alloc, sign;\\
*ebfedea0SLionel Sambuc\hspace{3mm}mp\_digit *dp;\\
*ebfedea0SLionel Sambuc\} \textbf{mp\_int}; \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc%\end{verbatim}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{The mp\_int Structure}
*ebfedea0SLionel Sambuc\label{fig:mpint}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe mp\_int structure (fig. \ref{fig:mpint}) can be broken down as follows.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc\item The \textbf{used} parameter denotes how many digits of the array \textbf{dp} contain the digits used to represent
*ebfedea0SLionel Sambuca given integer.  The \textbf{used} count must be positive (or zero) and may not exceed the \textbf{alloc} count.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\item The \textbf{alloc} parameter denotes how
*ebfedea0SLionel Sambucmany digits are available in the array to use by functions before it has to increase in size.  When the \textbf{used} count
*ebfedea0SLionel Sambucof a result would exceed the \textbf{alloc} count all of the algorithms will automatically increase the size of the
*ebfedea0SLionel Sambucarray to accommodate the precision of the result.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\item The pointer \textbf{dp} points to a dynamically allocated array of digits that represent the given multiple
*ebfedea0SLionel Sambucprecision integer.  It is padded with $(\textbf{alloc} - \textbf{used})$ zero digits.  The array is maintained in a least
*ebfedea0SLionel Sambucsignificant digit order.  As a pencil and paper analogy the array is organized such that the right most digits are stored
*ebfedea0SLionel Sambucfirst starting at the location indexed by zero\footnote{In C all arrays begin at zero.} in the array.  For example,
*ebfedea0SLionel Sambucif \textbf{dp} contains $\lbrace a, b, c, \ldots \rbrace$ where \textbf{dp}$_0 = a$, \textbf{dp}$_1 = b$, \textbf{dp}$_2 = c$, $\ldots$ then
*ebfedea0SLionel Sambucit would represent the integer $a + b\beta + c\beta^2 + \ldots$
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\index{MP\_ZPOS} \index{MP\_NEG}
*ebfedea0SLionel Sambuc\item The \textbf{sign} parameter denotes the sign as either zero/positive (\textbf{MP\_ZPOS}) or negative (\textbf{MP\_NEG}).
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Valid mp\_int Structures}
*ebfedea0SLionel SambucSeveral rules are placed on the state of an mp\_int structure and are assumed to be followed for reasons of efficiency.
*ebfedea0SLionel SambucThe only exceptions are when the structure is passed to initialization functions such as mp\_init() and mp\_init\_copy().
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc\item The value of \textbf{alloc} may not be less than one.  That is \textbf{dp} always points to a previously allocated
*ebfedea0SLionel Sambucarray of digits.
*ebfedea0SLionel Sambuc\item The value of \textbf{used} may not exceed \textbf{alloc} and must be greater than or equal to zero.
*ebfedea0SLionel Sambuc\item The value of \textbf{used} implies the digit at index $(used - 1)$ of the \textbf{dp} array is non-zero.  That is,
*ebfedea0SLionel Sambucleading zero digits in the most significant positions must be trimmed.
*ebfedea0SLionel Sambuc   \begin{enumerate}
*ebfedea0SLionel Sambuc   \item Digits in the \textbf{dp} array at and above the \textbf{used} location must be zero.
*ebfedea0SLionel Sambuc   \end{enumerate}
*ebfedea0SLionel Sambuc\item The value of \textbf{sign} must be \textbf{MP\_ZPOS} if \textbf{used} is zero;
*ebfedea0SLionel Sambucthis represents the mp\_int value of zero.
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Argument Passing}
*ebfedea0SLionel SambucA convention of argument passing must be adopted early on in the development of any library.  Making the function
*ebfedea0SLionel Sambucprototypes consistent will help eliminate many headaches in the future as the library grows to significant complexity.
*ebfedea0SLionel SambucIn LibTomMath the multiple precision integer functions accept parameters from left to right as pointers to mp\_int
*ebfedea0SLionel Sambucstructures.  That means that the source (input) operands are placed on the left and the destination (output) on the right.
*ebfedea0SLionel SambucConsider the following examples.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{verbatim}
*ebfedea0SLionel Sambuc   mp_mul(&a, &b, &c);   /* c = a * b */
*ebfedea0SLionel Sambuc   mp_add(&a, &b, &a);   /* a = a + b */
*ebfedea0SLionel Sambuc   mp_sqr(&a, &b);       /* b = a * a */
*ebfedea0SLionel Sambuc\end{verbatim}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe left to right order is a fairly natural way to implement the functions since it lets the developer read aloud the
*ebfedea0SLionel Sambucfunctions and make sense of them.  For example, the first function would read ``multiply a and b and store in c''.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucCertain libraries (\textit{LIP by Lenstra for instance}) accept parameters the other way around, to mimic the order
*ebfedea0SLionel Sambucof assignment expressions.  That is, the destination (output) is on the left and arguments (inputs) are on the right.  In
*ebfedea0SLionel Sambuctruth, it is entirely a matter of preference.  In the case of LibTomMath the convention from the MPI library has been
*ebfedea0SLionel Sambucadopted.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAnother very useful design consideration, provided for in LibTomMath, is whether to allow argument sources to also be a
*ebfedea0SLionel Sambucdestination.  For example, the second example (\textit{mp\_add}) adds $a$ to $b$ and stores in $a$.  This is an important
*ebfedea0SLionel Sambucfeature to implement since it allows the calling functions to cut down on the number of variables it must maintain.
*ebfedea0SLionel SambucHowever, to implement this feature specific care has to be given to ensure the destination is not modified before the
*ebfedea0SLionel Sambucsource is fully read.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Return Values}
*ebfedea0SLionel SambucA well implemented application, no matter what its purpose, should trap as many runtime errors as possible and return them
*ebfedea0SLionel Sambucto the caller.  By catching runtime errors a library can be guaranteed to prevent undefined behaviour.  However, the end
*ebfedea0SLionel Sambucdeveloper can still manage to cause a library to crash.  For example, by passing an invalid pointer an application may
*ebfedea0SLionel Sambucfault by dereferencing memory not owned by the application.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn the case of LibTomMath the only errors that are checked for are related to inappropriate inputs (division by zero for
*ebfedea0SLionel Sambucinstance) and memory allocation errors.  It will not check that the mp\_int passed to any function is valid nor
*ebfedea0SLionel Sambucwill it check pointers for validity.  Any function that can cause a runtime error will return an error code as an
*ebfedea0SLionel Sambuc\textbf{int} data type with one of the following values (fig \ref{fig:errcodes}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\index{MP\_OKAY} \index{MP\_VAL} \index{MP\_MEM}
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|l|l|}
*ebfedea0SLionel Sambuc\hline \textbf{Value} & \textbf{Meaning} \\
*ebfedea0SLionel Sambuc\hline \textbf{MP\_OKAY} & The function was successful \\
*ebfedea0SLionel Sambuc\hline \textbf{MP\_VAL}  & One of the input value(s) was invalid \\
*ebfedea0SLionel Sambuc\hline \textbf{MP\_MEM}  & The function ran out of heap memory \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{LibTomMath Error Codes}
*ebfedea0SLionel Sambuc\label{fig:errcodes}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen an error is detected within a function it should free any memory it allocated, often during the initialization of
*ebfedea0SLionel Sambuctemporary mp\_ints, and return as soon as possible.  The goal is to leave the system in the same state it was when the
*ebfedea0SLionel Sambucfunction was called.  Error checking with this style of API is fairly simple.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{verbatim}
*ebfedea0SLionel Sambuc   int err;
*ebfedea0SLionel Sambuc   if ((err = mp_add(&a, &b, &c)) != MP_OKAY) {
*ebfedea0SLionel Sambuc      printf("Error: %s\n", mp_error_to_string(err));
*ebfedea0SLionel Sambuc      exit(EXIT_FAILURE);
*ebfedea0SLionel Sambuc   }
*ebfedea0SLionel Sambuc\end{verbatim}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe GMP \cite{GMP} library uses C style \textit{signals} to flag errors which is of questionable use.  Not all errors are fatal
*ebfedea0SLionel Sambucand it was not deemed ideal by the author of LibTomMath to force developers to have signal handlers for such cases.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Initialization and Clearing}
*ebfedea0SLionel SambucThe logical starting point when actually writing multiple precision integer functions is the initialization and
*ebfedea0SLionel Sambucclearing of the mp\_int structures.  These two algorithms will be used by the majority of the higher level algorithms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucGiven the basic mp\_int structure an initialization routine must first allocate memory to hold the digits of
*ebfedea0SLionel Sambucthe integer.  Often it is optimal to allocate a sufficiently large pre-set number of digits even though
*ebfedea0SLionel Sambucthe initial integer will represent zero.  If only a single digit were allocated quite a few subsequent re-allocations
*ebfedea0SLionel Sambucwould occur when operations are performed on the integers.  There is a tradeoff between how many default digits to allocate
*ebfedea0SLionel Sambucand how many re-allocations are tolerable.  Obviously allocating an excessive amount of digits initially will waste
*ebfedea0SLionel Sambucmemory and become unmanageable.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf the memory for the digits has been successfully allocated then the rest of the members of the structure must
*ebfedea0SLionel Sambucbe initialized.  Since the initial state of an mp\_int is to represent the zero integer, the allocated digits must be set
*ebfedea0SLionel Sambucto zero.  The \textbf{used} count set to zero and \textbf{sign} set to \textbf{MP\_ZPOS}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Initializing an mp\_int}
*ebfedea0SLionel SambucAn mp\_int is said to be initialized if it is set to a valid, preferably default, state such that all of the members of the
*ebfedea0SLionel Sambucstructure are set to valid values.  The mp\_init algorithm will perform such an action.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\index{mp\_init}
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_init}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Allocate memory and initialize $a$ to a known valid mp\_int state.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Allocate memory for \textbf{MP\_PREC} digits. \\
*ebfedea0SLionel Sambuc2.  If the allocation failed return(\textit{MP\_MEM}) \\
*ebfedea0SLionel Sambuc3.  for $n$ from $0$ to $MP\_PREC - 1$ do  \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $a_n \leftarrow 0$\\
*ebfedea0SLionel Sambuc4.  $a.sign \leftarrow MP\_ZPOS$\\
*ebfedea0SLionel Sambuc5.  $a.used \leftarrow 0$\\
*ebfedea0SLionel Sambuc6.  $a.alloc \leftarrow MP\_PREC$\\
*ebfedea0SLionel Sambuc7.  Return(\textit{MP\_OKAY})\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_init}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_init.}
*ebfedea0SLionel SambucThe purpose of this function is to initialize an mp\_int structure so that the rest of the library can properly
*ebfedea0SLionel Sambucmanipulte it.  It is assumed that the input may not have had any of its members previously initialized which is certainly
*ebfedea0SLionel Sambuca valid assumption if the input resides on the stack.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBefore any of the members such as \textbf{sign}, \textbf{used} or \textbf{alloc} are initialized the memory for
*ebfedea0SLionel Sambucthe digits is allocated.  If this fails the function returns before setting any of the other members.  The \textbf{MP\_PREC}
*ebfedea0SLionel Sambucname represents a constant\footnote{Defined in the ``tommath.h'' header file within LibTomMath.}
*ebfedea0SLionel Sambucused to dictate the minimum precision of newly initialized mp\_int integers.  Ideally, it is at least equal to the smallest
*ebfedea0SLionel Sambucprecision number you'll be working with.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAllocating a block of digits at first instead of a single digit has the benefit of lowering the number of usually slow
*ebfedea0SLionel Sambucheap operations later functions will have to perform in the future.  If \textbf{MP\_PREC} is set correctly the slack
*ebfedea0SLionel Sambucmemory and the number of heap operations will be trivial.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOnce the allocation has been made the digits have to be set to zero as well as the \textbf{used}, \textbf{sign} and
*ebfedea0SLionel Sambuc\textbf{alloc} members initialized.  This ensures that the mp\_int will always represent the default state of zero regardless
*ebfedea0SLionel Sambucof the original condition of the input.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Remark.}
*ebfedea0SLionel SambucThis function introduces the idiosyncrasy that all iterative loops, commonly initiated with the ``for'' keyword, iterate incrementally
*ebfedea0SLionel Sambucwhen the ``to'' keyword is placed between two expressions.  For example, ``for $a$ from $b$ to $c$ do'' means that
*ebfedea0SLionel Sambuca subsequent expression (or body of expressions) are to be evaluated upto $c - b$ times so long as $b \le c$.  In each
*ebfedea0SLionel Sambuciteration the variable $a$ is substituted for a new integer that lies inclusively between $b$ and $c$.  If $b > c$ occured
*ebfedea0SLionel Sambucthe loop would not iterate.  By contrast if the ``downto'' keyword were used in place of ``to'' the loop would iterate
*ebfedea0SLionel Sambucdecrementally.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_init.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOne immediate observation of this initializtion function is that it does not return a pointer to a mp\_int structure.  It
*ebfedea0SLionel Sambucis assumed that the caller has already allocated memory for the mp\_int structure, typically on the application stack.  The
*ebfedea0SLionel Sambuccall to mp\_init() is used only to initialize the members of the structure to a known default state.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHere we see (line @23,XMALLOC@) the memory allocation is performed first.  This allows us to exit cleanly and quickly
*ebfedea0SLionel Sambucif there is an error.  If the allocation fails the routine will return \textbf{MP\_MEM} to the caller to indicate there
*ebfedea0SLionel Sambucwas a memory error.  The function XMALLOC is what actually allocates the memory.  Technically XMALLOC is not a function
*ebfedea0SLionel Sambucbut a macro defined in ``tommath.h``.  By default, XMALLOC will evaluate to malloc() which is the C library's built--in
*ebfedea0SLionel Sambucmemory allocation routine.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn order to assure the mp\_int is in a known state the digits must be set to zero.  On most platforms this could have been
*ebfedea0SLionel Sambucaccomplished by using calloc() instead of malloc().  However,  to correctly initialize a integer type to a given value in a
*ebfedea0SLionel Sambucportable fashion you have to actually assign the value.  The for loop (line @28,for@) performs this required
*ebfedea0SLionel Sambucoperation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the memory has been successfully initialized the remainder of the members are initialized
*ebfedea0SLionel Sambuc(lines @29,used@ through @31,sign@) to their respective default states.  At this point the algorithm has succeeded and
*ebfedea0SLionel Sambuca success code is returned to the calling function.  If this function returns \textbf{MP\_OKAY} it is safe to assume the
*ebfedea0SLionel Sambucmp\_int structure has been properly initialized and is safe to use with other functions within the library.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Clearing an mp\_int}
*ebfedea0SLionel SambucWhen an mp\_int is no longer required by the application, the memory that has been allocated for its digits must be
*ebfedea0SLionel Sambucreturned to the application's memory pool with the mp\_clear algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_clear}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The memory for $a$ shall be deallocated.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $a$ has been previously freed then return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc2.  for $n$ from 0 to $a.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $a_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc3.  Free the memory allocated for the digits of $a$. \\
*ebfedea0SLionel Sambuc4.  $a.used \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  $a.alloc \leftarrow 0$ \\
*ebfedea0SLionel Sambuc6.  $a.sign \leftarrow MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc7.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_clear}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_clear.}
*ebfedea0SLionel SambucThis algorithm accomplishes two goals.  First, it clears the digits and the other mp\_int members.  This ensures that
*ebfedea0SLionel Sambucif a developer accidentally re-uses a cleared structure it is less likely to cause problems.  The second goal
*ebfedea0SLionel Sambucis to free the allocated memory.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe logic behind the algorithm is extended by marking cleared mp\_int structures so that subsequent calls to this
*ebfedea0SLionel Sambucalgorithm will not try to free the memory multiple times.  Cleared mp\_ints are detectable by having a pre-defined invalid
*ebfedea0SLionel Sambucdigit pointer \textbf{dp} setting.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOnce an mp\_int has been cleared the mp\_int structure is no longer in a valid state for any other algorithm
*ebfedea0SLionel Sambucwith the exception of algorithms mp\_init, mp\_init\_copy, mp\_init\_size and mp\_clear.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_clear.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe algorithm only operates on the mp\_int if it hasn't been previously cleared.  The if statement (line @23,a->dp != NULL@)
*ebfedea0SLionel Sambucchecks to see if the \textbf{dp} member is not \textbf{NULL}.  If the mp\_int is a valid mp\_int then \textbf{dp} cannot be
*ebfedea0SLionel Sambuc\textbf{NULL} in which case the if statement will evaluate to true.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe digits of the mp\_int are cleared by the for loop (line @25,for@) which assigns a zero to every digit.  Similar to mp\_init()
*ebfedea0SLionel Sambucthe digits are assigned zero instead of using block memory operations (such as memset()) since this is more portable.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe digits are deallocated off the heap via the XFREE macro.  Similar to XMALLOC the XFREE macro actually evaluates to
*ebfedea0SLionel Sambuca standard C library function.  In this case the free() function.  Since free() only deallocates the memory the pointer
*ebfedea0SLionel Sambucstill has to be reset to \textbf{NULL} manually (line @33,NULL@).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNow that the digits have been cleared and deallocated the other members are set to their final values (lines @34,= 0@ and @35,ZPOS@).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Maintenance Algorithms}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe previous sections describes how to initialize and clear an mp\_int structure.  To further support operations
*ebfedea0SLionel Sambucthat are to be performed on mp\_int structures (such as addition and multiplication) the dependent algorithms must be
*ebfedea0SLionel Sambucable to augment the precision of an mp\_int and
*ebfedea0SLionel Sambucinitialize mp\_ints with differing initial conditions.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThese algorithms complete the set of low level algorithms required to work with mp\_int structures in the higher level
*ebfedea0SLionel Sambucalgorithms such as addition, multiplication and modular exponentiation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Augmenting an mp\_int's Precision}
*ebfedea0SLionel SambucWhen storing a value in an mp\_int structure, a sufficient number of digits must be available to accomodate the entire
*ebfedea0SLionel Sambucresult of an operation without loss of precision.  Quite often the size of the array given by the \textbf{alloc} member
*ebfedea0SLionel Sambucis large enough to simply increase the \textbf{used} digit count.  However, when the size of the array is too small it
*ebfedea0SLionel Sambucmust be re-sized appropriately to accomodate the result.  The mp\_grow algorithm will provide this functionality.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_grow}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ and an integer $b$. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a$ is expanded to accomodate $b$ digits. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  if $a.alloc \ge b$ then return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc2.  $u \leftarrow b\mbox{ (mod }MP\_PREC\mbox{)}$ \\
*ebfedea0SLionel Sambuc3.  $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\
*ebfedea0SLionel Sambuc4.  Re-allocate the array of digits $a$ to size $v$ \\
*ebfedea0SLionel Sambuc5.  If the allocation failed then return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc6.  for n from a.alloc to $v - 1$ do  \\
*ebfedea0SLionel Sambuc\hspace{+3mm}6.1  $a_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc7.  $a.alloc \leftarrow v$ \\
*ebfedea0SLionel Sambuc8.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_grow}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_grow.}
*ebfedea0SLionel SambucIt is ideal to prevent re-allocations from being performed if they are not required (step one).  This is useful to
*ebfedea0SLionel Sambucprevent mp\_ints from growing excessively in code that erroneously calls mp\_grow.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe requested digit count is padded up to next multiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} (steps two and three).
*ebfedea0SLionel SambucThis helps prevent many trivial reallocations that would grow an mp\_int by trivially small values.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIt is assumed that the reallocation (step four) leaves the lower $a.alloc$ digits of the mp\_int intact.  This is much
*ebfedea0SLionel Sambucakin to how the \textit{realloc} function from the standard C library works.  Since the newly allocated digits are
*ebfedea0SLionel Sambucassumed to contain undefined values they are initially set to zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_grow.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucA quick optimization is to first determine if a memory re-allocation is required at all.  The if statement (line @24,alloc@) checks
*ebfedea0SLionel Sambucif the \textbf{alloc} member of the mp\_int is smaller than the requested digit count.  If the count is not larger than \textbf{alloc}
*ebfedea0SLionel Sambucthe function skips the re-allocation part thus saving time.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen a re-allocation is performed it is turned into an optimal request to save time in the future.  The requested digit count is
*ebfedea0SLionel Sambucpadded upwards to 2nd multiple of \textbf{MP\_PREC} larger than \textbf{alloc} (line @25, size@).  The XREALLOC function is used
*ebfedea0SLionel Sambucto re-allocate the memory.  As per the other functions XREALLOC is actually a macro which evaluates to realloc by default.  The realloc
*ebfedea0SLionel Sambucfunction leaves the base of the allocation intact which means the first \textbf{alloc} digits of the mp\_int are the same as before
*ebfedea0SLionel Sambucthe re-allocation.  All	that is left is to clear the newly allocated digits and return.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNote that the re-allocation result is actually stored in a temporary pointer $tmp$.  This is to allow this function to return
*ebfedea0SLionel Sambucan error with a valid pointer.  Earlier releases of the library stored the result of XREALLOC into the mp\_int $a$.  That would
*ebfedea0SLionel Sambucresult in a memory leak if XREALLOC ever failed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Initializing Variable Precision mp\_ints}
*ebfedea0SLionel SambucOccasionally the number of digits required will be known in advance of an initialization, based on, for example, the size
*ebfedea0SLionel Sambucof input mp\_ints to a given algorithm.  The purpose of algorithm mp\_init\_size is similar to mp\_init except that it
*ebfedea0SLionel Sambucwill allocate \textit{at least} a specified number of digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_init\_size}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ and the requested number of digits $b$. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a$ is initialized to hold at least $b$ digits. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $u \leftarrow b \mbox{ (mod }MP\_PREC\mbox{)}$ \\
*ebfedea0SLionel Sambuc2.  $v \leftarrow b + 2 \cdot MP\_PREC - u$ \\
*ebfedea0SLionel Sambuc3.  Allocate $v$ digits. \\
*ebfedea0SLionel Sambuc4.  for $n$ from $0$ to $v - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $a_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  $a.sign \leftarrow MP\_ZPOS$\\
*ebfedea0SLionel Sambuc6.  $a.used \leftarrow 0$\\
*ebfedea0SLionel Sambuc7.  $a.alloc \leftarrow v$\\
*ebfedea0SLionel Sambuc8.  Return(\textit{MP\_OKAY})\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_init\_size}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_init\_size.}
*ebfedea0SLionel SambucThis algorithm will initialize an mp\_int structure $a$ like algorithm mp\_init with the exception that the number of
*ebfedea0SLionel Sambucdigits allocated can be controlled by the second input argument $b$.  The input size is padded upwards so it is a
*ebfedea0SLionel Sambucmultiple of \textbf{MP\_PREC} plus an additional \textbf{MP\_PREC} digits.  This padding is used to prevent trivial
*ebfedea0SLionel Sambucallocations from becoming a bottleneck in the rest of the algorithms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLike algorithm mp\_init, the mp\_int structure is initialized to a default state representing the integer zero.  This
*ebfedea0SLionel Sambucparticular algorithm is useful if it is known ahead of time the approximate size of the input.  If the approximation is
*ebfedea0SLionel Sambuccorrect no further memory re-allocations are required to work with the mp\_int.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_init_size.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe number of digits $b$ requested is padded (line @22,MP_PREC@) by first augmenting it to the next multiple of
*ebfedea0SLionel Sambuc\textbf{MP\_PREC} and then adding \textbf{MP\_PREC} to the result.  If the memory can be successfully allocated the
*ebfedea0SLionel Sambucmp\_int is placed in a default state representing the integer zero.  Otherwise, the error code \textbf{MP\_MEM} will be
*ebfedea0SLionel Sambucreturned (line @27,return@).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe digits are allocated and set to zero at the same time with the calloc() function (line @25,XCALLOC@).  The
*ebfedea0SLionel Sambuc\textbf{used} count is set to zero, the \textbf{alloc} count set to the padded digit count and the \textbf{sign} flag set
*ebfedea0SLionel Sambucto \textbf{MP\_ZPOS} to achieve a default valid mp\_int state (lines @29,used@, @30,alloc@ and @31,sign@).  If the function
*ebfedea0SLionel Sambucreturns succesfully then it is correct to assume that the mp\_int structure is in a valid state for the remainder of the
*ebfedea0SLionel Sambucfunctions to work with.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Multiple Integer Initializations and Clearings}
*ebfedea0SLionel SambucOccasionally a function will require a series of mp\_int data types to be made available simultaneously.
*ebfedea0SLionel SambucThe purpose of algorithm mp\_init\_multi is to initialize a variable length array of mp\_int structures in a single
*ebfedea0SLionel Sambucstatement.  It is essentially a shortcut to multiple initializations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_init\_multi}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Variable length array $V_k$ of mp\_int variables of length $k$. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The array is initialized such that each mp\_int of $V_k$ is ready to use. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  for $n$ from 0 to $k - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{+3mm}1.1.  Initialize the mp\_int $V_n$ (\textit{mp\_init}) \\
*ebfedea0SLionel Sambuc\hspace{+3mm}1.2.  If initialization failed then do \\
*ebfedea0SLionel Sambuc\hspace{+6mm}1.2.1.  for $j$ from $0$ to $n$ do \\
*ebfedea0SLionel Sambuc\hspace{+9mm}1.2.1.1.  Free the mp\_int $V_j$ (\textit{mp\_clear}) \\
*ebfedea0SLionel Sambuc\hspace{+6mm}1.2.2.   Return(\textit{MP\_MEM}) \\
*ebfedea0SLionel Sambuc2.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_init\_multi}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_init\_multi.}
*ebfedea0SLionel SambucThe algorithm will initialize the array of mp\_int variables one at a time.  If a runtime error has been detected
*ebfedea0SLionel Sambuc(\textit{step 1.2}) all of the previously initialized variables are cleared.  The goal is an ``all or nothing''
*ebfedea0SLionel Sambucinitialization which allows for quick recovery from runtime errors.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_init_multi.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis function intializes a variable length list of mp\_int structure pointers.  However, instead of having the mp\_int
*ebfedea0SLionel Sambucstructures in an actual C array they are simply passed as arguments to the function.  This function makes use of the
*ebfedea0SLionel Sambuc``...'' argument syntax of the C programming language.  The list is terminated with a final \textbf{NULL} argument
*ebfedea0SLionel Sambucappended on the right.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe function uses the ``stdarg.h'' \textit{va} functions to step portably through the arguments to the function.  A count
*ebfedea0SLionel Sambuc$n$ of succesfully initialized mp\_int structures is maintained (line @47,n++@) such that if a failure does occur,
*ebfedea0SLionel Sambucthe algorithm can backtrack and free the previously initialized structures (lines @27,if@ to @46,}@).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Clamping Excess Digits}
*ebfedea0SLionel SambucWhen a function anticipates a result will be $n$ digits it is simpler to assume this is true within the body of
*ebfedea0SLionel Sambucthe function instead of checking during the computation.  For example, a multiplication of a $i$ digit number by a
*ebfedea0SLionel Sambuc$j$ digit produces a result of at most $i + j$ digits.  It is entirely possible that the result is $i + j - 1$
*ebfedea0SLionel Sambucthough, with no final carry into the last position.  However, suppose the destination had to be first expanded
*ebfedea0SLionel Sambuc(\textit{via mp\_grow}) to accomodate $i + j - 1$ digits than further expanded to accomodate the final carry.
*ebfedea0SLionel SambucThat would be a considerable waste of time since heap operations are relatively slow.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe ideal solution is to always assume the result is $i + j$ and fix up the \textbf{used} count after the function
*ebfedea0SLionel Sambucterminates.  This way a single heap operation (\textit{at most}) is required.  However, if the result was not checked
*ebfedea0SLionel Sambucthere would be an excess high order zero digit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, suppose the product of two integers was $x_n = (0x_{n-1}x_{n-2}...x_0)_{\beta}$.  The leading zero digit
*ebfedea0SLionel Sambucwill not contribute to the precision of the result.  In fact, through subsequent operations more leading zero digits would
*ebfedea0SLionel Sambucaccumulate to the point the size of the integer would be prohibitive.  As a result even though the precision is very
*ebfedea0SLionel Sambuclow the representation is excessively large.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe mp\_clamp algorithm is designed to solve this very problem.  It will trim high-order zeros by decrementing the
*ebfedea0SLionel Sambuc\textbf{used} count until a non-zero most significant digit is found.  Also in this system, zero is considered to be a
*ebfedea0SLionel Sambucpositive number which means that if the \textbf{used} count is decremented to zero, the sign must be set to
*ebfedea0SLionel Sambuc\textbf{MP\_ZPOS}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_clamp}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Any excess leading zero digits of $a$ are removed \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  while $a.used > 0$ and $a_{a.used - 1} = 0$ do \\
*ebfedea0SLionel Sambuc\hspace{+3mm}1.1  $a.used \leftarrow a.used - 1$ \\
*ebfedea0SLionel Sambuc2.  if $a.used = 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{+3mm}2.1  $a.sign \leftarrow MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_clamp}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_clamp.}
*ebfedea0SLionel SambucAs can be expected this algorithm is very simple.  The loop on step one is expected to iterate only once or twice at
*ebfedea0SLionel Sambucthe most.  For example, this will happen in cases where there is not a carry to fill the last position.  Step two fixes the sign for
*ebfedea0SLionel Sambucwhen all of the digits are zero to ensure that the mp\_int is valid at all times.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_clamp.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNote on line @27,while@ how to test for the \textbf{used} count is made on the left of the \&\& operator.  In the C programming
*ebfedea0SLionel Sambuclanguage the terms to \&\& are evaluated left to right with a boolean short-circuit if any condition fails.  This is
*ebfedea0SLionel Sambucimportant since if the \textbf{used} is zero the test on the right would fetch below the array.  That is obviously
*ebfedea0SLionel Sambucundesirable.  The parenthesis on line @28,a->used@ is used to make sure the \textbf{used} count is decremented and not
*ebfedea0SLionel Sambucthe pointer ``a''.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section*{Exercises}
*ebfedea0SLionel Sambuc\begin{tabular}{cl}
*ebfedea0SLionel Sambuc$\left [ 1 \right ]$ & Discuss the relevance of the \textbf{used} member of the mp\_int structure. \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 1 \right ]$ & Discuss the consequences of not using padding when performing allocations.  \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 2 \right ]$ & Estimate an ideal value for \textbf{MP\_PREC} when performing 1024-bit RSA \\
*ebfedea0SLionel Sambuc                     & encryption when $\beta = 2^{28}$.  \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 1 \right ]$ & Discuss the relevance of the algorithm mp\_clamp.  What does it prevent? \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 1 \right ]$ & Give an example of when the algorithm  mp\_init\_copy might be useful. \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc%%%
*ebfedea0SLionel Sambuc% CHAPTER FOUR
*ebfedea0SLionel Sambuc%%%
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Basic Operations}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Introduction}
*ebfedea0SLionel SambucIn the previous chapter a series of low level algorithms were established that dealt with initializing and maintaining
*ebfedea0SLionel Sambucmp\_int structures.  This chapter will discuss another set of seemingly non-algebraic algorithms which will form the low
*ebfedea0SLionel Sambuclevel basis of the entire library.  While these algorithm are relatively trivial it is important to understand how they
*ebfedea0SLionel Sambucwork before proceeding since these algorithms will be used almost intrinsically in the following chapters.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe algorithms in this chapter deal primarily with more ``programmer'' related tasks such as creating copies of
*ebfedea0SLionel Sambucmp\_int structures, assigning small values to mp\_int structures and comparisons of the values mp\_int structures
*ebfedea0SLionel Sambucrepresent.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Assigning Values to mp\_int Structures}
*ebfedea0SLionel Sambuc\subsection{Copying an mp\_int}
*ebfedea0SLionel SambucAssigning the value that a given mp\_int structure represents to another mp\_int structure shall be known as making
*ebfedea0SLionel Sambuca copy for the purposes of this text.  The copy of the mp\_int will be a separate entity that represents the same
*ebfedea0SLionel Sambucvalue as the mp\_int it was copied from.  The mp\_copy algorithm provides this functionality.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_copy}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.  An mp\_int $a$ and $b$. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Store a copy of $a$ in $b$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b.alloc < a.used$ then grow $b$ to $a.used$ digits.  (\textit{mp\_grow}) \\
*ebfedea0SLionel Sambuc2.  for $n$ from 0 to $a.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $b_{n} \leftarrow a_{n}$ \\
*ebfedea0SLionel Sambuc3.  for $n$ from $a.used$ to $b.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $b_{n} \leftarrow 0$ \\
*ebfedea0SLionel Sambuc4.  $b.used \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc5.  $b.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc6.  return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_copy}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_copy.}
*ebfedea0SLionel SambucThis algorithm copies the mp\_int $a$ such that upon succesful termination of the algorithm the mp\_int $b$ will
*ebfedea0SLionel Sambucrepresent the same integer as the mp\_int $a$.  The mp\_int $b$ shall be a complete and distinct copy of the
*ebfedea0SLionel Sambucmp\_int $a$ meaing that the mp\_int $a$ can be modified and it shall not affect the value of the mp\_int $b$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf $b$ does not have enough room for the digits of $a$ it must first have its precision augmented via the mp\_grow
*ebfedea0SLionel Sambucalgorithm.  The digits of $a$ are copied over the digits of $b$ and any excess digits of $b$ are set to zero (step two
*ebfedea0SLionel Sambucand three).  The \textbf{used} and \textbf{sign} members of $a$ are finally copied over the respective members of
*ebfedea0SLionel Sambuc$b$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Remark.}  This algorithm also introduces a new idiosyncrasy that will be used throughout the rest of the
*ebfedea0SLionel Sambuctext.  The error return codes of other algorithms are not explicitly checked in the pseudo-code presented.  For example, in
*ebfedea0SLionel Sambucstep one of the mp\_copy algorithm the return of mp\_grow is not explicitly checked to ensure it succeeded.  Text space is
*ebfedea0SLionel Sambuclimited so it is assumed that if a algorithm fails it will clear all temporarily allocated mp\_ints and return
*ebfedea0SLionel Sambucthe error code itself.  However, the C code presented will demonstrate all of the error handling logic required to
*ebfedea0SLionel Sambucimplement the pseudo-code.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_copy.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOccasionally a dependent algorithm may copy an mp\_int effectively into itself such as when the input and output
*ebfedea0SLionel Sambucmp\_int structures passed to a function are one and the same.  For this case it is optimal to return immediately without
*ebfedea0SLionel Sambuccopying digits (line @24,a == b@).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe mp\_int $b$ must have enough digits to accomodate the used digits of the mp\_int $a$.  If $b.alloc$ is less than
*ebfedea0SLionel Sambuc$a.used$ the algorithm mp\_grow is used to augment the precision of $b$ (lines @29,alloc@ to @33,}@).  In order to
*ebfedea0SLionel Sambucsimplify the inner loop that copies the digits from $a$ to $b$, two aliases $tmpa$ and $tmpb$ point directly at the digits
*ebfedea0SLionel Sambucof the mp\_ints $a$ and $b$ respectively.  These aliases (lines @42,tmpa@ and @45,tmpb@) allow the compiler to access the digits without first dereferencing the
*ebfedea0SLionel Sambucmp\_int pointers and then subsequently the pointer to the digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the aliases are established the digits from $a$ are copied into $b$ (lines @48,for@ to @50,}@) and then the excess
*ebfedea0SLionel Sambucdigits of $b$ are set to zero (lines @53,for@ to @55,}@).  Both ``for'' loops make use of the pointer aliases and in
*ebfedea0SLionel Sambucfact the alias for $b$ is carried through into the second ``for'' loop to clear the excess digits.  This optimization
*ebfedea0SLionel Sambucallows the alias to stay in a machine register fairly easy between the two loops.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Remarks.}  The use of pointer aliases is an implementation methodology first introduced in this function that will
*ebfedea0SLionel Sambucbe used considerably in other functions.  Technically, a pointer alias is simply a short hand alias used to lower the
*ebfedea0SLionel Sambucnumber of pointer dereferencing operations required to access data.  For example, a for loop may resemble
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{alltt}
*ebfedea0SLionel Sambucfor (x = 0; x < 100; x++) \{
*ebfedea0SLionel Sambuc    a->num[4]->dp[x] = 0;
*ebfedea0SLionel Sambuc\}
*ebfedea0SLionel Sambuc\end{alltt}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis could be re-written using aliases as
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{alltt}
*ebfedea0SLionel Sambucmp_digit *tmpa;
*ebfedea0SLionel Sambuca = a->num[4]->dp;
*ebfedea0SLionel Sambucfor (x = 0; x < 100; x++) \{
*ebfedea0SLionel Sambuc    *a++ = 0;
*ebfedea0SLionel Sambuc\}
*ebfedea0SLionel Sambuc\end{alltt}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn this case an alias is used to access the
*ebfedea0SLionel Sambucarray of digits within an mp\_int structure directly.  It may seem that a pointer alias is strictly not required
*ebfedea0SLionel Sambucas a compiler may optimize out the redundant pointer operations.  However, there are two dominant reasons to use aliases.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first reason is that most compilers will not effectively optimize pointer arithmetic.  For example, some optimizations
*ebfedea0SLionel Sambucmay work for the Microsoft Visual C++ compiler (MSVC) and not for the GNU C Compiler (GCC).  Also some optimizations may
*ebfedea0SLionel Sambucwork for GCC and not MSVC.  As such it is ideal to find a common ground for as many compilers as possible.  Pointer
*ebfedea0SLionel Sambucaliases optimize the code considerably before the compiler even reads the source code which means the end compiled code
*ebfedea0SLionel Sambucstands a better chance of being faster.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe second reason is that pointer aliases often can make an algorithm simpler to read.  Consider the first ``for''
*ebfedea0SLionel Sambucloop of the function mp\_copy() re-written to not use pointer aliases.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{alltt}
*ebfedea0SLionel Sambuc    /* copy all the digits */
*ebfedea0SLionel Sambuc    for (n = 0; n < a->used; n++) \{
*ebfedea0SLionel Sambuc      b->dp[n] = a->dp[n];
*ebfedea0SLionel Sambuc    \}
*ebfedea0SLionel Sambuc\end{alltt}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhether this code is harder to read depends strongly on the individual.  However, it is quantifiably slightly more
*ebfedea0SLionel Sambuccomplicated as there are four variables within the statement instead of just two.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Nested Statements}
*ebfedea0SLionel SambucAnother commonly used technique in the source routines is that certain sections of code are nested.  This is used in
*ebfedea0SLionel Sambucparticular with the pointer aliases to highlight code phases.  For example, a Comba multiplier (discussed in chapter six)
*ebfedea0SLionel Sambucwill typically have three different phases.  First the temporaries are initialized, then the columns calculated and
*ebfedea0SLionel Sambucfinally the carries are propagated.  In this example the middle column production phase will typically be nested as it
*ebfedea0SLionel Sambucuses temporary variables and aliases the most.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe nesting also simplies the source code as variables that are nested are only valid for their scope.  As a result
*ebfedea0SLionel Sambucthe various temporary variables required do not propagate into other sections of code.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Creating a Clone}
*ebfedea0SLionel SambucAnother common operation is to make a local temporary copy of an mp\_int argument.  To initialize an mp\_int
*ebfedea0SLionel Sambucand then copy another existing mp\_int into the newly intialized mp\_int will be known as creating a clone.  This is
*ebfedea0SLionel Sambucuseful within functions that need to modify an argument but do not wish to actually modify the original copy.  The
*ebfedea0SLionel Sambucmp\_init\_copy algorithm has been designed to help perform this task.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_init\_copy}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ and $b$\\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a$ is initialized to be a copy of $b$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Init $a$.  (\textit{mp\_init}) \\
*ebfedea0SLionel Sambuc2.  Copy $b$ to $a$.  (\textit{mp\_copy}) \\
*ebfedea0SLionel Sambuc3.  Return the status of the copy operation. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_init\_copy}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_init\_copy.}
*ebfedea0SLionel SambucThis algorithm will initialize an mp\_int variable and copy another previously initialized mp\_int variable into it.  As
*ebfedea0SLionel Sambucsuch this algorithm will perform two operations in one step.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_init_copy.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis will initialize \textbf{a} and make it a verbatim copy of the contents of \textbf{b}.  Note that
*ebfedea0SLionel Sambuc\textbf{a} will have its own memory allocated which means that \textbf{b} may be cleared after the call
*ebfedea0SLionel Sambucand \textbf{a} will be left intact.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Zeroing an Integer}
*ebfedea0SLionel SambucReseting an mp\_int to the default state is a common step in many algorithms.  The mp\_zero algorithm will be the algorithm used to
*ebfedea0SLionel Sambucperform this task.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_zero}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Zero the contents of $a$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $a.used \leftarrow 0$ \\
*ebfedea0SLionel Sambuc2.  $a.sign \leftarrow$ MP\_ZPOS \\
*ebfedea0SLionel Sambuc3.  for $n$ from 0 to $a.alloc - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $a_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_zero}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_zero.}
*ebfedea0SLionel SambucThis algorithm simply resets a mp\_int to the default state.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_zero.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the function is completed, all of the digits are zeroed, the \textbf{used} count is zeroed and the
*ebfedea0SLionel Sambuc\textbf{sign} variable is set to \textbf{MP\_ZPOS}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Sign Manipulation}
*ebfedea0SLionel Sambuc\subsection{Absolute Value}
*ebfedea0SLionel SambucWith the mp\_int representation of an integer, calculating the absolute value is trivial.  The mp\_abs algorithm will compute
*ebfedea0SLionel Sambucthe absolute value of an mp\_int.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_abs}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Computes $b = \vert a \vert$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Copy $a$ to $b$.  (\textit{mp\_copy}) \\
*ebfedea0SLionel Sambuc2.  If the copy failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc3.  $b.sign \leftarrow MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc4.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_abs}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_abs.}
*ebfedea0SLionel SambucThis algorithm computes the absolute of an mp\_int input.  First it copies $a$ over $b$.  This is an example of an
*ebfedea0SLionel Sambucalgorithm where the check in mp\_copy that determines if the source and destination are equal proves useful.  This allows,
*ebfedea0SLionel Sambucfor instance, the developer to pass the same mp\_int as the source and destination to this function without addition
*ebfedea0SLionel Sambuclogic to handle it.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_abs.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis fairly trivial algorithm first eliminates non--required duplications (line @27,a != b@) and then sets the
*ebfedea0SLionel Sambuc\textbf{sign} flag to \textbf{MP\_ZPOS}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Integer Negation}
*ebfedea0SLionel SambucWith the mp\_int representation of an integer, calculating the negation is also trivial.  The mp\_neg algorithm will compute
*ebfedea0SLionel Sambucthe negative of an mp\_int input.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_neg}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Computes $b = -a$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Copy $a$ to $b$.  (\textit{mp\_copy}) \\
*ebfedea0SLionel Sambuc2.  If the copy failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc3.  If $a.used = 0$ then return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc4.  If $a.sign = MP\_ZPOS$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $b.sign = MP\_NEG$. \\
*ebfedea0SLionel Sambuc5.  else do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $b.sign = MP\_ZPOS$. \\
*ebfedea0SLionel Sambuc6.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_neg}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_neg.}
*ebfedea0SLionel SambucThis algorithm computes the negation of an input.  First it copies $a$ over $b$.  If $a$ has no used digits then
*ebfedea0SLionel Sambucthe algorithm returns immediately.  Otherwise it flips the sign flag and stores the result in $b$.  Note that if
*ebfedea0SLionel Sambuc$a$ had no digits then it must be positive by definition.  Had step three been omitted then the algorithm would return
*ebfedea0SLionel Sambuczero as negative.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_neg.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLike mp\_abs() this function avoids non--required duplications (line @21,a != b@) and then sets the sign.  We
*ebfedea0SLionel Sambuchave to make sure that only non--zero values get a \textbf{sign} of \textbf{MP\_NEG}.  If the mp\_int is zero
*ebfedea0SLionel Sambucthan the \textbf{sign} is hard--coded to \textbf{MP\_ZPOS}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Small Constants}
*ebfedea0SLionel Sambuc\subsection{Setting Small Constants}
*ebfedea0SLionel SambucOften a mp\_int must be set to a relatively small value such as $1$ or $2$.  For these cases the mp\_set algorithm is useful.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_set}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ and a digit $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Make $a$ equivalent to $b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Zero $a$ (\textit{mp\_zero}). \\
*ebfedea0SLionel Sambuc2.  $a_0 \leftarrow b \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc3.  $a.used \leftarrow  \left \lbrace \begin{array}{ll}
*ebfedea0SLionel Sambuc                              1 &  \mbox{if }a_0 > 0 \\
*ebfedea0SLionel Sambuc                              0 &  \mbox{if }a_0 = 0
*ebfedea0SLionel Sambuc                              \end{array} \right .$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_set}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_set.}
*ebfedea0SLionel SambucThis algorithm sets a mp\_int to a small single digit value.  Step number 1 ensures that the integer is reset to the default state.  The
*ebfedea0SLionel Sambucsingle digit is set (\textit{modulo $\beta$}) and the \textbf{used} count is adjusted accordingly.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_set.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst we zero (line @21,mp_zero@) the mp\_int to make sure that the other members are initialized for a
*ebfedea0SLionel Sambucsmall positive constant.  mp\_zero() ensures that the \textbf{sign} is positive and the \textbf{used} count
*ebfedea0SLionel Sambucis zero.  Next we set the digit and reduce it modulo $\beta$ (line @22,MP_MASK@).  After this step we have to
*ebfedea0SLionel Sambuccheck if the resulting digit is zero or not.  If it is not then we set the \textbf{used} count to one, otherwise
*ebfedea0SLionel Sambucto zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWe can quickly reduce modulo $\beta$ since it is of the form $2^k$ and a quick binary AND operation with
*ebfedea0SLionel Sambuc$2^k - 1$ will perform the same operation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOne important limitation of this function is that it will only set one digit.  The size of a digit is not fixed, meaning source that uses
*ebfedea0SLionel Sambucthis function should take that into account.  Only trivially small constants can be set using this function.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Setting Large Constants}
*ebfedea0SLionel SambucTo overcome the limitations of the mp\_set algorithm the mp\_set\_int algorithm is ideal.  It accepts a ``long''
*ebfedea0SLionel Sambucdata type as input and will always treat it as a 32-bit integer.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_set\_int}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An mp\_int $a$ and a ``long'' integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Make $a$ equivalent to $b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Zero $a$ (\textit{mp\_zero}) \\
*ebfedea0SLionel Sambuc2.  for $n$ from 0 to 7 do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $a \leftarrow a \cdot 16$ (\textit{mp\_mul2d}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  $u \leftarrow \lfloor b / 2^{4(7 - n)} \rfloor \mbox{ (mod }16\mbox{)}$\\
*ebfedea0SLionel Sambuc\hspace{3mm}2.3  $a_0 \leftarrow a_0 + u$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.4  $a.used \leftarrow a.used + 1$ \\
*ebfedea0SLionel Sambuc3.  Clamp excess used digits (\textit{mp\_clamp}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_set\_int}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_set\_int.}
*ebfedea0SLionel SambucThe algorithm performs eight iterations of a simple loop where in each iteration four bits from the source are added to the
*ebfedea0SLionel Sambucmp\_int.  Step 2.1 will multiply the current result by sixteen making room for four more bits in the less significant positions.  In step 2.2 the
*ebfedea0SLionel Sambucnext four bits from the source are extracted and are added to the mp\_int. The \textbf{used} digit count is
*ebfedea0SLionel Sambucincremented to reflect the addition.  The \textbf{used} digit counter is incremented since if any of the leading digits were zero the mp\_int would have
*ebfedea0SLionel Sambuczero digits used and the newly added four bits would be ignored.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucExcess zero digits are trimmed in steps 2.1 and 3 by using higher level algorithms mp\_mul2d and mp\_clamp.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_set_int.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis function sets four bits of the number at a time to handle all practical \textbf{DIGIT\_BIT} sizes.  The weird
*ebfedea0SLionel Sambucaddition on line @38,a->used@ ensures that the newly added in bits are added to the number of digits.  While it may not
*ebfedea0SLionel Sambucseem obvious as to why the digit counter does not grow exceedingly large it is because of the shift on line @27,mp_mul_2d@
*ebfedea0SLionel Sambucas well as the  call to mp\_clamp() on line @40,mp_clamp@.  Both functions will clamp excess leading digits which keeps
*ebfedea0SLionel Sambucthe number of used digits low.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Comparisons}
*ebfedea0SLionel Sambuc\subsection{Unsigned Comparisions}
*ebfedea0SLionel SambucComparing a multiple precision integer is performed with the exact same algorithm used to compare two decimal numbers.  For example,
*ebfedea0SLionel Sambucto compare $1,234$ to $1,264$ the digits are extracted by their positions.  That is we compare $1 \cdot 10^3 + 2 \cdot 10^2 + 3 \cdot 10^1 + 4 \cdot 10^0$
*ebfedea0SLionel Sambucto $1 \cdot 10^3 + 2 \cdot 10^2 + 6 \cdot 10^1 + 4 \cdot 10^0$ by comparing single digits at a time starting with the highest magnitude
*ebfedea0SLionel Sambucpositions.  If any leading digit of one integer is greater than a digit in the same position of another integer then obviously it must be greater.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first comparision routine that will be developed is the unsigned magnitude compare which will perform a comparison based on the digits of two
*ebfedea0SLionel Sambucmp\_int variables alone.  It will ignore the sign of the two inputs.  Such a function is useful when an absolute comparison is required or if the
*ebfedea0SLionel Sambucsigns are known to agree in advance.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucTo facilitate working with the results of the comparison functions three constants are required.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|r|l|}
*ebfedea0SLionel Sambuc\hline \textbf{Constant} & \textbf{Meaning} \\
*ebfedea0SLionel Sambuc\hline \textbf{MP\_GT} & Greater Than \\
*ebfedea0SLionel Sambuc\hline \textbf{MP\_EQ} & Equal To \\
*ebfedea0SLionel Sambuc\hline \textbf{MP\_LT} & Less Than \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Comparison Return Codes}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_cmp\_mag}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two mp\_ints $a$ and $b$.  \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Unsigned comparison results ($a$ to the left of $b$). \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $a.used > b.used$ then return(\textit{MP\_GT}) \\
*ebfedea0SLionel Sambuc2.  If $a.used < b.used$ then return(\textit{MP\_LT}) \\
*ebfedea0SLionel Sambuc3.  for n from $a.used - 1$ to 0 do \\
*ebfedea0SLionel Sambuc\hspace{+3mm}3.1  if $a_n > b_n$ then return(\textit{MP\_GT}) \\
*ebfedea0SLionel Sambuc\hspace{+3mm}3.2  if $a_n < b_n$ then return(\textit{MP\_LT}) \\
*ebfedea0SLionel Sambuc4.  Return(\textit{MP\_EQ}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_cmp\_mag}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_cmp\_mag.}
*ebfedea0SLionel SambucBy saying ``$a$ to the left of $b$'' it is meant that the comparison is with respect to $a$, that is if $a$ is greater than $b$ it will return
*ebfedea0SLionel Sambuc\textbf{MP\_GT} and similar with respect to when $a = b$ and $a < b$.  The first two steps compare the number of digits used in both $a$ and $b$.
*ebfedea0SLionel SambucObviously if the digit counts differ there would be an imaginary zero digit in the smaller number where the leading digit of the larger number is.
*ebfedea0SLionel SambucIf both have the same number of digits than the actual digits themselves must be compared starting at the leading digit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy step three both inputs must have the same number of digits so its safe to start from either $a.used - 1$ or $b.used - 1$ and count down to
*ebfedea0SLionel Sambucthe zero'th digit.  If after all of the digits have been compared, no difference is found, the algorithm returns \textbf{MP\_EQ}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_cmp_mag.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe two if statements (lines @24,if@ and @28,if@) compare the number of digits in the two inputs.  These two are
*ebfedea0SLionel Sambucperformed before all of the digits are compared since it is a very cheap test to perform and can potentially save
*ebfedea0SLionel Sambucconsiderable time.  The implementation given is also not valid without those two statements.  $b.alloc$ may be
*ebfedea0SLionel Sambucsmaller than $a.used$, meaning that undefined values will be read from $b$ past the end of the array of digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Signed Comparisons}
*ebfedea0SLionel SambucComparing with sign considerations is also fairly critical in several routines (\textit{division for example}).  Based on an unsigned magnitude
*ebfedea0SLionel Sambuccomparison a trivial signed comparison algorithm can be written.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_cmp}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two mp\_ints $a$ and $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Signed Comparison Results ($a$ to the left of $b$) \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  if $a.sign = MP\_NEG$ and $b.sign = MP\_ZPOS$ then return(\textit{MP\_LT}) \\
*ebfedea0SLionel Sambuc2.  if $a.sign = MP\_ZPOS$ and $b.sign = MP\_NEG$ then return(\textit{MP\_GT}) \\
*ebfedea0SLionel Sambuc3.  if $a.sign = MP\_NEG$ then \\
*ebfedea0SLionel Sambuc\hspace{+3mm}3.1  Return the unsigned comparison of $b$ and $a$ (\textit{mp\_cmp\_mag}) \\
*ebfedea0SLionel Sambuc4   Otherwise \\
*ebfedea0SLionel Sambuc\hspace{+3mm}4.1  Return the unsigned comparison of $a$ and $b$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_cmp}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_cmp.}
*ebfedea0SLionel SambucThe first two steps compare the signs of the two inputs.  If the signs do not agree then it can return right away with the appropriate
*ebfedea0SLionel Sambuccomparison code.  When the signs are equal the digits of the inputs must be compared to determine the correct result.  In step
*ebfedea0SLionel Sambucthree the unsigned comparision flips the order of the arguments since they are both negative.  For instance, if $-a > -b$ then
*ebfedea0SLionel Sambuc$\vert a \vert < \vert b \vert$.  Step number four will compare the two when they are both positive.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_cmp.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe two if statements (lines @22,if@ and @26,if@) perform the initial sign comparison.  If the signs are not the equal then which ever
*ebfedea0SLionel Sambuchas the positive sign is larger.   The inputs are compared (line @30,if@) based on magnitudes.  If the signs were both
*ebfedea0SLionel Sambucnegative then the unsigned comparison is performed in the opposite direction (line @31,mp_cmp_mag@).  Otherwise, the signs are assumed to
*ebfedea0SLionel Sambucbe both positive and a forward direction unsigned comparison is performed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section*{Exercises}
*ebfedea0SLionel Sambuc\begin{tabular}{cl}
*ebfedea0SLionel Sambuc$\left [ 2 \right ]$ & Modify algorithm mp\_set\_int to accept as input a variable length array of bits. \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 3 \right ]$ & Give the probability that algorithm mp\_cmp\_mag will have to compare $k$ digits  \\
*ebfedea0SLionel Sambuc                     & of two random digits (of equal magnitude) before a difference is found. \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 1 \right ]$ & Suggest a simple method to speed up the implementation of mp\_cmp\_mag based  \\
*ebfedea0SLionel Sambuc                     & on the observations made in the previous problem. \\
*ebfedea0SLionel Sambuc                     &
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Basic Arithmetic}
*ebfedea0SLionel Sambuc\section{Introduction}
*ebfedea0SLionel SambucAt this point algorithms for initialization, clearing, zeroing, copying, comparing and setting small constants have been
*ebfedea0SLionel Sambucestablished.  The next logical set of algorithms to develop are addition, subtraction and digit shifting algorithms.  These
*ebfedea0SLionel Sambucalgorithms make use of the lower level algorithms and are the cruicial building block for the multiplication algorithms.  It is very important
*ebfedea0SLionel Sambucthat these algorithms are highly optimized.  On their own they are simple $O(n)$ algorithms but they can be called from higher level algorithms
*ebfedea0SLionel Sambucwhich easily places them at $O(n^2)$ or even $O(n^3)$ work levels.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucMARK,SHIFTS
*ebfedea0SLionel SambucAll of the algorithms within this chapter make use of the logical bit shift operations denoted by $<<$ and $>>$ for left and right
*ebfedea0SLionel Sambuclogical shifts respectively.  A logical shift is analogous to sliding the decimal point of radix-10 representations.  For example, the real
*ebfedea0SLionel Sambucnumber $0.9345$ is equivalent to $93.45\%$ which is found by sliding the the decimal two places to the right (\textit{multiplying by $\beta^2 = 10^2$}).
*ebfedea0SLionel SambucAlgebraically a binary logical shift is equivalent to a division or multiplication by a power of two.
*ebfedea0SLionel SambucFor example, $a << k = a \cdot 2^k$ while $a >> k = \lfloor a/2^k \rfloor$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOne significant difference between a logical shift and the way decimals are shifted is that digits below the zero'th position are removed
*ebfedea0SLionel Sambucfrom the number.  For example, consider $1101_2 >> 1$ using decimal notation this would produce $110.1_2$.  However, with a logical shift the
*ebfedea0SLionel Sambucresult is $110_2$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Addition and Subtraction}
*ebfedea0SLionel SambucIn common twos complement fixed precision arithmetic negative numbers are easily represented by subtraction from the modulus.  For example, with 32-bit integers
*ebfedea0SLionel Sambuc$a - b\mbox{ (mod }2^{32}\mbox{)}$ is the same as $a + (2^{32} - b) \mbox{ (mod }2^{32}\mbox{)}$  since $2^{32} \equiv 0 \mbox{ (mod }2^{32}\mbox{)}$.
*ebfedea0SLionel SambucAs a result subtraction can be performed with a trivial series of logical operations and an addition.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHowever, in multiple precision arithmetic negative numbers are not represented in the same way.  Instead a sign flag is used to keep track of the
*ebfedea0SLionel Sambucsign of the integer.  As a result signed addition and subtraction are actually implemented as conditional usage of lower level addition or
*ebfedea0SLionel Sambucsubtraction algorithms with the sign fixed up appropriately.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe lower level algorithms will add or subtract integers without regard to the sign flag.  That is they will add or subtract the magnitude of
*ebfedea0SLionel Sambucthe integers respectively.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Low Level Addition}
*ebfedea0SLionel SambucAn unsigned addition of multiple precision integers is performed with the same long-hand algorithm used to add decimal numbers.  That is to add the
*ebfedea0SLionel Sambuctrailing digits first and propagate the resulting carry upwards.  Since this is a lower level algorithm the name will have a ``s\_'' prefix.
*ebfedea0SLionel SambucHistorically that convention stems from the MPI library where ``s\_'' stood for static functions that were hidden from the developer entirely.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{s\_mp\_add}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two mp\_ints $a$ and $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The unsigned addition $c = \vert a \vert + \vert b \vert$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  if $a.used > b.used$ then \\
*ebfedea0SLionel Sambuc\hspace{+3mm}1.1  $min \leftarrow b.used$ \\
*ebfedea0SLionel Sambuc\hspace{+3mm}1.2  $max \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc\hspace{+3mm}1.3  $x   \leftarrow a$ \\
*ebfedea0SLionel Sambuc2.  else  \\
*ebfedea0SLionel Sambuc\hspace{+3mm}2.1  $min \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc\hspace{+3mm}2.2  $max \leftarrow b.used$ \\
*ebfedea0SLionel Sambuc\hspace{+3mm}2.3  $x   \leftarrow b$ \\
*ebfedea0SLionel Sambuc3.  If $c.alloc < max + 1$ then grow $c$ to hold at least $max + 1$ digits (\textit{mp\_grow}) \\
*ebfedea0SLionel Sambuc4.  $oldused \leftarrow c.used$ \\
*ebfedea0SLionel Sambuc5.  $c.used \leftarrow max + 1$ \\
*ebfedea0SLionel Sambuc6.  $u \leftarrow 0$ \\
*ebfedea0SLionel Sambuc7.  for $n$ from $0$ to $min - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{+3mm}7.1  $c_n \leftarrow a_n + b_n + u$ \\
*ebfedea0SLionel Sambuc\hspace{+3mm}7.2  $u \leftarrow c_n >> lg(\beta)$ \\
*ebfedea0SLionel Sambuc\hspace{+3mm}7.3  $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc8.  if $min \ne max$ then do \\
*ebfedea0SLionel Sambuc\hspace{+3mm}8.1  for $n$ from $min$ to $max - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{+6mm}8.1.1  $c_n \leftarrow x_n + u$ \\
*ebfedea0SLionel Sambuc\hspace{+6mm}8.1.2  $u \leftarrow c_n >> lg(\beta)$ \\
*ebfedea0SLionel Sambuc\hspace{+6mm}8.1.3  $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc9.  $c_{max} \leftarrow u$ \\
*ebfedea0SLionel Sambuc10.  if $olduse > max$ then \\
*ebfedea0SLionel Sambuc\hspace{+3mm}10.1  for $n$ from $max + 1$ to $oldused - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{+6mm}10.1.1  $c_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc11.  Clamp excess digits in $c$.  (\textit{mp\_clamp}) \\
*ebfedea0SLionel Sambuc12.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm s\_mp\_add}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm s\_mp\_add.}
*ebfedea0SLionel SambucThis algorithm is loosely based on algorithm 14.7 of HAC \cite[pp. 594]{HAC} but has been extended to allow the inputs to have different magnitudes.
*ebfedea0SLionel SambucCoincidentally the description of algorithm A in Knuth \cite[pp. 266]{TAOCPV2} shares the same deficiency as the algorithm from \cite{HAC}.  Even the
*ebfedea0SLionel SambucMIX pseudo  machine code presented by Knuth \cite[pp. 266-267]{TAOCPV2} is incapable of handling inputs which are of different magnitudes.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first thing that has to be accomplished is to sort out which of the two inputs is the largest.  The addition logic
*ebfedea0SLionel Sambucwill simply add all of the smallest input to the largest input and store that first part of the result in the
*ebfedea0SLionel Sambucdestination.  Then it will apply a simpler addition loop to excess digits of the larger input.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first two steps will handle sorting the inputs such that $min$ and $max$ hold the digit counts of the two
*ebfedea0SLionel Sambucinputs.  The variable $x$ will be an mp\_int alias for the largest input or the second input $b$ if they have the
*ebfedea0SLionel Sambucsame number of digits.  After the inputs are sorted the destination $c$ is grown as required to accomodate the sum
*ebfedea0SLionel Sambucof the two inputs.  The original \textbf{used} count of $c$ is copied and set to the new used count.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt this point the first addition loop will go through as many digit positions that both inputs have.  The carry
*ebfedea0SLionel Sambucvariable $\mu$ is set to zero outside the loop.  Inside the loop an ``addition'' step requires three statements to produce
*ebfedea0SLionel Sambucone digit of the summand.  First
*ebfedea0SLionel Sambuctwo digits from $a$ and $b$ are added together along with the carry $\mu$.  The carry of this step is extracted and stored
*ebfedea0SLionel Sambucin $\mu$ and finally the digit of the result $c_n$ is truncated within the range $0 \le c_n < \beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNow all of the digit positions that both inputs have in common have been exhausted.  If $min \ne max$ then $x$ is an alias
*ebfedea0SLionel Sambucfor one of the inputs that has more digits.  A simplified addition loop is then used to essentially copy the remaining digits
*ebfedea0SLionel Sambucand the carry to the destination.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe final carry is stored in $c_{max}$ and digits above $max$ upto $oldused$ are zeroed which completes the addition.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_s_mp_add.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWe first sort (lines @27,if@ to @35,}@) the inputs based on magnitude and determine the $min$ and $max$ variables.
*ebfedea0SLionel SambucNote that $x$ is a pointer to an mp\_int assigned to the largest input, in effect it is a local alias.  Next we
*ebfedea0SLionel Sambucgrow the destination (@37,init@ to @42,}@) ensure that it can accomodate the result of the addition.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSimilar to the implementation of mp\_copy this function uses the braced code and local aliases coding style.  The three aliases that are on
*ebfedea0SLionel Sambuclines @56,tmpa@, @59,tmpb@ and @62,tmpc@ represent the two inputs and destination variables respectively.  These aliases are used to ensure the
*ebfedea0SLionel Sambuccompiler does not have to dereference $a$, $b$ or $c$ (respectively) to access the digits of the respective mp\_int.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe initial carry $u$ will be cleared (line @65,u = 0@), note that $u$ is of type mp\_digit which ensures type
*ebfedea0SLionel Sambuccompatibility within the implementation.  The initial addition (line @66,for@ to @75,}@) adds digits from
*ebfedea0SLionel Sambucboth inputs until the smallest input runs out of digits.  Similarly the conditional addition loop
*ebfedea0SLionel Sambuc(line @81,for@ to @90,}@) adds the remaining digits from the larger of the two inputs.  The addition is finished
*ebfedea0SLionel Sambucwith the final carry being stored in $tmpc$ (line @94,tmpc++@).  Note the ``++'' operator within the same expression.
*ebfedea0SLionel SambucAfter line @94,tmpc++@, $tmpc$ will point to the $c.used$'th digit of the mp\_int $c$.  This is useful
*ebfedea0SLionel Sambucfor the next loop (line @97,for@ to @99,}@) which set any old upper digits to zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Low Level Subtraction}
*ebfedea0SLionel SambucThe low level unsigned subtraction algorithm is very similar to the low level unsigned addition algorithm.  The principle difference is that the
*ebfedea0SLionel Sambucunsigned subtraction algorithm requires the result to be positive.  That is when computing $a - b$ the condition $\vert a \vert \ge \vert b\vert$ must
*ebfedea0SLionel Sambucbe met for this algorithm to function properly.  Keep in mind this low level algorithm is not meant to be used in higher level algorithms directly.
*ebfedea0SLionel SambucThis algorithm as will be shown can be used to create functional signed addition and subtraction algorithms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucMARK,GAMMA
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor this algorithm a new variable is required to make the description simpler.  Recall from section 1.3.1 that a mp\_digit must be able to represent
*ebfedea0SLionel Sambucthe range $0 \le x < 2\beta$ for the algorithms to work correctly.  However, it is allowable that a mp\_digit represent a larger range of values.  For
*ebfedea0SLionel Sambucthis algorithm we will assume that the variable $\gamma$ represents the number of bits available in a
*ebfedea0SLionel Sambucmp\_digit (\textit{this implies $2^{\gamma} > \beta$}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, the default for LibTomMath is to use a ``unsigned long'' for the mp\_digit ``type'' while $\beta = 2^{28}$.  In ISO C an ``unsigned long''
*ebfedea0SLionel Sambucdata type must be able to represent $0 \le x < 2^{32}$ meaning that in this case $\gamma \ge 32$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{s\_mp\_sub}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two mp\_ints $a$ and $b$ ($\vert a \vert \ge \vert b \vert$) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The unsigned subtraction $c = \vert a \vert - \vert b \vert$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $min \leftarrow b.used$ \\
*ebfedea0SLionel Sambuc2.  $max \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc3.  If $c.alloc < max$ then grow $c$ to hold at least $max$ digits.  (\textit{mp\_grow}) \\
*ebfedea0SLionel Sambuc4.  $oldused \leftarrow c.used$ \\
*ebfedea0SLionel Sambuc5.  $c.used \leftarrow max$ \\
*ebfedea0SLionel Sambuc6.  $u \leftarrow 0$ \\
*ebfedea0SLionel Sambuc7.  for $n$ from $0$ to $min - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $c_n \leftarrow a_n - b_n - u$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.2  $u   \leftarrow c_n >> (\gamma - 1)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.3  $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc8.  if $min < max$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  for $n$ from $min$ to $max - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}8.1.1  $c_n \leftarrow a_n - u$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}8.1.2  $u   \leftarrow c_n >> (\gamma - 1)$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}8.1.3  $c_n \leftarrow c_n \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc9. if $oldused > max$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1  for $n$ from $max$ to $oldused - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}9.1.1  $c_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc10. Clamp excess digits of $c$.  (\textit{mp\_clamp}). \\
*ebfedea0SLionel Sambuc11. Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm s\_mp\_sub}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm s\_mp\_sub.}
*ebfedea0SLionel SambucThis algorithm performs the unsigned subtraction of two mp\_int variables under the restriction that the result must be positive.  That is when
*ebfedea0SLionel Sambucpassing variables $a$ and $b$ the condition that $\vert a \vert \ge \vert b \vert$ must be met for the algorithm to function correctly.  This
*ebfedea0SLionel Sambucalgorithm is loosely based on algorithm 14.9 \cite[pp. 595]{HAC} and is similar to algorithm S in \cite[pp. 267]{TAOCPV2} as well.  As was the case
*ebfedea0SLionel Sambucof the algorithm s\_mp\_add both other references lack discussion concerning various practical details such as when the inputs differ in magnitude.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe initial sorting of the inputs is trivial in this algorithm since $a$ is guaranteed to have at least the same magnitude of $b$.  Steps 1 and 2
*ebfedea0SLionel Sambucset the $min$ and $max$ variables.  Unlike the addition routine there is guaranteed to be no carry which means that the final result can be at
*ebfedea0SLionel Sambucmost $max$ digits in length as opposed to $max + 1$.  Similar to the addition algorithm the \textbf{used} count of $c$ is copied locally and
*ebfedea0SLionel Sambucset to the maximal count for the operation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe subtraction loop that begins on step seven is essentially the same as the addition loop of algorithm s\_mp\_add except single precision
*ebfedea0SLionel Sambucsubtraction is used instead.  Note the use of the $\gamma$ variable to extract the carry (\textit{also known as the borrow}) within the subtraction
*ebfedea0SLionel Sambucloops.  Under the assumption that two's complement single precision arithmetic is used this will successfully extract the desired carry.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, consider subtracting $0101_2$ from $0100_2$ where $\gamma = 4$ and $\beta = 2$.  The least significant bit will force a carry upwards to
*ebfedea0SLionel Sambucthe third bit which will be set to zero after the borrow.  After the very first bit has been subtracted $4 - 1 \equiv 0011_2$ will remain,  When the
*ebfedea0SLionel Sambucthird bit of $0101_2$ is subtracted from the result it will cause another carry.  In this case though the carry will be forced to propagate all the
*ebfedea0SLionel Sambucway to the most significant bit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucRecall that $\beta < 2^{\gamma}$.  This means that if a carry does occur just before the $lg(\beta)$'th bit it will propagate all the way to the most
*ebfedea0SLionel Sambucsignificant bit.  Thus, the high order bits of the mp\_digit that are not part of the actual digit will either be all zero, or all one. All that
*ebfedea0SLionel Sambucis needed is a single zero or one bit for the carry.  Therefore a single logical shift right by $\gamma - 1$ positions is sufficient to extract the
*ebfedea0SLionel Sambuccarry.  This method of carry extraction may seem awkward but the reason for it becomes apparent when the implementation is discussed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf $b$ has a smaller magnitude than $a$ then step 9 will force the carry and copy operation to propagate through the larger input $a$ into $c$.  Step
*ebfedea0SLionel Sambuc10 will ensure that any leading digits of $c$ above the $max$'th position are zeroed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_s_mp_sub.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLike low level addition we ``sort'' the inputs.  Except in this case the sorting is hardcoded
*ebfedea0SLionel Sambuc(lines @24,min@ and @25,max@).  In reality the $min$ and $max$ variables are only aliases and are only
*ebfedea0SLionel Sambucused to make the source code easier to read.  Again the pointer alias optimization is used
*ebfedea0SLionel Sambucwithin this algorithm.  The aliases $tmpa$, $tmpb$ and $tmpc$ are initialized
*ebfedea0SLionel Sambuc(lines @42,tmpa@, @43,tmpb@ and @44,tmpc@) for $a$, $b$ and $c$ respectively.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first subtraction loop (lines @47,u = 0@ through @61,}@) subtract digits from both inputs until the smaller of
*ebfedea0SLionel Sambucthe two inputs has been exhausted.  As remarked earlier there is an implementation reason for using the ``awkward''
*ebfedea0SLionel Sambucmethod of extracting the carry (line @57, >>@).  The traditional method for extracting the carry would be to shift
*ebfedea0SLionel Sambucby $lg(\beta)$ positions and logically AND the least significant bit.  The AND operation is required because all of
*ebfedea0SLionel Sambucthe bits above the $\lg(\beta)$'th bit will be set to one after a carry occurs from subtraction.  This carry
*ebfedea0SLionel Sambucextraction requires two relatively cheap operations to extract the carry.  The other method is to simply shift the
*ebfedea0SLionel Sambucmost significant bit to the least significant bit thus extracting the carry with a single cheap operation.  This
*ebfedea0SLionel Sambucoptimization only works on twos compliment machines which is a safe assumption to make.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf $a$ has a larger magnitude than $b$ an additional loop (lines @64,for@ through @73,}@) is required to propagate
*ebfedea0SLionel Sambucthe carry through $a$ and copy the result to $c$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{High Level Addition}
*ebfedea0SLionel SambucNow that both lower level addition and subtraction algorithms have been established an effective high level signed addition algorithm can be
*ebfedea0SLionel Sambucestablished.  This high level addition algorithm will be what other algorithms and developers will use to perform addition of mp\_int data
*ebfedea0SLionel Sambuctypes.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucRecall from section 5.2 that an mp\_int represents an integer with an unsigned mantissa (\textit{the array of digits}) and a \textbf{sign}
*ebfedea0SLionel Sambucflag.  A high level addition is actually performed as a series of eight separate cases which can be optimized down to three unique cases.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_add}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two mp\_ints $a$ and $b$  \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The signed addition $c = a + b$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  if $a.sign = b.sign$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $c.sign \leftarrow a.sign$  \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add})\\
*ebfedea0SLionel Sambuc2.  else do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  if $\vert a \vert < \vert b \vert$ then do (\textit{mp\_cmp\_mag})  \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.1.1  $c.sign \leftarrow b.sign$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.1.2  $c \leftarrow \vert b \vert - \vert a \vert$ (\textit{s\_mp\_sub}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  else do \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.1  $c.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.2  $c \leftarrow \vert a \vert - \vert b \vert$ \\
*ebfedea0SLionel Sambuc3.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_add}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_add.}
*ebfedea0SLionel SambucThis algorithm performs the signed addition of two mp\_int variables.  There is no reference algorithm to draw upon from
*ebfedea0SLionel Sambuceither \cite{TAOCPV2} or \cite{HAC} since they both only provide unsigned operations.  The algorithm is fairly
*ebfedea0SLionel Sambucstraightforward but restricted since subtraction can only produce positive results.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert > \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\
*ebfedea0SLionel Sambuc\hline $+$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $+$ & $+$ & No  & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $-$ & No  & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline &&&&\\
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\hline $+$ & $-$ & No  & $c = b - a$ & $b.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $+$ & No  & $c = b - a$ & $b.sign$ \\
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\hline &&&&\\
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\hline $+$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Addition Guide Chart}
*ebfedea0SLionel Sambuc\label{fig:AddChart}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFigure~\ref{fig:AddChart} lists all of the eight possible input combinations and is sorted to show that only three
*ebfedea0SLionel Sambucspecific cases need to be handled.  The return code of the unsigned operations at step 1.2, 2.1.2 and 2.2.2 are
*ebfedea0SLionel Sambucforwarded to step three to check for errors.  This simplifies the description of the algorithm considerably and best
*ebfedea0SLionel Sambucfollows how the implementation actually was achieved.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAlso note how the \textbf{sign} is set before the unsigned addition or subtraction is performed.  Recall from the descriptions of algorithms
*ebfedea0SLionel Sambucs\_mp\_add and s\_mp\_sub that the mp\_clamp function is used at the end to trim excess digits.  The mp\_clamp algorithm will set the \textbf{sign}
*ebfedea0SLionel Sambucto \textbf{MP\_ZPOS} when the \textbf{used} digit count reaches zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, consider performing $-a + a$ with algorithm mp\_add.  By the description of the algorithm the sign is set to \textbf{MP\_NEG} which would
*ebfedea0SLionel Sambucproduce a result of $-0$.  However, since the sign is set first then the unsigned addition is performed the subsequent usage of algorithm mp\_clamp
*ebfedea0SLionel Sambucwithin algorithm s\_mp\_add will force $-0$ to become $0$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_add.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe source code follows the algorithm fairly closely.  The most notable new source code addition is the usage of the $res$ integer variable which
*ebfedea0SLionel Sambucis used to pass result of the unsigned operations forward.  Unlike in the algorithm, the variable $res$ is merely returned as is without
*ebfedea0SLionel Sambucexplicitly checking it and returning the constant \textbf{MP\_OKAY}.  The observation is this algorithm will succeed or fail only if the lower
*ebfedea0SLionel Sambuclevel functions do so.  Returning their return code is sufficient.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{High Level Subtraction}
*ebfedea0SLionel SambucThe high level signed subtraction algorithm is essentially the same as the high level signed addition algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_sub}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two mp\_ints $a$ and $b$  \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The signed subtraction $c = a - b$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  if $a.sign \ne b.sign$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $c.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  $c \leftarrow \vert a \vert + \vert b \vert$ (\textit{s\_mp\_add}) \\
*ebfedea0SLionel Sambuc2.  else do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  if $\vert a \vert \ge \vert b \vert$ then do (\textit{mp\_cmp\_mag}) \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.1.1  $c.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.1.2  $c \leftarrow \vert a \vert  - \vert b \vert$ (\textit{s\_mp\_sub}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  else do \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.1  $c.sign \leftarrow  \left \lbrace \begin{array}{ll}
*ebfedea0SLionel Sambuc                              MP\_ZPOS &  \mbox{if }a.sign = MP\_NEG \\
*ebfedea0SLionel Sambuc                              MP\_NEG  &  \mbox{otherwise} \\
*ebfedea0SLionel Sambuc                              \end{array} \right .$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.2  $c \leftarrow \vert b \vert  - \vert a \vert$ \\
*ebfedea0SLionel Sambuc3.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_sub}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_sub.}
*ebfedea0SLionel SambucThis algorithm performs the signed subtraction of two inputs.  Similar to algorithm mp\_add there is no reference in either \cite{TAOCPV2} or
*ebfedea0SLionel Sambuc\cite{HAC}.  Also this algorithm is restricted by algorithm s\_mp\_sub.  Chart \ref{fig:SubChart} lists the eight possible inputs and
*ebfedea0SLionel Sambucthe operations required.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Sign of $a$} & \textbf{Sign of $b$} & \textbf{$\vert a \vert \ge \vert b \vert $} & \textbf{Unsigned Operation} & \textbf{Result Sign Flag} \\
*ebfedea0SLionel Sambuc\hline $+$ & $-$ & Yes & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $+$ & $-$ & No  & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $+$ & Yes & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $+$ & No  & $c = a + b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline &&&& \\
*ebfedea0SLionel Sambuc\hline $+$ & $+$ & Yes & $c = a - b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $-$ & Yes & $c = a - b$ & $a.sign$ \\
*ebfedea0SLionel Sambuc\hline &&&& \\
*ebfedea0SLionel Sambuc\hline $+$ & $+$ & No  & $c = b - a$ & $\mbox{opposite of }a.sign$ \\
*ebfedea0SLionel Sambuc\hline $-$ & $-$ & No  & $c = b - a$ & $\mbox{opposite of }a.sign$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Subtraction Guide Chart}
*ebfedea0SLionel Sambuc\label{fig:SubChart}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSimilar to the case of algorithm mp\_add the \textbf{sign} is set first before the unsigned addition or subtraction.  That is to prevent the
*ebfedea0SLionel Sambucalgorithm from producing $-a - -a = -0$ as a result.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_sub.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucMuch like the implementation of algorithm mp\_add the variable $res$ is used to catch the return code of the unsigned addition or subtraction operations
*ebfedea0SLionel Sambucand forward it to the end of the function.  On line @38, != MP_LT@ the ``not equal to'' \textbf{MP\_LT} expression is used to emulate a
*ebfedea0SLionel Sambuc``greater than or equal to'' comparison.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Bit and Digit Shifting}
*ebfedea0SLionel SambucMARK,POLY
*ebfedea0SLionel SambucIt is quite common to think of a multiple precision integer as a polynomial in $x$, that is $y = f(\beta)$ where $f(x) = \sum_{i=0}^{n-1} a_i x^i$.
*ebfedea0SLionel SambucThis notation arises within discussion of Montgomery and Diminished Radix Reduction as well as Karatsuba multiplication and squaring.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn order to facilitate operations on polynomials in $x$ as above a series of simple ``digit'' algorithms have to be established.  That is to shift
*ebfedea0SLionel Sambucthe digits left or right as well to shift individual bits of the digits left and right.  It is important to note that not all ``shift'' operations
*ebfedea0SLionel Sambucare on radix-$\beta$ digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Multiplication by Two}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn a binary system where the radix is a power of two multiplication by two not only arises often in other algorithms it is a fairly efficient
*ebfedea0SLionel Sambucoperation to perform.  A single precision logical shift left is sufficient to multiply a single digit by two.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_mul\_2}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   One mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $b = 2a$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b.alloc < a.used + 1$ then grow $b$ to hold $a.used + 1$ digits.  (\textit{mp\_grow}) \\
*ebfedea0SLionel Sambuc2.  $oldused \leftarrow b.used$ \\
*ebfedea0SLionel Sambuc3.  $b.used \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc4.  $r \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  for $n$ from 0 to $a.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $rr \leftarrow a_n >> (lg(\beta) - 1)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $b_n \leftarrow (a_n << 1) + r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  $r \leftarrow rr$ \\
*ebfedea0SLionel Sambuc6.  If $r \ne 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $b_{n + 1} \leftarrow r$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  $b.used \leftarrow b.used + 1$ \\
*ebfedea0SLionel Sambuc7.  If $b.used < oldused - 1$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  for $n$ from $b.used$ to $oldused - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.1.1  $b_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc8.  $b.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc9.  Return(\textit{MP\_OKAY}).\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_mul\_2}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_mul\_2.}
*ebfedea0SLionel SambucThis algorithm will quickly multiply a mp\_int by two provided $\beta$ is a power of two.  Neither \cite{TAOCPV2} nor \cite{HAC} describe such
*ebfedea0SLionel Sambucan algorithm despite the fact it arises often in other algorithms.  The algorithm is setup much like the lower level algorithm s\_mp\_add since
*ebfedea0SLionel Sambucit is for all intents and purposes equivalent to the operation $b = \vert a \vert + \vert a \vert$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep 1 and 2 grow the input as required to accomodate the maximum number of \textbf{used} digits in the result.  The initial \textbf{used} count
*ebfedea0SLionel Sambucis set to $a.used$ at step 4.  Only if there is a final carry will the \textbf{used} count require adjustment.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep 6 is an optimization implementation of the addition loop for this specific case.  That is since the two values being added together
*ebfedea0SLionel Sambucare the same there is no need to perform two reads from the digits of $a$.  Step 6.1 performs a single precision shift on the current digit $a_n$ to
*ebfedea0SLionel Sambucobtain what will be the carry for the next iteration.  Step 6.2 calculates the $n$'th digit of the result as single precision shift of $a_n$ plus
*ebfedea0SLionel Sambucthe previous carry.  Recall from ~SHIFTS~ that $a_n << 1$ is equivalent to $a_n \cdot 2$.  An iteration of the addition loop is finished with
*ebfedea0SLionel Sambucforwarding the carry to the next iteration.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep 7 takes care of any final carry by setting the $a.used$'th digit of the result to the carry and augmenting the \textbf{used} count of $b$.
*ebfedea0SLionel SambucStep 8 clears any leading digits of $b$ in case it originally had a larger magnitude than $a$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_mul_2.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis implementation is essentially an optimized implementation of s\_mp\_add for the case of doubling an input.  The only noteworthy difference
*ebfedea0SLionel Sambucis the use of the logical shift operator on line @52,<<@ to perform a single precision doubling.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Division by Two}
*ebfedea0SLionel SambucA division by two can just as easily be accomplished with a logical shift right as multiplication by two can be with a logical shift left.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_div\_2}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   One mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $b = a/2$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b.alloc < a.used$ then grow $b$ to hold $a.used$ digits.  (\textit{mp\_grow}) \\
*ebfedea0SLionel Sambuc2.  If the reallocation failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc3.  $oldused \leftarrow b.used$ \\
*ebfedea0SLionel Sambuc4.  $b.used \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc5.  $r \leftarrow 0$ \\
*ebfedea0SLionel Sambuc6.  for $n$ from $b.used - 1$ to $0$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $rr \leftarrow a_n \mbox{ (mod }2\mbox{)}$\\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  $b_n \leftarrow (a_n >> 1) + (r << (lg(\beta) - 1)) \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.3  $r \leftarrow rr$ \\
*ebfedea0SLionel Sambuc7.  If $b.used < oldused - 1$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  for $n$ from $b.used$ to $oldused - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.1.1  $b_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc8.  $b.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc9.  Clamp excess digits of $b$.  (\textit{mp\_clamp}) \\
*ebfedea0SLionel Sambuc10.  Return(\textit{MP\_OKAY}).\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_div\_2}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_div\_2.}
*ebfedea0SLionel SambucThis algorithm will divide an mp\_int by two using logical shifts to the right.  Like mp\_mul\_2 it uses a modified low level addition
*ebfedea0SLionel Sambuccore as the basis of the algorithm.  Unlike mp\_mul\_2 the shift operations work from the leading digit to the trailing digit.  The algorithm
*ebfedea0SLionel Sambuccould be written to work from the trailing digit to the leading digit however, it would have to stop one short of $a.used - 1$ digits to prevent
*ebfedea0SLionel Sambucreading past the end of the array of digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEssentially the loop at step 6 is similar to that of mp\_mul\_2 except the logical shifts go in the opposite direction and the carry is at the
*ebfedea0SLionel Sambucleast significant bit not the most significant bit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_div_2.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Polynomial Basis Operations}
*ebfedea0SLionel SambucRecall from ~POLY~ that any integer can be represented as a polynomial in $x$ as $y = f(\beta)$.  Such a representation is also known as
*ebfedea0SLionel Sambucthe polynomial basis \cite[pp. 48]{ROSE}. Given such a notation a multiplication or division by $x$ amounts to shifting whole digits a single
*ebfedea0SLionel Sambucplace.  The need for such operations arises in several other higher level algorithms such as Barrett and Montgomery reduction, integer
*ebfedea0SLionel Sambucdivision and Karatsuba multiplication.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucConverting from an array of digits to polynomial basis is very simple.  Consider the integer $y \equiv (a_2, a_1, a_0)_{\beta}$ and recall that
*ebfedea0SLionel Sambuc$y = \sum_{i=0}^{2} a_i \beta^i$.  Simply replace $\beta$ with $x$ and the expression is in polynomial basis.  For example, $f(x) = 8x + 9$ is the
*ebfedea0SLionel Sambucpolynomial basis representation for $89$ using radix ten.  That is, $f(10) = 8(10) + 9 = 89$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Multiplication by $x$}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucGiven a polynomial in $x$ such as $f(x) = a_n x^n + a_{n-1} x^{n-1} + ... + a_0$ multiplying by $x$ amounts to shifting the coefficients up one
*ebfedea0SLionel Sambucdegree.  In this case $f(x) \cdot x = a_n x^{n+1} + a_{n-1} x^n + ... + a_0 x$.  From a scalar basis point of view multiplying by $x$ is equivalent to
*ebfedea0SLionel Sambucmultiplying by the integer $\beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_lshd}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   One mp\_int $a$ and an integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a \leftarrow a \cdot \beta^b$ (equivalent to multiplication by $x^b$). \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b \le 0$ then return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc2.  If $a.alloc < a.used + b$ then grow $a$ to at least $a.used + b$ digits.  (\textit{mp\_grow}). \\
*ebfedea0SLionel Sambuc3.  If the reallocation failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc4.  $a.used \leftarrow a.used + b$ \\
*ebfedea0SLionel Sambuc5.  $i \leftarrow a.used - 1$ \\
*ebfedea0SLionel Sambuc6.  $j \leftarrow a.used - 1 - b$ \\
*ebfedea0SLionel Sambuc7.  for $n$ from $a.used - 1$ to $b$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $a_{i} \leftarrow a_{j}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.2  $i \leftarrow i - 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.3  $j \leftarrow j - 1$ \\
*ebfedea0SLionel Sambuc8.  for $n$ from 0 to $b - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $a_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc9.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_lshd}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_lshd.}
*ebfedea0SLionel SambucThis algorithm multiplies an mp\_int by the $b$'th power of $x$.  This is equivalent to multiplying by $\beta^b$.  The algorithm differs
*ebfedea0SLionel Sambucfrom the other algorithms presented so far as it performs the operation in place instead storing the result in a separate location.  The
*ebfedea0SLionel Sambucmotivation behind this change is due to the way this function is typically used.  Algorithms such as mp\_add store the result in an optionally
*ebfedea0SLionel Sambucdifferent third mp\_int because the original inputs are often still required.  Algorithm mp\_lshd (\textit{and similarly algorithm mp\_rshd}) is
*ebfedea0SLionel Sambuctypically used on values where the original value is no longer required.  The algorithm will return success immediately if
*ebfedea0SLionel Sambuc$b \le 0$ since the rest of algorithm is only valid when $b > 0$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst the destination $a$ is grown as required to accomodate the result.  The counters $i$ and $j$ are used to form a \textit{sliding window} over
*ebfedea0SLionel Sambucthe digits of $a$ of length $b$.  The head of the sliding window is at $i$ (\textit{the leading digit}) and the tail at $j$ (\textit{the trailing digit}).
*ebfedea0SLionel SambucThe loop on step 7 copies the digit from the tail to the head.  In each iteration the window is moved down one digit.   The last loop on
*ebfedea0SLionel Sambucstep 8 sets the lower $b$ digits to zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage
*ebfedea0SLionel SambucFIGU,sliding_window,Sliding Window Movement
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_lshd.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe if statement (line @24,if@) ensures that the $b$ variable is greater than zero since we do not interpret negative
*ebfedea0SLionel Sambucshift counts properly.  The \textbf{used} count is incremented by $b$ before the copy loop begins.  This elminates
*ebfedea0SLionel Sambucthe need for an additional variable in the for loop.  The variable $top$ (line @42,top@) is an alias
*ebfedea0SLionel Sambucfor the leading digit while $bottom$ (line @45,bottom@) is an alias for the trailing edge.  The aliases form a
*ebfedea0SLionel Sambucwindow of exactly $b$ digits over the input.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Division by $x$}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucDivision by powers of $x$ is easily achieved by shifting the digits right and removing any that will end up to the right of the zero'th digit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_rshd}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   One mp\_int $a$ and an integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a \leftarrow a / \beta^b$ (Divide by $x^b$). \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b \le 0$ then return. \\
*ebfedea0SLionel Sambuc2.  If $a.used \le b$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  Zero $a$.  (\textit{mp\_zero}). \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  Return. \\
*ebfedea0SLionel Sambuc3.  $i \leftarrow 0$ \\
*ebfedea0SLionel Sambuc4.  $j \leftarrow b$ \\
*ebfedea0SLionel Sambuc5.  for $n$ from 0 to $a.used - b - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $a_i \leftarrow a_j$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $i \leftarrow i + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  $j \leftarrow j + 1$ \\
*ebfedea0SLionel Sambuc6.  for $n$ from $a.used - b$ to $a.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $a_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc7.  $a.used \leftarrow a.used - b$ \\
*ebfedea0SLionel Sambuc8.  Return. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_rshd}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_rshd.}
*ebfedea0SLionel SambucThis algorithm divides the input in place by the $b$'th power of $x$.  It is analogous to dividing by a $\beta^b$ but much quicker since
*ebfedea0SLionel Sambucit does not require single precision division.  This algorithm does not actually return an error code as it cannot fail.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf the input $b$ is less than one the algorithm quickly returns without performing any work.  If the \textbf{used} count is less than or equal
*ebfedea0SLionel Sambucto the shift count $b$ then it will simply zero the input and return.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the trivial cases of inputs have been handled the sliding window is setup.  Much like the case of algorithm mp\_lshd a sliding window that
*ebfedea0SLionel Sambucis $b$ digits wide is used to copy the digits.  Unlike mp\_lshd the window slides in the opposite direction from the trailing to the leading digit.
*ebfedea0SLionel SambucAlso the digits are copied from the leading to the trailing edge.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOnce the window copy is complete the upper digits must be zeroed and the \textbf{used} count decremented.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_rshd.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe only noteworthy element of this routine is the lack of a return type since it cannot fail.  Like mp\_lshd() we
*ebfedea0SLionel Sambucform a sliding window except we copy in the other direction.  After the window (line @59,for (;@) we then zero
*ebfedea0SLionel Sambucthe upper digits of the input to make sure the result is correct.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Powers of Two}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNow that algorithms for moving single bits as well as whole digits exist algorithms for moving the ``in between'' distances are required.  For
*ebfedea0SLionel Sambucexample, to quickly multiply by $2^k$ for any $k$ without using a full multiplier algorithm would prove useful.  Instead of performing single
*ebfedea0SLionel Sambucshifts $k$ times to achieve a multiplication by $2^{\pm k}$ a mixture of whole digit shifting and partial digit shifting is employed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Multiplication by Power of Two}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_mul\_2d}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   One mp\_int $a$ and an integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow a \cdot 2^b$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $c \leftarrow a$.  (\textit{mp\_copy}) \\
*ebfedea0SLionel Sambuc2.  If $c.alloc < c.used + \lfloor b / lg(\beta) \rfloor + 2$ then grow $c$ accordingly. \\
*ebfedea0SLionel Sambuc3.  If the reallocation failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc4.  If $b \ge lg(\beta)$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $c \leftarrow c \cdot \beta^{\lfloor b / lg(\beta) \rfloor}$ (\textit{mp\_lshd}). \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.2  If step 4.1 failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc5.  $d \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\
*ebfedea0SLionel Sambuc6.  If $d \ne 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $mask \leftarrow 2^d$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  $r \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.3  for $n$ from $0$ to $c.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.3.1  $rr \leftarrow c_n >> (lg(\beta) - d) \mbox{ (mod }mask\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.3.2  $c_n \leftarrow (c_n << d) + r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.3.3  $r \leftarrow rr$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.4  If $r > 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.4.1  $c_{c.used} \leftarrow r$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.4.2  $c.used \leftarrow c.used + 1$ \\
*ebfedea0SLionel Sambuc7.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_mul\_2d}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_mul\_2d.}
*ebfedea0SLionel SambucThis algorithm multiplies $a$ by $2^b$ and stores the result in $c$.  The algorithm uses algorithm mp\_lshd and a derivative of algorithm mp\_mul\_2 to
*ebfedea0SLionel Sambucquickly compute the product.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst the algorithm will multiply $a$ by $x^{\lfloor b / lg(\beta) \rfloor}$ which will ensure that the remainder multiplicand is less than
*ebfedea0SLionel Sambuc$\beta$.  For example, if $b = 37$ and $\beta = 2^{28}$ then this step will multiply by $x$ leaving a multiplication by $2^{37 - 28} = 2^{9}$
*ebfedea0SLionel Sambucleft.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the digits have been shifted appropriately at most $lg(\beta) - 1$ shifts are left to perform.  Step 5 calculates the number of remaining shifts
*ebfedea0SLionel Sambucrequired.  If it is non-zero a modified shift loop is used to calculate the remaining product.
*ebfedea0SLionel SambucEssentially the loop is a generic version of algorithm mp\_mul\_2 designed to handle any shift count in the range $1 \le x < lg(\beta)$.  The $mask$
*ebfedea0SLionel Sambucvariable is used to extract the upper $d$ bits to form the carry for the next iteration.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis algorithm is loosely measured as a $O(2n)$ algorithm which means that if the input is $n$-digits that it takes $2n$ ``time'' to
*ebfedea0SLionel Sambuccomplete.  It is possible to optimize this algorithm down to a $O(n)$ algorithm at a cost of making the algorithm slightly harder to follow.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_mul_2d.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe shifting is performed in--place which means the first step (line @24,a != c@) is to copy the input to the
*ebfedea0SLionel Sambucdestination.  We avoid calling mp\_copy() by making sure the mp\_ints are different.  The destination then
*ebfedea0SLionel Sambuchas to be grown (line @31,grow@) to accomodate the result.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf the shift count $b$ is larger than $lg(\beta)$ then a call to mp\_lshd() is used to handle all of the multiples
*ebfedea0SLionel Sambucof $lg(\beta)$.  Leaving only a remaining shift of $lg(\beta) - 1$ or fewer bits left.  Inside the actual shift
*ebfedea0SLionel Sambucloop (lines @45,if@ to @76,}@) we make use of pre--computed values $shift$ and $mask$.   These are used to
*ebfedea0SLionel Sambucextract the carry bit(s) to pass into the next iteration of the loop.  The $r$ and $rr$ variables form a
*ebfedea0SLionel Sambucchain between consecutive iterations to propagate the carry.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Division by Power of Two}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_div\_2d}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   One mp\_int $a$ and an integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow \lfloor a / 2^b \rfloor, d \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b \le 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $c \leftarrow a$ (\textit{mp\_copy}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  $d \leftarrow 0$ (\textit{mp\_zero}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.3  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc2.  $c \leftarrow a$ \\
*ebfedea0SLionel Sambuc3.  $d \leftarrow a \mbox{ (mod }2^b\mbox{)}$ (\textit{mp\_mod\_2d}) \\
*ebfedea0SLionel Sambuc4.  If $b \ge lg(\beta)$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $c \leftarrow \lfloor c/\beta^{\lfloor b/lg(\beta) \rfloor} \rfloor$ (\textit{mp\_rshd}). \\
*ebfedea0SLionel Sambuc5.  $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\
*ebfedea0SLionel Sambuc6.  If $k \ne 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $mask \leftarrow 2^k$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  $r \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.3  for $n$ from $c.used - 1$ to $0$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.3.1  $rr \leftarrow c_n \mbox{ (mod }mask\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.3.2  $c_n \leftarrow (c_n >> k) + (r << (lg(\beta) - k))$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.3.3  $r \leftarrow rr$ \\
*ebfedea0SLionel Sambuc7.  Clamp excess digits of $c$.  (\textit{mp\_clamp}) \\
*ebfedea0SLionel Sambuc8.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_div\_2d}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_div\_2d.}
*ebfedea0SLionel SambucThis algorithm will divide an input $a$ by $2^b$ and produce the quotient and remainder.  The algorithm is designed much like algorithm
*ebfedea0SLionel Sambucmp\_mul\_2d by first using whole digit shifts then single precision shifts.  This algorithm will also produce the remainder of the division
*ebfedea0SLionel Sambucby using algorithm mp\_mod\_2d.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_div_2d.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe implementation of algorithm mp\_div\_2d is slightly different than the algorithm specifies.  The remainder $d$ may be optionally
*ebfedea0SLionel Sambucignored by passing \textbf{NULL} as the pointer to the mp\_int variable.    The temporary mp\_int variable $t$ is used to hold the
*ebfedea0SLionel Sambucresult of the remainder operation until the end.  This allows $d$ and $a$ to represent the same mp\_int without modifying $a$ before
*ebfedea0SLionel Sambucthe quotient is obtained.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe remainder of the source code is essentially the same as the source code for mp\_mul\_2d.  The only significant difference is
*ebfedea0SLionel Sambucthe direction of the shifts.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Remainder of Division by Power of Two}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe last algorithm in the series of polynomial basis power of two algorithms is calculating the remainder of division by $2^b$.  This
*ebfedea0SLionel Sambucalgorithm benefits from the fact that in twos complement arithmetic $a \mbox{ (mod }2^b\mbox{)}$ is the same as $a$ AND $2^b - 1$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_mod\_2d}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   One mp\_int $a$ and an integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow a \mbox{ (mod }2^b\mbox{)}$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b \le 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $c \leftarrow 0$ (\textit{mp\_zero}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc2.  If $b > a.used \cdot lg(\beta)$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $c \leftarrow a$ (\textit{mp\_copy}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  Return the result of step 2.1. \\
*ebfedea0SLionel Sambuc3.  $c \leftarrow a$ \\
*ebfedea0SLionel Sambuc4.  If step 3 failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc5.  for $n$ from $\lceil b / lg(\beta) \rceil$ to $c.used$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $c_n \leftarrow 0$ \\
*ebfedea0SLionel Sambuc6.  $k \leftarrow b \mbox{ (mod }lg(\beta)\mbox{)}$ \\
*ebfedea0SLionel Sambuc7.  $c_{\lfloor b / lg(\beta) \rfloor} \leftarrow c_{\lfloor b / lg(\beta) \rfloor} \mbox{ (mod }2^{k}\mbox{)}$. \\
*ebfedea0SLionel Sambuc8.  Clamp excess digits of $c$.  (\textit{mp\_clamp}) \\
*ebfedea0SLionel Sambuc9.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_mod\_2d}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_mod\_2d.}
*ebfedea0SLionel SambucThis algorithm will quickly calculate the value of $a \mbox{ (mod }2^b\mbox{)}$.  First if $b$ is less than or equal to zero the
*ebfedea0SLionel Sambucresult is set to zero.  If $b$ is greater than the number of bits in $a$ then it simply copies $a$ to $c$ and returns.  Otherwise, $a$
*ebfedea0SLionel Sambucis copied to $b$, leading digits are removed and the remaining leading digit is trimed to the exact bit count.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_mod_2d.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWe first avoid cases of $b \le 0$ by simply mp\_zero()'ing the destination in such cases.  Next if $2^b$ is larger
*ebfedea0SLionel Sambucthan the input we just mp\_copy() the input and return right away.  After this point we know we must actually
*ebfedea0SLionel Sambucperform some work to produce the remainder.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucRecalling that reducing modulo $2^k$ and a binary ``and'' with $2^k - 1$ are numerically equivalent we can quickly reduce
*ebfedea0SLionel Sambucthe number.  First we zero any digits above the last digit in $2^b$ (line @41,for@).  Next we reduce the
*ebfedea0SLionel Sambucleading digit of both (line @45,&=@) and then mp\_clamp().
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section*{Exercises}
*ebfedea0SLionel Sambuc\begin{tabular}{cl}
*ebfedea0SLionel Sambuc$\left [ 3 \right ] $ & Devise an algorithm that performs $a \cdot 2^b$ for generic values of $b$ \\
*ebfedea0SLionel Sambuc                      & in $O(n)$ time. \\
*ebfedea0SLionel Sambuc                      &\\
*ebfedea0SLionel Sambuc$\left [ 3 \right ] $ & Devise an efficient algorithm to multiply by small low hamming  \\
*ebfedea0SLionel Sambuc                      & weight values such as $3$, $5$ and $9$.  Extend it to handle all values \\
*ebfedea0SLionel Sambuc                      & upto $64$ with a hamming weight less than three. \\
*ebfedea0SLionel Sambuc                      &\\
*ebfedea0SLionel Sambuc$\left [ 2 \right ] $ & Modify the preceding algorithm to handle values of the form \\
*ebfedea0SLionel Sambuc                      & $2^k - 1$ as well. \\
*ebfedea0SLionel Sambuc                      &\\
*ebfedea0SLionel Sambuc$\left [ 3 \right ] $ & Using only algorithms mp\_mul\_2, mp\_div\_2 and mp\_add create an \\
*ebfedea0SLionel Sambuc                      & algorithm to multiply two integers in roughly $O(2n^2)$ time for \\
*ebfedea0SLionel Sambuc                      & any $n$-bit input.  Note that the time of addition is ignored in the \\
*ebfedea0SLionel Sambuc                      & calculation.  \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 5 \right ] $ & Improve the previous algorithm to have a working time of at most \\
*ebfedea0SLionel Sambuc                      & $O \left (2^{(k-1)}n + \left ({2n^2 \over k} \right ) \right )$ for an appropriate choice of $k$.  Again ignore \\
*ebfedea0SLionel Sambuc                      & the cost of addition. \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 2 \right ] $ & Devise a chart to find optimal values of $k$ for the previous problem \\
*ebfedea0SLionel Sambuc                      & for $n = 64 \ldots 1024$ in steps of $64$. \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 2 \right ] $ & Using only algorithms mp\_abs and mp\_sub devise another method for \\
*ebfedea0SLionel Sambuc                      & calculating the result of a signed comparison. \\
*ebfedea0SLionel Sambuc                      &
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Multiplication and Squaring}
*ebfedea0SLionel Sambuc\section{The Multipliers}
*ebfedea0SLionel SambucFor most number theoretic problems including certain public key cryptographic algorithms, the ``multipliers'' form the most important subset of
*ebfedea0SLionel Sambucalgorithms of any multiple precision integer package.  The set of multiplier algorithms include integer multiplication, squaring and modular reduction
*ebfedea0SLionel Sambucwhere in each of the algorithms single precision multiplication is the dominant operation performed.  This chapter will discuss integer multiplication
*ebfedea0SLionel Sambucand squaring, leaving modular reductions for the subsequent chapter.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe importance of the multiplier algorithms is for the most part driven by the fact that certain popular public key algorithms are based on modular
*ebfedea0SLionel Sambucexponentiation, that is computing $d \equiv a^b \mbox{ (mod }c\mbox{)}$ for some arbitrary choice of $a$, $b$, $c$ and $d$.  During a modular
*ebfedea0SLionel Sambucexponentiation the majority\footnote{Roughly speaking a modular exponentiation will spend about 40\% of the time performing modular reductions,
*ebfedea0SLionel Sambuc35\% of the time performing squaring and 25\% of the time performing multiplications.} of the processor time is spent performing single precision
*ebfedea0SLionel Sambucmultiplications.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor centuries general purpose multiplication has required a lengthly $O(n^2)$ process, whereby each digit of one multiplicand has to be multiplied
*ebfedea0SLionel Sambucagainst every digit of the other multiplicand.  Traditional long-hand multiplication is based on this process;  while the techniques can differ the
*ebfedea0SLionel Sambucoverall algorithm used is essentially the same.  Only ``recently'' have faster algorithms been studied.  First Karatsuba multiplication was discovered in
*ebfedea0SLionel Sambuc1962.  This algorithm can multiply two numbers with considerably fewer single precision multiplications when compared to the long-hand approach.
*ebfedea0SLionel SambucThis technique led to the discovery of polynomial basis algorithms (\textit{good reference?}) and subquently Fourier Transform based solutions.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Multiplication}
*ebfedea0SLionel Sambuc\subsection{The Baseline Multiplication}
*ebfedea0SLionel Sambuc\label{sec:basemult}
*ebfedea0SLionel Sambuc\index{baseline multiplication}
*ebfedea0SLionel SambucComputing the product of two integers in software can be achieved using a trivial adaptation of the standard $O(n^2)$ long-hand multiplication
*ebfedea0SLionel Sambucalgorithm that school children are taught.  The algorithm is considered an $O(n^2)$ algorithm since for two $n$-digit inputs $n^2$ single precision
*ebfedea0SLionel Sambucmultiplications are required.  More specifically for a $m$ and $n$ digit input $m \cdot n$ single precision multiplications are required.  To
*ebfedea0SLionel Sambucsimplify most discussions, it will be assumed that the inputs have comparable number of digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe ``baseline multiplication'' algorithm is designed to act as the ``catch-all'' algorithm, only to be used when the faster algorithms cannot be
*ebfedea0SLionel Sambucused.  This algorithm does not use any particularly interesting optimizations and should ideally be avoided if possible.    One important
*ebfedea0SLionel Sambucfacet of this algorithm, is that it has been modified to only produce a certain amount of output digits as resolution.  The importance of this
*ebfedea0SLionel Sambucmodification will become evident during the discussion of Barrett modular reduction.  Recall that for a $n$ and $m$ digit input the product
*ebfedea0SLionel Sambucwill be at most $n + m$ digits.  Therefore, this algorithm can be reduced to a full multiplier by having it produce $n + m$ digits of the product.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucRecall from ~GAMMA~ the definition of $\gamma$ as the number of bits in the type \textbf{mp\_digit}.  We shall now extend the variable set to
*ebfedea0SLionel Sambucinclude $\alpha$ which shall represent the number of bits in the type \textbf{mp\_word}.  This implies that $2^{\alpha} > 2 \cdot \beta^2$.  The
*ebfedea0SLionel Sambucconstant $\delta = 2^{\alpha - 2lg(\beta)}$ will represent the maximal weight of any column in a product (\textit{see ~COMBA~ for more information}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{s\_mp\_mul\_digs}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$, mp\_int $b$ and an integer $digs$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If min$(a.used, b.used) < \delta$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  Calculate $c = \vert a \vert \cdot \vert b \vert$ by the Comba method (\textit{see algorithm~\ref{fig:COMBAMULT}}).  \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  Return the result of step 1.1 \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucAllocate and initialize a temporary mp\_int. \\
*ebfedea0SLionel Sambuc2.  Init $t$ to be of size $digs$ \\
*ebfedea0SLionel Sambuc3.  If step 2 failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc4.  $t.used \leftarrow digs$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucCompute the product. \\
*ebfedea0SLionel Sambuc5.  for $ix$ from $0$ to $a.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $u \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $pb \leftarrow \mbox{min}(b.used, digs - ix)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  If $pb < 1$ then goto step 6. \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.4  for $iy$ from $0$ to $pb - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.1  $\hat r \leftarrow t_{iy + ix} + a_{ix} \cdot b_{iy} + u$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.2  $t_{iy + ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.3  $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.5  if $ix + pb < digs$ then do \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.5.1  $t_{ix + pb} \leftarrow u$ \\
*ebfedea0SLionel Sambuc6.  Clamp excess digits of $t$. \\
*ebfedea0SLionel Sambuc7.  Swap $c$ with $t$ \\
*ebfedea0SLionel Sambuc8.  Clear $t$ \\
*ebfedea0SLionel Sambuc9.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm s\_mp\_mul\_digs}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm s\_mp\_mul\_digs.}
*ebfedea0SLionel SambucThis algorithm computes the unsigned product of two inputs $a$ and $b$, limited to an output precision of $digs$ digits.  While it may seem
*ebfedea0SLionel Sambuca bit awkward to modify the function from its simple $O(n^2)$ description, the usefulness of partial multipliers will arise in a subsequent
*ebfedea0SLionel Sambucalgorithm.  The algorithm is loosely based on algorithm 14.12 from \cite[pp. 595]{HAC} and is similar to Algorithm M of Knuth \cite[pp. 268]{TAOCPV2}.
*ebfedea0SLionel SambucAlgorithm s\_mp\_mul\_digs differs from these cited references since it can produce a variable output precision regardless of the precision of the
*ebfedea0SLionel Sambucinputs.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first thing this algorithm checks for is whether a Comba multiplier can be used instead.   If the minimum digit count of either
*ebfedea0SLionel Sambucinput is less than $\delta$, then the Comba method may be used instead.    After the Comba method is ruled out, the baseline algorithm begins.  A
*ebfedea0SLionel Sambuctemporary mp\_int variable $t$ is used to hold the intermediate result of the product.  This allows the algorithm to be used to
*ebfedea0SLionel Sambuccompute products when either $a = c$ or $b = c$ without overwriting the inputs.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAll of step 5 is the infamous $O(n^2)$ multiplication loop slightly modified to only produce upto $digs$ digits of output.  The $pb$ variable
*ebfedea0SLionel Sambucis given the count of digits to read from $b$ inside the nested loop.  If $pb \le 1$ then no more output digits can be produced and the algorithm
*ebfedea0SLionel Sambucwill exit the loop.  The best way to think of the loops are as a series of $pb \times 1$ multiplications.    That is, in each pass of the
*ebfedea0SLionel Sambucinnermost loop $a_{ix}$ is multiplied against $b$ and the result is added (\textit{with an appropriate shift}) to $t$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, consider multiplying $576$ by $241$.  That is equivalent to computing $10^0(1)(576) + 10^1(4)(576) + 10^2(2)(576)$ which is best
*ebfedea0SLionel Sambucvisualized in the following table.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|c|c|c|l|}
*ebfedea0SLionel Sambuc\hline   &&          & 5 & 7 & 6 & \\
*ebfedea0SLionel Sambuc\hline   $\times$&&  & 2 & 4 & 1 & \\
*ebfedea0SLionel Sambuc\hline &&&&&&\\
*ebfedea0SLionel Sambuc  &&          & 5 & 7 & 6 & $10^0(1)(576)$ \\
*ebfedea0SLionel Sambuc  &2 &   3    & 6 & 1 & 6 & $10^1(4)(576) + 10^0(1)(576)$ \\
*ebfedea0SLionel Sambuc  1 & 3 & 8 & 8 & 1 & 6 &   $10^2(2)(576) + 10^1(4)(576) + 10^0(1)(576)$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Long-Hand Multiplication Diagram}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEach row of the product is added to the result after being shifted to the left (\textit{multiplied by a power of the radix}) by the appropriate
*ebfedea0SLionel Sambuccount.  That is in pass $ix$ of the inner loop the product is added starting at the $ix$'th digit of the reult.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep 5.4.1 introduces the hat symbol (\textit{e.g. $\hat r$}) which represents a double precision variable.  The multiplication on that step
*ebfedea0SLionel Sambucis assumed to be a double wide output single precision multiplication.  That is, two single precision variables are multiplied to produce a
*ebfedea0SLionel Sambucdouble precision result.  The step is somewhat optimized from a long-hand multiplication algorithm because the carry from the addition in step
*ebfedea0SLionel Sambuc5.4.1 is propagated through the nested loop.  If the carry was not propagated immediately it would overflow the single precision digit
*ebfedea0SLionel Sambuc$t_{ix+iy}$ and the result would be lost.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt step 5.5 the nested loop is finished and any carry that was left over should be forwarded.  The carry does not have to be added to the $ix+pb$'th
*ebfedea0SLionel Sambucdigit since that digit is assumed to be zero at this point.  However, if $ix + pb \ge digs$ the carry is not set as it would make the result
*ebfedea0SLionel Sambucexceed the precision requested.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_s_mp_mul_digs.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst we determine (line @30,if@) if the Comba method can be used first since it's faster.  The conditions for
*ebfedea0SLionel Sambucsing the Comba routine are that min$(a.used, b.used) < \delta$ and the number of digits of output is less than
*ebfedea0SLionel Sambuc\textbf{MP\_WARRAY}.  This new constant is used to control the stack usage in the Comba routines.  By default it is
*ebfedea0SLionel Sambucset to $\delta$ but can be reduced when memory is at a premium.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf we cannot use the Comba method we proceed to setup the baseline routine.  We allocate the the destination mp\_int
*ebfedea0SLionel Sambuc$t$ (line @36,init@) to the exact size of the output to avoid further re--allocations.  At this point we now
*ebfedea0SLionel Sambucbegin the $O(n^2)$ loop.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis implementation of multiplication has the caveat that it can be trimmed to only produce a variable number of
*ebfedea0SLionel Sambucdigits as output.  In each iteration of the outer loop the $pb$ variable is set (line @48,MIN@) to the maximum
*ebfedea0SLionel Sambucnumber of inner loop iterations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucInside the inner loop we calculate $\hat r$ as the mp\_word product of the two mp\_digits and the addition of the
*ebfedea0SLionel Sambuccarry from the previous iteration.  A particularly important observation is that most modern optimizing
*ebfedea0SLionel SambucC compilers (GCC for instance) can recognize that a $N \times N \rightarrow 2N$ multiplication is all that
*ebfedea0SLionel Sambucis required for the product.  In x86 terms for example, this means using the MUL instruction.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEach digit of the product is stored in turn (line @68,tmpt@) and the carry propagated (line @71,>>@) to the
*ebfedea0SLionel Sambucnext iteration.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Faster Multiplication by the ``Comba'' Method}
*ebfedea0SLionel SambucMARK,COMBA
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOne of the huge drawbacks of the ``baseline'' algorithms is that at the $O(n^2)$ level the carry must be
*ebfedea0SLionel Sambuccomputed and propagated upwards.  This makes the nested loop very sequential and hard to unroll and implement
*ebfedea0SLionel Sambucin parallel.  The ``Comba'' \cite{COMBA} method is named after little known (\textit{in cryptographic venues}) Paul G.
*ebfedea0SLionel SambucComba who described a method of implementing fast multipliers that do not require nested carry fixup operations.  As an
*ebfedea0SLionel Sambucinteresting aside it seems that Paul Barrett describes a similar technique in his 1986 paper \cite{BARRETT} written
*ebfedea0SLionel Sambucfive years before.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt the heart of the Comba technique is once again the long-hand algorithm.  Except in this case a slight
*ebfedea0SLionel Sambuctwist is placed on how the columns of the result are produced.  In the standard long-hand algorithm rows of products
*ebfedea0SLionel Sambucare produced then added together to form the final result.  In the baseline algorithm the columns are added together
*ebfedea0SLionel Sambucafter each iteration to get the result instantaneously.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn the Comba algorithm the columns of the result are produced entirely independently of each other.  That is at
*ebfedea0SLionel Sambucthe $O(n^2)$ level a simple multiplication and addition step is performed.  The carries of the columns are propagated
*ebfedea0SLionel Sambucafter the nested loop to reduce the amount of work requiored. Succintly the first step of the algorithm is to compute
*ebfedea0SLionel Sambucthe product vector $\vec x$ as follows.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc\vec x_n = \sum_{i+j = n} a_ib_j, \forall n \in \lbrace 0, 1, 2, \ldots, i + j \rbrace
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhere $\vec x_n$ is the $n'th$ column of the output vector.  Consider the following example which computes the vector $\vec x$ for the multiplication
*ebfedea0SLionel Sambucof $576$ and $241$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|c|c|c|}
*ebfedea0SLionel Sambuc  \hline &          & 5 & 7 & 6 & First Input\\
*ebfedea0SLionel Sambuc  \hline $\times$ & & 2 & 4 & 1 & Second Input\\
*ebfedea0SLionel Sambuc\hline            &                        & $1 \cdot 5 = 5$   & $1 \cdot 7 = 7$   & $1 \cdot 6 = 6$ & First pass \\
*ebfedea0SLionel Sambuc                  &  $4 \cdot 5 = 20$      & $4 \cdot 7+5=33$  & $4 \cdot 6+7=31$  & 6               & Second pass \\
*ebfedea0SLionel Sambuc   $2 \cdot 5 = 10$ &  $2 \cdot 7 + 20 = 34$ & $2 \cdot 6+33=45$ & 31                & 6             & Third pass \\
*ebfedea0SLionel Sambuc\hline 10 & 34 & 45 & 31 & 6 & Final Result \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Comba Multiplication Diagram}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt this point the vector $x = \left < 10, 34, 45, 31, 6 \right >$ is the result of the first step of the Comba multipler.
*ebfedea0SLionel SambucNow the columns must be fixed by propagating the carry upwards.  The resultant vector will have one extra dimension over the input vector which is
*ebfedea0SLionel Sambuccongruent to adding a leading zero digit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Comba Fixup}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Vector $\vec x$ of dimension $k$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  Vector $\vec x$ such that the carries have been propagated. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  for $n$ from $0$ to $k - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1 $\vec x_{n+1} \leftarrow \vec x_{n+1} + \lfloor \vec x_{n}/\beta \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2 $\vec x_{n} \leftarrow \vec x_{n} \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc2.  Return($\vec x$). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Comba Fixup}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWith that algorithm and $k = 5$ and $\beta = 10$ the following vector is produced $\vec x= \left < 1, 3, 8, 8, 1, 6 \right >$.  In this case
*ebfedea0SLionel Sambuc$241 \cdot 576$ is in fact $138816$ and the procedure succeeded.  If the algorithm is correct and as will be demonstrated shortly more
*ebfedea0SLionel Sambucefficient than the baseline algorithm why not simply always use this algorithm?
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Column Weight.}
*ebfedea0SLionel SambucAt the nested $O(n^2)$ level the Comba method adds the product of two single precision variables to each column of the output
*ebfedea0SLionel Sambucindependently.  A serious obstacle is if the carry is lost, due to lack of precision before the algorithm has a chance to fix
*ebfedea0SLionel Sambucthe carries.  For example, in the multiplication of two three-digit numbers the third column of output will be the sum of
*ebfedea0SLionel Sambucthree single precision multiplications.  If the precision of the accumulator for the output digits is less then $3 \cdot (\beta - 1)^2$ then
*ebfedea0SLionel Sambucan overflow can occur and the carry information will be lost.  For any $m$ and $n$ digit inputs the maximum weight of any column is
*ebfedea0SLionel Sambucmin$(m, n)$ which is fairly obvious.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe maximum number of terms in any column of a product is known as the ``column weight'' and strictly governs when the algorithm can be used.  Recall
*ebfedea0SLionel Sambucfrom earlier that a double precision type has $\alpha$ bits of resolution and a single precision digit has $lg(\beta)$ bits of precision.  Given these
*ebfedea0SLionel Sambuctwo quantities we must not violate the following
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuck \cdot \left (\beta - 1 \right )^2 < 2^{\alpha}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhich reduces to
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuck \cdot \left ( \beta^2 - 2\beta + 1 \right ) < 2^{\alpha}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLet $\rho = lg(\beta)$ represent the number of bits in a single precision digit.  By further re-arrangement of the equation the final solution is
*ebfedea0SLionel Sambucfound.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuck  < {{2^{\alpha}} \over {\left (2^{2\rho} - 2^{\rho + 1} + 1 \right )}}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe defaults for LibTomMath are $\beta = 2^{28}$ and $\alpha = 2^{64}$ which means that $k$ is bounded by $k < 257$.  In this configuration
*ebfedea0SLionel Sambucthe smaller input may not have more than $256$ digits if the Comba method is to be used.  This is quite satisfactory for most applications since
*ebfedea0SLionel Sambuc$256$ digits would allow for numbers in the range of $0 \le x < 2^{7168}$ which, is much larger than most public key cryptographic algorithms require.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{fast\_s\_mp\_mul\_digs}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$, mp\_int $b$ and an integer $digs$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow \vert a \vert \cdot \vert b \vert \mbox{ (mod }\beta^{digs}\mbox{)}$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel SambucPlace an array of \textbf{MP\_WARRAY} single precision digits named $W$ on the stack. \\
*ebfedea0SLionel Sambuc1.  If $c.alloc < digs$ then grow $c$ to $digs$ digits. (\textit{mp\_grow}) \\
*ebfedea0SLionel Sambuc2.  If step 1 failed return(\textit{MP\_MEM}).\\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel Sambuc3.  $pa \leftarrow \mbox{MIN}(digs, a.used + b.used)$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel Sambuc4.  $\_ \hat W \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  for $ix$ from 0 to $pa - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $ty \leftarrow \mbox{MIN}(b.used - 1, ix)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $tx \leftarrow ix - ty$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.4  for $iz$ from 0 to $iy - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.1  $\_ \hat W \leftarrow \_ \hat W + a_{tx+iy}b_{ty-iy}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.5  $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$\\
*ebfedea0SLionel Sambuc\hspace{3mm}5.6  $\_ \hat W \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel Sambuc6.  $oldused \leftarrow c.used$ \\
*ebfedea0SLionel Sambuc7.  $c.used \leftarrow digs$ \\
*ebfedea0SLionel Sambuc8.  for $ix$ from $0$ to $pa$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $c_{ix} \leftarrow W_{ix}$ \\
*ebfedea0SLionel Sambuc9.  for $ix$ from $pa + 1$ to $oldused - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1 $c_{ix} \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel Sambuc10.  Clamp $c$. \\
*ebfedea0SLionel Sambuc11.  Return MP\_OKAY. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm fast\_s\_mp\_mul\_digs}
*ebfedea0SLionel Sambuc\label{fig:COMBAMULT}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm fast\_s\_mp\_mul\_digs.}
*ebfedea0SLionel SambucThis algorithm performs the unsigned multiplication of $a$ and $b$ using the Comba method limited to $digs$ digits of precision.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe outer loop of this algorithm is more complicated than that of the baseline multiplier.  This is because on the inside of the
*ebfedea0SLionel Sambucloop we want to produce one column per pass.  This allows the accumulator $\_ \hat W$ to be placed in CPU registers and
*ebfedea0SLionel Sambucreduce the memory bandwidth to two \textbf{mp\_digit} reads per iteration.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe $ty$ variable is set to the minimum count of $ix$ or the number of digits in $b$.  That way if $a$ has more digits than
*ebfedea0SLionel Sambuc$b$ this will be limited to $b.used - 1$.  The $tx$ variable is set to the to the distance past $b.used$ the variable
*ebfedea0SLionel Sambuc$ix$ is.  This is used for the immediately subsequent statement where we find $iy$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe variable $iy$ is the minimum digits we can read from either $a$ or $b$ before running out.  Computing one column at a time
*ebfedea0SLionel Sambucmeans we have to scan one integer upwards and the other downwards.  $a$ starts at $tx$ and $b$ starts at $ty$.  In each
*ebfedea0SLionel Sambucpass we are producing the $ix$'th output column and we note that $tx + ty = ix$.  As we move $tx$ upwards we have to
*ebfedea0SLionel Sambucmove $ty$ downards so the equality remains valid.  The $iy$ variable is the number of iterations until
*ebfedea0SLionel Sambuc$tx \ge a.used$ or $ty < 0$ occurs.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter every inner pass we store the lower half of the accumulator into $W_{ix}$ and then propagate the carry of the accumulator
*ebfedea0SLionel Sambucinto the next round by dividing $\_ \hat W$ by $\beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucTo measure the benefits of the Comba method over the baseline method consider the number of operations that are required.  If the
*ebfedea0SLionel Sambuccost in terms of time of a multiply and addition is $p$ and the cost of a carry propagation is $q$ then a baseline multiplication would require
*ebfedea0SLionel Sambuc$O \left ((p + q)n^2 \right )$ time to multiply two $n$-digit numbers.  The Comba method requires only $O(pn^2 + qn)$ time, however in practice,
*ebfedea0SLionel Sambucthe speed increase is actually much more.  With $O(n)$ space the algorithm can be reduced to $O(pn + qn)$ time by implementing the $n$ multiply
*ebfedea0SLionel Sambucand addition operations in the nested loop in parallel.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_fast_s_mp_mul_digs.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs per the pseudo--code we first calculate $pa$ (line @47,MIN@) as the number of digits to output.  Next we begin the outer loop
*ebfedea0SLionel Sambucto produce the individual columns of the product.  We use the two aliases $tmpx$ and $tmpy$ (lines @61,tmpx@, @62,tmpy@) to point
*ebfedea0SLionel Sambucinside the two multiplicands quickly.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe inner loop (lines @70,for@ to @72,}@) of this implementation is where the tradeoff come into play.  Originally this comba
*ebfedea0SLionel Sambucimplementation was ``row--major'' which means it adds to each of the columns in each pass.  After the outer loop it would then fix
*ebfedea0SLionel Sambucthe carries.  This was very fast except it had an annoying drawback.  You had to read a mp\_word and two mp\_digits and write
*ebfedea0SLionel Sambucone mp\_word per iteration.  On processors such as the Athlon XP and P4 this did not matter much since the cache bandwidth
*ebfedea0SLionel Sambucis very high and it can keep the ALU fed with data.  It did, however, matter on older and embedded cpus where cache is often
*ebfedea0SLionel Sambucslower and also often doesn't exist.  This new algorithm only performs two reads per iteration under the assumption that the
*ebfedea0SLionel Sambuccompiler has aliased $\_ \hat W$ to a CPU register.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the inner loop we store the current accumulator in $W$ and shift $\_ \hat W$ (lines @75,W[ix]@, @78,>>@) to forward it as
*ebfedea0SLionel Sambuca carry for the next pass.  After the outer loop we use the final carry (line @82,W[ix]@) as the last digit of the product.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Polynomial Basis Multiplication}
*ebfedea0SLionel SambucTo break the $O(n^2)$ barrier in multiplication requires a completely different look at integer multiplication.  In the following algorithms
*ebfedea0SLionel Sambucthe use of polynomial basis representation for two integers $a$ and $b$ as $f(x) = \sum_{i=0}^{n} a_i x^i$ and
*ebfedea0SLionel Sambuc$g(x) = \sum_{i=0}^{n} b_i x^i$ respectively, is required.  In this system both $f(x)$ and $g(x)$ have $n + 1$ terms and are of the $n$'th degree.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe product $a \cdot b \equiv f(x)g(x)$ is the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$.  The coefficients $w_i$ will
*ebfedea0SLionel Sambucdirectly yield the desired product when $\beta$ is substituted for $x$.  The direct solution to solve for the $2n + 1$ coefficients
*ebfedea0SLionel Sambucrequires $O(n^2)$ time and would in practice be slower than the Comba technique.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHowever, numerical analysis theory indicates that only $2n + 1$ distinct points in $W(x)$ are required to determine the values of the $2n + 1$ unknown
*ebfedea0SLionel Sambuccoefficients.   This means by finding $\zeta_y = W(y)$ for $2n + 1$ small values of $y$ the coefficients of $W(x)$ can be found with
*ebfedea0SLionel SambucGaussian elimination.  This technique is also occasionally refered to as the \textit{interpolation technique} (\textit{references please...}) since in
*ebfedea0SLionel Sambuceffect an interpolation based on $2n + 1$ points will yield a polynomial equivalent to $W(x)$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe coefficients of the polynomial $W(x)$ are unknown which makes finding $W(y)$ for any value of $y$ impossible.  However, since
*ebfedea0SLionel Sambuc$W(x) = f(x)g(x)$ the equivalent $\zeta_y = f(y) g(y)$ can be used in its place.  The benefit of this technique stems from the
*ebfedea0SLionel Sambucfact that $f(y)$ and $g(y)$ are much smaller than either $a$ or $b$ respectively.  As a result finding the $2n + 1$ relations required
*ebfedea0SLionel Sambucby multiplying $f(y)g(y)$ involves multiplying integers that are much smaller than either of the inputs.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen picking points to gather relations there are always three obvious points to choose, $y = 0, 1$ and $ \infty$.  The $\zeta_0$ term
*ebfedea0SLionel Sambucis simply the product $W(0) = w_0 = a_0 \cdot b_0$.  The $\zeta_1$ term is the product
*ebfedea0SLionel Sambuc$W(1) = \left (\sum_{i = 0}^{n} a_i \right ) \left (\sum_{i = 0}^{n} b_i \right )$.  The third point $\zeta_{\infty}$ is less obvious but rather
*ebfedea0SLionel Sambucsimple to explain.  The $2n + 1$'th coefficient of $W(x)$ is numerically equivalent to the most significant column in an integer multiplication.
*ebfedea0SLionel SambucThe point at $\infty$ is used symbolically to represent the most significant column, that is $W(\infty) = w_{2n} = a_nb_n$.  Note that the
*ebfedea0SLionel Sambucpoints at $y = 0$ and $\infty$ yield the coefficients $w_0$ and $w_{2n}$ directly.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf more points are required they should be of small values and powers of two such as $2^q$ and the related \textit{mirror points}
*ebfedea0SLionel Sambuc$\left (2^q \right )^{2n}  \cdot \zeta_{2^{-q}}$ for small values of $q$.  The term ``mirror point'' stems from the fact that
*ebfedea0SLionel Sambuc$\left (2^q \right )^{2n}  \cdot \zeta_{2^{-q}}$ can be calculated in the exact opposite fashion as $\zeta_{2^q}$.  For
*ebfedea0SLionel Sambucexample, when $n = 2$ and $q = 1$ then following two equations are equivalent to the point $\zeta_{2}$ and its mirror.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{eqnarray}
*ebfedea0SLionel Sambuc\zeta_{2}                  = f(2)g(2) = (4a_2 + 2a_1 + a_0)(4b_2 + 2b_1 + b_0) \nonumber \\
*ebfedea0SLionel Sambuc16 \cdot \zeta_{1 \over 2} = 4f({1\over 2}) \cdot 4g({1 \over 2}) = (a_2 + 2a_1 + 4a_0)(b_2 + 2b_1 + 4b_0)
*ebfedea0SLionel Sambuc\end{eqnarray}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucUsing such points will allow the values of $f(y)$ and $g(y)$ to be independently calculated using only left shifts.  For example, when $n = 2$ the
*ebfedea0SLionel Sambucpolynomial $f(2^q)$ is equal to $2^q((2^qa_2) + a_1) + a_0$.  This technique of polynomial representation is known as Horner's method.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs a general rule of the algorithm when the inputs are split into $n$ parts each there are $2n - 1$ multiplications.  Each multiplication is of
*ebfedea0SLionel Sambucmultiplicands that have $n$ times fewer digits than the inputs.  The asymptotic running time of this algorithm is
*ebfedea0SLionel Sambuc$O \left ( k^{lg_n(2n - 1)} \right )$ for $k$ digit inputs (\textit{assuming they have the same number of digits}).  Figure~\ref{fig:exponent}
*ebfedea0SLionel Sambucsummarizes the exponents for various values of $n$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Split into $n$ Parts} & \textbf{Exponent}  & \textbf{Notes}\\
*ebfedea0SLionel Sambuc\hline $2$ & $1.584962501$ & This is Karatsuba Multiplication. \\
*ebfedea0SLionel Sambuc\hline $3$ & $1.464973520$ & This is Toom-Cook Multiplication. \\
*ebfedea0SLionel Sambuc\hline $4$ & $1.403677461$ &\\
*ebfedea0SLionel Sambuc\hline $5$ & $1.365212389$ &\\
*ebfedea0SLionel Sambuc\hline $10$ & $1.278753601$ &\\
*ebfedea0SLionel Sambuc\hline $100$ & $1.149426538$ &\\
*ebfedea0SLionel Sambuc\hline $1000$ & $1.100270931$ &\\
*ebfedea0SLionel Sambuc\hline $10000$ & $1.075252070$ &\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Asymptotic Running Time of Polynomial Basis Multiplication}
*ebfedea0SLionel Sambuc\label{fig:exponent}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt first it may seem like a good idea to choose $n = 1000$ since the exponent is approximately $1.1$.  However, the overhead
*ebfedea0SLionel Sambucof solving for the 2001 terms of $W(x)$ will certainly consume any savings the algorithm could offer for all but exceedingly large
*ebfedea0SLionel Sambucnumbers.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Cutoff Point}
*ebfedea0SLionel SambucThe polynomial basis multiplication algorithms all require fewer single precision multiplications than a straight Comba approach.  However,
*ebfedea0SLionel Sambucthe algorithms incur an overhead (\textit{at the $O(n)$ work level}) since they require a system of equations to be solved.  This makes the
*ebfedea0SLionel Sambucpolynomial basis approach more costly to use with small inputs.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLet $m$ represent the number of digits in the multiplicands (\textit{assume both multiplicands have the same number of digits}).  There exists a
*ebfedea0SLionel Sambucpoint $y$ such that when $m < y$ the polynomial basis algorithms are more costly than Comba, when $m = y$ they are roughly the same cost and
*ebfedea0SLionel Sambucwhen $m > y$ the Comba methods are slower than the polynomial basis algorithms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe exact location of $y$ depends on several key architectural elements of the computer platform in question.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc\item  The ratio of clock cycles for single precision multiplication versus other simpler operations such as addition, shifting, etc.  For example
*ebfedea0SLionel Sambucon the AMD Athlon the ratio is roughly $17 : 1$ while on the Intel P4 it is $29 : 1$.  The higher the ratio in favour of multiplication the lower
*ebfedea0SLionel Sambucthe cutoff point $y$ will be.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\item  The complexity of the linear system of equations (\textit{for the coefficients of $W(x)$}) is.  Generally speaking as the number of splits
*ebfedea0SLionel Sambucgrows the complexity grows substantially.  Ideally solving the system will only involve addition, subtraction and shifting of integers.  This
*ebfedea0SLionel Sambucdirectly reflects on the ratio previous mentioned.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\item  To a lesser extent memory bandwidth and function call overheads.  Provided the values are in the processor cache this is less of an
*ebfedea0SLionel Sambucinfluence over the cutoff point.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucA clean cutoff point separation occurs when a point $y$ is found such that all of the cutoff point conditions are met.  For example, if the point
*ebfedea0SLionel Sambucis too low then there will be values of $m$ such that $m > y$ and the Comba method is still faster.  Finding the cutoff points is fairly simple when
*ebfedea0SLionel Sambuca high resolution timer is available.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Karatsuba Multiplication}
*ebfedea0SLionel SambucKaratsuba \cite{KARA} multiplication when originally proposed in 1962 was among the first set of algorithms to break the $O(n^2)$ barrier for
*ebfedea0SLionel Sambucgeneral purpose multiplication.  Given two polynomial basis representations $f(x) = ax + b$ and $g(x) = cx + d$, Karatsuba proved with
*ebfedea0SLionel Sambuclight algebra \cite{KARAP} that the following polynomial is equivalent to multiplication of the two integers the polynomials represent.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucf(x) \cdot g(x) = acx^2 + ((a + b)(c + d) - (ac + bd))x + bd
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucUsing the observation that $ac$ and $bd$ could be re-used only three half sized multiplications would be required to produce the product.  Applying
*ebfedea0SLionel Sambucthis algorithm recursively, the work factor becomes $O(n^{lg(3)})$ which is substantially better than the work factor $O(n^2)$ of the Comba technique.  It turns
*ebfedea0SLionel Sambucout what Karatsuba did not know or at least did not publish was that this is simply polynomial basis multiplication with the points
*ebfedea0SLionel Sambuc$\zeta_0$, $\zeta_{\infty}$ and $\zeta_{1}$.  Consider the resultant system of equations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{rcrcrcrc}
*ebfedea0SLionel Sambuc$\zeta_{0}$ &      $=$ &  &  &  & & $w_0$ \\
*ebfedea0SLionel Sambuc$\zeta_{1}$ &      $=$ & $w_2$ & $+$ & $w_1$ & $+$ & $w_0$ \\
*ebfedea0SLionel Sambuc$\zeta_{\infty}$ & $=$ & $w_2$ &  & &  & \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy adding the first and last equation to the equation in the middle the term $w_1$ can be isolated and all three coefficients solved for.  The simplicity
*ebfedea0SLionel Sambucof this system of equations has made Karatsuba fairly popular.  In fact the cutoff point is often fairly low\footnote{With LibTomMath 0.18 it is 70 and 109 digits for the Intel P4 and AMD Athlon respectively.}
*ebfedea0SLionel Sambucmaking it an ideal algorithm to speed up certain public key cryptosystems such as RSA and Diffie-Hellman.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_karatsuba\_mul}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and mp\_int $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow \vert a \vert \cdot \vert b \vert$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Init the following mp\_int variables: $x0$, $x1$, $y0$, $y1$, $t1$, $x0y0$, $x1y1$.\\
*ebfedea0SLionel Sambuc2.  If step 2 failed then return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucSplit the input.  e.g. $a = x1 \cdot \beta^B + x0$ \\
*ebfedea0SLionel Sambuc3.  $B \leftarrow \mbox{min}(a.used, b.used)/2$ \\
*ebfedea0SLionel Sambuc4.  $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\
*ebfedea0SLionel Sambuc5.  $y0 \leftarrow b \mbox{ (mod }\beta^B\mbox{)}$ \\
*ebfedea0SLionel Sambuc6.  $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_rshd}) \\
*ebfedea0SLionel Sambuc7.  $y1 \leftarrow \lfloor b / \beta^B \rfloor$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucCalculate the three products. \\
*ebfedea0SLionel Sambuc8.  $x0y0 \leftarrow x0 \cdot y0$ (\textit{mp\_mul}) \\
*ebfedea0SLionel Sambuc9.  $x1y1 \leftarrow x1 \cdot y1$ \\
*ebfedea0SLionel Sambuc10.  $t1 \leftarrow x1 + x0$ (\textit{mp\_add}) \\
*ebfedea0SLionel Sambuc11.  $x0 \leftarrow y1 + y0$ \\
*ebfedea0SLionel Sambuc12.  $t1 \leftarrow t1 \cdot x0$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucCalculate the middle term. \\
*ebfedea0SLionel Sambuc13.  $x0 \leftarrow x0y0 + x1y1$ \\
*ebfedea0SLionel Sambuc14.  $t1 \leftarrow t1 - x0$ (\textit{s\_mp\_sub}) \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucCalculate the final product. \\
*ebfedea0SLionel Sambuc15.  $t1 \leftarrow t1 \cdot \beta^B$ (\textit{mp\_lshd}) \\
*ebfedea0SLionel Sambuc16.  $x1y1 \leftarrow x1y1 \cdot \beta^{2B}$ \\
*ebfedea0SLionel Sambuc17.  $t1 \leftarrow x0y0 + t1$ \\
*ebfedea0SLionel Sambuc18.  $c \leftarrow t1 + x1y1$ \\
*ebfedea0SLionel Sambuc19.  Clear all of the temporary variables. \\
*ebfedea0SLionel Sambuc20.  Return(\textit{MP\_OKAY}).\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_karatsuba\_mul}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_karatsuba\_mul.}
*ebfedea0SLionel SambucThis algorithm computes the unsigned product of two inputs using the Karatsuba multiplication algorithm.  It is loosely based on the description
*ebfedea0SLionel Sambucfrom Knuth \cite[pp. 294-295]{TAOCPV2}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\index{radix point}
*ebfedea0SLionel SambucIn order to split the two inputs into their respective halves, a suitable \textit{radix point} must be chosen.  The radix point chosen must
*ebfedea0SLionel Sambucbe used for both of the inputs meaning that it must be smaller than the smallest input.  Step 3 chooses the radix point $B$ as half of the
*ebfedea0SLionel Sambucsmallest input \textbf{used} count.  After the radix point is chosen the inputs are split into lower and upper halves.  Step 4 and 5
*ebfedea0SLionel Sambuccompute the lower halves.  Step 6 and 7 computer the upper halves.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the halves have been computed the three intermediate half-size products must be computed.  Step 8 and 9 compute the trivial products
*ebfedea0SLionel Sambuc$x0 \cdot y0$ and $x1 \cdot y1$.  The mp\_int $x0$ is used as a temporary variable after $x1 + x0$ has been computed.  By using $x0$ instead
*ebfedea0SLionel Sambucof an additional temporary variable, the algorithm can avoid an addition memory allocation operation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe remaining steps 13 through 18 compute the Karatsuba polynomial through a variety of digit shifting and addition operations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_karatsuba_mul.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe new coding element in this routine, not  seen in previous routines, is the usage of goto statements.  The conventional
*ebfedea0SLionel Sambucwisdom is that goto statements should be avoided.  This is generally true, however when every single function call can fail, it makes sense
*ebfedea0SLionel Sambucto handle error recovery with a single piece of code.  Lines @61,if@ to @75,if@ handle initializing all of the temporary variables
*ebfedea0SLionel Sambucrequired.  Note how each of the if statements goes to a different label in case of failure.  This allows the routine to correctly free only
*ebfedea0SLionel Sambucthe temporaries that have been successfully allocated so far.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe temporary variables are all initialized using the mp\_init\_size routine since they are expected to be large.  This saves the
*ebfedea0SLionel Sambucadditional reallocation that would have been necessary.  Also $x0$, $x1$, $y0$ and $y1$ have to be able to hold at least their respective
*ebfedea0SLionel Sambucnumber of digits for the next section of code.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first algebraic portion of the algorithm is to split the two inputs into their halves.  However, instead of using mp\_mod\_2d and mp\_rshd
*ebfedea0SLionel Sambucto extract the halves, the respective code has been placed inline within the body of the function.  To initialize the halves, the \textbf{used} and
*ebfedea0SLionel Sambuc\textbf{sign} members are copied first.  The first for loop on line @98,for@ copies the lower halves.  Since they are both the same magnitude it
*ebfedea0SLionel Sambucis simpler to calculate both lower halves in a single loop.  The for loop on lines @104,for@ and @109,for@ calculate the upper halves $x1$ and
*ebfedea0SLionel Sambuc$y1$ respectively.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy inlining the calculation of the halves, the Karatsuba multiplier has a slightly lower overhead and can be used for smaller magnitude inputs.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen line @152,err@ is reached, the algorithm has completed succesfully.  The ``error status'' variable $err$ is set to \textbf{MP\_OKAY} so that
*ebfedea0SLionel Sambucthe same code that handles errors can be used to clear the temporary variables and return.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Toom-Cook $3$-Way Multiplication}
*ebfedea0SLionel SambucToom-Cook $3$-Way \cite{TOOM} multiplication is essentially the polynomial basis algorithm for $n = 2$ except that the points  are
*ebfedea0SLionel Sambucchosen such that $\zeta$ is easy to compute and the resulting system of equations easy to reduce.  Here, the points $\zeta_{0}$,
*ebfedea0SLionel Sambuc$16 \cdot \zeta_{1 \over 2}$, $\zeta_1$, $\zeta_2$ and $\zeta_{\infty}$ make up the five required points to solve for the coefficients
*ebfedea0SLionel Sambucof the $W(x)$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWith the five relations that Toom-Cook specifies, the following system of equations is formed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{rcrcrcrcrcr}
*ebfedea0SLionel Sambuc$\zeta_0$                    & $=$ & $0w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $1w_0$  \\
*ebfedea0SLionel Sambuc$16 \cdot \zeta_{1 \over 2}$ & $=$ & $1w_4$ & $+$ & $2w_3$ & $+$ & $4w_2$ & $+$ & $8w_1$ & $+$ & $16w_0$  \\
*ebfedea0SLionel Sambuc$\zeta_1$                    & $=$ & $1w_4$ & $+$ & $1w_3$ & $+$ & $1w_2$ & $+$ & $1w_1$ & $+$ & $1w_0$  \\
*ebfedea0SLionel Sambuc$\zeta_2$                    & $=$ & $16w_4$ & $+$ & $8w_3$ & $+$ & $4w_2$ & $+$ & $2w_1$ & $+$ & $1w_0$  \\
*ebfedea0SLionel Sambuc$\zeta_{\infty}$             & $=$ & $1w_4$ & $+$ & $0w_3$ & $+$ & $0w_2$ & $+$ & $0w_1$ & $+$ & $0w_0$  \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucA trivial solution to this matrix requires $12$ subtractions, two multiplications by a small power of two, two divisions by a small power
*ebfedea0SLionel Sambucof two, two divisions by three and one multiplication by three.  All of these $19$ sub-operations require less than quadratic time, meaning that
*ebfedea0SLionel Sambucthe algorithm can be faster than a baseline multiplication.  However, the greater complexity of this algorithm places the cutoff point
*ebfedea0SLionel Sambuc(\textbf{TOOM\_MUL\_CUTOFF}) where Toom-Cook becomes more efficient much higher than the Karatsuba cutoff point.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_toom\_mul}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and mp\_int $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow  a  \cdot  b $ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel SambucSplit $a$ and $b$ into three pieces.  E.g. $a = a_2 \beta^{2k} + a_1 \beta^{k} + a_0$ \\
*ebfedea0SLionel Sambuc1.  $k \leftarrow \lfloor \mbox{min}(a.used, b.used) / 3 \rfloor$ \\
*ebfedea0SLionel Sambuc2.  $a_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\
*ebfedea0SLionel Sambuc3.  $a_1 \leftarrow \lfloor a / \beta^k \rfloor$, $a_1 \leftarrow a_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
*ebfedea0SLionel Sambuc4.  $a_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $a_2 \leftarrow a_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
*ebfedea0SLionel Sambuc5.  $b_0 \leftarrow a \mbox{ (mod }\beta^{k}\mbox{)}$ \\
*ebfedea0SLionel Sambuc6.  $b_1 \leftarrow \lfloor a / \beta^k \rfloor$, $b_1 \leftarrow b_1 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
*ebfedea0SLionel Sambuc7.  $b_2 \leftarrow \lfloor a / \beta^{2k} \rfloor$, $b_2 \leftarrow b_2 \mbox{ (mod }\beta^{k}\mbox{)}$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucFind the five equations for $w_0, w_1, ..., w_4$. \\
*ebfedea0SLionel Sambuc8.  $w_0 \leftarrow a_0 \cdot b_0$ \\
*ebfedea0SLionel Sambuc9.  $w_4 \leftarrow a_2 \cdot b_2$ \\
*ebfedea0SLionel Sambuc10. $tmp_1 \leftarrow 2 \cdot a_0$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_2$ \\
*ebfedea0SLionel Sambuc11. $tmp_2 \leftarrow 2 \cdot b_0$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_2$ \\
*ebfedea0SLionel Sambuc12. $w_1 \leftarrow tmp_1 \cdot tmp_2$ \\
*ebfedea0SLionel Sambuc13. $tmp_1 \leftarrow 2 \cdot a_2$, $tmp_1 \leftarrow a_1 + tmp_1$, $tmp_1 \leftarrow 2 \cdot tmp_1$, $tmp_1 \leftarrow tmp_1 + a_0$ \\
*ebfedea0SLionel Sambuc14. $tmp_2 \leftarrow 2 \cdot b_2$, $tmp_2 \leftarrow b_1 + tmp_2$, $tmp_2 \leftarrow 2 \cdot tmp_2$, $tmp_2 \leftarrow tmp_2 + b_0$ \\
*ebfedea0SLionel Sambuc15. $w_3 \leftarrow tmp_1 \cdot tmp_2$ \\
*ebfedea0SLionel Sambuc16. $tmp_1 \leftarrow a_0 + a_1$, $tmp_1 \leftarrow tmp_1 + a_2$, $tmp_2 \leftarrow b_0 + b_1$, $tmp_2 \leftarrow tmp_2 + b_2$ \\
*ebfedea0SLionel Sambuc17. $w_2 \leftarrow tmp_1 \cdot tmp_2$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucContinued on the next page.\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_toom\_mul}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_toom\_mul} (continued). \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and mp\_int $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow a \cdot  b $ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel SambucNow solve the system of equations. \\
*ebfedea0SLionel Sambuc18. $w_1 \leftarrow w_4 - w_1$, $w_3 \leftarrow w_3 - w_0$ \\
*ebfedea0SLionel Sambuc19. $w_1 \leftarrow \lfloor w_1 / 2 \rfloor$, $w_3 \leftarrow \lfloor w_3 / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc20. $w_2 \leftarrow w_2 - w_0$, $w_2 \leftarrow w_2 - w_4$ \\
*ebfedea0SLionel Sambuc21. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\
*ebfedea0SLionel Sambuc22. $tmp_1 \leftarrow 8 \cdot w_0$, $w_1 \leftarrow w_1 - tmp_1$, $tmp_1 \leftarrow 8 \cdot w_4$, $w_3 \leftarrow w_3 - tmp_1$ \\
*ebfedea0SLionel Sambuc23. $w_2 \leftarrow 3 \cdot w_2$, $w_2 \leftarrow w_2 - w_1$, $w_2 \leftarrow w_2 - w_3$ \\
*ebfedea0SLionel Sambuc24. $w_1 \leftarrow w_1 - w_2$, $w_3 \leftarrow w_3 - w_2$ \\
*ebfedea0SLionel Sambuc25. $w_1 \leftarrow \lfloor w_1 / 3 \rfloor, w_3 \leftarrow \lfloor w_3 / 3 \rfloor$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucNow substitute $\beta^k$ for $x$ by shifting $w_0, w_1, ..., w_4$. \\
*ebfedea0SLionel Sambuc26. for $n$ from $1$ to $4$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}26.1  $w_n \leftarrow w_n \cdot \beta^{nk}$ \\
*ebfedea0SLionel Sambuc27. $c \leftarrow w_0 + w_1$, $c \leftarrow c + w_2$, $c \leftarrow c + w_3$, $c \leftarrow c + w_4$ \\
*ebfedea0SLionel Sambuc28. Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_toom\_mul (continued)}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_toom\_mul.}
*ebfedea0SLionel SambucThis algorithm computes the product of two mp\_int variables $a$ and $b$ using the Toom-Cook approach.  Compared to the Karatsuba multiplication, this
*ebfedea0SLionel Sambucalgorithm has a lower asymptotic running time of approximately $O(n^{1.464})$ but at an obvious cost in overhead.  In this
*ebfedea0SLionel Sambucdescription, several statements have been compounded to save space.  The intention is that the statements are executed from left to right across
*ebfedea0SLionel Sambucany given step.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe two inputs $a$ and $b$ are first split into three $k$-digit integers $a_0, a_1, a_2$ and $b_0, b_1, b_2$ respectively.  From these smaller
*ebfedea0SLionel Sambucintegers the coefficients of the polynomial basis representations $f(x)$ and $g(x)$ are known and can be used to find the relations required.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first two relations $w_0$ and $w_4$ are the points $\zeta_{0}$ and $\zeta_{\infty}$ respectively.  The relation $w_1, w_2$ and $w_3$ correspond
*ebfedea0SLionel Sambucto the points $16 \cdot \zeta_{1 \over 2}, \zeta_{2}$ and $\zeta_{1}$ respectively.  These are found using logical shifts to independently find
*ebfedea0SLionel Sambuc$f(y)$ and $g(y)$ which significantly speeds up the algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the five relations $w_0, w_1, \ldots, w_4$ have been computed, the system they represent must be solved in order for the unknown coefficients
*ebfedea0SLionel Sambuc$w_1, w_2$ and $w_3$ to be isolated.  The steps 18 through 25 perform the system reduction required as previously described.  Each step of
*ebfedea0SLionel Sambucthe reduction represents the comparable matrix operation that would be performed had this been performed by pencil.  For example, step 18 indicates
*ebfedea0SLionel Sambucthat row $1$ must be subtracted from row $4$ and simultaneously row $0$ subtracted from row $3$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucOnce the coeffients have been isolated, the polynomial $W(x) = \sum_{i=0}^{2n} w_i x^i$ is known.  By substituting $\beta^{k}$ for $x$, the integer
*ebfedea0SLionel Sambucresult $a \cdot b$ is produced.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_toom_mul.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first obvious thing to note is that this algorithm is complicated.  The complexity is worth it if you are multiplying very
*ebfedea0SLionel Sambuclarge numbers.  For example, a 10,000 digit multiplication takes approximaly 99,282,205 fewer single precision multiplications with
*ebfedea0SLionel SambucToom--Cook than a Comba or baseline approach (this is a savings of more than 99$\%$).  For most ``crypto'' sized numbers this
*ebfedea0SLionel Sambucalgorithm is not practical as Karatsuba has a much lower cutoff point.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst we split $a$ and $b$ into three roughly equal portions.  This has been accomplished (lines @40,mod@ to @69,rshd@) with
*ebfedea0SLionel Sambuccombinations of mp\_rshd() and mp\_mod\_2d() function calls.  At this point $a = a2 \cdot \beta^2 + a1 \cdot \beta + a0$ and similiarly
*ebfedea0SLionel Sambucfor $b$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNext we compute the five points $w0, w1, w2, w3$ and $w4$.  Recall that $w0$ and $w4$ can be computed directly from the portions so
*ebfedea0SLionel Sambucwe get those out of the way first (lines @72,mul@ and @77,mul@).  Next we compute $w1, w2$ and $w3$ using Horners method.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter this point we solve for the actual values of $w1, w2$ and $w3$ by reducing the $5 \times 5$ system which is relatively
*ebfedea0SLionel Sambucstraight forward.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Signed Multiplication}
*ebfedea0SLionel SambucNow that algorithms to handle multiplications of every useful dimensions have been developed, a rather simple finishing touch is required.  So far all
*ebfedea0SLionel Sambucof the multiplication algorithms have been unsigned multiplications which leaves only a signed multiplication algorithm to be established.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_mul}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and mp\_int $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c \leftarrow a \cdot b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $a.sign = b.sign$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $sign = MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc2.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $sign = MP\_ZNEG$ \\
*ebfedea0SLionel Sambuc3.  If min$(a.used, b.used) \ge TOOM\_MUL\_CUTOFF$ then  \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $c \leftarrow a \cdot b$ using algorithm mp\_toom\_mul \\
*ebfedea0SLionel Sambuc4.  else if min$(a.used, b.used) \ge KARATSUBA\_MUL\_CUTOFF$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $c \leftarrow a \cdot b$ using algorithm mp\_karatsuba\_mul \\
*ebfedea0SLionel Sambuc5.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $digs \leftarrow a.used + b.used + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  If $digs < MP\_ARRAY$ and min$(a.used, b.used) \le \delta$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.2.1  $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm fast\_s\_mp\_mul\_digs.  \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  else \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.3.1  $c \leftarrow a \cdot b \mbox{ (mod }\beta^{digs}\mbox{)}$ using algorithm s\_mp\_mul\_digs.  \\
*ebfedea0SLionel Sambuc6.  $c.sign \leftarrow sign$ \\
*ebfedea0SLionel Sambuc7.  Return the result of the unsigned multiplication performed. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_mul}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_mul.}
*ebfedea0SLionel SambucThis algorithm performs the signed multiplication of two inputs.  It will make use of any of the three unsigned multiplication algorithms
*ebfedea0SLionel Sambucavailable when the input is of appropriate size.  The \textbf{sign} of the result is not set until the end of the algorithm since algorithm
*ebfedea0SLionel Sambucs\_mp\_mul\_digs will clear it.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_mul.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe implementation is rather simplistic and is not particularly noteworthy.  Line @22,?@ computes the sign of the result using the ``?''
*ebfedea0SLionel Sambucoperator from the C programming language.  Line @37,<<@ computes $\delta$ using the fact that $1 << k$ is equal to $2^k$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Squaring}
*ebfedea0SLionel Sambuc\label{sec:basesquare}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSquaring is a special case of multiplication where both multiplicands are equal.  At first it may seem like there is no significant optimization
*ebfedea0SLionel Sambucavailable but in fact there is.  Consider the multiplication of $576$ against $241$.  In total there will be nine single precision multiplications
*ebfedea0SLionel Sambucperformed which are $1\cdot 6$, $1 \cdot 7$, $1 \cdot 5$, $4 \cdot 6$, $4 \cdot 7$, $4 \cdot 5$, $2 \cdot  6$, $2 \cdot 7$ and $2 \cdot 5$.  Now consider
*ebfedea0SLionel Sambucthe multiplication of $123$ against $123$.  The nine products are $3 \cdot 3$, $3 \cdot 2$, $3 \cdot 1$, $2 \cdot 3$, $2 \cdot 2$, $2 \cdot 1$,
*ebfedea0SLionel Sambuc$1 \cdot 3$, $1 \cdot 2$ and $1 \cdot 1$.  On closer inspection some of the products are equivalent.  For example, $3 \cdot 2 = 2 \cdot 3$
*ebfedea0SLionel Sambucand $3 \cdot 1 = 1 \cdot 3$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor any $n$-digit input, there are ${{\left (n^2 + n \right)}\over 2}$ possible unique single precision multiplications required compared to the $n^2$
*ebfedea0SLionel Sambucrequired for multiplication.  The following diagram gives an example of the operations required.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{ccccc|c}
*ebfedea0SLionel Sambuc&&1&2&3&\\
*ebfedea0SLionel Sambuc$\times$ &&1&2&3&\\
*ebfedea0SLionel Sambuc\hline && $3 \cdot 1$ & $3 \cdot 2$ & $3 \cdot 3$ & Row 0\\
*ebfedea0SLionel Sambuc       & $2 \cdot 1$  & $2 \cdot 2$ & $2 \cdot 3$ && Row 1 \\
*ebfedea0SLionel Sambuc         $1 \cdot 1$  & $1 \cdot 2$ & $1 \cdot 3$ &&& Row 2 \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Squaring Optimization Diagram}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucMARK,SQUARE
*ebfedea0SLionel SambucStarting from zero and numbering the columns from right to left a very simple pattern becomes obvious.  For the purposes of this discussion let $x$
*ebfedea0SLionel Sambucrepresent the number being squared.  The first observation is that in row $k$ the $2k$'th column of the product has a $\left (x_k \right)^2$ term in it.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe second observation is that every column $j$ in row $k$ where $j \ne 2k$ is part of a double product.  Every non-square term of a column will
*ebfedea0SLionel Sambucappear twice hence the name ``double product''.  Every odd column is made up entirely of double products.  In fact every column is made up of double
*ebfedea0SLionel Sambucproducts and at most one square (\textit{see the exercise section}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe third and final observation is that for row $k$ the first unique non-square term, that is, one that hasn't already appeared in an earlier row,
*ebfedea0SLionel Sambucoccurs at column $2k + 1$.  For example, on row $1$ of the previous squaring, column one is part of the double product with column one from row zero.
*ebfedea0SLionel SambucColumn two of row one is a square and column three is the first unique column.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{The Baseline Squaring Algorithm}
*ebfedea0SLionel SambucThe baseline squaring algorithm is meant to be a catch-all squaring algorithm.  It will handle any of the input sizes that the faster routines
*ebfedea0SLionel Sambucwill not handle.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{s\_mp\_sqr}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $b \leftarrow a^2$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Init a temporary mp\_int of at least $2 \cdot a.used +1$ digits.  (\textit{mp\_init\_size}) \\
*ebfedea0SLionel Sambuc2.  If step 1 failed return(\textit{MP\_MEM}) \\
*ebfedea0SLionel Sambuc3.  $t.used \leftarrow 2 \cdot a.used + 1$ \\
*ebfedea0SLionel Sambuc4.  For $ix$ from 0 to $a.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}Calculate the square. \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $\hat r \leftarrow t_{2ix} + \left (a_{ix} \right )^2$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.2  $t_{2ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}Calculate the double products after the square. \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.3  $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.4  For $iy$ from $ix + 1$ to $a.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.4.1  $\hat r \leftarrow 2 \cdot a_{ix}a_{iy} + t_{ix + iy} + u$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.4.2  $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.4.3  $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}Set the last carry. \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.5  While $u > 0$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.5.1  $iy \leftarrow iy + 1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.5.2  $\hat r \leftarrow t_{ix + iy} + u$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.5.3  $t_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.5.4  $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc5.  Clamp excess digits of $t$.  (\textit{mp\_clamp}) \\
*ebfedea0SLionel Sambuc6.  Exchange $b$ and $t$. \\
*ebfedea0SLionel Sambuc7.  Clear $t$ (\textit{mp\_clear}) \\
*ebfedea0SLionel Sambuc8.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm s\_mp\_sqr}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm s\_mp\_sqr.}
*ebfedea0SLionel SambucThis algorithm computes the square of an input using the three observations on squaring.  It is based fairly faithfully on  algorithm 14.16 of HAC
*ebfedea0SLionel Sambuc\cite[pp.596-597]{HAC}.  Similar to algorithm s\_mp\_mul\_digs, a temporary mp\_int is allocated to hold the result of the squaring.  This allows the
*ebfedea0SLionel Sambucdestination mp\_int to be the same as the source mp\_int.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe outer loop of this algorithm begins on step 4. It is best to think of the outer loop as walking down the rows of the partial results, while
*ebfedea0SLionel Sambucthe inner loop computes the columns of the partial result.  Step 4.1 and 4.2 compute the square term for each row, and step 4.3 and 4.4 propagate
*ebfedea0SLionel Sambucthe carry and compute the double products.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe requirement that a mp\_word be able to represent the range $0 \le x < 2 \beta^2$ arises from this
*ebfedea0SLionel Sambucvery algorithm.  The product $a_{ix}a_{iy}$ will lie in the range $0 \le x \le \beta^2 - 2\beta + 1$ which is obviously less than $\beta^2$ meaning that
*ebfedea0SLionel Sambucwhen it is multiplied by two, it can be properly represented by a mp\_word.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSimilar to algorithm s\_mp\_mul\_digs, after every pass of the inner loop, the destination is correctly set to the sum of all of the partial
*ebfedea0SLionel Sambucresults calculated so far.  This involves expensive carry propagation which will be eliminated in the next algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_s_mp_sqr.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucInside the outer loop (line @32,for@) the square term is calculated on line @35,r =@.  The carry (line @42,>>@) has been
*ebfedea0SLionel Sambucextracted from the mp\_word accumulator using a right shift.  Aliases for $a_{ix}$ and $t_{ix+iy}$ are initialized
*ebfedea0SLionel Sambuc(lines @45,tmpx@ and @48,tmpt@) to simplify the inner loop.  The doubling is performed using two
*ebfedea0SLionel Sambucadditions (line @57,r + r@) since it is usually faster than shifting, if not at least as fast.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe important observation is that the inner loop does not begin at $iy = 0$ like for multiplication.  As such the inner loops
*ebfedea0SLionel Sambucget progressively shorter as the algorithm proceeds.  This is what leads to the savings compared to using a multiplication to
*ebfedea0SLionel Sambucsquare a number.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Faster Squaring by the ``Comba'' Method}
*ebfedea0SLionel SambucA major drawback to the baseline method is the requirement for single precision shifting inside the $O(n^2)$ nested loop.  Squaring has an additional
*ebfedea0SLionel Sambucdrawback that it must double the product inside the inner loop as well.  As for multiplication, the Comba technique can be used to eliminate these
*ebfedea0SLionel Sambucperformance hazards.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first obvious solution is to make an array of mp\_words which will hold all of the columns.  This will indeed eliminate all of the carry
*ebfedea0SLionel Sambucpropagation operations from the inner loop.  However, the inner product must still be doubled $O(n^2)$ times.  The solution stems from the simple fact
*ebfedea0SLionel Sambucthat $2a + 2b + 2c = 2(a + b + c)$.  That is the sum of all of the double products is equal to double the sum of all the products.  For example,
*ebfedea0SLionel Sambuc$ab + ba + ac + ca = 2ab + 2ac = 2(ab + ac)$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHowever, we cannot simply double all of the columns, since the squares appear only once per row.  The most practical solution is to have two
*ebfedea0SLionel Sambucmp\_word arrays.  One array will hold the squares and the other array will hold the double products.  With both arrays the doubling and
*ebfedea0SLionel Sambuccarry propagation can be moved to a $O(n)$ work level outside the $O(n^2)$ level.  In this case, we have an even simpler solution in mind.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{fast\_s\_mp\_sqr}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $b \leftarrow a^2$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel SambucPlace an array of \textbf{MP\_WARRAY} mp\_digits named $W$ on the stack. \\
*ebfedea0SLionel Sambuc1.  If $b.alloc < 2a.used + 1$ then grow $b$ to $2a.used + 1$ digits.  (\textit{mp\_grow}). \\
*ebfedea0SLionel Sambuc2.  If step 1 failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel Sambuc3.  $pa \leftarrow 2 \cdot a.used$ \\
*ebfedea0SLionel Sambuc4.  $\hat W1 \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  for $ix$ from $0$ to $pa - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $\_ \hat W \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $ty \leftarrow \mbox{MIN}(a.used - 1, ix)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  $tx \leftarrow ix - ty$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.4  $iy \leftarrow \mbox{MIN}(a.used - tx, ty + 1)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.5  $iy \leftarrow \mbox{MIN}(iy, \lfloor \left (ty - tx + 1 \right )/2 \rfloor)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.6  for $iz$ from $0$ to $iz - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.6.1  $\_ \hat W \leftarrow \_ \hat W + a_{tx + iz}a_{ty - iz}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.7  $\_ \hat W \leftarrow 2 \cdot \_ \hat W  + \hat W1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.8  if $ix$ is even then \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.8.1  $\_ \hat W \leftarrow \_ \hat W + \left ( a_{\lfloor ix/2 \rfloor}\right )^2$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.9  $W_{ix} \leftarrow \_ \hat W (\mbox{mod }\beta)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.10  $\hat W1 \leftarrow \lfloor \_ \hat W / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel Sambuc6.  $oldused \leftarrow b.used$ \\
*ebfedea0SLionel Sambuc7.  $b.used \leftarrow 2 \cdot a.used$ \\
*ebfedea0SLionel Sambuc8.  for $ix$ from $0$ to $pa - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $b_{ix} \leftarrow W_{ix}$ \\
*ebfedea0SLionel Sambuc9.  for $ix$ from $pa$ to $oldused - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1  $b_{ix} \leftarrow 0$ \\
*ebfedea0SLionel Sambuc10.  Clamp excess digits from $b$.  (\textit{mp\_clamp}) \\
*ebfedea0SLionel Sambuc11.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm fast\_s\_mp\_sqr}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm fast\_s\_mp\_sqr.}
*ebfedea0SLionel SambucThis algorithm computes the square of an input using the Comba technique.  It is designed to be a replacement for algorithm
*ebfedea0SLionel Sambucs\_mp\_sqr when the number of input digits is less than \textbf{MP\_WARRAY} and less than $\delta \over 2$.
*ebfedea0SLionel SambucThis algorithm is very similar to the Comba multiplier except with a few key differences we shall make note of.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst, we have an accumulator and carry variables $\_ \hat W$ and $\hat W1$ respectively.  This is because the inner loop
*ebfedea0SLionel Sambucproducts are to be doubled.  If we had added the previous carry in we would be doubling too much.  Next we perform an
*ebfedea0SLionel Sambucaddition MIN condition on $iy$ (step 5.5) to prevent overlapping digits.  For example, $a_3 \cdot a_5$ is equal
*ebfedea0SLionel Sambuc$a_5 \cdot a_3$.  Whereas in the multiplication case we would have $5 < a.used$ and $3 \ge 0$ is maintained since we double the sum
*ebfedea0SLionel Sambucof the products just outside the inner loop we have to avoid doing this.  This is also a good thing since we perform
*ebfedea0SLionel Sambucfewer multiplications and the routine ends up being faster.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFinally the last difference is the addition of the ``square'' term outside the inner loop (step 5.8).  We add in the square
*ebfedea0SLionel Sambuconly to even outputs and it is the square of the term at the $\lfloor ix / 2 \rfloor$ position.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_fast_s_mp_sqr.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis implementation is essentially a copy of Comba multiplication with the appropriate changes added to make it faster for
*ebfedea0SLionel Sambucthe special case of squaring.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Polynomial Basis Squaring}
*ebfedea0SLionel SambucThe same algorithm that performs optimal polynomial basis multiplication can be used to perform polynomial basis squaring.  The minor exception
*ebfedea0SLionel Sambucis that $\zeta_y = f(y)g(y)$ is actually equivalent to $\zeta_y = f(y)^2$ since $f(y) = g(y)$.  Instead of performing $2n + 1$
*ebfedea0SLionel Sambucmultiplications to find the $\zeta$ relations, squaring operations are performed instead.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Karatsuba Squaring}
*ebfedea0SLionel SambucLet $f(x) = ax + b$ represent the polynomial basis representation of a number to square.
*ebfedea0SLionel SambucLet $h(x) = \left ( f(x) \right )^2$ represent the square of the polynomial.  The Karatsuba equation can be modified to square a
*ebfedea0SLionel Sambucnumber with the following equation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuch(x) = a^2x^2 + \left ((a + b)^2 - (a^2 + b^2) \right )x + b^2
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucUpon closer inspection this equation only requires the calculation of three half-sized squares: $a^2$, $b^2$ and $(a + b)^2$.  As in
*ebfedea0SLionel SambucKaratsuba multiplication, this algorithm can be applied recursively on the input and will achieve an asymptotic running time of
*ebfedea0SLionel Sambuc$O \left ( n^{lg(3)} \right )$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf the asymptotic times of Karatsuba squaring and multiplication are the same, why not simply use the multiplication algorithm
*ebfedea0SLionel Sambucinstead?  The answer to this arises from the cutoff point for squaring.  As in multiplication there exists a cutoff point, at which the
*ebfedea0SLionel Sambuctime required for a Comba based squaring and a Karatsuba based squaring meet.  Due to the overhead inherent in the Karatsuba method, the cutoff
*ebfedea0SLionel Sambucpoint is fairly high.  For example, on an AMD Athlon XP processor with $\beta = 2^{28}$, the cutoff point is around 127 digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucConsider squaring a 200 digit number with this technique.  It will be split into two 100 digit halves which are subsequently squared.
*ebfedea0SLionel SambucThe 100 digit halves will not be squared using Karatsuba, but instead using the faster Comba based squaring algorithm.  If Karatsuba multiplication
*ebfedea0SLionel Sambucwere used instead, the 100 digit numbers would be squared with a slower Comba based multiplication.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_karatsuba\_sqr}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $b \leftarrow a^2$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  Initialize the following temporary mp\_ints:  $x0$, $x1$, $t1$, $t2$, $x0x0$ and $x1x1$. \\
*ebfedea0SLionel Sambuc2.  If any of the initializations on step 1 failed return(\textit{MP\_MEM}). \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucSplit the input.  e.g. $a = x1\beta^B + x0$ \\
*ebfedea0SLionel Sambuc3.  $B \leftarrow \lfloor a.used / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc4.  $x0 \leftarrow a \mbox{ (mod }\beta^B\mbox{)}$ (\textit{mp\_mod\_2d}) \\
*ebfedea0SLionel Sambuc5.  $x1 \leftarrow \lfloor a / \beta^B \rfloor$ (\textit{mp\_lshd}) \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucCalculate the three squares. \\
*ebfedea0SLionel Sambuc6.  $x0x0 \leftarrow x0^2$ (\textit{mp\_sqr}) \\
*ebfedea0SLionel Sambuc7.  $x1x1 \leftarrow x1^2$ \\
*ebfedea0SLionel Sambuc8.  $t1 \leftarrow x1 + x0$ (\textit{s\_mp\_add}) \\
*ebfedea0SLionel Sambuc9.  $t1 \leftarrow t1^2$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucCompute the middle term. \\
*ebfedea0SLionel Sambuc10.  $t2 \leftarrow x0x0 + x1x1$ (\textit{s\_mp\_add}) \\
*ebfedea0SLionel Sambuc11.  $t1 \leftarrow t1 - t2$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucCompute final product. \\
*ebfedea0SLionel Sambuc12.  $t1 \leftarrow t1\beta^B$ (\textit{mp\_lshd}) \\
*ebfedea0SLionel Sambuc13.  $x1x1 \leftarrow x1x1\beta^{2B}$ \\
*ebfedea0SLionel Sambuc14.  $t1 \leftarrow t1 + x0x0$ \\
*ebfedea0SLionel Sambuc15.  $b \leftarrow t1 + x1x1$ \\
*ebfedea0SLionel Sambuc16.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_karatsuba\_sqr}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_karatsuba\_sqr.}
*ebfedea0SLionel SambucThis algorithm computes the square of an input $a$ using the Karatsuba technique.  This algorithm is very similar to the Karatsuba based
*ebfedea0SLionel Sambucmultiplication algorithm with the exception that the three half-size multiplications have been replaced with three half-size squarings.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe radix point for squaring is simply placed exactly in the middle of the digits when the input has an odd number of digits, otherwise it is
*ebfedea0SLionel Sambucplaced just below the middle.  Step 3, 4 and 5 compute the two halves required using $B$
*ebfedea0SLionel Sambucas the radix point.  The first two squares in steps 6 and 7 are rather straightforward while the last square is of a more compact form.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy expanding $\left (x1 + x0 \right )^2$, the $x1^2$ and $x0^2$ terms in the middle disappear, that is $(x0 - x1)^2 - (x1^2 + x0^2)  = 2 \cdot x0 \cdot x1$.
*ebfedea0SLionel SambucNow if $5n$ single precision additions and a squaring of $n$-digits is faster than multiplying two $n$-digit numbers and doubling then
*ebfedea0SLionel Sambucthis method is faster.  Assuming no further recursions occur, the difference can be estimated with the following inequality.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLet $p$ represent the cost of a single precision addition and $q$ the cost of a single precision multiplication both in terms of time\footnote{Or
*ebfedea0SLionel Sambucmachine clock cycles.}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc5pn +{{q(n^2 + n)} \over 2} \le pn + qn^2
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, on an AMD Athlon XP processor $p = {1 \over 3}$ and $q = 6$.  This implies that the following inequality should hold.
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{rcl}
*ebfedea0SLionel Sambuc${5n \over 3} + 3n^2 + 3n$     & $<$ & ${n \over 3} + 6n^2$ \\
*ebfedea0SLionel Sambuc${5 \over 3} + 3n + 3$     & $<$ & ${1 \over 3} + 6n$ \\
*ebfedea0SLionel Sambuc${13 \over 9}$     & $<$ & $n$ \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis results in a cutoff point around $n = 2$.  As a consequence it is actually faster to compute the middle term the ``long way'' on processors
*ebfedea0SLionel Sambucwhere multiplication is substantially slower\footnote{On the Athlon there is a 1:17 ratio between clock cycles for addition and multiplication.  On
*ebfedea0SLionel Sambucthe Intel P4 processor this ratio is 1:29 making this method even more beneficial.  The only common exception is the ARMv4 processor which has a
*ebfedea0SLionel Sambucratio of 1:7.  } than simpler operations such as addition.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_karatsuba_sqr.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis implementation is largely based on the implementation of algorithm mp\_karatsuba\_mul.  It uses the same inline style to copy and
*ebfedea0SLionel Sambucshift the input into the two halves.  The loop from line @54,{@ to line @70,}@ has been modified since only one input exists.  The \textbf{used}
*ebfedea0SLionel Sambuccount of both $x0$ and $x1$ is fixed up and $x0$ is clamped before the calculations begin.  At this point $x1$ and $x0$ are valid equivalents
*ebfedea0SLionel Sambucto the respective halves as if mp\_rshd and mp\_mod\_2d had been used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy inlining the copy and shift operations the cutoff point for Karatsuba multiplication can be lowered.  On the Athlon the cutoff point
*ebfedea0SLionel Sambucis exactly at the point where Comba squaring can no longer be used (\textit{128 digits}).  On slower processors such as the Intel P4
*ebfedea0SLionel Sambucit is actually below the Comba limit (\textit{at 110 digits}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis routine uses the same error trap coding style as mp\_karatsuba\_sqr.  As the temporary variables are initialized errors are
*ebfedea0SLionel Sambucredirected to the error trap higher up.  If the algorithm completes without error the error code is set to \textbf{MP\_OKAY} and
*ebfedea0SLionel Sambucmp\_clears are executed normally.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Toom-Cook Squaring}
*ebfedea0SLionel SambucThe Toom-Cook squaring algorithm mp\_toom\_sqr is heavily based on the algorithm mp\_toom\_mul with the exception that squarings are used
*ebfedea0SLionel Sambucinstead of multiplication to find the five relations.  The reader is encouraged to read the description of the latter algorithm and try to
*ebfedea0SLionel Sambucderive their own Toom-Cook squaring algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{High Level Squaring}
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_sqr}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $b \leftarrow a^2$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $a.used \ge TOOM\_SQR\_CUTOFF$ then  \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $b \leftarrow a^2$ using algorithm mp\_toom\_sqr \\
*ebfedea0SLionel Sambuc2.  else if $a.used \ge KARATSUBA\_SQR\_CUTOFF$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $b \leftarrow a^2$ using algorithm mp\_karatsuba\_sqr \\
*ebfedea0SLionel Sambuc3.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $digs \leftarrow a.used + b.used + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.2  If $digs < MP\_ARRAY$ and $a.used \le \delta$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}3.2.1  $b \leftarrow a^2$ using algorithm fast\_s\_mp\_sqr.  \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.3  else \\
*ebfedea0SLionel Sambuc\hspace{6mm}3.3.1  $b \leftarrow a^2$ using algorithm s\_mp\_sqr.  \\
*ebfedea0SLionel Sambuc4.  $b.sign \leftarrow MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc5.  Return the result of the unsigned squaring performed. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_sqr}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_sqr.}
*ebfedea0SLionel SambucThis algorithm computes the square of the input using one of four different algorithms.  If the input is very large and has at least
*ebfedea0SLionel Sambuc\textbf{TOOM\_SQR\_CUTOFF} or \textbf{KARATSUBA\_SQR\_CUTOFF} digits then either the Toom-Cook or the Karatsuba Squaring algorithm is used.  If
*ebfedea0SLionel Sambucneither of the polynomial basis algorithms should be used then either the Comba or baseline algorithm is used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_sqr.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section*{Exercises}
*ebfedea0SLionel Sambuc\begin{tabular}{cl}
*ebfedea0SLionel Sambuc$\left [ 3 \right ] $ & Devise an efficient algorithm for selection of the radix point to handle inputs \\
*ebfedea0SLionel Sambuc                      & that have different number of digits in Karatsuba multiplication. \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 2 \right ] $ & In ~SQUARE~ the fact that every column of a squaring is made up \\
*ebfedea0SLionel Sambuc                      & of double products and at most one square is stated.  Prove this statement. \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 3 \right ] $ & Prove the equation for Karatsuba squaring. \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 1 \right ] $ & Prove that Karatsuba squaring requires $O \left (n^{lg(3)} \right )$ time. \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 2 \right ] $ & Determine the minimal ratio between addition and multiplication clock cycles \\
*ebfedea0SLionel Sambuc                      & required for equation $6.7$ to be true.  \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc$\left [ 3 \right ] $ & Implement a threaded version of Comba multiplication (and squaring) where you \\
*ebfedea0SLionel Sambuc                      & compute subsets of the columns in each thread.  Determine a cutoff point where \\
*ebfedea0SLionel Sambuc                      & it is effective and add the logic to mp\_mul() and mp\_sqr(). \\
*ebfedea0SLionel Sambuc                      &\\
*ebfedea0SLionel Sambuc$\left [ 4 \right ] $ & Same as the previous but also modify the Karatsuba and Toom-Cook.  You must \\
*ebfedea0SLionel Sambuc                      & increase the throughput of mp\_exptmod() for random odd moduli in the range \\
*ebfedea0SLionel Sambuc                      & $512 \ldots 4096$ bits significantly ($> 2x$) to complete this challenge. \\
*ebfedea0SLionel Sambuc                      & \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Modular Reduction}
*ebfedea0SLionel SambucMARK,REDUCTION
*ebfedea0SLionel Sambuc\section{Basics of Modular Reduction}
*ebfedea0SLionel Sambuc\index{modular residue}
*ebfedea0SLionel SambucModular reduction is an operation that arises quite often within public key cryptography algorithms and various number theoretic algorithms,
*ebfedea0SLionel Sambucsuch as factoring.  Modular reduction algorithms are the third class of algorithms of the ``multipliers'' set.  A number $a$ is said to be \textit{reduced}
*ebfedea0SLionel Sambucmodulo another number $b$ by finding the remainder of the division $a/b$.  Full integer division with remainder is a topic to be covered
*ebfedea0SLionel Sambucin~\ref{sec:division}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucModular reduction is equivalent to solving for $r$ in the following equation.  $a = bq + r$ where $q = \lfloor a/b \rfloor$.  The result
*ebfedea0SLionel Sambuc$r$ is said to be ``congruent to $a$ modulo $b$'' which is also written as $r \equiv a \mbox{ (mod }b\mbox{)}$.  In other vernacular $r$ is known as the
*ebfedea0SLionel Sambuc``modular residue'' which leads to ``quadratic residue''\footnote{That's fancy talk for $b \equiv a^2 \mbox{ (mod }p\mbox{)}$.} and
*ebfedea0SLionel Sambucother forms of residues.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucModular reductions are normally used to create either finite groups, rings or fields.  The most common usage for performance driven modular reductions
*ebfedea0SLionel Sambucis in modular exponentiation algorithms.  That is to compute $d = a^b \mbox{ (mod }c\mbox{)}$ as fast as possible.  This operation is used in the
*ebfedea0SLionel SambucRSA and Diffie-Hellman public key algorithms, for example.  Modular multiplication and squaring also appears as a fundamental operation in
*ebfedea0SLionel Sambucelliptic curve cryptographic algorithms.  As will be discussed in the subsequent chapter there exist fast algorithms for computing modular
*ebfedea0SLionel Sambucexponentiations without having to perform (\textit{in this example}) $b - 1$ multiplications.  These algorithms will produce partial results in the
*ebfedea0SLionel Sambucrange $0 \le x < c^2$ which can be taken advantage of to create several efficient algorithms.   They have also been used to create redundancy check
*ebfedea0SLionel Sambucalgorithms known as CRCs, error correction codes such as Reed-Solomon and solve a variety of number theoeretic problems.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{The Barrett Reduction}
*ebfedea0SLionel SambucThe Barrett reduction algorithm \cite{BARRETT} was inspired by fast division algorithms which multiply by the reciprocal to emulate
*ebfedea0SLionel Sambucdivision.  Barretts observation was that the residue $c$ of $a$ modulo $b$ is equal to
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucc = a - b \cdot \lfloor a/b \rfloor
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSince algorithms such as modular exponentiation would be using the same modulus extensively, typical DSP\footnote{It is worth noting that Barrett's paper
*ebfedea0SLionel Sambuctargeted the DSP56K processor.}  intuition would indicate the next step would be to replace $a/b$ by a multiplication by the reciprocal.  However,
*ebfedea0SLionel SambucDSP intuition on its own will not work as these numbers are considerably larger than the precision of common DSP floating point data types.
*ebfedea0SLionel SambucIt would take another common optimization to optimize the algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Fixed Point Arithmetic}
*ebfedea0SLionel SambucThe trick used to optimize the above equation is based on a technique of emulating floating point data types with fixed precision integers.  Fixed
*ebfedea0SLionel Sambucpoint arithmetic would become very popular as it greatly optimize the ``3d-shooter'' genre of games in the mid 1990s when floating point units were
*ebfedea0SLionel Sambucfairly slow if not unavailable.   The idea behind fixed point arithmetic is to take a normal $k$-bit integer data type and break it into $p$-bit
*ebfedea0SLionel Sambucinteger and a $q$-bit fraction part (\textit{where $p+q = k$}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn this system a $k$-bit integer $n$ would actually represent $n/2^q$.  For example, with $q = 4$ the integer $n = 37$ would actually represent the
*ebfedea0SLionel Sambucvalue $2.3125$.  To multiply two fixed point numbers the integers are multiplied using traditional arithmetic and subsequently normalized by
*ebfedea0SLionel Sambucmoving the implied decimal point back to where it should be.  For example, with $q = 4$ to multiply the integers $9$ and $5$ they must be converted
*ebfedea0SLionel Sambucto fixed point first by multiplying by $2^q$.  Let $a = 9(2^q)$ represent the fixed point representation of $9$ and $b = 5(2^q)$ represent the
*ebfedea0SLionel Sambucfixed point representation of $5$.  The product $ab$ is equal to $45(2^{2q})$ which when normalized by dividing by $2^q$ produces $45(2^q)$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis technique became popular since a normal integer multiplication and logical shift right are the only required operations to perform a multiplication
*ebfedea0SLionel Sambucof two fixed point numbers.  Using fixed point arithmetic, division can be easily approximated by multiplying by the reciprocal.  If $2^q$ is
*ebfedea0SLionel Sambucequivalent to one than $2^q/b$ is equivalent to the fixed point approximation of $1/b$ using real arithmetic.  Using this fact dividing an integer
*ebfedea0SLionel Sambuc$a$ by another integer $b$ can be achieved with the following expression.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc\lfloor a / b \rfloor \mbox{ }\approx\mbox{ } \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe precision of the division is proportional to the value of $q$.  If the divisor $b$ is used frequently as is the case with
*ebfedea0SLionel Sambucmodular exponentiation pre-computing $2^q/b$ will allow a division to be performed with a multiplication and a right shift.  Both operations
*ebfedea0SLionel Sambucare considerably faster than division on most processors.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucConsider dividing $19$ by $5$.  The correct result is $\lfloor 19/5 \rfloor = 3$.  With $q = 3$ the reciprocal is $\lfloor 2^q/5 \rfloor = 1$ which
*ebfedea0SLionel Sambucleads to a product of $19$ which when divided by $2^q$ produces $2$.  However, with $q = 4$ the reciprocal is $\lfloor 2^q/5 \rfloor = 3$ and
*ebfedea0SLionel Sambucthe result of the emulated division is $\lfloor 3 \cdot 19 / 2^q \rfloor = 3$ which is correct.  The value of $2^q$ must be close to or ideally
*ebfedea0SLionel Sambuclarger than the dividend.  In effect if $a$ is the dividend then $q$ should allow $0 \le \lfloor a/2^q \rfloor \le 1$ in order for this approach
*ebfedea0SLionel Sambucto work correctly.  Plugging this form of divison into the original equation the following modular residue equation arises.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucc = a - b \cdot \lfloor (a \cdot \lfloor 2^q / b \rfloor)/2^q \rfloor
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucUsing the notation from \cite{BARRETT} the value of $\lfloor 2^q / b \rfloor$ will be represented by the $\mu$ symbol.  Using the $\mu$
*ebfedea0SLionel Sambucvariable also helps re-inforce the idea that it is meant to be computed once and re-used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucc = a - b \cdot \lfloor (a \cdot \mu)/2^q \rfloor
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucProvided that $2^q \ge a$ this algorithm will produce a quotient that is either exactly correct or off by a value of one.  In the context of Barrett
*ebfedea0SLionel Sambucreduction the value of $a$ is bound by $0 \le a \le (b - 1)^2$ meaning that $2^q \ge b^2$ is sufficient to ensure the reciprocal will have enough
*ebfedea0SLionel Sambucprecision.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLet $n$ represent the number of digits in $b$.  This algorithm requires approximately $2n^2$ single precision multiplications to produce the quotient and
*ebfedea0SLionel Sambucanother $n^2$ single precision multiplications to find the residue.  In total $3n^2$ single precision multiplications are required to
*ebfedea0SLionel Sambucreduce the number.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, if $b = 1179677$ and $q = 41$ ($2^q > b^2$), then the reciprocal $\mu$ is equal to $\lfloor 2^q / b \rfloor = 1864089$.  Consider reducing
*ebfedea0SLionel Sambuc$a = 180388626447$ modulo $b$ using the above reduction equation.  The quotient using the new formula is $\lfloor (a \cdot \mu) / 2^q \rfloor = 152913$.
*ebfedea0SLionel SambucBy subtracting $152913b$ from $a$ the correct residue $a \equiv 677346 \mbox{ (mod }b\mbox{)}$ is found.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Choosing a Radix Point}
*ebfedea0SLionel SambucUsing the fixed point representation a modular reduction can be performed with $3n^2$ single precision multiplications.  If that were the best
*ebfedea0SLionel Sambucthat could be achieved a full division\footnote{A division requires approximately $O(2cn^2)$ single precision multiplications for a small value of $c$.
*ebfedea0SLionel SambucSee~\ref{sec:division} for further details.} might as well be used in its place.  The key to optimizing the reduction is to reduce the precision of
*ebfedea0SLionel Sambucthe initial multiplication that finds the quotient.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLet $a$ represent the number of which the residue is sought.  Let $b$ represent the modulus used to find the residue.  Let $m$ represent
*ebfedea0SLionel Sambucthe number of digits in $b$.  For the purposes of this discussion we will assume that the number of digits in $a$ is $2m$, which is generally true if
*ebfedea0SLionel Sambuctwo $m$-digit numbers have been multiplied.  Dividing $a$ by $b$ is the same as dividing a $2m$ digit integer by a $m$ digit integer.  Digits below the
*ebfedea0SLionel Sambuc$m - 1$'th digit of $a$ will contribute at most a value of $1$ to the quotient because $\beta^k < b$ for any $0 \le k \le m - 1$.  Another way to
*ebfedea0SLionel Sambucexpress this is by re-writing $a$ as two parts.  If $a' \equiv a \mbox{ (mod }b^m\mbox{)}$ and $a'' = a - a'$ then
*ebfedea0SLionel Sambuc${a \over b} \equiv {{a' + a''} \over b}$ which is equivalent to ${a' \over b} + {a'' \over b}$.  Since $a'$ is bound to be less than $b$ the quotient
*ebfedea0SLionel Sambucis bound by $0 \le {a' \over b} < 1$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSince the digits of $a'$ do not contribute much to the quotient the observation is that they might as well be zero.  However, if the digits
*ebfedea0SLionel Sambuc``might as well be zero'' they might as well not be there in the first place.  Let $q_0 = \lfloor a/\beta^{m-1} \rfloor$ represent the input
*ebfedea0SLionel Sambucwith the irrelevant digits trimmed.  Now the modular reduction is trimmed to the almost equivalent equation
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucc = a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNote that the original divisor $2^q$ has been replaced with $\beta^{m+1}$ where in this case $q$ is a multiple of $lg(\beta)$. Also note that the
*ebfedea0SLionel Sambucexponent on the divisor when added to the amount $q_0$ was shifted by equals $2m$.  If the optimization had not been performed the divisor
*ebfedea0SLionel Sambucwould have the exponent $2m$ so in the end the exponents do ``add up''. Using the above equation the quotient
*ebfedea0SLionel Sambuc$\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ can be off from the true quotient by at most two.  The original fixed point quotient can be off
*ebfedea0SLionel Sambucby as much as one (\textit{provided the radix point is chosen suitably}) and now that the lower irrelevent digits have been trimmed the quotient
*ebfedea0SLionel Sambuccan be off by an additional value of one for a total of at most two.  This implies that
*ebfedea0SLionel Sambuc$0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$.  By first subtracting $b$ times the quotient and then conditionally subtracting
*ebfedea0SLionel Sambuc$b$ once or twice the residue is found.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe quotient is now found using $(m + 1)(m) = m^2 + m$ single precision multiplications and the residue with an additional $m^2$ single
*ebfedea0SLionel Sambucprecision multiplications, ignoring the subtractions required.  In total $2m^2 + m$ single precision multiplications are required to find the residue.
*ebfedea0SLionel SambucThis is considerably faster than the original attempt.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, let $\beta = 10$ represent the radix of the digits.  Let $b = 9999$ represent the modulus which implies $m = 4$. Let $a = 99929878$
*ebfedea0SLionel Sambucrepresent the value of which the residue is desired.  In this case $q = 8$ since $10^7 < 9999^2$ meaning that $\mu = \lfloor \beta^{q}/b \rfloor = 10001$.
*ebfedea0SLionel SambucWith the new observation the multiplicand for the quotient is equal to $q_0 = \lfloor a / \beta^{m - 1} \rfloor = 99929$.  The quotient is then
*ebfedea0SLionel Sambuc$\lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor = 9993$.  Subtracting $9993b$ from $a$ and the correct residue $a \equiv 9871 \mbox{ (mod }b\mbox{)}$
*ebfedea0SLionel Sambucis found.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Trimming the Quotient}
*ebfedea0SLionel SambucSo far the reduction algorithm has been optimized from $3m^2$ single precision multiplications down to $2m^2 + m$ single precision multiplications.  As
*ebfedea0SLionel Sambucit stands now the algorithm is already fairly fast compared to a full integer division algorithm.  However, there is still room for
*ebfedea0SLionel Sambucoptimization.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the first multiplication inside the quotient ($q_0 \cdot \mu$) the value is shifted right by $m + 1$ places effectively nullifying the lower
*ebfedea0SLionel Sambuchalf of the product.  It would be nice to be able to remove those digits from the product to effectively cut down the number of single precision
*ebfedea0SLionel Sambucmultiplications.  If the number of digits in the modulus $m$ is far less than $\beta$ a full product is not required for the algorithm to work properly.
*ebfedea0SLionel SambucIn fact the lower $m - 2$ digits will not affect the upper half of the product at all and do not need to be computed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe value of $\mu$ is a $m$-digit number and $q_0$ is a $m + 1$ digit number.  Using a full multiplier $(m + 1)(m) = m^2 + m$ single precision
*ebfedea0SLionel Sambucmultiplications would be required.  Using a multiplier that will only produce digits at and above the $m - 1$'th digit reduces the number
*ebfedea0SLionel Sambucof single precision multiplications to ${m^2 + m} \over 2$ single precision multiplications.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Trimming the Residue}
*ebfedea0SLionel SambucAfter the quotient has been calculated it is used to reduce the input.  As previously noted the algorithm is not exact and it can be off by a small
*ebfedea0SLionel Sambucmultiple of the modulus, that is $0 \le a - b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor < 3b$.  If $b$ is $m$ digits than the
*ebfedea0SLionel Sambucresult of reduction equation is a value of at most $m + 1$ digits (\textit{provided $3 < \beta$}) implying that the upper $m - 1$ digits are
*ebfedea0SLionel Sambucimplicitly zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe next optimization arises from this very fact.  Instead of computing $b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ using a full
*ebfedea0SLionel Sambuc$O(m^2)$ multiplication algorithm only the lower $m+1$ digits of the product have to be computed.  Similarly the value of $a$ can
*ebfedea0SLionel Sambucbe reduced modulo $\beta^{m+1}$ before the multiple of $b$ is subtracted which simplifes the subtraction as well.  A multiplication that produces
*ebfedea0SLionel Sambuconly the lower $m+1$ digits requires ${m^2 + 3m - 2} \over 2$ single precision multiplications.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWith both optimizations in place the algorithm is the algorithm Barrett proposed.  It requires $m^2 + 2m - 1$ single precision multiplications which
*ebfedea0SLionel Sambucis considerably faster than the straightforward $3m^2$ method.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{The Barrett Algorithm}
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_reduce}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$, mp\_int $b$ and $\mu = \lfloor \beta^{2m}/b \rfloor, m = \lceil lg_{\beta}(b) \rceil, (0 \le a < b^2, b > 1)$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a \mbox{ (mod }b\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel SambucLet $m$ represent the number of digits in $b$.  \\
*ebfedea0SLionel Sambuc1.  Make a copy of $a$ and store it in $q$.  (\textit{mp\_init\_copy}) \\
*ebfedea0SLionel Sambuc2.  $q \leftarrow \lfloor q / \beta^{m - 1} \rfloor$ (\textit{mp\_rshd}) \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucProduce the quotient. \\
*ebfedea0SLionel Sambuc3.  $q \leftarrow q \cdot \mu$  (\textit{note: only produce digits at or above $m-1$}) \\
*ebfedea0SLionel Sambuc4.  $q \leftarrow \lfloor q / \beta^{m + 1} \rfloor$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucSubtract the multiple of modulus from the input. \\
*ebfedea0SLionel Sambuc5.  $a \leftarrow a \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{mp\_mod\_2d}) \\
*ebfedea0SLionel Sambuc6.  $q \leftarrow q \cdot b \mbox{ (mod }\beta^{m+1}\mbox{)}$ (\textit{s\_mp\_mul\_digs}) \\
*ebfedea0SLionel Sambuc7.  $a \leftarrow a - q$ (\textit{mp\_sub}) \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucAdd $\beta^{m+1}$ if a carry occured. \\
*ebfedea0SLionel Sambuc8.  If $a < 0$ then (\textit{mp\_cmp\_d}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $q \leftarrow 1$ (\textit{mp\_set}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.2  $q \leftarrow q \cdot \beta^{m+1}$ (\textit{mp\_lshd}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.3  $a \leftarrow a + q$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucNow subtract the modulus if the residue is too large (e.g. quotient too small). \\
*ebfedea0SLionel Sambuc9.  While $a \ge b$ do (\textit{mp\_cmp}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1  $c \leftarrow a - b$ \\
*ebfedea0SLionel Sambuc10.  Clear $q$. \\
*ebfedea0SLionel Sambuc11.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_reduce}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_reduce.}
*ebfedea0SLionel SambucThis algorithm will reduce the input $a$ modulo $b$ in place using the Barrett algorithm.  It is loosely based on algorithm 14.42 of HAC
*ebfedea0SLionel Sambuc\cite[pp.  602]{HAC} which is based on the paper from Paul Barrett \cite{BARRETT}.  The algorithm has several restrictions and assumptions which must
*ebfedea0SLionel Sambucbe adhered to for the algorithm to work.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst the modulus $b$ is assumed to be positive and greater than one.  If the modulus were less than or equal to one than subtracting
*ebfedea0SLionel Sambuca multiple of it would either accomplish nothing or actually enlarge the input.  The input $a$ must be in the range $0 \le a < b^2$ in order
*ebfedea0SLionel Sambucfor the quotient to have enough precision.  If $a$ is the product of two numbers that were already reduced modulo $b$, this will not be a problem.
*ebfedea0SLionel SambucTechnically the algorithm will still work if $a \ge b^2$ but it will take much longer to finish.  The value of $\mu$ is passed as an argument to this
*ebfedea0SLionel Sambucalgorithm and is assumed to be calculated and stored before the algorithm is used.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucRecall that the multiplication for the quotient on step 3 must only produce digits at or above the $m-1$'th position.  An algorithm called
*ebfedea0SLionel Sambuc$s\_mp\_mul\_high\_digs$ which has not been presented is used to accomplish this task.  The algorithm is based on $s\_mp\_mul\_digs$ except that
*ebfedea0SLionel Sambucinstead of stopping at a given level of precision it starts at a given level of precision.  This optimal algorithm can only be used if the number
*ebfedea0SLionel Sambucof digits in $b$ is very much smaller than $\beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhile it is known that
*ebfedea0SLionel Sambuc$a \ge b \cdot \lfloor (q_0 \cdot \mu) / \beta^{m+1} \rfloor$ only the lower $m+1$ digits are being used to compute the residue, so an implied
*ebfedea0SLionel Sambuc``borrow'' from the higher digits might leave a negative result.  After the multiple of the modulus has been subtracted from $a$ the residue must be
*ebfedea0SLionel Sambucfixed up in case it is negative.  The invariant $\beta^{m+1}$ must be added to the residue to make it positive again.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe while loop at step 9 will subtract $b$ until the residue is less than $b$.  If the algorithm is performed correctly this step is
*ebfedea0SLionel Sambucperformed at most twice, and on average once. However, if $a \ge b^2$ than it will iterate substantially more times than it should.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_reduce.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first multiplication that determines the quotient can be performed by only producing the digits from $m - 1$ and up.  This essentially halves
*ebfedea0SLionel Sambucthe number of single precision multiplications required.  However, the optimization is only safe if $\beta$ is much larger than the number of digits
*ebfedea0SLionel Sambucin the modulus.  In the source code this is evaluated on lines @36,if@ to @44,}@ where algorithm s\_mp\_mul\_high\_digs is used when it is
*ebfedea0SLionel Sambucsafe to do so.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{The Barrett Setup Algorithm}
*ebfedea0SLionel SambucIn order to use algorithm mp\_reduce the value of $\mu$ must be calculated in advance.  Ideally this value should be computed once and stored for
*ebfedea0SLionel Sambucfuture use so that the Barrett algorithm can be used without delay.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_reduce\_setup}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ ($a > 1$)  \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $\mu \leftarrow \lfloor \beta^{2m}/a \rfloor$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $\mu \leftarrow 2^{2 \cdot lg(\beta) \cdot  m}$ (\textit{mp\_2expt}) \\
*ebfedea0SLionel Sambuc2.  $\mu \leftarrow \lfloor \mu / b \rfloor$ (\textit{mp\_div}) \\
*ebfedea0SLionel Sambuc3.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_reduce\_setup}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_reduce\_setup.}
*ebfedea0SLionel SambucThis algorithm computes the reciprocal $\mu$ required for Barrett reduction.  First $\beta^{2m}$ is calculated as $2^{2 \cdot lg(\beta) \cdot  m}$ which
*ebfedea0SLionel Sambucis equivalent and much faster.  The final value is computed by taking the integer quotient of $\lfloor \mu / b \rfloor$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_reduce_setup.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis simple routine calculates the reciprocal $\mu$ required by Barrett reduction.  Note the extended usage of algorithm mp\_div where the variable
*ebfedea0SLionel Sambucwhich would received the remainder is passed as NULL.  As will be discussed in~\ref{sec:division} the division routine allows both the quotient and the
*ebfedea0SLionel Sambucremainder to be passed as NULL meaning to ignore the value.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{The Montgomery Reduction}
*ebfedea0SLionel SambucMontgomery reduction\footnote{Thanks to Niels Ferguson for his insightful explanation of the algorithm.} \cite{MONT} is by far the most interesting
*ebfedea0SLionel Sambucform of reduction in common use.  It computes a modular residue which is not actually equal to the residue of the input yet instead equal to a
*ebfedea0SLionel Sambucresidue times a constant.  However, as perplexing as this may sound the algorithm is relatively simple and very efficient.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThroughout this entire section the variable $n$ will represent the modulus used to form the residue.  As will be discussed shortly the value of
*ebfedea0SLionel Sambuc$n$ must be odd.  The variable $x$ will represent the quantity of which the residue is sought.  Similar to the Barrett algorithm the input
*ebfedea0SLionel Sambucis restricted to $0 \le x < n^2$.  To begin the description some simple number theory facts must be established.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Fact 1.}  Adding $n$ to $x$ does not change the residue since in effect it adds one to the quotient $\lfloor x / n \rfloor$.  Another way
*ebfedea0SLionel Sambucto explain this is that $n$ is (\textit{or multiples of $n$ are}) congruent to zero modulo $n$.  Adding zero will not change the value of the residue.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Fact 2.}  If $x$ is even then performing a division by two in $\Z$ is congruent to $x \cdot 2^{-1} \mbox{ (mod }n\mbox{)}$.  Actually
*ebfedea0SLionel Sambucthis is an application of the fact that if $x$ is evenly divisible by any $k \in \Z$ then division in $\Z$ will be congruent to
*ebfedea0SLionel Sambucmultiplication by $k^{-1}$ modulo $n$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFrom these two simple facts the following simple algorithm can be derived.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Montgomery Reduction}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Integer $x$, $n$ and $k$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  for $t$ from $1$ to $k$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  If $x$ is odd then \\
*ebfedea0SLionel Sambuc\hspace{6mm}1.1.1  $x \leftarrow x + n$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  $x \leftarrow x/2$ \\
*ebfedea0SLionel Sambuc2.  Return $x$. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Montgomery Reduction}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe algorithm reduces the input one bit at a time using the two congruencies stated previously.  Inside the loop $n$, which is odd, is
*ebfedea0SLionel Sambucadded to $x$ if $x$ is odd.  This forces $x$ to be even which allows the division by two in $\Z$ to be congruent to a modular division by two.  Since
*ebfedea0SLionel Sambuc$x$ is assumed to be initially much larger than $n$ the addition of $n$ will contribute an insignificant magnitude to $x$.  Let $r$ represent the
*ebfedea0SLionel Sambucfinal result of the Montgomery algorithm.  If $k > lg(n)$ and $0 \le x < n^2$ then the final result is limited to
*ebfedea0SLionel Sambuc$0 \le r < \lfloor x/2^k \rfloor + n$.  As a result at most a single subtraction is required to get the residue desired.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|l|}
*ebfedea0SLionel Sambuc\hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} \\
*ebfedea0SLionel Sambuc\hline $1$ & $x + n = 5812$, $x/2 = 2906$ \\
*ebfedea0SLionel Sambuc\hline $2$ & $x/2 = 1453$ \\
*ebfedea0SLionel Sambuc\hline $3$ & $x + n = 1710$, $x/2 = 855$ \\
*ebfedea0SLionel Sambuc\hline $4$ & $x + n = 1112$, $x/2 = 556$ \\
*ebfedea0SLionel Sambuc\hline $5$ & $x/2 = 278$ \\
*ebfedea0SLionel Sambuc\hline $6$ & $x/2 = 139$ \\
*ebfedea0SLionel Sambuc\hline $7$ & $x + n = 396$, $x/2 = 198$ \\
*ebfedea0SLionel Sambuc\hline $8$ & $x/2 = 99$ \\
*ebfedea0SLionel Sambuc\hline $9$ & $x + n = 356$, $x/2 = 178$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Example of Montgomery Reduction (I)}
*ebfedea0SLionel Sambuc\label{fig:MONT1}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucConsider the example in figure~\ref{fig:MONT1} which reduces $x = 5555$ modulo $n = 257$ when $k = 9$ (note $\beta^k = 512$ which is larger than $n$).  The result of
*ebfedea0SLionel Sambucthe algorithm $r = 178$ is congruent to the value of $2^{-9} \cdot 5555 \mbox{ (mod }257\mbox{)}$.  When $r$ is multiplied by $2^9$ modulo $257$ the correct residue
*ebfedea0SLionel Sambuc$r \equiv 158$ is produced.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLet $k = \lfloor lg(n) \rfloor + 1$ represent the number of bits in $n$.  The current algorithm requires $2k^2$ single precision shifts
*ebfedea0SLionel Sambucand $k^2$ single precision additions.  At this rate the algorithm is most certainly slower than Barrett reduction and not terribly useful.
*ebfedea0SLionel SambucFortunately there exists an alternative representation of the algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Montgomery Reduction} (modified I). \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Integer $x$, $n$ and $k$ ($2^k > n$) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $2^{-k}x \mbox{ (mod }n\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  for $t$ from $1$ to $k$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  If the $t$'th bit of $x$ is one then \\
*ebfedea0SLionel Sambuc\hspace{6mm}1.1.1  $x \leftarrow x + 2^tn$ \\
*ebfedea0SLionel Sambuc2.  Return $x/2^k$. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Montgomery Reduction (modified I)}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis algorithm is equivalent since $2^tn$ is a multiple of $n$ and the lower $k$ bits of $x$ are zero by step 2.  The number of single
*ebfedea0SLionel Sambucprecision shifts has now been reduced from $2k^2$ to $k^2 + k$ which is only a small improvement.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|l|r|}
*ebfedea0SLionel Sambuc\hline \textbf{Step number ($t$)} & \textbf{Result ($x$)} & \textbf{Result ($x$) in Binary} \\
*ebfedea0SLionel Sambuc\hline -- & $5555$ & $1010110110011$ \\
*ebfedea0SLionel Sambuc\hline $1$ & $x + 2^{0}n = 5812$ &  $1011010110100$ \\
*ebfedea0SLionel Sambuc\hline $2$ & $5812$ & $1011010110100$ \\
*ebfedea0SLionel Sambuc\hline $3$ & $x + 2^{2}n = 6840$ & $1101010111000$ \\
*ebfedea0SLionel Sambuc\hline $4$ & $x + 2^{3}n = 8896$ & $10001011000000$ \\
*ebfedea0SLionel Sambuc\hline $5$ & $8896$ & $10001011000000$ \\
*ebfedea0SLionel Sambuc\hline $6$ & $8896$ & $10001011000000$ \\
*ebfedea0SLionel Sambuc\hline $7$ & $x + 2^{6}n = 25344$ & $110001100000000$ \\
*ebfedea0SLionel Sambuc\hline $8$ & $25344$ & $110001100000000$ \\
*ebfedea0SLionel Sambuc\hline $9$ & $x + 2^{7}n = 91136$ & $10110010000000000$ \\
*ebfedea0SLionel Sambuc\hline -- & $x/2^k = 178$ & \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Example of Montgomery Reduction (II)}
*ebfedea0SLionel Sambuc\label{fig:MONT2}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFigure~\ref{fig:MONT2} demonstrates the modified algorithm reducing $x = 5555$ modulo $n = 257$ with $k = 9$.
*ebfedea0SLionel SambucWith this algorithm a single shift right at the end is the only right shift required to reduce the input instead of $k$ right shifts inside the
*ebfedea0SLionel Sambucloop.  Note that for the iterations $t = 2, 5, 6$ and $8$ where the result $x$ is not changed.  In those iterations the $t$'th bit of $x$ is
*ebfedea0SLionel Sambuczero and the appropriate multiple of $n$ does not need to be added to force the $t$'th bit of the result to zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Digit Based Montgomery Reduction}
*ebfedea0SLionel SambucInstead of computing the reduction on a bit-by-bit basis it is actually much faster to compute it on digit-by-digit basis.  Consider the
*ebfedea0SLionel Sambucprevious algorithm re-written to compute the Montgomery reduction in this new fashion.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Montgomery Reduction} (modified II). \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Integer $x$, $n$ and $k$ ($\beta^k > n$) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  for $t$ from $0$ to $k - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $x \leftarrow x + \mu n \beta^t$ \\
*ebfedea0SLionel Sambuc2.  Return $x/\beta^k$. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Montgomery Reduction (modified II)}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe value $\mu n \beta^t$ is a multiple of the modulus $n$ meaning that it will not change the residue.  If the first digit of
*ebfedea0SLionel Sambucthe value $\mu n \beta^t$ equals the negative (modulo $\beta$) of the $t$'th digit of $x$ then the addition will result in a zero digit.  This
*ebfedea0SLionel Sambucproblem breaks down to solving the following congruency.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{rcl}
*ebfedea0SLionel Sambuc$x_t + \mu n_0$ & $\equiv$ & $0 \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc$\mu n_0$ & $\equiv$ & $-x_t \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc$\mu$ & $\equiv$ & $-x_t/n_0 \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn each iteration of the loop on step 1 a new value of $\mu$ must be calculated.  The value of $-1/n_0 \mbox{ (mod }\beta\mbox{)}$ is used
*ebfedea0SLionel Sambucextensively in this algorithm and should be precomputed.  Let $\rho$ represent the negative of the modular inverse of $n_0$ modulo $\beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, let $\beta = 10$ represent the radix.  Let $n = 17$ represent the modulus which implies $k = 2$ and $\rho \equiv 7$.  Let $x = 33$
*ebfedea0SLionel Sambucrepresent the value to reduce.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Step ($t$)} & \textbf{Value of $x$} & \textbf{Value of $\mu$} \\
*ebfedea0SLionel Sambuc\hline --                 & $33$ & --\\
*ebfedea0SLionel Sambuc\hline $0$                 & $33 + \mu n = 50$ & $1$ \\
*ebfedea0SLionel Sambuc\hline $1$                 & $50 + \mu n \beta = 900$ & $5$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Example of Montgomery Reduction}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe final result $900$ is then divided by $\beta^k$ to produce the final result $9$.  The first observation is that $9 \nequiv x \mbox{ (mod }n\mbox{)}$
*ebfedea0SLionel Sambucwhich implies the result is not the modular residue of $x$ modulo $n$.  However, recall that the residue is actually multiplied by $\beta^{-k}$ in
*ebfedea0SLionel Sambucthe algorithm.  To get the true residue the value must be multiplied by $\beta^k$.  In this case $\beta^k \equiv 15 \mbox{ (mod }n\mbox{)}$ and
*ebfedea0SLionel Sambucthe correct residue is $9 \cdot 15 \equiv 16 \mbox{ (mod }n\mbox{)}$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Baseline Montgomery Reduction}
*ebfedea0SLionel SambucThe baseline Montgomery reduction algorithm will produce the residue for any size input.  It is designed to be a catch-all algororithm for
*ebfedea0SLionel SambucMontgomery reductions.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_montgomery\_reduce}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\
*ebfedea0SLionel Sambuc\hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $digs \leftarrow 2n.used + 1$ \\
*ebfedea0SLionel Sambuc2.  If $digs < MP\_ARRAY$ and $m.used < \delta$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  Use algorithm fast\_mp\_montgomery\_reduce instead. \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucSetup $x$ for the reduction. \\
*ebfedea0SLionel Sambuc3.  If $x.alloc < digs$ then grow $x$ to $digs$ digits. \\
*ebfedea0SLionel Sambuc4.  $x.used \leftarrow digs$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucEliminate the lower $k$ digits. \\
*ebfedea0SLionel Sambuc5.  For $ix$ from $0$ to $k - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $\mu \leftarrow x_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $u \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  For $iy$ from $0$ to $k - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.3.1  $\hat r \leftarrow \mu n_{iy} + x_{ix + iy} + u$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.3.2  $x_{ix + iy} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.3.3  $u \leftarrow \lfloor \hat r / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.4  While $u > 0$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.1  $iy \leftarrow iy + 1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.2  $x_{ix + iy} \leftarrow x_{ix + iy} + u$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.3  $u \leftarrow \lfloor x_{ix+iy} / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.4.4  $x_{ix + iy} \leftarrow x_{ix+iy} \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucDivide by $\beta^k$ and fix up as required. \\
*ebfedea0SLionel Sambuc6.  $x \leftarrow \lfloor x / \beta^k \rfloor$ \\
*ebfedea0SLionel Sambuc7.  If $x \ge n$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $x \leftarrow x - n$ \\
*ebfedea0SLionel Sambuc8.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_montgomery\_reduce}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_montgomery\_reduce.}
*ebfedea0SLionel SambucThis algorithm reduces the input $x$ modulo $n$ in place using the Montgomery reduction algorithm.  The algorithm is loosely based
*ebfedea0SLionel Sambucon algorithm 14.32 of \cite[pp.601]{HAC} except it merges the multiplication of $\mu n \beta^t$ with the addition in the inner loop.  The
*ebfedea0SLionel Sambucrestrictions on this algorithm are fairly easy to adapt to.  First $0 \le x < n^2$ bounds the input to numbers in the same range as
*ebfedea0SLionel Sambucfor the Barrett algorithm.  Additionally if $n > 1$ and $n$ is odd there will exist a modular inverse $\rho$.  $\rho$ must be calculated in
*ebfedea0SLionel Sambucadvance of this algorithm.  Finally the variable $k$ is fixed and a pseudonym for $n.used$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep 2 decides whether a faster Montgomery algorithm can be used.  It is based on the Comba technique meaning that there are limits on
*ebfedea0SLionel Sambucthe size of the input.  This algorithm is discussed in ~COMBARED~.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep 5 is the main reduction loop of the algorithm.  The value of $\mu$ is calculated once per iteration in the outer loop.  The inner loop
*ebfedea0SLionel Sambuccalculates $x + \mu n \beta^{ix}$ by multiplying $\mu n$ and adding the result to $x$ shifted by $ix$ digits.  Both the addition and
*ebfedea0SLionel Sambucmultiplication are performed in the same loop to save time and memory.  Step 5.4 will handle any additional carries that escape the inner loop.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucUsing a quick inspection this algorithm requires $n$ single precision multiplications for the outer loop and $n^2$ single precision multiplications
*ebfedea0SLionel Sambucin the inner loop.  In total $n^2 + n$ single precision multiplications which compares favourably to Barrett at $n^2 + 2n - 1$ single precision
*ebfedea0SLionel Sambucmultiplications.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_montgomery_reduce.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis is the baseline implementation of the Montgomery reduction algorithm.  Lines @30,digs@ to @35,}@ determine if the Comba based
*ebfedea0SLionel Sambucroutine can be used instead.  Line @47,mu@ computes the value of $\mu$ for that particular iteration of the outer loop.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe multiplication $\mu n \beta^{ix}$ is performed in one step in the inner loop.  The alias $tmpx$ refers to the $ix$'th digit of $x$ and
*ebfedea0SLionel Sambucthe alias $tmpn$ refers to the modulus $n$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Faster ``Comba'' Montgomery Reduction}
*ebfedea0SLionel SambucMARK,COMBARED
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe Montgomery reduction requires fewer single precision multiplications than a Barrett reduction, however it is much slower due to the serial
*ebfedea0SLionel Sambucnature of the inner loop.  The Barrett reduction algorithm requires two slightly modified multipliers which can be implemented with the Comba
*ebfedea0SLionel Sambuctechnique.  The Montgomery reduction algorithm cannot directly use the Comba technique to any significant advantage since the inner loop calculates
*ebfedea0SLionel Sambuca $k \times 1$ product $k$ times.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe biggest obstacle is that at the $ix$'th iteration of the outer loop the value of $x_{ix}$ is required to calculate $\mu$.  This means the
*ebfedea0SLionel Sambuccarries from $0$ to $ix - 1$ must have been propagated upwards to form a valid $ix$'th digit.  The solution as it turns out is very simple.
*ebfedea0SLionel SambucPerform a Comba like multiplier and inside the outer loop just after the inner loop fix up the $ix + 1$'th digit by forwarding the carry.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWith this change in place the Montgomery reduction algorithm can be performed with a Comba style multiplication loop which substantially increases
*ebfedea0SLionel Sambucthe speed of the algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{fast\_mp\_montgomery\_reduce}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $x$, mp\_int $n$ and a digit $\rho \equiv -1/n_0 \mbox{ (mod }n\mbox{)}$. \\
*ebfedea0SLionel Sambuc\hspace{11.5mm}($0 \le x < n^2, n > 1, (n, \beta) = 1, \beta^k > n$) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $\beta^{-k}x \mbox{ (mod }n\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel SambucPlace an array of \textbf{MP\_WARRAY} mp\_word variables called $\hat W$ on the stack. \\
*ebfedea0SLionel Sambuc1.  if $x.alloc < n.used + 1$ then grow $x$ to $n.used + 1$ digits. \\
*ebfedea0SLionel SambucCopy the digits of $x$ into the array $\hat W$ \\
*ebfedea0SLionel Sambuc2.  For $ix$ from $0$ to $x.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $\hat W_{ix} \leftarrow x_{ix}$ \\
*ebfedea0SLionel Sambuc3.  For $ix$ from $x.used$ to $2n.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $\hat W_{ix} \leftarrow 0$ \\
*ebfedea0SLionel SambucElimiate the lower $k$ digits. \\
*ebfedea0SLionel Sambuc4.  for $ix$ from $0$ to $n.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $\mu \leftarrow \hat W_{ix} \cdot \rho \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.2  For $iy$ from $0$ to $n.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}4.2.1  $\hat W_{iy + ix} \leftarrow \hat W_{iy + ix} + \mu \cdot n_{iy}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.3  $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\
*ebfedea0SLionel SambucPropagate carries upwards. \\
*ebfedea0SLionel Sambuc5.  for $ix$ from $n.used$ to $2n.used + 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $\hat W_{ix + 1} \leftarrow \hat W_{ix + 1} + \lfloor \hat W_{ix} / \beta \rfloor$ \\
*ebfedea0SLionel SambucShift right and reduce modulo $\beta$ simultaneously. \\
*ebfedea0SLionel Sambuc6.  for $ix$ from $0$ to $n.used + 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $x_{ix} \leftarrow \hat W_{ix + n.used} \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel SambucZero excess digits and fixup $x$. \\
*ebfedea0SLionel Sambuc7.  if $x.used > n.used + 1$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  for $ix$ from $n.used + 1$ to $x.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.1.1  $x_{ix} \leftarrow 0$ \\
*ebfedea0SLionel Sambuc8.  $x.used \leftarrow n.used + 1$ \\
*ebfedea0SLionel Sambuc9.  Clamp excessive digits of $x$. \\
*ebfedea0SLionel Sambuc10.  If $x \ge n$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}10.1  $x \leftarrow x - n$ \\
*ebfedea0SLionel Sambuc11.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm fast\_mp\_montgomery\_reduce}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm fast\_mp\_montgomery\_reduce.}
*ebfedea0SLionel SambucThis algorithm will compute the Montgomery reduction of $x$ modulo $n$ using the Comba technique.  It is on most computer platforms significantly
*ebfedea0SLionel Sambucfaster than algorithm mp\_montgomery\_reduce and algorithm mp\_reduce (\textit{Barrett reduction}).  The algorithm has the same restrictions
*ebfedea0SLionel Sambucon the input as the baseline reduction algorithm.  An additional two restrictions are imposed on this algorithm.  The number of digits $k$ in the
*ebfedea0SLionel Sambucthe modulus $n$ must not violate $MP\_WARRAY > 2k +1$ and $n < \delta$.   When $\beta = 2^{28}$ this algorithm can be used to reduce modulo
*ebfedea0SLionel Sambuca modulus of at most $3,556$ bits in length.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs in the other Comba reduction algorithms there is a $\hat W$ array which stores the columns of the product.  It is initially filled with the
*ebfedea0SLionel Sambuccontents of $x$ with the excess digits zeroed.  The reduction loop is very similar the to the baseline loop at heart.  The multiplication on step
*ebfedea0SLionel Sambuc4.1 can be single precision only since $ab \mbox{ (mod }\beta\mbox{)} \equiv (a \mbox{ mod }\beta)(b \mbox{ mod }\beta)$.  Some multipliers such
*ebfedea0SLionel Sambucas those on the ARM processors take a variable length time to complete depending on the number of bytes of result it must produce.  By performing
*ebfedea0SLionel Sambuca single precision multiplication instead half the amount of time is spent.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAlso note that digit $\hat W_{ix}$ must have the carry from the $ix - 1$'th digit propagated upwards in order for this to work.  That is what step
*ebfedea0SLionel Sambuc4.3 will do.  In effect over the $n.used$ iterations of the outer loop the $n.used$'th lower columns all have the their carries propagated forwards.  Note
*ebfedea0SLionel Sambuchow the upper bits of those same words are not reduced modulo $\beta$.  This is because those values will be discarded shortly and there is no
*ebfedea0SLionel Sambucpoint.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep 5 will propagate the remainder of the carries upwards.  On step 6 the columns are reduced modulo $\beta$ and shifted simultaneously as they are
*ebfedea0SLionel Sambucstored in the destination $x$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_fast_mp_montgomery_reduce.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe $\hat W$ array is first filled with digits of $x$ on line @49,for@ then the rest of the digits are zeroed on line @54,for@.  Both loops share
*ebfedea0SLionel Sambucthe same alias variables to make the code easier to read.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe value of $\mu$ is calculated in an interesting fashion.  First the value $\hat W_{ix}$ is reduced modulo $\beta$ and cast to a mp\_digit.  This
*ebfedea0SLionel Sambucforces the compiler to use a single precision multiplication and prevents any concerns about loss of precision.   Line @101,>>@ fixes the carry
*ebfedea0SLionel Sambucfor the next iteration of the loop by propagating the carry from $\hat W_{ix}$ to $\hat W_{ix+1}$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe for loop on line @113,for@ propagates the rest of the carries upwards through the columns.  The for loop on line @126,for@ reduces the columns
*ebfedea0SLionel Sambucmodulo $\beta$ and shifts them $k$ places at the same time.  The alias $\_ \hat W$ actually refers to the array $\hat W$ starting at the $n.used$'th
*ebfedea0SLionel Sambucdigit, that is $\_ \hat W_{t} = \hat W_{n.used + t}$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Montgomery Setup}
*ebfedea0SLionel SambucTo calculate the variable $\rho$ a relatively simple algorithm will be required.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_montgomery\_setup}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $n$ ($n > 1$ and $(n, 2) = 1$) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $\rho \equiv -1/n_0 \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $b \leftarrow n_0$ \\
*ebfedea0SLionel Sambuc2.  If $b$ is even return(\textit{MP\_VAL}) \\
*ebfedea0SLionel Sambuc3.  $x \leftarrow (((b + 2) \mbox{ AND } 4) << 1) + b$ \\
*ebfedea0SLionel Sambuc4.  for $k$ from 0 to $\lceil lg(lg(\beta)) \rceil - 2$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $x \leftarrow x \cdot (2 - bx)$ \\
*ebfedea0SLionel Sambuc5.  $\rho \leftarrow \beta - x \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc6.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_montgomery\_setup}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_montgomery\_setup.}
*ebfedea0SLionel SambucThis algorithm will calculate the value of $\rho$ required within the Montgomery reduction algorithms.  It uses a very interesting trick
*ebfedea0SLionel Sambucto calculate $1/n_0$ when $\beta$ is a power of two.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_montgomery_setup.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis source code computes the value of $\rho$ required to perform Montgomery reduction.  It has been modified to avoid performing excess
*ebfedea0SLionel Sambucmultiplications when $\beta$ is not the default 28-bits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{The Diminished Radix Algorithm}
*ebfedea0SLionel SambucThe Diminished Radix method of modular reduction \cite{DRMET} is a fairly clever technique which can be more efficient than either the Barrett
*ebfedea0SLionel Sambucor Montgomery methods for certain forms of moduli.  The technique is based on the following simple congruence.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc(x \mbox{ mod } n) + k \lfloor x / n \rfloor \equiv x \mbox{ (mod }(n - k)\mbox{)}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis observation was used in the MMB \cite{MMB} block cipher to create a diffusion primitive.  It used the fact that if $n = 2^{31}$ and $k=1$ that
*ebfedea0SLionel Sambucthen a x86 multiplier could produce the 62-bit product and use  the ``shrd'' instruction to perform a double-precision right shift.  The proof
*ebfedea0SLionel Sambucof the above equation is very simple.  First write $x$ in the product form.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucx = qn + r
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNow reduce both sides modulo $(n - k)$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucx \equiv qk + r  \mbox{ (mod }(n-k)\mbox{)}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe variable $n$ reduces modulo $n - k$ to $k$.  By putting $q = \lfloor x/n \rfloor$ and $r = x \mbox{ mod } n$
*ebfedea0SLionel Sambucinto the equation the original congruence is reproduced, thus concluding the proof.  The following algorithm is based on this observation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Diminished Radix Reduction}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Integer $x$, $n$, $k$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $x \mbox{ mod } (n - k)$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $q \leftarrow \lfloor x / n \rfloor$ \\
*ebfedea0SLionel Sambuc2.  $q \leftarrow k \cdot q$ \\
*ebfedea0SLionel Sambuc3.  $x \leftarrow x \mbox{ (mod }n\mbox{)}$ \\
*ebfedea0SLionel Sambuc4.  $x \leftarrow x + q$ \\
*ebfedea0SLionel Sambuc5.  If $x \ge (n - k)$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $x \leftarrow x - (n - k)$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  Goto step 1. \\
*ebfedea0SLionel Sambuc6.  Return $x$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Diminished Radix Reduction}
*ebfedea0SLionel Sambuc\label{fig:DR}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis algorithm will reduce $x$ modulo $n - k$ and return the residue.  If $0 \le x < (n - k)^2$ then the algorithm will loop almost always
*ebfedea0SLionel Sambuconce or twice and occasionally three times.  For simplicity sake the value of $x$ is bounded by the following simple polynomial.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc0 \le x < n^2 + k^2 - 2nk
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe true bound is  $0 \le x < (n - k - 1)^2$ but this has quite a few more terms.  The value of $q$ after step 1 is bounded by the following.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucq < n - 2k - k^2/n
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSince $k^2$ is going to be considerably smaller than $n$ that term will always be zero.  The value of $x$ after step 3 is bounded trivially as
*ebfedea0SLionel Sambuc$0 \le x < n$.  By step four the sum $x + q$ is bounded by
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc0 \le q + x < (k + 1)n - 2k^2 - 1
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWith a second pass $q$ will be loosely bounded by $0 \le q < k^2$ after step 2 while $x$ will still be loosely bounded by $0 \le x < n$ after step 3.  After the second pass it is highly unlike that the
*ebfedea0SLionel Sambucsum in step 4 will exceed $n - k$.  In practice fewer than three passes of the algorithm are required to reduce virtually every input in the
*ebfedea0SLionel Sambucrange $0 \le x < (n - k - 1)^2$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|l|}
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc$x = 123456789, n = 256, k = 3$ \\
*ebfedea0SLionel Sambuc\hline $q \leftarrow \lfloor x/n \rfloor = 482253$ \\
*ebfedea0SLionel Sambuc$q \leftarrow q*k = 1446759$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x \mbox{ mod } n = 21$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x + q = 1446780$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x - (n - k) = 1446527$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc$q \leftarrow \lfloor x/n \rfloor = 5650$ \\
*ebfedea0SLionel Sambuc$q \leftarrow q*k = 16950$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x \mbox{ mod } n = 127$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x + q = 17077$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x - (n - k) = 16824$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc$q \leftarrow \lfloor x/n \rfloor = 65$ \\
*ebfedea0SLionel Sambuc$q \leftarrow q*k = 195$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x \mbox{ mod } n = 184$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x + q = 379$ \\
*ebfedea0SLionel Sambuc$x \leftarrow x - (n - k) = 126$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Example Diminished Radix Reduction}
*ebfedea0SLionel Sambuc\label{fig:EXDR}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFigure~\ref{fig:EXDR} demonstrates the reduction of $x = 123456789$ modulo $n - k = 253$ when $n = 256$ and $k = 3$.  Note that even while $x$
*ebfedea0SLionel Sambucis considerably larger than $(n - k - 1)^2 = 63504$ the algorithm still converges on the modular residue exceedingly fast.  In this case only
*ebfedea0SLionel Sambucthree passes were required to find the residue $x \equiv 126$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Choice of Moduli}
*ebfedea0SLionel SambucOn the surface this algorithm looks like a very expensive algorithm.  It requires a couple of subtractions followed by multiplication and other
*ebfedea0SLionel Sambucmodular reductions.  The usefulness of this algorithm becomes exceedingly clear when an appropriate modulus is chosen.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucDivision in general is a very expensive operation to perform.  The one exception is when the division is by a power of the radix of representation used.
*ebfedea0SLionel SambucDivision by ten for example is simple for pencil and paper mathematics since it amounts to shifting the decimal place to the right.  Similarly division
*ebfedea0SLionel Sambucby two (\textit{or powers of two}) is very simple for binary computers to perform.  It would therefore seem logical to choose $n$ of the form $2^p$
*ebfedea0SLionel Sambucwhich would imply that $\lfloor x / n \rfloor$ is a simple shift of $x$ right $p$ bits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHowever, there is one operation related to division of power of twos that is even faster than this.  If $n = \beta^p$ then the division may be
*ebfedea0SLionel Sambucperformed by moving whole digits to the right $p$ places.  In practice division by $\beta^p$ is much faster than division by $2^p$ for any $p$.
*ebfedea0SLionel SambucAlso with the choice of $n = \beta^p$ reducing $x$ modulo $n$ merely requires zeroing the digits above the $p-1$'th digit of $x$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThroughout the next section the term ``restricted modulus'' will refer to a modulus of the form $\beta^p - k$ whereas the term ``unrestricted
*ebfedea0SLionel Sambucmodulus'' will refer to a modulus of the form $2^p - k$.  The word ``restricted'' in this case refers to the fact that it is based on the
*ebfedea0SLionel Sambuc$2^p$ logic except $p$ must be a multiple of $lg(\beta)$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Choice of $k$}
*ebfedea0SLionel SambucNow that division and reduction (\textit{step 1 and 3 of figure~\ref{fig:DR}}) have been optimized to simple digit operations the multiplication by $k$
*ebfedea0SLionel Sambucin step 2 is the most expensive operation.  Fortunately the choice of $k$ is not terribly limited.  For all intents and purposes it might
*ebfedea0SLionel Sambucas well be a single digit.  The smaller the value of $k$ is the faster the algorithm will be.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Restricted Diminished Radix Reduction}
*ebfedea0SLionel SambucThe restricted Diminished Radix algorithm can quickly reduce an input modulo a modulus of the form $n = \beta^p - k$.  This algorithm can reduce
*ebfedea0SLionel Sambucan input $x$ within the range $0 \le x < n^2$ using only a couple passes of the algorithm demonstrated in figure~\ref{fig:DR}.  The implementation
*ebfedea0SLionel Sambucof this algorithm has been optimized to avoid additional overhead associated with a division by $\beta^p$, the multiplication by $k$ or the addition
*ebfedea0SLionel Sambucof $x$ and $q$.  The resulting algorithm is very efficient and can lead to substantial improvements over Barrett and Montgomery reduction when modular
*ebfedea0SLionel Sambucexponentiations are performed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_dr\_reduce}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $x$, $n$ and a mp\_digit $k = \beta - n_0$ \\
*ebfedea0SLionel Sambuc\hspace{11.5mm}($0 \le x < n^2$, $n > 1$, $0 < k < \beta$) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $x \mbox{ mod } n$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $m \leftarrow n.used$ \\
*ebfedea0SLionel Sambuc2.  If $x.alloc < 2m$ then grow $x$ to $2m$ digits. \\
*ebfedea0SLionel Sambuc3.  $\mu \leftarrow 0$ \\
*ebfedea0SLionel Sambuc4.  for $i$ from $0$ to $m - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $\hat r \leftarrow k \cdot x_{m+i} + x_{i} + \mu$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.2  $x_{i} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.3  $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc5.  $x_{m} \leftarrow \mu$ \\
*ebfedea0SLionel Sambuc6.  for $i$ from $m + 1$ to $x.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $x_{i} \leftarrow 0$ \\
*ebfedea0SLionel Sambuc7.  Clamp excess digits of $x$. \\
*ebfedea0SLionel Sambuc8.  If $x \ge n$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $x \leftarrow x - n$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.2  Goto step 3. \\
*ebfedea0SLionel Sambuc9.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_dr\_reduce}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_dr\_reduce.}
*ebfedea0SLionel SambucThis algorithm will perform the Dimished Radix reduction of $x$ modulo $n$.  It has similar restrictions to that of the Barrett reduction
*ebfedea0SLionel Sambucwith the addition that $n$ must be of the form $n = \beta^m - k$ where $0 < k <\beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis algorithm essentially implements the pseudo-code in figure~\ref{fig:DR} except with a slight optimization.  The division by $\beta^m$, multiplication by $k$
*ebfedea0SLionel Sambucand addition of $x \mbox{ mod }\beta^m$ are all performed simultaneously inside the loop on step 4.  The division by $\beta^m$ is emulated by accessing
*ebfedea0SLionel Sambucthe term at the $m+i$'th position which is subsequently multiplied by $k$ and added to the term at the $i$'th position.  After the loop the $m$'th
*ebfedea0SLionel Sambucdigit is set to the carry and the upper digits are zeroed.  Steps 5 and 6 emulate the reduction modulo $\beta^m$ that should have happend to
*ebfedea0SLionel Sambuc$x$ before the addition of the multiple of the upper half.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt step 8 if $x$ is still larger than $n$ another pass of the algorithm is required.  First $n$ is subtracted from $x$ and then the algorithm resumes
*ebfedea0SLionel Sambucat step 3.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_dr_reduce.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first step is to grow $x$ as required to $2m$ digits since the reduction is performed in place on $x$.  The label on line @49,top:@ is where
*ebfedea0SLionel Sambucthe algorithm will resume if further reduction passes are required.  In theory it could be placed at the top of the function however, the size of
*ebfedea0SLionel Sambucthe modulus and question of whether $x$ is large enough are invariant after the first pass meaning that it would be a waste of time.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe aliases $tmpx1$ and $tmpx2$ refer to the digits of $x$ where the latter is offset by $m$ digits.  By reading digits from $x$ offset by $m$ digits
*ebfedea0SLionel Sambuca division by $\beta^m$ can be simulated virtually for free.  The loop on line @61,for@ performs the bulk of the work (\textit{corresponds to step 4 of algorithm 7.11})
*ebfedea0SLionel Sambucin this algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy line @68,mu@ the pointer $tmpx1$ points to the $m$'th digit of $x$ which is where the final carry will be placed.  Similarly by line @71,for@ the
*ebfedea0SLionel Sambucsame pointer will point to the $m+1$'th digit where the zeroes will be placed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSince the algorithm is only valid if both $x$ and $n$ are greater than zero an unsigned comparison suffices to determine if another pass is required.
*ebfedea0SLionel SambucWith the same logic at line @82,sub@ the value of $x$ is known to be greater than or equal to $n$ meaning that an unsigned subtraction can be used
*ebfedea0SLionel Sambucas well.  Since the destination of the subtraction is the larger of the inputs the call to algorithm s\_mp\_sub cannot fail and the return code
*ebfedea0SLionel Sambucdoes not need to be checked.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Setup}
*ebfedea0SLionel SambucTo setup the restricted Diminished Radix algorithm the value $k = \beta - n_0$ is required.  This algorithm is not really complicated but provided for
*ebfedea0SLionel Sambuccompleteness.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_dr\_setup}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $n$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $k = \beta - n_0$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $k \leftarrow \beta - n_0$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_dr\_setup}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_dr_setup.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Modulus Detection}
*ebfedea0SLionel SambucAnother algorithm which will be useful is the ability to detect a restricted Diminished Radix modulus.  An integer is said to be
*ebfedea0SLionel Sambucof restricted Diminished Radix form if all of the digits are equal to $\beta - 1$ except the trailing digit which may be any value.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_dr\_is\_modulus}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $n$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $1$ if $n$ is in D.R form, $0$ otherwise \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc1.  If $n.used < 2$ then return($0$). \\
*ebfedea0SLionel Sambuc2.  for $ix$ from $1$ to $n.used - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  If $n_{ix} \ne \beta - 1$ return($0$). \\
*ebfedea0SLionel Sambuc3.  Return($1$). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_dr\_is\_modulus}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_dr\_is\_modulus.}
*ebfedea0SLionel SambucThis algorithm determines if a value is in Diminished Radix form.  Step 1 rejects obvious cases where fewer than two digits are
*ebfedea0SLionel Sambucin the mp\_int.  Step 2 tests all but the first digit to see if they are equal to $\beta - 1$.  If the algorithm manages to get to
*ebfedea0SLionel Sambucstep 3 then $n$ must be of Diminished Radix form.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_dr_is_modulus.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Unrestricted Diminished Radix Reduction}
*ebfedea0SLionel SambucThe unrestricted Diminished Radix algorithm allows modular reductions to be performed when the modulus is of the form $2^p - k$.  This algorithm
*ebfedea0SLionel Sambucis a straightforward adaptation of algorithm~\ref{fig:DR}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn general the restricted Diminished Radix reduction algorithm is much faster since it has considerably lower overhead.  However, this new
*ebfedea0SLionel Sambucalgorithm is much faster than either Montgomery or Barrett reduction when the moduli are of the appropriate form.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_reduce\_2k}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and $n$.  mp\_digit $k$  \\
*ebfedea0SLionel Sambuc\hspace{11.5mm}($a \ge 0$, $n > 1$, $0 < k < \beta$, $n + k$ is a power of two) \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a \mbox{ (mod }n\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc1.  $p \leftarrow \lceil lg(n) \rceil$  (\textit{mp\_count\_bits}) \\
*ebfedea0SLionel Sambuc2.  While $a \ge n$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $q \leftarrow \lfloor a / 2^p \rfloor$ (\textit{mp\_div\_2d}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  $a \leftarrow a \mbox{ (mod }2^p\mbox{)}$ (\textit{mp\_mod\_2d}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.3  $q \leftarrow q \cdot k$ (\textit{mp\_mul\_d}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.4  $a \leftarrow a - q$ (\textit{s\_mp\_sub}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.5  If $a \ge n$ then do \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.5.1  $a \leftarrow a - n$ \\
*ebfedea0SLionel Sambuc3.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_reduce\_2k}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_reduce\_2k.}
*ebfedea0SLionel SambucThis algorithm quickly reduces an input $a$ modulo an unrestricted Diminished Radix modulus $n$.  Division by $2^p$ is emulated with a right
*ebfedea0SLionel Sambucshift which makes the algorithm fairly inexpensive to use.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_reduce_2k.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe algorithm mp\_count\_bits calculates the number of bits in an mp\_int which is used to find the initial value of $p$.  The call to mp\_div\_2d
*ebfedea0SLionel Sambucon line @31,mp_div_2d@ calculates both the quotient $q$ and the remainder $a$ required.  By doing both in a single function call the code size
*ebfedea0SLionel Sambucis kept fairly small.  The multiplication by $k$ is only performed if $k > 1$. This allows reductions modulo $2^p - 1$ to be performed without
*ebfedea0SLionel Sambucany multiplications.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe unsigned s\_mp\_add, mp\_cmp\_mag and s\_mp\_sub are used in place of their full sign counterparts since the inputs are only valid if they are
*ebfedea0SLionel Sambucpositive.  By using the unsigned versions the overhead is kept to a minimum.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Unrestricted Setup}
*ebfedea0SLionel SambucTo setup this reduction algorithm the value of $k = 2^p - n$ is required.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_reduce\_2k\_setup}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $n$   \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $k = 2^p - n$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc1.  $p \leftarrow \lceil lg(n) \rceil$  (\textit{mp\_count\_bits}) \\
*ebfedea0SLionel Sambuc2.  $x \leftarrow 2^p$ (\textit{mp\_2expt}) \\
*ebfedea0SLionel Sambuc3.  $x \leftarrow x - n$ (\textit{mp\_sub}) \\
*ebfedea0SLionel Sambuc4.  $k \leftarrow x_0$ \\
*ebfedea0SLionel Sambuc5.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_reduce\_2k\_setup}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_reduce\_2k\_setup.}
*ebfedea0SLionel SambucThis algorithm computes the value of $k$ required for the algorithm mp\_reduce\_2k.  By making a temporary variable $x$ equal to $2^p$ a subtraction
*ebfedea0SLionel Sambucis sufficient to solve for $k$.  Alternatively if $n$ has more than one digit the value of $k$ is simply $\beta - n_0$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_reduce_2k_setup.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Unrestricted Detection}
*ebfedea0SLionel SambucAn integer $n$ is a valid unrestricted Diminished Radix modulus if either of the following are true.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc\item  The number has only one digit.
*ebfedea0SLionel Sambuc\item  The number has more than one digit and every bit from the $\beta$'th to the most significant is one.
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf either condition is true than there is a power of two $2^p$ such that $0 < 2^p - n < \beta$.   If the input is only
*ebfedea0SLionel Sambucone digit than it will always be of the correct form.  Otherwise all of the bits above the first digit must be one.  This arises from the fact
*ebfedea0SLionel Sambucthat there will be value of $k$ that when added to the modulus causes a carry in the first digit which propagates all the way to the most
*ebfedea0SLionel Sambucsignificant bit.  The resulting sum will be a power of two.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_reduce\_is\_2k}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $n$   \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $1$ if of proper form, $0$ otherwise \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc1.  If $n.used = 0$ then return($0$). \\
*ebfedea0SLionel Sambuc2.  If $n.used = 1$ then return($1$). \\
*ebfedea0SLionel Sambuc3.  $p \leftarrow \lceil lg(n) \rceil$  (\textit{mp\_count\_bits}) \\
*ebfedea0SLionel Sambuc4.  for $x$ from $lg(\beta)$ to $p$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  If the ($x \mbox{ mod }lg(\beta)$)'th bit of the $\lfloor x / lg(\beta) \rfloor$ of $n$ is zero then return($0$). \\
*ebfedea0SLionel Sambuc5.  Return($1$). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_reduce\_is\_2k}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_reduce\_is\_2k.}
*ebfedea0SLionel SambucThis algorithm quickly determines if a modulus is of the form required for algorithm mp\_reduce\_2k to function properly.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_reduce_is_2k.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Algorithm Comparison}
*ebfedea0SLionel SambucSo far three very different algorithms for modular reduction have been discussed.  Each of the algorithms have their own strengths and weaknesses
*ebfedea0SLionel Sambucthat makes having such a selection very useful.  The following table sumarizes the three algorithms along with comparisons of work factors.  Since
*ebfedea0SLionel Sambucall three algorithms have the restriction that $0 \le x < n^2$ and $n > 1$ those limitations are not included in the table.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Method} & \textbf{Work Required} & \textbf{Limitations} & \textbf{$m = 8$} & \textbf{$m = 32$} & \textbf{$m = 64$} \\
*ebfedea0SLionel Sambuc\hline Barrett    & $m^2 + 2m - 1$ & None              & $79$ & $1087$ & $4223$ \\
*ebfedea0SLionel Sambuc\hline Montgomery & $m^2 + m$      & $n$ must be odd   & $72$ & $1056$ & $4160$ \\
*ebfedea0SLionel Sambuc\hline D.R.       & $2m$           & $n = \beta^m - k$ & $16$ & $64$   & $128$  \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn theory Montgomery and Barrett reductions would require roughly the same amount of time to complete.  However, in practice since Montgomery
*ebfedea0SLionel Sambucreduction can be written as a single function with the Comba technique it is much faster.  Barrett reduction suffers from the overhead of
*ebfedea0SLionel Sambuccalling the half precision multipliers, addition and division by $\beta$ algorithms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor almost every cryptographic algorithm Montgomery reduction is the algorithm of choice.  The one set of algorithms where Diminished Radix reduction truly
*ebfedea0SLionel Sambucshines are based on the discrete logarithm problem such as Diffie-Hellman \cite{DH} and ElGamal \cite{ELGAMAL}.  In these algorithms
*ebfedea0SLionel Sambucprimes of the form $\beta^m - k$ can be found and shared amongst users.  These primes will allow the Diminished Radix algorithm to be used in
*ebfedea0SLionel Sambucmodular exponentiation to greatly speed up the operation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section*{Exercises}
*ebfedea0SLionel Sambuc\begin{tabular}{cl}
*ebfedea0SLionel Sambuc$\left [ 3 \right ]$ & Prove that the ``trick'' in algorithm mp\_montgomery\_setup actually \\
*ebfedea0SLionel Sambuc                     & calculates the correct value of $\rho$. \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 2 \right ]$ & Devise an algorithm to reduce modulo $n + k$ for small $k$ quickly.  \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc$\left [ 4 \right ]$ & Prove that the pseudo-code algorithm ``Diminished Radix Reduction'' \\
*ebfedea0SLionel Sambuc                     & (\textit{figure~\ref{fig:DR}}) terminates.  Also prove the probability that it will \\
*ebfedea0SLionel Sambuc                     & terminate within $1 \le k \le 10$ iterations. \\
*ebfedea0SLionel Sambuc                     & \\
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Exponentiation}
*ebfedea0SLionel SambucExponentiation is the operation of raising one variable to the power of another, for example, $a^b$.  A variant of exponentiation, computed
*ebfedea0SLionel Sambucin a finite field or ring, is called modular exponentiation.  This latter style of operation is typically used in public key
*ebfedea0SLionel Sambuccryptosystems such as RSA and Diffie-Hellman.  The ability to quickly compute modular exponentiations is of great benefit to any
*ebfedea0SLionel Sambucsuch cryptosystem and many methods have been sought to speed it up.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Exponentiation Basics}
*ebfedea0SLionel SambucA trivial algorithm would simply multiply $a$ against itself $b - 1$ times to compute the exponentiation desired.  However, as $b$ grows in size
*ebfedea0SLionel Sambucthe number of multiplications becomes prohibitive.  Imagine what would happen if $b$ $\approx$ $2^{1024}$ as is the case when computing an RSA signature
*ebfedea0SLionel Sambucwith a $1024$-bit key.  Such a calculation could never be completed as it would take simply far too long.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFortunately there is a very simple algorithm based on the laws of exponents.  Recall that $lg_a(a^b) = b$ and that $lg_a(a^ba^c) = b + c$ which
*ebfedea0SLionel Sambucare two trivial relationships between the base and the exponent.  Let $b_i$ represent the $i$'th bit of $b$ starting from the least
*ebfedea0SLionel Sambucsignificant bit.  If $b$ is a $k$-bit integer than the following equation is true.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuca^b = \prod_{i=0}^{k-1} a^{2^i \cdot b_i}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy taking the base $a$ logarithm of both sides of the equation the following equation is the result.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucb = \sum_{i=0}^{k-1}2^i \cdot b_i
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe term $a^{2^i}$ can be found from the $i - 1$'th term by squaring the term since $\left ( a^{2^i} \right )^2$ is equal to
*ebfedea0SLionel Sambuc$a^{2^{i+1}}$.  This observation forms the basis of essentially all fast exponentiation algorithms.  It requires $k$ squarings and on average
*ebfedea0SLionel Sambuc$k \over 2$ multiplications to compute the result.  This is indeed quite an improvement over simply multiplying by $a$ a total of $b-1$ times.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhile this current method is a considerable speed up there are further improvements to be made.  For example, the $a^{2^i}$ term does not need to
*ebfedea0SLionel Sambucbe computed in an auxilary variable.  Consider the following equivalent algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Left to Right Exponentiation}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Integer $a$, $b$ and $k$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = a^b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $c \leftarrow 1$ \\
*ebfedea0SLionel Sambuc2.  for $i$ from $k - 1$ to $0$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $c \leftarrow c^2$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  $c \leftarrow c \cdot a^{b_i}$ \\
*ebfedea0SLionel Sambuc3.  Return $c$. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Left to Right Exponentiation}
*ebfedea0SLionel Sambuc\label{fig:LTOR}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis algorithm starts from the most significant bit and works towards the least significant bit.  When the $i$'th bit of $b$ is set $a$ is
*ebfedea0SLionel Sambucmultiplied against the current product.  In each iteration the product is squared which doubles the exponent of the individual terms of the
*ebfedea0SLionel Sambucproduct.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFor example, let $b = 101100_2 \equiv 44_{10}$.  The following chart demonstrates the actions of the algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Value of $i$} & \textbf{Value of $c$} \\
*ebfedea0SLionel Sambuc\hline - & $1$ \\
*ebfedea0SLionel Sambuc\hline $5$ & $a$ \\
*ebfedea0SLionel Sambuc\hline $4$ & $a^2$ \\
*ebfedea0SLionel Sambuc\hline $3$ & $a^4 \cdot a$ \\
*ebfedea0SLionel Sambuc\hline $2$ & $a^8 \cdot a^2 \cdot a$ \\
*ebfedea0SLionel Sambuc\hline $1$ & $a^{16} \cdot a^4 \cdot a^2$ \\
*ebfedea0SLionel Sambuc\hline $0$ & $a^{32} \cdot a^8 \cdot a^4$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Example of Left to Right Exponentiation}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen the product $a^{32} \cdot a^8 \cdot a^4$ is simplified it is equal $a^{44}$ which is the desired exponentiation.  This particular algorithm is
*ebfedea0SLionel Sambuccalled ``Left to Right'' because it reads the exponent in that order.  All of the exponentiation algorithms that will be presented are of this nature.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Single Digit Exponentiation}
*ebfedea0SLionel SambucThe first algorithm in the series of exponentiation algorithms will be an unbounded algorithm where the exponent is a single digit.  It is intended
*ebfedea0SLionel Sambucto be used when a small power of an input is required (\textit{e.g. $a^5$}).  It is faster than simply multiplying $b - 1$ times for all values of
*ebfedea0SLionel Sambuc$b$ that are greater than three.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_expt\_d}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and mp\_digit $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = a^b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $g \leftarrow a$ (\textit{mp\_init\_copy}) \\
*ebfedea0SLionel Sambuc2.  $c \leftarrow 1$ (\textit{mp\_set}) \\
*ebfedea0SLionel Sambuc3.  for $x$ from 1 to $lg(\beta)$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $c \leftarrow c^2$ (\textit{mp\_sqr}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.2  If $b$ AND $2^{lg(\beta) - 1} \ne 0$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}3.2.1  $c \leftarrow c \cdot g$ (\textit{mp\_mul}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.3  $b \leftarrow b << 1$ \\
*ebfedea0SLionel Sambuc4.  Clear $g$. \\
*ebfedea0SLionel Sambuc5.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_expt\_d}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_expt\_d.}
*ebfedea0SLionel SambucThis algorithm computes the value of $a$ raised to the power of a single digit $b$.  It uses the left to right exponentiation algorithm to
*ebfedea0SLionel Sambucquickly compute the exponentiation.  It is loosely based on algorithm 14.79 of HAC \cite[pp. 615]{HAC} with the difference that the
*ebfedea0SLionel Sambucexponent is a fixed width.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucA copy of $a$ is made first to allow destination variable $c$ be the same as the source variable $a$.  The result is set to the initial value of
*ebfedea0SLionel Sambuc$1$ in the subsequent step.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucInside the loop the exponent is read from the most significant bit first down to the least significant bit.  First $c$ is invariably squared
*ebfedea0SLionel Sambucon step 3.1.  In the following step if the most significant bit of $b$ is one the copy of $a$ is multiplied against $c$.  The value
*ebfedea0SLionel Sambucof $b$ is shifted left one bit to make the next bit down from the most signficant bit the new most significant bit.  In effect each
*ebfedea0SLionel Sambuciteration of the loop moves the bits of the exponent $b$ upwards to the most significant location.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_expt_d.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLine @29,mp_set@ sets the initial value of the result to $1$.  Next the loop on line @31,for@ steps through each bit of the exponent starting from
*ebfedea0SLionel Sambucthe most significant down towards the least significant. The invariant squaring operation placed on line @333,mp_sqr@ is performed first.  After
*ebfedea0SLionel Sambucthe squaring the result $c$ is multiplied by the base $g$ if and only if the most significant bit of the exponent is set.  The shift on line
*ebfedea0SLionel Sambuc@47,<<@ moves all of the bits of the exponent upwards towards the most significant location.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{$k$-ary Exponentiation}
*ebfedea0SLionel SambucWhen calculating an exponentiation the most time consuming bottleneck is the multiplications which are in general a small factor
*ebfedea0SLionel Sambucslower than squaring.  Recall from the previous algorithm that $b_{i}$ refers to the $i$'th bit of the exponent $b$.  Suppose instead it referred to
*ebfedea0SLionel Sambucthe $i$'th $k$-bit digit of the exponent of $b$.  For $k = 1$ the definitions are synonymous and for $k > 1$ algorithm~\ref{fig:KARY}
*ebfedea0SLionel Sambuccomputes the same exponentiation.  A group of $k$ bits from the exponent is called a \textit{window}.  That is it is a small window on only a
*ebfedea0SLionel Sambucportion of the entire exponent.  Consider the following modification to the basic left to right exponentiation algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{$k$-ary Exponentiation}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Integer $a$, $b$, $k$ and $t$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = a^b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $c \leftarrow 1$ \\
*ebfedea0SLionel Sambuc2.  for $i$ from $t - 1$ to $0$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $c \leftarrow c^{2^k} $ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  Extract the $i$'th $k$-bit word from $b$ and store it in $g$. \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.3  $c \leftarrow c \cdot a^g$ \\
*ebfedea0SLionel Sambuc3.  Return $c$. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{$k$-ary Exponentiation}
*ebfedea0SLionel Sambuc\label{fig:KARY}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe squaring on step 2.1 can be calculated by squaring the value $c$ successively $k$ times.  If the values of $a^g$ for $0 < g < 2^k$ have been
*ebfedea0SLionel Sambucprecomputed this algorithm requires only $t$ multiplications and $tk$ squarings.  The table can be generated with $2^{k - 1} - 1$ squarings and
*ebfedea0SLionel Sambuc$2^{k - 1} + 1$ multiplications.  This algorithm assumes that the number of bits in the exponent is evenly divisible by $k$.
*ebfedea0SLionel SambucHowever, when it is not the remaining $0 < x \le k - 1$ bits can be handled with algorithm~\ref{fig:LTOR}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSuppose $k = 4$ and $t = 100$.  This modified algorithm will require $109$ multiplications and $408$ squarings to compute the exponentiation.  The
*ebfedea0SLionel Sambucoriginal algorithm would on average have required $200$ multiplications and $400$ squrings to compute the same value.  The total number of squarings
*ebfedea0SLionel Sambuchas increased slightly but the number of multiplications has nearly halved.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Optimal Values of $k$}
*ebfedea0SLionel SambucAn optimal value of $k$ will minimize $2^{k} + \lceil n / k \rceil + n - 1$ for a fixed number of bits in the exponent $n$.  The simplest
*ebfedea0SLionel Sambucapproach is to brute force search amongst the values $k = 2, 3, \ldots, 8$ for the lowest result.  Table~\ref{fig:OPTK} lists optimal values of $k$
*ebfedea0SLionel Sambucfor various exponent sizes and compares the number of multiplication and squarings required against algorithm~\ref{fig:LTOR}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:LTOR}} \\
*ebfedea0SLionel Sambuc\hline $16$ & $2$ & $27$ & $24$ \\
*ebfedea0SLionel Sambuc\hline $32$ & $3$ & $49$ & $48$ \\
*ebfedea0SLionel Sambuc\hline $64$ & $3$ & $92$ & $96$ \\
*ebfedea0SLionel Sambuc\hline $128$ & $4$ & $175$ & $192$ \\
*ebfedea0SLionel Sambuc\hline $256$ & $4$ & $335$ & $384$ \\
*ebfedea0SLionel Sambuc\hline $512$ & $5$ & $645$ & $768$ \\
*ebfedea0SLionel Sambuc\hline $1024$ & $6$ & $1257$ & $1536$ \\
*ebfedea0SLionel Sambuc\hline $2048$ & $6$ & $2452$ & $3072$ \\
*ebfedea0SLionel Sambuc\hline $4096$ & $7$ & $4808$ & $6144$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Optimal Values of $k$ for $k$-ary Exponentiation}
*ebfedea0SLionel Sambuc\label{fig:OPTK}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Sliding-Window Exponentiation}
*ebfedea0SLionel SambucA simple modification to the previous algorithm is only generate the upper half of the table in the range $2^{k-1} \le g < 2^k$.  Essentially
*ebfedea0SLionel Sambucthis is a table for all values of $g$ where the most significant bit of $g$ is a one.  However, in order for this to be allowed in the
*ebfedea0SLionel Sambucalgorithm values of $g$ in the range $0 \le g < 2^{k-1}$ must be avoided.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucTable~\ref{fig:OPTK2} lists optimal values of $k$ for various exponent sizes and compares the work required against algorithm~\ref{fig:KARY}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Exponent (bits)} & \textbf{Optimal $k$} & \textbf{Work at $k$} & \textbf{Work with ~\ref{fig:KARY}} \\
*ebfedea0SLionel Sambuc\hline $16$ & $3$ & $24$ & $27$ \\
*ebfedea0SLionel Sambuc\hline $32$ & $3$ & $45$ & $49$ \\
*ebfedea0SLionel Sambuc\hline $64$ & $4$ & $87$ & $92$ \\
*ebfedea0SLionel Sambuc\hline $128$ & $4$ & $167$ & $175$ \\
*ebfedea0SLionel Sambuc\hline $256$ & $5$ & $322$ & $335$ \\
*ebfedea0SLionel Sambuc\hline $512$ & $6$ & $628$ & $645$ \\
*ebfedea0SLionel Sambuc\hline $1024$ & $6$ & $1225$ & $1257$ \\
*ebfedea0SLionel Sambuc\hline $2048$ & $7$ & $2403$ & $2452$ \\
*ebfedea0SLionel Sambuc\hline $4096$ & $8$ & $4735$ & $4808$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Optimal Values of $k$ for Sliding Window Exponentiation}
*ebfedea0SLionel Sambuc\label{fig:OPTK2}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Sliding Window $k$-ary Exponentiation}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Integer $a$, $b$, $k$ and $t$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = a^b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $c \leftarrow 1$ \\
*ebfedea0SLionel Sambuc2.  for $i$ from $t - 1$ to $0$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  If the $i$'th bit of $b$ is a zero then \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.1.1   $c \leftarrow c^2$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  else do \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.1  $c \leftarrow c^{2^k}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.2  Extract the $k$ bits from $(b_{i}b_{i-1}\ldots b_{i-(k-1)})$ and store it in $g$. \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.3  $c \leftarrow c \cdot a^g$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}2.2.4  $i \leftarrow i - k$ \\
*ebfedea0SLionel Sambuc3.  Return $c$. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Sliding Window $k$-ary Exponentiation}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSimilar to the previous algorithm this algorithm must have a special handler when fewer than $k$ bits are left in the exponent.  While this
*ebfedea0SLionel Sambucalgorithm requires the same number of squarings it can potentially have fewer multiplications.  The pre-computed table $a^g$ is also half
*ebfedea0SLionel Sambucthe size as the previous table.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucConsider the exponent $b = 111101011001000_2 \equiv 31432_{10}$ with $k = 3$ using both algorithms.  The first algorithm will divide the exponent up as
*ebfedea0SLionel Sambucthe following five $3$-bit words $b \equiv \left ( 111, 101, 011, 001, 000 \right )_{2}$.  The second algorithm will break the
*ebfedea0SLionel Sambucexponent as $b \equiv \left ( 111, 101, 0, 110, 0, 100, 0 \right )_{2}$.  The single digit $0$ in the second representation are where
*ebfedea0SLionel Sambuca single squaring took place instead of a squaring and multiplication.  In total the first method requires $10$ multiplications and $18$
*ebfedea0SLionel Sambucsquarings.  The second method requires $8$ multiplications and $18$ squarings.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn general the sliding window method is never slower than the generic $k$-ary method and often it is slightly faster.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Modular Exponentiation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucModular exponentiation is essentially computing the power of a base within a finite field or ring.  For example, computing
*ebfedea0SLionel Sambuc$d \equiv a^b \mbox{ (mod }c\mbox{)}$ is a modular exponentiation.  Instead of first computing $a^b$ and then reducing it
*ebfedea0SLionel Sambucmodulo $c$ the intermediate result is reduced modulo $c$ after every squaring or multiplication operation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis guarantees that any intermediate result is bounded by $0 \le d \le c^2 - 2c + 1$ and can be reduced modulo $c$ quickly using
*ebfedea0SLionel Sambucone of the algorithms presented in ~REDUCTION~.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBefore the actual modular exponentiation algorithm can be written a wrapper algorithm must be written first.  This algorithm
*ebfedea0SLionel Sambucwill allow the exponent $b$ to be negative which is computed as $c \equiv \left (1 / a \right )^{\vert b \vert} \mbox{(mod }d\mbox{)}$. The
*ebfedea0SLionel Sambucvalue of $(1/a) \mbox{ mod }c$ is computed using the modular inverse (\textit{see \ref{sec;modinv}}).  If no inverse exists the algorithm
*ebfedea0SLionel Sambucterminates with an error.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_exptmod}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$, $b$ and $c$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $c.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc2.  If $b.sign = MP\_NEG$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $g' \leftarrow g^{-1} \mbox{ (mod }c\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  $x' \leftarrow \vert x \vert$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.3  Compute $d \equiv g'^{x'} \mbox{ (mod }c\mbox{)}$ via recursion. \\
*ebfedea0SLionel Sambuc3.  if $p$ is odd \textbf{OR} $p$ is a D.R. modulus then \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm mp\_exptmod\_fast. \\
*ebfedea0SLionel Sambuc4.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  Compute $y \equiv g^{x} \mbox{ (mod }p\mbox{)}$ via algorithm s\_mp\_exptmod. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_exptmod}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_exptmod.}
*ebfedea0SLionel SambucThe first algorithm which actually performs modular exponentiation is algorithm s\_mp\_exptmod.  It is a sliding window $k$-ary algorithm
*ebfedea0SLionel Sambucwhich uses Barrett reduction to reduce the product modulo $p$.  The second algorithm mp\_exptmod\_fast performs the same operation
*ebfedea0SLionel Sambucexcept it uses either Montgomery or Diminished Radix reduction.  The two latter reduction algorithms are clumped in the same exponentiation
*ebfedea0SLionel Sambucalgorithm since their arguments are essentially the same (\textit{two mp\_ints and one mp\_digit}).
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_exptmod.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn order to keep the algorithms in a known state the first step on line @29,if@ is to reject any negative modulus as input.  If the exponent is
*ebfedea0SLionel Sambucnegative the algorithm tries to perform a modular exponentiation with the modular inverse of the base $G$.  The temporary variable $tmpG$ is assigned
*ebfedea0SLionel Sambucthe modular inverse of $G$ and $tmpX$ is assigned the absolute value of $X$.  The algorithm will recuse with these new values with a positive
*ebfedea0SLionel Sambucexponent.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf the exponent is positive the algorithm resumes the exponentiation.  Line @63,dr_@ determines if the modulus is of the restricted Diminished Radix
*ebfedea0SLionel Sambucform.  If it is not line @65,reduce@ attempts to determine if it is of a unrestricted Diminished Radix form.  The integer $dr$ will take on one
*ebfedea0SLionel Sambucof three values.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc\item $dr = 0$ means that the modulus is not of either restricted or unrestricted Diminished Radix form.
*ebfedea0SLionel Sambuc\item $dr = 1$ means that the modulus is of restricted Diminished Radix form.
*ebfedea0SLionel Sambuc\item $dr = 2$ means that the modulus is of unrestricted Diminished Radix form.
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLine @69,if@ determines if the fast modular exponentiation algorithm can be used.  It is allowed if $dr \ne 0$ or if the modulus is odd.  Otherwise,
*ebfedea0SLionel Sambucthe slower s\_mp\_exptmod algorithm is used which uses Barrett reduction.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Barrett Modular Exponentiation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{s\_mp\_exptmod}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$, $b$ and $c$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $k \leftarrow lg(x)$ \\
*ebfedea0SLionel Sambuc2.  $winsize \leftarrow  \left \lbrace \begin{array}{ll}
*ebfedea0SLionel Sambuc                              2 &  \mbox{if }k \le 7 \\
*ebfedea0SLionel Sambuc                              3 &  \mbox{if }7 < k \le 36 \\
*ebfedea0SLionel Sambuc                              4 &  \mbox{if }36 < k \le 140 \\
*ebfedea0SLionel Sambuc                              5 &  \mbox{if }140 < k \le 450 \\
*ebfedea0SLionel Sambuc                              6 &  \mbox{if }450 < k \le 1303 \\
*ebfedea0SLionel Sambuc                              7 &  \mbox{if }1303 < k \le 3529 \\
*ebfedea0SLionel Sambuc                              8 &  \mbox{if }3529 < k \\
*ebfedea0SLionel Sambuc                              \end{array} \right .$ \\
*ebfedea0SLionel Sambuc3.  Initialize $2^{winsize}$ mp\_ints in an array named $M$ and one mp\_int named $\mu$ \\
*ebfedea0SLionel Sambuc4.  Calculate the $\mu$ required for Barrett Reduction (\textit{mp\_reduce\_setup}). \\
*ebfedea0SLionel Sambuc5.  $M_1 \leftarrow g \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucSetup the table of small powers of $g$.  First find $g^{2^{winsize}}$ and then all multiples of it. \\
*ebfedea0SLionel Sambuc6.  $k \leftarrow 2^{winsize - 1}$ \\
*ebfedea0SLionel Sambuc7.  $M_{k} \leftarrow M_1$ \\
*ebfedea0SLionel Sambuc8.  for $ix$ from 0 to $winsize - 2$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $M_k \leftarrow \left ( M_k \right )^2$ (\textit{mp\_sqr})  \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.2  $M_k \leftarrow M_k \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\
*ebfedea0SLionel Sambuc9.  for $ix$ from $2^{winsize - 1} + 1$ to $2^{winsize} - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1  $M_{ix} \leftarrow M_{ix - 1} \cdot M_{1}$ (\textit{mp\_mul}) \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.2  $M_{ix} \leftarrow M_{ix} \mbox{ (mod }p\mbox{)}$ (\textit{mp\_reduce}) \\
*ebfedea0SLionel Sambuc10.  $res \leftarrow 1$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucStart Sliding Window. \\
*ebfedea0SLionel Sambuc11.  $mode \leftarrow 0, bitcnt \leftarrow 1, buf \leftarrow 0, digidx \leftarrow x.used - 1, bitcpy \leftarrow 0, bitbuf \leftarrow 0$ \\
*ebfedea0SLionel Sambuc12.  Loop \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.1  $bitcnt \leftarrow bitcnt - 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.2  If $bitcnt = 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.2.1  If $digidx = -1$ goto step 13. \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.2.2  $buf \leftarrow x_{digidx}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.2.3  $digidx \leftarrow digidx - 1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.2.4  $bitcnt \leftarrow lg(\beta)$ \\
*ebfedea0SLionel SambucContinued on next page. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm s\_mp\_exptmod}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{s\_mp\_exptmod} (\textit{continued}). \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$, $b$ and $c$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $y \equiv g^x \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.3  $y \leftarrow (buf >> (lg(\beta) - 1))$ AND $1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.4  $buf \leftarrow buf << 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.5  if $mode = 0$ and $y = 0$ then goto step 12. \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.6  if $mode = 1$ and $y = 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.6.1  $res \leftarrow res^2$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.6.2  $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.6.3  Goto step 12. \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.7  $bitcpy \leftarrow bitcpy + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.8  $bitbuf \leftarrow bitbuf + (y << (winsize - bitcpy))$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.9  $mode \leftarrow 2$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.10  If $bitcpy = winsize$ then do \\
*ebfedea0SLionel Sambuc\hspace{6mm}Window is full so perform the squarings and single multiplication. \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.10.1  for $ix$ from $0$ to $winsize -1$ do \\
*ebfedea0SLionel Sambuc\hspace{9mm}12.10.1.1  $res \leftarrow res^2$ \\
*ebfedea0SLionel Sambuc\hspace{9mm}12.10.1.2  $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.10.2  $res \leftarrow res \cdot M_{bitbuf}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.10.3  $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}Reset the window. \\
*ebfedea0SLionel Sambuc\hspace{6mm}12.10.4  $bitcpy \leftarrow 0, bitbuf \leftarrow 0, mode \leftarrow 1$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucNo more windows left.  Check for residual bits of exponent. \\
*ebfedea0SLionel Sambuc13.  If $mode = 2$ and $bitcpy > 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.1  for $ix$ form $0$ to $bitcpy - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.1.1  $res \leftarrow res^2$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.1.2  $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.1.3  $bitbuf \leftarrow bitbuf << 1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.1.4  If $bitbuf$ AND $2^{winsize} \ne 0$ then do \\
*ebfedea0SLionel Sambuc\hspace{9mm}13.1.4.1  $res \leftarrow res \cdot M_{1}$ \\
*ebfedea0SLionel Sambuc\hspace{9mm}13.1.4.2  $res \leftarrow res \mbox{ (mod }p\mbox{)}$ \\
*ebfedea0SLionel Sambuc14.  $y \leftarrow res$ \\
*ebfedea0SLionel Sambuc15.  Clear $res$, $mu$ and the $M$ array. \\
*ebfedea0SLionel Sambuc16.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm s\_mp\_exptmod (continued)}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm s\_mp\_exptmod.}
*ebfedea0SLionel SambucThis algorithm computes the $x$'th power of $g$ modulo $p$ and stores the result in $y$.  It takes advantage of the Barrett reduction
*ebfedea0SLionel Sambucalgorithm to keep the product small throughout the algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first two steps determine the optimal window size based on the number of bits in the exponent.  The larger the exponent the
*ebfedea0SLionel Sambuclarger the window size becomes.  After a window size $winsize$ has been chosen an array of $2^{winsize}$ mp\_int variables is allocated.  This
*ebfedea0SLionel Sambuctable will hold the values of $g^x \mbox{ (mod }p\mbox{)}$ for $2^{winsize - 1} \le x < 2^{winsize}$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the table is allocated the first power of $g$ is found.  Since $g \ge p$ is allowed it must be first reduced modulo $p$ to make
*ebfedea0SLionel Sambucthe rest of the algorithm more efficient.  The first element of the table at $2^{winsize - 1}$ is found by squaring $M_1$ successively $winsize - 2$
*ebfedea0SLionel Sambuctimes.  The rest of the table elements are found by multiplying the previous element by $M_1$ modulo $p$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNow that the table is available the sliding window may begin.  The following list describes the functions of all the variables in the window.
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc\item The variable $mode$ dictates how the bits of the exponent are interpreted.
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc   \item When $mode = 0$ the bits are ignored since no non-zero bit of the exponent has been seen yet.  For example, if the exponent were simply
*ebfedea0SLionel Sambuc         $1$ then there would be $lg(\beta) - 1$ zero bits before the first non-zero bit.  In this case bits are ignored until a non-zero bit is found.
*ebfedea0SLionel Sambuc   \item When $mode = 1$ a non-zero bit has been seen before and a new $winsize$-bit window has not been formed yet.  In this mode leading $0$ bits
*ebfedea0SLionel Sambuc         are read and a single squaring is performed.  If a non-zero bit is read a new window is created.
*ebfedea0SLionel Sambuc   \item When $mode = 2$ the algorithm is in the middle of forming a window and new bits are appended to the window from the most significant bit
*ebfedea0SLionel Sambuc         downwards.
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc\item The variable $bitcnt$ indicates how many bits are left in the current digit of the exponent left to be read.  When it reaches zero a new digit
*ebfedea0SLionel Sambuc      is fetched from the exponent.
*ebfedea0SLionel Sambuc\item The variable $buf$ holds the currently read digit of the exponent.
*ebfedea0SLionel Sambuc\item The variable $digidx$ is an index into the exponents digits.  It starts at the leading digit $x.used - 1$ and moves towards the trailing digit.
*ebfedea0SLionel Sambuc\item The variable $bitcpy$ indicates how many bits are in the currently formed window.  When it reaches $winsize$ the window is flushed and
*ebfedea0SLionel Sambuc      the appropriate operations performed.
*ebfedea0SLionel Sambuc\item The variable $bitbuf$ holds the current bits of the window being formed.
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAll of step 12 is the window processing loop.  It will iterate while there are digits available form the exponent to read.  The first step
*ebfedea0SLionel Sambucinside this loop is to extract a new digit if no more bits are available in the current digit.  If there are no bits left a new digit is
*ebfedea0SLionel Sambucread and if there are no digits left than the loop terminates.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter a digit is made available step 12.3 will extract the most significant bit of the current digit and move all other bits in the digit
*ebfedea0SLionel Sambucupwards.  In effect the digit is read from most significant bit to least significant bit and since the digits are read from leading to
*ebfedea0SLionel Sambuctrailing edges the entire exponent is read from most significant bit to least significant bit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt step 12.5 if the $mode$ and currently extracted bit $y$ are both zero the bit is ignored and the next bit is read.  This prevents the
*ebfedea0SLionel Sambucalgorithm from having to perform trivial squaring and reduction operations before the first non-zero bit is read.  Step 12.6 and 12.7-10 handle
*ebfedea0SLionel Sambucthe two cases of $mode = 1$ and $mode = 2$ respectively.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFIGU,expt_state,Sliding Window State Diagram
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy step 13 there are no more digits left in the exponent.  However, there may be partial bits in the window left.  If $mode = 2$ then
*ebfedea0SLionel Sambuca Left-to-Right algorithm is used to process the remaining few bits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_s_mp_exptmod.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLines @31,if@ through @45,}@ determine the optimal window size based on the length of the exponent in bits.  The window divisions are sorted
*ebfedea0SLionel Sambucfrom smallest to greatest so that in each \textbf{if} statement only one condition must be tested.  For example, by the \textbf{if} statement
*ebfedea0SLionel Sambucon line @37,if@ the value of $x$ is already known to be greater than $140$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe conditional piece of code beginning on line @42,ifdef@ allows the window size to be restricted to five bits.  This logic is used to ensure
*ebfedea0SLionel Sambucthe table of precomputed powers of $G$ remains relatively small.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe for loop on line @60,for@ initializes the $M$ array while lines @71,mp_init@ and @75,mp_reduce@ through @85,}@ initialize the reduction
*ebfedea0SLionel Sambucfunction that will be used for this modulus.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc-- More later.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Quick Power of Two}
*ebfedea0SLionel SambucCalculating $b = 2^a$ can be performed much quicker than with any of the previous algorithms.  Recall that a logical shift left $m << k$ is
*ebfedea0SLionel Sambucequivalent to $m \cdot 2^k$.  By this logic when $m = 1$ a quick power of two can be achieved.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_2expt}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $a \leftarrow 2^b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $a \leftarrow 0$ \\
*ebfedea0SLionel Sambuc2.  If $a.alloc < \lfloor b / lg(\beta) \rfloor + 1$ then grow $a$ appropriately. \\
*ebfedea0SLionel Sambuc3.  $a.used \leftarrow \lfloor b / lg(\beta) \rfloor + 1$ \\
*ebfedea0SLionel Sambuc4.  $a_{\lfloor b / lg(\beta) \rfloor} \leftarrow 1 << (b \mbox{ mod } lg(\beta))$ \\
*ebfedea0SLionel Sambuc5.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_2expt}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_2expt.}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_2expt.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Higher Level Algorithms}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis chapter discusses the various higher level algorithms that are required to complete a well rounded multiple precision integer package.  These
*ebfedea0SLionel Sambucroutines are less performance oriented than the algorithms of chapters five, six and seven but are no less important.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first section describes a method of integer division with remainder that is universally well known.  It provides the signed division logic
*ebfedea0SLionel Sambucfor the package.  The subsequent section discusses a set of algorithms which allow a single digit to be the 2nd operand for a variety of operations.
*ebfedea0SLionel SambucThese algorithms serve mostly to simplify other algorithms where small constants are required.  The last two sections discuss how to manipulate
*ebfedea0SLionel Sambucvarious representations of integers.  For example, converting from an mp\_int to a string of character.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Integer Division with Remainder}
*ebfedea0SLionel Sambuc\label{sec:division}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucInteger division aside from modular exponentiation is the most intensive algorithm to compute.  Like addition, subtraction and multiplication
*ebfedea0SLionel Sambucthe basis of this algorithm is the long-hand division algorithm taught to school children.  Throughout this discussion several common variables
*ebfedea0SLionel Sambucwill be used.  Let $x$ represent the divisor and $y$ represent the dividend.  Let $q$ represent the integer quotient $\lfloor y / x \rfloor$ and
*ebfedea0SLionel Sambuclet $r$ represent the remainder $r = y - x \lfloor y / x \rfloor$.  The following simple algorithm will be used to start the discussion.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Radix-$\beta$ Integer Division}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   integer $x$ and $y$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $q = \lfloor y/x\rfloor, r = y - xq$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $q \leftarrow 0$ \\
*ebfedea0SLionel Sambuc2.  $n \leftarrow \vert \vert y \vert \vert - \vert \vert x \vert \vert$ \\
*ebfedea0SLionel Sambuc3.  for $t$ from $n$ down to $0$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  Maximize $k$ such that $kx\beta^t$ is less than or equal to $y$ and $(k + 1)x\beta^t$ is greater. \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.2  $q \leftarrow q + k\beta^t$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.3  $y \leftarrow y - kx\beta^t$ \\
*ebfedea0SLionel Sambuc4.  $r \leftarrow y$ \\
*ebfedea0SLionel Sambuc5.  Return($q, r$) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Radix-$\beta$ Integer Division}
*ebfedea0SLionel Sambuc\label{fig:raddiv}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs children we are taught this very simple algorithm for the case of $\beta = 10$.  Almost instinctively several optimizations are taught for which
*ebfedea0SLionel Sambuctheir reason of existing are never explained.  For this example let $y = 5471$ represent the dividend and $x = 23$ represent the divisor.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucTo find the first digit of the quotient the value of $k$ must be maximized such that $kx\beta^t$ is less than or equal to $y$ and
*ebfedea0SLionel Sambucsimultaneously $(k + 1)x\beta^t$ is greater than $y$.  Implicitly $k$ is the maximum value the $t$'th digit of the quotient may have.  The habitual method
*ebfedea0SLionel Sambucused to find the maximum is to ``eyeball'' the two numbers, typically only the leading digits and quickly estimate a quotient.  By only using leading
*ebfedea0SLionel Sambucdigits a much simpler division may be used to form an educated guess at what the value must be.  In this case $k = \lfloor 54/23\rfloor = 2$ quickly
*ebfedea0SLionel Sambucarises as a possible  solution.  Indeed $2x\beta^2 = 4600$ is less than $y = 5471$ and simultaneously $(k + 1)x\beta^2 = 6900$ is larger than $y$.
*ebfedea0SLionel SambucAs a  result $k\beta^2$ is added to the quotient which now equals $q = 200$ and $4600$ is subtracted from $y$ to give a remainder of $y = 841$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAgain this process is repeated to produce the quotient digit $k = 3$ which makes the quotient $q = 200 + 3\beta = 230$ and the remainder
*ebfedea0SLionel Sambuc$y = 841 - 3x\beta = 181$.  Finally the last iteration of the loop produces $k = 7$ which leads to the quotient $q = 230 + 7 = 237$ and the
*ebfedea0SLionel Sambucremainder $y = 181 - 7x = 20$.  The final quotient and remainder found are $q = 237$ and $r = y = 20$ which are indeed correct since
*ebfedea0SLionel Sambuc$237 \cdot 23 + 20 = 5471$ is true.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Quotient Estimation}
*ebfedea0SLionel Sambuc\label{sec:divest}
*ebfedea0SLionel SambucAs alluded to earlier the quotient digit $k$ can be estimated from only the leading digits of both the divisor and dividend.  When $p$ leading
*ebfedea0SLionel Sambucdigits are used from both the divisor and dividend to form an estimation the accuracy of the estimation rises as $p$ grows.  Technically
*ebfedea0SLionel Sambucspeaking the estimation is based on assuming the lower $\vert \vert y \vert \vert - p$ and $\vert \vert x \vert \vert - p$ lower digits of the
*ebfedea0SLionel Sambucdividend and divisor are zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe value of the estimation may off by a few values in either direction and in general is fairly correct.  A simplification \cite[pp. 271]{TAOCPV2}
*ebfedea0SLionel Sambucof the estimation technique is to use $t + 1$ digits of the dividend and $t$ digits of the divisor, in particularly when $t = 1$.  The estimate
*ebfedea0SLionel Sambucusing this technique is never too small.  For the following proof let $t = \vert \vert y \vert \vert - 1$ and $s = \vert \vert x \vert \vert - 1$
*ebfedea0SLionel Sambucrepresent the most significant digits of the dividend and divisor respectively.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Proof.}\textit{  The quotient $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ is greater than or equal to
*ebfedea0SLionel Sambuc$k = \lfloor y / (x \cdot \beta^{\vert \vert y \vert \vert - \vert \vert x \vert \vert - 1}) \rfloor$. }
*ebfedea0SLionel SambucThe first obvious case is when $\hat k = \beta - 1$ in which case the proof is concluded since the real quotient cannot be larger.  For all other
*ebfedea0SLionel Sambuccases $\hat k = \lfloor (y_t\beta + y_{t-1}) / x_s \rfloor$ and $\hat k x_s \ge y_t\beta + y_{t-1} - x_s + 1$.  The latter portion of the inequalility
*ebfedea0SLionel Sambuc$-x_s + 1$ arises from the fact that a truncated integer division will give the same quotient for at most $x_s - 1$ values.  Next a series of
*ebfedea0SLionel Sambucinequalities will prove the hypothesis.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucy - \hat k x \le y - \hat k x_s\beta^s
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis is trivially true since $x \ge x_s\beta^s$.  Next we replace $\hat kx_s\beta^s$ by the previous inequality for $\hat kx_s$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucy - \hat k x \le y_t\beta^t + \ldots + y_0 - (y_t\beta^t + y_{t-1}\beta^{t-1} - x_s\beta^t + \beta^s)
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy simplifying the previous inequality the following inequality is formed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucy - \hat k x \le y_{t-2}\beta^{t-2} + \ldots + y_0 + x_s\beta^s - \beta^s
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSubsequently,
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucy_{t-2}\beta^{t-2} + \ldots +  y_0  + x_s\beta^s - \beta^s < x_s\beta^s \le x
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhich proves that $y - \hat kx \le x$ and by consequence $\hat k \ge k$ which concludes the proof.  \textbf{QED}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Normalized Integers}
*ebfedea0SLionel SambucFor the purposes of division a normalized input is when the divisors leading digit $x_n$ is greater than or equal to $\beta / 2$.  By multiplying both
*ebfedea0SLionel Sambuc$x$ and $y$ by $j = \lfloor (\beta / 2) / x_n \rfloor$ the quotient remains unchanged and the remainder is simply $j$ times the original
*ebfedea0SLionel Sambucremainder.  The purpose of normalization is to ensure the leading digit of the divisor is sufficiently large such that the estimated quotient will
*ebfedea0SLionel Sambuclie in the domain of a single digit.  Consider the maximum dividend $(\beta - 1) \cdot \beta + (\beta - 1)$ and the minimum divisor $\beta / 2$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc{{\beta^2 - 1} \over { \beta / 2}} \le 2\beta - {2 \over \beta}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt most the quotient approaches $2\beta$, however, in practice this will not occur since that would imply the previous quotient digit was too small.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Radix-$\beta$ Division with Remainder}
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_div}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a, b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = \lfloor a/b \rfloor$, $d = a - bc$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b = 0$ return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc2.  If $\vert a \vert < \vert b \vert$ then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $d \leftarrow a$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  $c \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.3  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucSetup the quotient to receive the digits. \\
*ebfedea0SLionel Sambuc3.  Grow $q$ to $a.used + 2$ digits. \\
*ebfedea0SLionel Sambuc4.  $q \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  $x \leftarrow \vert a \vert , y \leftarrow \vert b \vert$ \\
*ebfedea0SLionel Sambuc6.  $sign \leftarrow  \left \lbrace \begin{array}{ll}
*ebfedea0SLionel Sambuc                              MP\_ZPOS &  \mbox{if }a.sign = b.sign \\
*ebfedea0SLionel Sambuc                              MP\_NEG  &  \mbox{otherwise} \\
*ebfedea0SLionel Sambuc                              \end{array} \right .$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucNormalize the inputs such that the leading digit of $y$ is greater than or equal to $\beta / 2$. \\
*ebfedea0SLionel Sambuc7.  $norm \leftarrow (lg(\beta) - 1) - (\lceil lg(y) \rceil \mbox{ (mod }lg(\beta)\mbox{)})$ \\
*ebfedea0SLionel Sambuc8.  $x \leftarrow x \cdot 2^{norm}, y \leftarrow y \cdot 2^{norm}$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucFind the leading digit of the quotient. \\
*ebfedea0SLionel Sambuc9.  $n \leftarrow x.used - 1, t \leftarrow y.used - 1$ \\
*ebfedea0SLionel Sambuc10.  $y \leftarrow y \cdot \beta^{n - t}$ \\
*ebfedea0SLionel Sambuc11.  While ($x \ge y$) do \\
*ebfedea0SLionel Sambuc\hspace{3mm}11.1  $q_{n - t} \leftarrow q_{n - t} + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}11.2  $x \leftarrow x - y$ \\
*ebfedea0SLionel Sambuc12.  $y \leftarrow \lfloor y / \beta^{n-t} \rfloor$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucContinued on the next page. \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_div}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_div} (continued). \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a, b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = \lfloor a/b \rfloor$, $d = a - bc$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel SambucNow find the remainder fo the digits. \\
*ebfedea0SLionel Sambuc13.  for $i$ from $n$ down to $(t + 1)$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.1  If $i > x.used$ then jump to the next iteration of this loop. \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.2  If $x_{i} = y_{t}$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.2.1  $q_{i - t - 1} \leftarrow \beta - 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.3  else \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.3.1  $\hat r \leftarrow x_{i} \cdot \beta + x_{i - 1}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.3.2  $\hat r \leftarrow \lfloor \hat r / y_{t} \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.3.3  $q_{i - t - 1} \leftarrow \hat r$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.4  $q_{i - t - 1} \leftarrow q_{i - t - 1} + 1$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucFixup quotient estimation. \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.5  Loop \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.5.1  $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.5.2  t$1 \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.5.3  t$1_0 \leftarrow y_{t - 1}, $ t$1_1 \leftarrow y_t,$ t$1.used \leftarrow 2$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.5.4  $t1 \leftarrow t1 \cdot q_{i - t - 1}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.5.5  t$2_0 \leftarrow x_{i - 2}, $ t$2_1 \leftarrow x_{i - 1}, $ t$2_2 \leftarrow x_i, $ t$2.used \leftarrow 3$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.5.6  If $\vert t1 \vert > \vert t2 \vert$ then goto step 13.5. \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.6  t$1 \leftarrow y \cdot q_{i - t - 1}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.7  t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.8  $x \leftarrow x - $ t$1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.9  If $x.sign = MP\_NEG$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.10  t$1 \leftarrow y$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.11  t$1 \leftarrow $ t$1 \cdot \beta^{i - t - 1}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.12  $x \leftarrow x + $ t$1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}13.13  $q_{i - t - 1} \leftarrow q_{i - t - 1} - 1$ \\
*ebfedea0SLionel Sambuc\\
*ebfedea0SLionel SambucFinalize the result. \\
*ebfedea0SLionel Sambuc14.  Clamp excess digits of $q$ \\
*ebfedea0SLionel Sambuc15.  $c \leftarrow q, c.sign \leftarrow sign$ \\
*ebfedea0SLionel Sambuc16.  $x.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc17.  $d \leftarrow \lfloor x / 2^{norm} \rfloor$ \\
*ebfedea0SLionel Sambuc18.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_div (continued)}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_div.}
*ebfedea0SLionel SambucThis algorithm will calculate quotient and remainder from an integer division given a dividend and divisor.  The algorithm is a signed
*ebfedea0SLionel Sambucdivision and will produce a fully qualified quotient and remainder.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFirst the divisor $b$ must be non-zero which is enforced in step one.  If the divisor is larger than the dividend than the quotient is implicitly
*ebfedea0SLionel Sambuczero and the remainder is the dividend.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter the first two trivial cases of inputs are handled the variable $q$ is setup to receive the digits of the quotient.  Two unsigned copies of the
*ebfedea0SLionel Sambucdivisor $y$ and dividend $x$ are made as well.  The core of the division algorithm is an unsigned division and will only work if the values are
*ebfedea0SLionel Sambucpositive.  Now the two values $x$ and $y$ must be normalized such that the leading digit of $y$ is greater than or equal to $\beta / 2$.
*ebfedea0SLionel SambucThis is performed by shifting both to the left by enough bits to get the desired normalization.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt this point the division algorithm can begin producing digits of the quotient.  Recall that maximum value of the estimation used is
*ebfedea0SLionel Sambuc$2\beta - {2 \over \beta}$ which means that a digit of the quotient must be first produced by another means.  In this case $y$ is shifted
*ebfedea0SLionel Sambucto the left (\textit{step ten}) so that it has the same number of digits as $x$.  The loop on step eleven will subtract multiples of the
*ebfedea0SLionel Sambucshifted copy of $y$ until $x$ is smaller.  Since the leading digit of $y$ is greater than or equal to $\beta/2$ this loop will iterate at most two
*ebfedea0SLionel Sambuctimes to produce the desired leading digit of the quotient.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNow the remainder of the digits can be produced.  The equation $\hat q = \lfloor {{x_i \beta + x_{i-1}}\over y_t} \rfloor$ is used to fairly
*ebfedea0SLionel Sambucaccurately approximate the true quotient digit.  The estimation can in theory produce an estimation as high as $2\beta - {2 \over \beta}$ but by
*ebfedea0SLionel Sambucinduction the upper quotient digit is correct (\textit{as established on step eleven}) and the estimate must be less than $\beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucRecall from section~\ref{sec:divest} that the estimation is never too low but may be too high.  The next step of the estimation process is
*ebfedea0SLionel Sambucto refine the estimation.  The loop on step 13.5 uses $x_i\beta^2 + x_{i-1}\beta + x_{i-2}$ and $q_{i - t - 1}(y_t\beta + y_{t-1})$ as a higher
*ebfedea0SLionel Sambucorder approximation to adjust the quotient digit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter both phases of estimation the quotient digit may still be off by a value of one\footnote{This is similar to the error introduced
*ebfedea0SLionel Sambucby optimizing Barrett reduction.}.  Steps 13.6 and 13.7 subtract the multiple of the divisor from the dividend (\textit{Similar to step 3.3 of
*ebfedea0SLionel Sambucalgorithm~\ref{fig:raddiv}} and then subsequently add a multiple of the divisor if the quotient was too large.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNow that the quotient has been determine finializing the result is a matter of clamping the quotient, fixing the sizes and de-normalizing the
*ebfedea0SLionel Sambucremainder.  An important aspect of this algorithm seemingly overlooked in other descriptions such as that of Algorithm 14.20 HAC \cite[pp. 598]{HAC}
*ebfedea0SLionel Sambucis that when the estimations are being made (\textit{inside the loop on step 13.5}) that the digits $y_{t-1}$, $x_{i-2}$ and $x_{i-1}$ may lie
*ebfedea0SLionel Sambucoutside their respective boundaries.  For example, if $t = 0$ or $i \le 1$ then the digits would be undefined.  In those cases the digits should
*ebfedea0SLionel Sambucrespectively be replaced with a zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_div.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe implementation of this algorithm differs slightly from the pseudo code presented previously.  In this algorithm either of the quotient $c$ or
*ebfedea0SLionel Sambucremainder $d$ may be passed as a \textbf{NULL} pointer which indicates their value is not desired.  For example, the C code to call the division
*ebfedea0SLionel Sambucalgorithm with only the quotient is
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{verbatim}
*ebfedea0SLionel Sambucmp_div(&a, &b, &c, NULL);  /* c = [a/b] */
*ebfedea0SLionel Sambuc\end{verbatim}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLines @108,if@ and @113,if@ handle the two trivial cases of inputs which are division by zero and dividend smaller than the divisor
*ebfedea0SLionel Sambucrespectively.  After the two trivial cases all of the temporary variables are initialized.  Line @147,neg@ determines the sign of
*ebfedea0SLionel Sambucthe quotient and line @148,sign@ ensures that both $x$ and $y$ are positive.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe number of bits in the leading digit is calculated on line @151,norm@.  Implictly an mp\_int with $r$ digits will require $lg(\beta)(r-1) + k$ bits
*ebfedea0SLionel Sambucof precision which when reduced modulo $lg(\beta)$ produces the value of $k$.  In this case $k$ is the number of bits in the leading digit which is
*ebfedea0SLionel Sambucexactly what is required.  For the algorithm to operate $k$ must equal $lg(\beta) - 1$ and when it does not the inputs must be normalized by shifting
*ebfedea0SLionel Sambucthem to the left by $lg(\beta) - 1 - k$ bits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThroughout the variables $n$ and $t$ will represent the highest digit of $x$ and $y$ respectively.  These are first used to produce the
*ebfedea0SLionel Sambucleading digit of the quotient.  The loop beginning on line @184,for@ will produce the remainder of the quotient digits.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe conditional ``continue'' on line @186,continue@ is used to prevent the algorithm from reading past the leading edge of $x$ which can occur when the
*ebfedea0SLionel Sambucalgorithm eliminates multiple non-zero digits in a single iteration.  This ensures that $x_i$ is always non-zero since by definition the digits
*ebfedea0SLionel Sambucabove the $i$'th position $x$ must be zero in order for the quotient to be precise\footnote{Precise as far as integer division is concerned.}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLines @214,t1@, @216,t1@ and @222,t2@ through @225,t2@ manually construct the high accuracy estimations by setting the digits of the two mp\_int
*ebfedea0SLionel Sambucvariables directly.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Single Digit Helpers}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis section briefly describes a series of single digit helper algorithms which come in handy when working with small constants.  All of
*ebfedea0SLionel Sambucthe helper functions assume the single digit input is positive and will treat them as such.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Single Digit Addition and Subtraction}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBoth addition and subtraction are performed by ``cheating'' and using mp\_set followed by the higher level addition or subtraction
*ebfedea0SLionel Sambucalgorithms.   As a result these algorithms are subtantially simpler with a slight cost in performance.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_add\_d}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and a mp\_digit $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = a + b$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $t \leftarrow b$ (\textit{mp\_set}) \\
*ebfedea0SLionel Sambuc2.  $c \leftarrow a + t$ \\
*ebfedea0SLionel Sambuc3.  Return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_add\_d}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_add\_d.}
*ebfedea0SLionel SambucThis algorithm initiates a temporary mp\_int with the value of the single digit and uses algorithm mp\_add to add the two values together.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_add_d.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucClever use of the letter 't'.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Subtraction}
*ebfedea0SLionel SambucThe single digit subtraction algorithm mp\_sub\_d is essentially the same except it uses mp\_sub to subtract the digit from the mp\_int.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Single Digit Multiplication}
*ebfedea0SLionel SambucSingle digit multiplication arises enough in division and radix conversion that it ought to be implement as a special case of the baseline
*ebfedea0SLionel Sambucmultiplication algorithm.  Essentially this algorithm is a modified version of algorithm s\_mp\_mul\_digs where one of the multiplicands
*ebfedea0SLionel Sambuconly has one digit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_mul\_d}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and a mp\_digit $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = ab$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $pa \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc2.  Grow $c$ to at least $pa + 1$ digits. \\
*ebfedea0SLionel Sambuc3.  $oldused \leftarrow c.used$ \\
*ebfedea0SLionel Sambuc4.  $c.used \leftarrow pa + 1$ \\
*ebfedea0SLionel Sambuc5.  $c.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc6.  $\mu \leftarrow 0$ \\
*ebfedea0SLionel Sambuc7.  for $ix$ from $0$ to $pa - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $\hat r \leftarrow \mu + a_{ix}b$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.2  $c_{ix} \leftarrow \hat r \mbox{ (mod }\beta\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.3  $\mu \leftarrow \lfloor \hat r / \beta \rfloor$ \\
*ebfedea0SLionel Sambuc8.  $c_{pa} \leftarrow \mu$ \\
*ebfedea0SLionel Sambuc9.  for $ix$ from $pa + 1$ to $oldused$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1  $c_{ix} \leftarrow 0$ \\
*ebfedea0SLionel Sambuc10.  Clamp excess digits of $c$. \\
*ebfedea0SLionel Sambuc11.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_mul\_d}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_mul\_d.}
*ebfedea0SLionel SambucThis algorithm quickly multiplies an mp\_int by a small single digit value.  It is specially tailored to the job and has a minimal of overhead.
*ebfedea0SLionel SambucUnlike the full multiplication algorithms this algorithm does not require any significnat temporary storage or memory allocations.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_mul_d.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn this implementation the destination $c$ may point to the same mp\_int as the source $a$ since the result is written after the digit is
*ebfedea0SLionel Sambucread from the source.  This function uses pointer aliases $tmpa$ and $tmpc$ for the digits of $a$ and $c$ respectively.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Single Digit Division}
*ebfedea0SLionel SambucLike the single digit multiplication algorithm, single digit division is also a fairly common algorithm used in radix conversion.  Since the
*ebfedea0SLionel Sambucdivisor is only a single digit a specialized variant of the division algorithm can be used to compute the quotient.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_div\_d}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and a mp\_digit $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = \lfloor a / b \rfloor, d = a - cb$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b = 0$ then return(\textit{MP\_VAL}).\\
*ebfedea0SLionel Sambuc2.  If $b = 3$ then use algorithm mp\_div\_3 instead. \\
*ebfedea0SLionel Sambuc3.  Init $q$ to $a.used$ digits.  \\
*ebfedea0SLionel Sambuc4.  $q.used \leftarrow a.used$ \\
*ebfedea0SLionel Sambuc5.  $q.sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc6.  $\hat w \leftarrow 0$ \\
*ebfedea0SLionel Sambuc7.  for $ix$ from $a.used - 1$ down to $0$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $\hat w \leftarrow \hat w \beta + a_{ix}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.2  If $\hat w \ge b$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.2.1  $t \leftarrow \lfloor \hat w / b \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.2.2  $\hat w \leftarrow \hat w \mbox{ (mod }b\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.3  else\\
*ebfedea0SLionel Sambuc\hspace{6mm}7.3.1  $t \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.4  $q_{ix} \leftarrow t$ \\
*ebfedea0SLionel Sambuc8.  $d \leftarrow \hat w$ \\
*ebfedea0SLionel Sambuc9.  Clamp excess digits of $q$. \\
*ebfedea0SLionel Sambuc10.  $c \leftarrow q$ \\
*ebfedea0SLionel Sambuc11.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_div\_d}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_div\_d.}
*ebfedea0SLionel SambucThis algorithm divides the mp\_int $a$ by the single mp\_digit $b$ using an optimized approach.  Essentially in every iteration of the
*ebfedea0SLionel Sambucalgorithm another digit of the dividend is reduced and another digit of quotient produced.  Provided $b < \beta$ the value of $\hat w$
*ebfedea0SLionel Sambucafter step 7.1 will be limited such that $0 \le \lfloor \hat w / b \rfloor < \beta$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf the divisor $b$ is equal to three a variant of this algorithm is used which is called mp\_div\_3.  It replaces the division by three with
*ebfedea0SLionel Sambuca multiplication by $\lfloor \beta / 3 \rfloor$ and the appropriate shift and residual fixup.  In essence it is much like the Barrett reduction
*ebfedea0SLionel Sambucfrom chapter seven.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_div_d.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLike the implementation of algorithm mp\_div this algorithm allows either of the quotient or remainder to be passed as a \textbf{NULL} pointer to
*ebfedea0SLionel Sambucindicate the respective value is not required.  This allows a trivial single digit modular reduction algorithm, mp\_mod\_d to be created.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe division and remainder on lines @44,/@ and @45,%@ can be replaced often by a single division on most processors.  For example, the 32-bit x86 based
*ebfedea0SLionel Sambucprocessors can divide a 64-bit quantity by a 32-bit quantity and produce the quotient and remainder simultaneously.  Unfortunately the GCC
*ebfedea0SLionel Sambuccompiler does not recognize that optimization and will actually produce two function calls to find the quotient and remainder respectively.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Single Digit Root Extraction}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFinding the $n$'th root of an integer is fairly easy as far as numerical analysis is concerned.  Algorithms such as the Newton-Raphson approximation
*ebfedea0SLionel Sambuc(\ref{eqn:newton}) series will converge very quickly to a root for any continuous function $f(x)$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucx_{i+1} = x_i - {f(x_i) \over f'(x_i)}
*ebfedea0SLionel Sambuc\label{eqn:newton}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn this case the $n$'th root is desired and $f(x) = x^n - a$ where $a$ is the integer of which the root is desired.  The derivative of $f(x)$ is
*ebfedea0SLionel Sambucsimply $f'(x) = nx^{n - 1}$.  Of particular importance is that this algorithm will be used over the integers not over the a more continuous domain
*ebfedea0SLionel Sambucsuch as the real numbers.  As a result the root found can be above the true root by few and must be manually adjusted.  Ideally at the end of the
*ebfedea0SLionel Sambucalgorithm the $n$'th root $b$ of an integer $a$ is desired such that $b^n \le a$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_n\_root}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and a mp\_digit $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c^b \le a$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b$ is even and $a.sign = MP\_NEG$ return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc2.  $sign \leftarrow a.sign$ \\
*ebfedea0SLionel Sambuc3.  $a.sign \leftarrow MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc4.  t$2 \leftarrow 2$ \\
*ebfedea0SLionel Sambuc5.  Loop \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  t$1 \leftarrow $ t$2$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  t$3 \leftarrow $ t$1^{b - 1}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  t$2 \leftarrow $ t$3 $ $\cdot$ t$1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.4  t$2 \leftarrow $ t$2 - a$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.5  t$3 \leftarrow $ t$3 \cdot b$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.6  t$3 \leftarrow \lfloor $t$2 / $t$3 \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.7  t$2 \leftarrow $ t$1 - $ t$3$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.8  If t$1 \ne $ t$2$ then goto step 5.  \\
*ebfedea0SLionel Sambuc6.  Loop \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  t$2 \leftarrow $ t$1^b$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  If t$2 > a$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.2.1  t$1 \leftarrow $ t$1 - 1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.2.2  Goto step 6. \\
*ebfedea0SLionel Sambuc7.  $a.sign \leftarrow sign$ \\
*ebfedea0SLionel Sambuc8.  $c \leftarrow $ t$1$ \\
*ebfedea0SLionel Sambuc9.  $c.sign \leftarrow sign$  \\
*ebfedea0SLionel Sambuc10.  Return(\textit{MP\_OKAY}).  \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_n\_root}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_n\_root.}
*ebfedea0SLionel SambucThis algorithm finds the integer $n$'th root of an input using the Newton-Raphson approach.  It is partially optimized based on the observation
*ebfedea0SLionel Sambucthat the numerator of ${f(x) \over f'(x)}$ can be derived from a partial denominator.  That is at first the denominator is calculated by finding
*ebfedea0SLionel Sambuc$x^{b - 1}$.  This value can then be multiplied by $x$ and have $a$ subtracted from it to find the numerator.  This saves a total of $b - 1$
*ebfedea0SLionel Sambucmultiplications by t$1$ inside the loop.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe initial value of the approximation is t$2 = 2$ which allows the algorithm to start with very small values and quickly converge on the
*ebfedea0SLionel Sambucroot.  Ideally this algorithm is meant to find the $n$'th root of an input where $n$ is bounded by $2 \le n \le 5$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_n_root.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Random Number Generation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucRandom numbers come up in a variety of activities from public key cryptography to simple simulations and various randomized algorithms.  Pollard-Rho
*ebfedea0SLionel Sambucfactoring for example, can make use of random values as starting points to find factors of a composite integer.  In this case the algorithm presented
*ebfedea0SLionel Sambucis solely for simulations and not intended for cryptographic use.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_rand}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   An integer $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  A pseudo-random number of $b$ digits \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $a \leftarrow 0$ \\
*ebfedea0SLionel Sambuc2.  If $b \le 0$ return(\textit{MP\_OKAY}) \\
*ebfedea0SLionel Sambuc3.  Pick a non-zero random digit $d$. \\
*ebfedea0SLionel Sambuc4.  $a \leftarrow a + d$ \\
*ebfedea0SLionel Sambuc5.  for $ix$ from 1 to $d - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $a \leftarrow a \cdot \beta$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  Pick a random digit $d$. \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  $a \leftarrow a + d$ \\
*ebfedea0SLionel Sambuc6.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_rand}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_rand.}
*ebfedea0SLionel SambucThis algorithm produces a pseudo-random integer of $b$ digits.  By ensuring that the first digit is non-zero the algorithm also guarantees that the
*ebfedea0SLionel Sambucfinal result has at least $b$ digits.  It relies heavily on a third-part random number generator which should ideally generate uniformly all of
*ebfedea0SLionel Sambucthe integers from $0$ to $\beta - 1$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_rand.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Formatted Representations}
*ebfedea0SLionel SambucThe ability to emit a radix-$n$ textual representation of an integer is useful for interacting with human parties.  For example, the ability to
*ebfedea0SLionel Sambucbe given a string of characters such as ``114585'' and turn it into the radix-$\beta$ equivalent would make it easier to enter numbers
*ebfedea0SLionel Sambucinto a program.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Reading Radix-n Input}
*ebfedea0SLionel SambucFor the purposes of this text we will assume that a simple lower ASCII map (\ref{fig:ASC}) is used for the values of from $0$ to $63$ to
*ebfedea0SLionel Sambucprintable characters.  For example, when the character ``N'' is read it represents the integer $23$.  The first $16$ characters of the
*ebfedea0SLionel Sambucmap are for the common representations up to hexadecimal.  After that they match the ``base64'' encoding scheme which are suitable chosen
*ebfedea0SLionel Sambucsuch that they are printable.  While outputting as base64 may not be too helpful for human operators it does allow communication via non binary
*ebfedea0SLionel Sambucmediums.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[here]
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{cc|cc|cc|cc}
*ebfedea0SLionel Sambuc\hline \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} & \textbf{Value} & \textbf{Char} &  \textbf{Value} & \textbf{Char} \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc0 & 0 & 1 & 1 & 2 & 2 & 3 & 3 \\
*ebfedea0SLionel Sambuc4 & 4 & 5 & 5 & 6 & 6 & 7 & 7 \\
*ebfedea0SLionel Sambuc8 & 8 & 9 & 9 & 10 & A & 11 & B \\
*ebfedea0SLionel Sambuc12 & C & 13 & D & 14 & E & 15 & F \\
*ebfedea0SLionel Sambuc16 & G & 17 & H & 18 & I & 19 & J \\
*ebfedea0SLionel Sambuc20 & K & 21 & L & 22 & M & 23 & N \\
*ebfedea0SLionel Sambuc24 & O & 25 & P & 26 & Q & 27 & R \\
*ebfedea0SLionel Sambuc28 & S & 29 & T & 30 & U & 31 & V \\
*ebfedea0SLionel Sambuc32 & W & 33 & X & 34 & Y & 35 & Z \\
*ebfedea0SLionel Sambuc36 & a & 37 & b & 38 & c & 39 & d \\
*ebfedea0SLionel Sambuc40 & e & 41 & f & 42 & g & 43 & h \\
*ebfedea0SLionel Sambuc44 & i & 45 & j & 46 & k & 47 & l \\
*ebfedea0SLionel Sambuc48 & m & 49 & n & 50 & o & 51 & p \\
*ebfedea0SLionel Sambuc52 & q & 53 & r & 54 & s & 55 & t \\
*ebfedea0SLionel Sambuc56 & u & 57 & v & 58 & w & 59 & x \\
*ebfedea0SLionel Sambuc60 & y & 61 & z & 62 & $+$ & 63 & $/$ \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Lower ASCII Map}
*ebfedea0SLionel Sambuc\label{fig:ASC}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_read\_radix}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   A string $str$ of length $sn$ and radix $r$. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The radix-$\beta$ equivalent mp\_int. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc2.  $ix \leftarrow 0$ \\
*ebfedea0SLionel Sambuc3.  If $str_0 =$ ``-'' then do \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $ix \leftarrow ix + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.2  $sign \leftarrow MP\_NEG$ \\
*ebfedea0SLionel Sambuc4.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $sign \leftarrow MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc5.  $a \leftarrow 0$ \\
*ebfedea0SLionel Sambuc6.  for $iy$ from $ix$ to $sn - 1$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  Let $y$ denote the position in the map of $str_{iy}$. \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  If $str_{iy}$ is not in the map or $y \ge r$ then goto step 7. \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.3  $a \leftarrow a \cdot r$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.4  $a \leftarrow a + y$ \\
*ebfedea0SLionel Sambuc7.  If $a \ne 0$ then $a.sign \leftarrow sign$ \\
*ebfedea0SLionel Sambuc8.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_read\_radix}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_read\_radix.}
*ebfedea0SLionel SambucThis algorithm will read an ASCII string and produce the radix-$\beta$ mp\_int representation of the same integer.  A minus symbol ``-'' may precede the
*ebfedea0SLionel Sambucstring  to indicate the value is negative, otherwise it is assumed to be positive.  The algorithm will read up to $sn$ characters from the input
*ebfedea0SLionel Sambucand will stop when it reads a character it cannot map the algorithm stops reading characters from the string.  This allows numbers to be embedded
*ebfedea0SLionel Sambucas part of larger input without any significant problem.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_read_radix.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Generating Radix-$n$ Output}
*ebfedea0SLionel SambucGenerating radix-$n$ output is fairly trivial with a division and remainder algorithm.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_toradix}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   A mp\_int $a$ and an integer $r$\\
*ebfedea0SLionel Sambuc\textbf{Output}.  The radix-$r$ representation of $a$ \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $r < 2$ or $r > 64$ return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc2.  If $a = 0$ then $str = $ ``$0$'' and return(\textit{MP\_OKAY}).  \\
*ebfedea0SLionel Sambuc3.  $t \leftarrow a$ \\
*ebfedea0SLionel Sambuc4.  $str \leftarrow$ ``'' \\
*ebfedea0SLionel Sambuc5.  if $t.sign = MP\_NEG$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $str \leftarrow str + $ ``-'' \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $t.sign = MP\_ZPOS$ \\
*ebfedea0SLionel Sambuc6.  While ($t \ne 0$) do \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $d \leftarrow t \mbox{ (mod }r\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  $t \leftarrow \lfloor t / r \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.3  Look up $d$ in the map and store the equivalent character in $y$. \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.4  $str \leftarrow str + y$ \\
*ebfedea0SLionel Sambuc7.  If $str_0 = $``$-$'' then \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  Reverse the digits $str_1, str_2, \ldots str_n$. \\
*ebfedea0SLionel Sambuc8.  Otherwise \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  Reverse the digits $str_0, str_1, \ldots str_n$. \\
*ebfedea0SLionel Sambuc9.  Return(\textit{MP\_OKAY}).\\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_toradix}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_toradix.}
*ebfedea0SLionel SambucThis algorithm computes the radix-$r$ representation of an mp\_int $a$.  The ``digits'' of the representation are extracted by reducing
*ebfedea0SLionel Sambucsuccessive powers of $\lfloor a / r^k \rfloor$ the input modulo $r$ until $r^k > a$.  Note that instead of actually dividing by $r^k$ in
*ebfedea0SLionel Sambuceach iteration the quotient $\lfloor a / r \rfloor$ is saved for the next iteration.  As a result a series of trivial $n \times 1$ divisions
*ebfedea0SLionel Sambucare required instead of a series of $n \times k$ divisions.  One design flaw of this approach is that the digits are produced in the reverse order
*ebfedea0SLionel Sambuc(see~\ref{fig:mpradix}).  To remedy this flaw the digits must be swapped or simply ``reversed''.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{|c|c|c|}
*ebfedea0SLionel Sambuc\hline \textbf{Value of $a$} & \textbf{Value of $d$} & \textbf{Value of $str$} \\
*ebfedea0SLionel Sambuc\hline $1234$ & -- & -- \\
*ebfedea0SLionel Sambuc\hline $123$  & $4$ & ``4'' \\
*ebfedea0SLionel Sambuc\hline $12$   & $3$ & ``43'' \\
*ebfedea0SLionel Sambuc\hline $1$    & $2$ & ``432'' \\
*ebfedea0SLionel Sambuc\hline $0$    & $1$ & ``4321'' \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\caption{Example of Algorithm mp\_toradix.}
*ebfedea0SLionel Sambuc\label{fig:mpradix}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_toradix.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\chapter{Number Theoretic Algorithms}
*ebfedea0SLionel SambucThis chapter discusses several fundamental number theoretic algorithms such as the greatest common divisor, least common multiple and Jacobi
*ebfedea0SLionel Sambucsymbol computation.  These algorithms arise as essential components in several key cryptographic algorithms such as the RSA public key algorithm and
*ebfedea0SLionel Sambucvarious Sieve based factoring algorithms.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Greatest Common Divisor}
*ebfedea0SLionel SambucThe greatest common divisor of two integers $a$ and $b$, often denoted as $(a, b)$ is the largest integer $k$ that is a proper divisor of
*ebfedea0SLionel Sambucboth $a$ and $b$.  That is, $k$ is the largest integer such that $0 \equiv a \mbox{ (mod }k\mbox{)}$ and $0 \equiv b \mbox{ (mod }k\mbox{)}$ occur
*ebfedea0SLionel Sambucsimultaneously.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe most common approach (cite) is to reduce one input modulo another.  That is if $a$ and $b$ are divisible by some integer $k$ and if $qa + r = b$ then
*ebfedea0SLionel Sambuc$r$ is also divisible by $k$.  The reduction pattern follows $\left < a , b \right > \rightarrow \left < b, a \mbox{ mod } b \right >$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Greatest Common Divisor (I)}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two positive integers $a$ and $b$ greater than zero. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The greatest common divisor $(a, b)$.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  While ($b > 0$) do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $r \leftarrow a \mbox{ (mod }b\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  $a \leftarrow b$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.3  $b \leftarrow r$ \\
*ebfedea0SLionel Sambuc2.  Return($a$). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Greatest Common Divisor (I)}
*ebfedea0SLionel Sambuc\label{fig:gcd1}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis algorithm will quickly converge on the greatest common divisor since the residue $r$ tends diminish rapidly.  However, divisions are
*ebfedea0SLionel Sambucrelatively expensive operations to perform and should ideally be avoided.  There is another approach based on a similar relationship of
*ebfedea0SLionel Sambucgreatest common divisors.  The faster approach is based on the observation that if $k$ divides both $a$ and $b$ it will also divide $a - b$.
*ebfedea0SLionel SambucIn particular, we would like $a - b$ to decrease in magnitude which implies that $b \ge a$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Greatest Common Divisor (II)}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two positive integers $a$ and $b$ greater than zero. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The greatest common divisor $(a, b)$.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  While ($b > 0$) do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  $b \leftarrow b - a$ \\
*ebfedea0SLionel Sambuc2.  Return($a$). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Greatest Common Divisor (II)}
*ebfedea0SLionel Sambuc\label{fig:gcd2}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Proof} \textit{Algorithm~\ref{fig:gcd2} will return the greatest common divisor of $a$ and $b$.}
*ebfedea0SLionel SambucThe algorithm in figure~\ref{fig:gcd2} will eventually terminate since $b \ge a$ the subtraction in step 1.2 will be a value less than $b$.  In other
*ebfedea0SLionel Sambucwords in every iteration that tuple $\left < a, b \right >$ decrease in magnitude until eventually $a = b$.  Since both $a$ and $b$ are always
*ebfedea0SLionel Sambucdivisible by the greatest common divisor (\textit{until the last iteration}) and in the last iteration of the algorithm $b = 0$, therefore, in the
*ebfedea0SLionel Sambucsecond to last iteration of the algorithm $b = a$ and clearly $(a, a) = a$ which concludes the proof.  \textbf{QED}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs a matter of practicality algorithm \ref{fig:gcd1} decreases far too slowly to be useful.  Specially if $b$ is much larger than $a$ such that
*ebfedea0SLionel Sambuc$b - a$ is still very much larger than $a$.  A simple addition to the algorithm is to divide $b - a$ by a power of some integer $p$ which does
*ebfedea0SLionel Sambucnot divide the greatest common divisor but will divide $b - a$.  In this case ${b - a} \over p$ is also an integer and still divisible by
*ebfedea0SLionel Sambucthe greatest common divisor.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHowever, instead of factoring $b - a$ to find a suitable value of $p$ the powers of $p$ can be removed from $a$ and $b$ that are in common first.
*ebfedea0SLionel SambucThen inside the loop whenever $b - a$ is divisible by some power of $p$ it can be safely removed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{Greatest Common Divisor (III)}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   Two positive integers $a$ and $b$ greater than zero. \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The greatest common divisor $(a, b)$.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $k \leftarrow 0$ \\
*ebfedea0SLionel Sambuc2.  While $a$ and $b$ are both divisible by $p$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $a \leftarrow \lfloor a / p \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  $b \leftarrow \lfloor b / p \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.3  $k \leftarrow k + 1$ \\
*ebfedea0SLionel Sambuc3.  While $a$ is divisible by $p$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $a \leftarrow \lfloor a / p \rfloor$ \\
*ebfedea0SLionel Sambuc4.  While $b$ is divisible by $p$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $b \leftarrow \lfloor b / p \rfloor$ \\
*ebfedea0SLionel Sambuc5.  While ($b > 0$) do \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  Swap $a$ and $b$ such that $a$ is the smallest of the two. \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $b \leftarrow b - a$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  While $b$ is divisible by $p$ do \\
*ebfedea0SLionel Sambuc\hspace{6mm}5.3.1  $b \leftarrow \lfloor b / p \rfloor$ \\
*ebfedea0SLionel Sambuc6.  Return($a \cdot p^k$). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm Greatest Common Divisor (III)}
*ebfedea0SLionel Sambuc\label{fig:gcd3}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis algorithm is based on the first except it removes powers of $p$ first and inside the main loop to ensure the tuple $\left < a, b \right >$
*ebfedea0SLionel Sambucdecreases more rapidly.  The first loop on step two removes powers of $p$ that are in common.  A count, $k$, is kept which will present a common
*ebfedea0SLionel Sambucdivisor of $p^k$.  After step two the remaining common divisor of $a$ and $b$ cannot be divisible by $p$.  This means that $p$ can be safely
*ebfedea0SLionel Sambucdivided out of the difference $b - a$ so long as the division leaves no remainder.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIn particular the value of $p$ should be chosen such that the division on step 5.3.1 occur often.  It also helps that division by $p$ be easy
*ebfedea0SLionel Sambucto compute.  The ideal choice of $p$ is two since division by two amounts to a right logical shift.  Another important observation is that by
*ebfedea0SLionel Sambucstep five both $a$ and $b$ are odd.  Therefore, the diffrence $b - a$ must be even which means that each iteration removes one bit from the
*ebfedea0SLionel Sambuclargest of the pair.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Complete Greatest Common Divisor}
*ebfedea0SLionel SambucThe algorithms presented so far cannot handle inputs which are zero or negative.  The following algorithm can handle all input cases properly
*ebfedea0SLionel Sambucand will produce the greatest common divisor.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_gcd}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The greatest common divisor $c = (a, b)$.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $a = 0$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $c \leftarrow \vert b \vert $ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc2.  If $b = 0$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $c \leftarrow \vert a \vert $ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc3.  $u \leftarrow \vert a \vert, v \leftarrow \vert b \vert$ \\
*ebfedea0SLionel Sambuc4.  $k \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  While $u.used > 0$ and $v.used > 0$ and $u_0 \equiv v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $k \leftarrow k + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $u \leftarrow \lfloor u / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.3  $v \leftarrow \lfloor v / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc6.  While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $u \leftarrow \lfloor u / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc7.  While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $v \leftarrow \lfloor v / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc8.  While $v.used > 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  If $\vert u \vert > \vert v \vert$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}8.1.1  Swap $u$ and $v$. \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.2  $v \leftarrow \vert v \vert - \vert u \vert$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.3  While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}8.3.1  $v \leftarrow \lfloor v / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc9.  $c \leftarrow u \cdot 2^k$ \\
*ebfedea0SLionel Sambuc10.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_gcd}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_gcd.}
*ebfedea0SLionel SambucThis algorithm will produce the greatest common divisor of two mp\_ints $a$ and $b$.  The algorithm was originally based on Algorithm B of
*ebfedea0SLionel SambucKnuth \cite[pp. 338]{TAOCPV2} but has been modified to be simpler to explain.  In theory it achieves the same asymptotic working time as
*ebfedea0SLionel SambucAlgorithm B and in practice this appears to be true.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe first two steps handle the cases where either one of or both inputs are zero.  If either input is zero the greatest common divisor is the
*ebfedea0SLionel Sambuclargest input or zero if they are both zero.  If the inputs are not trivial than $u$ and $v$ are assigned the absolute values of
*ebfedea0SLionel Sambuc$a$ and $b$ respectively and the algorithm will proceed to reduce the pair.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep five will divide out any common factors of two and keep track of the count in the variable $k$.  After this step, two is no longer a
*ebfedea0SLionel Sambucfactor of the remaining greatest common divisor between $u$ and $v$ and can be safely evenly divided out of either whenever they are even.  Step
*ebfedea0SLionel Sambucsix and seven ensure that the $u$ and $v$ respectively have no more factors of two.  At most only one of the while--loops will iterate since
*ebfedea0SLionel Sambucthey cannot both be even.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy step eight both of $u$ and $v$ are odd which is required for the inner logic.  First the pair are swapped such that $v$ is equal to
*ebfedea0SLionel Sambucor greater than $u$.  This ensures that the subtraction on step 8.2 will always produce a positive and even result.  Step 8.3 removes any
*ebfedea0SLionel Sambucfactors of two from the difference $u$ to ensure that in the next iteration of the loop both are once again odd.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter $v = 0$ occurs the variable $u$ has the greatest common divisor of the pair $\left < u, v \right >$ just after step six.  The result
*ebfedea0SLionel Sambucmust be adjusted by multiplying by the common factors of two ($2^k$) removed earlier.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_gcd.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThis function makes use of the macros mp\_iszero and mp\_iseven.  The former evaluates to $1$ if the input mp\_int is equivalent to the
*ebfedea0SLionel Sambucinteger zero otherwise it evaluates to $0$.  The latter evaluates to $1$ if the input mp\_int represents a non-zero even integer otherwise
*ebfedea0SLionel Sambucit evaluates to $0$.  Note that just because mp\_iseven may evaluate to $0$ does not mean the input is odd, it could also be zero.  The three
*ebfedea0SLionel Sambuctrivial cases of inputs are handled on lines @23,zero@ through @29,}@.  After those lines the inputs are assumed to be non-zero.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLines @32,if@ and @36,if@ make local copies $u$ and $v$ of the inputs $a$ and $b$ respectively.  At this point the common factors of two
*ebfedea0SLionel Sambucmust be divided out of the two inputs.  The block starting at line @43,common@ removes common factors of two by first counting the number of trailing
*ebfedea0SLionel Sambuczero bits in both.  The local integer $k$ is used to keep track of how many factors of $2$ are pulled out of both values.  It is assumed that
*ebfedea0SLionel Sambucthe number of factors will not exceed the maximum value of a C ``int'' data type\footnote{Strictly speaking no array in C may have more than
*ebfedea0SLionel Sambucentries than are accessible by an ``int'' so this is not a limitation.}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt this point there are no more common factors of two in the two values.  The divisions by a power of two on lines @60,div_2d@ and @67,div_2d@ remove
*ebfedea0SLionel Sambucany independent factors of two such that both $u$ and $v$ are guaranteed to be an odd integer before hitting the main body of the algorithm.  The while loop
*ebfedea0SLionel Sambucon line @72, while@ performs the reduction of the pair until $v$ is equal to zero.  The unsigned comparison and subtraction algorithms are used in
*ebfedea0SLionel Sambucplace of the full signed routines since both values are guaranteed to be positive and the result of the subtraction is guaranteed to be non-negative.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Least Common Multiple}
*ebfedea0SLionel SambucThe least common multiple of a pair of integers is their product divided by their greatest common divisor.  For two integers $a$ and $b$ the
*ebfedea0SLionel Sambucleast common multiple is normally denoted as $[ a, b ]$ and numerically equivalent to ${ab} \over {(a, b)}$.  For example, if $a = 2 \cdot 2 \cdot 3 = 12$
*ebfedea0SLionel Sambucand $b = 2 \cdot 3 \cdot 3 \cdot 7 = 126$ the least common multiple is ${126 \over {(12, 126)}} = {126 \over 6} = 21$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe least common multiple arises often in coding theory as well as number theory.  If two functions have periods of $a$ and $b$ respectively they will
*ebfedea0SLionel Sambuccollide, that is be in synchronous states, after only $[ a, b ]$ iterations.  This is why, for example, random number generators based on
*ebfedea0SLionel SambucLinear Feedback Shift Registers (LFSR) tend to use registers with periods which are co-prime (\textit{e.g. the greatest common divisor is one.}).
*ebfedea0SLionel SambucSimilarly in number theory if a composite $n$ has two prime factors $p$ and $q$ then maximal order of any unit of $\Z/n\Z$ will be $[ p - 1, q - 1] $.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_lcm}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and $b$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The least common multiple $c = [a, b]$.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $c \leftarrow (a, b)$ \\
*ebfedea0SLionel Sambuc2.  $t \leftarrow a \cdot b$ \\
*ebfedea0SLionel Sambuc3.  $c \leftarrow \lfloor t / c \rfloor$ \\
*ebfedea0SLionel Sambuc4.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_lcm}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_lcm.}
*ebfedea0SLionel SambucThis algorithm computes the least common multiple of two mp\_int inputs $a$ and $b$.  It computes the least common multiple directly by
*ebfedea0SLionel Sambucdividing the product of the two inputs by their greatest common divisor.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_lcm.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Jacobi Symbol Computation}
*ebfedea0SLionel SambucTo explain the Jacobi Symbol we shall first discuss the Legendre function\footnote{Arrg.  What is the name of this?} off which the Jacobi symbol is
*ebfedea0SLionel Sambucdefined.  The Legendre function computes whether or not an integer $a$ is a quadratic residue modulo an odd prime $p$.  Numerically it is
*ebfedea0SLionel Sambucequivalent to equation \ref{eqn:legendre}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textit{-- Tom, don't be an ass, cite your source here...!}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuca^{(p-1)/2} \equiv \begin{array}{rl}
*ebfedea0SLionel Sambuc                              -1 &  \mbox{if }a\mbox{ is a quadratic non-residue.} \\
*ebfedea0SLionel Sambuc                              0  &  \mbox{if }a\mbox{ divides }p\mbox{.} \\
*ebfedea0SLionel Sambuc                              1  &  \mbox{if }a\mbox{ is a quadratic residue}.
*ebfedea0SLionel Sambuc                              \end{array} \mbox{ (mod }p\mbox{)}
*ebfedea0SLionel Sambuc\label{eqn:legendre}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textbf{Proof.} \textit{Equation \ref{eqn:legendre} correctly identifies the residue status of an integer $a$ modulo a prime $p$.}
*ebfedea0SLionel SambucAn integer $a$ is a quadratic residue if the following equation has a solution.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucx^2 \equiv a \mbox{ (mod }p\mbox{)}
*ebfedea0SLionel Sambuc\label{eqn:root}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucConsider the following equation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc0 \equiv x^{p-1} - 1 \equiv \left \lbrace \left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \right \rbrace + \left ( a^{(p-1)/2} - 1 \right ) \mbox{ (mod }p\mbox{)}
*ebfedea0SLionel Sambuc\label{eqn:rooti}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhether equation \ref{eqn:root} has a solution or not equation \ref{eqn:rooti} is always true.  If $a^{(p-1)/2} - 1 \equiv 0 \mbox{ (mod }p\mbox{)}$
*ebfedea0SLionel Sambucthen the quantity in the braces must be zero.  By reduction,
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{eqnarray}
*ebfedea0SLionel Sambuc\left (x^2 \right )^{(p-1)/2} - a^{(p-1)/2} \equiv 0  \nonumber \\
*ebfedea0SLionel Sambuc\left (x^2 \right )^{(p-1)/2} \equiv a^{(p-1)/2} \nonumber \\
*ebfedea0SLionel Sambucx^2 \equiv a \mbox{ (mod }p\mbox{)}
*ebfedea0SLionel Sambuc\end{eqnarray}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs a result there must be a solution to the quadratic equation and in turn $a$ must be a quadratic residue.  If $a$ does not divide $p$ and $a$
*ebfedea0SLionel Sambucis not a quadratic residue then the only other value $a^{(p-1)/2}$ may be congruent to is $-1$ since
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc0 \equiv a^{p - 1} - 1 \equiv (a^{(p-1)/2} + 1)(a^{(p-1)/2} - 1) \mbox{ (mod }p\mbox{)}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel SambucOne of the terms on the right hand side must be zero.  \textbf{QED}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Jacobi Symbol}
*ebfedea0SLionel SambucThe Jacobi symbol is a generalization of the Legendre function for any odd non prime moduli $p$ greater than 2.  If $p = \prod_{i=0}^n p_i$ then
*ebfedea0SLionel Sambucthe Jacobi symbol $\left ( { a \over p } \right )$ is equal to the following equation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc\left ( { a \over p } \right ) = \left ( { a \over p_0} \right ) \left ( { a \over p_1} \right ) \ldots \left ( { a \over p_n} \right )
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy inspection if $p$ is prime the Jacobi symbol is equivalent to the Legendre function.  The following facts\footnote{See HAC \cite[pp. 72-74]{HAC} for
*ebfedea0SLionel Sambucfurther details.} will be used to derive an efficient Jacobi symbol algorithm.  Where $p$ is an odd integer greater than two and $a, b \in \Z$ the
*ebfedea0SLionel Sambucfollowing are true.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{enumerate}
*ebfedea0SLionel Sambuc\item $\left ( { a \over p} \right )$ equals $-1$, $0$ or $1$.
*ebfedea0SLionel Sambuc\item $\left ( { ab \over p} \right ) = \left ( { a \over p} \right )\left ( { b \over p} \right )$.
*ebfedea0SLionel Sambuc\item If $a \equiv b$ then $\left ( { a \over p} \right ) = \left ( { b \over p} \right )$.
*ebfedea0SLionel Sambuc\item $\left ( { 2 \over p} \right )$ equals $1$ if $p \equiv 1$ or $7 \mbox{ (mod }8\mbox{)}$.  Otherwise, it equals $-1$.
*ebfedea0SLionel Sambuc\item $\left ( { a \over p} \right ) \equiv \left ( { p \over a} \right ) \cdot (-1)^{(p-1)(a-1)/4}$.  More specifically
*ebfedea0SLionel Sambuc$\left ( { a \over p} \right ) = \left ( { p \over a} \right )$ if $p \equiv a \equiv 1 \mbox{ (mod }4\mbox{)}$.
*ebfedea0SLionel Sambuc\end{enumerate}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucUsing these facts if $a = 2^k \cdot a'$ then
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{eqnarray}
*ebfedea0SLionel Sambuc\left ( { a \over p } \right ) = \left ( {{2^k} \over p } \right ) \left ( {a' \over p} \right ) \nonumber \\
*ebfedea0SLionel Sambuc                               = \left ( {2 \over p } \right )^k \left ( {a' \over p} \right )
*ebfedea0SLionel Sambuc\label{eqn:jacobi}
*ebfedea0SLionel Sambuc\end{eqnarray}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy fact five,
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc\left ( { a \over p } \right ) = \left ( { p \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucSubsequently by fact three since $p \equiv (p \mbox{ mod }a) \mbox{ (mod }a\mbox{)}$ then
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc\left ( { a \over p } \right ) = \left ( { {p \mbox{ mod } a} \over a } \right ) \cdot (-1)^{(p-1)(a-1)/4}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy putting both observations into equation \ref{eqn:jacobi} the following simplified equation is formed.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambuc\left ( { a \over p } \right ) = \left ( {2 \over p } \right )^k \left ( {{p\mbox{ mod }a'} \over a'} \right )  \cdot (-1)^{(p-1)(a'-1)/4}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe value of $\left ( {{p \mbox{ mod }a'} \over a'} \right )$ can be found by using the same equation recursively.  The value of
*ebfedea0SLionel Sambuc$\left ( {2 \over p } \right )^k$ equals $1$ if $k$ is even otherwise it equals $\left ( {2 \over p } \right )$.  Using this approach the
*ebfedea0SLionel Sambucfactors of $p$ do not have to be known.  Furthermore, if $(a, p) = 1$ then the algorithm will terminate when the recursion requests the
*ebfedea0SLionel SambucJacobi symbol computation of $\left ( {1 \over a'} \right )$ which is simply $1$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_jacobi}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and $p$, $a \ge 0$, $p \ge 3$, $p \equiv 1 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The Jacobi symbol $c = \left ( {a \over p } \right )$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $a = 0$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $c \leftarrow 0$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc2.  If $a = 1$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $c \leftarrow 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.2  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc3.  $a' \leftarrow a$ \\
*ebfedea0SLionel Sambuc4.  $k \leftarrow 0$ \\
*ebfedea0SLionel Sambuc5.  While $a'.used > 0$ and $a'_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.1  $k \leftarrow k + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}5.2  $a' \leftarrow \lfloor a' / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc6.  If $k \equiv 0 \mbox{ (mod }2\mbox{)}$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $s \leftarrow 1$ \\
*ebfedea0SLionel Sambuc7.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $r \leftarrow p_0 \mbox{ (mod }8\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.2  If $r = 1$ or $r = 7$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.2.1  $s \leftarrow 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.3  else \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.3.1  $s \leftarrow -1$ \\
*ebfedea0SLionel Sambuc8.  If $p_0 \equiv a'_0 \equiv 3 \mbox{ (mod }4\mbox{)}$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $s \leftarrow -s$ \\
*ebfedea0SLionel Sambuc9.  If $a' \ne 1$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1  $p' \leftarrow p \mbox{ (mod }a'\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.2  $s \leftarrow s \cdot \mbox{mp\_jacobi}(p', a')$ \\
*ebfedea0SLionel Sambuc10.  $c \leftarrow s$ \\
*ebfedea0SLionel Sambuc11.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_jacobi}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_jacobi.}
*ebfedea0SLionel SambucThis algorithm computes the Jacobi symbol for an arbitrary positive integer $a$ with respect to an odd integer $p$ greater than three.  The algorithm
*ebfedea0SLionel Sambucis based on algorithm 2.149 of HAC \cite[pp. 73]{HAC}.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucStep numbers one and two handle the trivial cases of $a = 0$ and $a = 1$ respectively.  Step five determines the number of two factors in the
*ebfedea0SLionel Sambucinput $a$.  If $k$ is even than the term $\left ( { 2 \over p } \right )^k$ must always evaluate to one.  If $k$ is odd than the term evaluates to one
*ebfedea0SLionel Sambucif $p_0$ is congruent to one or seven modulo eight, otherwise it evaluates to $-1$. After the the $\left ( { 2 \over p } \right )^k$ term is handled
*ebfedea0SLionel Sambucthe $(-1)^{(p-1)(a'-1)/4}$ is computed and multiplied against the current product $s$.  The latter term evaluates to one if both $p$ and $a'$
*ebfedea0SLionel Sambucare congruent to one modulo four, otherwise it evaluates to negative one.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucBy step nine if $a'$ does not equal one a recursion is required.  Step 9.1 computes $p' \equiv p \mbox{ (mod }a'\mbox{)}$ and will recurse to compute
*ebfedea0SLionel Sambuc$\left ( {p' \over a'} \right )$ which is multiplied against the current Jacobi product.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_jacobi.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs a matter of practicality the variable $a'$ as per the pseudo-code is reprensented by the variable $a1$ since the $'$ symbol is not valid for a C
*ebfedea0SLionel Sambucvariable name character.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe two simple cases of $a = 0$ and $a = 1$ are handled at the very beginning to simplify the algorithm.  If the input is non-trivial the algorithm
*ebfedea0SLionel Sambuchas to proceed compute the Jacobi.  The variable $s$ is used to hold the current Jacobi product.  Note that $s$ is merely a C ``int'' data type since
*ebfedea0SLionel Sambucthe values it may obtain are merely $-1$, $0$ and $1$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAfter a local copy of $a$ is made all of the factors of two are divided out and the total stored in $k$.  Technically only the least significant
*ebfedea0SLionel Sambucbit of $k$ is required, however, it makes the algorithm simpler to follow to perform an addition. In practice an exclusive-or and addition have the same
*ebfedea0SLionel Sambucprocessor requirements and neither is faster than the other.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucLine @59, if@ through @70, }@ determines the value of $\left ( { 2 \over p } \right )^k$.  If the least significant bit of $k$ is zero than
*ebfedea0SLionel Sambuc$k$ is even and the value is one.  Otherwise, the value of $s$ depends on which residue class $p$ belongs to modulo eight.  The value of
*ebfedea0SLionel Sambuc$(-1)^{(p-1)(a'-1)/4}$ is compute and multiplied against $s$ on lines @73, if@ through @75, }@.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucFinally, if $a1$ does not equal one the algorithm must recurse and compute $\left ( {p' \over a'} \right )$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\textit{-- Comment about default $s$ and such...}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Modular Inverse}
*ebfedea0SLionel Sambuc\label{sec:modinv}
*ebfedea0SLionel SambucThe modular inverse of a number actually refers to the modular multiplicative inverse.  Essentially for any integer $a$ such that $(a, p) = 1$ there
*ebfedea0SLionel Sambucexist another integer $b$ such that $ab \equiv 1 \mbox{ (mod }p\mbox{)}$.  The integer $b$ is called the multiplicative inverse of $a$ which is
*ebfedea0SLionel Sambucdenoted as $b = a^{-1}$.  Technically speaking modular inversion is a well defined operation for any finite ring or field not just for rings and
*ebfedea0SLionel Sambucfields of integers.  However, the former will be the matter of discussion.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe simplest approach is to compute the algebraic inverse of the input.  That is to compute $b \equiv a^{\Phi(p) - 1}$.  If $\Phi(p)$ is the
*ebfedea0SLionel Sambucorder of the multiplicative subgroup modulo $p$ then $b$ must be the multiplicative inverse of $a$.  The proof of which is trivial.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucab \equiv a \left (a^{\Phi(p) - 1} \right ) \equiv a^{\Phi(p)} \equiv a^0 \equiv 1 \mbox{ (mod }p\mbox{)}
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucHowever, as simple as this approach may be it has two serious flaws.  It requires that the value of $\Phi(p)$ be known which if $p$ is composite
*ebfedea0SLionel Sambucrequires all of the prime factors.  This approach also is very slow as the size of $p$ grows.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucA simpler approach is based on the observation that solving for the multiplicative inverse is equivalent to solving the linear
*ebfedea0SLionel SambucDiophantine\footnote{See LeVeque \cite[pp. 40-43]{LeVeque} for more information.} equation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel Sambucab + pq = 1
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhere $a$, $b$, $p$ and $q$ are all integers.  If such a pair of integers $ \left < b, q \right >$ exist than $b$ is the multiplicative inverse of
*ebfedea0SLionel Sambuc$a$ modulo $p$.  The extended Euclidean algorithm (Knuth \cite[pp. 342]{TAOCPV2}) can be used to solve such equations provided $(a, p) = 1$.
*ebfedea0SLionel SambucHowever, instead of using that algorithm directly a variant known as the binary Extended Euclidean algorithm will be used in its place.  The
*ebfedea0SLionel Sambucbinary approach is very similar to the binary greatest common divisor algorithm except it will produce a full solution to the Diophantine
*ebfedea0SLionel Sambucequation.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{General Case}
*ebfedea0SLionel Sambuc\newpage\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_invmod}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and $b$, $(a, b) = 1$, $p \ge 2$, $0 < a < p$.  \\
*ebfedea0SLionel Sambuc\textbf{Output}.  The modular inverse $c \equiv a^{-1} \mbox{ (mod }b\mbox{)}$. \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  If $b \le 0$ then return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc2.  If $b_0 \equiv 1 \mbox{ (mod }2\mbox{)}$ then use algorithm fast\_mp\_invmod. \\
*ebfedea0SLionel Sambuc3.  $x \leftarrow \vert a \vert, y \leftarrow b$ \\
*ebfedea0SLionel Sambuc4.  If $x_0 \equiv y_0  \equiv 0 \mbox{ (mod }2\mbox{)}$ then return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc5.  $B \leftarrow 0, C \leftarrow 0, A \leftarrow 1, D \leftarrow 1$ \\
*ebfedea0SLionel Sambuc6.  While $u.used > 0$ and $u_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $u \leftarrow \lfloor u / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  If ($A.used > 0$ and $A_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($B.used > 0$ and $B_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.2.1  $A \leftarrow A + y$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.2.2  $B \leftarrow B - x$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.3  $A \leftarrow \lfloor A / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.4  $B \leftarrow \lfloor B / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc7.  While $v.used > 0$ and $v_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.1  $v \leftarrow \lfloor v / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.2  If ($C.used > 0$ and $C_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) or ($D.used > 0$ and $D_0 \equiv 1 \mbox{ (mod }2\mbox{)}$) then \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.2.1  $C \leftarrow C + y$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}7.2.2  $D \leftarrow D - x$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.3  $C \leftarrow \lfloor C / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}7.4  $D \leftarrow \lfloor D / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc8.  If $u \ge v$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.1  $u \leftarrow u - v$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.2  $A \leftarrow A - C$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}8.3  $B \leftarrow B - D$ \\
*ebfedea0SLionel Sambuc9.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.1  $v \leftarrow v - u$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.2  $C \leftarrow C - A$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}9.3  $D \leftarrow D - B$ \\
*ebfedea0SLionel Sambuc10.  If $u \ne 0$ goto step 6. \\
*ebfedea0SLionel Sambuc11.  If $v \ne 1$ return(\textit{MP\_VAL}). \\
*ebfedea0SLionel Sambuc12.  While $C \le 0$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}12.1  $C \leftarrow C + b$ \\
*ebfedea0SLionel Sambuc13.  While $C \ge b$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}13.1  $C \leftarrow C - b$ \\
*ebfedea0SLionel Sambuc14.  $c \leftarrow C$ \\
*ebfedea0SLionel Sambuc15.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_invmod.}
*ebfedea0SLionel SambucThis algorithm computes the modular multiplicative inverse of an integer $a$ modulo an integer $b$.  This algorithm is a variation of the
*ebfedea0SLionel Sambucextended binary Euclidean algorithm from HAC \cite[pp. 608]{HAC}.  It has been modified to only compute the modular inverse and not a complete
*ebfedea0SLionel SambucDiophantine solution.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf $b \le 0$ than the modulus is invalid and MP\_VAL is returned.  Similarly if both $a$ and $b$ are even then there cannot be a multiplicative
*ebfedea0SLionel Sambucinverse for $a$ and the error is reported.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe astute reader will observe that steps seven through nine are very similar to the binary greatest common divisor algorithm mp\_gcd.  In this case
*ebfedea0SLionel Sambucthe other variables to the Diophantine equation are solved.  The algorithm terminates when $u = 0$ in which case the solution is
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{equation}
*ebfedea0SLionel SambucCa + Db = v
*ebfedea0SLionel Sambuc\end{equation}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf $v$, the greatest common divisor of $a$ and $b$ is not equal to one then the algorithm will report an error as no inverse exists.  Otherwise, $C$
*ebfedea0SLionel Sambucis the modular inverse of $a$.  The actual value of $C$ is congruent to, but not necessarily equal to, the ideal modular inverse which should lie
*ebfedea0SLionel Sambucwithin $1 \le a^{-1} < b$.  Step numbers twelve and thirteen adjust the inverse until it is in range.  If the original input $a$ is within $0 < a < p$
*ebfedea0SLionel Sambucthen only a couple of additions or subtractions will be required to adjust the inverse.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_invmod.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsubsection{Odd Moduli}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucWhen the modulus $b$ is odd the variables $A$ and $C$ are fixed and are not required to compute the inverse.  In particular by attempting to solve
*ebfedea0SLionel Sambucthe Diophantine $Cb + Da = 1$ only $B$ and $D$ are required to find the inverse of $a$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe algorithm fast\_mp\_invmod is a direct adaptation of algorithm mp\_invmod with all all steps involving either $A$ or $C$ removed.  This
*ebfedea0SLionel Sambucoptimization will halve the time required to compute the modular inverse.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\section{Primality Tests}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucA non-zero integer $a$ is said to be prime if it is not divisible by any other integer excluding one and itself.  For example, $a = 7$ is prime
*ebfedea0SLionel Sambucsince the integers $2 \ldots 6$ do not evenly divide $a$.  By contrast, $a = 6$ is not prime since $a = 6 = 2 \cdot 3$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucPrime numbers arise in cryptography considerably as they allow finite fields to be formed.  The ability to determine whether an integer is prime or
*ebfedea0SLionel Sambucnot quickly has been a viable subject in cryptography and number theory for considerable time.  The algorithms that will be presented are all
*ebfedea0SLionel Sambucprobablistic algorithms in that when they report an integer is composite it must be composite.  However, when the algorithms report an integer is
*ebfedea0SLionel Sambucprime the algorithm may be incorrect.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAs will be discussed it is possible to limit the probability of error so well that for practical purposes the probablity of error might as
*ebfedea0SLionel Sambucwell be zero.  For the purposes of these discussions let $n$ represent the candidate integer of which the primality is in question.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{Trial Division}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucTrial division means to attempt to evenly divide a candidate integer by small prime integers.  If the candidate can be evenly divided it obviously
*ebfedea0SLionel Sambuccannot be prime.  By dividing by all primes $1 < p \le \sqrt{n}$ this test can actually prove whether an integer is prime.  However, such a test
*ebfedea0SLionel Sambucwould require a prohibitive amount of time as $n$ grows.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucInstead of dividing by every prime, a smaller, more mangeable set of primes may be used instead.  By performing trial division with only a subset
*ebfedea0SLionel Sambucof the primes less than $\sqrt{n} + 1$ the algorithm cannot prove if a candidate is prime.  However, often it can prove a candidate is not prime.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe benefit of this test is that trial division by small values is fairly efficient.  Specially compared to the other algorithms that will be
*ebfedea0SLionel Sambucdiscussed shortly.  The probability that this approach correctly identifies a composite candidate when tested with all primes upto $q$ is given by
*ebfedea0SLionel Sambuc$1 - {1.12 \over ln(q)}$.  The graph (\ref{pic:primality}, will be added later) demonstrates the probability of success for the range
*ebfedea0SLionel Sambuc$3 \le q \le 100$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucAt approximately $q = 30$ the gain of performing further tests diminishes fairly quickly.  At $q = 90$ further testing is generally not going to
*ebfedea0SLionel Sambucbe of any practical use.  In the case of LibTomMath the default limit $q = 256$ was chosen since it is not too high and will eliminate
*ebfedea0SLionel Sambucapproximately $80\%$ of all candidate integers.  The constant \textbf{PRIME\_SIZE} is equal to the number of primes in the test base.  The
*ebfedea0SLionel Sambucarray \_\_prime\_tab is an array of the first \textbf{PRIME\_SIZE} prime numbers.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_prime\_is\_divisible}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = 1$ if $n$ is divisible by a small prime, otherwise $c = 0$.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  for $ix$ from $0$ to $PRIME\_SIZE$ do \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.1  $d \leftarrow n \mbox{ (mod }\_\_prime\_tab_{ix}\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}1.2  If $d = 0$ then \\
*ebfedea0SLionel Sambuc\hspace{6mm}1.2.1  $c \leftarrow 1$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}1.2.2  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc2.  $c \leftarrow 0$ \\
*ebfedea0SLionel Sambuc3.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_prime\_is\_divisible}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_prime\_is\_divisible.}
*ebfedea0SLionel SambucThis algorithm attempts to determine if a candidate integer $n$ is composite by performing trial divisions.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_prime_is_divisible.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucThe algorithm defaults to a return of $0$ in case an error occurs.  The values in the prime table are all specified to be in the range of a
*ebfedea0SLionel Sambucmp\_digit.  The table \_\_prime\_tab is defined in the following file.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_prime_tab.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucNote that there are two possible tables.  When an mp\_digit is 7-bits long only the primes upto $127$ may be included, otherwise the primes
*ebfedea0SLionel Sambucupto $1619$ are used.  Note that the value of \textbf{PRIME\_SIZE} is a constant dependent on the size of a mp\_digit.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{The Fermat Test}
*ebfedea0SLionel SambucThe Fermat test is probably one the oldest tests to have a non-trivial probability of success.  It is based on the fact that if $n$ is in
*ebfedea0SLionel Sambucfact prime then $a^{n} \equiv a \mbox{ (mod }n\mbox{)}$ for all $0 < a < n$.  The reason being that if $n$ is prime than the order of
*ebfedea0SLionel Sambucthe multiplicative sub group is $n - 1$.  Any base $a$ must have an order which divides $n - 1$ and as such $a^n$ is equivalent to
*ebfedea0SLionel Sambuc$a^1 = a$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf $n$ is composite then any given base $a$ does not have to have a period which divides $n - 1$.  In which case
*ebfedea0SLionel Sambucit is possible that $a^n \nequiv a \mbox{ (mod }n\mbox{)}$.  However, this test is not absolute as it is possible that the order
*ebfedea0SLionel Sambucof a base will divide $n - 1$ which would then be reported as prime.  Such a base yields what is known as a Fermat pseudo-prime.  Several
*ebfedea0SLionel Sambucintegers known as Carmichael numbers will be a pseudo-prime to all valid bases.  Fortunately such numbers are extremely rare as $n$ grows
*ebfedea0SLionel Sambucin size.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_prime\_fermat}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$.  \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = 1$ if $b^a \equiv b \mbox{ (mod }a\mbox{)}$, otherwise $c = 0$.  \\
*ebfedea0SLionel Sambuc\hline \\
*ebfedea0SLionel Sambuc1.  $t \leftarrow b^a \mbox{ (mod }a\mbox{)}$ \\
*ebfedea0SLionel Sambuc2.  If $t = b$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}2.1  $c = 1$ \\
*ebfedea0SLionel Sambuc3.  else \\
*ebfedea0SLionel Sambuc\hspace{3mm}3.1  $c = 0$ \\
*ebfedea0SLionel Sambuc4.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_prime\_fermat}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_prime\_fermat.}
*ebfedea0SLionel SambucThis algorithm determines whether an mp\_int $a$ is a Fermat prime to the base $b$ or not.  It uses a single modular exponentiation to
*ebfedea0SLionel Sambucdetermine the result.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_prime_fermat.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\subsection{The Miller-Rabin Test}
*ebfedea0SLionel SambucThe Miller-Rabin (citation) test is another primality test which has tighter error bounds than the Fermat test specifically with sequentially chosen
*ebfedea0SLionel Sambuccandidate  integers.  The algorithm is based on the observation that if $n - 1 = 2^kr$ and if $b^r \nequiv \pm 1$ then after upto $k - 1$ squarings the
*ebfedea0SLionel Sambucvalue must be equal to $-1$.  The squarings are stopped as soon as $-1$ is observed.  If the value of $1$ is observed first it means that
*ebfedea0SLionel Sambucsome value not congruent to $\pm 1$ when squared equals one which cannot occur if $n$ is prime.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\begin{figure}[!here]
*ebfedea0SLionel Sambuc\begin{small}
*ebfedea0SLionel Sambuc\begin{center}
*ebfedea0SLionel Sambuc\begin{tabular}{l}
*ebfedea0SLionel Sambuc\hline Algorithm \textbf{mp\_prime\_miller\_rabin}. \\
*ebfedea0SLionel Sambuc\textbf{Input}.   mp\_int $a$ and $b$, $a \ge 2$, $0 < b < a$.  \\
*ebfedea0SLionel Sambuc\textbf{Output}.  $c = 1$ if $a$ is a Miller-Rabin prime to the base $a$, otherwise $c = 0$.  \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc1.  $a' \leftarrow a - 1$ \\
*ebfedea0SLionel Sambuc2.  $r  \leftarrow n1$    \\
*ebfedea0SLionel Sambuc3.  $c \leftarrow 0, s  \leftarrow 0$ \\
*ebfedea0SLionel Sambuc4.  While $r.used > 0$ and $r_0 \equiv 0 \mbox{ (mod }2\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.1  $s \leftarrow s + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}4.2  $r \leftarrow \lfloor r / 2 \rfloor$ \\
*ebfedea0SLionel Sambuc5.  $y \leftarrow b^r \mbox{ (mod }a\mbox{)}$ \\
*ebfedea0SLionel Sambuc6.  If $y \nequiv \pm 1$ then \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.1  $j \leftarrow 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.2  While $j \le (s - 1)$ and $y \nequiv a'$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.2.1  $y \leftarrow y^2 \mbox{ (mod }a\mbox{)}$ \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.2.2  If $y = 1$ then goto step 8. \\
*ebfedea0SLionel Sambuc\hspace{6mm}6.2.3  $j \leftarrow j + 1$ \\
*ebfedea0SLionel Sambuc\hspace{3mm}6.3  If $y \nequiv a'$ goto step 8. \\
*ebfedea0SLionel Sambuc7.  $c \leftarrow 1$\\
*ebfedea0SLionel Sambuc8.  Return(\textit{MP\_OKAY}). \\
*ebfedea0SLionel Sambuc\hline
*ebfedea0SLionel Sambuc\end{tabular}
*ebfedea0SLionel Sambuc\end{center}
*ebfedea0SLionel Sambuc\end{small}
*ebfedea0SLionel Sambuc\caption{Algorithm mp\_prime\_miller\_rabin}
*ebfedea0SLionel Sambuc\end{figure}
*ebfedea0SLionel Sambuc\textbf{Algorithm mp\_prime\_miller\_rabin.}
*ebfedea0SLionel SambucThis algorithm performs one trial round of the Miller-Rabin algorithm to the base $b$.  It will set $c = 1$ if the algorithm cannot determine
*ebfedea0SLionel Sambucif $b$ is composite or $c = 0$ if $b$ is provably composite.  The values of $s$ and $r$ are computed such that $a' = a - 1 = 2^sr$.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucIf the value $y \equiv b^r$ is congruent to $\pm 1$ then the algorithm cannot prove if $a$ is composite or not.  Otherwise, the algorithm will
*ebfedea0SLionel Sambucsquare $y$ upto $s - 1$ times stopping only when $y \equiv -1$.  If $y^2 \equiv 1$ and $y \nequiv \pm 1$ then the algorithm can report that $a$
*ebfedea0SLionel Sambucis provably composite.  If the algorithm performs $s - 1$ squarings and $y \nequiv -1$ then $a$ is provably composite.  If $a$ is not provably
*ebfedea0SLionel Sambuccomposite then it is \textit{probably} prime.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel SambucEXAM,bn_mp_prime_miller_rabin.c
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\backmatter
*ebfedea0SLionel Sambuc\appendix
*ebfedea0SLionel Sambuc\begin{thebibliography}{ABCDEF}
*ebfedea0SLionel Sambuc\bibitem[1]{TAOCPV2}
*ebfedea0SLionel SambucDonald Knuth, \textit{The Art of Computer Programming}, Third Edition, Volume Two, Seminumerical Algorithms, Addison-Wesley, 1998
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[2]{HAC}
*ebfedea0SLionel SambucA. Menezes, P. van Oorschot, S. Vanstone, \textit{Handbook of Applied Cryptography}, CRC Press, 1996
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[3]{ROSE}
*ebfedea0SLionel SambucMichael Rosing, \textit{Implementing Elliptic Curve Cryptography}, Manning Publications, 1999
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[4]{COMBA}
*ebfedea0SLionel SambucPaul G. Comba, \textit{Exponentiation Cryptosystems on the IBM PC}. IBM Systems Journal 29(4): 526-538 (1990)
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[5]{KARA}
*ebfedea0SLionel SambucA. Karatsuba, Doklay Akad. Nauk SSSR 145 (1962), pp.293-294
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[6]{KARAP}
*ebfedea0SLionel SambucAndre Weimerskirch and Christof Paar, \textit{Generalizations of the Karatsuba Algorithm for Polynomial Multiplication}, Submitted to Design, Codes and Cryptography, March 2002
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[7]{BARRETT}
*ebfedea0SLionel SambucPaul Barrett, \textit{Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor}, Advances in Cryptology, Crypto '86, Springer-Verlag.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[8]{MONT}
*ebfedea0SLionel SambucP.L.Montgomery. \textit{Modular multiplication without trial division}. Mathematics of Computation, 44(170):519-521, April 1985.
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[9]{DRMET}
*ebfedea0SLionel SambucChae Hoon Lim and Pil Joong Lee, \textit{Generating Efficient Primes for Discrete Log Cryptosystems}, POSTECH Information Research Laboratories
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[10]{MMB}
*ebfedea0SLionel SambucJ. Daemen and R. Govaerts and J. Vandewalle, \textit{Block ciphers based on Modular Arithmetic}, State and {P}rogress in the {R}esearch of {C}ryptography, 1993, pp. 80-89
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[11]{RSAREF}
*ebfedea0SLionel SambucR.L. Rivest, A. Shamir, L. Adleman, \textit{A Method for Obtaining Digital Signatures and Public-Key Cryptosystems}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[12]{DHREF}
*ebfedea0SLionel SambucWhitfield Diffie, Martin E. Hellman, \textit{New Directions in Cryptography}, IEEE Transactions on Information Theory, 1976
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[13]{IEEE}
*ebfedea0SLionel SambucIEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985)
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[14]{GMP}
*ebfedea0SLionel SambucGNU Multiple Precision (GMP), \url{http://www.swox.com/gmp/}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[15]{MPI}
*ebfedea0SLionel SambucMultiple Precision Integer Library (MPI), Michael Fromberger, \url{http://thayer.dartmouth.edu/~sting/mpi/}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[16]{OPENSSL}
*ebfedea0SLionel SambucOpenSSL Cryptographic Toolkit, \url{http://openssl.org}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[17]{LIP}
*ebfedea0SLionel SambucLarge Integer Package, \url{http://home.hetnet.nl/~ecstr/LIP.zip}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[18]{ISOC}
*ebfedea0SLionel SambucJTC1/SC22/WG14, ISO/IEC 9899:1999, ``A draft rationale for the C99 standard.''
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\bibitem[19]{JAVA}
*ebfedea0SLionel SambucThe Sun Java Website, \url{http://java.sun.com/}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\end{thebibliography}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\input{tommath.ind}
*ebfedea0SLionel Sambuc
*ebfedea0SLionel Sambuc\end{document}