mjhjmtu

Question

Programming language development has been influenced by hardware design. One example from this answer mentions how C pointers were, at least in part, influenced by the design of the PDP-11. Has the reverse taken place, where a construct provided by a language drove the development of hardware?

To be clear, I'm wondering about core language constructs, like pointers for example, rather than industry consortiums coming up with something like OpenGL then being implemented in hardware. Perhaps hardware floating-point support?

I'm pretty sure that many of the RISC processors were designed specifically to be good targets for compilers; I'm seeking a source. — Jan 22 at 1:23
@Tommy RISC CPUs are interesting case. They were designed in era where it was obvious that they'd be almost exclusively running compiled code, so were designed with compilers in mind. On the other hand, many RISC CPUs were designed with certain features and limitations that required compilers implement things they didn't have to previously. For example, branch delay slots and pipelining required that the compiler to reorder instructions in order to generate the most efficient code. (In the end it proved CPUs can do a better job of reordering instructions than compilers.) — Jan 22 at 1:42
Quite surprised no one has mentioned Lisp machines or Java smartcards. — Jan 22 at 3:53
Both of those are simply hardware implementations of interpreters or virtual machines; I assumed the question was more about programming languages that changed the general evolution of hardware design. (Like, did array support lead to greater use of on-chip cache to support efficient array access?) — Jan 22 at 19:49

score 42 · Accepted Answer · 2019-01-22 10:45:14Z

Interesting question with an interesting answer.

First let me get one thing out of the way:

One example from this answer mentions how C pointers were, at least in part, influenced by the design of the PDP-11

It's a myth to suggest C's design is based on the PDP-11. People often quote, for example, the increment and decrement operators because they have an analogue in the PDP-11 instruction set. This is, however, a coincidence. Those operators were invented before the language was ported to the PDP-11.

There are actually two answers to this question

processors that are targeted to a specific high level language

processors that include features that a high level language might find useful.

In the former category we have most of the interesting eventual dead ends in computer hardware history. Perhaps the one of the earliest examples of a CPU architecture being targeted at a high level language is the Burroughs B5000 and its successors. This is a family of machines targeted at Algol. In fact, there wasn't really a machine language as such that you could program in.

The B5000 had a lot of hardware features designed to support the implementation of Algol. It had a hardware stack and all data manipulations were performed on the stack. It used tagged descriptors for data so the CPU had some idea of what it was dealing with. It had a series of registers called display registers that were used to model static scope* efficiently.

Other examples of machines targeted at specific languages include the Lisp machine already mentioned, arguably the Cray series of supercomputers for Fortran - or even just Fortran loops, the ICL 2900 series (also block structured high level languages), some machines targeted at the Java virtual machine (some ARM processors have hardware JVM support) and many others.

One of the drivers behind creating RISC architectures was the observation that compilers tended to use only a small subset of the available combinations of instructions and addressing modes available on most CPU architectures, so RISC designers ditched the unused ones and filled the space previously used for complex decoding logic with more registers.

In the second category, we have individual features in processors targeted at high level languages. For example, the hardware stack is a useful feature for an assembly language programmer, but more or less essential for any language that allows recursive function calls. The processor may build features on top of that for example, many CPUs have an instruction to create a stack frame (the data structure on the stack that represents a function's parameters and local variables).

*Algol allowed you to declare functions inside other functions. Static scope reflects the way functions were nested in the program source - an inner function could access the variables and functions defined in it and in the scope in which it was defined and the scope in which that scope was defined all the way up to global scope.

Like others mentioned, there's an obvious chicken and egg problem here. Regarding the hardware stack, which side drove development? Was the hardware stack initially implemented, and then taken advantage of? Or were stacks implemented in software, and the hardware stack implemented to address this? — Jan 22 at 14:54
@Nathan I think it is most likely that the hardware stack was invented after the need for efficient stacks was identified but this was probably fairly early in the history of digital computers. — Jan 22 at 19:23
@JeremyP - since you mentioned the 2900, I'll add MU5 , an ancestor of sorts to the 2900. — Jan 22 at 23:57

score 53 · Accepted Answer · 2019-01-25 17:59:30Z

Simply yes.

And not just a few instructions, but whole CPUs have been developed with languages in mind. Most prominent maybe Intel's 8086. Already the basic CPU was designed to support the way high level languages handle memory management, especially stack allocation and usage. With BP a separate register for stack frames and addressing was added in conjunction with short encodings for stack related addressing to make HLL programs perform. The 80186/286 went further in this direction by adding Enter/Leave instructions for stack frame handling.

While it can be said that the base 8086 was geared more toward languages like Pascal or PL/M (*1,2), later incarnations added many ways to support the now prevalent C primitives - not at least scaling factors for indices.

Since many answers pile now various details of CPUs where instructions may match up (or not), there are maybe two other CPUs worth mentioning: The Pascal Microengine and Rockwells 65C19 (as well as the RTX2000).

Pascal Microengine, a CPU made for generic HLL implementaions

The Pascal Microengine was a WD MCP1600 chipset (*3) based implementation of the virtual 16 bit UCSD p-code processor. Contrary to what the name suggests, it wasn't tied to Pascal as a language, but a generic stack machine tailored to support concepts for HLL operations. Beside a rather simple, stack based execution, the most important part was a far reaching and comfortable management of local memory structures for functions and function linking as well as data. Modern time Java Bytecode and its interpreter as a native Bytecode CPU (e.g. PicoJava) isn't in any way a new idea (*4).

R65C19 and N4000, CPU's enhanced or custom made for a specific language

The Rockwell R65C19 is a 6500 variant with added support for Forth. Its 10 new threaded code instructions (*5) implemented the core functions (like Next) of a Forth system as single machine instructions.

Forth as a language was developed with a keen eye on the way it is executed. It got more in common with Assemblers than many other HLL (*6). So it's no surprise that, already in 1983, its inventor Charles Moore created a Forth CPU called N4000 (*7).

*1 - Most remarkable here the string functions which make only sense in languages supporting strings as discrete data type.

*2 - Stephen Morse's 8086 primer is still a good read - especially when he talks about the finer details. Similar and quite recommended his 2008 interview about the 8086 creation where he describes his approach as mostly HLL driven.

*3 - Which makes it basically a LSI-11 with different microcode.

*4 - As IT historians, we have seen each and every implementation already before, haven't we? So let's play a round of Zork.

*5 - There are other nice additions as well, like mathematical operations that ease filter programming - after all, the 65C19/29/39 series was the heart of many modems.

*6 - The discrimination with assembler as not being a HLL and miles apart becomes quite blurry when looking close anyway.

*7 - Later sold to Harris, who developed it into the RTX2000 series - with radiation hardened versions that power several deep space probes.

Intel 8086 certainly had relatively high-level instructions, but I'd chalk that to the opposite you're saying - it was so that you could write (relatively) high-level code in assembly. There's huge amounts of instructions that are worthless for a "high-level" language like C, but great for assembly programmers. RISC CPUs were designed for compiled languages, CISC CPUs were primarily designed for assembly programmers. Of course, in practice, compilers weren't all that good at that point, so there was a lot of blurry lines between assembly programming and something like C. — Jan 22 at 11:29
@Luaan Calling C a HLL is always borderline. And here it's even less important, as C wasn't anywhere relevant during the late 70s. Pascal and similar languages were seen as the way to go. Further, that nice complex instructions are as well a good idea for Assembler doesn't make them made for Assembler. Aach and every compiler produces assembly at last, thus directly benefiting from such operations. You may realy want to read the interview. — Jan 22 at 11:42
I'm not saying that compilers couldn't use the more complex instructions - I'm saying that those instructions were redundant. You could do the same job faster with a couple of instructions, which compilers could easily do, but assembly programmers didn't really bother with unless necessary (though granted, this was a lot more important on 386 and 486 and outright silly on Pentium+, rather than the 8086). And while I've always been on the Pascal side in the great fight, I don't think there were many differences in the required instructions between Pascal and C. — Jan 22 at 11:52
As for the interview, I don't see them talking about high-level languages and the design of 8086. Would you care to point it out more specifically? — Jan 22 at 11:52
Add the Hobbit which was designed to run C (and similar CPUs like the Bellmac and CRISP that led to Hobbit). Intel 432 was similar in concept but added tagged memory and other support for built-in Multics/Unix-like security heirarchies. — Jan 22 at 15:29

Community♦ 1 · Accepted Answer · 2019-01-24 13:34:25Z

One interesting example of programming languages driving hardware development is the LISP machine. Since "normal" computers of the time period couldn't execute lisp code efficiently, and there was a high demand for the language in academia and research, dedicated hardware was built with the sole purpose of executing lisp code. Although lisp machines were initially developed for MIT's AI lab, they also saw sucess in computer animation.

These computers provided increased speed by using a stack machine instead of the typical register based design, and had native support for type checking lisp types. Some other important hardware features aided in garbage collection and closures. Here's a series of slides that go into more detail on the design: Architecture of Lisp Machines (PDF) (archive).

The architecture of these computers are specialized enough that in order to run c code, the c source is transpiled into lisp, and then run normally. The Vacietis compiler is an example of such a system.

It's notable that there has been a few talks about CPUs specifically designed to run the intermediate languages of platforms like Java or .NET, though to my knowledge the attempts were mostly failures. And of course, then there's the VLIW architectures that were entirely designed for smart compilers. — Jan 22 at 11:32
Interesting to hear about C being transpiled into lisp! Thanks! — Jan 22 at 14:44
"and there was a high demand for the language" - citation needed. :-) My impression is more that there was a loud academic lisp fandom pushing for hardware that would make their language relevant to the real world, and very little interest from the real world in buying such machines or using lisp. — Jan 24 at 0:00
@R.. I clarified where that high demand was coming from. Better? — Jan 24 at 1:34
en.wikipedia.org/wiki/Jazelle "Jazelle DBX (Direct Bytecode eXecution) is an extension that allows some ARM processors to execute Java bytecode in hardware" — Jan 25 at 19:36

score 20 · Accepted Answer · 2019-01-22 00:25:33Z

Yes. Case in point, the VAX. The instruction set design was influenced by the requirements of the compiled languages of the day. For example, the orthogonality of the ISA; the provision of instructions that map to language constructs such as 'case' statements (in the numbered-case sense of Hoare's original formulation, not the labelled-case of C), loop statements, and so on.

VAX Architecture Ref - see the Introduction.

I am not claiming the VAX is unique in this respect, just an example I know a little about. As a second example, I'll mention the Burroughs B6500 'display' registers. A display, in 1960s speak, is a mechanism for efficient uplevel references. If your language, such as Algol60, permits declaration of nested procedures to any depth, then arbitrary references to the local variables of different levels of enclosing procedure require special handling. The mechanism used (the 'display') was invented for KDF9 Whetstone Algol by Randell and Russell, and described in their book Algol 60 Implementation. The B6500 incorporates that into hardware:

The B6500/B7500 contains a network of Display Registers (D0 through
D31) which are caused to point at the appropriate MSCW (Fig. 5). The
local variables of all procedures global to the current procedure are
addressed in the B6500/B7500 relative to the Display Registers.

as well as more registers for local variables, and large, regular address space rather than smaller banks that can be activated one at a time.. — Jan 21 at 23:47
My understanding is that VAX's powerful addressing modes and flexible instruction operands were mostly useful for human programmers at the time; VAX was a poor compiler target because compilers at the time weren't smart enough to take advantage of the funky stuff, and used more but simpler instructions (which VAX couldn't run faster). I haven't used VAX, but providing high-level-like functionality in hardware instructions directly (so your asm can "look" like high-level control constructs) sounds much more beneficial to humans than compilers. Or maybe they thought it would be good, but wasn't. — Jan 23 at 16:19
i.e. maybe VAX architects hoped that compilers could be made that would translate high-level code to similar-looking asm using these features, but the discovery that that wasn't the case is what led to RISC designs, which were good at chewing through the multiple simple instructions compilers most naturally generated. — Jan 23 at 16:25
I wasn't thinking so much of the 'fancy' instructions so much as I was thinking of the orthogonality of the instruction set; pretty much any operand specifiers could be used with any instruction. That makes compilation easier because fewer special cases. The main language I used on VMS, BLISS-32, had a compiler that was way better at using funky addressing modes than I was. You can interpret that how you like :-) — Jan 24 at 3:13

score 14 · Accepted Answer · 2019-01-22 23:48:56Z

Some manufacturers have directly admitted such

The first page of the Intel 8086 data sheet lists the processor's features, which include

Architecture Designed for Powerful Assembly Language and Efficient High Level Languages

In particular, C and other high-level languages use the stack for arguments and local variables. The 8086 has both a stack pointer (SP) and a frame pointer (BP) which address memory using the stack segment (SS) rather than other segments (CS, DS, ES).

The datasheet for the 8087 co-processor has the following section:

PROGRAMMING LANGUAGE SUPPORT

Programs for the 8087 can be written in Intel's high-level languages for 8086/8088 and 80186/80188 systems; ASM-86 (the 8086, 8088 assembly language), PL/M-86, FORTRAN-86, and PASCAL-86.

The 80286 added several instructions to the architecture to aid high-level languages. PUSHA, POPA, ENTER, and LEAVE help with subroutine calls. The BOUND instruction was useful for array bounds checking and switch-style control statements. Other instructions unrelated to high-level languages were added as well.

The 80386 added bitfield instructions, which are used in C.

The Motorola MC68000 Microprocessor User's Manual states:

2.2.2 Structured Modular Programming

[...]
The availability of advanced, structured assemblers and block-structured high-level languages such as Pascal simplifies modular programming. Such concepts are virtually useless, however, unless parameters are easily transferred between and within software modules that operate on a re-entrant and recursive basis.
[...]
The MC68000 provides architectural features that allow efficient re-entrant modular programming. Two complementary instructions, link and allocate (LINK) and unlink (UNLK), reduce subroutine call overhead by manipulating linked lists of data areas on the stack. The move multiple register instruction (MOVEM) also reduces subroutine call programming overhead.
[...]
Other instructions that support modern structured programming techniques are push effective address (PEA), load effective address (LEA), return and restore (RTR), return from exception (RTE), jump to subroutine (JSR), branch to subroutine (BSR), and return from subroutine (RTS).

The 68020 added bitfield instructions, which are used in C.

Whereas the above processors added instructions to support programming languages, Reduced Instruction-Set Computers (RISC) took the opposite approach. By analyzing which instructions compilers actually used, they were able to discard many complex instructions that weren't being used. This allowed the architecture to be simplified, shorten the instruction cycle length, and reduce instructions to one cycle, speeding up processors significantly.

I interpret the quote from the 8087 manual in a way that the languages listed there (e.g. "PASCAL-86") have been designed to support the (already existing) 8087; the 8087 was not designed in a way to work well with the already existing "PASCAL-86" language. — Jan 22 at 9:14
Interesting to hear about the 80286 instructions. Would these have been added in anticipation of their use, or to improve the constructs already being used? — Jan 22 at 14:48
bound is not useful for switch statements: the docs for bound say it raises a #BR exception if the index is out of bounds. But in C, even if there's no default: case label, an input that doesn't match any case is not undefined-behaviour, it just doesn't run any of the cases. What if I don't write default in switch case? — Jan 23 at 16:00
The 386 "bitfield" instructions are actually bitmap instructions, for testing or bts test-and-setting/resetting/flipping a single bit in a register or in memory. They're mostly not useful for C struct bitfield members. x86 still has crap-all for bitfield insert/extract compared to ARM ubfx / sbfx ([un]signed bitfield extract of an immediate range of bits), or PowerPC's very powerful rotate-and-mask immediate instructions like rlwinm that can extract a field from anywhere and shift it anywhere into another register. x86 BMI2 did add pdep, but that needs a mask in a register. — Jan 23 at 16:04
@MartinRosenau: I think it's generally agreed that one of the HLLs 8086 architects had in mind was Pascal. en.wikipedia.org/wiki/Intel_8086#The_first_x86_design says "According to principal architect Stephen P. Morse, this [instructions supporting nested procedures] was a result of a more software-centric approach than in the design of earlier Intel processors (the designers had experience working with compiler implementations)." — Jan 23 at 16:10

score 12 · Accepted Answer · 2019-01-22 16:22:42Z

Over time, there have been various language-specific CPUs, some so dedicated that it would be awkward to use them for a different language. For example, the Harris RTX-2000 was designed to run Forth. One could almost say it and other Forth CPUs were the language in hardware form. I'm not saying they understand the language, but they are designed to execute it at the "assembler" level.

Early on, Forth was known for being extremely memory efficient, fast, and, for programmers who could think bass-ackwards, quick to develop in. Having a CPU that could execute it almost directly was a no-brainer. However, memory got cheaper, CPU's got quicker, and programmers who liked thinking Forth got scarcer. (How many folks still use calculators in reverse Polish notation mode?)

@Wilson Move the 'b' 4 characters to the right. Logic in Forth is opposite to most other languages. Parameters are generally on the stack and used in an object object action order. Compare this to action(object1, object2) order of most other languages. I mentioned RPN calculators because they are the same way. Instead of 3*(2+5)=, you would use **3 2 5 + ***. — Jan 22 at 14:28
I was thinking of Forth almost as soon as I saw the question, but I didn't know a specific Forth-optimized processor so I didn't post an answer. — Jan 22 at 15:20
@Wilson, bass-ackwards is a non technical joke. It translates to "ass backwards", implying doing things in a strange reverse order such as walking backwards, with your ass in front. If you type it bass-ackwards, the joke becomes self referential. — Jan 22 at 18:41
For other bass-ackwards computers, see the English Electric KDF9. Rather than 'an accumulator' in the manner of its contemporaries, it had a 16-deep nesting store ('stack' to the youth of today) for expression evaluation. The arithmetic orders were zero-address. Note this was not a full stack in the manner of the B5000; it was more limited in applicability. — Jan 22 at 23:48

LuaanLuaan 42528 · Accepted Answer · 2019-01-22 11:42:42Z

Arguably, VLIW architectures were designed mainly for smart compilers. They rely on efficient building of individual very complex instructions (a single "instruction" can do many things at the same time), and while it's not impossible to write the code manually, the idea was that you could get better performance for your applications by using a better compiler, rather than having to upgrade your CPU.

In principle, the difference between e.g. a x86 superscalar CPU and something like SHARC or i860 is that x86 achieves instruction level parallelism at runtime, while SHARC is a very simple CPU design (comparatively) that relies on the compiler. In both cases, there's many tricks to reorder instructions, rename registers etc. to allow multiple instructions to run at the same time, while still appearing to execute them sequentially. The VLIW approach would be especially handy in theory for platforms like JVM or .NET, which use a just-in-time compiler - every update to .NET or JVM could make all your applications faster by allowing better optimizations. And of course, during compilation, the compiler has a lot better idea of what all of your application is trying to do, while the runtime approach only ever has a small subset to work with, and has to rely on techniques like statistical branch prediction.

In practice, the approach of having the CPU decide won out. This does make the CPUs incredibly complex, but it's a lot easier to just buy a new better CPU than to recompile or update all your applications; and frankly, it's a lot easier to sell a compatible CPU that just runs your applications faster :)

RE: your last point. Yes, they're easy to sell, but darn hard to deliver! — Jan 22 at 13:09
JIT compilers need to run quickly, and probably don't have time to find / create as much instruction-level parallelism as possible. If you've looked at x86 asm created by current JVM jitters, well there's a reason Intel and AMD design their CPUs to be able to chew through lots of instructions quickly. Modern JITs create extra ILP by wasting instructions :P Why is 2 * (i * i) faster than 2 * i * i in Java? is hopefully at the lower end of HotSpot JVM code quality, but it misses huge amounts of asm-level micro-optimizations. — Jan 23 at 15:39
@PeterCordes Well, I didn't mean we'd use those compilers just in time - .NET supports native image precompilation. The point is that you could do recompile all the native images after updating the runtime (ideally in the background, of course), without requiring the company that produces the application to provide a new build of their software. That said, the latest JIT compiler for .NET is pretty impressive in both speed and the code it produces - though it certainly does have its quirks (my 3D software renderer was a bit impeded by the floating point overhead :D). — Jan 23 at 20:27
Yeah that works, and might be a way for closed-source software to effectively get the benefit of being ahead-of-time compiled with -O3 -march=native to take full advantage of the target CPU. Like what open-source software in any language can do, e.g. Gentoo Linux compiles packages from source when you install. — Jan 23 at 20:40
In the late 90s I worked with a VLIW system based on custom silicon (the forerunner is somewhat described at computer.org/csdl/proceedings/visual/1992/2897/00/00235189.pdf) and it had a custom compiler with some interesting features. All variables (unless otherwise qualified) were static which led to some interesting discussions on occasion. The system had up to 320 parallel processors doing video on demand for up to 2000 streams. — Jan 25 at 15:16

Peter CordesPeter Cordes 992510 · Accepted Answer · 2019-01-22 15:25:31Z

Some ARM CPUs used to have partial support for executing Java bytecode in hardware with https://en.wikipedia.org/wiki/Jazelle Direct Bytecode eXecution (DBX).

With modern JITing JVMs, that became obsolete, so there was later a variant of Thumb2 mode (compact 16-bit instructions) called ThumbEE designed as a JIT target for managed languages like Java and C# https://en.wikipedia.org/wiki/ARM_architecture#Thumb_Execution_Environment_(ThumbEE)

Apparently ThumbEE has automatic NULL-pointer checks, and an array-bounds instruction. But that was deprecated, too, in 2011.

score 9 · Accepted Answer · 2019-01-21 23:58:28Z

Arguably, the relatively simple logical structure of DO loops in Fortran motivated the development of vector hardware on the early Cray and Cyber supercomputers. There may be some "chicken and egg" between hardware and software development though, since CDC Fortran included array slicing operations to encourage programmers to write "logic-free" loops long before that syntax became standardized in Fortran 90.

Certainly the Cray XMP hardware enhancements compared with the Cray 1, such as improved "chaining" (i.e. overlapping in time) of vector operations, vector mask instructions, and gather/scatter vector addressing, were aimed at improving the performance of typical code written in "idiomatic" Fortran.

The need to find a way to overcome the I/O bottlenecks caused by faster computation, without the prohibitive expense of large fast memory, led to the development of the early Cray SSD storage devices as an intermediate level between main memory conventional disk and tape storage devices. Fortran I/O statements made it easy to read and write a random-access file as if it were a large two-dimensional array of data.

See section 2 of http://www.chilton-computing.org.uk/ccd/supercomputers/p005.htm for an 1988 paper by the head of the Cray XMP design team.

There was a downside to this, in that the performance of the first Cray C compilers (and hence the first implementation of the Unix-like Cray operating system UNICOS) was abysmal, since the hardware had no native character-at-a-time instructions, and there was little computer science theory available to attempt to vectorize idiomatic "C-style" loops with a relatively unstructured combination of pointer manipulation and logical tests, compared with Fortran's more rigidly structured DO loops.

If the hardware didn't have character at a time instructions, leading to poor C compiler performance, then would it have been the case that the next iterations of the hardware added such instructions to improve compiler performance? — Jan 22 at 14:50
I've used TI DSPs which include a hardware instruction that behaves very much like a "do" loop. The hardware has three relevant registers and a flag: loop count, loop end, loop start, and loop enable. The "RPT" instruction loads the "loop end" register with the value an immediate operand, loads "loop start" with the address of the following instruction, and sets the "loop enable" flag. User code must set "loop count" beforehand. Any time the program counter equals "loop end" and "loop enable" is set, the DSP will decrement "loop count" and load the program counter with "loop start". — Jan 22 at 22:16
If the loop counter gets decremented past zero, the loop enable flag will be cleared, so the loop will exit after the next repetition. TI's compiler is configurable to use the looping feature to improve performance (interrupts would only need to safe/restore its state if they use the feature themselves, so it's usually better to simply disable the feature within interrupt service routines than to save/restore the necessary registers). — Jan 22 at 22:19
@Nathan: DEC Alpha AXP did that: first 2 silicon versions didn't have byte load / byte store. Instead, string functions were expected to operate a word at a time, if necessary using bitfield instructions to unpack / repack but the designers argued they wanted to encourage more efficient strlen and similar functions that checked a whole 64-bit chunk for any zero bytes with bithacks. 3rd-gen Alpha added byte load/store, but user-space code still generally avoided it in case it would run on older machines and be emulated. en.wikipedia.org/wiki/DEC_Alpha#Byte-Word_Extensions_(BWX) — Jan 23 at 15:50
Some modern DSPs only have word-addressable memory. C implementations on those usually use CHAR_BIT=32 (or 16 for 16-bit machines), instead of doing software read-modify-write of a 32-bit word to implement an 8-bit char. But DSPs don't need to run general-purpose text-processing software like Unix systems. — Jan 23 at 15:52

Brian Tompsett - 汤莱恩Brian Tompsett - 汤莱恩 1,495420 · Accepted Answer · 2019-01-22 09:27:22Z

Another specific example of hardware design to match the languages was Recursiv, which was designed to implement object oriented language features in hardware.

Our Recursiv has been preserved in a museum.

See https://en.wikipedia.org/wiki/Rekursiv

Ha, I'd never heard of the Rekursiv, but the name made me immediately think that it had to have something to do with Linn. — Jan 23 at 3:05
how did the hardware optimize for oo? only think I can think of is maybe perfecting the vtable call — Jan 27 at 13:19

score 9 · Accepted Answer · 2019-02-08 04:00:27Z

Null-terminated strings

When C was invented, many different string forms were in used at the same time. String operations were probably handled mostly in software, therefore people can use whatever format they want. Null-terminated string wasn't a new idea, however special hardware support, if any, might not be meant for it.

But later due to the domination of C, other platforms began adding accelerated instructions for the null-terminated format:

This had some influence on CPU instruction set design. Some CPUs in the 1970s and 1980s, such as the Zilog Z80 and the DEC VAX, had dedicated instructions for handling length-prefixed strings. However, as the NUL-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to the ES/9000 520 in 1992.

https://en.wikipedia.org/wiki/Null-terminated_string#History

On x86 Intel introduced many instructions for text processing in SSE4.2, which do things in parallel until the first null-termination character. Before that there was the SCAS - scan string instruction that can be used to look for the position of the termination character

mov ecx, 100 ; search up to 100 characters
xor eax, eax ; search for 0
mov edi, offset string ; search this string
repe scas byte ptr [edi] ; scan bytes looking for 0 (find end of string)
jnz toolong ; not found
sub edi, (offset string) + 1 ; calculate length

https://blogs.msdn.microsoft.com/oldnewthing/20190130-00/?p=100825

We all know that nowadays it's a bad idea. Unfortunately it was baked into C, hence used by every modern platform and can't be changed anymore. Luckily we have std::string in C++

The use of a string with some termination character seems already existed on the PDP-7 where people can choose the termination character

The string is terminated by the second occurrence of the delimiting character chosen by the user

http://bitsavers.trailing-edge.com/pdf/dec/pdp7/PDP-7_AsmMan.pdf

However a real null-termination character can be seen in used on the PDP-8 (see the last line in the code block). Then the ASCIZ keyword was introduced in the assembly language for PDP-10/11

Null-terminated strings were produced by the .ASCIZ directive of the PDP-11 assembly languages and the ASCIZ directive of the MACRO-10 macro assembly language for the PDP-10. These predate the development of the C programming language, but other forms of strings were often used.

The B language, which appeared in 1969 and became the precursor to C, might be influenced by that and uses a special character for termination, although I'm not sure which one was chosen

In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled *e

Dennis M. Ritchie, Development of the C Language

If you argue "C uses it because..." you're not providing an answer to the question, but rather the opposite — Jan 22 at 9:10
@tofro PDP affects C design decision, but C affects development of other hardware platforms. Those are 2 separate processes — Jan 22 at 9:41
@JeremyP But they existed very specific on DEC platforms. Can't find any auto increment on a /360 or most other computers - similar character based moves with flags set according to the character transfered. — Jan 22 at 10:27
@Raffzahn C was not developed with specific features of the PDP 11 in mind. For example, the ++ and -- operators were introduced before C was ported to the PDP 11. Read it from the man himself. — Jan 22 at 10:42
@JeremyP Erm, have you tried to read my comment before replying ? I didn't say PDP-11, but DEC. But when it comes to characters it's the PDP-11 where the zero termination happened. After all, the char type was only introduced into B when it got retargeted to produce PDP-11 code. Which also was before C was even fully defined - definition ( and move move from B to C) happened on the PDP-11. Not on any other machine. Further retargeting happened thereafter. Hint: Read the very link you gave :)) — Jan 22 at 11:56

mannaggiamannaggia 1,211158 · Accepted Answer · 2019-01-22 00:28:05Z

I recall, back in the 80’s, and referenced in the Wikipedia article, Bellmac 32 CPU, which became the AT&T Western Electric WE32100 CPU was supposedly designed for the C programming language.

This CPU was used by AT&T in their 3B2 line of Unix systems. There was also a single board VME bus version of it that some third parties used. Zilog also came out with a line of Unix systems using this chip - I think they were a second source for it for AT&T.

I did a lot of work with these in the 80’s and probably early 90’s. It was pretty much a dog in terms of performance, if I remember.

tofrotofro 14.9k32985 · Accepted Answer · 2019-01-22 07:54:46Z

Yes, definitely. A very good example is how Motorola moved from the 68k architecture to the (somewhat) compatible ColdFire range of CPUs. (It is also an example of how such an evolution might go wrong, but that is another story).

The Motorola Coldfire range of CPUs and Microcontrollers was basically a 68000 CPU32 core with lots of instructions and addressing modes removed that "normal" C compilers wouldn't use frequently (like arithmetic instructions on byte and word operands, some complex addressing modes and addressing modes that act only on memory in favor of registers,...). They also simplified the supervisor mode model and removed some rarely used instructions completely. The whole instruction set was "optimized for C and C++ compilers" (This is how Motorola put it) and the freed up chip space used to train the CPUs for performance (like with adding larger data and instruction caches).

In the end, the changes made the CPUs quite a bit too incompatible for customers to stay within the product family, and the MC68k range of CPUs went towards its demise.

I couldn't quite figure out what you meant by "addressing modes that act only on memory"; these are very useful to a C compiler? — Jan 22 at 9:47
The 68k has a number of very complex addressing modes, like "memory indirect with register offsets and postincrement" that contemporary C compilers didn't/couldn't/wouldn't use according to Motorola and were removed in the Coldfire architecture.. — Jan 22 at 10:40
that's more like array[x++ + sizeof(array[0]) * y]. Compilers of that time apparently weren't able to make use of such complex addressing modes. Might be different with today's compilers. — Jan 22 at 19:29

score 7 · Accepted Answer · 2019-01-22 20:31:43Z

7

Another interesting example are Java processors, CPUs that execute (a subset of) Java Virtual Machine bytecode as their instruction sets.

If you’re interested enough to ask this question, you might want to read one of the later editions of Andrew Tanenbaum’s textbook, Structured Computer Organization¹, in which he walks the reader step-by-step through the design of a simple Java processor.

¹ Apparently not the third edition or below. It’s in Chapter 4 of the fifth edition.

edited Jan 22 at 20:31

answered Jan 22 at 7:29

Davislor

1,000210

1

You must be thinking of a particular edition of Tanenbaum's textbook. I have the third edition which doesn't mention Java anywhere -- not surprisingly since it was published 6 years before Java 1.0. (The copyright dates in the third edition are 1976, 1984, 1990).

– Henning Makholm
Jan 22 at 15:28

@HenningMakholm Yes, you're correct. Let me check.

– Davislor
Jan 22 at 18:13

2

While I don't see much use for a processor running Java bytecode vs. using a JIT, garbage-collected languages could benefit from some features that I've not seen supported in hardware such as support for an cache-efficient lazy bit-set operation with semantics that if synchronization is forced after multiple threads have independently set various bits in the same word, all threads would see all bits that are set. Bits could only be cleared at a time when all threads that might set them were stopped. Such a feature would greatly improve the efficiency of maintaining "dirty" flags.

– supercat
Jan 25 at 20:36

add a comment |

score 7 · Accepted Answer · 2019-01-23 15:58:31Z

Another example is the decline of binary-coded decimal instructions

In the past it was common for computers to be decimal or have instructions for decimal operations. For example x86 has AAM, AAD, AAA, FBLD... for operating on packed, unpacked and 10-byte BCD values. Many other classic architectures also have similar features. However they're rarely used, since modern languages often don't have a way to access those instructions. They either lack a decimal integer type completely (like C or Pascal), or doesn't have a decimal type that can map cleanly to BCD instructions

The result is that BCD instructions started to disappear. In x86 they're micro-coded, therefore very slow, which makes people further avoid them. Later AMD removed BCD instructions in x64-64. Other manufacturers did the same in newer versions of their architectures. Having said that, a remnant of BCD operations is still there in the FLAGS register in x86-64 and many other platforms that use flags: the half-carry flag. Newly implemented architectures like ARM, MIPS, Sparc, RISC-V also didn't get any BCD instructions and most of them don't use a flag register

In fact C and C++ allow float, double and long double to be decimal, however none of the implementations use it for the default floating-point types, because modern computers are all binary and are bad at decimal operations. Very few architectures have decimal floating-point support

Many C and C++ compilers do have decimal floating-point types as an extension, such as gcc with _Decimal32, _Decimal64, and _Decimal128. Similarly some other modern languages also have decimal types, however those are mostly big floating-point types for financial or scientific problems and not an integer BCD type. For example decimal in C# is a floating-point type with the mantissa stored in binary, thus BCD instructions would be no help here. Arbitrary-precision decimal types like BigInteger in C# and BigDecimal in Ruby or Java also store the mantissa as binary instead of decimal for performance. A few languages do have a fixed-point decimal monetary type, but the significant part is also in binary

That said, a few floating-point formats can still be stored in BCD or a related form. For example the mantissa in IEEE-754 decimal floating-point types can be stored in either binary or DPD (a highly-packed decimal format which can then be converted to BCD easily). However I doubt that decimal IEEE-754 libraries use BCD instructions, because they're often not exist at all in modern computers, or in case they really exist they'd be extremely slow

BCD was used in many early decimal computers, and is implemented in the instruction set of machines such as the IBM System/360 series and its descendants, Digital Equipment Corporation's VAX, the Burroughs B1700, and the Motorola 68000-series processors. Although BCD per se is not as widely used as in the past and is no longer implemented in newer computers' instruction sets (such as ARM; x86 does not support its BCD instructions in long mode any more), decimal fixed-point and floating-point formats are still important and continue to be used in financial, commercial, and industrial computing, where subtle conversion and fractional rounding errors that are inherent in floating point binary representations cannot be tolerated.

https://en.wikipedia.org/wiki/Binary-coded_decimal#Other_computers_and_BCD

Does a compiler use all x86 instructions?

Does a compiler use all x86 instructions? (2010)

Do modern x86 processors have native support for decimal floating point arithmetic?

Was it a case that these instructions were rarely used to begin with, or just that alternative implementations (if used) like you mentioned better suited to modern languages? — Jan 22 at 18:40
Bizarrely, some microcontroller designers are incorporating BCD-based real-time-clock-calendar circuitry into ARM devices. That might be handy if for some reason software wanted to read a time and show it as a hex dump, but it adds silly overhead when doing almost anything else. — Jan 22 at 22:12
They were never common on high level languages. The only notable language with BCD support is COBOL, which is still in use today by financial institutions. However they're used a lot in hand-written assembly in the past. Even nowadays BCD is still commonly used in microcontrollers as @supercat said, because they don't have a fast multiplier/divider and with BCD they can do arithmetic in decimal directly (like outputting numbers to 7-segment LEDs) instead of messing around with divide by 10 to convert to decimal — Jan 23 at 1:13
@phuclv: The only place I've seen any new microcontroller designs use BCD is in real-time clock/calendar modules. Even in applications that need to display large numbers, it's more practical to use one byte per digit or use an optimized function whose behavior is equivalent to int temp=256*remainder + *p; *p=temp/10; remainder = temp % 10; *p++; which can be implemented efficiently even on platforms that lack "divide" instructions, and can be used to convert any size binary number to decimal fairly easily. — Jan 23 at 2:18
Fun-fact: AAM is literally just 8-bit division by an immediate, and is just as fast as div r/m8 even on modern CPUs (because AAM can use the same divider hardware, but the dividend is only 8-bit instead of 16), It puts the quotient and remainder in opposite registers from div, though, so actually it costs one extra uop more than div r/m8 on some CPUs, presumably doing a div and then swapping. But on some Intel CPUs it's still slightly faster than div r8, especially if you count mov bl, imm8 against div. Of course if you care about perf not size, you use a multiplicative inverse. — Jan 23 at 15:28

Martin BonnerMartin Bonner 1614 · Accepted Answer · 2019-01-22 10:16:44Z

6

Yet another example: The Prime minicomputer had a segmented architecture, and anything in segment 07777 was a null pointer (Prime used octal, and the top 4 bits of the 16-bit word had other uses). Segment 0 was a kernel segment, and just loading 0 into a segment register in user code was an access violation. This would have been fine in properly written C (int* p = 0; stored bit pattern 7777/0 into p).

However it turns out that a lot of C code assumes that if you memset a block of memory to all bits zero, that any contained pointers will have been set to NULL. They eventually had to add a whole new instruction called TCNP (Test C Null Pointer).

answered Jan 22 at 10:16

Martin Bonner

1614

1

C is just too flexible for its own good, and/or its standard library sucks. e.g. memset can only take a repeating pattern 1 char wide, so the only portable way to fill an array with NULL is an actual loop. But that's not efficient on any implementation that doesn't recognize it as a memset idiom or auto-vectorize it. So basically C doesn't give you the tools to conveniently deal with all the possible unusual implementations it allows, making it really easy and common to write non-portable code.

– Peter Cordes
Jan 23 at 16:35

@PeterCordes: You can write static const foo_t zero_foo = 0; and then memcpy from that - but that does waste a chunk of memory, and doesn't help if people allocate stuff with calloc (and it means that zeroing the memory needs twice the memory bandwidth).

– Martin Bonner
Jan 23 at 16:40

Ok, so not the only portable way, but there's no portable way that's as efficient as memset on typical implementations. Having arbitrary-sized arrays of NULL pointers of each type is totally impractical. (C does allow different object-representations for different pointer types, right? So even an arbitrary-sized array of (void*)NULL isn't enough.) Or you could loop in chunks of memcpying 1k at a time so you can have bounded sizes for your copy sources, but then you probably bloat each zeroing site with a loop around memcpy if the size isn't provably smaller than 4kiB at runtime.

– Peter Cordes
Jan 23 at 16:51

And dirtying cache for these stupid source arrays is just horrible.

– Peter Cordes
Jan 23 at 16:51

1

"C does allow different object-representations for different pointer types, right?" - Yes, and the Prime used this. A pointer was segment + word offset, unless you needed a character pointer (or void ptr) - when one of the bits in the segment word was set, indicating that there was a third word which was the bit offset (which was always 0 or 8). It's a lovely machine for violating people's assumptions about "how all computers work". It used ASCII, but ASCII is seven-bit, and they always set the top bit!

– Martin Bonner
Jan 23 at 16:54

add a comment |

score 5 · Accepted Answer · 2019-01-24 14:47:32Z

Some more examples of programming languages affecting hardware design:

The MIPS RISC ISA often seems strange to newcomers: instructions like ADD generate exceptions on signed integer overflow. It is necessary to use ADDU, add unsigned, to get the usual 2’s complement wrap.

Now, I wasn’t there at the time, but I conjecture that MIPS provided this behavior because it was designed with the Stanford benchmarks - which were originally written in the Pascal programming language, which requires overflow detection.

The C programming language does not require overflow traps. The MIPSr6 new (circa 2012) ISA gets rid of the integer overflow trapping instructions - at least those with a 16 bit immediate - in order to free up opcode space. I was there when this was done.

—

I can testify that programming languages influenced many modern x86 features, at least from P6 onwards. Mainly C/C++; to some extent JavaScript.

Indeed wiki said that integer overflow trapping instructions with 16-bit immediate were removed. That's good since no one use add or sub in MIPS anyway — Jan 23 at 16:04
Nobody uses MIPS’ trap on overflow instructions. Certainly not C code. But JavaScript requires overflow detection. MIPSr6 added a branch on overflow instruction. But that requires extra code. I conjecture that if OS and JavaScript were defined together, trap on overflow might be better. However, JavaScript engines that want to be portable would not do this. — Jan 23 at 16:05
There are many purposes for which having a program terminate abnormally would be far preferable to having it produce incorrect output that seems valid. There are other purposes for which it is useful for a program which receives a mixture of valid and invalid data produce a corresponding mixture of correct and meaningless output, so the presence of invalid data doesn't prevent the program from correctly producing the portions of the output associated with the correct data. Having both instructions available facilitates both kinds of programming in languages that expose them. — Jan 25 at 20:18
@supercat and your point is? Trapping on overflow does not necessarily mean terminating the program. In the bad old days we used to have trap handlers that actually did things - like look up the trap PC, and decide what to do. I.e. trapping can be the exact equivalent of an IF statement - the tradeoff being that the IF statement requires an extra instruction, slowing down the case where IF-overflow does not happen, whereas the trap slows down the case where the IF-overflow happens. And requires cooperation between the compiler and the part of the runtime that handles traps. — Jan 25 at 20:23

William SmithWilliam Smith 291 · Accepted Answer · 2019-01-22 23:49:07Z

Yes.
More recently, the TPUs (Tensor Processing Units,) designed by Google to accelerate AI work, are designed to efficiently process their TensorFlow language.

score 42 · Accepted Answer · 2019-01-22 10:45:14Z