-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting 8086-toolchain to ELKS #2112
Comments
I see my name exists. Let me know if you want any help. I'm happy to take patches for whatever I've moved to Codeberg. I'm not dead yet! |
One obvious thing lacking is an "ar" implementation. For static libraries. It might be worth to look into Dev86 implementation, too. |
I have temporarily been using the upstream 8086-toolchain to quickly compile up macOS-hosted versions of C86 and NASM, since @rafael2k's version is currently an ELKS-only build. I have been playing its C86 and NASM to get more information on how it works and its continued suitability for an ELKS-hosted 8086-only C compiler toolchain. Build script:
Here are the current results:
Looking at the class slides showing how intended workflow using the toolchain is utilized, it appears C86 is not built for and knows nothing about the idea of multiple input or output files. The school class workflow shows utilization somewhat like the following:
So we could face an uphill battle for certain unhandled situations (e.g. extern functions) and have to modify C86 in order to get NASM to output .o or .obj output that will communicate properly with the linker what to do with various constructs. Examples of problems could be .comm data (e.g. This is not all bad - C86 itself seems to handle the C source I've thrown at it - but creating a "toolchain" out of it might take some work, unless we want to generally produce smaller programs and/or compile and assemble everything at once. I'll help with whatever needs to be done. |
Yes, Dev86 has ar, and it would well suited to use it if Dev86 LD is used. Given my report above, we're ahead of ourselves since C86 doesn't ever produce a GLOBAL or EXTERN directive in its .asm output, so there'd be no symbols to manage, and NASM won't produce a .o or .obj file with undefined externals. So, for now, we're really talking about just getting CPP, C86 and NASM to compile, assemble, and produce a .bin (.com) binary output file with just those three tools. In order to do anything actually useful, we'll use @rafael2k's poor-mans a.out header (likely created with an included .asm file) and the nasm -P option to automatically include it. Something like:
After that, in order to do anything useful (like call an ELKS system call to display something), we can easily produce a syscalls.asm file that can be added to the NASM assembly step, assembling header.asm, file.asm and syscalls.asm into a single a.out-compatible file that will load and run on ELKS. The NCC Project, oriented towards x86-64, uses NASM as well but uses a similar approach for syscalls, where a Linux-compatible list of system calls is linked to provide all system calls. While each system call should be in a separate library function, the NCC approach will work very well to provide a full set of system calls in a single .asm file, just by renumbering the system calls to ELKS' system call list. After that, a CC wrapper program could be built which automatically performs much of this workflow, and hide it from the user. Even though programs might be quite a bit larger in the beginning than necessary (because of the inclusion of all system calls or even perhaps a full mini-libc in a single source file), it could all be made to work. I like C86, but the big disadvantage is that we're not starting with a toolchain, instead we're having to build one. Lots of work, but fun. |
More news: on the upstream 8086-toolchain class online resources page, there's a link to known tool problems which discusses some known problems with C86. While most are seemingly OK, like having to declare all local variables at the start of a function rather than anywhere, there are a couple issues that could be very problematic for porting any larger piece(s) of code: the compiler apparently has problems dealing with multiple C statements on a single line separated by semicolon, as well as having register allocation problems when more than one C operator is used in an expression at once (which means porting the ELKS library code will be problematic), and then a killer problem of the usage of long/unsigned long (any 32-bit arithmetic) not working well. I've asked the 8086-toolchain maintainer for a copy of the C86 Manual, which is currently a dead link. Hopefully it can be found and we can read more about what C86 does and doesn't do. |
Oh... Reading this and your following answers, it seems c86 has some problems that make it problematic for use for anything more than small and fast projects. At least without any more thorough changes to the compiler. Still, it might be useful in some cases. Btw, ar might potentialy be used directly with nasm and ld86. That of course requires for project to be written entirely in assembly. |
Hi all. I got the cpp and ld from https://github.com/lkundrak/dev86 |
Yes, I meant to mention this when I had the chance, but you beat me to it! c86 has some limitations that may not make it suitable as a general-purpose 8086 C compiler. It was only used to compile a toy Real-time Operating System for the 8086 emulator. But y'all seem like you know what you're talking about with compilers, so perhaps you might be able to fix those limitations. I myself have not done much compiler work, so I probably won't be of much help here. |
As far as the license and history goes for the project, here is what my professor, James Archibald, said to me via email (on 2024-11-20):
So I believe you are free to extend and use as you see fit. |
C86 has some info on licensing in cmain.c (main.c in earlier versions)
|
An oral authorization is enough in my opinion, as this implies no one will try to sue us in the future. |
Unrelated to the question of licensing (not a lawyer, anyway), there is some documentation about c86 here: http://retro.co.za/68000/CC68K/QDOSC68K/c68.txt Although, this is related to a version maintained by the Walkers (Keith and Dave). I wonder if ECEn 425 version is based on this or on an earlier version? Also, I wonder if there is a newer release by Keith and Dave? Other parts of toolchain on that page are for QDOS (Sinclair QL). So, not really interesting for us here. |
There are two versions of C86 sources here (one from 1998, one from 1999): |
Found more docs (for QL version): |
I found a copy of the compiler's homepage: Also available on Internet Archive: |
@ghaerr sent this message:
|
From http://retro.co.za/68000/CC68K/QDOSC68K/c68.txt See ghaerr/elks#2112 (comment) I believe this is the original c86 text manual referenced in the class, though I don't know if there were any modifications. But the size (~146.4 KB) is close to the 147 KB expected size, so I suspect any modifications made were small.
That's a different project, as far as I know. It was in development since early nineties, maybe even before. As far as I know, it was written from scratch and it has always supported 386 only. It never had a support for 8086. |
Oh. I see I am wrong. I was sure it was written from scratch. |
One thing I was right about, it only emits 386 code:
And, it has been in development for so long, it's probably significantly changed compared to the original. Might have as well been rewritten a couple of times. :) |
Here is the manual of the compiler version I'm using at 8086-toolchain: The issue in 8086-toolchain upstream repo: hintron/8086-toolchain#13 |
I found Mathew Brandt's version: Original copyright notice:
|
I might be wrong about this too. I can vaguely place it in the late 90s, althouh the earliest version I found is this one from 2000 (ccdl*.zip): |
We need to compare to understand how different it is from the one in the 8086-toolchain. No problem in changing the source, if it makes sense. |
Found an early version from 1996 (ccdl122.zip). This version also supports only 68k and 386. |
But, what I can tell you is, it likely doesn't fit public domain definition. It's also not OSI compliant. That doesn't mean it's not hackable, changeable or usable. It just means it's not really open source in OSI sense of the word and has limited usage and distribution permissions in comparisson to GPL, BSD and MIT licensed stuff. |
Btw, @ghaerr, can you advice on adapting ld86 from elks aout v0 to v1? |
The a.out .version field was changed from 0 to 1 to indicate that the interpretation of the previous 32-bit .chmem field's upper 16 bits were split off into a new 16-bit .minstack field that occupied the same space. This allowed the developer to specify a separate heap value in .chmen and stack in .minstack. Nothing else was changed. ELKS can load both V0 and V1 executables, so no immediate change is necessary. At some point ld86 could be enhanced to output V1 executables by adding a min stack size command line argument and writing it in the revised header, along with version = 1. For now, since the small executables that are likely going to be built using either your poor-mans header or ld86, specifying either v0 or v1 will work the same, as both the chmem and minstack fields are zero anyways. I can help more with this when ld86 is actually need by c86, as it seems for now there won't be any need for ld86 or ar until c86 is enhanced to output GLOBAL and EXTERN directives for function and data symbols. What this really means is that, for now, a poor-mans a.out header can be fairly easily implemented with no c86 modifications using a pre- and post- .asm file around the NASM-assembled c86 output, with NASM then creating the a.out file directly using its -f bin option. The pre- header will list some internal symbols for start of text and data along with the a.out structure itself at address 0, and the post- header will calculate the length of each for inclusion in the a.out header fields. |
A possible big problem with AS86 is that there's no manual for it anywhere - so we don't really know what its input format is supposed to look like. Also, we would have to rewrite all our ASM support files (for the fourth time for myself) to AS86 format, instead of NASM format. We could, however, use host NASM to assemble the libraries in standard NASM format and only use AS86 as C86 output. But that's a mess waiting to happen... So these are the downsides to switching to AS86. Nonetheless, if NASM is ridiculously slow, we might have to switch just for C86 output on ELKS only. Of course, we could look at BCC to see what it outputs. (No, don't port BCC!) We also don't know the AR86 object format, although when I updated NASM to its later AR86 support, I did learn a little more about it. its not that important, but its nice to have documentation on internal formats when we run into trouble. That's why I initially really liked NASM because it is very popular. I didn't realize it was also a pig. Luckily we don't have to run python etc to translate all its tables for compilation! Perhaps @bocke, our web search genius can find some information on AS86 input format? Or BCC output format? |
This is kinda specific. I tought I might find more information in Usenet archives than searching the web. Luckily, Google still carries the archive of comp.os.minix - where bcc and its utilities originaly come from. I think I found some info:
This is answer to a Usenet thread where the following question was asked:
Direct link to the thread archive on Google: https://groups.google.com/g/comp.os.minix/c/vP6atMbKqtc/m/llUap37Tqe0J So, if this is true, the ultimate documentation would come from Minix documentation and source code. |
I missed the post in the thread because of the Google folding most of the posts:
So, While the posts about asld don't concern us, the fact that as86 was influenced by Minix assembler might. |
More info from the archives: This discussion is very much related to your question: Manual for the assembler (1996). According to Albert S Woodhull:
|
Thanks for the info @bocke. Interesting, I didn't realize BCC and AS86 were for the purpose of compiling MINIX. The more I think about it, the less I like the idea of a (fourth!!) incompatible ASM format in ELKS. We've already got GCC ASM format, then OWC ASM format, NASM intel standard format, and now AS86 ASM format. If the problem were only getting C86 output assembled, it might not be too bad, but another huge issue is writing all the other ASM files around the C86 runtime library, something I've already started. Writing all that in yet another incompatible ASM language, especially a super old one, doesn't make much sense to me. Having said all that, yes, we might still want to research this more (and thank you already for your fast work @bocke!), to see what the alternatives are. At this point, we've put in a lot of work and have a toolchain actually running, which is big accomplishment. Switching out assemblers could make sense as NASM is a pig and now we see its slow, but another alternative is to take a working small assembler in C and add AS86 .o object file output to it - but stay with INTEL STANDARD assembly format input. Finally, perhaps @rafael2k can take a quick look at running gdb on NASM on the host, and interrupt up it to see what it might be doing to take up so much CPU time to assemble a file. I believe the issue is in multi-pass jump optimization. This is also an area where our early version of NASM could be buggy and a later version has a speed fix. So there's lots of options. I'll read up on the AS86 history, but presently I'm going to continue looking at enhancing C86's compiler bug issues, and put any assembler change on the back burner until we get more of the toolchain and C library running. |
Well, since we got the idea where to look for, it was easier to search further. I remembered Tenox had an OS archive and found a mirror. Looked into InteractiveUnix folder (PC/IX) and found something that confirms that Minix used PC/IX assembly syntax: More PC/IX docs are avaiablable in parent directory: http://mirrors.pdp-11.ru/_tenox_collection/@tenox.pdp-11.ru/os/interactiveunix/Documentation/ I did browse a few dev docs, but haven't yet found a full manual for the assembler. Edit: Accidentaly, this asm summary info is also from Minix Usenet group. |
Found mininasm: @ghaerr, lets keep with NASM/Intel assembly syntax. I'll test mininasm. ps: original mininasm is very interesting, but might not have all what we want, apart of not having as86 output: |
I searched for a little while after my last message, but without further luck. Although, I found the assembler man page but it only lists command line arguments, not assembler syntax or instructions. I also did a fast look into other dev doc, but haven't found anything on assembler. Maybe the specific docs are not yet availbale online? |
I studied a bit nasm source code, and cloned from upstream repo here (our 0.98.x lineage): We might be able to run it concerning memory occupation (small changes from our 0.98.35), but by design, the runtime seems to grow exponentially on size of the input. So for now, until we find a better solution, one assembling on XT will need to have patience. |
I'm still not happy with the memory functions.
Can we know the size of the memory allocated with malloc? |
Fixed some bugs in the memory code: |
I would say that adding AS86-compatible object format to any assembler is definitely not worth it. Lots of work, and very error prone. We're already in the middle of figuring out various existing problems with NASM's AS86 .o output, although it is at least now working for uninitialized BSS variables, something we had to add ourselves. True common variables are another story, we haven't got to them yet. I suspect as86 supports them, but I agree with @rafael2k that it's not worth changing the ASM input format away from Intel Standard unless we have to. Other ideas for assemblers could be anything that accepts NASM- or even MASM- assembly source, but outputs Intel OMF/OBJ (.obj) object format files. There are quite a few assemblers out there that do that. Perhaps we could look at FASM or one of the early Microsoft MASM-compatble assemblers. All of MSFT's tools used to use OBJ format. However, we would have to replace the LD86 linker with one that supports .OBJ input in order for any of this to work. There are a number of linkers out there that do that. Except then the linker would have to be modified to output ELKS a.out format, or at least V7 UNIX or MINIX a.out format. There are very few that do that. So what we're talking about in our toolchain: CPP -> C86 -> ASM -> LD is throwing out ASM and LD, which is half our toolchain. Its lots of work, but could be worth it, depending on what's out there. I just want everybody to understand what we're up against. @rafael2k did a very good job with the initial selection of a toolchain that had a great C compiler and also used a well-known assembler that uses Intel Standard input. With this, we were initially stuck and couldn't do anything until it was realized that NASM could output AR86 object format and then we realized LD86 from dev86 could be used - and voila! - a vision for a workable toolchain was born. It's a lot harder than one would think, just getting the architecture right, let alone all running on an ancient 8086 with very limited memory.
Well, that's bad news about the NASM design, I knew it was a disk pig at 141k, larger than even the C86 compiler, but I didn't know that it runs terribly inefficiently. I agree though with @rafael2k that for now, we're better off staying with NASM and moving the toolchain project forward, until both an assembler and linker are found that will meet our needs. @rafael2k, I suggest we keep moving forward with getting the memory management (malloc etc) completed as previously discussed on all tools, then trying to move forward on some C library compiles, while we still have some momentum. I have several big fixes for C86 and some ideas for a development system disk layout that would allow us to produce a prelim devkit for others to try. Once that's done, we will be in a better position to replace ASM/LD if wanted, and the good news is that most .c files won't be too large (I hope) and cause NASM to run super slowly on ELKS. Right now its only you and I running any of the toolchain on ELKS, so not too much to worry about! |
Yes, that memcpy would be bad news on any protected-mode system, as when the new size is larger than the old, an out-of-bounds exception might be thrown. However, for ELKS running in real mode, all that will ever happen worst-case is copying too much data from the old alloc location to the new location. This won't overwrite any important data, it just writes garbage in the new area from the old area.
Interestingly, I didn't think it was possible to write realloc without knowing the previous memory size, until I saw your first (working) implemention in NASM where you did the above - copy more data than needed. I then added the comment in a subsequent commit to remind ourselves that the copy was too much, but won't ever do harm. To answer your question: Outside the C library, no, there isn't currently a way to know the size of a memory block. Inside the C library, yes, we can access internal data structures and copy the correct number of bytes (although still this won't affect th operation of any program at all, except to be slightly faster by copying only the needed number of bytes). Nonetheless, I plan on adding an API call like Does the make sense, or are you thinking there is another problem with the memcpy that I'm not seeing? |
Thanks @ghaerr ! That clarifies my question - we are in real mode, there will be no segmentation fault or something weird apart of copying trash sometimes. |
This helps debugging the free()'s too, so it will be a welcomed addition. |
ps: at some point I'd like to hear more about the "True common variables". |
C linkers have to deal with three types of data variables : "data", "bss", and "common". The linker either stores data, or reserves space, for each of these types. Initialized data is easiest, the initialized value is stored directly in the .data segment by the linker. Uninitialized data is never stored, but space is reserved in the .bss section, which is not on disk but allocated by the loader when the program starts up, and is zeroed. Both "bss" variables and "common" variables end up in the .bss "section". C allows for mulitple declarations of variables of the same name in separate source files: these are called common variables. They are not allocated separately, but magically combined by the linker at link time. To see how and why this must be, consider the following .c file:
What does our compiler do? Since it doesn't know about common variables, it has to emit the following:
Since this is not an Now what happens if another source file has the same declaration:
C86 emits the same thing, a RESB and global declaration. When the linker sees to global declarations with the same name, it'll fail the link with a multiple symbol definition error. And importantly, it can't "combine" them because it doesn't technically know how big each 'a' is. That is, one source file could have said Since C allows declaration of uninitialized data in separate source files without an extern declaration, this allows for the problem to occur. There is a The big problem comes up when header files use So there you go. Our C86 doesn't output a .comm directive, and I'm not sure yet whether NASM supports it. LD86 probably does, but I haven't checked fully. Since AS86 was written to take BCC output, I would think it knows about .comm directives. |
I did not know about this common section which the linker needs to do its magic. I was always wondering what that -fno-common was about. So apart of compiler work, I think we need outas86.c in nasm to output this section to the as86 object, in case this language "feature" is to be supported. |
So, good news from the as86 support in c86. Using the BAS output (_b) of c86, I get a useful assembly to as86 that correctly assemble, but with some stuff missing, like the ".data" section and so on. Manually adjusting the generated assembly and adding the .data, the "hello world" correctly assembled and linked with the nasm-source libc86 in as86 object format! ps: at one point I thought the BAS output was related to Borland, but I think is Bruce... |
Thanks for porting as86! That's good to know it assembled and correctly output (it's own) .o object format after manual editing the asm file produced by C86. And very nice to know the NASM .o format from libc86 is actually compatible with real as86 output and ld86 accepts them both! As per earlier, this is another viable pathway for replacing NASM. The AS86 output would only produced by C86 and could be mixed with NASM, allowing either. Kind of nice, but introduces a fourth assembly language variant. Have you tried seeing whether AS86 can assemble chess.asm, even without the .data section changes required for LD86? It would be interested to know how fast it is.
Agreed, I have added this to my TODO list. |
I'll also play around with AS86 a bit myself, and look into the C86 outx86_b.c stuff. It might be a good idea to have both linked in to C86 so that either can be produced. I'm not sure exactly how/when C86 handles common variables, I'll have to dig deeper on that. My initial inspection seemed to show that it did not handle them. |
Currently AS86 is complaining about some call and jumps with the chess example:
ps: I'm calling as86 as (is this correct?):
p2: I can not even compare the speed with nasm, as as86 for me seems to run "instantaneously". |
I can see c86 is generating ".comm" directives, for the nanoprintf non-initialized line buffer, for example. |
Ugh, the reason for the jmp/ja/etc errors are that 8086 is limited to 128 byte offsets - and C86 is depending on the assembler to do jump optimization and "reversal", where a JC to a > 128 destination is replaced with JNC .+2 followed by a JMP. So this is going to be a big problem, unless AS86 can do the jump handling. We would have to check, it could easily be that BCC handled that before sending output to AS86. We can fix it in C86, but its one more big thing to do. The "CALL AH" instructions are buggy - that will have to be fixed in outx86_b.c.
Look at the ASM on this - is this the .comm issue already, or something else?
I think so, google for the Linux AS86 man page, I don't have it handy. See if there is a jump optimization parameter we need to turn on.
You mean when the AS86 output is selected?
Well, perhaps we're seeing more why NASM is so slow - it has to reassemble chess.asm every time a jump offset changes... and it looks like there are lots of them! But 30 seconds sounds terrible. |
as86 "-j" fixes the error:
I'm not sure. Here is the snippet:
Yes, when BAS output is selected in c86. |
It would seem this is an error in C86. Looking at chess.c, the variable is declared just once as I've made a note to fix this. For the time being, a workaround should be to declare the buffer 'static' in chess.c. Are you thinking we should add AS86 assembler format permanently configured into C86, or have you already done that? I can see it might be convenient to have both AS86 or NASM output available as a compiler switch while we evaluate both assemblers. BTW, what are you using for selecting AS86 output, -bas86? What about nasm format, what is the option you are using for that, -bnasm? |
Concerning C86, for now I'm just keeping both binaries, one with the defines set to use BAS (I did like this in "c86bas" branch: rafael2k/8086-toolchain@bf0a43d ), other for NASM (default in the default "dev" branch). |
I found our compiler here too: |
Did you say (somewhere) that you had the '//' C++ style comment problem in CPP86 fixed? Or was that something about using the OWC compiler and using the -cc++ to do the same thing? I'm trying to understand if we need to fix //-style comments in CPP86, and exactly where the problem was. Can you generate a source code test case that I can look at that fails on our toolchain? Thanks! |
This is a continuation of the discussion in #1443 (comment), regarding issues getting what is hopefully the latest version of a C86 compiler and @rafael's port of its included (older) NASM assembler running on ELKS.
At the moment, there is some consideration of using Dev86's CPP C preprocessor, producing Dev86-compatible AS86 format object file out from NASM, and possibly using Dev86's LD linker, as both CPP and LD are (hopefully) likely to be easily ported to the ELKS 8086-only environment.
I'm not sure where the best current sources are for Dev86 - it used to be that @jbruchon hosted them on Github, and that versions' upstream is quite old, but still present: https://github.com/lkundrak/dev86. It seems that jbruchon has moved his version to Codeberg at https://codeberg.org/jbruchon/dev86. During the last four years, I am aware of a number of bug fixes posted to his repo when it used to be on Github. I would recommend starting with jbruchon's Dev86 unless another more updated version is found on Github.
ELKS shares quite a history with Dev86, just five years ago the entire kernel and C library were compiled using its BCC->AS86->LD toolchain. The ELKS C library had originally bin in dev86/libc but had been moved prior to that.
While it could make sense to use Dev86's CPP and LD in order to get C86 running more quickly on ELKS, unfortunately the BCC compiler is K&R only, and doesn't support ANSI C at all.
@rafael2k, which repo are using for your CPP and future LD ports? I would assume that if you can get them running, both will be moved into your https://github.com/rafael2k/8086-toolchain repo.
The text was updated successfully, but these errors were encountered: