Re: Learning *NIX programming?
From: cLIeNUX user (r_at_cLIeNUX.)
Date: 08/06/05
- Next message: cLIeNUX user: "Re: Origins of the name GLOB"
- Previous message: Paul Pluzhnikov: "Re: segmentation fault before starting main()"
- In reply to: Zach: "Learning *NIX programming?"
- Next in thread: Alan Balmer: "Re: Learning *NIX programming?"
- Reply: Alan Balmer: "Re: Learning *NIX programming?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 06 Aug 2005 06:16:22 -0000
humbubba@smart.net
>Can anyone please recommend a good _free_ book/tutorial
>(HTML/PDF/TXT/PS) ?
>
>I'd like to learn about file handling (POSIX), sockets, inter process
>communication, I/O, etc...
>
>I have had one C course. Currently running Debian GNU Linux. Also plan
>on installing OpenBSD and Solaris on some boxes.
>
>Zach
>
If you do see C at a shell prompt in cLIeNUX you get this...
<html><head><title>cLIeNUX intro seedoc for the C programming language
</title></head> <body> <h1>A point of introduction to C in cLIeNUX </h1>
<h2>NAME</h2> C, gcc, cc1 - the C programming language
<h2>DOC DATE</h2> 19990912 <-> 20000309 <p>
<h2>purpose of this document</h2>
This seedoc is intended to provide a point of introduction to using the C
programming language on cLIeNUX. It is hoped that sufficient rudiments
will be revealed here to write useful programs, and that concepts of using
the C development tools will be clarified significantly for many
non-programming tasks, such as installing existing programs consisting
mainly of C sourcecode. I am not an expert C programmer, but with enough
flailing about I can get it to do what I want, and an introduction is
needed. I haven't done much checking of what I state here. Actual code
examples have been checked. As soon as you can, get other references to
fix the damage I do here.
<p>
I attempt to present what appear to me to be the basic concepts of C, and
this document is not complete as a reference to operators and so on. The
basic operators provided by awk, such as +, -, &, % and so on, are almost
exactly the same as C's, except as pertains to data typing. See awk for
that, and as a sort of a "practice C". Basic concepts of programming may
be omitted in lieu of the <a
href=file://localhost/help/see/programming.7.html> programming </a>
seedoc, which should be studied before this document if you don't know
programming at all. If you get lost here, refer back to that seedoc. Some
details of this discussion may be specific to GNU C and the GNU
development tools, or to cLIeNUX. Some other details are known to be less
than perfectly correct in the interest of brevity. C has some extremely
confusing aspects, but most aspects are fairly straightforward for a
language that produces high-performance results. Once C is learned, the
possibilies opened are proportional to the mass of existing C code.
<p>
Things for you to input at the terminal, C/cpp keywords, and commands,
are usually emphasized in this html like <em> this</em>.
<p>
<h2>general description of C</h2>
The C programming language was developed around 1973 to make UNIX
portable. C is a procedural language. That is a type of programming
language that is not very abstract compared to what a typical CPU actually
does. C is at the low-abstraction end of high-level languages. This is
one reason the performance of code produced by C is usually quite good,
and therefor hand-coding assembly or machine language in conjunction with
C is typically only done in circumstances where it is unavoidable. Almost
all of a typical Linux/GNU/etc. distribution is written in C. C++, also
from those wacky boys at Bell Labs, is basically a superset of C. That is,
conceptually, C++ is written in C. C++ is not included in cLIeNUX Core.
<p>
Knowing C, or being aware of C terminology, and the terminology of the
assembler, linker and other utilities associated with C, is of great help
to users of unix in activities other than programming, and also for
programming most other languages popular in unix.
<p>
Documentation for unix that tells you how things actually work often
assumes a knowledge of C. That's not good, since C isn't the definition of
computing, but it is very good in that in many ways unix makes no
distinction between user and programmer, which may be difficult at first
but is ultimately very empowering.
<p>
The C compiler proper, <em>cc1</em>, which is what implements the actual
C programming language, is one component in a suite of tools. At least
four main components are usually implied by the phrase "written in C"; the
C pre-processor, the C compiler, the assembler, and the linker. There are
other preprocessors and translators of various types for use with C. There
are also two prevalent "front-ends" for the entire compilation process.
"Written in C" is represented directly by the <em>gcc</em> command. The
GNU <b>gcc</b> (or cc) command is "driver" in the top-down sense of the
term, an interface to and manager for the four main programs mentioned
above, which in these examples will be the GNU <b>cpp</b> C preprocessor,
the GNU <b>cc1</b> C compiler, the GNU <b>as</b> or gas assembler, and the
GNU <b>ld</b> linker.
<p>
Large C-based programs consisting of many files are invariably built
under the control of the <b>make</b> utility. <em>make</em>,
<em>cpp</em>, <em>as</em> and <em>ld</em> can all be used for tasks
unrelated to the C compiler or libc, but are designed and heavily
defaulted to work with them, and each other. <em>cpp</em> in particular is
bothersome to use as a macro-processor for something other than C
sourcecode. See m4 for general macro-processing. System-wide subroutine
linking libraries in unix observe the C conventions for parameter passing,
header (#include) files and so on.
<p>
<h2>background, from the bottom up, machine-wise and historically</h2>
Using C benefits from a basic grasp of the layers underneath C, since it
resembles or works closely with them, and was written by the kind of
assembly language programmers that "They don't make 'em like that any
more". Common C terminology tends to assume a knowledge of assembly
language practice and concepts. I hope to present some of that here.
<p>
C, and GNU <em>gcc</em> in particular, allows detailed control of what
level of abstraction you operate at. <em>gcc</em> can in fact compile
several versions of C, "K&R" or "traditional", ANSI, and "GNU", which
allows a lot of syntactical constructs ANSI forbids or leaves undefined.
Some of my own code depends on the GNU C extensions, since GNU C, like GNU
software in general, is amazingly portable and widely used, to the point
of being perhaps a de-facto standard for C.
<p>
Unfortunately, something that is capable of high abstraction, extreme
flexibility and utter specificity takes a lot of explaining. Fortunately,
we are presenting C right in it's natural habitat, it's home court, and
everything works pretty much as expected, right at your fingertips. You
are urged to investigate things you don't understand in this document by
interacting with the C facilities cLIeNUX Core provides.
<p>
A typical CPU chip gets it's instructions from RAM as simple binary codes
for various operations. These codes are called opcodes. The full set of
opcodes a particular CPU implements are called it's machine language, or
instruction set. All programming is a matter of arranging these opcodes,
and usually arranging some initial data for them to act upon. At the
dawn of the digital computer age these opcodes and data were entered by
hand with a row of 2-position switches. Input devices were developed
to allow these binary opcodes to be entered in bulk, from paper tape,
punchcards and so on. Other devices arose to allow input to the computer
as hexadecimal numbers instead of individual bits. ( Lost yet? see <a
href=file://localhost/help/see/programming.7.html> programming.</a>
) These forms of controlling a computer utterly directly in
it's native machine language and without abstraction are called first
generation languages.
<p>
The GNU <em>objdump</em> utility can display sections of an object file in
hex. So can cLIeNUX <em>binedit</em>. "object" has several meanings, and
more than one of them is used in this seedoc. In this case an "object
file" is a file containing program code in runnable form. "object code" is
binary data ready to give to a CPU as it's program. An object file may or
may not also be runnable as a stand-alone command. "object" is not used in
this document in the sense of "object oriented programming", which is an
abstraction layer built on top of what is presented here. Let's look at
the object code for the <em>basename</em> command. Note that the code
examples in this background section are for perspective, not for detailed
understanding.
<pre>
:; cLIeNUX0 /dev/tty10 r 05:32:57 /subroutine/static
:;<em>objdump -s -j .text /.bi/basename |page</em>
/.bi/basename: file format elf32-i386
Contents of section .text:
8048d70 5989e389 e083e4f8 89ca01d2 01d201d0 Y...............
8048d80 83c00431 ed555555 89e55053 51b88800 ...1.UUU..PSQ...
8048d90 0000bb00 000000cd 808b4424 08a3c8ce ..........D$....
8048da0 04080fb7 05dcd004 0850e859 ffffff83 .........P.Y....
8048db0 c404e851 feffff68 70ba0408 e8f7feff ...Q...hp.......
8048dc0 ff83c404 e837fdff ffe80a02 000050e8 .....7........P.
8048dd0 24ffffff 5b8d7426 008db426 00000000 $...[.t&...&....
8048de0 b8010000 00cd80eb f78db426 00000000 ...........&....
8048df0 5589e553 bbfcce04 08833dfc ce040800 U..S......=.....
8048e00 740e89f6 8b03ffd0 83c30483 3b0075f4 t...........;.u.
</pre> Pretty cryptic, isn't it? The above format is affectionately
known as a
hex dump. The left column is the process virtual address of the first
byte of the line, then there are 16 bytes of memory shown in hexadecimal,
and in the right column bytes that have printable ASCII representations
are shown as such. Bytes that aren't printable in ASCII are accounted
for with periods. That's actually more user-friendly than what the
CPU sees. That's the first 160 bytes of the .text section of the
file /command/basename. ".text" is the ELF section of an executable
file that contains the actual code of the program, but to be honest,
there is sometimes non-code stuff in a .text section, and from this
display I can't tell for sure if the above is actually machine code or
some other data. Trying to talk to computers in a form like the above
quickly gave rise to the second generation of programming languages,
known as assembly languages.
<p>
Assembly languages are simple translations of opcodes to short text names
for the opcodes, called mnemonics, and some other rudimentry conveniences.
The C compiler itself produces assembly language to be processed into
binary object code by an assembler. Machine language monitors also exist
to convert assembly language to binary object code directly in RAM
interactively, and other on-the-metal activities. "File-to-file"
assemblers are more prevalent on multi-user operating systems than machine
language monitors, which give total interactive control over the machine
to an extent that is unsuitable for a running multi-user system, since
frequently crashing the machine is normal when using a machine language
monitor. The GNU debugger may have much of the functionality of a machine
language monitor, I don't know. I manage to stay out of <em>gdb</em> most
of the time. Here's an example of typing some C code at the <em>cc1</em>
C compiler directly. Again, this is just for perspective; don't be too
worried about details just yet.
<pre>
:; cLIeNUX0 /dev/tty10 r 06:39:15 /subroutine/static
:;<em>cc1
blah(){</em>
.file "stdin" .version "01.01"
gcc2_compiled.:
blah
<em> int zay;
zay = zay + 3; }</em>
.text
.align 16
.globl blah
.type blah,@function
blah:
pushl %ebp movl %esp,%ebp subl $4,%esp addl $3,-4(%ebp)
.L1:
movl %ebp,%esp popl %ebp ret
.Lfe1:
.size blah,.Lfe1-blah
<em> (at this point I did ctrl&c to exit) </em> </pre> OK, not very
clear. The C code I input was <pre>
blah(){ int zay; zay = zay + 3 ; }
</pre>
The compiler produces valid assembly language for the GNU gas assembler.
On this box the assembly is for the x86 family of chips. The output
contains linker directives, interspersed with the following x86 assembly
language instructions... <pre>
pushl %ebp
movl %esp,%ebp
subl $4,%esp
addl $3,-4(%ebp)
movl %ebp,%esp
popl %ebp ret
</pre>
Admittedly, this probably doesn't make much more sense than a hex dump,
but if you know x86 assembly it does. I don't know x86 assembly well
myself, except for things like "ret" means "return from subroutine". Even
if one knows neither C nor assembly, the above illustrates some things.
The difference in size between the C code and the assembly language code
vaguely supports the statement that C is at the low end of high-level
languages. Also, one can clearly see that to get from C to the CPU one
does in fact have to pass through the lower-level stages. The lower level
stages aren't gone, and they aren't obsoleted, they are just subordinated,
and usually occur in the background. Also, if you want to know what your C
code actually results in, you can see that you have the tools to find out.
<p>
There's some other things that aren't actually self-evident from the
above, but that one can imagine when looking at that example. If we
did the same thing on a machine set up to assemble object code for some
other CPU besides the x86, such as the PowerPC, the same C code input
would produce different assembly language output. That is portability,
and is the most important characteristic of third-generation languages.
<p>
There are a couple more things I want to point out about the assembly
code that will explain a lot later. our blah<em>()</em> in the C code is
what is called a function in C. It, and it's contents, which is what is
between the curly-braces, are a named coherent functional unit of code.
"blah()" in the C input became <pre>
.globl blah
.type blah,@function
blah:
</pre> in the assembly language output. The ".globl blah" says that the
following blah: label is a "global symbol". This is very important to
the linker. This is how the entry points of routines are found in object
files and libraries during linking. You'll hear about symbols all the
time when dealing with linking, which you'll hear about a lot when building
large C programs that aren't pre-packaged for your setup.
<h2>Unsuccessfully doing nothing in C </h2>
For an actual example of C resulting in a working program, one would like
to begin with the simplest example possible. Given the fact that C is
really all in parts, there are a couple of meanings to "simplest". The
simplest way to get a working command out of C is not the simplest example
of legitimate C code, in terms of facilities used. An object file that's
not a command, such as a component in some other program or a library,
might use fewer features of the build environment, but we want a command.
<em>gcc</em> and the standard C library provide facilities for making a
stand-alone program. Our first actual code example will use those
facilities. If we use the name <em>main()</em> for the main routine of our
program, the handlers will all be included by <em>gcc</em> to make a runnable
command. OK, make a directory somewhere called /source/box/C_revelations
or something, and edit a file called null.c in it to contain the following
text...
<pre>
main(){}
</pre> Did that? OK, now do<br> <em> gcc null.c </em><br>
You should then have a new file, named a.out by ancient tradition. That's
your new executable. The text "main(){}" is the absolute minimum that
will compile into an executable. <em>gcc</em> did a lot of processing and
assumed a lot of default choices to produce a unix/Linux ELF executable
from null.c. <pre>
:;<em>ls -l </em> <br>
-rwxr-xr-x 1 r root 3772 Sep 12 10:46 a.out
:;<em>file a.out
</em> a.out: ELF 32-bit LSB executable, Intel 80386, version 1,
dynamically linked, not stripped
</pre>
We see that main(){} became a 3772 byte dynamically linked executable
command. "not stripped" means the symbols used to link it and various
other info are still in it. You know what symbols are, right? Good. Even
stripped, it would be much bigger than whatever few opcodes are required
to express a <em>main()</em> that does nothing because it has an ELF
format data structure in it so the system can handle it properly. In the
case of a dynamically linked executable that includes a call to the
dynamic linker, ld.so-linux.
<p>
What all did <em>gcc</em> do? null.c doesn't include any preprocessor
directives, but <em>gcc</em> can't know that ahead of time, so GNU
<em>cpp</em> was run on it anyway. The output of <em>cpp</em>, which in
this case was the same as it's input, our program text, was then passed to
<em>cc1</em>, the actual GNU C compiler executable, and the C code was
compiled into x86 assembly language. The assembly language code was passed
to <em>as</em>, the gnu assembler, which assembled the assembly language
text representation of the code into binary object code. <em>as</em> also
does just enough adding of header information so that the object file it
produces can be handled by the linker. <em>ld</em> took the object file
produced by <em>as</em>, and linked it to a file <em>gcc</em> keeps handy
named crt1.o that actually puts wrapper code around <em>main()</em>. The
wrapper code precompiled in crt1.o endowed our program with unix
commandline and environment variable processing, and some libc
initialization.
<em>ld</em> also linked our program to libc.so.(version#), so that if we
had used any functions from the standard C library they would be available
at runtime. You can get an account of all this processing with the
<em>-v</em> and <em>--save-temps</em> switches to <em>gcc</em>, ala<br>
<pre>
:;<em>gcc -v --save-temps null.c</em>
</pre>
<p>
Our current a.out is a full-fledged unix executable. The OS gives it the
full suite of facilities for a process when you run it, so you can time
it, you can do pointless things with redirection operators, and you can
give it commandline arguments and environment variables to ignore, but all
that stuff is the OS at work. We have actually failed to produce a program
that does absolutely nothing, however. The wrapper code that calls
<em>main()</em>, that <em>gcc</em> linked in from crt1.o, uses the
<em>_exit()</em> system call to end the process. That's the only way to
end a process, and it always returns a byte to the calling process. In
our case it always returns zero, but that's doing something. That is about
the least a unix command can do, though.
<p>
One more aspect of doing nothing that bears mention is proper commenting.
Comments should state *what* a routine does. The code is *how* it does it.
Sometimes stating what things don't do is important also. For future
reference to null.c, you may want to edit it something like this... <pre>
/* null.c minimum C program. Returns 0, because returning
void from a Linux command isn't possible. */
main(){}
</pre>
<h2>Doing something</h2>
OK, you've picked up some assembly lingo, and you realize that there's
massive work done under the hood when making an executable from C. Note
also that everything <em>gcc</em> does can be specified individually.
<em>gcc</em> is the manager for lots of parts, but all the parts are
accessible individually. You can do whatever you want. The sticky parts
are A: you have to know what you want, and B: you have to be able to
express what you want. Now we can delve into expressing what you want in C
itself like someone that wants some results.
<p>
C can be thought of as a core language, and a standard library. That is
the overall format of the current ANSI C standard, that's how it's
implemented as <em>cpp</em>/<em>cc1</em> and libc, and it is rather
analagous to the CPU and the peripherals of a computer. The core language
defines the C virtual CPU, and the library routines provide things like
files, sockets, high-level math functions like trigonometric functions,
and so on. We'll emphasize the core language, but we'll need some other
functionality to use it, just as you need some kind of input/output to a
CPU to control it. We will continue to use the <em>main()</em> interface
to provide us with a testable command, and we will use the commandline
argument handling provided with it to give our program something to work
on that isn't necessarily the same every time the program runs. We'll use
the <em>return</em> value of <em>main()</em> for output, even though it
gets truncated to just a byte, and is really intended to provide just a
sucess/fail flag. Type the following into a file named plus5.c. <pre>
/* plus5.c add 5 to a commandline argument */
int main(int argc, char * argv[]){ return atoi(argv[1]) + 5; }
</pre> This probably looks real bad to a non-C programmer, but it's
much better than a hex dump. Let's pick it apart. The first line,<pre>
/* plus5.c add 5 to a commandline argument */</pre>
is a comment. The C preprocessor comes first and replaces everything from
the /* to the */ inclusive with a single blank. Right after the first
line is another example of my sloppy coding style. There should be a
line saying<pre>
#include <stdlib.h> </pre>
and there isn't. This is because later in the program, we use the atoi()
libc call, and we should <em>#include</em> the header file for it.
However, this is such a standard <em>#include</em> that <em>gcc</em>
includes it for us by default. For a more obscure call than atoi(), or
without <em>gcc</em>, this kind of sloppiness will not work.
<em>#include</em> is a facility of the preprocessor. It's kindof the
opposite of a comment, in that comments get removed, #include's get
inserted. The greater-than/less-than around stdio.h is shorthand for "in
the standard header files directory" which in cLIeNUX is
/source/C/include by default and is /usr/include on most other unices.
Next comes <pre>
int main(int argc, char * argv[]){ </pre>
Wow. A lot of the trickiness of C, and the <em>main()</em> interface,
comes to a head right here. Well, in the interest of brevity, I'm going to
ignore it. For the purpose of introduction, just note that this, exactly,
is the magic incantation you use to use the commandline facilities
associated with <em>main()</em>, and that <em>argv</em> is given to us as
an array of pointers to char's.
<p>
The <em>{</em> and <em>}</em> in our program define the limits of the
body of <em>main()</em>. <em>main()</em> contains one statement.
Statements are terminated with a semicolon. Our statement, <pre>
return atoi(argv[1]) + 5; </pre>
says to return from <em>main()</em>, i.e. quit, and to provide the return
value of the first commandline argument after the program name, converted
from an ASCII string to an <em>int</em>, and added to 5. C converts things
in an expression like that from innermost parenthesized group to
outermost. First it gets argv[1], which is the first commandline argument.
argv[0] is the program name, which we don't happen to use within our
program. atoi() converts the string to an <em>int</em>, then 5 is added to
that, then return has everything it needs and it does it's thing, and the
program is finished.
<p>
This is a very sloppy program. If it doesn't get an argument at all after
the program name, it segfaults. If it gets an argument it can't convert to
an <em>int</em>, it thinks it's 0 and it returns 5. If the argument + 5 is
more than 255, it gets truncated to a byte by <em>_exit</em>. But, when
you write your own code, you determine whether or not the program is
appropriate for the task at hand.
<h2>Format of a C program</h2>
From the top, a C program has an overall form like this <pre>
(not real C code)
variable declaration/definition . . .
function declaration/definition . . .
main definition
</pre>
That's for an executable. Code for routines to be linked with something
else won't have <em>main()</em>. "main", by the way, isn't formally a C
reserved keyword, but it's use is ubiquitous.
Variables can also be declared inside functions.
If they are declared outside any function they are visible to any function
in the file. "visible to" means "usable by".
If they are declared within a function, they are only visible
within that function, but that includes being visible to functions when
called within the same function the variable in question is defined
in. These issues are called scoping. A variable that is only visible within a
function is "local" to that function. There are also "storage class" qualifiers
for variables, but I'm not going to address that.
<p>
#includes of necessary header files are usually at the top, but the
include mechanism works at any point in a file, except within comments,
since <em>cpp</em> does comment removal first. The preprocessor can also cause
sections of code to be included or omitted depending on variables in cpp's
own variables namespace. For any program of any size, various mechanisms
will be used to maintain the program in various parts, but at build time
it will all resolve to the above general outline by the time <em>cpp</em>
hands it to
<em>cc1</em>.
Routines linked from precompiled libraries are semantically like
variable declarations and non-main() function declarations. The structure
of a variable declaration is <pre>
(not real C code)
type [qualifier] name[= initializer] [, name, name...];
</pre>
where [ ] encloses optional material. What you are doing when you declare
a variable is initializing storage, which happens before the program
starts, so an initializer must be something that can be determined at
that point in time, such as a constant number, string, or expression
that can be resolved unambiguously. There's a concise definition of
"expression" in C, but I don't know exactly what it is offhand.
Lets say for now that it's a description of a simple computation to be
performed that produces one value. According to the ANSI lingo, if a variable
declaration has an initializer it's a "definition".
<p>
All actual runtime code must be within functions. Declarations and definitions
don't have to be within functions because a program can have initialized data
in it's memory image. The structure of a function definition, including main,
is this...<pre>
(not real C code)
return_type name ( [argument...] )
{
[statement ;
statement ;
statement; ....]
}
</pre>
Line spacing and indentation are just shown for illustration. Proper
formatting is important for readability and maintainability, but sequences
of blanks, tabs and newlines are all the same as a single blank to C.
Statements may be variable declarations, assignment statements or calls of
existing functions. All the statement lines are optional. A function that does
nothing is sometimes useful as a stub for future code. All data C handles must
have a data type, so a function definition must declare the type of data it
produces, that is, the data it returns to it's calling function.
GNU C allows function declarations within a function.
An assignment statement has the form <pre>
(not real C code)
ob_expression = expression ;</pre>
ob_expression means some code that can be resolved to an object, i.e. an
entity that can store a value. This is a new meaning of the term "object"
within this seedoc. The simplest example of an object in this sense is
a variable. Objects are also called lvalues.
In other words, the left side of an assignment must represent a storage
location of some kind in memory.
The = sign is the basic assignment operator. It means when this statement
has been performed the object on the left (the lvalue) will contain the results
of having performed the expression on the right.
<h2>controlling the order of execution</h2>
Execution of a C program begins at the first statement in main(). The
default sequence of execution flow of the program is from top to bottom
within a function, including main(), which can be altered with structured
flow control constructs and labels/gotos. The time sequence of actions
within a statement is determined by the the precedence of the operators
used, which operators are used, and parentheses. C has hairy precedence
rules for it's many operators, so use lots of parentheses. The action of
parentheses in C expressions is fairly intuitive, as far as plain expressions
are concerned. Unfortunately parentheses are also used for arguments
delimiters for functions and flow control constructs, and for the
typecasting operation. More on these later. Meanwhile, parentheses are
a welcome simplifier for expressions.
<p>
As pertains labels and structured flow controls such as for-loops and
while-loops, a statement is a unit of program flow. There are, however,
two constructs to control and alter program flow within a statement.
<p>
A comma is an operator in some contexts. The comma operator creates a
compound expression. For each comma operator, the left side is evaluated,
and it's results other than it's value as an expression are asserted,
which may include changing values in variables and so on. The right
side of the comma operator expression is then evaluated, which may be
effected by the side-effects of evaluating the left side, and the value
of the comma expression as a whole is the value of the right side. This
creates a sub-program within an expression, and is sometimes used for
complex behavior where an expression is expected, such as in the loop
control specifiers of a for-loop. The comma expression <pre>
j = 2, j * 4
</pre>
will evaluate to 8, with the type j has. The term "side-effects" in the
above sense is usually applied to functions, and means changes to things
besides thier return value.
<p> The conditional operator is an "if" construct within an
expression. The
format is <pre>
(not real C code)
expressionC ? expressionT : expressionF
</pre>
The example represents one expression. It's value is the value of
expressionT if
expressionC has a value of other than zero (false), and it's value
is the value of expressionF if expressionC is 0. The side-effects
of expressionC and the other expression evaluated are asserted. In
other words, expressionC is the conditional, ? is the true/false test,
expressionT is the part to be performed IF TRUE, and expressionF is the
part to be performed IF FALSE.
<p>
When a function is called, it is entered and it's sequence of statements
and it's flow controls are performed until such time as it returns to
the caller. Functions nest arbitrarily deep. That is, a function may
call a function which calls a function which calls a function etc. etc. A
function may call itself; this is called recursion.
<p>
Within a single function, there are a variety of flow control constructs
available in C to implement conditional execution of code sections and
loops of various kinds. The rudimentry, un-structured flow modifier is
a goto combined with a label. A label is specified like <pre>
labelname:</pre>
and represents the statement following it. C code in the form of <pre>
(not real C code)
toploop(){
statement a;
target: statement b;
statement c;
goto target; /* statement d */ statement e; }</pre>
will do statements a, b, c and d, and then loop endlessly over statements
b, c and d. Without other provision for changing the flow, statement e
will never be executed and toploop() will never return to it's caller.
Endless loops are useful in some situations, such as the top
user-interface loop of an interactive program. Statement e in this example
is what is called "dead code". The C compiler might optimize it away in
the final object code, if optimization is being used.
<p>
A goto can go to a label anywhere in the same function. In particular, it
can cross the block boundaries of the other flow control constructs I'm
about to describe. This has issues. See <a
href=file://localhost/help/see/programming.7.html> programming </a> for
comments on goto.
<p>
STRUCTURED PROGRAMMING<br> sections and jumps<br>
Several statements enclosed in curly-braces <b>{ }</b> are called a
block. A block
is syntactically equivalent to a single statement; a block or a single
statement are interchangeable in the syntax of most flow control
constructs. Blocks, like function definition bodies, do not have a
trailing semicolon. The difference between blocks and the braces in
a function definition is the braces are not optional for a function,
but may be for other constructs.
<p>
The <b>break</b> statement will exit several types of flow-control
blocks, and
is necessary for normal use of the switch/case construct. The
<b>continue;</b> statement is used in loop constructs to end the current
loop iteration without leaving the loop, i.e. start the next iteration
of the loop immediately. The <b>return</b> statement leaves a function,
and can pass a value to the calling function it is returning to. Falling
through to the end <b>}</b> in a function is equivalent to return 0;
. Note that a return statement can be inside a flow-control construct
like a <b>while</b> loop.
<p>
<h3>decisions</h3>
Conditional execution of statements may be caused by an
<b>if</b> construct. The general format is<pre>
(not real C code)
if ( expression )
statement or block
else if ( expression )
statement or block
else if ( expression )
statement or block
else
statement or block</pre>
The expressions are evaluated until one evaluates true (non-zero), or
until the <b>else</b> is encountered, and the following block/statement is
executed. Then flow resumes after the else part, outside the conditional.
Each section but the "if" section is optional. That is, the simplest
case is<pre>
(not real C code)
if ( expression )
statement</pre>
This is an example of the generality of <em>{ }</em> blocks. The
statement following the if clause can be a single <em>;</em>-terminated
statement, or a braces-enclosed block of statements.
<p>
The usual <b>case</b> construct is called <b>switch</b> in C.
<pre>
(not real C code)
switch ( expression ) {
case constant : statements
case constant : statements
case constant : statements . . .
default : statements
} </pre>
The expression is evaluated, and the case with the matching value for its
constant is jumped to. If no case matches then <b>default:</b> is jumped
to. The constants in the above may be expressions. They must each evaluate
to a unique integer within the set. This is in effect a multi-target goto
with numbered labels, where the expression determines which case is the
goto target label. Flow does not automatically exit the construct after a
case is executed. That means break statements must be used to end atomic
cases, or flow will fall through into the following cases. The order of
the cases is not rigid, and how you order the cases may effect which case
is tested for first, which may effect performance. That is, in the ones
I've compiled anyway, a switch construct becomes several discrete tests
and branches, and you may want the most frequent cases first.
<p>
tested loops<br>
The while-loop construct tests an escape condition at the beginning of
each iteration of the loop. The do-while loop construct tests the escape
condition after each iteration of the loop. do-while is often used when
a loop is intended to always iterate at least once.<pre>
(not real C code)
/* while loop */
while ( expression )
statement or block
/* do-while loop */
do
statement or block
while ( expression ) ;
</pre> counted loops<br>
A counted loop can be constructed from while or do-while. In fact, any
flow control construct can be created with if and gotos/labels, but not
having to do that is one reason third-generation languages were developed.
C provides the <B>for</b> loop, which is very general, but is intended
as a convenience for counted loops. It's format is<pre>
(not real C code)
for ( init_expr ; test_expr ; incr_expr )
block/statement
</pre> Here's an example program using a for-loop... <pre>
/* for.c for-loop demo */
#include <stdio.h> /* declare printf from libc */
int i; /* we need a loop increment
var. */
main(){ /* program takes no arguments */
for (i = 0 ; i < 30 ; i = i + 1) /* for 0 thru 29,
count by 1's */
{ printf ("%d ", i ); /* print the count as a
decimal number, with
some trailing blanks */
}
printf("\n"); /* print a newline when done
looping */
} /* end program, use default
return */
</pre>
That's fairly plain-vanilla C. It differs from most code in that the
comments are a bit verbose, (and crunched a bit for html,) since normally
one could assume the reader knows C. Also, "i = i + 1" is usually
expressed with the C increment operator ++, e.g. i++ . The curly-braces
around the single-statement for-loop body are unnecessary, but typical
for clarity. The indentation style is what I use. The documentation for
all the libc calls including printf are not in cLIeNUX Core, but they
are in a package. printf is almost a language unto itself, with lots
of formatting and conversion options. Paste the above into a text file
named for.c, gcc for.c, run it, change it, make it do something clever.
<h2>DATA TYPES</h2>
C is called a "typed language". When your code does something like
<b>+</b> in C, the compiler figures out what kind of things you are
adding, and then creates the appropriate assembly code. This is also
called "operator overloading", because operators like <b>+</b>, <b>-
</b>, <b>
%</b>, <b>
/</b>, <b>
<<</b> and so on have several possible meanings depending on
the data types of the entities they are currently being invoked on.
Really C is a "typed-data language". If you don't have typed data,
then you usually wind up with typed operators, i.e. various operators
for various datatypes.
<p>
The basic types in C are <b>void</b>, <b>char</b>, <b>int</b>, and
<b>float</b>. A <em>void</em> object has no size, and is sometimes useful
with "pointers", addresses of other objects. In other words, <em>void</em>
isn't nothing, it's an address of something of un-specified size or type.
<em>char</em> is, in practical terms, a byte. <em>int</em> is usually the
same size as a machine address in a particular implementation, which on
Linux x86 is four bytes. <em>float</em> is a floating-point number.
<p>
Possible qualifiers of the above types include <b>unsigned</b>,
<b>short</b>, <b>long </b>(for
ints), <b>double</b> (for floats), and <b>signed</b> (for chars).
The <b>const</b> qualifier states that the object is constant, and the
<b>volatile </b>
qualifier says that the object's value may be changed by something other
than the program. <em>volatile</em> may be a necessary qualifier when an object
represents an input/output port of some kind, for example.
<p>
Data types are part real and part abstraction. An actual storage location
for a datum, an lvalue, has a certain size. That's a very real constraint.
If your C code tells the compiler a variable is an <em>int</em> it allocates 4
bytes for it (on x86). If you declare it <b>unsigned</b>, then that's an
abstraction, and effects how the data is handled, but it's still 4 bytes.
Sizes of things are a matter of physical reality, but typing information
more specific than that is entirely a service of, and internal to, the
compiler.
<P>
Situations often come up where you want to add two integral types of
different sizes. C will do a lot of different type conversions if
situations arise where it seems OK to do so. Usually what is allowed is a
"promotion", from <em>char</em> to <em>int</em> for example. If you add a
char and an int, the value produced will be carried around in the
compiler's idea of things as type int, which is the conversion, a
promotion, that results in no loss of information. That is, an
<em>int</em> holds all the bits of a <em>char</em> without losing any.
<h3>type casting</h3>
Type conversion, causing C to handle something as some particular type,
can also be caused deliberately by the programmer with the C cast
operator. In ANSI C you can't cast a memory-allocated object to some
other type, since an object has some predetermined amount of actual
memory storage. gcc however does allow casting lvalues if it's physically
possible to do so, i.e. for types of the same size, such as ints and pointers
(usually).
The syntax of the cast operator is <pre>
(type) expression</pre>
That means that declared types of things are thier defaults, but you can
do just about anything to them, as may be desirable. There are a lot of
possible conversions though, and what happens when you cast e.g. a
<em>float</em> to type <em>unsigned int</em> is something you had better
check in your particular C implementation if you need such strange
behavior.
<h2>pointers</h2>
An object whose purpose is to hold the memory address of other objects
is called a "pointer". Most useful programs involve pointers in one way or
another. C provides the unary <b>&</b> operator, and a
unary expression like<pre>
( & my_variable )
</pre>
evaluates to, or returns, or is seen by the compiler as, the address and
type of my_variable. Let's say you have an <em>int</em> variable called
fake_pointer. If you do <pre>
int fake_pointer;
fake_pointer = (int) & my_variable;
</pre>
that statement will result in the contents of fake_pointer being the
address of my_variable, so you've created a pointer. You've lost some
information though, or rather the compiler has lost some information.
When you stored <em>&</em> my_variable in an <em>int</em>, (which you'll
get compiler warnings about if you don't do the cast to type int,) the
compiler lost track of the datatype of my_variable. All you stored was
my_variable's address. You can keep track of types yourself and handle the
necessary conversions with casts, or you can declare variables to be
pointers to objects of some type.
<p>
Given an address in a variable, you need some means to obtain the object
that address points to. This is called "dereferencing". The name of a
similar operation in the parlance of the Forth programming language is
"fetch", which I think is rather intuitive. A thing that points at another
thing is also known as a "degree of indirection". The C fetch or
dereference operator is unary <b>*</b>. That is, <em>*</em> not in an
arithmetic expression, but rather preceding the name of a pointer. I
suspect that perhaps one of the really confusing things in C is that
<em>&</em> and <em>*</em> are not exactly symmetrical. This is because of
data typing. You can't directly fetch something with an <em>int</em>,
because that doesn't get you a datatype for the pointed-to object, which
in most cases is useless, so C doesn't allow it. You can fetch something
with an <em>int</em> though, with a cast. What you are doing with the cast
is providing information C needs to keep track of types. Because of
operator overloading, because e.g. <em>+</em> is various operations for
various types, types have to be kept track of by you, with casts, or by C,
based on declarations.
<pre>
int fake_pointer, my_variable, other;
my_variable = 77 ;
fake_pointer = (int) & my_variable;
other = *(int *)fake_pointer;
</pre>
"other" now contains the the same value as my_variable, 77, but was
passed that value using just addresses. It also happens to have the
same type, <em>int</em>. Doing it that way means that you, the
programmer, kept track of the datatype. Sometimes you may want to do that,
usually you don't. I usually do, but I'm weird.
<p>
Casts can be pretty arbitrary, especially in gcc, and arbitrarily
complex. Casts bind right-to-left, so the <em> *(int *)</em> is a cast to
pointer to <em>int</em>, specified by the <em>(int *)</em>, followed in
time sequence by a fetch or dereference, specified by the <em>*</em>.
That's the minimum you have to do to an object declared <em>int</em> to
dereference it as a pointer.
<p>
More typically, and more conveniently, but maybe not as clearly, you can
declare variables specifically for pointers as type "pointer to [type]".
A declaration of a variable to hold the address of another object of type
<em>float</em>, for example, would be<pre>
float * my_float;
</pre>
That creates an object the size of a machine address that is considered
to be the address of an object of type <em>float</em>. The compiler then
handles my_float and the object it points to various ways depending on
context.
<pre>
/* pointer_demo.c, pointer values and pointer net values */
main(){
int a, b ; /* declare a couple ints */
int * p, * q; /* declare a couple pointers to ints */
a = 777; /* give our int a value */
p = &a; /* set what p is pointing at */
q = p ; /* copy a pointer to a pointer.
The address is copied.
*/
printf("%d\n", * q ); /* print the object/net value the copy of
the pointer points at */
b = 4; /* initialize our other int */
*q = b ; /* change the value of what q is pointing at */
printf("%d\n", *q ); /* print the contents of what q points at */
printf("%d\n",a ); /* print what we set to 777, and then reset
to 4 indirectly via a pointer */
}
</pre>
A pointer to <em>void</em> is an address, but the pointed to object has
no size. Pointers to <em>void</em> are used at times to handle pointers
that you want to point at different types of objects at different times in
the program.
<h3>derived types</h3>
Compound objects or types with direct support in C are strings, arrays,
structs, enums, and unions. I guess in a sense a pointer is a compound
object also. Compound types are in a sense clusters of pointers of
various types that C handles internally to the particular type.
C/unix handles strings as pointers to type <em>char</em>. The fact
that strings vary in length is handled by terminating strings with a
zero byte. The zero-terminator convention is the only thing that makes
a C string any different than a regular pointer to <em>char</em>.
<pre>
char *stringy_thingy;
stringy_thingy = "My string\t\t\t for illustration\n" ;
</pre>
That declares a pointer to type <em>char</em>, and then initializes it to
the string literal shown. The actual value stringy_thingy will then
contain is the address of the "M" in "My". <em>\t</em> is the C string
literal escape to include an ASCII tab byte in a string, and <em>\n</em>
represents a newline. This convention is reflected in most unix
programming languages. When you declare a string literal C puts a zero on
the end of it, and then the various library routines and so on can
traverse the string from it's pointer address up to the zero. This is the
case with stringy_thingy, and with the printf format control construct %s.
For example,<pre>
/* string.c demo of null-terminated string */
char * string = "blah blah woof woof " ;
main(){ printf("%s %s %s \n\n\n", string, string, string );
}</pre>
A <b>struct</b> is an arbitrary grouping of data. The struct mechanism
allows an arbitrary data grouping to be replicated. Structs also implement
a hierarchical naming scheme for data somewhat like the pathnames of
a filesystem. A <b>union</b> is an object that can be accessed as more
than one type. <pre>
/* data_toy.c play with a struct within a union */
union convrt {
char b[20];
struct clump {
int header ;
int body[4]; }
clmp ;
} convert ;
main()
{ int i ;
convert.clmp.header = 5555;
for (i = 0 ; i < 4 ; i++)
convert.clmp.body[i] = i * 4444 ;
for (i=0; i < 20 ; i++)
printf("%d ", convert.b[i] );
printf("\n");
}</pre>
I didn't comment this one, because I'm going to unravel it here. The mess
above main() is a union definition. The union is named convert, and is
a union of a struct and an array of chars. The definition of the struct
is contained right in the definition of the union. Recall that runtime
code can't be outside a function. The things within the struct and union
definitions that look like statements, i.e. that are ended by a semicolon,
are the individual components of the compound structure definition. Unions
and structs have the format<pre>
(not real C code, [] encloses options)
struct classname {
type fieldname;
[ type fieldname;
.
.
. ] }
[ instance_name, ... ] ;
</pre>
The format for a <em>union</em> would say "union" where the above says
<em>struct</em>. If there are no instance names then it's a declaration.
If there are instance names, instances of that type are defined and
allotted storage. For a <em>struct</em>, the system makes a data format to
keep the fields organized. For a <em>union</em>, the fields actually
overlap in memory. In other words, a <em>union</em> provides a physical
location that you can manipulate as various types. Unions, like casts,
are an escape-clause of sorts for C typing.
<p>
Looking back to data_toy.c, our array of chars overlaps our
<em>struct</em>. That means we can fill the <em>struct</em> as the data
types it's defined as, and look at it as consecutive chars, <em>unsigned</em>
bytes. This is a form of low-level data conversion. By running data_toy.c
you can see how C types are actually stored as bytes.
<p>
What else did I introduce in data_toy.c? Too much, actually. Well, the
struct/union equivalent of the / in the unix file namespace is a period.
Fields in a <em>struct</em> are a compound name of the form struct.field,
similar to dir/dir/file in a unix filesystem. Also, this time I gave the
sizes of the arrays at declare-time. A <em>union</em> will be the size of
it's largest field, but the declaration has to know what size that is.
<p>
An <b>enum</b> is a sequence of names for sequential constant
ints. I don't find enums very useful.
<h2>cpp THE PREPROCESSOR</h2>
The C preprocessor is used extensively in most large C programs. It
responds to certain character sequences in it's input by performing
various modifications of it's input, such as removing <em>/* */</em>
comments. It is something like a programming language, but quite unlike C.
It has variables and so on, but in it's own context. All <em>cc1</em> sees
of the <em>cpp</em> variables and so on is thier effects on the C source
sent to <em>cc1</em>. <em>cpp</em> is what is called a macro processor.
Macros are text representing other text. <em>cpp</em> thinks in lines,
unlike C; in <em>cpp</em> newlines aren't the same as other whitespace.
<p>
We can divide the directives to <em>cpp</em> into two classes by what
text causes them; special and named. Comment removal and line joining are
caused by specific simple character sequences in the input. You know about
<em>/*</em> and <em>*/</em> around comments. Line joining is caused when a
line ends with a <b>\</b> and an immediately following newline. <em>\</em>
allows to you make multi-line cpp directives.
<p> Named <em>cpp</em> directives all require a <em>#</em> prefix.
<em>#include</em> is one. The
prefix must be the first non-whitespace character on the line. There may
be whitespace between the <em>#</em> and the name. Here are six legit
named <em>cpp</em> directives... <pre>
#include <stdlib.h>
# include "app_local.h"
#define PI 3.1415926
#if 0
# define PI 3
#endif </pre>
If the above were in a C source file, when processed by <em>cpp</em>, the
files ./app_local.h and the stdlib.h in the standard system header files
would be inserted into the file, the macro PI would be converted to the
string 3.1415926 anywhere it occured in the file, and the conditional
action of the next 3 lines shown wouldn't have any effect on the output,
since 0 is false. <em>cpp</em> doesn't do any math. 0 and 3 and 3.1415926
are just text strings to <em>cpp</em>. Actually, I'm not sure about 0, but the
point is <em>cpp</em> is all about text-to-text. PI in the above is a
macro, text. The prevailing convention, and a good one, is to use all-caps for
<em>cpp</em> macros in .c files.
<p>
Macros can be constructed with syntaxes, and that take arguments. A macro
<em>#define</em> has to use the aforementioned line-joining mechanism to
be longer than one physical line. For a macro to take arguments, it has a
syntax roughly like a C function definition, and the <em>(</em> opening
the argument list must follow the macro name immediately. The following is
legit <em>cpp</em>. <em>cpp</em> also does some checking of whether it's
legit C or not. Run <em>cpp</em> on this.<pre>
/* twiddle.c cpp demo, real but rather bogus C. */
int first, second, third;
# define TWIDDLE(A, B, C) C = B + A + C ; \
"continued line" ;
TWIDDLE(first, second, third)
</pre> Note that the <em>\</em> logical line continuation operator only
works as such if the very next character in the file is a newline. If
there is other whitespace between the <em>\</em> and the newline it
doesn't work. This is one of my pet peeves with unix tools; some things
that are invisible have important meanings. "make" and sh-style shells
have this mis-feature also.
<h2>exercises</h3>
Healthy stuff. Even if you are reading this for non-programming reasons,
such as wanting to be better at importing apps to cLIeNUX, you should
write some code. Pick one of cLIeNUX's small scripted commands, like
<b>add</b> and write a C version, and add a feature or two.
<h2>bailing out</h2>
OK, I've spent too much time on this seedoc. Keep in mind that most
things you might have questions about can be tried and proven, which
is the best docs. If it's just one feature you are having trouble with
a small test/experiment program can be written in a minute or two.
You also have the world of open source unix for examples. Start with
small stuff. Linux kernel code, as a counter-example, is a huge wad
of distant cross-references, and uses a lot of GNU assembly linking
tricks. Not for newbies. Instead, look at small utilities.
<p>
This file is about one tenth the size of "The C Programming Language"
Second (ANSI) Edition, Kernighan and Ritchie, the standard book on C by
the authors of the language, which I have had sitting in front of me for
this. Trying to present C in this much space is of course absurd. I do
think I've given a bit more bottom-up presentation than that work, and I
do think there's enough info here to write small useful programs.
<p>
C is a third generation language. Various parties represent other languages
as fourth-generation. The actual fourth-generation language is Forth,
circa 1971. Do as I say, not as I did, and learn C before Forth. Then learn
Forth.
<p> <em>RIGHTS</em><br>
Copyright 1999 Richard Allen Hohensee<br>
This file is released for redistribution only as part of an entire intact
cLIeNUX Core.
I have since released cLIeNUX, and thus this post, to the public domain.
- Next message: cLIeNUX user: "Re: Origins of the name GLOB"
- Previous message: Paul Pluzhnikov: "Re: segmentation fault before starting main()"
- In reply to: Zach: "Learning *NIX programming?"
- Next in thread: Alan Balmer: "Re: Learning *NIX programming?"
- Reply: Alan Balmer: "Re: Learning *NIX programming?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|