Thursday, October 23, 2008

What is a line ?

When a human being reads a text file in computer, the file is well formatted line by line. When the file is a COBOL program in mainframe MVS, the program is displayed in a well formatted 80-column line by line. When the file is a C program in unix, it is also formatted well line by line, similarly for many DOS files.

Human beings see the text file as a line-by-line file. How does the computer see this file ?

Actually, the computer does NOT see the file in a human being line-by-line concept.

In fact, the file is stored as a STREAM of bytes, i.e. a byte following a byte continuously until the end of the file. Then, some special handling (e.g. using line delimiter or else) is done to identify which portion is line-1 and which portion is line-2 in that long stream of bytes.

For example, when you see this 2-line file (named C:\example.txt) in DOS:

This is first line.
This is second line.

This file has 2 lines. The first line [ This is first line. ] has 19 characters from T to the last period. Similarly, the second line [ This is second line. ] has 20 characters. One can count a total of 39 characters.

However, when you use the DOS command [ dir ] to examine the file size:

C:\> dir example.txt
2006-03-20  13:39                41 example.txt
1 File(s)             41 bytes
This [ dir ] command reports a file size of 41 byte, not 39 byte. Why there are 2 more bytes ?

Actually, how the file is stored in DOS ? Using the [ debug ] program, the following will be seen:

C:\> debug example.txt
-d
0B1A:0100  54 68 69 73 20 69 73 20-66 69 72 73 74 20 6C 69   This is first li
0B1A:0110  6E 65 2E 0D 0A 54 68 69-73 20 69 73 20 73 65 63   ne...This is sec
0B1A:0120  6F 6E 64 20 6C 69 6E 65-2E 6B 6A 6B 6A 65 72 6A   ond line.kjkjerj
0B1A:0130  20 64 67 6B 3B 6C 64 73-20 6C 6B 6A 66 67 6C 73    dgk;lds lkjfgls
0B1A:0140  20 64 68 6A 73 64 6B 68-6A 6B 73 68 20 73 20 68    dhjsdkhjksh s h
0B1A:0150  6B 73 20 68 20 73 68 20-73 20 68 20 64 73 68 20   ks h sh s h dsh
0B1A:0160  66 64 73 20 68 20 73 66-64 68 20 6B 20 73 68 20   fds h sfdh k sh
0B1A:0170  6B 68 6A 6B 73 68 6B 20-73 68 20 73 66 64 68 20   khjkshk sh sfdh
-q

As one can see, the file is stored as a STREAM of characters in the harddisk. After the string [ This is first line. ], one can find 2 characters [ 0D 0A ]. Then, the second line follows.

This [ 0D 0A ] characters are termed line delimiter. This 39-byte information plus 2-byte line delimiter results in a 41-byte file.

Also, this line delimiter [ 0D 0A ] tells the software editor to display the file into 2 lines.

According to ASCII encoding sequence, [ 0D ] is the decimal value 13. This [ 0D ] is called [ carriage return ], with an abbreviation [ CR ]. Similarly, [ 0A ] is decimal 10, called [ line feed ], abbreviated as [ LF ]. Together, [ 0D 0A ] are represented by CRLF.

In the world of unix, the line delimiter is [ 0A ] for most of the common settings.

As a result, when using FTP to transfer files between DOS and unix, it is better to use the [ ascii ] option to turn on the line delimiter conversion between [ 0D 0A ] and [ 0A ]. If someone forget to do so, after uploading a DOS file into unix, there will be a ^M character at the end of each line (when the file is opened by vi editor). This ^M is actually the [ 0D ] character which is NOT treated as line delimiter and is considered as normal character to be displayed.

Similar mistake can occur for downloading unix file into DOS. Without using the [ ascii ] option, there is no line delimiter conversion. The received file in DOS will be delimited by [ 0A ] only, not the common DOS line delimiter [ 0D 0A ]. So far, the Notepad application CANNOT recognize this [ 0A ] as line delimiter. You will see all the line mess together into one very long line. Another application, Wordpad, is more intelligent. It knows that [ 0A ] is also a line delimiter. It can open the file normally for human being to read.

In the mainframe MVS world, if the dataset file is a QSAM, there is no need to have line delimiter. The dataset organization has already tell the exact number of characters for each line.


No comments:

Duplicate Open Current Folder in a New Window

Sometimes after I opened a folder in Win7, I would like to duplicate open the same folder again in another explorer window. Then, I can ope...