# Data Presentation

## 1           Data representation

The representation of data types is always a problem, as different computer systems use different ways to store and represent data. For example, the PC, which is based on Intel microprocessors, uses the little endian approach of representing a floating-point value. The little endian form starts with the least-significant byte in the lowest memory location, and the most-significant byte in the highest location. The big endian form, as used with Motorola-based systems, always starts with the high-order byte and ends with the lowest-order byte. For example with little endian, the value to store the 16-bit integer values of 4 (0000 0000 0000 0100b), 5,241 (0001 0100 0111 1001 b) and 26,152 (0110 0110 0010 1000b) would be:

 Memory location Contents (hex) Contents (binary) Value 00 04 0000 0100 4 01 00 0000 0000 02 79 0111 1001 5,241 03 14 0001 0100 04 28 0010 1000 26,152 05 66 0110 0110

Whereas, in big endian, it would be stored as:

 Memory location Contents (hex) Contents (binary) Value 00 00 0000 0000 4 01 04 0000 0100 02 14 0001 0100 5,241 03 79 0111 1001 04 66 0110 0110 26,152 05 28 0010 1000

Thus a program which has been written for a PC would incorrectly read data which has been written for a big endian program (typically for a UNIX workstation), and vice versa. Another particular problem is that different computer systems represent data (such as numeric values) in different formats. For example, an integer can be represented with either 16 bits, 32 bits, 64 bits, or even, 128 bits. The more bits that are used, the larger the integer value that can be represented.

All these problems highlight the need for a conversion technique that knows how to read the value from memory, and convert it into a standard form that is independent of the operating system or the hardware of the computer. This is the function of eXternal Data Representation (XDR), which represents data in a standard format. In XDR the basic data types are:

• Unsigned integer and signed integer. An unsigned and signed integer uses a 32-bit value. The unsigned value uses the range from 0 to 232–1 (4,294,967,295), whereas the signed integer uses 2’s complement which gives a range of –2,147,483,648 (1111 1111 1111 … 1111 1111) to +2,147,483,647 (0111 1111 1111 … 1111).
• Single-precision floating point. A single-precision floating-point value uses a 32-bit IEEE format of a floating-point value. An example is given next. The range is from ±3.4´10-38 to ±3.4´1038.
• Double-precision floating point. A double-precision floating-point value uses a 64-bit IEEE format of a floating-point value. The range is from ±1.7´10-308 to ±1.7´10308.
• String. A string is represented with a number of bytes. The first four bytes define the number of ASCII characters defined. For example, if there were four characters in the string then the first four bytes would be: 0, 0, 0, 4, followed by the four characters in the string. Note that this differs from the way that the C programming language represents strings, as C uses the NULL ASCII character to define the end of a string.

### A1.1.1    Negative numbers

Signed integers use a notation called 2’s complement to represents negative values. In this representation the binary digits have a ‘1’ in the most significant bit column if the number is negative, else it is a ‘0’. To convert a decimal value into 2’s complement notation, the magnitude of the negative number is represented in binary form. Next, all the bits are inverted and a ‘1’ is added. For example to determine the 16-bit 2’s complement of the value –65, the following steps are taken:

```+65            00000000 01000001
invert         11111111 10111110

Thus, –65 is 11111111 1011111 in 16-bit 2’s complement notation. Table A1.4 shows that with 16 bits the range of values that can be represented in 2’s complement is from –32 768 to 32 767 (that is, 65 536 values).

Two’s complement is also useful in subtraction operations, where the value to be subtracted is converted in its negative form, and then added to the value it is to be subtract from. For example to subtract 42 from 65, first 42 is converted into 2’s complement (that is, –42) and added to the binary equivalent of 65. The result gives a carry into the sign bit and a carry‑out (these are ignored).

 65 0100 0001 –42 1101 0110 = 23 (1) 0001 0111

For a 16‑bit signed integer can vary from –32768 (1000000000000000) to 32767 (0111111111111111).

Table A1.4 16-bit 2’s complement notation

 Decimal 2’s complement –32 768–32 767::::–2–101 2 :::: 32 766 32 767 10000000 00000000 10000000 00000001 :::: 11111111 11111110 11111111 11111111 00000000 00000000 00000000 00000001 00000000 00000010 :: 01111111 11111110 01111111 11111111

### A1.1.2    Hexadecimal and octal numbers

Often it is difficult to differentiate binary number from decimal numbers (as one hundred and one can be seen as 101 in binary). A typical convention is to use a proceeding b for binary numbers, for example 010101111010b and 101111101010b are binary numbers. Hexadecimal and octal are often used to represent binary digits, as they are relatively easily to convert to and from binary. Table A1.5 shows the basic conversion between decimal, binary, octal and hexadecimal numbers. A typical convention is to append a hexadecimal value with an ‘h’ at the end of a hexadecimal numbers (and octal number with an o). For example, 43F1h is a hexadecimal value whereas 4310o is octal.

To represent a binary digit as a hexadecimal value, the binary digits are split into groups of four bits (starting from the least significant bit). A hexadecimal equivalent value then replaces each of the binary groups. For example, to represent 0111 0101 1100 0000b the bits are split into sections of four to give:

 Binary 0111 0101 1100 0000 Hex 7 5 C 0

Thus, 75C0h represents the binary number 0111010111000000b. To convert from decimal to hexadecimal the decimal value is divided by 16 recursively and each remainder noted. The first remainder gives the least significant digit and the final remainder the most significant digit. For example, the following shows the hexadecimal equivalent of the decimal number 1103:

 16 1103 68 r F   <<< LSD (least significant digit) 4 r 4 0 r 4  <<< MSD (most significant digit)

Thus, the decimal value 1103 is equivalent to 044Fh.

Table A1.5Decimal, binary, octal and hexadecimal conversions

 Decimal Binary Octal Hex 0 0000 0 0 1 0001 1 1 2 0010 2 2 3 0011 3 3 4 0100 4 4 5 0101 5 5 6 0110 6 6 7 0111 7 7 8 1000 10 8 9 1001 11 9 10 1010 12 A 11 1011 13 B 12 1100 14 C 13 1101 15 D 14 1110 16 E 15 1111 17 F

### A1.1.3    Floating-point representation

A single-precision floating-point value uses 32 bits, where the most-significant bit represents the sign bit (S), the next eight bits represents the exponent of the number in base 2, minus 127 (E). The final 23 bits represent the base-2 fractional part of the number’s mantissa (F).  The standard format is:

Value = -1S´2(E–127) ´1.F

For example:

1.23                    = 3F9D 70A4h

= 0 01111111 00111010111000010100100b

= -10´2(127–127) ´ (1+ 2–3+2–4+2–5+2–9+2–10+2–11+2–16+2–18+2–21)

–5.67                 = C0B5 70A4h

= 1 10000001 0110101011100010100100b

= -11´2(129–127) ´ (1+ 2–2+2–3+2–5+2–7+2–9+2–10+2–11+2–15+2–17+2–20)

100.442              = 42C8 E24Eh

= 0 10000101 10010001110001001001110b

= -10´2(133–127) ´ (1+ 2–1+2–4+2–8+2–9+2–10+2–14+2–17+2–20+2–21+2–22)

A single-precision floating-point value uses 64 bits, where the most-significant bit represents the sign bit (S), the next eight bits represents the exponent of the number in base 2, minus 1023 (E). The final 52 bits represent the base-2 fractional part of the number’s mantissa (F).

### A1.1.4    ASCII

As we have seen, there are standard formats for integers and floating-point values. There are many standards for the representation of characters (known as character sets), but the most common one is known as ASCII. In its standard form it uses a 7-bit binary code to represent characters (letters, giving a range of 0 to 127). This is rather limited as it does not support symbols such as Greek lines, and so. To increase the number of symbols which can be represented, extended ASCII is used which has an 8-bit code.

Appendix 4 shows the standard ASCII character set (in binary, decimal, hexadecimal and also as a character). For example the ‘a’ character has the ASCII binary representation of 0110 0001b (61h), and the ‘A’ character has the binary representation of 0100 0001 (41h). One thing that can be noticed is that the upper and lower case versions of the letters (‘a’ to ‘z’) only differ by a single bit (the 6th bit, from the right-hand side).

In 1963, ANSI defined the 7-bit ASCII standard code for characters. At the same time IBM had developed the 8-bit EBCDIC code which allowed for up to 256 characters, rather than 128 characters for ASCII. It is thought that the 7-bit code was used for the standard as it was reckoned that eight holes in punched paper tape would weaken the tape. Thus the world has had to use the 7-bit ASCII standard, which is still popular in the days of global communications, and large-scale disk storage.

```Char  Dec  Oct  Hex | Char  Dec  Oct  Hex | Char  Dec  Oct  Hex | Char Dec  Oct   Hex
-------------------------------------------------------------------------------------
(nul)   0 0000 0x00 | (sp)   32 0040 0x20 | @      64 0100 0x40 | `      96 0140 0x60
(soh)   1 0001 0x01 | !      33 0041 0x21 | A      65 0101 0x41 | a      97 0141 0x61
(stx)   2 0002 0x02 | "      34 0042 0x22 | B      66 0102 0x42 | b      98 0142 0x62
(etx)   3 0003 0x03 | #      35 0043 0x23 | C      67 0103 0x43 | c      99 0143 0x63
(eot)   4 0004 0x04 | \$      36 0044 0x24 | D      68 0104 0x44 | d     100 0144 0x64
(enq)   5 0005 0x05 | %      37 0045 0x25 | E      69 0105 0x45 | e     101 0145 0x65
(ack)   6 0006 0x06 | &      38 0046 0x26 | F      70 0106 0x46 | f     102 0146 0x66
(bel)   7 0007 0x07 | '      39 0047 0x27 | G      71 0107 0x47 | g     103 0147 0x67
(bs)    8 0010 0x08 | (      40 0050 0x28 | H      72 0110 0x48 | h     104 0150 0x68
(ht)    9 0011 0x09 | )      41 0051 0x29 | I      73 0111 0x49 | i     105 0151 0x69
(nl)   10 0012 0x0a | *      42 0052 0x2a | J      74 0112 0x4a | j     106 0152 0x6a
(vt)   11 0013 0x0b | +      43 0053 0x2b | K      75 0113 0x4b | k     107 0153 0x6b
(np)   12 0014 0x0c | ,      44 0054 0x2c | L      76 0114 0x4c | l     108 0154 0x6c
(cr)   13 0015 0x0d | -      45 0055 0x2d | M      77 0115 0x4d | m     109 0155 0x6d
(so)   14 0016 0x0e | .      46 0056 0x2e | N      78 0116 0x4e | n     110 0156 0x6e
(si)   15 0017 0x0f | /      47 0057 0x2f | O      79 0117 0x4f | o     111 0157 0x6f
(dle)  16 0020 0x10 | 0      48 0060 0x30 | P      80 0120 0x50 | p     112 0160 0x70
(dc1)  17 0021 0x11 | 1      49 0061 0x31 | Q      81 0121 0x51 | q     113 0161 0x71
(dc2)  18 0022 0x12 | 2      50 0062 0x32 | R      82 0122 0x52 | r     114 0162 0x72
(dc3)  19 0023 0x13 | 3      51 0063 0x33 | S      83 0123 0x53 | s     115 0163 0x73
(dc4)  20 0024 0x14 | 4      52 0064 0x34 | T      84 0124 0x54 | t     116 0164 0x74
(nak)  21 0025 0x15 | 5      53 0065 0x35 | U      85 0125 0x55 | u     117 0165 0x75
(syn)  22 0026 0x16 | 6      54 0066 0x36 | V      86 0126 0x56 | v     118 0166 0x76
(etb)  23 0027 0x17 | 7      55 0067 0x37 | W      87 0127 0x57 | w     119 0167 0x77
(can)  24 0030 0x18 | 8      56 0070 0x38 | X      88 0130 0x58 | x     120 0170 0x78
(em)   25 0031 0x19 | 9      57 0071 0x39 | Y      89 0131 0x59 | y     121 0171 0x79
(sub)  26 0032 0x1a | :      58 0072 0x3a | Z      90 0132 0x5a | z     122 0172 0x7a
(esc)  27 0033 0x1b | ;      59 0073 0x3b | [      91 0133 0x5b | {     123 0173 0x7b
(fs)   28 0034 0x1c | <      60 0074 0x3c | \      92 0134 0x5c | |     124 0174 0x7c
(gs)   29 0035 0x1d | =      61 0075 0x3d | ]      93 0135 0x5d | }     125 0175 0x7d
(rs)   30 0036 0x1e | >      62 0076 0x3e | ^      94 0136 0x5e | ~     126 0176 0x7e
(us)   31 0037 0x1f | ?      63 0077 0x3f | _      95 0137 0x5f | (del) 127 0177 0x7f```

## Base-64

When sending text, we can use ASCII. Unfortunately some of the codes representing in ASCII are non-printable, so if we have a binary file then the characters within the file may be non-printing ones. For the Internet, some communication protocols require that we have printable characters, such as for SMTP (which sends emails). Thus we often have to convert a binary file into Base-64. For this we take six bits at a time. The coding that we use is then given by the Base-64 table:

### Example 1

If we take an example of “fred“, then we get:

```ASCII      f       r         e        d
Binary 01100110 01110010 01100101 01100100```

Next we group in 6-bits:

`Binary 011001 100111 001001 100101 011001 00`

and then map these using the Base-64 table:

```Binary  011001 100111 001001 100101 011001 00
Decimal   25     39     9      37     25    0
Base-64   Z      n      J       l     Z     A```

The result is ZnJlZA

### Hash signatures

One thing that we use Base-64 for is to represent the hash signature of some data. For this we have 24-bit groups of the input bits, and then will pad the binary input value to fit. For this we need to create groups-of-four Base64 characters, so we pad at the end to make sure that we can have a multiple of 4 characters:

`Binary011001 100111 001001 100101 011001 00[0000] xxxxx xxxxxx`

The extra padding at the end is represented with a “=” character to give:

`ZnJlZA==`

### Example 2

If we take an example of “napier “, then we get:

```ASCII      f       r         e        d
Binary 01101110 01100001 01110000 01101001 01100101 01110010```

Next we group in 6-bits:

`Binary 011011 100110 000101 110000 011010 010110 010101 110010`

We thus do not need any padding, as we have a multiple of four characters, and then map these using the Base-64 table:

```Binary  011011 100110 000101 110000 011010 010110 010101 110010
Decimal   27     38     5      48     26     22      21    50
Base-64   b       m     F      w      a      W        V     y```

The result is bmFwaWVy

### Base-64 table

The table is:

```     Value Encoding  Value Encoding  Value Encoding  Value Encoding
0 A            17 R            34 i            51 z
1 B            18 S            35 j            52 0
2 C            19 T            36 k            53 1
3 D            20 U            37 l            54 2
4 E            21 V            38 m            55 3
5 F            22 W            39 n            56 4
6 G            23 X            40 o            57 5
7 H            24 Y            41 p            58 6
8 I            25 Z            42 q            59 7
9 J            26 a            43 r            60 8
10 K            27 b            44 s            61 9
11 L            28 c            45 t            62 +
12 M            29 d            46 u            63 /
13 N            30 e            47 v
14 O            31 f            48 w         (pad) =
15 P            32 g            49 x
16 Q            33 h            50 y```

# File Forensics with Signatures

### Magic Numbers

Sometimes we need to scan a disk at a low level, and determine the files that are contained on a disk. One method of determining the files is to look for standard signatures, normally using standard sequences at the start of the file. I’ve tried to gather as many of these signatures as possible for key file types (see Table 1). For example an Abobe Illustrator file should start with the hex sequence of 0x25, 0x50, 0x44, 0x46 (which is the ASCII characters of %PDF), and which shows that it is a standard PDF file. If we scan a disk and find this signature, it may thus be an Illustrator file.

### PNG File

PNG files provide high quality vector and bit mapped graphic formats. They have a magic number of 0x89 0x50 0x4E 0x47 0x0D 0x0A 0x1A 0x0A. The following gives a sample listing for a real PNG file:

http://www.profsims.com/information/png?file=bg.png

The starting part of the file shows the magic number:

```[00000000] 89 50 4E 47 0D 0A 1A 0A   .PNG....
[00000008] 00 00 00 0D 49 48 44 52   ....IHDR
[00000016] 00 00 00 F3 00 00 00 C3   ........
[00000024] 08 06 00 00 00 57 8C 27   .....W.'
[00000032] 92 00 00 00 04 67 41 4D   .....gAM
[00000040] 41 00 00 AF C8 37 05 8A   A....7..
[00000048] E9 00 00 00 19 74 45 58   .....tEX```

A demonstration of this is given in:

### GIF file

The GIF file format uses a file signature of 0x47 0x49 0x46 0x38 0x39 0x61 (GIF89a) in the first few bytes of the file. After this, the key fields are then Width (16 bits), Height (16 bits), Packed (8 bits), Color Index (8 bits) and Aspect (8 bits), followed by a colour table of 256 24-bit colors. This means that GIF files have good resolution of the colour of a pixel, but only have 256 different colours, which limits its scope. For example it is not good for photographs, as these typically need thousands of colours.

A sample analysis is:

http://www.profsims.com/information/gif?file=cat01_with_hidden_text.gif

which analyses this image:

```[00000000] 47 49 46 38 39 61 64 00   GIF89ad.
[00000008] 55 00 E6 00 00 FF FF FF   U.......
[00000016] F7 F7 F6 F1 F4 F2 EE EE   ........
[00000024] EF E7 E7 E7 E1 E4 E6 DF   ........```

It should be noted that I have added a covert message into the colour table (which will only affect a few pixels – where a few pixels change their colour):

```[00000048] A1 CC CC CC C4 C8 CC 68   .......h
[00000056] 65 6C 6C 6F C0 D1 C6 84   ello....
[00000064] C0 BF BD BD BB B8 B8 B6   ........```

A presentation on this is at:

### PKZIP File

The PKZIP file format is used to compress files, and, potentially encrypt them. It can be identified with the magic number of 0x504B0304 at the start of the file, followed by a fairly structure format of:

Version: 14 00
General purpose bit flag: 02 00
Compression method: 08 00
File last modification time: 80 9D
File last modification date: 6C 39
CRC: DA4DB80F
Compessed size: 90010000
Uncompressed size: 27060000
File name length: 0900
Extra field length: 0000
Filename: anim.xaml

The following shows an example with a real file:

http://www.profsims.com/information/zip?file=anim.zip

where we see the following at the start of the file:

```[00000000] 50 4B  03 04 14 00 02 00    PK......
[00000008] 08 00 80 9D 6C 39 DA 4D   ....l9.M```

A presentation is here:

An interest fact is that Office 2010 files, such DOCX, XLSX, and so on, in a XML format, which has a PKZIP compressed format. This can be seen with:

http://www.profsims.com/information/docx?file=hello.docx

```[00000000] 50 4B  03 04 14 00 06 00    PK......
[00000008] 08 00 00 00 21 00 09 24   ....!..\$
[00000016] 87 82 81 01 00 00 8E 05   ........
[00000024] 00 00 13 00 08 02 5B 43   ......[C
[00000032] 6F 6E 74 65 6E 74 5F 54   ontent_T
[00000040] 79 70 65 73 5D 2E 78 6D   ypes].xm
[00000048] 6C 20 A2 04 02 28 A0 00   l....(..```

Table 1: Magic file numbers

 Description Extension Magic Number Adobe Illustrator .ai 25 50 44 46 [%PDF] Bitmap graphic .bmp 42 4D [BM] Class File .class CA FE BA BE JPEG graphic file .jpg FFD8 JPEG 2000 graphic file .jp2 0000000C6A5020200D0A [….jP..] GIF graphic file .gif 47 49 46 38 [GIF89] TIF graphic file .tif 49 49 [II] PNG graphic file .png 89 50 4E 47 .PNG Photoshop Graphics .psd 38 42 50 53 [8BPS] Windows Meta File .wmf D7 CD C6 9A MIDI file .mid 4D 54 68 64 [MThd] Icon file .ico 00 00 01 00 MP3 file with ID3 identity tag .mp3 49 44 33 [ID3] AVI video file .avi 52 49 46 46 [RIFF] Flash Shockwave .swf 46 57 53 [FWS] Flash Video .flv 46 4C 56 [FLV] Mpeg 4 video file .mp4 00 00 00 18 66 74 79 70 6D 70 34 32 [….ftypmp42] MOV video file .mov 6D 6F 6F 76 [….moov] Windows Video file .wmv 30 26 B2 75 8E 66 CF Windows Audio file .wma 30 26 B2 75 8E 66 CF PKZip .zip 50 4B 03 04 [PK] GZip .gz 1F 8B 08 Tar file .tar 75 73 74 61 72 Microsoft Installer .msi D0 CF 11 E0 A1 B1 1A E1 Object Code File .obj 4C 01 Dynamic Library .dll 4D 5A [MZ] CAB Installer file .cab 4D 53 43 46 [MSCF] Executable file .exe 4D 5A [MZ] RAR file .rar 52 61 72 21 1A 07 00 [Rar!…] SYS file .sys 4D 5A [MZ] Help file .hlp 3F 5F 03 00 [?_..] VMWare Disk file .vmdk 4B 44 4D 56 [KDMV] Outlook Post Office file .pst 21 42 44 4E 42 [!BDNB] PDF Document .pdf 25 50 44 46 [%PDF] Word Document .doc D0 CF 11 E0 A1 B1 1A E1 RTF Document .rtf 7B 5C 72 74 66 31 [{ tf1] Excel Document .xls D0 CF 11 E0 A1 B1 1A E1 PowerPoint Document .ppt D0 CF 11 E0 A1 B1 1A E1 Visio Document .vsd D0 CF 11 E0 A1 B1 1A E1 DOCX (Office 2010) .docx 50 4B 03 04 [PK] XLSX (Office 2010) .xlsx 50 4B 03 04 [PK] PPTX (Office 2010) .pptx 50 4B 03 04 [PK] Microsoft Database .mdb 53 74 61 6E 64 61 72 64 20 4A 65 74 Postcript File .ps 25 21 [%!] Outlook Message File .msg D0 CF 11 E0 A1 B1 1A E1 EPS File .eps 25 21 50 53 2D 41 64 6F 62 65 2D 33 2E 30 20 45 50 53 46 2D 33 20 30 Jar File .jar 50 4B 03 04 14 00 08 00 08 00 SLN File .sln 4D 69 63 72 6F 73 6F 66 74 20 56 69 73 75 61 6C 20 53 74 75 64 69 6F 20 53 6F 6C 75 74 69 6F 6E 20 46 69 6C 65