SEGREGATION INDEX CALCULATOR (SEGCALC) 1. INTRODUCTION Segregation Index Calculator (SEGCALC) was developed by Dr. Stavros Konstantinidis (dept. of Mathematics and Statistics)and Dr. Ivan Townshend (dept. of Geography) at the University of Lethbridge (Alberta) during the months of May and July of 1997. The motivation for this project was the need by researchers to compute geographical indices for the purpose of evaluating the degree of segregation of a city's population groups. SEGCALC implements the following 18 indices: (a) ACE (absolute centralization), (b) ACL (absolute clustering), (c) ACO (absolute concentration), (d) Atkinson's, (e) Delta, (f) Dissimilarity, (g) DPxx* (distance decay isolation), (h) DPxy* (distance decay interaction), (i) Entropy, (j) Eta squared (correlation ratio), (k) Gini coefficient, (l) PCC (proportion central city), (m) RCE (relative centralization), (n) RCL (relative clustering), (o) RCO (relative concentration), (p) SP (spatial proximity), (q) xP*y (isolation), (r) xP*y (interaction). The formulas used to compute the above indices are taken from Massey and Denton (1988) with the exception of DPxx* and DPxy* which are implemented according to the correct formula given in Morgan (1983). For the indices that use the contiguity between areal units as a function of the distance between their centroid coordinates, SEGCALC assumes that the contiguity between an areal unit and itself is equal to 1 (see page 294 of Massey and Denton, 1988). While testing the effectiveness of SEGCALC, the following problems were identified (see also Townshend, Konstantinidis and Walker (in progress) for further details): - It is possible that the ACL index produces negative values, despite the claim of Massey and Denton that its values always range between 0 and 1. - There are situations where the values of the indices ACO and RCO are undefined (i.e., the formulas produce 0/0). The rest of this document describes the capabilities of SEGCALC. Section 2 gives some general suggestions of how to structure the input data files used by SEGCALC. Section 3 contains further details on the input data files and defines which data lines are considered valid and processed by SEGCALC. Section 4 describes the set of permissible user inputs and how SEGCALC responds to those. Section 5 lists the possible error messages and warnings of SEGCALC. Finally, Section 6 states the operating system requirements to use SEGCALC. 2. RECOMMENDED INPUT DATA FILE CONFIGURATION To compute ALL indices in SEGCALC it is advisable (but not essential) to set up your raw data file (ASCII format) as follows: Column 1: unique identifier for row or record (e.g. census tract code or name). This column is not used in the computations but is recommended as record identifier. Column 2 (referred to as I6 below): Area of the census tract (note: SEGALC assumes that the area of the city is the sum of all census tract areas). If you are not using the area-based indices (DEL, ACO, RCO, ACE) you can omit this column. Column 3 (referred to as I7 below): the geographic x-coordinate of the census tract centroid (e.g. metres or km). Note: The distance-based indices (PCC, RCE, ACE, ACL, RCL, SP, Dpxy, DPxx) are sensitive to unit of measurement. If you are not using indices based on distances (from each other or from the central business district, CBD) you can omit columns 3-6. Column 4 (referred to as I8 below): the geographic y-coordinate of the census tract centroid (same units as x coordinate above). If you are not using indices (PCC, RCE, ACE, ACL, RCL, SP, Dpxy, DPxx) based on distances (from each other or from the CDB) you can omit columns 3-6. Column 5 (referred to as I9 below): the x coordinate of the central business district (CBD) or city Centre (must be same unit of measure as the tract coordinates). If you are not using indices (PCC, RCE, ACE, ACL, RCL, SP, Dpxy, DPxx) based on distances (from each other or from the CDB) you can omit columns 3-6. Column 6 (referred to as I10 below): the y coordinate of the central business district or city centre (must be same unit of measure as the tract coordinates). If you are not using indices (PCC, RCE, ACE, ACL, RCL, SP, Dpxy, DPxx) based on distances (from each other or from the CDB) you can omit columns 3-6. Column 7 (referred to as I5 below): First column of data (i.e. population segment such as age group, ethnic group, etc.) Columns 8 to LAST column: remaining population segments. Note: the number of population segment columns (e.g. 7 to LAST) plus the number of ID and geographic reference columns (e.g. cols 1 thru 6) = the total number of columns in the input data file (referred to as I4 below) IMPORTANT: SEGCALC indices are all based on dichotomous analysis: i.e., segment vs. non-segment , where non-segment is defined as the total tract population minus the segment. E.g., if segment = Seniors, non-segment = total tract population minus Seniors. if segment = Blacks, non-segment = total tract population minus Blacks. SEGCALC does not make provision for reading (from the raw data input file) a separate column that contains the total tract populations. SEGCALC computes the total tract population as the sum of the individual segments (e.g. cols 7 to LAST). This means that if your analysis is only concerned with one population segment, you can simplify the input data file by simply entering two columns of data: the first for the population of that segment, and the second as the difference between total population and that segment. In all cases, AT LEAST TWO segment data columns are required. SEGCALC also requires complete (valid) records. Records containing missing data will not be processed. EXAMPLE DATA FILE: (note that fixed field format not required) Ctname Area(km2) CTX CTY CBDX CBDY SEGMENT1 SEGMENT2...SEGMENTM 100 8 1.0 10.0 5 6 300 700 ... xxx 101 15 4.5 10.5 5 6 1000 5000 ... xxx . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 21 3.5 1.5 5 6 600 2000 ... xxx 3. TECHNICAL INPUT DATA FILE SPECIFICATIONS SEGCALC processes tabular data that describe the geographical distribution of the population groups in a given city (see table below). The city in question consists of N areal units and its population is made up of M different groups. g11 g12 ..... g1L p11 p12 ..... p1M g21 g22 ..... g2L p21 p22 ..... p2M ... ... ..... ... ... ... ..... ... ... ... ..... ... ... ... ..... ... gN1 gN2 ..... gNL pN1 pN2 ..... pNM The rows of the table represent the city's areal units. Each row i consists of two components: - The geography component (referred to as G-component in the sequel) with elements gi1, gi2,..., giL that represent data about the geography of the areal unit. For example, a typical G-component consists of 5 numbers: (a) the x-coordinate of the unit's centre, (b) the y-coordinate of the unit's centre, (c) the land area of the unit, (d) the x-coordinate of the central business district, and (e) the y-coordinate of the central business district. Note that the pair of numbers in (d) and (e) would be the same in every areal unit --their purpose being to compute the distance between the unit's centre and the centre of the central business district. - The population component (referred to as P-component in the sequel) with elements pi1, pi2,..., piM that represent the populations of the M groups living in the unit. SEGCALC uses the following terminology and assumptions about the rows of the table and the lines of the input file: 1. A data-line is a line of the input file that contains at least one non-whitespace character (whitespace characters are the space, the horizontal tab, and the new-line character). 2. Each row of the table occupies exactly one data-line of the input file. 3. The last row of the table corresponds to the data-line that immediately precedes the first non-data-line that appears in the file (if such a line exists), or the end-of-file character. 4. The elements of a data-line are delimited by at least one whitespace character. 5. In every row, the elements of the G-component precede the elements of the P-component. SEGCALC reads the lines of the input file until the end-of-file character is encountered or the first non-data-line is read. Only the valid data-lines are processed, however. The validity of a data-line depends on the user's specifications as follows. Near the beginning of its execution, SEGCALC asks the user to enter the total number, L+M, of columns in the table, and the column number, L+1, of the first population group. Then, a data-line is considered valid if and only if - the data-line consists of exactly L+M elements, and - each element is a number, and - each element of the last M columns is a non-negative integer, and - the sum of the last M elements of the line is a positive integer (i.e the total population of the areal unit is at least one), and - the land area of the row is positive, provided that the column of the land areas is included in the input file. In any other case, the data-line is considered invalid and not processed by SEGCALC. For example, a data-line is invalid if it contains a letter, or it contains more than L+M elements, or it consists of less than L+M elements, or a number in column L+1 is negative, and so on. 4. USING SEGCALC To run SEGCALC, the user types "SEGCALC" at the operating system prompt. Then, the following information is requested from the user: I1. The name of the input file that contains the table of geographical data. SEGCALC will ask for another name if the file cannot be found. I2. The name of the output file in which the results will be stored. If no name is given the results are only shown on the screen. I3. The number of rows in the table (i.e., the number of areal units in the city). This number must be between 1 and 2000; otherwise SEGCALC will ask for a new number. You can enter a number in excess of the actual number of records in your file. DO not enter a number less than the number of records in your file or SEGCALC will not process these records. I4. The total number of columns in the table (i.e., the number of elements in the G-component plus the number of elements in the P-component of each row). This number must be between 2 and 60; otherwise SEGCALC will ask for a new number. If the number is 2, the inputs I5 to I10 will be omitted. (Note: the total number of columns must include all those which are placed before your first column of SEGMENT data). I5. The column number, say F, of the first population group (segment) in the table. This number must be between 1 and C-1, where C is the number given in I4; otherwise SEGCALC will ask for a new number. If F is 1, the inputs I6--I10 are omitted. I6. The number of the column that contains the sizes of the land areas in the city. This number must be between 0 and F-1, where F is the number given in I5; otherwise SEGCALC will ask for a new number. If the number is 0, it indicates that this column is not available. I7. The number of the column that contains the x-coordinates of the areal units' centres. This number must be between 0 and F-1, where F is the number given in I5; otherwise SEGCALC will ask for a new number. If the number is 0, it indicates that this column is not available and the inputs I8-I10 are omitted. I8. The number of the column that contains the y-coordinates of the areal units' centres. This number must be between 0 and F-1, where F is the number given in I5; otherwise SEGCALC will ask for a new number. If the number is 0, it indicates that this column is not available and the inputs I9 and I10 are omitted. I9. The number of the column that contains the x-coordinate of the central business district. This number must be between 0 and F-1, where F is the number given in input I5. If the number is 0, it indicates that this column is not available and the input I10 is omitted. I10 The number of the column that contains the y-coordinate of the central business district. This number must be between 0 and F-1, where F is the number given in input I5. If the number is 0, it indicates that this column is not available. After obtaining the inputs I1--I10, SEGCALC reads the input file and reports the number of data-lines read and the number of valid data-lines. The numbers of the invalid data-lines, if any, will be recorded in the output file specified in I2, but will not be shown on the screen. If the number of data-lines is greater than the number of rows N given in I3, SEGCALC will not process the data-lines that appear after the N-th line of the file. On the other hand, if the number of data-lines read is smaller than N, SEGCALC will continue its execution. If the total population of a group in the city is zero, SEGCALC reports that this group will not be considered in any of the geographical indices. Moreover, if the total land area of the city is zero, SEGCALC will set the column number given in I6 to zero and will continue its execution assuming that that column number is not available. After the input file has been read, the following menu is shown on the screen: ____________________________________________________________________________ |UNEVENNESS: | | 1. D (Dissimilarity Index) 2. GINI (Gini Coefficient) | | 3. H (Entropy Index) 4. ATKIN (Atkinson's Index) | |EXPOSURE: | | 5. XPy (Interaction Index) 6. XPx (Isolation Index) | | 7. V (Correlation Ratio) | |CONCENTRATION: | | 8. DEL (Delta Index) 9. ACO (Absolute Concentration Index) | |10. RCO (Relative Concentration Index) | |CENTRALIZATION: | |11. PCC (Proportion Central City) | |12. ACE (Absolute Centr. Index) 13. RCE (Relative Centr. Index) | |CLUSTERING: | |14. ACL (Absolute Clust. Index) 15. SP (White's Index Spatial Proximity) | |16. RCL (Relative Clust. Index) | |17. DPxy (Morgan's Distance Decay Interaction Index) | |18. DPxx (Morgan's Distance Decay Isolation Index) | |____________________________________________________________________________| | 19. compute ALL INDICES | 20. process NEW file | 21. QUIT | |_____________________________|_________________________|____________________| select operation (1 to 21): The user can enter a number from 1 to 18 to get the value of the corresponding geographical index for every group in the table whose total population is at least one. SEGCALC will report that an index cannot be computed if it requires information which is not available (e.g., when one of the numbers given in I6--I10 is zero). In this case, no record is written in the output file. If the Atkinson index is selected, the user is asked to give a list of parameters for each computation of that index. To terminate the list, a 0 or 1 is required. For the PCC index, a list of area numbers is required that represent the city's central areas. It is assumed that the areas are sorted in increasing order of their distance from the CBD. The list will terminate when 0 is entered. The Atkinson or PCC index will not be computed if the first input terminates the list. If 19 is selected, SEGCALC will compute all the indices for which the required information is available. Before it begins the computation it will ask the user to enter the lists for the Atkinson index and for the PCC index (if there is sufficient information to compute it) as described above. If 20 is selected, SEGCALC will request again the sequence of inputs I1--I10 in order to process a new data file. Finally, SEGCALC will terminate when the user selects 21. 5. ERROR MESSAGES AND WARNINGS This section lists all the possible error messages and warnings of SEGCALC. "ERROR-0: could not open file" The message occurs if the input file specified in I1 does not exist, or the system cannot create the output file specified in I2. "ERROR-2: out of space" This message can occur in the beginning of SEGCALC's execution, if the number of data-lines in the input file is large and the system's memory is limited. In this case SEGCALC terminates. In rather rare cases, the message could appear if some intermediate computations require memory that is not available. Then, SEGCALC will not perform these computations and will return to the main menu. "ERROR-3: illegal number" The message occurs if a number is requested and the user enters something that is not a number or it is a number out of the expected range. In any case, the user will be asked to give a correct number. "ERROR-4: could not compute index. (the column numbers of the areas' centroid coordinates are zero.)" The message occurs if the input I7 or I8 is zero and the user requested to compute the index based on that input. The values of can be: ACL, RCL, SP, DPxx*, DPxy*. "ERROR-6: could not compute index. (the column number of the city's land areas is zero.)" The message occurs if the input I6 is zero and the user requested to compute the index based on that input. The values of can be: DELTA, ACO, RCO, ACE. "ERROR-7: could not compute index. (the column numbers of the CBD centroid coordinates are zero.)" The message occurs if the input I9 or I10 is zero and the user requested to compute the index based on that input. The values of can be: ACE, RCE, PCC. "WARNING-0: the total land area of the city is not a positive number. No index based on the land areas can be computed." This warning can occur after the user has entered inputs I1--I10 and the input I6 is not zero. The indices that cannot be computed are those listed in ERROR-6. "WARNING-1: the total population of column is not a positive number. This column will not be used in any of the operations." This warning can occur after the user has entered inputs I1--I10. Column could be between the numbers given in I4 and I5. 6. SYSTEM REQUIREMENTS ...... to be completed ........ REFERENCES Massey, D. S. and Denton, A. N. (1988) `The dimensions of residential segregation', Social Forces 67, 281--315. Morgan, B. S. (1983) `A distance-decay based interaction index to measure residential segregation', Area 15, 211--217. Ours....