Data extraction and transformation for the Pará, State in Brazil¶
This notebook is part of the project "New Schools Location in Brazil" and it is focused on the data extraction and transformation for the Pará State in Brazil.
Step 0. Import libraries and setup environment¶
Step 1. Define Area of Interest (AoI)¶
For this, we will:
- Download a administrative limits for Brazil, and get the multi-level admistrative limits of Pará usig the geoBR library.
- Read schools data downloaded using the R package CensoBr.
code_state | abbrev_state | name_state | code_region | name_region | geometry | |
---|---|---|---|---|---|---|
0 | 11.0 | RO | Rondônia | 1.0 | Norte | MULTIPOLYGON (((-65.38150 -10.42907, -65.37916... |
1 | 12.0 | AC | Acre | 1.0 | Norte | MULTIPOLYGON (((-71.07772 -9.82774, -71.08554 ... |
2 | 13.0 | AM | Amazônas | 1.0 | Norte | MULTIPOLYGON (((-69.83766 -3.68659, -69.71230 ... |
3 | 14.0 | RR | Roraima | 1.0 | Norte | MULTIPOLYGON (((-63.96008 2.47313, -63.96041 2... |
4 | 15.0 | PA | Pará | 1.0 | Norte | MULTIPOLYGON (((-51.43248 -0.47334, -51.42843 ... |
Let's visualize of Area of Interest (AoI) - Pará - in Brazil
Since Pará is a really big state (1.28 km^2). We can explore intermediate administrative limits on which we can simplify the analysis.
Let's see the number of units and visualize each administrative level
Number of units | |
---|---|
Intermediate regions | 6 |
Meso regions | 7 |
Micro regions | 22 |
Municipalities | 144 |
From the visualization we can see that:
- There is a clear overlaping between Meso and Intermediate regions, and they are too big.
- Municipalities could be grouped considering if they form a contigous urban area.
- Micro regions are the chosen administrative level for the analysis. They are more homogeneous than municipalities and they are not too big. In addition, bigger micro regions are the ones with forest reserves.
Step 2. Extract and process Census Sociodemographic geographical data¶
Step 2.1. Extracting Census Tracts Data from IBGE¶
Census Tract's Geometries are obtained in this notebook with the GeoBR
Python library
Number of census tracts in Pará: 8919
code_tract | zone | code_muni | name_muni | name_neighborhood | code_neighborhood | code_subdistrict | name_subdistrict | code_district | name_district | code_state | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 150060205000138 | RURAL | 1500602.0 | Altamira | NaN | NaN | 1.500602e+10 | NaN | 150060205.0 | Altamira | 15.0 | POLYGON ((-54.49972 -5.53668, -54.49990 -5.538... |
1 | 150060205000139 | URBANO | 1500602.0 | Altamira | Jardim Altamira | 1.500602e+11 | 1.500602e+10 | NaN | 150060205.0 | Altamira | 15.0 | POLYGON ((-52.21407 -3.19932, -52.21353 -3.200... |
2 | 150060205000140 | RURAL | 1500602.0 | Altamira | NaN | NaN | 1.500602e+10 | NaN | 150060205.0 | Altamira | 15.0 | POLYGON ((-52.31884 -3.91514, -52.31992 -3.914... |
3 | 150060205000141 | RURAL | 1500602.0 | Altamira | NaN | NaN | 1.500602e+10 | NaN | 150060205.0 | Altamira | 15.0 | POLYGON ((-54.79477 -5.75528, -54.79602 -5.757... |
4 | 150060205000142 | RURAL | 1500602.0 | Altamira | NaN | NaN | 1.500602e+10 | NaN | 150060205.0 | Altamira | 15.0 | POLYGON ((-54.58234 -6.71616, -54.58653 -6.716... |
Let's visualize the census tracts in Pará to get a sense of their size and granularity
We can clearly visualize that the census tracts differ in size depeding on the geographical location. For example, near city centers, the census tracts are smaller, while in rural areas, the census tracts are bigger. This usually happens because in rural areas, the population is more sparse, so the census tracts are bigger to ensure that the population is represented in the census.
Now, we will read a dataset generated with the CensoBR
R package here. In this dataset we have census variables such as Income and Population by Age that are useful to identify areas where schools are needed.
Number of census tracts obtainer in R for Brazil: 308993
code_tract | V002 | V003 | pessoa01_V021 | pessoa01_V022 | pessoa01_V023 | pessoa01_V024 | pessoa01_V025 | pessoa01_V026 | pessoa01_V027 | ... | pessoa01_V044 | pessoa01_V045 | pessoa01_V046 | pessoa01_V047 | pessoa01_V048 | pessoa01_V049 | pessoa01_V050 | pessoa01_V051 | pessoa01_V052 | income_pc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 521520705000006 | 164 | 92958.0 | 1.0 | 4.0 | 0.0 | 4.0 | 3.0 | 6.0 | 1.0 | ... | 1.0 | 2.0 | 4.0 | 2.0 | 0.0 | 2.0 | 4.0 | 2.0 | 2.0 | 566.817073 |
2 | 521520705000007 | 221 | 112781.0 | 3.0 | 3.0 | 4.0 | 4.0 | 4.0 | 2.0 | 2.0 | ... | 6.0 | 3.0 | 3.0 | 3.0 | 1.0 | 4.0 | 3.0 | 3.0 | 1.0 | 510.321267 |
3 | 521520705000008 | 290 | 126647.0 | 3.0 | 8.0 | 1.0 | 7.0 | 1.0 | 1.0 | 4.0 | ... | 2.0 | 8.0 | 4.0 | 8.0 | 1.0 | 4.0 | 2.0 | 2.0 | 3.0 | 436.713793 |
4 | 521520705000009 | 150 | 90031.0 | 2.0 | 2.0 | 3.0 | 3.0 | 2.0 | 2.0 | 2.0 | ... | 2.0 | 1.0 | 2.0 | 1.0 | 1.0 | 2.0 | 3.0 | 1.0 | 2.0 | 600.206667 |
5 | 521520705000010 | 192 | 134946.0 | 2.0 | 2.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | ... | 2.0 | 5.0 | 5.0 | 2.0 | 3.0 | 3.0 | 1.0 | 3.0 | 1.0 | 702.843750 |
5 rows × 36 columns
Type of code_tract in para_cntr: object Type of code_tract in censobr_tract_income: int64
Number of census tracts with associated data: 8741
Now we will aggregate population by age variables according to the brazilian school levels:
- Infantil: 3 months to 5 years old
- Elementary: 6 to 14 years old
- Middle: 15 to 17 years old
- High: 18 years old
Step 2.2. Download Meta's High Resolution Population Density¶
Let's see all the available datasets for Brazil in Humanitarian Data Exchange (HDX).
created | name | population | size_mb | url | |
---|---|---|---|---|---|
id | |||||
0 | 2019-06-11 | BRA_children_under_five_2019-06-01_csv.zip | Children (ages 0-5) | 159.14 | https://data.humdata.org/dataset/c17003d1-47f4... |
1 | 2019-06-11 | BRA_elderly_60_plus_2019-06-01_csv.zip | Elderly (ages 60+) | 159.42 | https://data.humdata.org/dataset/c17003d1-47f4... |
2 | 2019-06-11 | BRA_men_2019-06-01_csv.zip | Men | 160.20 | https://data.humdata.org/dataset/c17003d1-47f4... |
26 | 2019-06-11 | BRA_women_2019-06-01_csv.zip | Women | 160.21 | https://data.humdata.org/dataset/c17003d1-47f4... |
27 | 2019-06-11 | BRA_women_of_reproductive_age_15_49_2019-06-01... | Women of reproductive age (ages 15-49) | 160.03 | https://data.humdata.org/dataset/c17003d1-47f4... |
28 | 2019-06-11 | BRA_youth_15_24_2019-06-01_csv.zip | Youth (ages 15-24) | 159.79 | https://data.humdata.org/dataset/c17003d1-47f4... |
33 | 2019-06-25 | population_bra_northeast_2018-10-01.csv..zip | Overall population density | 71.24 | https://data.humdata.org/dataset/c17003d1-47f4... |
34 | 2019-06-25 | population_bra_northwest_2018-10-01.csv.zip | Overall population density | 13.46 | https://data.humdata.org/dataset/c17003d1-47f4... |
35 | 2019-06-25 | population_bra_southeast_2018-10-01.csv.zip | Overall population density | 159.42 | https://data.humdata.org/dataset/c17003d1-47f4... |
36 | 2019-06-25 | population_bra_southwest_2018-10-01.csv.zip | Overall population density | 45.96 | https://data.humdata.org/dataset/c17003d1-47f4... |
We will download the Overall population density dataset from HDX. This dataset contains the population density of Brazil at 30m2 resolution.
The obtained dataset is simple and contains the following columns:
latitude | longitude | population_2015 | population_2020 | geometry | |
---|---|---|---|---|---|
0 | -9.840972 | -49.330417 | 1.221294 | 1.245924 | POINT (-49.33042 -9.84097) |
1 | -9.840972 | -49.330139 | 1.221294 | 1.245924 | POINT (-49.33014 -9.84097) |
2 | -9.840972 | -49.325417 | 1.221294 | 1.245924 | POINT (-49.32542 -9.84097) |
3 | -9.840972 | -48.980139 | 1.060563 | 1.081952 | POINT (-48.98014 -9.84097) |
4 | -9.840972 | -48.979861 | 1.060563 | 1.081952 | POINT (-48.97986 -9.84097) |
Number population data points for Pará: 1786979
Step 2.3. Aggregate Population Density in Hexagons (Uniform Spatial Units)¶
Step 2.4. Convert Census Tract Level Data to Hexagons Level¶
Census tract level numerical variables like population and income will be divided in hexagons proportionally to the Meta high-resolution population density.
Compare the total number of census tracts and hexagons in Pará
Census Tracts: 8919 --> Hexagons: 1501128
Step 2.5. Visualize downsampled population data in hexagons¶
Let's see how the population and income data looks like in the hexagons. Visualizing data side by side is a good way to see if the data is consistent. We will work with the Belem Metropolitan Area (BMA) since it is the most populated area in Pará and it is the capital of the state.
Another way to check population consistency is to compare population distribution with road network density. The idea is that areas with higher population density should have more roads.
We can clearly observe that the population values mostly are located around the road network. This is a good sign that our downscaling method is working as expected. Also, higher population values correspond to higher density of roads.
Step 2.6. Calculate and visualize income per capita¶
Step 3. Extract and process Schools geographical data¶
Number of schools in Brazil: 222936
abbrev_state | name_muni | code_school | name_school | education_level | education_level_others | admin_category | address | phone_number | government_level | private_school_type | private_government_partnership | regulated_education_council | service_restriction | size | urban | location_type | date_update | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | RO | Porto Velho | 11000023 | EEEE ABNAEL MACHADO DE LIMA - CENE | Ensino Fundamental | Atendimento Educacional Especializado | Pública | AVENIDA AMAZONAS, 6492 ZONA LESTE. TIRADENTES.... | (69) 992083054 | Estadual | Não Informado | Não | Não | ESCOLA ATENDE EXCLUSIVAMENTE ALUNOS COM DEFICI... | Entre 51 e 200 matrículas de escolarização | Urbana | A escola não está em área de localização difer... | 2020-10-18 | POINT (-63.85401 -8.75846) |
1 | RO | Porto Velho | 11000040 | EMEIEF PEQUENOS TALENTOS | Educação Infantil | Pública | RUA CAETANO, 3256 PREDIO. CALADINHO. 76808-108... | (69) 32135237 | Municipal | Não Informado | Não | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | Entre 201 e 500 matrículas de escolarização | Urbana | A escola não está em área de localização difer... | 2020-10-18 | POINT (-63.88392 -8.79373) | |
2 | RO | Porto Velho | 11000058 | CENTRO DE ENSINO CLASSE A | Educação Infantil, Ensino Fundamental, Ensino ... | Privada | AVENIDA CARLOS GOMES, 1135 CENTRO. 76801-123 P... | (69) 32244473 | Privada | Particular | Não | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | Mais de 1000 matrículas de escolarização | Urbana | A escola não está em área de localização difer... | 2020-10-18 | POINT (-63.90199 -8.76073) | |
3 | RO | Porto Velho | 11000082 | CENTRO EDUCACIONAL PRESBITERIANO 15 DE NOVEMBRO | Educação Infantil, Ensino Fundamental | Privada | RUA ALMIRANTE BARROSO, 1483 SANTA BARBARA. 768... | (69) 32245636 | Privada | Particular | Não | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | Entre 51 e 200 matrículas de escolarização | Urbana | A escola não está em área de localização difer... | 2020-10-18 | POINT (-63.89618 -8.76520) | |
4 | RO | Porto Velho | 11000104 | CENTRO EDUC CORA CORALINA | Educação Infantil, Ensino Fundamental | Privada | RUA MEXICO, 1056 NOVA PORTO VELHO. 76820-190 P... | (69) 32252616 | Privada | Particular | Não | Sim | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | Entre 501 e 1000 matrículas de escolarização | Urbana | A escola não está em área de localização difer... | 2020-10-18 | POINT (-63.87547 -8.76863) |
Step 3.1. Aggregate categorical variables in hexagons¶
To work with categorical data we have to convert from text to numerical values. To do this we will create one column for each class in the caterogical variable and fill with 1 if the school belongs to that class and 0 otherwise. This process is known as One or Multi-hot encoding.
Example of education level columns assigned to the schools:
ensino_fundamental | educacao_infantil | ensino_medio | educacao_jovens_adultos | educacao_profissional | no_specified | |
---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 1 | 1 | 0 | 0 | 0 |
3 | 1 | 1 | 0 | 0 | 0 | 0 |
4 | 1 | 1 | 0 | 0 | 0 | 0 |
Example of administrative category columns assigned to the schools:
publica | privada | |
---|---|---|
0 | 1 | 0 |
1 | 1 | 0 |
2 | 0 | 1 |
3 | 0 | 1 |
4 | 0 | 1 |
Once aggregated the categorical variables, we can visualize the number of schools per hexagon. We will overlay the schools points to see if the aggregation is working as expected.
Schools distribution seems to be consistent with the population distribution. This is a good sign that our aggregation method is working as expected.
Step 3.2. Aggregate numerical variables in hexagons¶
In this official dataset provided by IBGE, we don't have the enrollment data. We will use the enrollment data from the a previous schools dataset to estimate the number of students per school and hexagon.
Restrição de Atendimento | Escola | UF | Município | Localização | Categoria Administrativa | Dependência Administrativa | Categoria Escola Privada | Conveniada Poder Público | Regulamentação pelo Conselho de Educação | Porte da Escola | Etapas e Modalidade de Ensino Oferecidas | Outras Ofertas Educacionais | LAT_FINAL | LONG_FINAL | Matriculas totales | Matrículas oficiales | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ESCOLA ATENDE EXCLUSIVAMENTE ALUNOS COM DEFICI... | EEEE ABNAEL MACHADO DE LIMA - CENE | RO | Porto Velho | Urbana | Pública | Estadual | Não Informado | Não | Não | Entre 51 e 200 matrículas de escolarização | Ensino Fundamental | Atendimento Educacional Especializado | -8.758459 | -63.854011 | 159.0 | 94.0 |
1 | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | CENTRO DE ENSINO CLASSE A | RO | Porto Velho | Urbana | Privada | Privada | Particular | Não | Sim | Mais de 1000 matrículas de escolarização | Educação Infantil, Ensino Fundamental, Ensino ... | NaN | -8.760734 | -63.901986 | 1433.0 | 1433.0 |
2 | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | CENTRO EDUCACIONAL PRESBITERIANO 15 DE NOVEMBRO | RO | Porto Velho | Urbana | Privada | Privada | Particular | Não | Sim | Entre 51 e 200 matrículas de escolarização | Educação Infantil, Ensino Fundamental | NaN | -8.765205 | -63.896177 | 59.0 | 59.0 |
3 | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | CENTRO EDUC CORA CORALINA | RO | Porto Velho | Urbana | Privada | Privada | Particular | Não | Sim | Entre 501 e 1000 matrículas de escolarização | Educação Infantil, Ensino Fundamental | NaN | -8.768633 | -63.875471 | 636.0 | 636.0 |
4 | ESCOLA EM FUNCIONAMENTO E SEM RESTRIÇÃO DE ATE... | CENTRO EDUCACIONAL MOJUCA | RO | Porto Velho | Urbana | Privada | Privada | Particular | Não | Sim | Entre 201 e 500 matrículas de escolarização | Ensino Fundamental, Ensino Médio | NaN | -8.765028 | -63.891857 | 290.0 | 290.0 |
Separate matriculas by education level
ensino_fundamental | educacao_infantil | ensino_medio | educacao_jovens_adultos | educacao_profissional | no_specified | |
---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 0 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 0 | 0 |
3 | 1 | 1 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 1 | 0 | 0 | 0 |
This maps shows the difference between the number of students and the target student population for each education level.
- Blue hexagons have negative values and represent areas where there are more students than the target population.
- Red hexagons have positive values and represent areas where there are less students than the target population.
Given this we could say that:
- Blue hexagons are city centers were there is less population but have a lot of students that come from other areas.
- Red hexagons are most present in rural areas were there is more target student population that have to travel to other areas to attend school.
- This second group could be in danger of not attending school due to the distance.
Step 4. Assing the hexagon zone (rural or urban)¶
We will use the census track variable to determine the urban area of Pará.
We can see that the urban areas are correctly identified. Considering the satellite image, we can see that the urban areas are the ones with more buildings and roads.
Summary¶
We have extracted and processed schools and sociodemographic data from multiple source. The final results is a dataset on which each row is a H3 Hexagon. This will allow us to compare and analyze different microregions in Para. The variables obtained are the the following:
Source | Variable name | Type | Description |
---|---|---|---|
Meta's HRSM | population_2020 | Numerical | Total Population |
IBGE's Census* | pop_3_5_years_adj | Numerical | Population between 3 months and 5 years old |
IBGE's Census* | pop_6_14_years_adj | Numerical | Population between 6 and 14 years old |
IBGE's Census* | pop_15_17_years_adj | Numerical | Population between 15 and 17 years old |
IBGE's Census* | pop_18_years_adj | Numerical | Population between 18 years old |
IBGE's Census* | V002_adj | Numerical | Total Population |
IBGE's Census* | V003_adj | Numerical | Income (R$) |
INEP's School Census | ensino_fundamental | Numerical | Ensino Fundamental Schools Count |
INEP's School Census | educacao_infantil | Numerical | Educacao Infantil Schools Count |
INEP's School Census | ensino_medio | Numerical | Ensino Medio Schools Count |
INEP's School Census | educacao_jovens_adultos | Numerical | Educacao Jovens Adultos Schools Count |
INEP's School Census | educacao_profissional | Numerical | Educacao Profissional Schools Count |
INEP's School Census | no_specified | Numerical | Count of schools with unspecified type |
INEP's School Census | publica | Numerical | Public Schools Count |
INEP's School Census | privada | Numerical | Private Schools Count |
Custom School Census** | Matriculas totales | Numerical | Total Enrolled Students |
Custom School Census** | Matrículas oficiales | Numerical | Oficially Enrolled Students |
Custom School Census** | matric_total_[school type] | Numerical | Enrolled Students by School Type |
Custom School Census** | matric_oficial_[school type] | Numerical | Enrolled Students by School Type |
Custom School Census** | diff_pop_matric_total_[school type] | Numerical | Difference between target student population and enrolled students by School Type |
Custom School Census** | rate_pop_matric_total_[school type] | Numerical | Ratio between target student population and enrolled students by School Type |
IBGE's Urban Footprints | urban_area | Categorical | 1=Urban or 0=Rural |
Ours | income_pc | Numerical | Income Per Capita (R$) |
* IBGE's Census: Variables where downscaled proportionally to Meta's high resolution population data.
** Custom School Census: This dataset was provided by IDB Education team in Brazil (need to ask for the source)