I'm implementing a program which should get files from the user in Angular, send them to the node.js backend, where there those will be read and parsed into an array of object.
I went with javascript because it was the fastest, for me, to implement this program and test it out.
If you have any other language advice I'll consider it in my next implementation.
So far my XML files are 110mb the largest and 16mb the smallest.
I have to be able to parse up to 3000 files that can vary between those sizes.
All of these files have the same structures and tags, in almost the same order.
Example XML:
<start_xml_tag>
<property1>something</property1>
.
.
<property10>somethingElse</property10>
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
... 1Milion row after
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
...2 milion row after
</start_xml_tag>
The result attended is an array containing objects, where each object is like this:
{ codice: 123, nome: andrea stato: italia }
In advance the user can insert a filter where if the field stato is equal to (for example) 'italia' the result in the objects should only be those who have the stato = italia.
I will provide my code as soon as I can, I'm not at home until tomorrow so I don't have it right here.
Can you help me or tell me where I'm thinking wrong?
Thanks in advance!
So far I did achieve this by loading the files by text with smaller numbers of files up to 400 or even 600.
Then I switched to using stream sax to read the files in chunks, but it was taking forever to finish the parsing, I'm not sure if i made some mistakes, it was parsing all good, but after 9hrs it was still running.
The chunk size was set to 100000 bytes, higher than this i would get max stack size error.
I'm now trying with a xpath approach to see if i can target all 'Offerte' children with an expression, but I'm new to this and i still think I won't be too efficient at all.
import { Injectable } from '@angular/core';
import { HttpClient, } from '@angular/common/http';
import * as sax from 'sax';
import { targetTags, TargetTags, startTagName, checkUnitReferenceNo, checkStatusCd } from 'src/models/util';
const BATCH_SIZE = 100;
@Injectable({
providedIn: 'root'
})
export class BatchXmlService {
constructor(private http: HttpClient) { }
private result: TargetTags[] = [];
async processFiles(files: FileList, unitReferenceNoFilter: any, statusCdFilters: any) {
let result = [];
for (let i = 0; i < files.length; i += BATCH_SIZE) {
const batch = Array.from(files).slice(i, i + BATCH_SIZE);
result.push(...await this.processBatch(batch, unitReferenceNoFilter, statusCdFilters));
}
return result;
}
async processBatch(batch: File[], unitReferenceNoFilter: any, statusCdFilters: any) {
const promises = batch.map(file => this.parseFile(file, unitReferenceNoFilter, statusCdFilters));
const parsedData = await Promise.all(promises);
return parsedData.flatMap(x=>x);
}
// Function to parse a single XML file
async parseFile(file: File, unitReferenceNoFilter: any, statusCdFilters: any): Promise<any[]> {
const strict = true;
return new Promise((resolve, reject) => {
const saxStream = sax.createStream(strict);
saxStream.on('error', (error) => {
reject(error);
});
const parsedData: any[] = []; // Array to store parsed data
let lastTag = "";
let inTargetSection = false; // Flag to track if we're within the target section
let skipParsing = false;
saxStream.on('opentag', (node) => {
if (node.name === startTagName) {
inTargetSection = true;
parsedData.push({});
}
else if (inTargetSection && !skipParsing) {
lastTag = node.name;
if(targetTags.includes(node.name)){
// Store tag name as key and content as value in the current offer
const currentOffer = parsedData[parsedData.length - 1];
if (currentOffer) {
currentOffer[node.name] = "";
}
}
}
});
saxStream.on('text', (text) => {
// Assuming you only care about text content within target tags
if(text.trim() === "") return;
if (inTargetSection && lastTag?.length > 0 && targetTags.includes(lastTag) && !skipParsing) {
const currentOffer = parsedData[parsedData.length - 1];
if(lastTag){
if(
lastTag === "UNIT_REFERENCE_NO" && !checkUnitReferenceNo(unitReferenceNoFilter, text.trim()) ||
lastTag === "STATUS_CD" && !checkStatusCd(statusCdFilters, text.trim())
){
skipParsing = true;
parsedData.pop();
return;
}
currentOffer[lastTag] = text.trim();
}
}
});
saxStream.on('closetag', (nodeName) => {
if (nodeName === startTagName) {
inTargetSection = false;
skipParsing = false;
}
});
saxStream.on('end', () => {
resolve(parsedData);
});
const reader = new FileReader();
reader.readAsArrayBuffer(file);
reader.onload = () => {
const arrayBuffer = reader.result as ArrayBuffer;
const byteArray = new Uint8Array(arrayBuffer); // Convert to Uint8Array
let remainingData = byteArray;
while (remainingData.length > 0) {
const chunkSize = Math.min(remainingData.length, 100000); // Read in chunks of 50000 bytes
saxStream.write(String.fromCharCode.apply(null, remainingData.slice(0, chunkSize)));
remainingData = remainingData.slice(chunkSize);
}
saxStream.end(); // Call end after processing all data
};
reader.onerror = (error) => {
reject(error); // Handle file read errors
};
});
}
}
<form [formGroup]="form" class="d-flex flex-column justify-content-around w-100" (ngSubmit)="generateExcel()" style="height: 300px">
<div class="d-flex flex-column">
<label for="Filtro_UNIT_REFERENCE_NO">Filtro per UNIT_REFERENCE_NO <sup style="color: red">*</sup></label>
<div class="d-flex justify-content-between w-50">
<select class="form-control" formControlName="UNIT_REFERENCE_NO" (change)="setUnitReferenceNoFilter()">
<option *ngFor="let item of unitReferenceNoOptions" [value]="item.value">{{item.label}}</option>
</select>
</div>
</div>
<div class="d-flex flex-column">
<label for="Filtro_STATUS_CD">Filtro per STATUS_CD</label>
<div class="d-flex justify-content-between w-50">
<ng-select [items]="statusCdOptions" class="w-100"
[multiple]="true"
placeholder="Se nessuna scelta e' selezionata allora si estraggono tutte"
formControlName="STATUS_CD"
bindLabel="label"
bindValue="value">
</ng-select>
</div>
</div>
<input #myInput for="files-xml" (click)="checkFiltersInserted($event)" formControlName="files" class="form-control mx-1" type="file" class="file-upload bg-secondary-subtle w-50" (change)="onUpload($event)" required webkitdirectory multiple />
<button class="btn btn-primary w-25" type="submit" [disabled]="!isUnitReferenceNoFiltered || (loadingData && result !== [])">
<span *ngIf="loadingData" class="spinner-border spinner-border-sm" role="status" aria-hidden="true"></span>
Genera Excel
</button>
</form>
And my app.ts onUpload function is this:
async onUpload(event: any) {
const files = event.target.files;
this.loadingData = true;
try{
this.result = await this.batchXmlService.processFiles(files, this.form.controls['UNIT_REFERENCE_NO'].value, this.form.controls['STATUS_CD'].value);
this.loadingData = false;
}
catch(error){
this.loadingData = false;
}
}
I tried to clear the code as best as I could, I had some more commented code in the mix from previous tests.
I'm aware that this is on Angular.js and not Node.js, I'm now trying to write this same logic in node.js to delegate this calculation on the backend.
I also forgot the second filter is an array of choices where from the previous XML example the field 'nome' could have different values all contained in the multiselect option list.
This filter works this way: If the list contanins ['andrea', 'francesco', 'brian']
It filters the object so that all the results are matching one of these three strings.
Accepted XML example:
<Offerte>
<codice>123</codice>
<nome>andrea</nome>
<stato>italia</stato>
</Offerte>
Discarded xml example:
<Offerte>
<codice>123</codice>
<nome>john</nome>
<stato>italia</stato>
</Offerte>