1

I have a webpage that is used to submit a CSV file to the server. I have to validate the file, for stuff like correct number of columns, correct data type, cross field validations, data-range validations, etc. And finally either show a successful message or return a CSV with error messages and line numbers.

Currently every row and every column is looped through to find out all the errors in the CSV file. But it becomes very slow for bigger files, sometimes resulting in a server time-out. Can someone please suggest a better way to do this.

Thanks

2
  • post your code where you load/read the file. Commented Jan 21, 2011 at 5:00
  • Its a lot of code. What I was basically after was if instead of looping through each row and then each column, there was another way to validate the cells? Using Regex or some other way? Commented Jan 21, 2011 at 5:29

5 Answers 5

2

To validate a CSV file you will surely need to check each column. The only best way if possible in your scenario is to validate the entry itself while appending to the CSV file..


Edit

As pinpointed an error by @accolaum, i have edited my code

It will only work provided each row is delimited with a `\n`

IF you only want to Validate number of Columns.. then its easier.. Just take the mod of all the entries with the num of columns

bool file_isvalid;
string data = streamreader.ReadLine();
while(data != null)
{
    if(data.Split(',').Length % Num_Of_Columns == 0)
    {
        file_isvalid = true;
        //Perform opertaion
    }
    else
    {
        file_isvalid = false;
        //Perform Operation
    }
    data = streamreader.ReadLine();
}

Hope it helps

Sign up to request clarification or add additional context in comments.

3 Comments

split to count is a bad idea. There is mistake in your code: if line1 has 3 columns, line2 has 5 colums and other rows have 4 colums you will receive an error of behavior.
I never said its an actual code... its just an idea..However you can perform this for every row in a loop provided each row is delimited with an \n
i am wondering if someone can remove that downvote.. since i have edited my code.
1

I would suggest a rule based approach, similar to unit tests. Think of every! error that can possibly occour and order them in increasing abstraction level

  • Correct file encoding
  • Correct number of lines/columns
  • correct column headers
  • correct number/text/date formats
  • correct number ranges
  • bussiness rules??
  • ...

These rules could also have automatic fixes. So if you could automatically detect the encoding, you could correct it before testing all the rules.

Implementation could be done using the command pattern

public abstract class RuleBase
{
  public abstract bool Test();
  public virtual bool CanCorrect()
  { 
     return false;
  }
}

Then create a subclass for each test you want to make and put them in a list.

The timeout can be overcome by using a background thread only for test incoming files. The user has to wait till his file is validated and becomes "active". When finished you can forward him to the next page.

Comments

1

You may be able to optimize your code to perform faster, but what you really want to do is to spawn a worker thread to do the processing.

Two benefits of this

  • You can redirect the user to another page so that they know their request has submitted
  • The worker thread can be given a callback so that it can report its status - if you want to, you could put a progress bar or a percentage on the 'submitted' page so that the user can see as their file is being processed.

It is not good design to have the user waiting for long running processes to complete - they should be given updates or notifications, rather than just a 'loading' icon on their browser.

edit: This is my answer because (1) I can't recommend code improvements without seeing your code, and (2) efficiency improvements are probably only going to yield incremental improvements (unless you are doing something really wrong), which won't solve your problem long term.

Comments

0

Validation of csv data usually always needs to look at every single cell. Can you post some of your code, there may be ways to optimse it.

EDIT

in most cases this is the best solution

foreach(row) {
    foreach (column) {
        validate cell
    }
}

if you were really keen, you could try something with regex's

foreach(row) {
    validate row by regex
}

but then you are really just off loading the validation code from you to the regex, and i really hate using regexs

3 Comments

Please add this a comment! :-)
the 1st sentence is my answer?
I do need to validate each cell, but the question is instead of looping if there is a better way to do this?
0

You could use XMLReader and parse against an XSD

1 Comment

Yes, if you pass the CSV into an XMLReader and then validate the XML using an XSD it will work. We have implemented this solution in out company.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.