So what are they using for a development method?
The "No complaints == No faults" rule?
A flaw in the scanning compression software of some Xerox copiers which changes digits and numbers run through the machine is worse than first thought and will require a full software upgrade, the self-styled "Document Company" has said. The flaw was first spotted by German computer science student David Kriesel, who …
I hear what you're saying but I believe there's a much larger issue here.
Is anyone reading/using the incorrect documents? It doesn't sound much like it. More than anything I think Xerox has made a great case for less copying of documents, obviously no one is using them anyway.
I wouldn't count on that. I'm still not sure the implications have sunk in. People assumed it was an accurate copy and proceeded. If they don't have cause to check it, they probably won't. Hell, the mass transit center next to where I work that was supposed to go into service 6 months ago is still in legal limbo because somebody could follow directions or confusion with blueprints. In fact if faulty Xeroxes are anywhere in loop on that one, I can pretty much guarantee the subcontractor will be reaching for that defense.
So from that report we can conclude that:
"... the unit’s “Quality/file size” factory default and highest modes don’t completely alleviate the problem"
and
"The default and highest modes ... a software bug character substitution is not completely eliminated"
however we then revert to the original stement that
"... on low-resolution scans of documents ... then it may reoccur"
Also it doesn't affect normal office documents unless you consider a large spreadsheet which has small numbers (quite a few) a normal office document?
"We apologize for any confusion that came from our prior communications." ... and from this one.
Follow this link for a detailed from an AC who claims experience with the compression algorithm in question:
http://forums.theregister.co.uk/forum/1/2013/08/06/xerox_copier_flaw_means_dodgy_numbers_and_dangerous_designs/#c_1917144
Based on the level of detail and tone of the message I accept his claim of expertise.
"compression is not used when making basic copies."
How sure are you ?
Based on what the apologists have said, compression is used to minimise the amount of memory needed.
The context they say this in seems to say that if you want to do things like collating or layup (two pages per sheet etc), the input is scanned and the compressed image is stored in memory, and then a potentially corrupt version of the input is printed out. It would make sense (well, it would if the compression/decompression was trustworthy).
Similarly, if you want multiple copies of a set of originals, I imagine it scans the originals once, stores the compressed version, and then a potentially corrupt version of the input is printed out. I'd be *very* surprised if it rescanned the original for each copy that was wanted - the way that dumb but trustworthy copiers used to (trustworthy as long as there wasn't a misfeed or other mechanical problem).
In this context, I struggle to see what a "basic copy" (without compression) might mean, and why the flow of data (including whether it's compressed or not) would be different for making 2 copies of an original rather than making one copy. But ICBW. Who's got the source code?
Wait till this kind of "cost excellence" reaches places where it actually matters. Engine control units, that kind of thing, where lives may eventually be at stake (how many computers in that nice new Volvo hybrid).
If you were following the issue there was a quite good explanation in a previous column.
It starts with they are compressing for data manipulation speed. It gets mangled with an algorithm that has more variations than RS-232. And it sounds like it has a library of sharp images that they use based on detection from the mangly algorithm.
Yes, to me the non-programming tech who is the first guy to catch flack from the users, it looks like they should simply turn off the compression while they get it fixed and take the performance hit. But since I'm not the programmer, I'm willing to believe them when they say it is a bit more complicated than that.
They're clueless? They've "cost excellenced" [1] everything down to a level where this isn't actually fixable?
Also, not all compression is lossy compression. ZIP files are often quite compressed, but it's a rare day when they get undetected data corruption even in the presence of other errors. Ditto PDF in most cases. And various other less well known lossless compression formats, some of which have been routinely applied to documents for many years without any difficulties of this nature.
What went wrong here? Too much clueless "cost excellence", perhaps. There's a lot of it about.
[1] Yes,seriously. "value engineering" used to be the term of choice, but now in line with the company's policy of continuous product and service improvement, and in order to acknowledge the important role of the value stream contribution from the staff in Strategic Sourcing (which everyone else still calls Purchasing), the term has been replaced with "cost excellence".
[1] Yes,seriously. "value engineering" used to be the term of choice, but now in line with the company's policy of continuous product and service improvement, and in order to acknowledge the important role of the value stream contribution from the staff in Strategic Sourcing (which everyone else still calls Purchasing), the term has been replaced with "cost excellence".
BINGO!
What do you want to bet they have error controls in place to prevent that? Like being able to access the copier via phone line. Trey advertise a line of copiers that can self call a tech to fix an issue. If it can call out it can transmit the page count, your bill, to a high quality printer. When the tech comes out to fixes the machine they often print out a page count and other stats from the copier directly. Want to bet that gets printed out in the best quality?
It's like this joke some once sent me. This young monk was sent to this monastery. All thy did was make copys of of the holy texts. The young monk noticed that they were working of copies of copies. He ask the senior monk what happens if some gets it wrong? Won't the error be duplicated. The senior monk don't worry it never happens., but to placate the young monk the senior monk went into the archives to check the original. The senior monks said they left out the R they left out the f'n r. " each day we shall be happy and celibrate" became "each day we shall be happy and celibate "
While true that copying alone can incur errors. There are methods of copy error checking and integrity checking. The old method was to count individual letters and hope errors were spellings, thus correctable, and not substitution, both in letter and word and less correctable. Along with context, allows for correction in most instances, with the exceptions above and things like numbers or unique names.
More recently we can do it mathematically with 100% (assuming a perfect "machine/calculator") or close to (assuming no hash collisions) accuracy. Say, assign a prime number to each letter and record the total sum of a line (may be off in my maths there, been a while). You have to hope the hash/integrity check is not copied in error though... and no, I'm not going to apply reoccurring checks! :P
Given the level of technical expertise here, I'm surprised that you're surprised OCR isn't 100%. Personally I think it's brilliant that Xerox came up with a file format compresses the original document by running it through OCR and building a font of the original images of the characters used. I'd love to know the sizes of files it generates for a given resolution. OCR is never 100 percent and shouldn't be used as the only method of storing/reading a document if every part of the information is important, or if you are going to be using OCR in such a case you need a human checking it.
I think the fault here is that Xerox did not provide warning on the machine of what the risks and limitations of using that file format were. Perhaps a solution would be to provide notice of the limitations on the machine and have the copier assess it's certainty that it was seeing the correct characters and switch to an alternate file format if the certainty was below say 96%.
Thumbs up to the guy who created this format. Xerox should have made end users aware of it's limitations.
I have been wondering how this can come about, and concluded it must be a serious thinko in the design of the lossy compression algorithm itself, when applied to documents. I guess it splits the input image into blocks, and then looks for blocks with the commonest patterns and substitutes them for approximately similar rarer blocks, so that it does not have to store a separate code for them (this is about what the "vector quantization" compression method does, remember the 1990's blocky QuickTime and AVI videos?). No problem for kitten videos, but too bad if in an important document, some common blocks contain digits "8" and rarer blocks contain "6":s...
Anyway, why should an all-in-one COPIER even apply lossy compression to an image on the way from the scanner part to the printer part? Another thinko there.
Xerox claims the algorithm the copier uses is called JBIG2. The original JBIG is lossless but JBIG2 allows for lossy. As for why the compression, consider a big scan run that has to be duplexed or collated. That means ALL the pages have to be in memory, and it's not unheard of for copiers to get skimped on memory. "Out of Memory" errors are rare enough that an expansion might not be considered or allowed by accounting.
@AC 07:03
I see the "out of memory" point, but why not just tell the user that after [X] pages the pdf will be written (e-mailed, whatever) and a new one started? Thus freeing up the memory problem.
Surely even at compressed there must be a page limit the memory can take, it's just you'll reach it sooner with uncompressed.
The problem here is the action - to COPY, scanning has a different user mindset. A significant number of users will have grown up with old-style not-scanned-but-copied copies, and therefore have a reasonable expectation that the copy will be just that. In the good old days unclear copied numbers would be a mess (8,9,6 often being the culprits) I knew they were wrong because I couldn't read them, or I had doubts from the quality of the copied digit, so I'd go and look at the original. Now, I can't tell at a glance, as all the numbers look as clear as day because they have been OCRd and rendered in a nice, clear font.
"I see the "out of memory" point, but why not just tell the user that after [X] pages the pdf will be written (e-mailed, whatever) and a new one started? Thus freeing up the memory problem."
Because it would be pointless in the job's case to do in segments. Plus, like I said, collating and duplexing are pure copy functions and involve rearranging the pages IN MEMORY so they come out in a certain way. For these kinds of jobs, they have to be all or nothing or there's no point in the exercise (especially collating, which requires each set of copies come out in the same order as the original—3 copies of 3 pages will go 1,2,3,1,2,3,1,2,3—any break defeats the purpose).
Well, using lossy compression that can visibly alter the results defeats the purpose even worse!
They simply should have added enough memory to handle these jobs using only loss-less compression, which should work fine for most actual pages, because they contain large areas of solid colour, usually white.
" it would be pointless in the job's case to do in segments"
Oh really? Do you think we were born yesterday or are you a copier salesman?
How do you think the world managed before copiers could (not) hold a whole jobsworth of data in memory?
The world managed for several decades with photocopiers that copied without misrepresentation but were not capable of handling clever stuff in huge jobs and/or huge numbers of copies.
Manual intervention part way through (splitting a big job into a few smaller ones) was one widely used option which fixed a lot of things without too much effort and no loss of quality. Collating was and is particularly trivial to fix, and what couldn't be fixed manually could and can be sent to a copy shop with a clue, where $$$ will be required.
You're seriously trying to tell readers that copying with misrepresentation is preferable to occasional manual intervention or occasionally spending $$$ with a copy shop with a clue?
Words fowl me.
"How do you think the world managed before copiers could (not) hold a whole jobsworth of data in memory?"
Clumsily, with plenty of potential for mistakes. That's why most firms went to professional printers for the big jobs, which meant outsourcing and money. Things the accounting department may not be keen to budget anymore. Same goes for the memory. As I've said, the higher-ups may not see the value in more memory for the copier.
Probably not. They'll say you're doing it wrong. It would have to take something exceptional, like a 100-copy run of a single page intended for the PHB and ALL bearing some critical mistake to draw their attention. As noted, the circumstances are already somewhat contrived (very small print for starters).