back to article Large databases are not safe enough, says stats boffin

Large databases do not adequately protect sensitive personal information, according to a statistics professor in the US, who says that individuals can still be identified despite attempts to anonymise them. George Duncan is a statistics professor at Carnegie Mellon University in Pittsburgh, Pennsylvania. He writes in the …

COMMENTS

This topic is closed for new posts.
  1. JP

    Are you doing this on purpose?

    Since the comments in a previous article asking for the end of the use of the word "Boffin", I seem to notice it cropping up more and more in titles and tag lines...

    What's the rule about flaming and feeding trolls??

  2. Pete

    on flaming trolls...

    Fatter trolls burn more. Maybe troll candles are the way The Reg plans to cut spending on lightbulbs?

    "The question is, how can data be made useful for research purposes without compromising the confidentiality of those who provided the data?"

    "...further user-specific restrictions on the use of information in databases would go some way to solving the problem"

    So what you mean to say, is that the confidential information stored should only be viewable by the people who need it for legitimate purposes?

    Covered already by our Data Protection Act.

    The real issue that people have with privacy could easily be solved with a big group unbunching-of-panties. We already have Big Government departments that have excessive, often incorrect information about us stored. It can be expensive (overdue council tax bill for house you moved out of, resulting in a CCJ because you didn't get the mail) and dangerous (allergy to penicillin omitted from medical record) but it's something that we seem to have quickly just taken as the normal condition.

    I would welcome a single, coherent, accurate database that could be used by all Government departments, if I felt certain it would be planned, implemented and maintained with some degree of competence.

    Guess what? It's not possible to feel that way given the current state of this country. Every IT project in the news is overbudget, badly run, and often just dead-in-the-water after millions/billions of pounds have been blown.

    ID cards are not necessary, we already carry enough information like that. It's probably wrong (I know that most provisional driving licences have the wrong address on them, given the number of calls I used to field whilst on a DSA contract) but it's the same information that would be collected.

  3. Sceptical Bastard

    The real worry is...

    Quote: "What's the rule about flaming and feeding trolls?"

    It's not the trolls that worry me, it's the government's compiling, storage, usage and cross-referencing of data that keeps me awake at night.

  4. Charlie Clark Silver badge

    Shock horror - data modelling allows conclusions to be drawn

    Data modelling is all about storing information in a way that allows it to be reconstructed usefully. If the database is storing information about people then you should be able to reconstruct it at will. As such access to such systems should be carefully controlled. This is nothing new.

    If the guy has anything interesting to say on how you might cut up your data and access privileges so that no single user can access everything at once that might be more interesting but otherwise this non-news and not worthy of the boffin title.

  5. Anonymous Coward
    Anonymous Coward

    Boffinry? Really?

    Agree. In all probability a Professor of statistics actually knows enough about his subject that to call him a "boffin" would be gratuitously insulting and false. Leave "boffin" for the junk scientists, please.

  6. John Gamble

    Re: Boffinry? Really?

    Is the term boffinry really reserved for junk scientists? That's not the impression I got from reading Register articles (the slang's not hard to figure out if you pay attention). Does anyone have an OED citation?

  7. Charlie Clark Silver badge

    re. Boffinry? Really?

    A professor of statistics probably knows nothing but makes it up as he goes along!

    He'd be a boffin if he came up with a new statistical approach, product or device (Richard Dimbleby's Swingometer was obviously developed by a proper boffin) related to the data worked with. Simply worrying about the practicalities of enforcing data protection doesn't count.

    For statistical purposes: data can indeed be reliably anonymised, although there is actually no need to collect any nature of a distinctly (ie. uniquely) personal nature in any statistical exercise. Not sure if I can think of a foolproof approach in any data management scenario. Abuse of data protections laws is per definition illegal, except if your the US government, but so is theft, murder or grievous bodily harm. So are we going to get a rerun of the story of bullets that can be traced back to their owners or knives that can't be used to stab people with? Or rayguns to be more topical.

  8. peter Silver badge

    What we need is Boffin Pride! ;)

    My Concise OED (1990) says

    boffin. n. esp. Brit. colloq. a person engaged in scientific (esp. military) research. [20th c.: orig, unkn.]

    But that's largely irrelevant since the OED simply describes how _we_ are using words, rather than prescribing how we should use them.

    I always took /boffin/ to be perjorative, and it always seems to used negatively in El Reg. (But maybe they mean it affectionately?) And maybe the term can be reclaimed from -ve connotations in the same way geek seems to be a much more +ve word thesedays.

  9. Sceptical Bastard

    Anonymous doesn't mean useless

    Further to Charlie Clark's point (above), data can be gathered with personally-identifying information yet be effectively anonymised while retaining its value for statistical analysis.

    My business recently conducted on behalf of a client a survey of 16,000 people at an event. Paper forms were distributed and a prize draw formed the inducement. To distribute the prizes, the form requested respondents' names, addresses and emails.

    We received a surprisingly high response, mainly because our client is a trusted brand and had been very generous with the prize pot.

    The form was designed so that the section containing the personally identifiable information could be readily cut off.

    Thus we ended up with two piles of paper: the answers to the twenty-odd questions (which contained no personally identifiable information) and the names and addresses of the respondents.

    The name-and-address portions were divided into those who had ticked the opt-in for email marketing and those who had opted out. The opt-outs were shredded: the opt-ins were added to an OpenOffice Calc (our client's preferred spreadsheet format) data file of email-shot recipients.

    The statistical information was entered into an entirely separate MySQL database for analysis. It will provide a very rich source of information which will allow our client to improve the event and target its market spend more cost-effectively in future.

    A good result for us and our client - and one achieved without anyone's privacy or identity being compromised.

    All this - and no boffins involved.

  10. Adrian Midgley

    conclusions may be drawn without identification

    Consider for a moment general practice medical records, which are presently stored in 10 000 systems of around a dozen different sorts in a like number of places.

    A question such as "How many people have Diabetes, of which types, by age and sex distribution and what medicines are they prescribed?" can be approached in at least two ways.

    One way is to construct a large computer system notionally placed in Richmond House, Whitehall, suck all information from the 10 000 systems into it, and then make an SQL query against it.

    Another way is to write two lines of Perl for each of those dozen sorts, which launch a (possibly SQL, possibly M, possibly procedural) query against the system to produce an answer, a small table of figures, ship that to a rather smaller computer notionally in RIchmond House and with another two lines of Perl aggregate them into a table of figures.

    The first is more popular with the suppliers of large, and rather fanciful, computer systems, the civil service, and allegedly MI5. The latter has certain advantages, such as being known to be possible, easy even, cheap and as a small but topcially relevant feature, of not transferring identities from here to there or concentrating them into one place.

  11. Anonymous Coward
    Anonymous Coward

    re: on flaming trolls...

    Noooo! Burning trolls will result in the emission of massive quantities of greenhouse gas!

This topic is closed for new posts.